#**Web scraping articles/news and summarising + classifying it**
⏲️ **The deadline: see what could be accomplished in 1 month.**

## **Why I did this**
1. Learn about web scraping and 🤗 HuggingFace's transformers (high level)
2. I don't want to subscribe to dozens of newsletters and rely upon article/news tags to classify news

## **The data**
Data has been web scraped using `Scrapy` (see other folders) across multiple sites in the following format
* `title` - main title of article
* `subtitle` - if there's a subtitle then subtitle of the article
* `date` - date of the article
* `article` - the article (what is being summarised and being classified)
* `link` - link to the article

### **Setting up environment**

In [1]:
# Install transformers package (required for summarization and zero-shot classification)
!pip install -q transformers

[K     |████████████████████████████████| 1.3MB 4.7MB/s 
[K     |████████████████████████████████| 2.9MB 29.7MB/s 
[K     |████████████████████████████████| 1.1MB 45.0MB/s 
[K     |████████████████████████████████| 890kB 61.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [4]:
import re # regular expressions (data cleaning)
import pandas as pd # data prep
from transformers import pipeline # text summarization

### **Importing data**

The stories are stored in a csv (`sec_press_releases.csv`).

In [3]:
stories = pd.read_csv("sec_press_releases.csv")

### **Adding word count and convert all stories to string**

Word count is added as the summarization has a limit of 512 words so later I will use `word_count` to split data into small/large stories.

In [5]:
stories['word_count'] = stories['story'].str.count(' ') + 1

stories['story'] = stories['story'].astype(str)

### **Initialise pipelines**
HuggingFace's pipelines save a lot of time when using their NLP models for simple things and to try things out hence were used.

In [6]:
# Initialise summarizer and zero-shot classifier
summarizer = pipeline("summarization", device = 0)
classifier = pipeline("zero-shot-classification", device = 0)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1621.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=908.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1629486723.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BartForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


####**Basic example of each pipeline**

Zero shot - looking to see if the sentence is more `animals` or `technology`. Here the high 1st score indicates this sentence is about `animals`.

In [7]:
classifier("I walked the dog",
           candidate_labels = ["animals", "technology"],
           multi_class = False)

{'labels': ['animals', 'technology'],
 'scores': [0.9975346326828003, 0.0024653782602399588],
 'sequence': 'I walked the dog'}

Summarisation - summarise a paragraph.

In [8]:
summarizer("""The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and 
              the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. 
              During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest 
              man-made structure in the world, a title it held for 41 years until the Chrysler Building in 
              New York City was finished in 1930. It was the first structure to reach a height of 300 metres. 
              Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than 
              the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second 
              tallest free-standing structure in France after the Millau Viaduct.""")

[{'summary_text': ' The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building . It is the second tallest free-standing structure in France after the Millau Viaduct . Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres .'}]