<a href="https://colab.research.google.com/github/Dawudis/Problem-Scoping-an-Area-with-Python/blob/main/Official_Local_Area_Problem_Scoping_Code_03162022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scrape Articles Using Newspaper and news-please**

In [None]:
!pip install newspaper3k
import newspaper
from newspaper import Article

In [None]:
import nltk
nltk.download('punkt')

I want to problem scope the great city of Houston. To do this, I am going to scrape Houston's newspaper 'The Houston Chronicle' for any news that might help us.

In [3]:
site = newspaper.build("https://www.houstonchronicle.com/news/houston-texas/")  

In [4]:
urls = site.article_urls()

In [5]:
urls

['https://www.houstonchronicle.com/news/houston-texas/#content',
 'https://www.houstonchronicle.com/news/houston-texas/',
 'https://www.houstonchronicle.com/news/houston-texas/houston/',
 'https://www.houstonchronicle.com/news/houston-texas/education/',
 'https://www.houstonchronicle.com/news/houston-texas/texas/',
 'https://www.houstonchronicle.com/news/houston-texas/environment/',
 'https://www.houstonchronicle.com/news/houston-texas/health/',
 'https://www.houstonchronicle.com/news/investigations/',
 'https://www.houstonchronicle.com/news/houston-texas/crime/',
 'https://www.houstonchronicle.com/news/houston-texas/transportation/',
 'https://www.houstonchronicle.com/news/houston-texas/immigration/',
 'https://www.houstonchronicle.com/news/houston-weather/',
 'https://www.houstonchronicle.com/news/houston-texas/religion/',
 'https://www.houstonchronicle.com/news/houston-texas/space/',
 'https://www.houstonchronicle.com/news/houston-texas/trending/',
 'https://www.houstonchronicle.com

In [6]:
import pandas as pd

Import urls into dataframe 'df'

In [7]:
df = pd.DataFrame(urls, columns = ['web_url'])

We want to remove any defective links, so we append only the urls that have a status code of 200

In [8]:
import requests

In [9]:
new_urls = []
for i in df['web_url']:
  if requests.head(i).status_code == 200:
    new_urls.append(i)

We put these urls into a dataframe 'df1'

In [10]:
df1 = pd.DataFrame(new_urls, columns= ['new_urls'])
df1.head()

Unnamed: 0,new_urls
0,https://www.houstonchronicle.com/news/houston-...
1,https://www.houstonchronicle.com/news/houston-...
2,https://www.houstonchronicle.com/news/houston-...
3,https://www.houstonchronicle.com/news/houston-...
4,https://www.houstonchronicle.com/news/houston-...


We use the 'news-please' crawler to extract the maintext from the article urls

In [None]:
!pip3 install news-please
from newsplease import NewsPlease

In [12]:
#scrape the urls and get the main text
article_texts = []
for i in df1["new_urls"]:
  article_texts.append(NewsPlease.from_url(i).maintext)

We create a column 'articles' within the same dataset for the article text

In [13]:
df1['articles'] = pd.DataFrame(article_texts)

We drop any missing values from the dataframe and assign it to 'data'

In [14]:
data = df1.dropna()

# **Extractive Summarization**

Since the article texts would be too big for analysis, we want an extractive summary for each text and assign this to 'article_summaries' column

In [None]:
!pip install bert-extractive-summarizer
from summarizer import Summarizer
model = Summarizer()

In [16]:
result = []
for i in data['articles']:
  result.append(model(i))

In [17]:
data['article_summaries'] = pd.DataFrame(result)

In [18]:
data

Unnamed: 0,new_urls,articles,article_summaries
0,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...
1,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...
2,https://www.houstonchronicle.com/news/houston-...,A vehicle hit the couple around 9:30 p.m. in t...,A vehicle hit the couple around 9:30 p.m. in t...
3,https://www.houstonchronicle.com/news/houston-...,This Texas college produces some the highest-p...,This Texas college produces some the highest-p...
4,https://www.houstonchronicle.com/news/houston-...,NEW YORK (AP) — Shares of Tesla jumped at the ...,NEW YORK (AP) — Shares of Tesla jumped at the ...
...,...,...,...
56,https://preview.houstonchronicle.com/dining/al...,Comeback Cheeseburger at Red Dwarf Photo: Alis...,Comeback Cheeseburger at Red Dwarf Photo: Alis...
57,https://preview.houstonchronicle.com/dining/al...,The new Cheeseburger Pizza from Domino’s Photo...,The new Cheeseburger Pizza from Domino’s Photo...
58,https://www.houstonchronicle.com/projects/2020...,About\nThis project will update daily.\nData i...,"Data is from URISA’s GISCorps, Coders Against ..."
59,https://preview.houstonchronicle.com/music/aft...,Willie Nelson has canceled a Houston show afte...,Willie Nelson has canceled a Houston show afte...


In [19]:
data = data.dropna()

In [20]:
data

Unnamed: 0,new_urls,articles,article_summaries
0,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...
1,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...
2,https://www.houstonchronicle.com/news/houston-...,A vehicle hit the couple around 9:30 p.m. in t...,A vehicle hit the couple around 9:30 p.m. in t...
3,https://www.houstonchronicle.com/news/houston-...,This Texas college produces some the highest-p...,This Texas college produces some the highest-p...
4,https://www.houstonchronicle.com/news/houston-...,NEW YORK (AP) — Shares of Tesla jumped at the ...,NEW YORK (AP) — Shares of Tesla jumped at the ...
...,...,...,...
56,https://preview.houstonchronicle.com/dining/al...,Comeback Cheeseburger at Red Dwarf Photo: Alis...,Comeback Cheeseburger at Red Dwarf Photo: Alis...
57,https://preview.houstonchronicle.com/dining/al...,The new Cheeseburger Pizza from Domino’s Photo...,The new Cheeseburger Pizza from Domino’s Photo...
58,https://www.houstonchronicle.com/projects/2020...,About\nThis project will update daily.\nData i...,"Data is from URISA’s GISCorps, Coders Against ..."
59,https://preview.houstonchronicle.com/music/aft...,Willie Nelson has canceled a Houston show afte...,Willie Nelson has canceled a Houston show afte...


# **Sentiment Analysis**

In [None]:
!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
import torch

In [None]:
!pip install transformers 
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Define the function to get the sentiment analysis scores

In [23]:
def sentiment_score(articles):
    tokens = tokenizer.encode(articles, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

Apply this sentiment function on each article summary and input results into column 'sentiment'

In [24]:
data['sentiment'] = data['article_summaries'].apply(lambda x: sentiment_score(x[:512]))

Drop duplicates

In [25]:
data = data.drop_duplicates(subset='article_summaries', keep="first")

In [33]:
data.head()

Unnamed: 0,new_urls,articles,article_summaries,sentiment
0,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...,1
2,https://www.houstonchronicle.com/news/houston-...,A vehicle hit the couple around 9:30 p.m. in t...,A vehicle hit the couple around 9:30 p.m. in t...,1
3,https://www.houstonchronicle.com/news/houston-...,This Texas college produces some the highest-p...,This Texas college produces some the highest-p...,5
4,https://www.houstonchronicle.com/news/houston-...,NEW YORK (AP) — Shares of Tesla jumped at the ...,NEW YORK (AP) — Shares of Tesla jumped at the ...,1
5,https://www.houstonchronicle.com/news/houston-...,"Meteorologists urged caution Saturday as dry, ...","Meteorologists urged caution Saturday as dry, ...",4


To find what we consider 'problems', we have to filter out the positive inputs. To do this, we simply extract only the rows that have a sentiment score under 3 and input them into a dataframe 'sent'.

In [27]:
sent = data[data.sentiment < 3]

In [28]:
sent

Unnamed: 0,new_urls,articles,article_summaries,sentiment
0,https://www.houstonchronicle.com/news/houston-...,The woman got into a fight with another woman ...,The woman got into a fight with another woman ...,1
2,https://www.houstonchronicle.com/news/houston-...,A vehicle hit the couple around 9:30 p.m. in t...,A vehicle hit the couple around 9:30 p.m. in t...,1
4,https://www.houstonchronicle.com/news/houston-...,NEW YORK (AP) — Shares of Tesla jumped at the ...,NEW YORK (AP) — Shares of Tesla jumped at the ...,1
6,https://www.houstonchronicle.com/news/houston-...,Some of Houston's at-risk COVID patients feel ...,Some of Houston's at-risk COVID patients feel ...,1
10,https://www.houstonchronicle.com/news/houston-...,Fact check: Does Jackson have a record on defu...,Fact check: Does Jackson have a record on defu...,1
14,https://www.houstonchronicle.com/news/houston-...,The FBI started to investigate a tip on Jan. 1...,The FBI started to investigate a tip on Jan. 1...,1
29,https://preview.houstonchronicle.com/movies-tv...,"Still from Richard Linklater's ""Apollo 10 1/2""...","Still from Richard Linklater's ""Apollo 10 1/2""...",1
30,https://preview.houstonchronicle.com/movies-tv...,FILE - An Oscar statue is pictured underneath ...,FILE - An Oscar statue is pictured underneath ...,1
31,https://preview.houstonchronicle.com/movies-tv...,An Oscar statue sparkles in sunlight on the re...,An Oscar statue sparkles in sunlight on the re...,1
44,https://preview.houstonchronicle.com/families/...,Two participants in the Heights Kids' Day of M...,Two participants in the Heights Kids' Day of M...,1


Since we already have the summaries of the articles, we can drop the article texts themselves by dropping column 'articles'.

In [None]:
sent = sent.drop('articles', 1)

In [30]:
sent.shape

(11, 3)

Export the dataframe as a csv

In [31]:
sent.to_csv('houston_problem_scoping_dataset.csv', index=False) 