<a href="https://colab.research.google.com/github/Dawudis/NYT-Articles-Document-Similarity/blob/main/NYT_Articles_Document_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scraping Articles About New York City**

In [None]:
!pip install newspaper3k
import newspaper
from newspaper import Article

In [None]:
import nltk
nltk.download('punkt')

In [3]:
site = newspaper.build("https://apnews.com/hub/new-york-city", memoize_articles=False)  

In [4]:
urls = site.article_urls()

In [5]:
import pandas as pd

In [6]:
#put urls into dataframe
df = pd.DataFrame(urls, columns= ['web_url'])
df.head()

Unnamed: 0,web_url
0,https://apnews.com/hub/ap-top-25-college-baske...
1,https://apnews.com/article/immigration-new-yor...
2,https://apnews.com/article/shootings-new-york-...
3,https://apnews.com/article/kathy-hochul-new-yo...
4,https://apnews.com/article/rihanna-pregnant-ca...


In [None]:
!pip3 install news-please
from newsplease import NewsPlease

In [14]:
#scrape the urls and get the titles
article_titles = []
for i in df ["web_url"]:
  article_titles.append(NewsPlease.from_url(i).title)

In [8]:
#scrape the urls and get the main text
top_articles = []
for i in df["web_url"]:
  top_articles.append(NewsPlease.from_url(i).maintext)

In [15]:
df1 = pd.DataFrame(article_titles, columns= ['article titles'])

In [16]:
df1['articles'] = top_articles

In [56]:
df1.head()

Unnamed: 0,article titles,articles
0,NCAA College Basketball Rankings: AP Top Baske...,
1,"Food delivery workers, rideshare drivers deman...",NEW YORK (AP) — Having won rights to use resta...
2,NYC police honor 2nd officer killed in Harlem ...,The casket of New York City Police Officer Wil...
3,"Is NYC safe? Violence, perception and a compli...",FILE — New York City Mayor Eric Adams addresse...
4,"Rihanna is pregnant, debuts bump on stroll wit...","In this combination photo, A$AP Rocky attends ..."


If your dataframe has any "None" values, make sure to remove the respective rows so that it can pass through the vectorization process.

In [65]:
#dropping the rows that have "None" values and inputting dataframe into "data1"
data = df1.drop(labels=0, axis=0)
data1 = data.drop(labels=49, axis=0)
data1.head()

Unnamed: 0,article titles,articles
1,"Food delivery workers, rideshare drivers deman...",NEW YORK (AP) — Having won rights to use resta...
2,NYC police honor 2nd officer killed in Harlem ...,The casket of New York City Police Officer Wil...
3,"Is NYC safe? Violence, perception and a compli...",FILE — New York City Mayor Eric Adams addresse...
4,"Rihanna is pregnant, debuts bump on stroll wit...","In this combination photo, A$AP Rocky attends ..."
5,"‘Nanny,’ ‘Exiles,’ ‘Navalny’ among top Sundanc...","This image shows a scene from ""Nanny"" by Nikya..."


# **Creating the Similarity Clusters**

In [58]:
#first we convert our text into vectors using tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(stop_words="english", ngram_range = (1, 3))
#then we pre-process the vectors/text
vec.fit(data1.articles.values)
features = vec.transform(data1.articles.values)

In [59]:
from sklearn.cluster import MiniBatchKMeans, KMeans
#now we create the clusters
#we can choose how many clusters we want via "n_clusters=..."
#for this doc, we will choose 3 clusters
clust = KMeans(init='k-means++', n_clusters=3, n_init=10)
clust.fit(features)
yhat = clust.predict(features)

In [60]:
#now we have a dataframe with another column "Cluster Labels"
#each article/row is given a label based on the cluster they are in from 0-2
data1['Cluster Labels'] = clust.labels_
data1[['article titles', 'Cluster Labels']].head(10)

Unnamed: 0,article titles,Cluster Labels
1,"Food delivery workers, rideshare drivers deman...",1
2,NYC police honor 2nd officer killed in Harlem ...,0
3,"Is NYC safe? Violence, perception and a compli...",1
4,"Rihanna is pregnant, debuts bump on stroll wit...",1
5,"‘Nanny,’ ‘Exiles,’ ‘Navalny’ among top Sundanc...",1
6,"Man crashes into Taylor Swift's NY building, p...",1
7,Palin dines out again in NYC days after positi...,1
8,"Fallen officers sought bridges between NYPD, c...",0
9,Workers at Amazon NYC warehouse get go-ahead f...,1
10,NY trying to reduce gun violence with new stat...,1


# **The Clusters**

First Cluster // Seems to be mostly talking about the recent shooting where two NYPD officers lost their lives

In [61]:
data2 = data1.loc[data1['Cluster Labels'] == 0]
data2['article titles'].head(10)

2     NYC police honor 2nd officer killed in Harlem ...
8     Fallen officers sought bridges between NYPD, c...
12    Police say suspect in NYC hospital shooting ha...
13    2nd NYPD officer dies, days after Harlem shooting
14    Man who shot 2 NYPD officers, killing 1, has died
17    In mourning yet again, NYC prepares to honor f...
19    Young officer slain in Harlem joined to help '...
20    1 NYPD officer killed, 1 severely injured in H...
25    Police: Baby girl shot in face by stray bullet...
29    Prison for teen who pleaded guilty in college ...
Name: article titles, dtype: object

Second Cluster // Seems to be mostly talking about NYC's gun violence issues

In [62]:
data3 = data1.loc[data1['Cluster Labels'] == 1]
data3['article titles'].head(10)

1     Food delivery workers, rideshare drivers deman...
3     Is NYC safe? Violence, perception and a compli...
4     Rihanna is pregnant, debuts bump on stroll wit...
5     ‘Nanny,’ ‘Exiles,’ ‘Navalny’ among top Sundanc...
6     Man crashes into Taylor Swift's NY building, p...
7     Palin dines out again in NYC days after positi...
9     Workers at Amazon NYC warehouse get go-ahead f...
10    NY trying to reduce gun violence with new stat...
11    Biden to NYC next week to discuss gun crime wi...
15    New NYC mayor, an ex-cop, unveils plan to stem...
Name: article titles, dtype: object

Third Cluster // Seems to be mostly talking about the Brooklyn Nets (one of New York's NBA teams)

In [64]:
data4 = data1.loc[data1['Cluster Labels'] == 2]
data4['article titles'].head(10)

27    Irving fined $25,000 by NBA for cursing at fan...
35    James Sands makes Glasgow Rangers debut in dra...
38    Nets' star Irving steadfast on vaccine despite...
40    Durant has sprained knee ligament, no timetabl...
45    While waiting on Irving, Nets still searching ...
Name: article titles, dtype: object