# Tutorial 2 - Clustering of News Articles

We will use the dataset [News Articles, from Tianru Dai (2017)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GMFCTR).

This dataset was made for studying political bias in articles, and is made of articles from different sources, reporting on political topics.

In this tutorial we will cluster these news articles by topic.

# Download

In [None]:
!wget https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/GMFCTR/IZQODZ -O NewsArticles.csv

# Prepare Data

In [None]:
import pandas as pd
df = pd.read_csv('NewsArticles.csv', encoding='latin1')
df = df[['article_id', 'title', 'text']].copy().dropna().reset_index(drop=True)
print(df.shape)

In [None]:
df.head()

In [None]:
df['nb_words'] = df['text'].apply(lambda x: len(x.split()))
_ = df['nb_words'].hist(bins=50, figsize=(9, 9))

# EXERCISE: Clustering

We need to customize the `K-Means` class from `sklearn`:
* Centroids will be the median point of their cluster (instead of the average point)
* Distances will be computed as cosine distances (instead of euclidean distances)
* cosine distance = 1 - cosine similarity

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_distances

class KMedians(KMeans):
    def _e_step(self, X):
        self.labels_ = cosine_distances(X, self.cluster_centers_).argmin(axis=1)
    def _average(self, X):
        return np.median(X, axis=0)

In [None]:
corpus = df['text']

## TODO - BoW

Instantiate and fit a CountVectorizer:
* english stopwords
* lowercase
* setup values for `min_df` and `max_df`
* decide for a `max_features` value (advice: 50000)
* use a token pattern to capture only: 'only letters, and at least 2 letters'

In [None]:
# TODO - BoW
# Create a CountVectorizer with the right parameters (hint for token pattern: r'[a-z]{2,}')
# fit it to the corpus of texts
# transform the corpus into a term-document matrix


Use the term document matrix to cluster the documents:
* 8 clusters
* Normalize the vectors before clustering (always a good practice)
* `KMedians` has the same interface as `KMeans` [Link](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
* Might be needed to adjust `max_iter`

In [None]:
from sklearn.preprocessing import Normalizer

# TODO - Normalize the term-document matrix
normalizer = Normalizer()
bow_norm = normalizer.fit_transform(# TODO)

# TODO -     
km = KMedians(
    # TODO
)
km.fit(# TODO)





---



Below is some code:
* Identify the closest document to a cluster's centroid
* Identify the other documents closest to this document

**TODO**: observe the titles and evaluate if the articles might be related.

In [None]:
OBSERVE_CLUSTER = 2

In [None]:
import numpy as np
from sklearn.metrics import pairwise_distances, pairwise_distances_argmin_min

closest, _ = pairwise_distances_argmin_min(X=km.cluster_centers_, Y=bow_norm, metric='cosine')

c = closest[OBSERVE_CLUSTER]
d = pairwise_distances(X=[km.cluster_centers_[OBSERVE_CLUSTER]], Y=bow_norm, metric='cosine')[0]
top_10_idx = np.argsort(d)[1:11]   # the closest to a point is itself, so we remove the TOP 1

print(df.iloc[c]['title'])
print('*' * 80)
for i, idx in enumerate(top_10_idx):
    print(f'#{i+1:>2} (idx={idx:4}, d={d[idx]:.2f}): {df.iloc[idx]["title"]}')
    

## TODO - SVD

Use SVD to create semantic vectors:
* 8 topics
* then cluster in 8 clusters
* Observe the difference in computation time
* Observe the relatedness of articles

In [None]:
from sklearn.decomposition import TruncatedSVD
K = 8

# TODO - Instantiate a TruncatedSVD
# Transform the BoW into a document-topic matrix
svd = TruncatedSVD(
    # TODO
)
lsa = svd.fit_transform(#TODO - term-doc matrix)

In [None]:
# TODO - Show the topics and their most important words
vocab = yourcountvectorizer.get_feature_names()

for topic in range(K):
    topic_terms = svd.components_[topic, :]
    top_10_indices = topic_terms.argsort()[::-1][:10]
    print(f'Topic {topic:>2}: {"".join([f"{vocab[i]:<15}" for i in top_10_indices])}')

In [None]:
from sklearn.preprocessing import Normalizer

# TODO - Normalize the document-topic matrix
normalizer = Normalizer()
lsa_norm = normalizer.fit_transform(#TODO - LSA)

# TODO - Clustering
lsa_km = KMedians(
    #TODO
)

lsa_km.fit(#TODO - normalized LSA)




---



Below is some code:
* Identify the closest document to a cluster's centroid
* Identify the other documents closest to this document

**TODO**: observe the titles and evaluate if the articles might be related.

In [None]:
OBSERVE_CLUSTER = 2

In [None]:
import numpy as np
from sklearn.metrics import pairwise_distances, pairwise_distances_argmin_min

closest, _ = pairwise_distances_argmin_min(X=lsa_km.cluster_centers_, Y=lsa_norm, metric='cosine')

c = closest[OBSERVE_CLUSTER]
d = pairwise_distances(X=[lsa_km.cluster_centers_[OBSERVE_CLUSTER]], Y=bow_norm, metric='cosine')[0]
top_10_idx = np.argsort(d)[1:11]   # the closest to a point is itself, so we remove the TOP 1

print(df.iloc[c]['title'])
print('*' * 80)
for i, idx in enumerate(top_10_idx):
    print(f'#{i+1:>2} (idx={idx:4}, d={d[idx]:.2f}): {df.iloc[idx]["title"]}')
    

## TODO - LDA

Use LDA to create semantic vectors:
* 8 topics
* Same as SVD
* Take inspiration in the LDA Notebook of the lecture

In [None]:
# Create tokenized corpus
import spacy

nlp = spacy.load('en_core_web_sm')

add_stops = ['said', 'mr']

stopped_tokenized = list(map(
    lambda tokens: [t.text for t in tokens if len(t.text) > 1 and not t.is_stop and t.text not in add_stops],
    nlp.tokenizer.pipe(df['text'])
))

In [None]:
# TODO - Create the corpus (see Classroom notebook)
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(stopped_tokenized)
corpus = [dictionary.doc2bow(txt) for txt in stopped_tokenized]

In [None]:
# TODO - Create the LDA model with 8 topics
# Evaluate it
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

lda = LdaModel(
    # TODO,
    minimum_probability=0.0
)

# Compute Coherence Score
coherence_cv = CoherenceModel(
    # TODO
    coherence='c_v'
)
c_v = # TODO

In [None]:
# TODO - Display the topics

In [None]:
import numpy as np

lda_vecs = lda[corpus]

# Create the document-topic matrix
doc_topics = np.zeros((len(corpus), K))
for i in range(len(corpus)):
    topics = lda_vecs[i][0]
    for (j, v) in topics:
        doc_topics[i][j] = v

In [None]:
from sklearn.preprocessing import Normalizer

# TODO - Normalize the document-topic matrix
normalizer = Normalizer()
lda_norm = normalizer.fit_transform(# TODO - document/topic matrix)

# TODO - Clustering
lda_km = KMedians(
    # TODO
)

lda_km.fit(# TODO - use the normalized document/topic matrix)




---



Below is some code:
* Identify the closest document to a cluster's centroid
* Identify the other documents closest to this document

**TODO**: observe the titles and evaluate if the articles might be related.

In [None]:
OBSERVE_CLUSTER = 2

In [None]:
from sklearn.metrics import pairwise_distances, pairwise_distances_argmin_min

closest, _ = pairwise_distances_argmin_min(X=lda_km.cluster_centers_, Y=lda_norm, metric='cosine')

c = closest[OBSERVE_CLUSTER]
d = pairwise_distances(X=[lda_km.cluster_centers_[OBSERVE_CLUSTER]], Y=bow_norm, metric='cosine')[0]
top_10_idx = np.argsort(d)[1:11]   # the closest to a point is itself, so we remove the TOP 1

print(df.iloc[c]['title'])
print('*' * 80)
for i, idx in enumerate(top_10_idx):
    print(f'#{i+1:>2} (idx={idx:4}, d={d[idx]:.2f}): {df.iloc[idx]["title"]}')
    

## TODO - LDA GridSearch with Coherence


* Find the optimum number of topics by maximizing the coherence score $\textrm{C}_V$
* Display these topics
* Explore the clusters