# Clustering Model

### Summary: 

In the baseline model, located in `../baseline_models/clustering_model.ipynb`, 
- we extracted `sentences_df.csv` dataset from the original `ner_dataset.csv',  
- applied two vectorization algorithms, `CountVectorizer` and `TfidfVectorizer`, to convert _shortened sentences_ (original sentences with stopwords removed) to vectors, 
- and finally, used the KMeans clustering algorithm to cluster sentences. 

The Silhouette scores of the resulting clusterings were extremely low (less than $0.1$). 

In this notebook, 
- we improve the model by adding **lemmatization** and **dimension reduction** layers.  
- In addition to KMeans, we also explored other clustering models. 

### Load the dataset `sentences_v1.csv`, created in the baseline mode

In [64]:
import pandas as pd 
import numpy as np

sentences_df = pd.read_csv('../datasets/sentences_v1.csv')
extended_df = pd.read_csv('../datasets/extended_df.csv')

In [None]:
pd.set_option('display.max_colwidth', -1)

In [51]:
sentences_df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Sentence Length', 'Sentence#', 'Content',
       'Tagged Words', 'Shortened Sentences'],
      dtype='object')

Drop columns `'Unnamed: 0.1', 'Unnamed: 0'`. 

In [52]:
sentences_df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], inplace=True)

In [53]:
sentences_df.head()

Unnamed: 0,Sentence Length,Sentence#,Content,Tagged Words,Shortened Sentences
0,24,1,"['Thousands', 'of', 'demonstrators', 'have', '...","['London', 'Iraq', 'British']",Thousands demonstrators marched London pro...
1,30,2,"['Families', 'of', 'soldiers', 'killed', 'in',...",['Bush'],Families soldiers killed conflict joined p...
2,14,3,"['They', 'marched', 'from', 'the', 'Houses', '...","['Hyde', 'Park']",marched Houses Parliament rally Hyde Pa...
3,15,4,"['Police', 'put', 'the', 'number', 'of', 'marc...",,"Police number marchers 10,000 organizers ..."
4,25,5,"['The', 'protest', 'comes', 'on', 'the', 'eve'...","['Britain', 'Labor', 'Party', 'English', 'Brig...",protest comes eve annual conference Brit...


### Vectorization and Lemmatization

In [54]:
shortened_sent = sentences_df['Shortened Sentences']

##### Drop an **invaid** sentence 

In [65]:
shortened_sent[shortened_sent.isna()]

8411    NaN
Name: Shortened Sentences, dtype: object

In [66]:
sentences_df.iloc[8409: 8413]

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Sentence Length,Sentence#,Content,Tagged Words,Shortened Sentences
8409,8409,8409,33,8410,"['A', 'STATE', 'Official', 'carrying', 'off', ...","['Dome', 'of', 'the', 'Capitol']",STATE Official carrying Dome Capitol met ...
8410,8410,8410,34,8411,"['As', 'the', 'place', 'of', 'meeting', 'was',...","['midnight', 'State', 'Official', 'Dome', 'of'...","place meeting lonely time midnight , St..."
8411,8411,8411,1,8412,['The'],,
8412,8412,8412,28,8413,"['Ghost', 'replied', 'that', 'he', 'had', 'not...","['State', 'Official']","Ghost replied eaten , explaining sit..."


In [67]:
shortened_sent.drop(8411, inplace=True)

In [68]:
shortened_sent.isna().sum()

0

##### Vectorization and Lemmatization classes

In [25]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

In [14]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /Users/yinghu/nltk_data...


True

Define the `LemmaTokenizer` class. 

In [56]:
## code is adapted from sklearn: https://scikit-learn.org/stable/modules/feature_extraction.html

class LemmaTokenizer:
    def __init__(self) -> None:
        self.wnl = WordNetLemmatizer()
    
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in doc.lower().split(' ')]

The following code is to test the `LemmaTokenizer` class. There is no need to initialize tokenizer here.  

In [78]:
#vectorizer_bow = CountVectorizer(tokenizer=LemmaTokenizer())
#tokenized_sent = vectorizer_bow.fit_transform(shortened_sent)
#tokenized_sent.shape

(47958, 28760)

Each sentence is represented by a $28760$ dim vector. Oddly, the dimension is increased after using lemmatization.

In [77]:
#tokenized_sent_wolema = CountVectorizer().fit_transform(shortened_sent)
#tokenized_sent_wolema.shape

(47958, 27706)

##### word2vec for future reference

In [None]:
## Download the full model 
# import gensim.downloader as api 
# wv = api.load('word2vec-google-news-300')

##### Dimension reduction classes

In [99]:
from sklearn.decomposition import PCA

## Clustering Models

In [73]:
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

### Model 1: `CountVectorizer` with `tokenizer=LemmaTokenizer()` + `KMeans`

- When `n_cluster=2`, Silhouette Coefficient is 0.28244613563900284 for `random_states=1, 5, 10, 42`
- When `n_cluster=3`, Silhouette Coefficient is 0.17099173313074406 for `random_states=0, 1`

This is a substantial improvement comparing to the baseline model, which has Silhouette Coefficient less than 0.1 


In [75]:
shortened_sent_bow = CountVectorizer(tokenizer=LemmaTokenizer()).fit_transform(shortened_sent)

In [76]:
shortened_sent_bow.shape

(47958, 28760)

Fit and evaluate K-Mean model

**We terminated the execution early.**

In [80]:
for random_state in [0, 1, 10, 42]: 
    for n_cluster in range(2,7):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=random_state).fit(shortened_sent_bow)
        labels = kmeans_model.labels_

        # print(f'Calinski-Harabasz Index: {calinski_harabasz_score(sample_data, labels)}')
        # print(f'Davies-Bouldin Index: {davies_bouldin_score(sample_data, labels)}')
        print(f'n_clusters = {n_cluster}, random_state = {random_state}: Silhouette Coefficient: {silhouette_score(shortened_sent_bow, labels)}')

n_clusters = 3, random_state = 0: Silhouette Coefficient: 0.17099173313074406
n_clusters = 4, random_state = 0: Silhouette Coefficient: 0.12557863515767284
n_clusters = 5, random_state = 0: Silhouette Coefficient: 0.09898807800970112
n_clusters = 6, random_state = 0: Silhouette Coefficient: 0.0744724413531645
n_clusters = 7, random_state = 0: Silhouette Coefficient: 0.059102724262040676
n_clusters = 8, random_state = 0: Silhouette Coefficient: 0.043977833607019796
n_clusters = 9, random_state = 0: Silhouette Coefficient: 0.05252731485513745
n_clusters = 10, random_state = 0: Silhouette Coefficient: 0.0317145227045245
n_clusters = 11, random_state = 0: Silhouette Coefficient: 0.03639271433403673
n_clusters = 12, random_state = 0: Silhouette Coefficient: 0.04897327320206242
n_clusters = 13, random_state = 0: Silhouette Coefficient: 0.026837843138244215
n_clusters = 14, random_state = 0: Silhouette Coefficient: 0.025795734505181533
n_clusters = 3, random_state = 1: Silhouette Coefficient:

KeyboardInterrupt: 

In [82]:
for random_state in [1, 5, 10, 42]: 
    for n_cluster in range(2,3):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=random_state).fit(shortened_sent_bow)
        labels = kmeans_model.labels_

        # print(f'Calinski-Harabasz Index: {calinski_harabasz_score(sample_data, labels)}')
        # print(f'Davies-Bouldin Index: {davies_bouldin_score(sample_data, labels)}')
        print(f'n_clusters = {n_cluster}, random_state = {random_state}: Silhouette Coefficient: {silhouette_score(shortened_sent_bow, labels)}')

n_clusters = 2, random_state = 1: Silhouette Coefficient: 0.28244613563900284
n_clusters = 2, random_state = 5: Silhouette Coefficient: 0.28244613563900284
n_clusters = 2, random_state = 10: Silhouette Coefficient: 0.28244613563900284
n_clusters = 2, random_state = 42: Silhouette Coefficient: 0.28244613563900284


Inspect the core sentences when `n_cluster = 2`

In [89]:
kmeans = KMeans(n_clusters=2)
sent_dist = kmeans.fit_transform(shortened_sent_bow)
representative_sent_idx = np.argmin(sent_dist, axis=0)
representative_sent = sentences_df.iloc[representative_sent_idx]

In [92]:
pd.set_option('display.max_colwidth', -1)
representative_sent['Content']

  pd.set_option('display.max_colwidth', -1)


16893    ['One', 'time', 'I', 'had', 'to', 'go', 'to', 'a', 'funeral', 'at', '6', 'AM', '.']                                                                                       
33172    ['Authorities', 'say', 'Commander', 'Victor', 'Berrones', 'died', 'Tuesday', 'when', 'gunmen', 'attacked', 'the', 'vehicle', 'in', 'which', 'he', 'was', 'traveling', '.']
Name: Content, dtype: object

Inspect the core sentences when `n_cluster = 3`

In [93]:
kmeans = KMeans(n_clusters=3)
sent_dist = kmeans.fit_transform(shortened_sent_bow)
representative_sent_idx = np.argmin(sent_dist, axis=0)
representative_sent = sentences_df.iloc[representative_sent_idx]

In [94]:
representative_sent['Content']

  pd.set_option('display.max_colwidth', -1)


44758    ['Officials', 'say', 'the', 'strike', 'hit', 'a', 'Palestinian', 'militant', 'training', 'camp', '.']                                                                                                                                  
40246    ['One', 'of', 'them', 'got', 'terribly', 'sick', '.']                                                                                                                                                                                  
13830    ['"', 'The', 'next', 'time', 'you', 'touch', 'a', 'Nettle', ',', 'grasp', 'it', 'boldly', ',', 'and', 'it', 'will', 'be', 'soft', 'as', 'silk', 'to', 'your', 'hand', ',', 'and', 'not', 'in', 'the', 'least', 'hurt', 'you', '.', '"']
Name: Content, dtype: object

In [98]:
sent_dist.max()

42.14102267137398

### Model 2 (Not good): `TfidfVectorizer` with `tokenizer=LemmaTokenizer()` + `KMeans`

- Using `TfidfVectorizer` results in much worse performance than `CountVectorizer` 
- **We terminated the search early.**

In [83]:
shortened_sent_tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer()).fit_transform(shortened_sent)

In [84]:
shortened_sent_tfidf.shape

(47958, 28760)

In [85]:

for n_cluster in range(2, 6):
    for random_state in range(4):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=random_state).fit(shortened_sent_tfidf)
        labels = kmeans_model.labels_
        # print(f'Calinski-Harabasz Index: {calinski_harabasz_score(sample_data, labels)}')
        # print(f'Davies-Bouldin Index: {davies_bouldin_score(sample_data, labels)}')
        print(f'n_clusters = {n_cluster}: Silhouette Coefficient: {silhouette_score(shortened_sent_tfidf, labels)}')

n_clusters = 2: Silhouette Coefficient: 0.008146341934268077
n_clusters = 2: Silhouette Coefficient: 0.008146341934268077


KeyboardInterrupt: 

### Model 3: `CountVectorizer` with `tokenizer=LemmaTokenizer()` + `TruncatedSVD` + `KMeans`

**Remark**: `PCA` class doesn't support sparse input. Instead, we use `TruncatedSVD` (Latent semantic analysis) for dimension reduction.

We drop the dimension to 100 and the performance is exactly the same as before the dimension reduction. 

**We terminate the searching of the best parameters early**

In [105]:
from sklearn.decomposition import TruncatedSVD

In [101]:
shortened_sent_bow = CountVectorizer(tokenizer=LemmaTokenizer()).fit_transform(shortened_sent)

In [116]:
lsa = TruncatedSVD(n_components=100)
shortened_sent_reducded = lsa.fit_transform(shortened_sent_bow)

In [114]:
lsa.explained_variance_ratio_[:10]

array([0.50368223, 0.03100053, 0.00577348, 0.0062948 , 0.00417825,
       0.00403233, 0.00384653, 0.00317059, 0.00308567, 0.00299286])

In [115]:
lsa.explained_variance_ratio_.sum()

0.666552006697911

In [117]:
for random_state in [0, 10, 42]: 
    for n_cluster in range(2,7):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=random_state).fit(shortened_sent_reducded)
        labels = kmeans_model.labels_

        # print(f'Calinski-Harabasz Index: {calinski_harabasz_score(sample_data, labels)}')
        # print(f'Davies-Bouldin Index: {davies_bouldin_score(sample_data, labels)}')
        print(f'n_clusters = {n_cluster}, random_state = {random_state}: Silhouette Coefficient: {silhouette_score(shortened_sent_bow, labels)}')

n_clusters = 2, random_state = 0: Silhouette Coefficient: 0.28244613563900284
n_clusters = 3, random_state = 0: Silhouette Coefficient: 0.17099173313074406
n_clusters = 4, random_state = 0: Silhouette Coefficient: 0.12557863515767284


KeyboardInterrupt: 