<b><font size="5">12 - Document Clustering</font></b><br>
<font size="3">*Or, "The First Law of Geometry: Everything is related to everything else, but near things are more related than distant things. (Waldo Tobler)"*</font>
<br><br>
<b>Lesson Goal: Cluster text documents using the k-means, OPTICS, and HDBSCAN algorithms.</b> <br>

<b>Information Requirement:</b>
- Can documents be grouped into meaningful clusters from their vectorised content?

<b>Lesson Outline:</b> 
- First, we will depart from the previously explored and preprocessed NER event dataset from last class and cluster its constituent documents - short news headlines - using the k-Means algorithms;
- Then, we will assess, optimise, and visualise the k-means clustering;
- Lastly, we will repeat the process using the density- and hierarchy-based OPTICS and HDBSCAN algorithms.

<b>Package requirements:</b>
*(Run `conda update` if needed)*
- threadpoolctl > 3.5.0
- sklearn >= 1.3

### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [0. Prepare the NER Dataset](#P1)
- [1. Cluster documents using k-means](#P1)
    - [1.1 Test document clustering using k-means](#P11)
    - [1.2 Assess the quality of clustering](#P12)
    - [1.3 Optimize k-means clustering using the elbow method](#P13)
    - [1.4 Visualize clusters in 3D](#P14)
- [2. Cluster documents using hierarchical- and density-based algorithms](#P2)
    - [2.1 Cluster documents using OPTICS](#P21)
    - [2.2 Cluster documents using HDBSCAN](#P22)


In [3]:
%load_ext autoreload
%autoreload 2

#General-Purpose
import pandas as pd
import numpy as np
import re
import nltk
import pickle
import shutil
import os
from tqdm import tqdm

#Preprocessing
from utils.pipeline import *
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA, TruncatedSVD

#Vectorization
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#Clustering
from sklearn.cluster import KMeans, HDBSCAN, OPTICS

#Evaluation
from sklearn.metrics import silhouette_score, calinski_harabasz_score, adjusted_mutual_info_score

#Visualization
import plotly.graph_objects as go
import plotly.express as px


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


ModuleNotFoundError: No module named 'utils'

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
reviews_df = pd.read_csv('C:/Users/msard/OneDrive/Desktop/Data Science/Fall 2024/Text Mining/Hyderabadi-Word-Soup/data_hyderabad/10k_reviews.csv')
restaurants_df = pd.read_csv('C:/Users/msard/OneDrive/Desktop/Data Science/Fall 2024/Text Mining/Hyderabadi-Word-Soup/data_hyderabad/105_restaurants.csv')

<font color='#BFD72F' size=5>0. Prepare the NER Dataset</font> <a class="anchor" id="P0"></a>
- 
  
[Back to TOC](#toc)

<b>Task 0.0.1a: Option 1</b> Repeat the Data Preparation process from TM11, albeit increasing sample size (recommended sample size: >0.05).

*Expected runtime: 10' for sample size = 1.0*

In [53]:
def ner_dataset_preparer(sample_size=0.25):
    
    # Load ner_dataset
    ner_dataset = pd.read_csv('data/ner_dataset.csv') ##encoding="cp1252
    ner_dataset = ner_dataset.drop(columns=["Sentence #","POS","Tag"])
    ner_dataset = ner_dataset.rename(columns={"Sentence":"raw_content"})
    ner_dataset = ner_dataset.sample(frac=sample_size, random_state=39)

    # Correct leading whitespace before punctuation
    preceding_whitespace_pattern = "(\s)(?=[\.\,\!\?\;\:\'])"
    ner_dataset["raw_content"] = ner_dataset["raw_content"].map(lambda content : re.sub(preceding_whitespace_pattern,"",content))

    # Create full preproc column
    full_preprocessor = pipeline_v2b.MainPipeline(lemmatized=True, custom_stopwords=["said","says","say"]).main_pipeline
    ner_dataset["preproc_content"] = ner_dataset["raw_content"].map(lambda content : full_preprocessor(content))

    # Create doc2vec preproc column
    doc2vec_preprocessor = pipeline_v2b.MainPipeline(lemmatized=False,no_stopwords=False,lowercase=False).main_pipeline
    ner_dataset["doc2vec_content"] = ner_dataset["raw_content"].map(lambda content : doc2vec_preprocessor(content))

    # Vectorise using BOW
    bow_vectorizer = CountVectorizer(ngram_range=(1,1), token_pattern=r"(?u)\b\w+\b")
    ner_dataset_bow_td_matrix = bow_vectorizer.fit_transform(ner_dataset["preproc_content"]).toarray()
    ner_dataset["bow_vector"] = ner_dataset_bow_td_matrix.tolist()

    # Vectorise using TF-IDF
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), token_pattern=r"(?u)\b\w+\b")
    ner_dataset_tfidf_td_matrix = tfidf_vectorizer.fit_transform(ner_dataset["preproc_content"]).toarray()
    ner_dataset["tfidf_vector"] = ner_dataset_tfidf_td_matrix.tolist()

    # Vectorise using Doc2Vec
    d2v = Doc2Vec 
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(ner_dataset["doc2vec_content"])]
    d2v_model = d2v(documents, vector_size=300, window=6, min_count=1, workers=4, epochs=20)
    ner_dataset["doc2vec_vector"] = [d2v_model.dv[idx].tolist() for idx in tqdm(range(len(ner_dataset)))]

    return ner_dataset


In [4]:
ner_dataset = ner_dataset_preparer(sample_size=0.05)

100%|██████████| 2398/2398 [00:00<00:00, 74476.79it/s]


<b>Task 0.0.1b: Option 2</b> Unzip and load the prepared full NER Dataset (warning: when unzipped, the `ner_dataset` pickle file weighs ~12 GB, although it can safely be deleted after loading the dataframe.)

*Expected runtime: 4'*

In [30]:
shutil.unpack_archive('data/ner_dataset.zip','data/')

In [None]:
with open('data/ner_dataset.pickle', 'rb') as f:
    ner_dataset = pickle.load(f)

In [32]:
os.remove('data/ner_dataset.pickle')

In [None]:
ner_dataset = ner_dataset.sample(frac=0.05, random_state=39)
ner_dataset

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector
47443,Thai officials say the country plans to instal...,thai official say country plan install first t...,Thai officials say the country plans to instal...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.013276808895170689, -0.032141394913196564, ..."
37232,Fifty-one people are believed to have killed i...,fifty-one people believe kill fiery bus crash ...,Fifty-one people are believed to have killed i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.014542167074978352, -0.013322676531970501, ..."
39016,"Soviet leaders favored the choreographer, alth...",soviet leader favor choreographer although mem...,Soviet leaders favored the choreographer altho...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.034374285489320755, -0.010712085291743279,..."
42384,Attorney General Mukasey was formally sworn in...,attorney general mukasey formally swear wednes...,Attorney General Mukasey was formally sworn in...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0017161703435704112, 0.03675336763262749, ..."
8066,"During last month's six-nation talks, North Ko...",last month six-nation talk north korea agree a...,During last months six-nation talks North Kore...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.010788045823574066, -0.004852193407714367,..."
...,...,...,...,...,...,...
7835,"In the occupied West Bank, the Israeli militar...",occupy west bank israeli military say palestin...,In the occupied West Bank the Israeli military...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.02139437198638916, 0.03249619901180267, 0...."
23126,Aboriginal settlers arrived on the continent f...,aboriginal settler arrive continent southeast ...,Aboriginal settlers arrived on the continent f...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.018541263416409492, 0.03823277726769447, -0..."
37418,Kenyan police say Immigration Minister Jebii K...,kenyan police say immigration minister jebii k...,Kenyan police say Immigration Minister Jebii K...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.030502043664455414, -0.025818223133683205,..."
41944,The International Atomic Energy Agency was set...,international atomic energy agency set united ...,The International Atomic Energy Agency was set...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.03171413019299507, -0.02980659157037735, -..."


**Task 0.0.2:** Retrieve the BoW/TF-IDF term-document matrices and generate a (Doc2Vec) component-document matrix from the `ner_dataset`.

In [5]:
ner_bow_td_matrix = np.array([[component for component in doc] for doc in ner_dataset["bow_vector"]])
ner_tfidf_td_matrix = np.array([[component for component in doc] for doc in ner_dataset["tfidf_vector"]])
ner_doc2vec_td_matrix = np.array([[component for component in doc] for doc in ner_dataset["doc2vec_vector"]])

<font color='#BFD72F' size=5>1. Cluster documents using k-means</font> <a class="anchor" id="P1"></a>
- 
  
[Back to TOC](#toc)

<font color='#BFD72F' size=4>1.1 Test document clustering using k-means</font> <a class="anchor" id="P11"></a>
-

**Recall** the k-means algorithm:
- Initialise: Assign k centroid at random
- Do while centroid travel distance is above threshold:
    - Assign data points to the closest centroid
    - Compute the middle of the resulting polygons
    - Move the centroid to the middle while recording the centroid travel distance
    - Evaluate centroid travel distance against threshold

<img src="https://assets.blog.code-specialist.com/k_means_animation_6cdd31d106.gif" width=800>

<b>Task 1.1.1:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans">sklearn implementation of k-means</a> to cluster TF-IDF document vectors. Use k=9.

In [6]:
kmeans_tfidf = KMeans(n_clusters=9, random_state=39).fit(ner_tfidf_td_matrix)

In [7]:
set(kmeans_tfidf.labels_)

{0, 1, 2, 3, 4, 5, 6, 7, 8}

In [8]:
ner_dataset["kmeans_tfidf_clusters"] = kmeans_tfidf.labels_.tolist()

In [9]:
ner_dataset.loc[ner_dataset["kmeans_tfidf_clusters"]==0]

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector,kmeans_tfidf_clusters
42384,Attorney General Mukasey was formally sworn in...,attorney general mukasey formally swear wednes...,Attorney General Mukasey was formally sworn in...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0192197784781456, -0.0401582345366478, 0.0...",0
30321,French President Jacques Chirac cut short a va...,french president jacques chirac cut short vaca...,French President Jacques Chirac cut short a va...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0013230304466560483, -0.043248146772384644...",0
21942,State Department spokesman Richard Boucher sai...,state department spokesman richard boucher say...,State Department spokesman Richard Boucher sai...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.03877526894211769, -0.048570722341537476, ...",0
18148,Brazilian President Luiz Inacio Lula da Silva ...,brazilian president luiz inacio lula da silva ...,Brazilian President Luiz Inacio Lula da Silva ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.000582390814088285, 0.0017582299187779427,...",0
40300,Former President Bill Clinton was also critica...,former president bill clinton also critical su...,Former President Bill Clinton was also critica...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.005634291097521782, -0.01781783439218998, 0...",0
...,...,...,...,...,...,...,...
12404,Visiting Afghan President Hamid Karzai thanked...,visiting afghan president hamid karzai thank i...,Visiting Afghan President Hamid Karzai thanked...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.027142636477947235, 0.024441249668598175, 0...",0
26028,Venezuela's foreign ministry has rejected Colo...,venezuelas foreign ministry reject colombian c...,Venezuelas foreign ministry has rejected Colom...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.024126417934894562, -0.03653780743479729, ...",0
43006,French President Nicolas Sarkozy offered his c...,french president nicolas sarkozy offer condole...,French President Nicolas Sarkozy offered his c...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.06145235151052475, -0.002301904372870922, ...",0
37114,Fidel RAMOS was elected president in 1992.,fidel ramos elect president 1992,Fidel RAMOS was elected president in 1992,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.024144811555743217, 0.005283172242343426, ...",0


<b>Task 1.1.2:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans">sklearn implementation of k-means</a> to cluster BoW document vectors. Use k=9.

In [10]:
kmeans_bow = KMeans(n_clusters=9, random_state=39).fit(ner_bow_td_matrix)
ner_dataset["kmeans_bow_clusters"] = kmeans_bow.labels_.tolist()

<b>Task 1.1.3:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans">sklearn implementation of k-means</a> to cluster Doc2Vec document vectors. Use k=9.

In [11]:
kmeans_doc2vec = KMeans(n_clusters=9, random_state=39).fit(ner_doc2vec_td_matrix)
ner_dataset["kmeans_doc2vec_clusters"] = kmeans_doc2vec.labels_.tolist()

<font color='#BFD72F' size=4>1.2 Assess the quality of clustering</font> <a class="anchor" id="P12"></a>
-

**Understand** the <a href="https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation">sklearn clustering evaluation metrics</a>:
- **Inertia (inf to 0; lower is better)**: sum of squared distances of samples to their closest cluster center *(k-means only)*;

<img src="https://scikit-learn.org/1.5/_images/sphx_glr_plot_affinity_propagation_001.png" width=600>

- **Adjusted Mutual Information (0 to 1; 1 is best)**: function that measures the agreement of the two assignments, ignoring permutations *(preferably used when there is a ground truth clustering to compare against)*;

<img src="ami.jpg" width=600>

- **Silhouette Score (-1 to +1; +1 is best)**: mean silhouette coefficient of all samples, with the silhouette coefficient defined as the ratio between the difference between intercluster and intracluster distances, and the maximum distance (intra- or intercluster);

<img src="https://miro.medium.com/v2/resize:fit:1400/0*Tqd113bamewWq0_q.jpg" width=600>

- **Calinski-Harabasz Score** (0 to inf; higher is better): ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared).

<img src="https://media.licdn.com/dms/image/v2/D4D22AQH2Lf5ga26r_w/feedshare-shrink_800/feedshare-shrink_800/0/1706416821895?e=2147483647&v=beta&t=rbrehcqgjee_C8-678zXAU3gJBK5F06fxNA1noKRLD4" width=600>


<b>Task 1.2.1:</b> Compute the inertia, silhouette score, and Calinski-Harabasz score for all k-means clusters generated in 1.1.

In [12]:
model_list = [("kmeans tfidf k=9", kmeans_tfidf, ner_tfidf_td_matrix),
              ("kmeans BoW k=9", kmeans_bow, ner_bow_td_matrix),
              ("kmeans Doc2Vec k=9", kmeans_doc2vec, ner_doc2vec_td_matrix)]

def unsupervised_score_calculator(model_list):
    for tuple in model_list:
        #Inertia
        print("Inertia of {}: {}".format(tuple[0],tuple[1].inertia_))
        #Silhouette Score
        print("Silhouette score of {}: {}".format(tuple[0],silhouette_score(tuple[2],tuple[1].labels_)))
        #calinski-harabasz
        print("Calinski-Harabasz score of {}: {}".format(tuple[0],calinski_harabasz_score(tuple[2],tuple[1].labels_)))
        print("\n")

In [13]:
unsupervised_score_calculator(model_list)

Inertia of kmeans tfidf k=9: 2327.511195329767
Silhouette score of kmeans tfidf k=9: 0.004651538367410463
Calinski-Harabasz score of kmeans tfidf k=9: 6.328704612368464


Inertia of kmeans BoW k=9: 29776.08294558697
Silhouette score of kmeans BoW k=9: 0.018356361456463292
Calinski-Harabasz score of kmeans BoW k=9: 11.686404379768202


Inertia of kmeans Doc2Vec k=9: 405.06094965668694
Silhouette score of kmeans Doc2Vec k=9: 0.06246234721533703
Calinski-Harabasz score of kmeans Doc2Vec k=9: 154.6993663846699




<b>Task 1.2.2:</b> Compute the Adjusted Mutual Information between all k-means clusters generated in 1.1.

In [14]:
ami_matrix = np.array([[adjusted_mutual_info_score(tuple1[1].labels_,tuple2[1].labels_) for tuple1 in model_list]\
                        for tuple2 in model_list])

names_list = [tuples[0] for tuples in model_list]
ami_df = pd.DataFrame(data=ami_matrix,columns=names_list,index=names_list)
ami_df

Unnamed: 0,kmeans tfidf k=9,kmeans BoW k=9,kmeans Doc2Vec k=9
kmeans tfidf k=9,1.0,0.39922,0.051142
kmeans BoW k=9,0.39922,1.0,0.026326
kmeans Doc2Vec k=9,0.051142,0.026326,1.0


<font color='#BFD72F' size=4>1.3 Optimize k-means clustering using the elbow method</font> <a class="anchor" id="P13"></a>
-

**Recall** that a useful heuristic to define the optimal number of clusters without incurring into overfitting is to use the "elbow method" - finding the inflection point of the intertia-cluster scree plot

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*-hmOKGfaUScq73e30R1q0g.png" width=600>

**Task 1.3.1:** Define a function that plots the inertia of a k-means model given a range of possible k values.

In [None]:
def inertia_plotter(tf_matrix, max_k = 10, verbose=False):
    x_k_nr = []
    y_inertia = []
    for k in tqdm(range(2,max_k+1)):
        x_k_nr.append(k)
        kmeans = KMeans(n_clusters=k,random_state=0).fit(tf_matrix)
        y_inertia.append(kmeans.inertia_)
        if verbose==True:
            print("For k = {}, inertia = {}".format(k,round(kmeans.inertia_,3)))
    fig = px.line(x=x_k_nr, y=y_inertia, markers=True)
    fig.show()

In [18]:
inertia_plotter(ner_tfidf_td_matrix,30)

100%|██████████| 28/28 [01:58<00:00,  4.23s/it]


In [19]:
inertia_plotter(ner_doc2vec_td_matrix,30)

100%|██████████| 28/28 [00:11<00:00,  2.51it/s]


**Task 1.3.2:** Define a function (`elbow_finder`) that plots that returns the optimal k value ("elbow") by selecting the sample that is furthest away from the line connecting the first (smallest k) and last (highest k) samples.

<img src="https://typeset.io/figures/fig-2-kneedle-algorithm-for-online-knee-detection-a-depicts-p3zcelsb.png" width=600>
</br>
<img src="https://th.bing.com/th/id/OIP.shQdO0X4pXic4oiqG_Ka5QAAAA?rs=1&pid=ImgDetMain">

In [33]:
def elbow_finder(tf_matrix, max_k=10, verbose=True):
    
    y_inertia = []
    for k in tqdm(range(1,max_k+1)):
        kmeans = KMeans(n_clusters=k,random_state=0).fit(tf_matrix)
        if verbose==True:
            print("For k = {}, inertia = {}".format(k,round(kmeans.inertia_,3)))
        y_inertia.append(kmeans.inertia_)

    x = np.array([1, max_k])
    y = np.array([y_inertia[0], y_inertia[-1]])
    coefficients = np.polyfit(x, y, 1)
    line = np.poly1d(coefficients)

    a = coefficients[0]
    c = coefficients[1]

    elbow_point = max(range(1, max_k+1), key=lambda i: abs(y_inertia[i-1] - line(i)) / np.sqrt(a**2 + 1))
    print(f'Optimal value of k according to the elbow method: {elbow_point}')
    
    return elbow_point


In [27]:
elbow_finder(ner_doc2vec_td_matrix,40)

100%|██████████| 40/40 [00:17<00:00,  2.23it/s]

Optimal value of k according to the elbow method: 9





9

**Task 1.3.3:** Use the `elbow_finder` function to select an optimal k for the TF-IDF document vectors. Then, update the `kmeans_tfidf_clusters` column of the `ner_dataset` with the new cluster labels.

In [51]:
elbow_finder(ner_tfidf_td_matrix,40,verbose=False)

100%|██████████| 40/40 [03:49<00:00,  5.73s/it]

Optimal value of k according to the elbow method: 15





15

In [50]:
kmeans_tfidf = KMeans(n_clusters=15, random_state=0).fit(ner_tfidf_td_matrix)
ner_dataset["kmeans_tfidf_clusters"] = kmeans_tfidf.labels_.tolist()

**Task 1.3.4:** As per the advice on <a href="https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#performing-dimensionality-reduction-using-lsa">sklearn</a> about clustering text features, use Latent Semantic Analysis (LSA) to reduce the dimensionality of the TF-IDF term-document matrix, and then use the `elbow_finder` function to select the best k. Lastly, create a `kmeans_tfidf_lsa_clusters` column of the `ner_dataset` with the new cluster labels.


In [None]:
lsa = TruncatedSVD(n_eigenvalues=200) ## The number of components can also be optimized
lsa_result = lsa.fit_transform(ner_tfidf_td_matrix)
lsa_result.shape

(2398, 200)

*Note: Use the explained variance ratio to understand the percentage of total variance in the dataset captured by the number of eigenvalues used in the LSA*

<img src="https://vitalflux.com/wp-content/uploads/2020/08/Screenshot-2020-08-08-at-12.05.44-PM.png">

In [48]:
lsa.explained_variance_ratio_.sum()

0.32194807568809086

In [42]:
elbow_finder(lsa_result,40,verbose=False)

100%|██████████| 40/40 [00:10<00:00,  3.95it/s]

Optimal value of k according to the elbow method: 12





12

In [49]:
kmeans_tfidf_lsa = KMeans(n_clusters=12, random_state=0).fit(lsa_result)
ner_dataset["kmeans_tfidf_lsa_clusters"] = kmeans_tfidf_lsa.labels_.tolist()

<font color='#BFD72F' size=4>1.4 Visualize clusters in 3D</font> <a class="anchor" id="P14"></a>
-

*Before visualising clusters in 3d, we want to quickly understand what they represent - to do so, we will name them using the top 5 TF-IDF tokens for each label.*

**Task 1.4.1:** Define a naming function that takes a dataset with a `preproc_content` column and the name of the column containing the labels, and that uses TF-IDF to retrieve the highest scoring N tokens for each label, concatenating them to return a topic name for each label.

In [None]:
def cluster_namer(dataset, label_column_name, nr_words=5):

    labels = list(set(dataset[label_column_name]))
    # corpus generator
    corpus = []
    for label in labels:
        label_doc = ""
        for doc in dataset["preproc_content"].loc[dataset[label_column_name]==label]:        
            label_doc = label_doc + " " + doc
        corpus.append(label_doc)
        

    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), token_pattern=r"(?u)\b\w+\b")
    label_name_list = []

    for idx, document in enumerate(corpus):
        corpus_tfidf_td_matrix = tfidf_vectorizer.fit_transform(corpus)
        corpus_tfidf_word_list = tfidf_vectorizer.get_feature_names_out()

        label_vocabulary = pipeline_v2b.word_freq_calculator(corpus_tfidf_td_matrix[idx].toarray(),\
                                                                        corpus_tfidf_word_list, df_output=True)
        
        label_vocabulary = label_vocabulary.head(nr_words)
        label_name = ""
        for jdx in range(len(label_vocabulary)):
            label_name = label_name + "_" + label_vocabulary["words"].iloc[jdx]

        label_name_list.append(label_name)

    label_name_dict = dict(zip(labels,label_name_list))
    dataset[label_column_name] = dataset[label_column_name].map(lambda label : label_name_dict[label])

    return dataset

In [61]:
ner_dataset = cluster_namer(ner_dataset, "kmeans_tfidf_clusters")
ner_dataset

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector,kmeans_tfidf_clusters,kmeans_bow_clusters,kmeans_doc2vec_clusters,kmeans_tfidf_lsa_clusters
47443,Thai officials say the country plans to instal...,thai official say country plan install first t...,Thai officials say the country plans to instal...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0007468890398740768, -0.029474955052137375,...",_official_say_last_month_un,7,2,_year_old_say_last_million
37232,Fifty-one people are believed to have killed i...,fifty-one people believe kill fiery bus crash ...,Fifty-one people are believed to have killed i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.003389752469956875, -0.026266204193234444, ...",_kill_least_people_say_bomb,7,0,_kill_least_people_wound_say
39016,"Soviet leaders favored the choreographer, alth...",soviet leader favor choreographer although mem...,Soviet leaders favored the choreographer altho...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.01417169626802206, -0.01004005502909422, -...",_government_new_say_leader_rebel,5,0,_say_mr_country_us_two
42384,Attorney General Mukasey was formally sworn in...,attorney general mukasey formally swear wednes...,Attorney General Mukasey was formally sworn in...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0192197784781456, -0.0401582345366478, 0.0...",_president_bush_say_hugo_us,5,4,_president_bush_say_mr_us
8066,"During last month's six-nation talks, North Ko...",last month six-nation talk north korea agree a...,During last months six-nation talks North Kore...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.008597010746598244, 0.02697954699397087, -...",_nuclear_program_iran_weapon_say,5,8,_nuclear_program_iran_weapon_say
...,...,...,...,...,...,...,...,...,...,...
7835,"In the occupied West Bank, the Israeli militar...",occupy west bank israeli military say palestin...,In the occupied West Bank the Israeli military...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.011245496571063995, 0.006958349607884884, ...",_palestinian_israeli_militant_fatah_abbas,1,2,_palestinian_israeli_militant_gaza_say
23126,Aboriginal settlers arrived on the continent f...,aboriginal settler arrive continent southeast ...,Aboriginal settlers arrived on the continent f...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.036412257701158524, 0.03029363416135311, -0...",_year_old_say_last_million,5,5,_year_old_say_last_million
37418,Kenyan police say Immigration Minister Jebii K...,kenyan police say immigration minister jebii k...,Kenyan police say Immigration Minister Jebii K...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.025957779958844185, -0.005221147555857897,...",_police_say_afghan_taleban_afghanistan,1,2,_say_us_official_police_military
41944,The International Atomic Energy Agency was set...,international atomic energy agency set united ...,The International Atomic Energy Agency was set...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.01855214685201645, -0.007976766675710678, ...",_nuclear_program_iran_weapon_say,5,8,_nuclear_program_iran_weapon_say


In [52]:
ner_dataset = cluster_namer(ner_dataset, "kmeans_tfidf_lsa_clusters")
ner_dataset

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector,kmeans_tfidf_clusters,kmeans_bow_clusters,kmeans_doc2vec_clusters,kmeans_tfidf_lsa_clusters
47443,Thai officials say the country plans to instal...,thai official say country plan install first t...,Thai officials say the country plans to instal...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0007468890398740768, -0.029474955052137375,...",14,7,2,_year_old_say_last_million
37232,Fifty-one people are believed to have killed i...,fifty-one people believe kill fiery bus crash ...,Fifty-one people are believed to have killed i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.003389752469956875, -0.026266204193234444, ...",12,7,0,_kill_least_people_wound_say
39016,"Soviet leaders favored the choreographer, alth...",soviet leader favor choreographer although mem...,Soviet leaders favored the choreographer altho...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.01417169626802206, -0.01004005502909422, -...",4,5,0,_say_mr_country_us_two
42384,Attorney General Mukasey was formally sworn in...,attorney general mukasey formally swear wednes...,Attorney General Mukasey was formally sworn in...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0192197784781456, -0.0401582345366478, 0.0...",1,5,4,_president_bush_say_mr_us
8066,"During last month's six-nation talks, North Ko...",last month six-nation talk north korea agree a...,During last months six-nation talks North Kore...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.008597010746598244, 0.02697954699397087, -...",7,5,8,_nuclear_program_iran_weapon_say
...,...,...,...,...,...,...,...,...,...,...
7835,"In the occupied West Bank, the Israeli militar...",occupy west bank israeli military say palestin...,In the occupied West Bank the Israeli military...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.011245496571063995, 0.006958349607884884, ...",10,1,2,_palestinian_israeli_militant_gaza_say
23126,Aboriginal settlers arrived on the continent f...,aboriginal settler arrive continent southeast ...,Aboriginal settlers arrived on the continent f...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.036412257701158524, 0.03029363416135311, -0...",13,5,5,_year_old_say_last_million
37418,Kenyan police say Immigration Minister Jebii K...,kenyan police say immigration minister jebii k...,Kenyan police say Immigration Minister Jebii K...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.025957779958844185, -0.005221147555857897,...",2,1,2,_say_us_official_police_military
41944,The International Atomic Energy Agency was set...,international atomic energy agency set united ...,The International Atomic Energy Agency was set...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.01855214685201645, -0.007976766675710678, ...",7,5,8,_nuclear_program_iran_weapon_say


**Task 1.4.2:** Define a 3d plotting function that takes a dataset, the name of a document vector column, and the name of a cluster label column, and returns a 3d graph of the document clusters.

*Note that you can define several <a href="https://plotly.com/python/discrete-color/">color scales using plotly</a> to increase readability.*

In [60]:
def plotter_3d_cluster(dataset_org, vector_column_name, cluster_label_name, write_html=False, html_name="test.html"):
    
    dataset = dataset_org.copy()
    dataset = cluster_namer(dataset, cluster_label_name)

    svd_n3 = TruncatedSVD(n_components=3)
    td_matrix = np.array([[component for component in doc] for doc in dataset[vector_column_name]])
    svd_result = svd_n3.fit_transform(td_matrix)

    for component in range(3):
        col_name = "svd_d3_x{}".format(component)
        dataset[col_name] = svd_result[:,component].tolist()

    fig = px.scatter_3d(dataset,
                        x='svd_d3_x0',
                        y='svd_d3_x1',
                        z='svd_d3_x2',
                        color=cluster_label_name,
                        title=vector_column_name+"__"+cluster_label_name,
                        opacity=0.7,
                        hover_name = "preproc_content",
                        color_discrete_sequence=px.colors.qualitative.Alphabet)

    if write_html==True:
        fig.write_html(html_name)
    fig.show()
    

In [59]:
plotter_3d_cluster(ner_dataset,"tfidf_vector","kmeans_tfidf_lsa_clusters")

In [62]:
plotter_3d_cluster(ner_dataset,"tfidf_vector","kmeans_tfidf_clusters")

<font color='#BFD72F' size=5>2. Cluster documents using hierarchical- and density-based algorithms</font> <a class="anchor" id="P2"></a>
- 
  
[Back to TOC](#toc)

**Understand** that different clustering algorithms are suited to different use cases. Read the <a href="https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods">sklearn User Guide</a> to review their use cases and limitations.

<font color='#BFD72F' size=4>2.1 Cluster documents using OPTICS</font> <a class="anchor" id="P21"></a>
-

**Understand** that <a href="https://medium.com/@prasanNH/optics-hierarchical-density-based-clustering-e659e4b21764">OPTICS</a> finds core sample of high density with a minimum number of samples and expands clusters from them, keeping cluster hierarchy for a variable neighborhood radius.

<img src="reachability.jpg">

<img src="https://www.researchgate.net/publication/370354979/figure/fig4/AS:11431281154180452@1682697660013/Comparison-of-K-means-K3-and-OPTICS-on-Buckeye-corpus-cardinal-vowels.jpg" width=800>´

<b>Task 2.1.1:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html">sklearn implementation of OPTICS</a> to cluster TF-IDF document vectors. 

<i> **Note** that OPTICS can use various metrics to compute the reachability graph, including a precomputed square matrix of distances - or co-occurrences - between samples.</i>

In [None]:
optics = OPTICS(metric="minkowksi",min_samples=2).fit(ner_tfidf_td_matrix)

In [66]:
silhouette_score(ner_tfidf_td_matrix, optics.labels_)

-0.03063064133501761

In [67]:
len(set(optics.labels_))

131

In [68]:
ner_dataset["optics_tfidf_clusters"] = optics.labels_.tolist()
ner_dataset.loc[ner_dataset["optics_tfidf_clusters"]==86]

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector,kmeans_tfidf_clusters,kmeans_bow_clusters,kmeans_doc2vec_clusters,kmeans_tfidf_lsa_clusters,optics_tfidf_clusters
2314,The Iranian report says the missile tests invo...,iranian report say missile test involve shahab...,The Iranian report says the missile tests invo...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.011196550913155079, 0.030995508655905724, ...",_say_officials_statement_military_ministry,6,1,_say_mr_country_us_two,86
28549,"Quoting a government source, Yonhap said the t...",quoting government source yonhap say test part...,Quoting a government source Yonhap said the te...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.011801731772720814, 0.013872230425477028, -...",_say_officials_statement_military_ministry,1,3,_government_also_say_country_leader,86
30615,The launch comes one day after Pakistan's riva...,launch come one day pakistans rival india test...,The launch comes one day after Pakistans rival...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.017659075558185577, -0.03984908387064934, 0...",_say_country_us_two_also,5,2,_say_mr_country_us_two,86
14787,Interceptor missiles are designed to shoot dow...,interceptor missile design shoot enemy missile,Interceptor missiles are designed to shoot dow...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.01977558434009552, 0.013915526680648327, -0...",_say_country_us_two_also,5,2,_say_mr_country_us_two,86
46488,Pakistan says it has successfully launched a b...,pakistan say successfully launch ballistic mis...,Pakistan says it has successfully launched a b...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.008678007870912552, -0.013260235078632832,...",_say_officials_statement_military_ministry,1,2,_say_mr_country_us_two,86
3916,Media reports differed on whether the missiles...,media report differ whether missile launch tes...,Media reports differed on whether the missiles...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.010097515769302845, -0.008393255062401295,...",_say_country_us_two_also,5,5,_say_mr_country_us_two,86


In [70]:
plotter_3d_cluster(ner_dataset,"tfidf_vector","optics_tfidf_clusters")

<font color='#BFD72F' size=4>2.2 Cluster documents using HDBSCAN</font> <a class="anchor" id="P22"></a>
-

**Understand** that like OPTICS, <a href="https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html">HDBSCAN</a> finds core sample of high density with a minimum number of samples and expands clusters from them over varying epsilon values, integrating the result to find a clustering that gives the best stability over epsilon, ensuring that clusters of different stabilities can be identified.

<img src="https://www.researchgate.net/profile/Solomon-Owiredu/publication/370129234/figure/fig2/AS:11431281151392423@1681978802711/A-concept-of-HDBSCAN-clustering-algorithm-a-Mutual-reachability-distance-computation.png" width=800>

<b>Task 2.1.2:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN">sklearn implementation of HDBSCAN</a> to cluster TF-IDF document vectors. 

<i> **Note** that like OPTICS, HDBSCAN can use various metrics to compute the reachability graph, including a precomputed square matrix of distances - or co-occurrences - between samples.</i>

In [72]:
hdbscan = HDBSCAN(min_cluster_size=5).fit(ner_tfidf_td_matrix)
ner_dataset["hdbscan_tfidf_clusters"] = hdbscan.labels_.tolist()

In [None]:
ner_dataset.loc[ner_dataset["hdbscan_tfidf_clusters"]==0]

Unnamed: 0,raw_content,preproc_content,doc2vec_content,bow_vector,tfidf_vector,doc2vec_vector,kmeans_clusters,optics_clusters,hdbscan_clusters
42722,Russian President Vladimir Putin has dismissed...,russian president vladimir putin dismiss presi...,Russian President Vladimir Putin has dismissed...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.002982792444527149, -0.026136036962270737, ...",0,-1,0
30979,"In Russia, President Vladimir Putin attended M...",russia president vladimir putin attend mass ch...,In Russia President Vladimir Putin attended Ma...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.0057198042050004005, -0.01159823127090931,...",0,-1,0
23756,They also appealed to G-8 member countries to ...,also appeal g-8 member country press russian p...,They also appealed to G-8 member countries to ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.005216252990067005, 0.03800731897354126, -0...",0,-1,0
13344,A letter by the group Wednesday accuses Russia...,letter group wednesday accuse russian presiden...,A letter by the group Wednesday accuses Russia...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.007039791904389858, -0.013226264156401157,...",0,-1,0
43561,Russian President Vladimir Putin has sent new ...,russian president vladimir putin send new year...,Russian President Vladimir Putin has sent new ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.030728518962860107, 0.062114447355270386, ...",0,-1,0


In [None]:
set(hdbscan.labels_)

{-1, 0, 1, 2, 3}

In [None]:
silhouette_score(ner_tfidf_td_matrix, hdbscan.labels_)

-0.0028272542332657873

In [73]:
plotter_3d_cluster(ner_dataset, "tfidf_vector", "hdbscan_tfidf_clusters")