# Clustering Example

This example notebook applies *k*-means clustering to the CHI data from the [HCI Bibliography](http://hcibib.org), building on the [Week 13 Example](https://cs533.ekstrandom.net/content/week13/Week13/).

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

In [None]:
rng = np.random.RandomState(20201119)

## Load Data

In [None]:
papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()

Let's treat empty abstracts as empty strings:

In [None]:
papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)

For some purposes, we want *all text*.  Let's make a field:

In [None]:
papers['all_text'] = papers['title'] + ' ' + papers['abstract']

## Raw Clustering

Let's set up a *k*-means to make 10 clusters out of our titles and abstracts.  We're going to also limit the term vectors to only the 10K most common words, to make the vectors more manageable.

In [None]:
cluster_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('cluster', KMeans(5, random_state=rng))
])

In [None]:
cluster_pipe.fit(papers['all_text'])

Now, if we want clusters for all of our papers, we use `predict`:

In [None]:
paper_clusters = cluster_pipe.predict(papers['all_text'])

In [None]:
sns.countplot(paper_clusters)

We can, for instance, get the titles of papers in cluster 0:

In [None]:
papers.loc[paper_clusters == 0, 'title']

This created a Boolean mask that is `True` where the cluster number is equal to 0, and selects those rows and the `'title'` column.

Don't know if these papers make any sense, but they are clusters.  We aren't doing anything to find the *most* central papers to the cluster, though.

We can get that with `transform`, which will transform papers into *cluster distance space* - columns are the distances between each paper and that cluster:

In [None]:
paper_cdist = cluster_pipe.transform(papers['all_text'])

And we can find the papers *closest to the center* of cluster 0:

In [None]:
closest = np.argsort(paper_cdist[:, 0])[-10:]
papers.iloc[closest]['title']

We can also look at clusters in space.  *t*-SNE is a technique for dimensionality reduction that is emphasized on visualizability.  Let's compute the *t*-SNE of our papers:

In [None]:
sne_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('sne', TSNE())
])
paper_sne = sne_pipe.fit_transform(papers['all_text'])
paper_sne

Now we can plot:

In [None]:
paper_viz = pd.DataFrame({
    'SNE0': paper_sne[:, 0],
    'SNE1': paper_sne[:, 1],
    'cluster': paper_clusters
})
sns.scatterplot('SNE0', 'SNE1', hue='cluster', style='cluster', data=paper_viz)

## SVD-based Clusters

Let's cluster in reduced-dimensional space:

In [None]:
svd_cluster_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english')),
    ('svd', TruncatedSVD(10)),
    ('cluster', KMeans(10))
])
paper_svd_clusters = svd_cluster_pipe.fit_predict(papers['all_text'])

In [None]:
sns.countplot(paper_svd_clusters)

In [None]:
paper_svd_cdist = svd_cluster_pipe.transform(papers['all_text'])

Let's look at Cluster 0 in this space:

In [None]:
closest = np.argsort(paper_svd_cdist[:, 0])[-10:]
papers.iloc[closest]['title']

Not sure if that's better, but it shows the concept.

Let's do the color-coded SNE visualization:

In [None]:
paper_viz = pd.DataFrame({
    'SNE0': paper_sne[:, 0],
    'SNE1': paper_sne[:, 1],
    'cluster': paper_svd_clusters
})
sns.scatterplot('SNE0', 'SNE1', hue='cluster', data=paper_viz)