# Clustering

Clustering is an unsupervised machine learning method, as we are not telling the computer the groupings. Clustering is not the same as topic modeling, although clustering can yield topics. Clustering is a more restricted approach to grouping and visualizing data based on their similarity. If you only want to determine topics, a conventional LDA model will be more accurate as it allows for documents to be assigned to more than one topic. If you are looking for spatial relations and 1-1 assignment of documents to groupings, clustering will show this better.

Much of this code has been adapted from [Brandon Rose](http://brandonrose.org/clustering).

First we'll define functions in order to collect tokenized words and stemmed words:

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from string import punctuation

stemmer = SnowballStemmer("english")

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation] #word tokenizer cuts the possessives
    return filtered_tokens

def tokenize_and_stem(text):
    stems = [stemmer.stem(x) for x in tokenize_only(text)]
    return stems

Now we collect these from our paragraphs, this is only necessary to map out our data points after:

In [None]:
totalvocab_stemmed = []
totalvocab_tokenized = []

for d in docs:
    totalvocab_stemmed.extend(tokenize_and_stem(d))
    totalvocab_tokenized.extend(tokenize_only(d))

Our data frame will map tokenized words to stemmed words, recalling our work with pandas in Day 3 of the introductory series:

In [None]:
import pandas as pd

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print (vocab_frame.shape[0])
print (vocab_frame)

We'll make a tfidf, *term freqency inverse document frequency*, matrix. A tfidf takes into account the frequency of a word in the entire corpus, and offsets it based on its frequency among documents (more here: https://en.wikipedia.org/wiki/Tf–idf):

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters, max_df is maximum occurence in docs of word, min_df is opposite
#use .8 max to eliminate more common words, lower .2 looking for unique but not proper nouns
#use inverse document frequency, give more weight to rare words
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=20000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,5))

tfidf_matrix = tfidf_vectorizer.fit_transform(docs) #fit the vectorizer to the docs

print(tfidf_matrix.shape)

The tfidf_matrix maps every document with every term meeting the vectorizer paramerters, and assigns a weight based on the term's frequency amongst all the documents.

In [None]:
print (tfidf_matrix)

Then we need the words from the vector, these are essentially most influential words taking both document frequency and corpus frequency into account, we will eventually assign them to clusters.

In [None]:
terms = tfidf_vectorizer.get_feature_names()
print (terms)

In order to plot our clusters in a 2D plane, we'll want to calculate the distance between any two given docs via cosine similarity:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix) #creates high dimensional object
print (dist.shape)

Now we'll start the actual clustering. The algorithm assigns each observation to the cluster whose mean yields the least within-cluster sum of squares, essentially the nearest mean. This iterates until the mean no longer changes.

But how do we know the number of clusters? This is a highly contentious matter. There are various methods (https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).

In [None]:
from sklearn import metrics
from sklearn.cluster import KMeans

cnum = 6
km = KMeans(init='k-means++', n_clusters=cnum, n_init=10, random_state=10)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist() #assigns each paragraph to the respective cluster

We can now do a form of topic modelling by printing the words characterizing the clusters we made, the words are those closest to the centroid of the cluster, extracted from the vocab data frame, indexed by their position within the cluster:

In [None]:
#sort cluster centers by proximity to centroid, and grabs the index to iterate through below
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

cents_words = [] #to collect words for chart legend

for i in range(cnum): #numer of clusters
    cent = []
    print("Cluster %i words:" % i, end='')
    
    for ind in order_centroids[i, :7]: #ind is index, replace 7 with n words per cluster, how many to choose from centroid
        a = vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0] #indexing term in dataframe
        cent.append(a)
        
        print(' %a' % a, end=',')
        
    cents_words.append(cent)
    print ()


To see which doc was assigned to which cluster, we'll zip our titles and clusters together:

In [None]:
print (list(zip(titles,clusters)))

## Plot clusters

Two dimensional scaling must be applied for plotting:

In [None]:
from sklearn.manifold import MDS

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
mds = MDS(n_components=2, dissimilarity="precomputed", random_state = 10)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples), based on distances

xs, ys = pos[:, 0], pos[:, 1] #grabs x and y coordinates from pos (numpy array)

Define colors and labels for plot:

In [None]:
import random

cluster_colors = {}
cluster_names ={}

cols = ["b", "g", "r", "c", "m","y"]

for i in range(cnum): # for each cluster
    #cluster_colors[i] = "#%06x" % random.randint(0, 0xFFFFFF) #random hexadecimal color
    cluster_colors[i] = cols[i]
    cluster_names[i] = ' '.join(cents_words[i][:4])

Plot:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 

#group by cluster
groups = df.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size, subplots yields a tuple of figure and axes, hence the two assignments
ax.margins(0.15) # Optional, just adds 10% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, #marker size
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none') #'marker edge color'
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the paragraph number
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=7)  

    
    
plt.show() #show the plot

#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)

We can also cluster hierarchically via Ward's method (https://en.wikipedia.org/wiki/Ward%27s_method):

In [None]:
from scipy.cluster.hierarchy import ward, dendrogram

linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(15, 40)) # set size
ax = dendrogram(linkage_matrix, orientation="right", labels=titles);

plt.tick_params(labelbottom='off')

plt.tight_layout() #show plot with tight layout

#uncomment below to save figure
# plt.savefig('ward_clusters.png', dpi=200) #save figure as ward_clusters