# Cluster Analysis Example

As guidence for this work, I refer largely to an Email Clustering Thesis https://www.scribd.com/document/21970433/Email-clustering-algorithm, and a pipline referred to in the Wikipedia article on Text Clustering.

This notebook starts off by taking example text data and does the following:
* Cleans data (Removes stopwords, stems, etc.)
* Tokenizes words, might TF-IDF it
* Performs Dimension Reduction (PCA algorithm)
* Performs Cluster Analysis

In this notebook, a number of clustering algorithims are presented such as K-Means, MeanShift, and finally we use a Non-negative Matrix Factorisation technique for topic identification. The purpose of comparing algorithims isn't neccessarily to rule them out, instead it's more about trying to get a working example of each one.

For the example dataset used dimension reduction isn't really needed as the size of the dataset is unlikely to be computationally intensive, however it sets the standard for any future use of larger datasets.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pprint
import nltk

from nltk.collocations import *
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from nltk.corpus import stopwords

% matplotlib inline

## Data used

For the purposes of this example, we'll just use Sklearn's 20 newsgroups dataset http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
categories = [
    'talk.politics.mideast',
    'rec.motorcycles',
]

remove = ('headers', 'footers', 'quotes')

In [4]:
data_train = fetch_20newsgroups(subset='train', categories=categories,
                               random_state=42,
                               remove=remove)

In [5]:
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               random_state=42,
                               remove=remove)

In [6]:
train = data_train.data
train[:4]

[u'\n\nYep...I think it\'s the only CB750 with a 630 chain.\nAfter 14 years, it\'s finally stretching into the "replace" zone.\n\n\n<Sigh> I know .... I know.',
 u": \n: While you brought up the separate question of Israel's unjustified\n: policies and practices, I am still unclear about your reaction to\n: the practices and polocies reflected in the article above.\n: \n: Tim\n\nNot a separate question Mr. Clock. It is deceiving to judge the \nresistance movement out of the context of the occupation.",
 u'\n\n\tI\'m not sure that\'s true.  Let me rephrase; "You can file a complaint\n which will bring the person into court."  As I understand it, a\n "citizens arrest" does not have to be the physical detention of\n the person.\n\n Better now?',
 u'\nI bought my Moto Guzzi from a Univ of Va grad student in Charlottesville\nlast spring.\n\n\t     Mark Cervi, cervi@oasys.dt.navy.mil, (w) 410-267-2147\n\t\t DoD #0603  MGNOC #12998  \'87 Moto Guzzi SP-II\n      "What kinda bikes that?" A Moto

### Data cleaning

As we can see from the above, the data given is quite messy and has a formatting, punctuation, stopwords, etc that we'd like to parse out before running our model.

In [7]:
df = pd.DataFrame({'col':train})
df.head()

Unnamed: 0,col
0,\n\nYep...I think it's the only CB750 with a 6...
1,: \n: While you brought up the separate questi...
2,\n\n\tI'm not sure that's true. Let me rephra...
3,\nI bought my Moto Guzzi from a Univ of Va gra...
4,I was just wondering if there were any law off...


In [8]:
df['col'] = df['col'].str.lower().str.split()

In [9]:
vectorizer = TfidfVectorizer()

In [10]:
stop = stopwords.words('english')

In [11]:
df['col'] = df['col'].apply(lambda x: [item for item in x if item not in stop])

In [12]:
df.head()

Unnamed: 0,col
0,"[yep...i, think, it's, cb750, 630, chain., 14,..."
1,"[:, :, brought, separate, question, israel's, ..."
2,"[i'm, sure, that's, true., let, rephrase;, ""yo..."
3,"[bought, moto, guzzi, univ, va, grad, student,..."
4,"[wondering, law, officers, read, this., severa..."


In [13]:
# df.to_csv('clean_news_data.csv')

In [32]:
df = pd.read_csv('clean_news_data.csv', usecols=[1],
                           names=['text'],
                           header=0)

In [33]:
df = df.dropna()
df['text'] = df['text'].str.replace(r'\W', ' ', case = False)
df['text'] = df['text'].str.replace(r'[.,?<>-]', '')

In [34]:
df.head()

Unnamed: 0,text
0,yep i think its cb750 630 chain 14 years its f...
1,brought separate question israels unjustif...
2,im sure thats true let rephrase yo file compl...
3,bought moto guzzi univ va grad student charlot...
4,wondering law officers read this several quest...


In [35]:
X = vectorizer.fit_transform(df['text'])

In [36]:
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=2, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [37]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print


print("\n")
print("Prediction")

Y = vectorizer.transform(["This Motorbike has the best chain"])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["Turkey is close to Israel"])
prediction = model.predict(Y)
print(prediction)

Top terms per cluster:
Cluster 0:  israel  people  jews  armenian  israeli  armenians  turkish  would  arab  one
Cluster 1:  bike  one  would  its  like  get  know  dont  im  it


Prediction
[1]
[0]


## MeanShift Clustering

MeanShift clustering is another centriod based clustering alogrithim that works in a similar way to the above K-Means. The advantage of Meanshift is that you don't have to specify the number of clusters prior to analysis. This is has a disctinct advantage over K-Means as when analysing general KAs, we're not sure how many natural clusters there are.

In [38]:
from sklearn.cluster import MeanShift, estimate_bandwidth

In [39]:
bandwidth = estimate_bandwidth(X.toarray(), quantile=0.2, n_samples=500) ## Do PCA or other dimension reduction for real data.

In [40]:
mean_shift_model = MeanShift(bin_seeding=True)

In [41]:
mean_shift_model.fit(X.toarray())

MeanShift(bandwidth=None, bin_seeding=True, cluster_all=True, min_bin_freq=1,
     n_jobs=1, seeds=None)

In [42]:
labels = mean_shift_model.labels_
cluster_centers = mean_shift_model.cluster_centers_

labels_unique = np.unique(labels)
n_clusters = len(labels_unique)

In [43]:
print n_clusters

1


MeanShift at the moment, doesn't look fully helpful if it's only yielding a single cluster... Might need work.

## Non-negative Matrix Factorisation

As NMF is typically a dimension reduction technique in itself, the methodology for this algorithim is going to be slightly different. We'll do the following:

* Cleans data (Removes stopwords, stems, etc.)
* Tokenizes words, might TF-IDF it
* Perform NMF on tokenized vector.