# <font color='#eb3483'> Evaluating Clustering Methods </font>
Once we have clusters, a natural question is to ask how good they are. In this notebook we'll walk through a few quick ways to check the performance of our clusters (these are great for comparing different clustering methods). In general we'll look at two cases:

1. External Methods: Ways to evaluate our clusters when we have a ground truth label (i.e. what the clusters should be)
2. Internal Methods: Ways to evaluate our clusters when we have no 'right' answer

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

### <font color='#eb3483'> Data Loading </font>

We are going to use the [20Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/), which is a collection of approximately 20,000 newsgroup (newsgroups were basically internet forums in the old internet) documents, partitioned (nearly) evenly across 20 different newsgroups

In [2]:
from sklearn.datasets import fetch_20newsgroups_vectorized, fetch_20newsgroups

In [3]:
news_20 = fetch_20newsgroups()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [4]:
print(news_20.data[2])

From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: PB questions...
Organization: Purdue University Engineering Computer Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was supposed to make an
appearence "this summer" but haven't heard anymore on it - and since i
don't have access to macleak, i was wondering if anybody out there had
more info...

* has anybody heard rumors about price drops to the powerbook line like the
ones the duo's just went through recently?

* what's the impression of the display on the 180?  i could probably swing
a 180 if i got the 80Mb disk

We can directly download the text in vectorized form (TF-IDF) with `scikit-learn`, which returns a sparse matrix. 

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

In [6]:
data = fetch_20newsgroups_vectorized()

In [7]:
news = data.data

In [8]:
news

<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>

We can see the observations labels with `target_names`

In [9]:
data.target

array([17,  7, 10, ..., 14, 12, 11])

In [12]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [10]:
len(data.target_names)

20

The dataset has 20 natural clusters (20 different news rooms). Let's quickly try to make our own 20 clusters using K means.

In [11]:
from sklearn.cluster import MiniBatchKMeans

In [12]:
estimator = MiniBatchKMeans(n_clusters=20)
estimator.fit(news)
pred_labels = estimator.labels_

In [13]:
pred_labels

array([16,  9, 16, ..., 12, 14, 16], dtype=int32)

# <font color='#eb3483'> External Evaluation Measures </font>

We call external evaluation measures those that rely on "true" cluster labels known beforehand.

In [14]:
real_classes = data.target

In [15]:
from sklearn.metrics import (
                            homogeneity_completeness_v_measure, 
                            adjusted_rand_score, 
)

The function [homogeneity_completeness_v_measure](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html#sklearn.metrics.homogeneity_completeness_v_measure) returns a tuple with homogeneity, completeness and v-measure. What do those terms actually mean? 

- **Homogeneity**: How alike are all the data-points in a cluster (i.e. do they all have the same true cluster label). In general we want clusters that are as homogenous as possible.
- **Completeness**: For a given true label (i.e. sports news) completeness indicates how much of the data points with that label are in the same cluster. For example, if we just put all our data in the same cluster, we'd have a completeness of 1.
- **V-Measure**: This is a fancy score that looks at the "mutual information" between the cluster label and the true label (i.e. if we see the cluster label, how much do we then know about the true label). 

All of these values are normalized between 0 and 1, with 1 being the best.


In [19]:
homogeneity, completeness, v_measure =  homogeneity_completeness_v_measure(real_classes, pred_labels)

print(f"""Cluster Model Performance:
      Homogeneity: {homogeneity}
      Completeness: {completeness}
      V-Measure: {v_measure}""")

Cluster Model Performance:
      Homogeneity: 0.04688327357652485
      Completeness: 0.08694884839520874
      V-Measure: 0.06091880762881924


We see these clusters have higher completeness than homogeneity (that means, the clusters are less homogeneous than the class distribution in the clusters). We also have a low V-Measure meaning these clusters arent great.

For datasets smaller than 1000 observations or for a cluster number greater than 10, it is recommended to evaluate using the Adjusted Rand Index (ARI), available as the function [adjusted_rand_score](http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-index). ARI looks at how similar our cluster assignments are to the true label (you can dig into the details later), for now just understand that larger positive values are good.

In [20]:
adjusted_rand_score(real_classes, pred_labels)

0.016464237308071956

The ARI has a range of (-1, 1), so these clusters aren't great.

We can use external evaluation metrics for crossvalidation the same way we could do with any other classification/regression problem.

Metrics defined on `cross_val_score` include:

- `adjusted_rand_score`  
- `completeness_score` 	 
- `homogeneity_score` 	 
- `v_measure_score`

In [21]:
from sklearn.model_selection import cross_val_score

For example we can evaluate the performance of the MiniBatchKMeans in terms of the Adjusted Rand Index with cross validation

In [22]:
results = cross_val_score(X=news, y=real_classes, 
                             estimator=MiniBatchKMeans(), 
                             scoring="adjusted_rand_score", cv=3)

In [23]:
results.mean()

0.02071193587831173

# <font color='#eb3483'>  Internal Evaluation metrics </font>

Internal measures are used when the true classes aren't known before hand (which is usually the case). To evaluate clusters when we don't have a ground truth label we want to try to evaluate how 'similar' our data is within-in clusters and how 'different' data is between clusters. Luckily for us, sklearn has some built-in methods that give us a view on that trade-off. 

In [24]:
from sklearn.metrics import silhouette_score, calinski_harabaz_score 

These internal measures are used in `sklearn` with 2 arguments, the training dataset and the cluster labels.

First we have the Silhouette Coefficient [(silhouette_score)](http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient), goes from -1 to 1. the higher the better.

In [25]:
silhouette_score(news, pred_labels)

-0.11035924060560391

Once again, our clustering sucks.

The Calinski-harabaz score [(calinski-harabaz-score)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score) is another internal evaluation metric. It gives us a measure of dispersion within a cluster and of the separation between clusters. It is faster to compute than the Silhouette Coefficient. The higher the better

In [26]:
calinski_harabaz_score(news.todense(), pred_labels)

51.49986943724563