# Clustering

**Loading files from a directory into a panda dataframe**

* the  [load_files](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) takes as input a directory in which the immediate subdirectories are category names for the text files they contain 

```
DIR/
 category_1/
    file_1.txt file_2.txt … file_42.txt
 category_2/
    file_43.txt file_44.txt …
```

* the load_files method
recursively uploads all files in a directory and return a dictionary object with attributes "data", the text content of the input files and "target_names", the names of the subdirectory containing the text files. 
* the code below use this method to extract the content and categories of the text files contained in the ../data/bbc/ directory and to store them into a pandas frame with headers 'text' and 'label' respectively

In [None]:
!ls ../data/bbc/

In [None]:
import pandas as pd
from sklearn.datasets import load_files
# Loading all files in "dir" directory into a pandas dataframe
DATA_DIR = "../data/bbc/"
data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace")
df = pd.DataFrame(list(zip(data['data'], data['target_names'])), columns=['text', 'label'])
df.head()

**Converting a corpus of texts into a tf-idf matrix**
* the input is our corpus, a list of texts
* we can specify how the text is tokenised and whether stop-words are removed
* the output is a matrix where each row is a text and each column is a token. The cells of the matrix contain the tf-idf score of the token in that text

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

corpus = ['Apples and pears are fruit','A pear is a fruit','dogs and cats are animal','a cat is an animal']

from nltk import word_tokenize
# Create a TFIDF vectorizer to convert convert words to vectors
vectorizer = TfidfVectorizer(max_features=10,
                                       use_idf=True,
                                       stop_words='english',
                                       tokenizer=nltk.word_tokenize)
# Apply the vectorizer to the input texts
M = vectorizer.fit_transform(corpus)

In [None]:
# the output matrix contains 4 rows, one for each input document 
# and 5 columns as we set the max nb of features to 5
print(M.shape)

**Viewing the features used by the clustering algorithm**

In [None]:
vectorizer.get_feature_names()

**Clustering** 

[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
* we just input the tf-idf matrix (the representation of the input texts) to a KMeans clustering model
* the "fit" method fits the data i.e., aims to find the best set of clusters for it
* n_init: the number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs.

In [None]:
from sklearn.cluster import KMeans
# Create a KMeans clustering model
km = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=5, verbose=0, random_state=3425)
# Apply the clustering model on the tf-idf matrix (the features)
km.fit(M)


**Printing out clustering results**
* You can view which item belongs to which cluster using the labels_ attribute
* If you want to use the predicted cluster labels eg for viewing (to compare with the ground truth labels) you need to explicitely store these into a list as shown below

In [None]:
# Print out the predicted labels
predicted_labels = km.labels_
# Store the predicted clusters into a list
predicted_labels = predicted_labels.tolist()

**Computing clustering evaluation metrics**

In [None]:
from sklearn import metrics
# Show ground truth labels (if available)
labels = ["fruit","fruit","animals","animals"]
print( labels)
# Show predicted labels
print( km.labels_)
# Compute and show evaluation scores
# When a ground truth is available
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
# When no ground truth is available
#print("Silhouette Coefficient: %0.3f"
#      % metrics.silhouette_score(tfidf_matrix, km.labels_, sample_size=1000))
print()

**Printing out the number of items per clusters**

First store documents, labels and cluster labels into a Pandas datafram

In [None]:
data = {'text':corpus,'label':labels,'cluster':km.labels_}
df = pd.DataFrame(data)
df.head()

Then count the number of each occurrence in the cluster column (= the number of documents for each cluster label)

In [None]:
df['cluster'].value_counts()

**Print out the top tokens of each cluster**

In [None]:
import numpy as np
print("Top terms per cluster:")
# get the number of clusters
true_k = np.unique(labels).shape[0]

# get the cluster center of each cluster 
# argsort() return the index of each dimension in the cluster center and sort them in increasing value order
# [:, ::-1] reverts the argsort() list to place the indices with highest value first (decreasing order)
order_centroids = km.cluster_centers_.argsort()[:]

# terms maps a vectorizer index to the corresponding token
terms = vectorizer.get_feature_names()

# for each cluster
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    # print out the token of the centroid (order by decreasing tf-idf value)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print('\n')