## Text clustering with K-Means

### Corpus

This tiny test corpus is comprised of 24 wikipedia aritcles on historic and current public figures (in german language).  
The selected articles come from four different categories:

* Literature
* Philosophy
* Music
* Politics

There are 6 articles for each category. The following table depicts which article belongs to which category:

<img style="float: left;" src="categories.PNG">

## Plan

Our plan is to automatically group these articles into four categories automatically.  
To do this we will use the [K-Means clustering algorithm](https://en.wikipedia.org/wiki/K-means_clustering).  

### Preprocessing

To build our document vectors we will be using the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from sklearn.

It should be noted, that stopwords have already been removed from the source text.

In [20]:
import glob
from sklearn.feature_extraction.text import TfidfVectorizer

files_list = glob.glob('corpus_wikipedia/*.txt')

# Texts will be stored as list of strings. With one list entry representing the content of one article.
files = []

for file in files_list:
    with open(file, encoding='utf-8') as f:
        # Read text in all lower-case.
        files.append(f.read().lower())
        
# Prepare the tfidf vectorizer using a very rudimentary tokenizing regex.
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2,
                                 use_idf=True, ngram_range=(1,3), token_pattern='[a-zäöü]+')


tfidf_matrix = tfidf_vectorizer.fit_transform(files)

terms = tfidf_vectorizer.get_feature_names()
# Labels are the titles to the respective article in corresponding order to the list of tfidf matrices.
labels = [filename.split("_")[1].split('\\')[1] for filename in files_list]

print("IfIdf Matrix Shape: {}".format(tfidf_matrix.shape))
print("Labels in order of files list: {}".format(labels))

IfIdf Matrix Shape: (24, 2768)
Labels in order of files list: ['Angela Merkel', 'Aristoteles', 'Axl Rose', 'Barack Obama', 'Cristina Fernández de Kirchner', 'Freddie Mercury', 'Friedrich Schiller', 'Gotthold Ephraim Lessing', 'Heinrich von Kleist', 'Immanuel Kant', 'Joe Cocker', 'Johann Gottfried Herder', 'Johann Wolfgang von Goethe', 'Julia Gillard', 'Martin Heidegger', 'Michael Jackson', 'Ozzy Osbourne', 'Platon', 'René Descartes', 'Serj Tankian', 'Sokrates', 'Tsachiagiin Elbegdordsch', 'William Shakespeare', 'Wladimir Wladimirowitsch Putin']


## Clustering

We pass the number of (4) clusters (categories) we are trying to group the articles by to the K-Means algorithm.  
A list of numbers from 0 to 3 is returned.  
Each of those numbers shows the predicted category for each artcile.  
The numbers have to be labelled afterwards, since K-Means can't tell anything about the actual category we are looking for.  
K-Means only groups the articles without knowledge of the content.

In [21]:
from sklearn.cluster import KMeans

num_clusters = 4

km = KMeans(n_clusters = num_clusters)

km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

## View Results

Finally, we display the results as a pandas DataFrame to evaluate the quality of predictions.

In [23]:
import pandas as pd

df = pd.DataFrame(
    {"Labels": labels,
    "Predicted": clusters},
    columns = ["Labels", "Predicted"])

df.sort_values("Predicted")

Unnamed: 0,Labels,Predicted
0,Angela Merkel,0
21,Tsachiagiin Elbegdordsch,0
13,Julia Gillard,0
4,Cristina Fernández de Kirchner,0
23,Wladimir Wladimirowitsch Putin,0
3,Barack Obama,0
5,Freddie Mercury,1
19,Serj Tankian,1
10,Joe Cocker,1
22,William Shakespeare,1


## Conculsion

The outcome of every run can be different each time. This is due to a random initialization of the [centroids](http://bigdata-madesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/).  

The results seem very accurate with being comletely correct many times and only sometimes being one article off.