# Introduction to the Clustering of Texts

Text clustering is widely used in many applications such as recommender systems, sentiment analysis, topic selection or user segmentation. Word embeddings allow to exploit the ordering of words and semantics information from the text corpus. But what are word embeddings exactly? Loosely speaking, they are vector representations of a particular word. Words cannot naively be passed to a neural network (or other machine learning algorithm), they must first be encoded into some kind of numerical form fit for input. We can think of words as being distinct inputs, which nevertheless have some kind of relationship with one another in some kind of abstract space of meanings. For example the words "princess" and "queen" are close to one another in meaning, while the words "ball" and "square" are far away.
There are a number of different approaches to generating word embeddings, whose relative merit is based on how good they are at placing words in vector space close to one another. The approaches can be split into roughly two categories: probabilistic approaches (e.g. using a neural network to optimize an embedding), and classical "count-based" approaches.
In this notebook we are going to use two embedding algorithms:

- **Word2Vec** is one of the most popular techniques to learn word embeddings using shallow neural networks. It was developed by Tomas Mikolov in 2013 at Google.

- **GloVe** is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.



In [None]:
import pandas as pd
import numpy as np
import string
import sys
import re

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.cluster import KMeansClusterer
 
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import confusion_matrix, classification_report, rand_score, accuracy_score
from sklearn import metrics
from sklearn.cluster import DBSCAN, MiniBatchKMeans, KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

import gensim.downloader
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (15, 10)

## KMeans in Combination with word2vec

We are starting with a simple example using `word2vec`, where we are clustering 10 sentences using `KMeans` Clustering.

- 'this', 'is', 'the', 'one','good', 'machine', 'learning', 'book'
- 'this', 'is',  'another', 'book'
- 'another', 'book', 'in', 'collection'
- 'weather', 'rain', 'snow'
- 'yesterday', 'weather', 'snow'
- 'forecast', 'tomorrow', 'rain', 'snow'
- 'tomorrow', 'weather', 'clear'
- 'this', 'is', 'the', 'new', 'post'
- 'this', 'is', 'about', 'more', 'machine', 'learning', 'post'
- 'and', 'this', 'is', 'the', 'one', 'last', 'post', 'book'


For the embeddings we will use the gensim word2vec model. With the need to do text clustering at sentence level there will be one extra step for moving from word level to sentence level. For each sentence from the set of sentences, word embedding of each word is summed and in the end divided by the number of words in the sentence. So we are getting the average of all word embeddings for each sentence and use them as we would use embeddings at word level – feeding it to a machine learning clustering algorithm.

Now we will cluster our text using the Kmeans algorithm with the word2vec model for embeddings. For this we will use `KMeansClusterer`, the Kmeans implementation from the NLTK library. 

In [None]:
# Saving the sentences as a list 
sentences = [['this', 'is', 'the', 'one','good', 'machine', 'learning', 'book'],
            ['this', 'is',  'another', 'book'],
            ['another', 'book', 'in', 'collection'],
            ['weather', 'rain', 'snow'],
            ['yesterday', 'weather', 'snow'],
            ['forecast', 'tomorrow', 'rain', 'snow'],
            ['tomorrow', 'weather', 'clear'],
            ['this', 'is', 'the', 'new', 'post'],
            ['this', 'is', 'about', 'more', 'machine', 'learning', 'post'],  
            ['and', 'this', 'is', 'the', 'one', 'last', 'post', 'book']]
  

# Importing the word2vec model
model = Word2Vec(sentences, min_count=1)
 

# Creating a function that calculates the embeddings for the whole sentence by sum up the embedding of 
# each word und divide it by the sum of words in the sentence
def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model.wv[w]
            else:
                sent_vec = np.add(sent_vec, model.wv[w])
            numw+=1
        except:
            pass
     
    return np.asarray(sent_vec) / numw


# Saving the embeddings into a list X  
X=[]
for sentence in sentences:
    X.append(sent_vectorizer(sentence, model))   

    
# Creating the KMeans Clustering model with the implementation of the nltk library
NUM_CLUSTERS=2
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=50)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

In [None]:
# If you look at our data X you see that it has more than 2 dimensions 
# In order to plot the data we will use t-SNE to reduce the dimensions to 2
model = TSNE(n_components=2, perplexity=5, random_state=0)
np.set_printoptions(suppress=True)
 
Y = model.fit_transform(pd.DataFrame(X))
 
# Plotting the results of the clustering
plt.scatter(Y[:, 0], Y[:, 1], c=assigned_clusters, s=290, alpha=.5)
 
for j in range(len(sentences)):    
    plt.annotate(assigned_clusters[j], xy=(Y[j][0], Y[j][1]), xytext=(0,0), textcoords='offset points')
    print ("%s %s" % (assigned_clusters[j],  sentences[j]))

plt.show();

We see that the data were clustered according to our expectations – sentences were assigned to different clusters by topic. 
In this example we saw how to use clustering algorithms with word embeddings at sentence level. We used a Kmeans clustering algorithm and a word2vec embedding model. In order to go from word embedding level to sentence embedding level we created a additional function. 

## Several Clustering Algorithms in Combination with GloVe

`GloVe` stands for Global Vectors for Word Representations and it's a relatively new state of the art natural language processing technique of creating vector space models of word semantics, more commonly known as word embeddings. GloVe is an unsupervised algorithm developed by Stanford University that aims at generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. GloVe is similar to Word2Vec with the primary difference being that Word2Vec is a 'predictive' model that predicts context given a word while GloVe is a count-based model that learns the vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix with respect to minimizing a cost function.

We will use 2 categories from the 20 News Groups dataset as our data. The we are going to use and compare `KMeans Clustering`, `MiniBatchKMeans` and `Agglomerative Clustering` as our clustering algorithms. 

In [None]:
# Downloading the GLoVe Pre-trained Vectorizer
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

In [None]:
# Downloading the 20 Newsgroups Dataset
data = fetch_20newsgroups()

In [None]:
# Keeping only two Categories (atheism and autos) of the Newsgroup dataset
# (Feel free to experiment also with other categories)
categories = ['alt.atheism','rec.autos']
data = fetch_20newsgroups(categories=categories)

In [None]:
# Preprocess the data
sentences = []
for sentence in data.data:
    # Removing punctuation and special characters
    sentence = re.sub(r'[^A-Za-z0-9]+', ' ', sentence)
    # Removing the stopwords
    stop = stopwords.words('english')
    sentence = sentence.split()
    resultwords  = [word for word in sentence if word.lower() not in stop]
    sentences.append(" ".join(resultwords))

In [None]:
# Creating a function that calculates the embeddings for the whole sentence by summing up the embedding of 
# each word und dividing it by the sum of words in the sentence
def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    values = sent.split()
    for w in values:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
    
    return np.asarray(sent_vec) / numw
  
# Saving the embeddings in a list X
X=[]
for sentence in sentences:
    X.append(sent_vectorizer(sentence, glove_vectors))   

In [None]:
print('Number of targets: ', len(data.target))
print('Number of Sentences: ', len(sentences))
print('Number of X: ', len(X))

In [None]:
# Using PCA for Dimensionality Reduction
# And the StandardScaler to scale the data 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_ = PCA(0.99, random_state=0)
X_pca=pca_.fit_transform(X_scaled)

### KMeans

First we will use the `KMeans` clustering algorithm. We will look at the `Elbow Method` to have a look how many clusters we should choose. The y-axis of our plot represents the WCSS which is the *"within-cluster sum of squares"*. These are the sum of squared distances from observations within a cluster to their centroid.

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=150, n_init=10, random_state=0)
    kmeans.fit(X_pca)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

This time it is quite difficult to determine the right number of clusters from the plot alone. There are two relatively clear kinks at two and three clusters before the curve flattens, so it makes sense to decide on one of these numbers. In this case we are in a very comfortable situation because we know the right number of clusters, so we choose two as the number we want to work with.

Now we will fit our model to the data which earlier was reduced in its dimensionality with a PCA.

In [None]:
# Creating the KMeans Clustering model and fit it to our data X_pca after Dimensionality Reduction
kmeans = KMeans(n_clusters=2, verbose=0, init='k-means++', max_iter=150, n_init=20, random_state=0)
kmeans.fit(X_pca)

Now we will plot the clusters and print our results. Since our data has more than three dimensions we will use t-SNE to reduce the dimensionality to visualize the results. 

In [None]:
# Using t-SNE to prepare the data for plotting
model1 = TSNE(n_components=2, random_state=0)

Y = model1.fit_transform(X_pca)

plt.scatter(Y[:, 0], Y[:, 1],c=kmeans.labels_,  s=190,alpha=.5); 
for j in range(len(sentences)):    
    plt.annotate(data.target[j], xy=(Y[j][0], Y[j][1]), xytext=(0,0), textcoords='offset points')

In [None]:
# Since we know the correct labels we can also evaluate the performance of the algorithm
print(confusion_matrix(data.target,kmeans.labels_))
print(classification_report(data.target,kmeans.labels_))

According to the classification report the algorithm achieved an **accuracy** of only **8%**. BUT: Because Kmeans knows nothing about the identity of the clusters and assigns the initial centroids randomly, the labels may be permuted. To fix this we will perform the following steps:

In [None]:
# Create new dataframe with real and predicted labels 
df_result = pd.DataFrame()
df_result['org_target'] = data.target
df_result['pred_target'] = kmeans.labels_
df_result.head()

In [None]:
# Switch 0s and 1s for predicted label
df_result['pred_target'].replace({1:0, 0:1}, inplace=True)
df_result.head()

In [None]:
# Evaluation metrics with fixed labels
print(confusion_matrix(df_result.org_target,df_result.pred_target))
print(classification_report(df_result.org_target,df_result.pred_target))

Now, that looks better! As we can see Kmeans could cluster the texts pretty well with an **accuracy** of **92%**. 

Changing the labels manually is not always feasible, sometimes you will need an evaluation metric which is not sensitive to the permutation of the labels. One metric to choose in this situation is the [Rand Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.rand_score.html). It considers not the labels themselves, but their separation. It measure by some distance metric the similarity between the two label sets (the predicted one and the true labels).

In [None]:
# Rand Index
print('Rand Index: {}'.format(rand_score(data.target,kmeans.labels_)))

### MiniBatch KMeans

The second model we are using is `MiniBatchKMeans`. The MiniBatchKMeans is a variation of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

In [None]:
# Creating the MiniBatchKMeans Clustering model and fit it to our preprocessed data
batch_kmeans = MiniBatchKMeans(n_clusters=2, verbose=0, max_iter=150, n_init=20, random_state=0)
batch_kmeans.fit(X_pca)

Now we will plot the clusters and print our results.

In [None]:
# Using t-SNE to prepare the data for plotting
model1 = TSNE(n_components=2, random_state=0)

Y = model1.fit_transform(X_pca)

plt.scatter(Y[:, 0], Y[:, 1], c=batch_kmeans.labels_,  s=190, alpha=.5);
for j in range(len(sentences)):    
    plt.annotate(data.target[j], xy=(Y[j][0], Y[j][1]), xytext=(0,0), textcoords='offset points') 

In [None]:
print(confusion_matrix(data.target,batch_kmeans.labels_))
print(classification_report(data.target,batch_kmeans.labels_))

Again, the clustering algorithm has reverse-mapped the class labels to the real labels. We can use the same procedure as above to fix this: 

In [None]:
# Define dataframe and change predicted target labels
df_result = pd.DataFrame()
df_result['org_target'] = data.target
df_result['pred_target'] = batch_kmeans.labels_
df_result['pred_target'].replace({1:0, 0:1}, inplace=True)
df_result.head()

In [None]:
# Evaluation metrics with fixed labels
print(confusion_matrix(df_result.org_target,df_result.pred_target))
print(classification_report(df_result.org_target,df_result.pred_target))

As we can see MiniBatchKMeans has an **accuracy** of **93%**. That's nearly the same as Kmeans. 

### Agglomerative Clustering

The third algorithm we are going to use is `AgglomerativeClustering`. The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.

In [None]:
# Creating the AgglomerativeClustering model and fit it to our preprocessed data
agg_clust = AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X_pca)

In [None]:
# Using t-SNE to prepare the data for plotting
model1 = TSNE(random_state=0)

Y = model1.fit_transform(X_pca)

plt.scatter(Y[:, 0], Y[:, 1], c=agg_clust.labels_,  s=190, alpha=.5);
for j in range(len(sentences)):    
    plt.annotate(data.target[j], xy=(Y[j][0], Y[j][1]), xytext=(0,0), textcoords='offset points') 

In [None]:
print(confusion_matrix(data.target,agg_clust.labels_))
print(classification_report(data.target,agg_clust.labels_))

Again, the clustering algorithm has reverse-mapped the class labels to the real labels. We can use the same procedure as above to fix this: 

In [None]:
# Define dataframe and change predicted target labels
df_result = pd.DataFrame()
df_result['org_target'] = data.target
df_result['pred_target'] = agg_clust.labels_
df_result['pred_target'].replace({1:0, 0:1}, inplace=True)
df_result.head()

In [None]:
# Evaluate the AgglomerativeClustering algorithm 
print(confusion_matrix(df_result.org_target,df_result.pred_target))
print(classification_report(df_result.org_target,df_result.pred_target))

As you can see `AgglomerativeClustering` performed similarly to the other algorithms achieving an **accuracy** of **91%**. 

### DBSCAN 

We also tried `DBSCAN` on the data but it has performed relatively poorly.

You are welcome to try it yourself:

Remember to tweak the hyperparameters of DBSCAN to get different results (you should definitely do this, if all values are considered to be in one cluster only).

If you check out the classification report, you will stumble across a `-1` "cluster". This is no cluster, but all the points that are considered as noise by the algorithm.

In [None]:
# Your code!
from sklearn.cluster import DBSCAN