# Using Pre-Trained Word Vectors for Clustering Text Documents

A common practice for clustering word documents involves using Tfidf vectorization along with KMeans. This of course is simple and powerful - but essentially has the limits of known bag of words - which in case of sparsity of ground truth can have poor results. Now with pre-trained word vectors, once can theoretically overcome the some of this limitations as rather than clustering on ground truth embodied in just the training set, one can gain the power of a much larger word space.

The following example provides an implementation where one can choose to cluster through Tfidf vectorization - or use pre-trained word vectors. One can see that by using pre-trained word vectors, "Orange is hip" is correctly clustered with the other similar color based sentences, while "Cooking is my hobby" is positioned closer to the food sentences!

For more information on word vectors, please see:

[Deep Learning, NLP, and Representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)


In [None]:
%matplotlib inline

from __future__ import print_function
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.utils.extmath import row_norms, squared_norm
from sklearn.cluster.k_means_ import _labels_inertia

import matplotlib.pyplot as plt
import matplotlib.cm as cm


import spacy

In [None]:
def show_clusters(docs, num_clusters, vector_type="spacy"):    
    # create the vectorizer & feature matrix based on that 
    if vector_type == "count":
        vectorizer = CountVectorizer()
        # Converting todense to use in the PCA reduction for 2D plotting
        feature_matrix = vectorizer.fit_transform(docs).todense()
    elif vector_type == "tfidf":
        vectorizer = TfidfVectorizer()
        # Converting todense to use in the PCA reduction for 2D plotting
        feature_matrix = vectorizer.fit_transform(docs).todense()
    elif vector_type == "spacy":
        # ensure that the model has been downloaded previously
        # python -m spacy download en_vectors_glove_md
        nlp = spacy.load('en_vectors_glove_md') 
        feature_matrix = list()
        for doc in docs:
            feature_matrix.append(nlp(doc).vector)
        
    km = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=20, n_init=10)
    cluster_labels = km.fit_predict(feature_matrix)
    
    for i, doc in enumerate(docs):
        print(cluster_labels[i], doc)
    
    # now reduce feature matrix to 2 dimensions and plot it
    pca = PCA(n_components=2).fit(feature_matrix)       
    data2D = pca.transform(feature_matrix)
    colors = cm.spectral(cluster_labels.astype(float) / num_clusters)
    plt.scatter(data2D[:,0], data2D[:,1], color=colors)
     
    centers2D = pca.transform(km.cluster_centers_)

    plt.hold(True)
    plt.scatter(centers2D[:,0], centers2D[:,1], 
                marker='x', s=200, linewidths=3, c='r')
    plt.show()
    

In [None]:
docs = ["The colors are amazing", "Japanese food is great", 
        "My favorite color is blue", "I love Indian food", 
        "Orange is hip", "Cooking is my hobby"]

show_clusters(docs, 2, vector_type="spacy")