DiploDatos 2018 / Aprendizaje no supervizado / Clustering Demo*

# Aplicación de técnicas de *clustering* a documentos de texto

**Objetivos:**

En este ejemplo mostraremos cómo utilizar técnicas de clustering para aprender la estructura subyacente de un conjunto de documentos de texto.

In [95]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3

from __future__ import print_function

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

### DATOS: Top 100 Greatest Movies of All Time (The Ultimate List), by ChrisWalczyk55

https://www.imdb.com/list/ls055592025/

El problema consiste en agrupar un conjunto de películas en base a sus críticas en inglés, 
usando para ello procesamiento del texto


Lo primero que haremos es leer los datos, disponibles en:
https://github.com/brandomr/document_cluster.git

In [96]:
# Lectura de los titulos

with open("data/document_cluster/title_list.txt") as file:
    titles = [line.strip() for line in file]
    
# Lectura de las criticas

synopses = []
with open("data/document_cluster/synopses_list_wiki.txt") as file:
    i = True
    l = ' '
    for line in file:            
        if 'BREAKS HERE' in line:
            synopses.append(l) # append the previously collected lines
            l = ' '       
        l = l + line.decode('utf-8').strip()
        
# Lectura de los generos

with open("data/document_cluster/genres_list.txt") as file:
    genres = [line.strip() for line in file]

### Análisis del texto | tokenizing


Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [97]:
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [98]:
# e.g.:
from nltk.tokenize import word_tokenize
text = "Computer science is no more about computers than astronomy is about telescopes. Edsger Dijkstra"
tokens = tokenize_only(text)
print(tokens)

['computer', 'science', 'is', 'no', 'more', 'about', 'computers', 'than', 'astronomy', 'is', 'about', 'telescopes', 'edsger', 'dijkstra']


In [99]:
totalvocab_tokenized = []

for i in synopses:
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [100]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_tokenized)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

Hay en total 164243 tokens 

[u'plot', u'edit', u'edit', u'edit', u'on', u'the', u'day', u'of', u'his', u'only', u'daughter', u"'s", u'wedding', u'vito', u'corleone', u'hears', u'requests', u'in', u'his', u'role', u'as', u'the', u'godfather', u'the', u'don', u'of', u'a', u'new', u'york', u'crime', u'family', u'vito', u"'s", u'youngest', u'son', u'michael', u'in', u'a', u'marine', u'corps', u'uniform', u'introduces', u'his', u'girlfriend', u'kay', u'adams', u'to', u'his', u'family', u'at']
there are 164243 items in vocab_frame


In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)

terms = tfidf_vectorizer.get_feature_names()

(100, 143)


### Buscar clusters | Kmeans

Primero tenemos que hacer el *embeding*:

In [102]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [103]:
print (clusters)

# Recuento del número de elementos en cada cluster
for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

[1, 3, 2, 3, 2, 2, 1, 3, 1, 1, 2, 1, 3, 4, 0, 3, 1, 1, 4, 2, 1, 3, 1, 1, 2, 3, 1, 3, 3, 2, 1, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 0, 1, 3, 1, 1, 1, 1, 2, 2, 2, 4, 4, 2, 3, 2, 1, 2, 1, 4, 1, 2, 2, 4, 3, 1, 4, 3, 4, 1, 3, 4, 3, 3, 1, 0, 4, 1, 0, 2, 0, 1, 3, 3, 1, 0, 3, 2, 2, 2, 4, 1, 2, 1, 3, 4, 4, 4, 1, 1]
El cluster 0 tiene 6 elementos
El cluster 1 tiene 30 elementos
El cluster 2 tiene 26 elementos
El cluster 3 tiene 24 elementos
El cluster 4 tiene 14 elementos


In [104]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'genre'])

In [105]:
frame[1:10]

Unnamed: 0,title,genre
3,The Shawshank Redemption,"[u' Crime', u' Drama']"
2,Schindler's List,"[u' Biography', u' Drama', u' History']"
3,Raging Bull,"[u' Biography', u' Drama', u' Sport']"
2,Casablanca,"[u' Drama', u' Romance', u' War']"
2,One Flew Over the Cuckoo's Nest,[u' Drama']
1,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
3,Citizen Kane,"[u' Drama', u' Mystery']"
1,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
1,Titanic,"[u' Drama', u' Romance']"


In [106]:
frame.ix[1]

Unnamed: 0,title,genre
1,The Godfather,"[u' Crime', u' Drama']"
1,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
1,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
1,Titanic,"[u' Drama', u' Romance']"
1,The Godfather: Part II,"[u' Crime', u' Drama']"
1,Forrest Gump,"[u' Drama', u' Romance']"
1,The Sound of Music,"[u' Biography', u' Drama', u' Family', u' Musi..."
1,E.T. the Extra-Terrestrial,"[u' Adventure', u' Family', u' Sci-Fi']"
1,The Silence of the Lambs,"[u' Crime', u' Drama', u' Thriller']"
1,Chinatown,"[u' Drama', u' Mystery', u' Thriller']"


In [107]:
dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [108]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [111]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]     
        
for i in range(num_clusters):
    print("*** Cluster %d:" % i, end='\n\n')
    
    print("WORDS /// ", end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
    print() #add whitespace
    print() #add whitespace
    
    print("TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
        print(' %s / ' % title, end='')
    print() #add whitespace
    print() #add whitespace
    
print()
print()        
        

Top terms per cluster:

*** Cluster 0:

WORDS ///  army /  war /  men /  town /  killed /  new / 

TITLES ///  Vertigo /  The Philadelphia Story /  Tootsie /  The Grapes of Wrath /  The Green Mile /  American Graffiti / 

*** Cluster 1:

WORDS ///  family /  home /  father /  war /  son /  death / 

TITLES ///  The Godfather /  Gone with the Wind /  The Wizard of Oz /  Titanic /  The Godfather: Part II /  Forrest Gump /  The Sound of Music /  E.T. the Extra-Terrestrial /  The Silence of the Lambs /  Chinatown /  It's a Wonderful Life /  Amadeus /  To Kill a Mockingbird /  The Best Years of Our Lives /  My Fair Lady /  Ben-Hur /  Doctor Zhivago /  High Noon /  The Pianist /  The Exorcist /  The King's Speech /  Mr. Smith Goes to Washington /  Terms of Endearment /  Giant /  Close Encounters of the Third Kind /  The Graduate /  A Clockwork Orange /  Wuthering Heights /  North by Northwest /  Yankee Doodle Dandy / 

*** Cluster 2:

WORDS ///  car /  tells /  woman /  money /  wife /  late

# .

### Ahora lo limpiamos un poco más: STOPWORDS, STEMMING & TOKENIZING

In [132]:
# la primera vez hay que descargar la lista de 'stopwords': nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

In [133]:
# STEMMING

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [134]:
totalvocab_stemmed = []

for i in synopses:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)    

**Eliminar las palabras vacías**

In [135]:
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
f_text = [word for word in totalvocab_stemmed if word not in stopwords]
    
vocab_frame = pd.DataFrame({'words': f_text}, index=range(len(f_text)))

In [136]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

terms = tfidf_vectorizer.get_feature_names()

**K-Means**

In [138]:
#from sklearn.metrics.pairwise import cosine_similarity
#dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [139]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'cluster', 'genre'])

In [140]:
#terms = tfidf_vectorizer.get_feature_names()

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [141]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]     
        
for i in range(num_clusters):
    print("[[ Cluster %d ]]" % i, end='\n\n')
    
    print("  WORDS /// ", end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        #print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
        print(' %s' % terms[ind], end=' /')
    print('\n')
    
    print("  TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
        print(' %s / ' % title, end='')
    print('\n\n')
    
print('\n')

Top terms per cluster:

[[ Cluster 0 ]]

  WORDS ///  kill / armi / soldier / order / war / command /

  TITLES ///  On the Waterfront /  The French Connection /  City Lights /  It Happened One Night /  Shane /  American Graffiti /  A Clockwork Orange /  Rebel Without a Cause / 


[[ Cluster 1 ]]

  WORDS ///  film / fight / year / new / scene / tell /

  TITLES ///  The Shawshank Redemption /  Psycho /  Vertigo /  West Side Story /  Chinatown /  12 Angry Men /  Amadeus /  An American in Paris /  The Apartment /  Goodfellas /  A Place in the Sun /  Fargo /  Nashville /  The Maltese Falcon /  Taxi Driver /  Double Indemnity /  Rear Window /  The Third Man / 


[[ Cluster 2 ]]

  WORDS ///  home / father / john / tell / return / day /

  TITLES ///  Casablanca /  One Flew Over the Cuckoo's Nest /  Lawrence of Arabia /  Star Wars /  The Silence of the Lambs /  The Bridge on the River Kwai /  Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb /  Apocalypse Now /  The Lord