# MINERÍA DE DATOS
## LABORATORIO SEMANA 13
### DOCENTE:  Dr. Hugo David Calderon Vilca
### INTEGRANTES:
- Blas Ruiz, Luis Aaron - 19200069
- Huarhuachi Ortega, Andrea Mariana - 19200267
- Ramos Rivas, Kevin Keyler - 19200096
- Rojas Villanueva, Paula Elianne - 19200266
- Torres Talaverano, Luz Elena - 19200294

# Modelado

## Descripción de la Cuarta Parte

En este notebook haremos el análisis de texto mediante el modelado de temas. El objetivo final del modelado de temas es encontrar varios temas que estén presentes en su corpus. Cada documento del corpus se compondrá de al menos un tema, si no de varios temas.

Para usar una técnica de modelado de temas, debemos proporcionar (1) una matriz de término de documento y (2) la cantidad de temas que le gustaría que el algoritmo recogiera.

## Analizando los topicos de todo el texto

In [221]:
#leermos la matriz de terminos de documentos
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop.pkl')
data

Unnamed: 0,abandoned,abilities,ability,ablaze,able,abominable,abomination,abominations,absolutely,academy,...,youve,yowza,yuck,yup,zealots,zero,zombies,zone,zurich,ça
T1C1,1,0,0,0,0,0,0,0,0,1,...,4,0,0,0,1,1,0,0,0,1
T1C2,0,0,1,0,1,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
T1C3,0,1,1,1,2,0,0,1,0,2,...,1,0,2,0,0,0,1,0,0,0
T1C4,0,0,1,0,1,1,0,0,1,0,...,3,0,0,0,0,2,0,2,0,0
T1C5,1,1,1,0,0,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
T1C6,0,0,2,0,1,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
T1C7,0,0,0,0,1,0,0,0,0,0,...,12,1,0,0,0,0,0,0,1,0
T1C8,0,0,1,0,1,0,1,1,0,0,...,4,0,0,1,0,1,0,0,0,0


In [222]:
# Import los módulos necesarios para LDA con gensim
from gensim import matutils, models
import scipy.sparse


In [223]:
# Convertir la matriz de documentos-terminos en una matriz de terminos-documentos
# gensim requiere una matriz traspuesta, así que usamos la función de transposición de pandas
tdm = data.transpose()
tdm.head()

Unnamed: 0,T1C1,T1C2,T1C3,T1C4,T1C5,T1C6,T1C7,T1C8
abandoned,1,0,0,0,1,0,0,0
abilities,0,0,1,0,1,0,0,0
ability,0,1,1,1,1,2,0,1
ablaze,0,0,1,0,0,0,0,0
able,0,1,2,1,0,1,1,1


In [224]:
# Vamos a poner la matriz del documento de términos en un nuevo formato gensim, desde df --> matriz dispersa --> corpus gensim
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [225]:
# Gensim también requiere un diccionario de todos los términos y su ubicación respectiva en la matriz del documento de términos
# Vamos a usar el diccionario que creamos anteriormente
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term), we need to specify two other parameters - the number of topics and the number of passes. Let's start the number of topics at 2, see if the results make sense, and increase the number from there.

In [226]:
# LDA para num_topics = 3
# numero de pasadas = 10
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.006*"like" + 0.005*"want" + 0.004*"yeah" + 0.003*"hey" + 0.003*"think" + 0.003*"house" + 0.003*"pilgrim" + 0.003*"meeting" + 0.003*"joseph" + 0.003*"oh"'),
 (1,
  '0.006*"did" + 0.006*"like" + 0.005*"want" + 0.005*"garrett" + 0.004*"father" + 0.004*"way" + 0.004*"oh" + 0.004*"need" + 0.003*"tell" + 0.003*"little"'),
 (2,
  '0.009*"like" + 0.006*"need" + 0.005*"want" + 0.004*"ill" + 0.004*"think" + 0.004*"oh" + 0.004*"did" + 0.003*"yeah" + 0.003*"enid" + 0.003*"lets"')]

In [227]:
# Ahora que tenemos el corpus (matriz término-documento) e id2word (diccionario de ubicación: término),
# necesitamos especificar otros dos parámetros también: el número de temas y el número de pases
# El número de temas es el número de temas que queremos que el modelo identifique en el corpus
# El número de pases es el número de veces que el algoritmo de aprendizaje iterará a través del corpus completo
# LDA para num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.001*"like" + 0.000*"want" + 0.000*"oh" + 0.000*"think" + 0.000*"need" + 0.000*"yeah" + 0.000*"did" + 0.000*"tell" + 0.000*"rowan" + 0.000*"lets"'),
 (1,
  '0.001*"like" + 0.000*"want" + 0.000*"rowan" + 0.000*"did" + 0.000*"need" + 0.000*"good" + 0.000*"cup" + 0.000*"poe" + 0.000*"think" + 0.000*"ill"'),
 (2,
  '0.007*"like" + 0.005*"did" + 0.005*"need" + 0.005*"ill" + 0.004*"oh" + 0.004*"want" + 0.004*"father" + 0.003*"night" + 0.003*"lets" + 0.003*"think"'),
 (3,
  '0.009*"like" + 0.006*"want" + 0.005*"need" + 0.005*"think" + 0.005*"did" + 0.005*"oh" + 0.004*"rowan" + 0.004*"yeah" + 0.003*"people" + 0.003*"tell"')]

Estos temas no se pueden diferenciar fácilmente por los humanos. Probaremos con otros enfoques.

## Identificar topicos de solo sustantivos

Aqui tratamos de identificar los topicos de solo sustantivos en el texto.

In [228]:
# Vamos a crear una función para extraer sustantivos de una cadena de texto
from nltk import word_tokenize, pos_tag

def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [229]:
# Lea los datos limpios, antes del paso CountVectorizer
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
T1C1,original release date november wednesday add...
T1C2,original release date november wednesday con...
T1C3,original release date november wednesday fin...
T1C4,original release date november wednesday and...
T1C5,original release date november years ago go...
T1C6,original release date november wednesday att...
T1C7,original release date november at mayor walk...
T1C8,original release date november wednesday and...


In [230]:
# Aplique la función de sustantivos a las transcripciones para filtrar solo por sustantivos
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

Unnamed: 0,transcript
T1C1,release date november student brother pugsley ...
T1C2,release date november sheriff galpin perpetrat...
T1C3,release date november members students society...
T1C4,release date november wednesday thing break co...
T1C5,release date years gomez suspicion garrett gat...
T1C6,release date november wednesday attempts goody...
T1C7,release date november mayor walkers notices fi...
T1C8,release date november wednesday classmates tyl...


In [231]:
# Crea una nueva matriz documento-término usando solo sustantivos
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Vuelva a agregar las palabras vacías adicionales ya que estamos recreando la matriz documento-término
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said', 'say', 'did']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recrear una matriz documento-término con solo sustantivos
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn



Unnamed: 0,abilities,ability,abomination,abominations,academy,accident,accounts,accusations,act,action,...,yonder,youd,youll,youve,yuck,zero,zombies,zone,zurich,ça
T1C1,0,0,0,0,1,1,0,0,0,0,...,0,0,4,4,0,1,0,0,0,1
T1C2,0,1,0,0,0,2,0,0,1,0,...,0,0,0,2,0,0,0,0,0,0
T1C3,1,1,0,1,1,0,0,0,0,0,...,1,0,1,1,1,0,1,0,0,0
T1C4,0,1,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,0,1,0,0
T1C5,1,1,0,0,0,2,1,0,0,0,...,0,0,0,2,0,0,0,0,0,0
T1C6,0,2,0,0,0,0,0,0,2,1,...,0,0,2,1,0,0,0,0,0,0
T1C7,0,0,0,0,0,0,0,0,0,0,...,0,0,2,3,0,0,0,0,1,0
T1C8,0,1,1,1,0,0,0,1,1,1,...,0,0,0,2,0,1,0,0,0,0


In [232]:
# Crear el corpus gensim
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Crear el diccionario de vocabulario
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [233]:
# Probemos 4 temas
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.011*"father" + 0.010*"garrett" + 0.007*"gates" + 0.007*"mother" + 0.007*"family" + 0.007*"weekend" + 0.007*"way" + 0.007*"school" + 0.005*"night" + 0.005*"wednesday"'),
 (1,
  '0.001*"wednesday" + 0.001*"thing" + 0.001*"monster" + 0.001*"hes" + 0.001*"dance" + 0.001*"date" + 0.001*"weems" + 0.001*"cup" + 0.001*"mother" + 0.001*"tyler"'),
 (2,
  '0.011*"wednesday" + 0.010*"thing" + 0.010*"tyler" + 0.009*"monster" + 0.008*"crackstone" + 0.007*"dance" + 0.006*"weems" + 0.006*"date" + 0.005*"way" + 0.005*"hes"'),
 (3,
  '0.011*"thing" + 0.010*"wednesday" + 0.007*"monster" + 0.007*"school" + 0.006*"mother" + 0.005*"hyde" + 0.005*"way" + 0.005*"friends" + 0.005*"rowan" + 0.005*"gates"')]

Aqui ya se puede ver que los topicos son mas claros sin emabargo aun cuesta diferenciarlos.
- topico [1]: crackstone y merlina
- topico [2]: Merlina en la escuela
- topico [3]: Merlina en un baile
- topico [4]: Merlina conoce a Goody

## Identificar topicos de solo sustantivos y adjetivos

In [234]:
# Vamos a crear una función para extraer sustantivos y adjetivos de una cadena de texto
def nouns_adj(text):
    '''Dada una cadena de texto, tokenice el texto y extraiga solo los sustantivos y adjetivos.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [235]:
# Aplique la función de sustantivos a las transcripciones para filtrar solo por sustantivos
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
T1C1,original release date november wednesday highs...
T1C2,original release date november wednesday skept...
T1C3,original release date november wednesday membe...
T1C4,original release date november wednesday thing...
T1C5,original release date years gomez suspicion ga...
T1C6,original release date november wednesday attem...
T1C7,original release date november mayor walkers f...
T1C8,original release date november wednesday class...


In [236]:
# Cree una nueva matriz de término de documento usando solo sustantivos y adjetivos, también elimine palabras comunes con max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
# Recrear una matriz documento-término con solo sustantivos y adjetivos
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
# Convertir la matriz de término de documento en una matriz de término de documento normalizada
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
# Renombrar los índices para que coincidan con los nombres de los episodios
data_dtmna.index = data_nouns_adj.index
# Examinamos la matriz de término de documento
data_dtmna



Unnamed: 0,abilities,ability,able,abominable,abomination,abominations,academy,accept,accident,accounts,...,youd,youll,young,youthful,yuck,zero,zombies,zone,zurich,ça
T1C1,0,0,0,0,0,0,1,1,1,0,...,0,5,0,0,0,1,0,0,0,1
T1C2,0,1,1,0,0,0,0,0,2,0,...,1,0,0,0,0,0,0,0,0,0
T1C3,1,1,2,0,0,1,2,0,0,0,...,0,1,1,0,2,0,1,0,0,0
T1C4,0,1,1,1,0,0,0,0,1,0,...,1,0,0,0,0,1,0,1,0,0
T1C5,1,1,0,0,0,0,0,0,2,1,...,1,0,0,1,0,0,0,0,0,0
T1C6,0,2,1,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
T1C7,0,0,1,0,0,0,0,0,0,0,...,0,3,1,0,0,0,0,0,1,0
T1C8,0,1,1,0,1,1,0,0,0,0,...,1,0,1,0,0,1,0,0,0,0


In [237]:
# Crear el corpus gensim
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Crear el diccionario de vocabulario
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [238]:
# Probamos 4 temas
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"hyde" + 0.005*"crackstone" + 0.004*"cup" + 0.004*"poe" + 0.004*"mayor" + 0.004*"house" + 0.003*"bianca" + 0.003*"book" + 0.003*"diary" + 0.003*"jericho"'),
 (1,
  '0.000*"hyde" + 0.000*"eugene" + 0.000*"raven" + 0.000*"dance" + 0.000*"garrett" + 0.000*"gates" + 0.000*"sorry" + 0.000*"look" + 0.000*"kinbott" + 0.000*"bianca"'),
 (2,
  '0.008*"dance" + 0.008*"eugene" + 0.007*"gates" + 0.007*"raven" + 0.006*"goody" + 0.005*"god" + 0.005*"hyde" + 0.004*"goo" + 0.004*"mayor" + 0.004*"crackstone"'),
 (3,
  '0.009*"garrett" + 0.004*"gates" + 0.004*"murder" + 0.003*"gomez" + 0.003*"best" + 0.003*"week" + 0.003*"session" + 0.003*"morticia" + 0.002*"mom" + 0.002*"home"')]

Aqui tratamos de identificar los topicos de solo sustantivos y adjetivos en el texto.
Se puede ver que los topicos son mas claros y se pueden diferenciar mejor.
- topico [1]: crackstone y merlina
- topico [2]: Merlina en la escuela
- topico [3]: Merlina y la oscuridad
- topico [4]: Merlina y Goody

## Identificar topicos en todo el documento

In [239]:
# Nuestro modelo LDA final (por ahora)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=200)
ldana.print_topics()

[(0,
  '0.003*"rest" + 0.003*"murder" + 0.003*"week" + 0.003*"youll" + 0.002*"plan" + 0.002*"brother" + 0.002*"roommate" + 0.002*"luck" + 0.002*"session" + 0.002*"gon"'),
 (1,
  '0.007*"garrett" + 0.007*"gates" + 0.006*"crackstone" + 0.005*"goody" + 0.005*"house" + 0.004*"best" + 0.004*"happy" + 0.004*"mayor" + 0.003*"pilgrim" + 0.003*"walker"'),
 (2,
  '0.015*"hyde" + 0.007*"diary" + 0.007*"kinbott" + 0.005*"dr" + 0.005*"laurel" + 0.005*"gates" + 0.004*"master" + 0.004*"mayor" + 0.003*"course" + 0.003*"job"'),
 (3,
  '0.008*"eugene" + 0.007*"dance" + 0.005*"raven" + 0.005*"sorry" + 0.004*"woods" + 0.004*"hyde" + 0.004*"bianca" + 0.004*"goo" + 0.004*"cup" + 0.003*"look"')]

4 topicos sobre la serie Merlina de Netflix
* Topic 0: Compañeros planean 
* Topic 1: Crackstone y Goody (personajes de la serie)
* Topic 2: Secretos en un diario
* Topic 3: Baile de merlina

In [241]:
# Echemos un vistazo a los temas que contiene cada transcripción
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(0, 'T1C1'),
 (3, 'T1C2'),
 (1, 'T1C3'),
 (3, 'T1C4'),
 (1, 'T1C5'),
 (1, 'T1C6'),
 (2, 'T1C7'),
 (3, 'T1C8')]

4 topicos sobre la serie Merlina de Netflix
* Topic 0: Compañeros planean  [T1C1]
* Topic 1: Crackstone y Goody (personajes de la serie) [T1C3,T1C5,T1C6]
* Topic 2: Secretos en un diario [T1C7]
* Topic 3: Baile de merlina [T1C2,T1C2,T1C8]


En general podemos comentar que estos topicos coinciden con lo que sucede en la serie, ya que es en el 
episodio 4 donde se da el baile de Merlina

## Usamos K-means para identificar topicos

In [246]:
#usamos nuestro dataset de bolsa de palabras
data_dtmna
#inverso de la frecuencia de documentos
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(smooth_idf=False)
data_tfidf = tfidf_transformer.fit_transform(data_dtmna)
#lo convertimos en un dataframe
data_ = pd.DataFrame(data_tfidf.toarray(), columns=data_dtmna.columns)
data_.index = data_dtmna.index
data_

Unnamed: 0,abilities,ability,able,abominable,abomination,abominations,academy,accept,accident,accounts,...,youd,youll,young,youthful,yuck,zero,zombies,zone,zurich,ça
T1C1,0.0,0.0,0.0,0.0,0.0,0.0,0.025325,0.032681,0.017969,0.0,...,0.0,0.089843,0.0,0.0,0.0,0.021022,0.0,0.0,0.0,0.032681
T1C2,0.0,0.014324,0.014324,0.0,0.0,0.0,0.0,0.0,0.037669,0.0,...,0.018834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
T1C3,0.024424,0.01318,0.02636,0.0,0.0,0.024424,0.048849,0.0,0.0,0.0,...,0.0,0.01733,0.020274,0.0,0.063038,0.0,0.031519,0.0,0.0,0.0
T1C4,0.0,0.013471,0.013471,0.032215,0.0,0.0,0.0,0.0,0.017712,0.0,...,0.017712,0.0,0.0,0.0,0.0,0.020722,0.0,0.032215,0.0,0.0
T1C5,0.022751,0.012277,0.0,0.0,0.0,0.0,0.0,0.0,0.032285,0.02936,...,0.016143,0.0,0.0,0.02936,0.0,0.0,0.0,0.0,0.0,0.0
T1C6,0.0,0.033648,0.016824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.044243,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
T1C7,0.0,0.0,0.012848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.050681,0.019764,0.0,0.0,0.0,0.0,0.0,0.030725,0.0
T1C8,0.0,0.016513,0.016513,0.0,0.039489,0.030601,0.0,0.0,0.0,0.0,...,0.021712,0.0,0.025401,0.0,0.0,0.025401,0.0,0.0,0.0,0.0


In [250]:
#usmos k-means para agrupar los episodios
from sklearn.cluster import KMeans

# Crear un modelo k-means
km = KMeans(n_clusters=4)
# Ajustar el modelo k-means
km.fit(data_)

# graficar los clusters por episodio
clusters = km.labels_.tolist()
# Crear un dataframe de episodios con su respectivo cluster
episodes = { 'transcript': data_.index, 'cluster': clusters }

frame = pd.DataFrame(episodes, index = [clusters] , columns = ['transcript', 'cluster'])
frame


Unnamed: 0,transcript,cluster
3,T1C1,3
3,T1C2,3
3,T1C3,3
0,T1C4,0
2,T1C5,2
1,T1C6,1
1,T1C7,1
1,T1C8,1


Gracias al uso de K-means podemos ver que los episodios 1,2,3 guardan relacion, mientras que los episodios 4,5 no guardan relacion mucha relacion entre si.Finalmente el episodio 6,7,8 guardan relacion entre si.

# ¡FIN TRABAJO!