## Ejercicios

### 1. Modelado de Topics

El objetivo principal de este ejercicio es el de realizar un **análisis exploratorio** - etapa principal en cualquier problema de analítica, ML, DL y, por supuesto, NLP - de alguno de los datasets disponibles (tweets o reviews de Amazon).

Además del análisis exploratorio, se pide que el alumno realice un **modelado de topics** identificando los principales temas que aparecen en los corpus, así como los tokens que los componen.

Será muy valorable si se incluyen **gráficos descriptivos** que describan los corpus utilizados.

Se recomienda, aunque no es obligatorio, utilizar los datasets de las reviews de Amazon y que este ejercicio sea la _antesala_ del ejercicio 2.


In [2]:
import random
import pandas as pd

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel

import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from stop_words import get_stop_words 

## Extracción y procesado de datos

In [3]:
data= pd.read_csv('../../Dataset/data.csv')

In [4]:
#Compruebo que se han cargado correctamente los datos
len(data)

36004

In [5]:
#ídem
data.head()

Unnamed: 0,helpful,reviewText,overall,category
0,"[4, 4]",This is a fantastic product that is well made....,1,Musical_Instruments
1,"[2, 3]",I was never able to get this to extend and tha...,1,Patio_Lawn_and_Garden
2,"[0, 0]",It arrived quickly and good packing. but I hav...,5,Automotive
3,"[0, 0]",prints are sharp with great color saturation. ...,5,Office_Products
4,"[7, 7]",I used to buy a lot of wire form these guys......,1,Musical_Instruments


In [6]:
#nos vamos a crear un dataset con la reviews y a eliminarnos los NAs

rtext = data[['reviewText']]
rtext.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [7]:
#revisamos el resultado
rtext

Unnamed: 0,reviewText
0,This is a fantastic product that is well made....
1,I was never able to get this to extend and tha...
2,It arrived quickly and good packing. but I hav...
3,prints are sharp with great color saturation. ...
4,I used to buy a lot of wire form these guys......
...,...
35999,Found the same 4 pack at Walmart for $2 less. ...
36000,"I've used these since I fist started playing, ..."
36001,I have noticed a difference in the hills in my...
36002,There is a very short screw which goes into ea...


In [8]:
#chequeamos la primera review para ver cómo es la composición de la review y su distribución de palabras
rtext["reviewText"][0]

"This is a fantastic product that is well made. The plastic is heavy duty and comes with a plastic cover that must be removed before use. This ensures that the wheel you get is in perfect condition.The tool is extremely useful. I have just started playing with a band, and it is very common for band members to call out a song and the key that it is in. With the wheel, I immediately know what chords to choose from. I also know what notes are in each chord. In addition, it has a note by note image of the entire fretboard so that you can see where the roots are for the scale you are playing.The wheel is also invaluable for switching keys. For example, say you're playing the I, IV, and V chords in the key of C., but your lead singer is having trouble and would rather sing it in Bb. All you need to do is dial the wheel to Bb, find out the new I, IV, and V chords, and start playing. The wheel makes chord transposition a breeze.I feel that the negative reviews you are seeing are unfair. This i

In [9]:
#reviso las stopwords que tenemos de base en la librería

len(get_stop_words('en'))

174

In [10]:
#ídem pero con gensim. Al ver que nos arroja mayor información que stopwords, nos quedamos con gensim
gensim.parsing.preprocessing.STOPWORDS

len(gensim.parsing.preprocessing.STOPWORDS)

337

In [11]:
 #preprocesamos el texto y eliminamos las palabras que no aportan información. 
def text_preprocessing(text):
    result=[]
    for word in gensim.utils.simple_preprocess(text) :
        if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3:
            result.append(word)
    return result

In [12]:
gensim.utils.simple_preprocess(rtext['reviewText'][0])

['this',
 'is',
 'fantastic',
 'product',
 'that',
 'is',
 'well',
 'made',
 'the',
 'plastic',
 'is',
 'heavy',
 'duty',
 'and',
 'comes',
 'with',
 'plastic',
 'cover',
 'that',
 'must',
 'be',
 'removed',
 'before',
 'use',
 'this',
 'ensures',
 'that',
 'the',
 'wheel',
 'you',
 'get',
 'is',
 'in',
 'perfect',
 'condition',
 'the',
 'tool',
 'is',
 'extremely',
 'useful',
 'have',
 'just',
 'started',
 'playing',
 'with',
 'band',
 'and',
 'it',
 'is',
 'very',
 'common',
 'for',
 'band',
 'members',
 'to',
 'call',
 'out',
 'song',
 'and',
 'the',
 'key',
 'that',
 'it',
 'is',
 'in',
 'with',
 'the',
 'wheel',
 'immediately',
 'know',
 'what',
 'chords',
 'to',
 'choose',
 'from',
 'also',
 'know',
 'what',
 'notes',
 'are',
 'in',
 'each',
 'chord',
 'in',
 'addition',
 'it',
 'has',
 'note',
 'by',
 'note',
 'image',
 'of',
 'the',
 'entire',
 'fretboard',
 'so',
 'that',
 'you',
 'can',
 'see',
 'where',
 'the',
 'roots',
 'are',
 'for',
 'the',
 'scale',
 'you',
 'are',
 'pl

In [15]:
#revisamos y "enfrentamos" el texto original con el texto procesado amén de ver cuántos caracteres tenemos

print('Original text:\n{}\n\n'.format(rtext['reviewText'][0]))

print(len(rtext['reviewText'][0]))

print('Processed text:\n{}'.format(text_preprocessing(rtext['reviewText'][0])))

print(len(text_preprocessing(rtext['reviewText'][0])))

Original text:
This is a fantastic product that is well made. The plastic is heavy duty and comes with a plastic cover that must be removed before use. This ensures that the wheel you get is in perfect condition.The tool is extremely useful. I have just started playing with a band, and it is very common for band members to call out a song and the key that it is in. With the wheel, I immediately know what chords to choose from. I also know what notes are in each chord. In addition, it has a note by note image of the entire fretboard so that you can see where the roots are for the scale you are playing.The wheel is also invaluable for switching keys. For example, say you're playing the I, IV, and V chords in the key of C., but your lead singer is having trouble and would rather sing it in Bb. All you need to do is dial the wheel to Bb, find out the new I, IV, and V chords, and start playing. The wheel makes chord transposition a breeze.I feel that the negative reviews you are seeing are 

In [17]:
#procesamos los textos
processed_texts = []
for text in rtext['reviewText']:
    processed_texts.append(text_preprocessing(text))

In [18]:
#con este print comprobamos el resultado
print(processed_texts[10])

['purchased', 'rain', 'bird', 'inch', 'blank', 'tubing', 'feetversion', 'adding', 'cents', 'item', 'mistake', 'purchased', 'place', 'suggest', 'familiar', 'brand', 'check', 'review', 'installing', 'drip', 'micro', 'spray', 'important', 'able', 'connect', 'water', 'purchasing', 'wrong', 'fittings', 'easy', 'hose']


In [19]:
# guardamos todos los resultados en un diccionario para utilizarlo más adelante
dictionary = Dictionary(processed_texts) 

In [20]:
list(dictionary.items()) 

[(0, 'addition'),
 (1, 'affiliated'),
 (2, 'appreciates'),
 (3, 'band'),
 (4, 'breeze'),
 (5, 'choose'),
 (6, 'chord'),
 (7, 'chords'),
 (8, 'comes'),
 (9, 'common'),
 (10, 'company'),
 (11, 'condition'),
 (12, 'cover'),
 (13, 'dial'),
 (14, 'duty'),
 (15, 'ensures'),
 (16, 'entire'),
 (17, 'example'),
 (18, 'extremely'),
 (19, 'fair'),
 (20, 'fantastic'),
 (21, 'feel'),
 (22, 'fretboard'),
 (23, 'having'),
 (24, 'heavy'),
 (25, 'image'),
 (26, 'immediately'),
 (27, 'invaluable'),
 (28, 'keys'),
 (29, 'know'),
 (30, 'lead'),
 (31, 'makes'),
 (32, 'member'),
 (33, 'members'),
 (34, 'need'),
 (35, 'negative'),
 (36, 'newbie'),
 (37, 'note'),
 (38, 'notes'),
 (39, 'opinion'),
 (40, 'perfect'),
 (41, 'plastic'),
 (42, 'playing'),
 (43, 'price'),
 (44, 'product'),
 (45, 'removed'),
 (46, 'reviews'),
 (47, 'roots'),
 (48, 'scale'),
 (49, 'seeing'),
 (50, 'sing'),
 (51, 'singer'),
 (52, 'song'),
 (53, 'star'),
 (54, 'start'),
 (55, 'started'),
 (56, 'switching'),
 (57, 'tool'),
 (58, 'transpo

In [21]:
#con el diccionario que hemos creado, nos creamos un corpus
corpus = [dictionary.doc2bow(doc) for doc in processed_texts] 

In [22]:
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 3),
  (4, 1),
  (5, 1),
  (6, 2),
  (7, 3),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 2),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 2),
  (42, 4),
  (43, 1),
  (44, 3),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 2),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 2),
  (62, 5)],
 [(63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1)],
 [(83, 1),
  (84, 1),
  (85, 2),
  (86, 1),
  (87, 2),
  (88, 1),
  (89, 1),
  (90, 1),
  (91, 

In [23]:
def check_topics(num_topics, corpus, dictionary):
    lda_model = LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=num_topics,
        iterations=5,
        passes=10,
        alpha='auto'
    )
    # Calculando la perplejidad, medimos cuán bueno es el modelo. Cuanto más bajo, mejor es.
    perplexity= lda_model.log_perplexity(corpus)


    # Calculamos la coherencia de nuestro modelo
    coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_texts, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()

    return lda_model,perplexity, coherence_lda 

    

In [24]:
def view_topics(num_topics, lda_model):
    word_dict = {};
    for i in range(num_topics):
        words = lda_model.show_topic(i, topn = 20)
        word_dict['Topic #' + '{:02d}'.format(i+1)] = [i[0] for i in words]
    return pd.DataFrame(word_dict)

In [25]:
#vamos a comprobar si hay subtopics, para ello haré 3 pruebas 
prueba,prueba_perplexity,prueba_coherence = check_topics(3, corpus,dictionary)
view_topics(3, prueba)

Unnamed: 0,Topic #01,Topic #02,Topic #03
0,tape,time,like
1,good,trap,guitar
2,like,battery,great
3,great,easy,product
4,quality,like,good
5,paper,works,sound
6,price,good,strings
7,easy,work,feeder
8,nice,great,little
9,product,bought,pedal


In [26]:
print(prueba_perplexity)
print(prueba_coherence)

-8.126174760594212
0.28493425252482535


In [27]:
prueba2,prueba2_perplexity,prueba2_coherence = check_topics(6, corpus,dictionary )
view_topics(6, prueba2)

Unnamed: 0,Topic #01,Topic #02,Topic #03,Topic #04,Topic #05,Topic #06
0,tape,guitar,like,trap,time,battery
1,feeder,sound,great,labels,product,hose
2,product,strings,good,paper,amazon,product
3,garden,great,easy,printer,bought,time
4,plants,good,nice,mouse,filter,works
5,like,like,price,traps,grill,water
6,birds,pedal,quality,print,years,tool
7,water,price,need,bait,good,like
8,deer,string,work,mice,great,clean
9,squirrels,tone,plastic,avery,better,batteries


In [28]:
print(prueba2_perplexity)
print(prueba2_coherence)

-8.10352256015578
0.3902700641511205


In [30]:
prueba3,prueba3_perplexity,prueba3_coherence = check_topics(12, corpus,dictionary )
view_topics(12, prueba3)

Unnamed: 0,Topic #01,Topic #02,Topic #03,Topic #04,Topic #05,Topic #06,Topic #07,Topic #08,Topic #09,Topic #10,Topic #11,Topic #12
0,strap,guitar,small,tape,trap,great,color,labels,handle,battery,water,paper
1,cable,sound,plastic,boxes,feeder,good,filter,printer,grill,unit,clean,stapler
2,install,strings,hold,scotch,mouse,like,pens,paper,cover,power,hose,office
3,change,pedal,like,roll,traps,works,black,print,tool,batteries,product,file
4,cables,tone,easily,packaging,bait,product,colors,avery,wood,trimmer,spray,binder
5,installed,play,place,dispenser,birds,price,write,label,weber,charge,bottle,papers
6,phone,string,large,scissors,mice,time,markers,sheet,heavy,mower,paint,folders
7,light,picks,little,tear,deer,easy,fine,cards,heat,lawn,cleaning,pages
8,engine,playing,size,gift,squirrels,quality,pencil,printing,tools,cord,wash,staples
9,plug,sounds,metal,wrap,plants,better,point,printed,charcoal,grass,glass,ring


In [31]:
print(prueba3_perplexity)
print(prueba3_coherence)

# vemos que en prueba 3, la perplejidad es de -8.71 lo que se supone que nos indica que vamos bien y que la coherencia es de 0.56 
#con esto se comprueba que hay subtopics.

-8.712828172397218
0.5617016605128805


In [27]:
#Para el topic 2, por ejemplo, vemos que se refiere a instrumentos musicales, en concreto a guitarras: fender, acoustic, pedals...

In [32]:
 # Visualizar los topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(prueba3, corpus, dictionary)
vis

  pickler.file_handle.write(chunk.tostring('C'))
  pickler.file_handle.write(chunk.tostring('C'))
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [33]:
pyLDAvis.save_html(vis, '../../Dataset/topics_vis_0.html') 

#vemos que los topics 3 y 6 hablan de material para el jardín