# Argentine Election Analysis

## Introduction
In this notebook I analyze a Spanish dataset set up during the [Argentine legislative election](https://en.wikipedia.org/wiki/Argentine_legislative_election,_2017) of 2017. 
This dataset contains the data of 9 facebook bots, crawled over a period of 16 days, following 45 sources.

__Note__: If you haven't done it already, go through the set up in the *README* of [this repo](https://github.com/rugantio/nlp_fbtrex/).

### Roadmap
Download dataset -> cast JSON to txt -> tokenization -> normalization -> phrase modeling -> topic mining -> burst the bubble -> word2vec algebra & predictive analysis

## Dataset
The dataset was prepared by the [__Facebook Tracking Exposed__](https://facebook.tracking.exposed/) project and can be retrieved in a convenient JSON format from the specific GitHub [__repo__](https://github.com/tracking-exposed/experiments-data/tree/master/silver).
There are two separate files that we'll try to breakdown:
* __fbtrex-data-\*.json__ - Contains all impressions relative to single users
* __semantic-entities.json__ - Contains all available metadata regarding posts

The text field of every posts is enclosed in *semantic-entities.json*, while I can use *fbtrex-data-\*.json* to correlate which user has visualized this content, thus providing an easy way to investigate the Facebook filter bubble.
Given a ready working environment, as explained is the *README* of this repo, just go ahead and download the files:

In [3]:
#%%bash
#Download Argentine dataset in a data subdir

import os
from urllib.request import urlretrieve
from urllib.parse import urlparse

def datacollector(url):
    os.system('mkdir data')
    filename = os.path.basename(urlparse(url).path)
    (filename, header) = urlretrieve(url,filrename)
    
    
        
#os.system('mkdir data && cd data')
#os.system('wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/fbtrex-data-1.json.zip')
#os.system('wget https://github.com/tracking-exposed/experiments-data/raw/master/silver/semantic-entities.json.zip')

__Note__: This commands are supposed to be executed in a bash environment, not in the notebook itself. The operation may fail due to permissions.

Extract the content from the zip archive:

In [None]:
#%%bash
#Extract JSON from zipped archives
 
cd data
unzip fbtrex-data-1.json.zip
unzip semantic-entities.json.zip

__Note__: To try out this notebook I made a shorter version of the JSON, I highly recommend to do the same

## Data preprocessing


Now that we have the dataset in JSON format, we can use the [JSON Python library](https://docs.python.org/3/library/json.html) to decode its content and store it in a Python variable. The variable type depends on the actual content of the provided file, by [default](https://docs.python.org/3/library/json.html#json-to-py-table) a JSON object is decoded to a dict and an arrays to a list. The recommended approach for working with encoded text files, is to use the [codecs Python library](https://docs.python.org/3/library/codecs.html):

In [1]:
import codecs
import json

with codecs.open('data/semantic-entities.json',encoding='utf-8') as data_json:    
    data = json.load(data_json)

To print to stdout the content of the parsed JSON file just use [pprint](https://docs.python.org/3/library/pprint.html), the data pretty printer:

In [None]:
import pprint
pprint.pprint(data)

It's useful to check if the casting was performed correctly before proceding, the resulting decoded type can be inspected with:

In [5]:
print(type(data))

<class 'list'>


__Note__: If you are using Spyder IDE you can keep track of variable simply looking at the variable explorer window.

So the JSON is now a list. How many entities do we have?

In [48]:
print('There are {} total elements to analyze.'.format(len(data)))

There are 437 total elements to analyze.


Let's go deeper. We decoded the JSON to a list, but what kind of list is it? What happened to JSON objects?

In [None]:
for item in data:
    print(type(item))

Of course, *data* is not a simple list, it's a nested list of dictionaries! Let's print the *dict_keys*:

In [None]:
for item in data:
    print(item.keys())

This is interesting: in the provided dataset there are some entities that don't have a *text* field. So let's first take only the elements that have a text field and put them in a new non-nested list:

In [61]:
tex = [item['text'] for item in data if 'text' in item]

This is better. We now have an actual working list. Again, how many entities do we have?

In [62]:
print('There are actually {} text elements to analyze!'.format(len(tex)))

There are actually 429 text elements to analyze!


This is good enough for now, later we can make a deeper analysis, associating each *text* key with its *id* key and its *time* key to correlate which user visualizes which entity and when.  

It's good practice to have a new txt file for every step in NLP processing. So let's create a new txt file populated with the *text keys* of the *tex list*, __one per line__. 

Since some of the text values are made of more than one paragraphs, we need to substitute linebreaks (newline character) with a space character. Some caution is needed because some paragraphs have a double linebreak.  

In [63]:
#Swap linebreaks with a space
for i in range(len(tex)):
    tex[i] = tex[i].replace('\n\n','\n')
    tex[i] = tex[i].replace('\n',' ')

#Create new txt with text keys (one per line)
with codecs.open('data/text.txt','w',encoding='utf-8') as text:
    for i in range(len(tex)):
        text.write('%s\n' % tex[i])

To view the file and check that everything was executed as it should you don't need another editor:

In [57]:
#Print the first 3000 characters
with codecs.open('data/text.txt',encoding='utf-8') as text:    
    print(text.read(3000))

Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle. Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macri para que lo ayude a sostener el emprendimiento. El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia. Allí, le prometió ayuda. Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana. Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio. Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo. No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamiento de adicciones, con asistentes sociales. Macri me dijo que lo hagamos.Los tiempos de Nación no s

Data preprocessing is over, we now have a txt ready to feed our NLP modules!
## Language processing with SpaCy

Text mining tasks have become incredibly easy thanks to [spaCy](http://alpha.spacy.io/), a NLP Python module which provides:
* Non-destructive tokenization
* Syntax-driven sentence segmentation
* Pre-trained word vectors
* Part-of-speech tagging
* Named entity recognition
* Labelled dependency parsing
* A built-in visualizer 

...and much more, all with just one function!

SpaCy also provides some already trained [models](https://alpha.spacy.io/models/) which you can use out-of-the-box to process different languages. SpaCy's core is written in pure C (via Cython), it's currently the [fastest](https://alpha.spacy.io/usage/facts-figures) parser available and makes [multithreading](https://explosion.ai/blog/multithreading-with-cython) profitable by virtue of Cython.

Follow the *README* of this repo and install the Spanish language model. Now import the model, and load spaCy's pipeline:

In [13]:
%%time 
import spacy

#Initialize SpaCy's pipeline
nlp = spacy.load('es_core_web_sm')

CPU times: user 821 ms, sys: 354 ms, total: 1.17 s
Wall time: 3.14 s


Now that we have a processing pipeline, we can call a *nlp* instance as if it were a function on a string of text. This will produce a [Doc](https://alpha.spacy.io/api/doc) object, a special container that holds all linguistic annotations of the text fed in.

Let's first explore how SpaCy processes a single entity, before diving into the dataset:

In [14]:
%%time
#Snip single line of text
with codecs.open('data/text.txt',encoding='utf-8') as text:
    line_txt = text.readline()

#Standard way of processing text 
doc = nlp(line_txt)

CPU times: user 437 ms, sys: 1.12 s, total: 1.56 s
Wall time: 664 ms


In [15]:
print(doc)

Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle. Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macri para que lo ayude a sostener el emprendimiento. El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia. Allí, le prometió ayuda. Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana. Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio. Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo. No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamiento de adicciones, con asistentes sociales. Macri me dijo que lo hagamos.Los tiempos de Nación no s

Looks exactly the same! But what happened under the hood? Have a look at how [spaCy's pipeline](https://alpha.spacy.io/usage/processing-pipelines) is made:

__Text -> tokenizer -> tagger -> parser -> ner -> Doc__

Text analysis is built from bottom-up. The *tokenizer* creates a *Doc* data structure, breaking the text in tokens and storing their metadata in a tensor. The *tagger* takes these tokens (and their context) and uses the information to make predictions of the part-of-speech tags. The *parser* assigns dependency labels between tokens and segments text in sentences. The *ner*, named entity recognizer, detects and labels named entities.
### Sentence detection and segmentation
Sentences are automatically extracted from each review:

In [69]:
for i, sent in enumerate(doc.sents):
    print ('Sentence {}:'.format(i + 1),sent,end='\n')

Sentence 1: Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle.
Sentence 2: Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macri para que lo ayude a sostener el emprendimiento. 
Sentence 3: El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia.
Sentence 4: Allí, le prometió ayuda. 
Sentence 5: Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana.
Sentence 6: Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio.
Sentence 7: Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo.
Sentence 8: No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamien

### Part-of-speech (POS) tagging and grammar analysis
Using [Pandas](http://pandas.pydata.org/), Python Data Analysis library, we can have a clean table visualization.
- Text: The original word text.
- POS: The simple part-of-speech tag.
- Tag: The detailed part-of-speech tag, with full morphology!
- Dep: Syntactic dependency, i.e. the relation between tokens.

In [17]:
import pandas as pd

token_text = [token.orth_ for token in doc]
token_pos = [token.pos_ for token in doc]
token_tag = [token.tag_ for token in doc]
token_dep = [token.dep_ for token in doc]

pd.DataFrame(list(zip(token_text,token_pos,token_tag,token_dep)), columns=['Text', 'POS','Tag','Dep'])

Unnamed: 0,Text,POS,Tag,Dep
0,Cerró,VERB,VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past...,ROOT
1,el,DET,DET__Definite=Def|Gender=Masc|Number=Sing|Pron...,det
2,comedor,NOUN,NOUN__Gender=Masc|Number=Sing,nsubj
3,cordobés,ADJ,ADJ__Number=Sing,amod
4,al,ADP,ADP__AdpType=Preppron|Gender=Masc|Number=Sing,case
5,que,PRON,PRON__PronType=Rel,obj
6,Macri,PROPN,PROPN___,nsubj
7,le,PRON,PRON__Case=Dat|Number=Sing|Person=3|PronType=Prs,obj
8,había,AUX,AUX__Mood=Ind|Number=Sing|Person=3|Tense=Imp|V...,aux
9,prometido,VERB,VERB__Gender=Masc|Number=Sing|Tense=Past|VerbF...,acl


### Navigating the parse tree
SpaCy uses the terms *head* and *child* to describe the words connected by a single arc in the dependency tree. The term *dep* is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of *.dep* is a hash value. You can get the string value with *.dep\_*.
- Text: The original token text.
- Dep: The syntactic relation connecting child to head.
- Head text: The original text of the token head.
- Head POS: The part-of-speech tag of the token head.
- Children: The immediate syntactic dependents of the token.

In [18]:
token_text = [token.text for token in doc]
token_head_pos = [token.head.pos_ for token in doc]
token_head_text = [token.head.text for token in doc]
token_dep = [token.dep_ for token in doc]
token_children = [[child for child in token.children] for token in doc]
pd.DataFrame(list(zip(token_text,token_dep,token_head_text,token_head_pos,token_children)), columns=['Text','Dep','Head text','Head POS','Children'])


Unnamed: 0,Text,Dep,Head text,Head POS,Children
0,Cerró,ROOT,Cerró,VERB,"[comedor, ytenía, .]"
1,el,det,comedor,NOUN,[]
2,comedor,nsubj,Cerró,VERB,"[el, cordobés]"
3,cordobés,amod,comedor,NOUN,[prometido]
4,al,case,que,PRON,[]
5,que,obj,prometido,VERB,[al]
6,Macri,nsubj,prometido,VERB,[]
7,le,obj,prometido,VERB,[]
8,había,aux,prometido,VERB,[]
9,prometido,acl,cordobés,ADJ,"[que, Macri, le, había, ayuda, Luis]"


### Named entity recognition (NER)

In [19]:
for num, ent in enumerate(doc.ents):
    print ('Entity {}:'.format(num + 1),ent,'-', ent.label_,end='\n')

Entity 1: Macri - PER
Entity 2: Luis Almadaes - PER
Entity 3: Desesperado - LOC
Entity 4: Mauricio Macri - PER
Entity 5: Allí - PER
Entity 6: Yo Te Ayudo Amigo del Alma - MISC
Entity 7: Almada - LOC
Entity 8: La Naciónque - LOC
Entity 9: Puse - PER
Entity 10: No llamé al Presidente - MISC
Entity 11: Macri me dijo - MISC
Entity 12: Los tiempos de Nación - MISC
Entity 13: Puse - PER
Entity 14: No pedí nada para mí - MISC
Entity 15: Vino el Presidente - PER
Entity 16: Almada - LOC
Entity 17: Provincia - LOC
Entity 18: Municipalidad - LOC
Entity 19: El proyecto quedó rengo - MISC
Entity 20: Hace - ORG
Entity 21: Carolina Stanley - PER
Entity 22: Presidente - PER
Entity 23: Ahora - MISC
Entity 24: Almada - LOC
Entity 25: Cómo - MISC
Entity 26: No tengo más plata - MISC
Entity 27: Laburo - MISC
Entity 28: El día de la visita presidencial - MISC
Entity 29: Talleres - ORG
Entity 30: Sin embargo - MISC
Entity 31: Almada - LOC
Entity 32: Estado Nacional - LOC
Entity 33: Ayer - LOC
Entity 34: Ell

### Visualization with displaCy
SpaCy has an integrated visualization library that can display the content in two styles: *dep* and *ent*.
The *dep* style shows the dependency between words using arcs, the *ent* style prints out the text with colored NER labels wrapped around words.

The method *.serve()* launches a local web server for visualization while the method *.render()* generates an image.

__Note__: Style *dep* is not working well in Spanish because *tag* is used instead of *POS* for annotating words, but the *tag* field is much larger than *POS* thus causing overlapping. 

__Note2__: Style *ent* can't be viewed in Github, but in Jupyter is great.

In [None]:
from spacy import displacy
#displacy.serve(doc, style='dep')
options = {'distance':425, 'arrow_spacing':6}
displacy.render(doc,style='dep', jupyter=True, options=options)

In [None]:
#displacy.serve(doc,style='dep')
displacy.render(doc,style='ent', jupyter=True)

### Text normalization: stemming, lemmatization and shape analysis
Let's now move on to single token analysis. *Normalization* is a way of processing text that involves changing the words to make them less unique. We talk about *stemming* when we take the words and we remove the end part, producing a new token that often is not in the language dictionary. *Lemmatization* takes inflected words as input and tries to give the root word as output, so in some way is similar to stemming, but it produces meaningful (actually existing) words. The token *shape* is the de-capitalization char mask that gets applied to the original (orthodox) token.

In [22]:
token_lemma = [token.lemma_ for token in doc]
token_shape = [token.shape_ for token in doc]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,Cerró,cerró,Xxxxx
1,el,el,xx
2,comedor,comedor,xxxx
3,cordobés,cordobés,xxxx
4,al,al,xx
5,que,que,xxx
6,Macri,macri,Xxxxx
7,le,le,xx
8,había,había,xxxx
9,prometido,prometido,xxxx


Too bad, lemmatization is actually not supported for the Spanish language model (for the English model however it has good support). We still have some normalization, as seen from the shape mask applied to every word.

### Token-level entity analysis
The standard way to access entity annotations is the *doc.ents* property, but you can also access token entity annotations using the *token.ent_iob* and *token.ent_type* attributes; *token.ent_iob* indicates whether an entity starts, continues or ends on the tag.

IOB Scheme:
- *I* - Token is *inside* an entity.
- *O* - Token is *outside* an entity.
- *B* - Token is the *beginning* of an entity.

In [23]:
token_entity_type = [token.ent_type_ for token in doc]
token_entity_iob = [token.ent_iob_ for token in doc]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)), columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Cerró,,O
1,el,,O
2,comedor,,O
3,cordobés,,O
4,al,,O
5,que,,O
6,Macri,PER,B
7,le,,O
8,había,,O
9,prometido,,O


### Token-level attributes
Other useful metadata is provided, such as the relative frequency of tokens, and whether or not a token matches any of these categories:
- stop-word
- punctuation
- whitespace
- number
- url

...and many more token [attributes](https://alpha.spacy.io/api/token#attributes)! If you are using the alpha version of spaCy, you can also add [custom attributes](https://explosion.ai/blog/spacy-v2-pipelines-extensions) to tokens.


In [24]:
token_attributes = [(token.orth_,token.prob,token.is_stop,token.is_punct,token.is_space,token.like_num,token.like_url) for token in doc]

df = pd.DataFrame(token_attributes,columns=['text','log_probability','stop?','punctuation?','whitespace?','number?','url?'])

df.loc[:, 'stop?':'url?'] = (df.loc[:, 'stop?':'url?'].applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,url?
0,Cerró,-20.0,,,,,
1,el,-20.0,Yes,,,,
2,comedor,-20.0,,,,,
3,cordobés,-20.0,,,,,
4,al,-20.0,Yes,,,,
5,que,-20.0,Yes,,,,
6,Macri,-20.0,,,,,
7,le,-20.0,Yes,,,,
8,había,-20.0,Yes,,,,
9,prometido,-20.0,,,,,


The relative frequency is not stored in the model, but that's not important since we don't intend to rely on it anyways.  We can see that there are some problems with stop-words, for example "trabajo" should not be considered a stop-word, thus in the next section we have to manually adjust this attribute.

## Text normalization, lemmatization, stop-words removal and sentence segmentation
Now that we have explored all that spaCy can do for us, we can use it to parse our *text.txt* and generate a new *parsed_text.txt* that has the same text, normalized, lemmatized, deprived of stop-words and segmented in sentences.

We first define a helper function that constructs a generator to loop over the *text.txt* and yield the review one-by-one. A generator is similar to an iterator but it can be used only once because its content is generated on the fly and not stored in memory, saving precious computation.

Then, we pass on the reviews to spaCy using the *.pipe()* method via a generator function to parse the reviews, lemmatize the text, and yield segmantized sentences. The standard way to initialize spaCy would be to call *nlp(text.txt)* on each review, but I will make use instead of the *.pipe()* method which allows efficient [multi-threading](https://spacy.io/docs/usage/processing-text#multithreading). Two [arguments](https://alpha.spacy.io/api/language#pipe) are given to *.pipe()*: *batch_size* is the number of reviews to buffer and *n_threads* which is the number of worker threads to use (default is 2, if -1 OpenMP will decide how many to use at run time). You can also pass a *disable* option to turn off some components of the pipeline that is not needed to further optimize the processing. Note that all processing algorithms are linear-time in the length of the string. 

Luckily for us, spaCy makes it really easy to modify the pipeline. As explained in the former section, we are going to insert some custom stop-words that were missing in our vocabulary and remove some other ones. As you can see some tokens should be considered stop-words, such as "y" and "a" are not correctly identified, and, viceversa, words such as "trabajo" should not be classified as stop-words. To fix this, we have to list all stop-words present in our model and if they are not supposed to be stop-words we can manually remove them from the pipeline, while to include new stop-words we just have to see if they appear in our topic models and only then come back and label them as stop-words. Another thing we will add to spaCy's pipeline is a custom normalization that replaces accent characters such as *è* and *é* with regular characters such as a simpler *e*, because some people choose to use accents and some don't, and in topic modeling we don't want to have two separate entries for *macrì* and *macri*. We also fix some punctuation although it will be removed anyway by the lemmatizer.

Finally, we write the sentences to a new txt file, *parsed_text.txt*.

In [72]:
%%time

#Helper function that yields all reviews via generator 
def get_review(filename):
    with codecs.open(filename,encoding='utf-8') as textfile:
        for review in textfile:
            review = review.replace('ó','o')            
            review = review.replace('ó','o')
            review = review.replace('Ó','o')
            review = review.replace('Ò','o')
            review = review.replace('í','i')
            review = review.replace('ì','i')
            review = review.replace('Ì','i')            
            review = review.replace('Í','i')            
            review = review.replace('à','a')
            review = review.replace('á','a')
            review = review.replace('À','a')
            review = review.replace('Á','a')
            review = review.replace('ù','u')
            review = review.replace('Ù','u')
            review = review.replace('Ú','u')
            review = review.replace('ú','u')
            review = review.replace('è','e')
            review = review.replace('é','e')
            review = review.replace('È','e')
            review = review.replace('É','e')
            review = review.replace('¿','')
            review = review.replace('“','\"')
            review = review.replace('”','\"')            
            yield review
            
#Add and remove custom stop words
nlp.vocab["A"].is_stop = True
nlp.vocab["a"].is_stop = True
nlp.vocab["Y"].is_stop = True
nlp.vocab["y"].is_stop = True
nlp.vocab["o"].is_stop = True
nlp.vocab["O"].is_stop = True
nlp.vocab["the"].is_stop = True
nlp.vocab["The"].is_stop = True
nlp.vocab["e"].is_stop = True
nlp.vocab["E"].is_stop = True
nlp.vocab["ciento"].is_stop = True
nlp.vocab["año"].is_stop = True
nlp.vocab["años"].is_stop = True
nlp.vocab["trabajo"].is_stop = False
nlp.vocab["Trabajo"].is_stop = False
nlp.vocab["Trabajar"].is_stop = False
nlp.vocab["trabajan"].is_stop = False
nlp.vocab["Trabaja"].is_stop = False
nlp.vocab["trabaja"].is_stop = False
nlp.vocab["tiempo"].is_stop = False
nlp.vocab["Tiempo"].is_stop = False
nlp.vocab["Respecto"].is_stop = False
nlp.vocab["respecto"].is_stop = False
nlp.vocab["primero"].is_stop = False
nlp.vocab["primera"].is_stop = False
nlp.vocab["PRIMERO"].is_stop = False
nlp.vocab["primeros"].is_stop = False
nlp.vocab["primer"].is_stop = False
nlp.vocab["Primero"].is_stop = False
nlp.vocab["Primera"].is_stop = False
nlp.vocab["Momento"].is_stop = False
nlp.vocab["momento"].is_stop = False
nlp.vocab["MOMENTO"].is_stop = False
nlp.vocab["Estado"].is_stop = False
nlp.vocab["estado"].is_stop = False
nlp.vocab["Estados"].is_stop = False
nlp.vocab["grandes"].is_stop = False
nlp.vocab["diferente"].is_stop = False
nlp.vocab["diferentes"].is_stop = False
nlp.vocab["realizar"].is_stop = False
nlp.vocab["realizado"].is_stop = False
nlp.vocab["REALIZAR"].is_stop = False
nlp.vocab["proximo"].is_stop = False
nlp.vocab["empleo"].is_stop = False
nlp.vocab["Empleo"].is_stop = False
nlp.vocab["acuerdo"].is_stop = False
nlp.vocab["pasado"].is_stop = False
nlp.vocab["pasada"].is_stop = False
nlp.vocab["Van"].is_stop = False
nlp.vocab["finally"].is_stop = False
nlp.vocab["General"].is_stop = False
nlp.vocab["general"].is_stop = False
nlp.vocab["Asi"].is_stop = False
nlp.vocab["misma"].is_stop = False
nlp.vocab["mismo"].is_stop = False
nlp.vocab["mismas"].is_stop = False
nlp.vocab["mismos"].is_stop = False
nlp.vocab["nuevo"].is_stop = False
nlp.vocab["nuevos"].is_stop = False
nlp.vocab["Nuevo"].is_stop = False
nlp.vocab["NUEVO"].is_stop = False
nlp.vocab["nuevas"].is_stop = False
nlp.vocab["Nueva"].is_stop = False
nlp.vocab["nueva"].is_stop = False
nlp.vocab["igual"].is_stop = False
nlp.vocab["Igual"].is_stop = False
nlp.vocab["Debido"].is_stop = False
nlp.vocab["debido"].is_stop = False
nlp.vocab["ejemplo"].is_stop = False
nlp.vocab["verdad"].is_stop = False
nlp.vocab["Verdad"].is_stop = False
nlp.vocab["valor"].is_stop = False
nlp.vocab["Valor"].is_stop = False
nlp.vocab["VALOR"].is_stop = False
nlp.vocab["HASTA"].is_stop = False
nlp.vocab["hasta"].is_stop = False
nlp.vocab["Hasta"].is_stop = False
nlp.vocab["Buenos"].is_stop = False
nlp.vocab["buenos"].is_stop = False
nlp.vocab["BUENOS"].is_stop = False
nlp.vocab["medio"].is_stop = False
nlp.vocab["Medio"].is_stop = False
nlp.vocab["lugar"].is_stop = False
nlp.vocab["mejor"].is_stop = False
nlp.vocab["buena"].is_stop = False
nlp.vocab["BUENA"].is_stop = False
nlp.vocab["Bueno"].is_stop = False
nlp.vocab["bueno"].is_stop = False
nlp.vocab["luego"].is_stop = False
nlp.vocab["Luego"].is_stop = False
nlp.vocab["mal"].is_stop = False
nlp.vocab["poco"].is_stop = False
nlp.vocab["Poco"].is_stop = False
nlp.vocab["Pocos"].is_stop = False
nlp.vocab["pocos"].is_stop = False
nlp.vocab["embargo"].is_stop = False
nlp.vocab["verdadero"].is_stop = False
nlp.vocab["verdadera"].is_stop = False
nlp.vocab["posible"].is_stop = False
nlp.vocab["intento"].is_stop = False

#List current stop words
stop_words = []
for parsed_review in nlp.pipe(get_review("data/text.txt"),batch_size=10, n_threads=3):
    for sent in parsed_review.sents:
        for token in sent:
            if token.is_stop:
                stop_words.append(token.orth_)
print("There are {} total stop words.".format(len(stop_words)))
stop_set = set(stop_words)
print("There are {} unique stop words.".format(len(stop_set)))
print(stop_set)

There are 66027 total stop words.
There are 605 unique stop words.
{'aquel', 'Cuales', 'sean', 'sobre', 'atras', 'Tres', 'sido', 'Se', 'Ademas', 'alguno', 'Ustedes', 'principalmente', 'le', 'podriamos', 'Del', 'van', 'Me', 'como', 'tampoco', 'partir', 'detras', 'Alguna', 'tener', 'asi', 'Entre', 'Nosotros', 'Tuvo', 'dia', 'hacen', 'sola', 'pudo', 'gran', 'Cuando', 'Uno', 'Tal', 'estas', 'haya', 'algo', 'ustedes', 'propio', 'el', 'nunca', 'ante', 'me', 'trata', 'mio', 'a', 'porque', 'tenemos', 'Sin', 'trabajar', 'Algunos', 'sea', 'una', 'Esas', 'Ella', 'Debe', 'Segun', 'Muchos', 'No', 'existe', 'vamos', 'bien', 'mediante', 'mia', 'Durante', 'Ambos', 'De', 'Ninguno', 'Actualmente', 'Ello', 'Fuera', 'Sean', 'con', 'mi', 'cuales', 'ha', 'NO', 'Fin', 'Tengo', 'AuN', 'cuando', 'Lo', 'Gran', 'las', 'eras', 'intenta', 'esos', 'al', 'hay', 'haber', 'Estoy', 'hablan', 'nosotros', 'fueron', 'alrededor', 'Su', 'otros', 'QUE', 'Estan', 'aquellas', 'La', 'en', 'ALGUNA', 'podria', 'aqui', 'Esa', 'Tod

We are now ready to lemmatize our corpus!

In [73]:
%%time
#Generator function to parse reviews, lemmatize the text, and yield sentences
#WARNING: This task is computationally demanding, adjust batch_size and n_threads according to your machine
def lemmatize_corpus(filename):
    for parsed_review in nlp.pipe(get_review(filename),batch_size=10, n_threads=3):
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent if not (token.is_punct or token.is_space or token.is_stop or token.like_num or token.like_email or token.like_url)])

with codecs.open('data/parsed_text.txt','w',encoding='utf-8') as f:
    for sent in lemmatize_corpus('data/text.txt'):
        f.write(sent + '\n')

CPU times: user 37.3 s, sys: 41.7 s, total: 1min 19s
Wall time: 22.7 s


Let's see an excerpt of what we got out of it:

In [74]:
with codecs.open('data/parsed_text.txt',encoding='utf-8') as parsed_text:    
    print(parsed_text.read(1000))

cerro comedor cordobes macri prometido ayuda luis almadaes cordobes ytenia comedor comunitario fundacion ayudar gente situacion calle
desesperado mayo envio mensaje redes sociales presidente mauricio macri ayude sostener emprendimiento
mandatario llamo telefono julio vieron personalmente provincia
prometio ayuda
llego alternativa bajar persiana
fundacion ayudo amigo alma ayudaba personas situacion calle vendieran golosinas peatonal pudieran capacitarse oficio
almada nacionque debio cerrar fundacion puse esfuerzo cumpli comprometi
llame presidente pedirle alfajores ayudara tratamiento adicciones asistentes sociales
macri hagamos
tiempos nacion mismos gente
pateando primero meses semana pasada meses -continuo-
puse plata bolsillo
pedi gente
vino presidente laburo tendrian
almada requisitos requisitos piden salon sale $ aparte pagando provincia alojamiento muchachos municipalidad permiso venta callejera
proyecto quedo rengo conto
hablo telefonicamente ministra carolina stanley señalo envi

## Phrase Modeling with Gensim


__Phrase modeling__ is a form of text manipulation that consists in producing new one-word tokens from two or more token. As we saw in named entity recognition, there are groups of words that represent things that have nothing to do with the single words themselves that make up the group. For example *New York* is supposed to be different in meaning from *New* and *York*. We would like to have these single token words joined together in a single word, with an underscore instead of a space. We then repeat the process (__second-order phrase modeling__) to catch three-word tokens such as *New_York_City*.

We will use [__gensim__](https://radimrehurek.com/gensim/index.html), an incredible Python library that implements several unsupervised machine learning algorithms designed for text analysis and also some useful text manipulation classes. To accomplish phrase modeling, gensim provides automatic common phrase detection (multiword expressions) from a stream of sentences. The phrases are identified as __collocations__ (frequently co-occurring tokens). In the built-in [*gensim.models.phrases.Phrases*](https://radimrehurek.com/gensim/models/phrases.html) class there are actually two ways ([formulas](https://radimrehurek.com/gensim/models/phrases.html#id2)) for measuring the co-occurrence of these composite words in the corpus, meaning the __frequency__ these words appear __together__ in sequence, compared to the frequency they appear __alone__. We will use the *default* mode, with a *threshold* set to *170*, just enough to catch *mauricio_macri* as a phrase.

Gensim's [gensim.models.word2vec.LineSentence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It streams the sentences from the disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

Finally, we write the sentences to a new txt file, *bigram_sents.txt*, and we snip the head to check the result.

In [75]:
%%time
import warnings
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

#Supress useless warning
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
    #Creates iterable of sentences
    unigram_sents = LineSentence('data/parsed_text.txt')

    #Initialize the model with our dataset
    bigram_model = Phrases(unigram_sents, threshold=170)

    #Save and load trained model in data directory (optional)
    bigram_model.save('data/bigram_model')
    bigram_model = Phrases.load('data/bigram_model')

    #Write processed sentences to the new file 
    with codecs.open('data/bigram_sents.txt','w',encoding='utf-8') as f:
        for sent in unigram_sents:
            bigram_sent = u' '.join(bigram_model[sent])
            f.write(bigram_sent + '\n')

    #Print the first 3000 characters
    with codecs.open('data/bigram_sents.txt',encoding='utf-8') as f:    
        print(f.read(3000))

cerro comedor cordobes macri prometido ayuda luis almadaes cordobes ytenia comedor comunitario fundacion ayudar gente situacion calle
desesperado mayo envio mensaje redes_sociales presidente mauricio_macri ayude sostener emprendimiento
mandatario llamo telefono julio vieron personalmente provincia
prometio ayuda
llego alternativa bajar persiana
fundacion ayudo amigo alma ayudaba personas situacion calle vendieran golosinas peatonal pudieran capacitarse oficio
almada nacionque debio cerrar fundacion puse esfuerzo cumpli comprometi
llame presidente pedirle alfajores ayudara tratamiento adicciones asistentes sociales
macri hagamos
tiempos nacion mismos gente
pateando primero meses semana pasada meses -continuo-
puse plata bolsillo
pedi gente
vino presidente laburo tendrian
almada requisitos requisitos piden salon sale $ aparte pagando provincia alojamiento muchachos municipalidad permiso venta callejera
proyecto quedo rengo conto
hablo telefonicamente ministra carolina stanley señalo envi

We can see that gensim has picked up some composite __phrases__ that we __expected__ such as *mauricio_macri*, *policia_federal*, *codigo_penal* etc. and some __unexpected__ ones such as *acusado_cometer*, *universitario_anibal* etc. With a correct lemmatization this process is far more accurate because inflections in the words account for a larger number of tokens and the co-occurrence model produces more __false positives__. Still, adjusting the threshold to *170* has somewhat mitigated the false processing, producing a functional text.

At this point this step is replicated another time, to join three word tokens (second-order phrase modeling). For example if the first pass has joined words like *new* and *york*, producing *new_york*, with a second pass we would expect to join *new_york* and *city*, getting *new_york_city*.

In [76]:
%%time
#Supress useless warning
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
    #Creates iterable of sentences from our two-word token dataset
    bigram_sents = LineSentence('data/bigram_sents.txt')

    #Initialize the model with our dataset
    trigram_model = Phrases(bigram_sents, threshold=400)

    #Save and load trained model in data directory (optional)
    trigram_model.save('data/trigram_model')
    trigram_model = Phrases.load('data/trigram_model')
    
    #Write processed sentences to the new file
    with codecs.open('data/trigram_sents.txt','w',encoding='utf-8') as f:
        for sent in bigram_sents:
            trigram_sentence = u' '.join(trigram_model[sent])
            f.write(trigram_sentence + '\n')

CPU times: user 324 ms, sys: 3.05 ms, total: 327 ms
Wall time: 323 ms


This is actually really useful in our case, because in Argentina people usually have either a middle name or two last names, and in these cases we are left with a one-word token! For example you see appearing in our corpus words such as *maría_eugenia_vidal* and *alejandra_gils_carbó*. Notice that I chose a higher threshold, *400*, to reduce the number of false positives.

Our corpus is actually ready for topic modeling, but before proceding let's see how the text changed. Let's print the same review before and after text processing:

In [77]:
print("Before:")
with codecs.open('data/text.txt',encoding='utf-8') as f:
    print(f.read(845))
print("________________________________________________\nAfter:")
with codecs.open('data/trigram_sents.txt',encoding='utf-8') as f:
    print(f.read(1000))

Before:
Cerró el comedor cordobés al que Macri le había prometido ayuda Luis Almadaes cordobés ytenía un comedor comunitario y una fundación para ayudar a gente en situación de calle. Desesperado, en mayo le envió un mensaje por redes sociales al presidente Mauricio Macri para que lo ayude a sostener el emprendimiento. El mandatario lo llamó por teléfono días después y el 12 de julio se vieron personalmente en esa provincia. Allí, le prometió ayuda. Pero nunca llegó y por eso no tuvo más alternativa que bajar la persiana. Su fundación "Yo Te Ayudo Amigo del Alma" ayudaba a unas 20 personas en situación de calle para que vendieran golosinas en la peatonal y pudieran capacitarse en un oficio. Almada dijo a La Naciónque debió cerrar la fundación: "Puse mi esfuerzo,cumplí con lo que me comprometí pero solo no puedo. No llamé al Presidente para pedirle alfajores, sino para que nos ayudara con tratamiento de adicciones, con asistentes sociales. Macri me dijo que lo hagamos.Los tiempos de Nac

## Training a dictionary with Gensim
Now that our text is ready to go we can use it to build a gensim [*dictionary*](https://radimrehurek.com/gensim/corpora/dictionary.html), which, in gensim's jargon, consists in a mapping between *words* and their integer *ids*. Dictionaries created from a corpus can later be pruned according to document frequency (removing (un)common words), save/loaded from disk (via *Dictionary.save()* and *Dictionary.load()* methods), merged with other dictionary (*Dictionary.merge_with()*) etc.

Dictionary keys in gensim, like in python, constitute a set, thus contain one instance of every word. There are some words that we are not interested in for topic modeling, such as too common or too uncommon words. We can remove them from our dictionary via the *Dictionary.filter_extremes()* method. After some tokens have been removed via there are gaps in the id series. Calling this *Dictionary.compactify()* method will remove these gaps and reassign integer ids.

Then we call the *doc2bow()* function to parse our reviews and yield a bag-of-words set. In this "casting" the sequential relationship between words is lost, but the number of occurrences of each word of the review is stored in a [vector](https://radimrehurek.com/gensim/tut1.html). We pass the bow set to the [*MmCorpus.serialize()*](https://radimrehurek.com/gensim/corpora/mmcorpus.html) function that iterates through the document stream corpus and saves the bow representation in a simple [Market Matrix](http://math.nist.gov/MatrixMarket/formats.html) format to the disk. Gensim also supports [other formats](https://radimrehurek.com/gensim/tut1.html) such as [Joachim's *SVMlight* format](http://svmlight.joachims.org/), [Blei's LDA-C](http://www.cs.columbia.edu/~blei/lda-c/) format and [GibbsLDA++](http://gibbslda.sourceforge.net/) format.

We then easily load the matrix in a variable calling *MmCorpus()*, we will use this variable for topic modeling. 

In [78]:
%%time
from gensim.corpora import Dictionary, MmCorpus

#Set up sentence streaming
trigram_reviews = LineSentence('data/trigram_sents.txt')

#Learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)

#Get infos about our dict before filtering
print("Before filtering:")
print(trigram_dictionary)

#Filter out words that appear in less than 6 documents or more than 80% reviews
trigram_dictionary.filter_extremes(no_below=6, no_above=0.8)

#Get infos about our dict after filtering
print("\nAfter filtering")
print(trigram_dictionary,'\n')

#Print tokens after filtering
#print(trigram_dictionary.token2id)

#Generate new ids 
trigram_dictionary.compactify()
   
#Save and load the finished dictionary from in data directory (optional)
trigram_dictionary.save('data/trigram_dict.dict')
trigram_dictionary = Dictionary.load('data/trigram_dict.dict')

#Read the reviews and generate bag-of-words representation
corpus = [trigram_dictionary.doc2bow(review) for review in trigram_reviews]

#Print review vectors
#print(corpus)

#Save the bow corpus as a matrix
MmCorpus.serialize('data/trigram_bow_corpus_all.mm',corpus)


#Load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus('data/trigram_bow_corpus_all.mm')

Before filtering:
Dictionary(14224 unique tokens: ['cerro', 'comedor', 'cordobes', 'macri', 'prometido']...)

After filtering
Dictionary(2086 unique tokens: ['cerro', 'cordobes', 'macri', 'ayuda', 'luis']...) 

CPU times: user 190 ms, sys: 7.49 ms, total: 198 ms
Wall time: 194 ms


## Topic modeling with Gensim
What is a topic model? Why would you want to create? Gensim's creator, Radim Rehurek, gives two reasonable answers:
- To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
- To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

As a matter of fact, the main problem with topic modeling (and other NLP task) is that we represent documents (reviews) as vector spaces of tokens, and since the __dimension__ of the document vectors is the number of tokens in the corpus vocabulary, it ends up being very __large__. Furthermore every document contains only a small fraction of all tokens in the vocabulary, thus they also tend to be very __sparse__. What we can do is create a new conceptual layer built in our model. Instead of using tokens directly in documents, we describe everything in term of topics: documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. 

Gensim provides different algorithms for training a topic model. Now that we created a corpus of documents represented as a stream of vectors we can treat it with different transformations. We are going to try all the transformations available in gensim, starting from the simpler ones and building on them, because [transformation](https://radimrehurek.com/gensim/tut2.html) can be stacked. In order we will train our corpus using:
- [__Tf-idf__](https://radimrehurek.com/gensim/models/tfidfmodel.html) (Term frequency - inverse document frequency)
- [__LSI__](https://radimrehurek.com/gensim/models/lsimodel.html) (Latent Semantic Indexing) aka LSA
- [__RP__](https://radimrehurek.com/gensim/models/rpmodel.html) (Random Projections)
- [__HDP__](https://radimrehurek.com/gensim/models/hdpmodel.html) (Hierarchical Dirichlet Process)
- [__LDA__](https://radimrehurek.com/gensim/models/ldamodel.html) (Latent Dirichlet Allocation) 

### Tf-idf (Term frequency - inverse document frequency)
[__Tf-idf__](https://radimrehurek.com/gensim/models/tfidfmodel.html) is the naivest form of training we can do. The way it works is somewhat similar to what we have already done when we created the bag-of-words, but this time the frequencies computed will be a real number:
- __Term frequency__: Counts the number of occurencies (frequency) of each term appearing in the dictionary.
- __Inverse document frequency__: Introduces a factor that diminishes the weight of terms that occur very frequently in the corpus and increases the weight of terms that occur rarely.

In gensim transformations are standard Python objects, typically initialized by means of a training corpus. In case of tf-idf, the "training" consists simply of going through the supplied corpus once and computing document frequencies of all its items, increasing the value of rare tokens. In this particular case, we are transforming the same corpus that we used for training, but this is only incidental. Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all. 

In [96]:
%%time
from gensim.models.tfidfmodel import TfidfModel

#Count the frequencies of all tokens in the corpus (training)
tfidf = TfidfModel(corpus)

#Save and load trained tf-idf corpus
tfidf.save('data/trainedcorpus.tfidf_model')
tfidf = tfidf.load('data/trainedcorpus.tfidf_model')

#Transform our corpus using trained corpus
print(tfidf)
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

TfidfModel(num_docs=5108, num_nnz=30359)
[(0, 0.3077443847359691), (1, 0.6282129565459228), (2, 0.16561225255246909), (3, 0.29257167976555587), (4, 0.27471943748611183), (5, 0.2810815310231041), (6, 0.29257167976555587), (7, 0.19349683801598408), (8, 0.22440388425543967), (9, 0.26408777913420967)]
[(10, 0.4412757865501782), (11, 0.4412757865501782), (12, 0.39968230791471643), (13, 0.3468072630822345), (14, 0.22190432053216497), (15, 0.282716710350837), (16, 0.44873140032761016)]
[(17, 0.4374782747436849), (18, 0.47325106924651916), (19, 0.46478422050291057), (20, 0.47782504171992235), (21, 0.37457314701026206)]
[(3, 1.0)]
[(22, 0.5129639196152067), (23, 0.6796503188052978), (24, 0.524350513798605)]
[(5, 0.4330827407076205), (8, 0.3457553716355454), (9, 0.40689923225667657), (25, 0.4507864478505497), (26, 0.299822300101647), (27, 0.48396667681908684)]
[(5, 0.4749344849165326), (28, 0.5020218836639422), (29, 0.5199858571942918), (30, 0.5020218836639422)]
[(14, 0.27534510607727863), (31, 

[(39, 0.231418009512929), (119, 0.30673341593973463), (212, 0.3352612748200347), (275, 0.2818616558658503), (682, 0.29799600499938356), (710, 0.30673341593973463), (874, 0.313462489395698), (1406, 0.34092571111083814), (2022, 0.3627244965351748), (2023, 0.3627244965351748)]
[]
[(12, 0.3666668122991902), (155, 0.3234981726997916), (316, 0.2918376701639564), (395, 0.28637919929267547), (470, 0.34534112956820556), (632, 0.32537997288634546), (674, 0.3666668122991902), (698, 0.2823205302189518), (1110, 0.39298861809508356)]
[(1193, 0.3972931896207561), (1324, 0.4179133617537487), (1657, 0.4835902823003973), (1658, 0.4469758620752248), (2024, 0.4835902823003973)]
[(1324, 0.41090087810132525), (1657, 0.47547575603861053), (1659, 0.4552015853337273), (1660, 0.4646121065575833), (1661, 0.42662674663557826)]
[(1142, 0.47477675907044553), (1658, 0.6314921129582407), (1661, 0.6130291512792035)]
[(1142, 0.45941161577558615), (1661, 0.5931897624014227), (1662, 0.6611103335091721)]
[(1324, 0.5571114

So we see *tf-idf* has correctly parsed our corpus, transforming from a bag-of-words __integer__ frequency representation to a tf-idf __real-valued__ frequency weighted matrix, increasing the frequency of rare tokens.

We can now use this new *corpus_tfidf* to train other topic mining algorithms instead of the simpler bag-of-words representation.

### LSI (Latent Semantic Indexing)
[__LSI__](https://radimrehurek.com/gensim/models/lsimodel.html) transforms our corpus from Tf-Idf weighted space into a latent space of a lower dimensionality. The "latency" is supposed to represent a hidden connection between words (topics, indeed) and can be set at runtime via the *num_topics* parameter. We also turn *onepass* parameter off to force a multi-pass stochastic algorithm and increase *power_iters* and *extra_samples* that represent the number of power iterations  and an oversampling factor respectively, to improve accuracy.

In [107]:
%%time
from gensim.models.lsimodel import LsiModel

#Train our corpus with lsi
lsi = LsiModel(corpus_tfidf,id2word=trigram_dictionary,num_topics=15,onepass=False,power_iters=1000,extra_samples=100)

#Save and load the finished LDA model from disk
lsi.save('data/lsi_model')
lsi = LsiModel.load('data/lsi_model')

# Accept a user-supplied topic number and print out a formatted list of the top terms
def explore_topic(topic_number, topn=25):
    print ('{:20} {}'.format('term','frequency') + '\n')
    for term, frequency in lsi.show_topic(topic_number, topn=15):
        print ('{:20} {:.3f}'.format(term,round(frequency, 3)))
print(trigram_dictionary)
explore_topic(topic_number=0)
explore_topic(topic_number=1)
explore_topic(topic_number=2)
explore_topic(topic_number=3)
explore_topic(topic_number=4)
explore_topic(topic_number=5)
explore_topic(topic_number=6)
explore_topic(topic_number=7)
explore_topic(topic_number=8)
explore_topic(topic_number=9)
explore_topic(topic_number=10)
explore_topic(topic_number=11)
explore_topic(topic_number=12)
explore_topic(topic_number=13)
explore_topic(topic_number=14)

Dictionary(2086 unique tokens: ['cerro', 'cordobes', 'macri', 'ayuda', 'luis']...)
term                 frequency

presidente           0.324
gobierno             0.277
macri                0.225
mauricio_macri       0.155
inflacion            0.152
argentina            0.148
paso                 0.143
nacional             0.136
pampa                0.133
hasta                0.116
justicia             0.115
cambiemos            0.111
ciudad               0.107
gente                0.100
caso                 0.098
term                 frequency

presidente           -0.528
gobierno             0.371
pampa                -0.280
mauricio_macri       -0.254
macri                -0.249
inflacion            0.199
huevazos             -0.179
hasta                0.136
gente                0.115
santa_rosa           -0.097
trabajo              0.094
nacional             0.094
precios              0.071
argentina            0.068
tiempo               0.066
term                 frequency

gobie

In [81]:
%%time
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
#import cPickle as pickle

#Supress useless warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
    # workers => sets the parallelism, and should be set to your number of physical cores minus one
    lda = LdaMulticore(trigram_bow_corpus,num_topics=15,id2word=trigram_dictionary,workers=3, passes=300)

    #Save and load the finished LDA model from disk
    lda.save('data/lda_model_all')
    lda = LdaMulticore.load('data/lda_model_all')

# Accept a user-supplied topic number and print out a formatted list of the top terms
def explore_topic(topic_number, topn=25):
    print ('{:20} {}'.format('term','frequency') + '\n')
    for term, frequency in lda.show_topic(topic_number, topn=15):
        print ('{:20} {:.3f}'.format(term,round(frequency, 3)))
print(trigram_dictionary)
explore_topic(topic_number=0)
explore_topic(topic_number=1)
explore_topic(topic_number=2)
explore_topic(topic_number=3)
explore_topic(topic_number=4)
explore_topic(topic_number=5)
explore_topic(topic_number=6)
explore_topic(topic_number=7)
explore_topic(topic_number=8)
explore_topic(topic_number=9)
explore_topic(topic_number=10)
explore_topic(topic_number=11)
explore_topic(topic_number=12)
explore_topic(topic_number=13)
explore_topic(topic_number=14)


Dictionary(2086 unique tokens: ['cerro', 'cordobes', 'macri', 'ayuda', 'luis']...)
term                 frequency

bono                 0.022
santiago_maldonado   0.019
maldonado            0.019
justicia             0.017
nacional             0.016
cordoba              0.016
familia              0.014
estado               0.013
situacion            0.012
pidio                0.009
santiago             0.008
argentina            0.008
hijo                 0.007
banda                0.007
caso                 0.007
term                 frequency

vida                 0.012
mujeres              0.012
misma                0.011
afip                 0.011
obra                 0.011
ley                  0.009
presidente           0.008
peronismo            0.008
judicial             0.007
distintos            0.007
nivel                0.007
joven                0.007
luego                0.007
marco                0.007
lee                  0.007
term                 frequency

inflacion  