<figure>
<img src="../Imagenes/logo-final-ap.png"  width="80" height="80" align="left"/> 
</figure>

# <span style="color:blue"><left>Aprendizaje Profundo</left></span>

# <span style="color:red"><center>Introduction to LDA</center></span>

<center>Latent Dirichlet Allocation</center>

##   <span style="color:blue">Authors</span>

1. Alvaro Mauricio Montenegro Díaz, ammontenegrod@unal.edu.co
2. Daniel Mauricio Montenegro Reyes, dextronomo@gmail.com 

## <span style="color:blue">References</span> 

1. Blei et al.,[Latent Dirichlet Allocation, 2003](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
2. [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python, 2018](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24), en Toward data science.

## <span style="color:blue">Content</span>

* [Introduction](#Introducción)
* [Superficial Analysis of Texts](#Superficial-Analysis-of-Texts)
* [General Terminology](#General-Terminology)
* [Textual data preprocessing](#Textual-data-preprocessing)
* [TF-IDF](#TF-IDF)
* [Generative Models: Latent Dirichlet Allocation](#Generative-Models:-Latent-Dirichlet-Allocation)
* [Example: One million headlines](#Example:-One-million-headlines)
* [Example: Airlines Tweets](#Example:-Airlines-Tweets)

## <span style="color:blue">Introduction</span>

Humans communicate using natural languages. Natural languages differ from programming languages in recent times, they follow strict syntactic and semantic rules, while the former, due to their complexity, depend on the context.

In general, text analysis has two large subareas: superficial text analysis and natural language processing.

In this lesson we deal with the superficial analysis of texts.

[[Go Back]](#Content)

## <span style="color:blue">Superficial Analysis of Texts</span>

This subarea was developed first, because the problems associated with natural language in this case are simpler. These are techniques in which it is sought to find the underlying topics in the text. In this sense, they are unsupervised and consequential type models based on automatic classification techniques.

These techniques are aimed at detecting clusters of words and documents in large data corpus.

A document is in this case a distinguishable unit from others in the corpus. For example an open response in a survey, a comment in a review, an abstract of a document, etc.

After omitting terms that are considered not contributing to the detection of topics (themes), usually known as *empty words* (`stop words`) and other preprocessing processes such as stemming, clipping (`steeming`), it is common construct an array named document-term (`dtm`).

This `dtm` matrix represents by the rows each one of the individual documents of the corpus and by the columns each one of the terms conserved in the analysis. Each position in the array contains the number of times a term appears in the document. In some cases this is a binary array, in which case the dtm indicates when a term appears in a document.

The `dtm` is the basis of the techniques known generically as *word-bag* (`word-bag`). The name derives from the fact that when organizing the dtm, the context of the words in each document is lost.

[[Go Back]](#Content)

## <span style="color:blue">General Terminology</span>

### Words or Terms

Words are the minimum units of information in natural language work.

From a very modern perspective, words are objects that can be thought of as points that are in a high-dimensional space, in such a way that close points in some sense of distance correspond to words that have a closeness within a universe of words considered.

The following image corresponds to a set of **astrophysics words**, considered in a study of abstracts of scientific articles. This is a graph obtained after processing like what we show today, developed by Montenegro and Montenegro using an analysis technique based on multidimensional item response theory (TRIM).

In this document the words will be denoted as $w_i, i = 1,2, \ldots, K$.

<figure>
<center>
<img src="../Imagenes/cluster_kmeans_10.png" width="700" height="600" align="center"/>
</center>
<figcaption>
<p style="text-align:center">Astrophysics knowledge areas, based on scientific articles</p>
</figcaption>
</figure>
Source: Alvaro Montenegro

### Documents

Documents are the subjects in superficial textual analyzes. We assume that we have a set of individual documents, each of which will be denoted by $ \mathbf {w} $. A document is considered to be a sequence of $ N $ words. Thus we have that a document is denoted as $ \mathbf {w} = \{w_1, \ldots, w_N \} $.

### Corpus

A corpus is a collection of documents on a particular problem.

This means that a corpus can be writen as $C = \{\text{doc}_{1},\text{doc}_{2},\text{doc}_{3},\dots\}$

### Topics

Topics are latent areas to which both words and documents are associated. 

One of the main purposes of text analysis is to discover or highlight such topics.

The previous figure shows, for example, the presence of 10 topics in the set of astrophysics documents analyzed.

[[Go Back]](#Content)

## <span style="color:blue">Textual data preprocessing</span> 

In what follows, we are going to use the terms token and tokenize, which are not yet adopted by the Royal Academy of the Language, but which we believe will soon be like so many others from English due to their enormous current use, due to the scientific and technological developments.

We will carry out the following steps:

- **Cleansing raw data**: Cleaning strange simbols, tags or another kind of unnecessary elements on the text.
- **Tokenization**: divide the text into sentences and sentences into words. Put the words in lowercase and remove the punctuation.
- Text is cleaned using **regular expresions**.
- Words are **lemmatized**: words in the third person are changed to the first person and the verbs in the past and future tense are changed to the present.
- All **stopwords** are removed. (**CAREFULLY**)
- Words **that have less than 3 characters are eliminated**. (**CAREFULLY**)
- Words are stemming (**stemming**): words are reduced to their root form. (**OPTIONAL**)

We will use the *gensim* and *nltk* libraries to do this work.

[[Go Back]](#Content)

### Tokenization

Some terms that will be used frequently are:

- `Corpus`: body of the text, singular. Corpora is the plural of corpus.
- `Lexicon`: words and their meanings.
- `Token`: each *entity* that is part of whatever was divided according to the rules that we establish for the analysis. For example, each word is a token when a sentence is tokenized into words. Each sentence can also be a token, if you have converted the sentences to a paragraph.

Basically, tokenize involves splitting sentences and words from the body of text.

See the following example taken from [Geek for Geeks](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/?ref=rp). We use *nltk* library.

Lets suppose that our goal is to analize the following toy example:

*Natural language processing **(NLP)** is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.* 

*Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof. There are 365 days usually. This year is 2020.*

In [218]:
raw_text = '<p class="foo">Natural language processing <b>(NLP)</b> is a field ' \
       + 'of computer science, artificial intelligence ' \
       + 'and computational linguistics concerned with ' \
       +'the interactions between computers and human ' \
       + '(natural) languages, and, in particular, ' \
       + 'concerned with programming computers to ' \
       + 'fruitfully process large natural language ' \
       + 'corpora.<br/> Challenges in natural language ' \
       + 'processing frequently involve natural ' \
       + 'language understanding, natural language ' \
       + 'generation frequently from formal, machine' \
       + '-readable logical forms), connecting language ' \
       + 'and machine perception, managing human-' \
       + 'computer dialog systems, or some combination ' \
       + 'thereof. There are 365 days usually. ' \
       + 'This year is 2020.</p>'

print(raw_text)

<p class="foo">Natural language processing <b>(NLP)</b> is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.<br/> Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof. There are 365 days usually. This year is 2020.</p>


### Importing resources from `nltk`

In [219]:
# import the existing word and sentence tokenizing libraries 
import nltk

# tokenizers
from nltk.tokenize import sent_tokenize, word_tokenize
# For tweets
from nltk.tokenize import TweetTokenizer

# Special dictionaries for punctuation and stopwords
nltk.download('punkt') # Punctuation
nltk.download('stopwords') # stopwords

# Large lexical database of English
nltk.download('wordnet')

# Stopwords from nltk
from nltk.corpus import stopwords

# lematizador basado en WordNet de nltk
from nltk.stem import WordNetLemmatizer 

# notlk's steemer. Extract root of words.
from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer 

[nltk_data] Downloading package punkt to /Users/moury/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/moury/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/moury/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Importing resources from `gensim`

In [220]:
import gensim
# Importing stopwords using gensim
from gensim.parsing.preprocessing import STOPWORDS

### Cleansing raw data

In [221]:
# Transforms html to text
import html2text
# Regular Expressions
import re

# Transform html to text
text = html2text.html2text(''.join(str(raw_text)))
# Drop breakline
text = re.sub(r'\n',' ',text)
# Drop *
text = re.sub(r'\*',' ',text)
# Drop extra-spaces
text = re.sub(r'\s\s+',' ',text)
# Unify w1- w2 to w1-w2 
text = re.sub(r'\-\s','-',text)
print(text)

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof. There are 365 days usually. This year is 2020. 


### Tokenization Example

In [585]:
# sentences
print('Sentences Tokenization:\n')
sentences = sent_tokenize(text)

for i,sentence in enumerate(sentences):
    print(f"Sentence {i}:\n{sentence}\n")

Sentences Tokenization:

Sentence 0:
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.

Sentence 1:
Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.

Sentence 2:
There are 365 days usually.

Sentence 3:
This year is 2020.



In [247]:
print('List with Tokens(Sentences):\n')
print(sent_tokenize(text))

List with Tokens(Sentences):

['Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.', 'Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.', 'There are 365 days usually.', 'This year is 2020.']


In [299]:
# palabras
tokens = word_tokenize(text)

print('Word Tokenization:\n')
print(f'Tokens: {len(tokens)}\n')
for i,token in enumerate(tokens):
    print(f'Word {i}: {token}')

Word Tokenization:

Tokens: 98

Word 0: Natural
Word 1: language
Word 2: processing
Word 3: (
Word 4: NLP
Word 5: )
Word 6: is
Word 7: a
Word 8: field
Word 9: of
Word 10: computer
Word 11: science
Word 12: ,
Word 13: artificial
Word 14: intelligence
Word 15: and
Word 16: computational
Word 17: linguistics
Word 18: concerned
Word 19: with
Word 20: the
Word 21: interactions
Word 22: between
Word 23: computers
Word 24: and
Word 25: human
Word 26: (
Word 27: natural
Word 28: )
Word 29: languages
Word 30: ,
Word 31: and
Word 32: ,
Word 33: in
Word 34: particular
Word 35: ,
Word 36: concerned
Word 37: with
Word 38: programming
Word 39: computers
Word 40: to
Word 41: fruitfully
Word 42: process
Word 43: large
Word 44: natural
Word 45: language
Word 46: corpora
Word 47: .
Word 48: Challenges
Word 49: in
Word 50: natural
Word 51: language
Word 52: processing
Word 53: frequently
Word 54: involve
Word 55: natural
Word 56: language
Word 57: understanding
Word 58: ,
Word 59: natural
Word 60: langua

In [300]:
print('List with Tokens (Words):\n')
print(word_tokenize(text))

List with Tokens (Words):

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting', 'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'or', 'some', 'combination', 'thereof', '.', 'There', 'are', '365', 'days', 'usually', '.', 'This', 'year', 'is', '2020', '.']


In [301]:
# characters
chars = [char for char in text]
print('Character Tokenization:')
for i,char in enumerate(chars):
    print(f'{char}', end='|')
    

Character Tokenization:
N|a|t|u|r|a|l| |l|a|n|g|u|a|g|e| |p|r|o|c|e|s|s|i|n|g| |(|N|L|P|)| |i|s| |a| |f|i|e|l|d| |o|f| |c|o|m|p|u|t|e|r| |s|c|i|e|n|c|e|,| |a|r|t|i|f|i|c|i|a|l| |i|n|t|e|l|l|i|g|e|n|c|e| |a|n|d| |c|o|m|p|u|t|a|t|i|o|n|a|l| |l|i|n|g|u|i|s|t|i|c|s| |c|o|n|c|e|r|n|e|d| |w|i|t|h| |t|h|e| |i|n|t|e|r|a|c|t|i|o|n|s| |b|e|t|w|e|e|n| |c|o|m|p|u|t|e|r|s| |a|n|d| |h|u|m|a|n| |(|n|a|t|u|r|a|l|)| |l|a|n|g|u|a|g|e|s|,| |a|n|d|,| |i|n| |p|a|r|t|i|c|u|l|a|r|,| |c|o|n|c|e|r|n|e|d| |w|i|t|h| |p|r|o|g|r|a|m|m|i|n|g| |c|o|m|p|u|t|e|r|s| |t|o| |f|r|u|i|t|f|u|l|l|y| |p|r|o|c|e|s|s| |l|a|r|g|e| |n|a|t|u|r|a|l| |l|a|n|g|u|a|g|e| |c|o|r|p|o|r|a|.| |C|h|a|l|l|e|n|g|e|s| |i|n| |n|a|t|u|r|a|l| |l|a|n|g|u|a|g|e| |p|r|o|c|e|s|s|i|n|g| |f|r|e|q|u|e|n|t|l|y| |i|n|v|o|l|v|e| |n|a|t|u|r|a|l| |l|a|n|g|u|a|g|e| |u|n|d|e|r|s|t|a|n|d|i|n|g|,| |n|a|t|u|r|a|l| |l|a|n|g|u|a|g|e| |g|e|n|e|r|a|t|i|o|n| |f|r|e|q|u|e|n|t|l|y| |f|r|o|m| |f|o|r|m|a|l|,| |m|a|c|h|i|n|e|-|r|e|a|d|a|b|l|e| |l|o|g|i|c|a|l| |f|o|r|m|s|)|

In [302]:
print('List with Tokens (Characters):\n')
print(chars)

List with Tokens (Characters):

['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', '(', 'N', 'L', 'P', ')', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'i', 'e', 'l', 'd', ' ', 'o', 'f', ' ', 'c', 'o', 'm', 'p', 'u', 't', 'e', 'r', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e', ',', ' ', 'a', 'r', 't', 'i', 'f', 'i', 'c', 'i', 'a', 'l', ' ', 'i', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', ' ', 'a', 'n', 'd', ' ', 'c', 'o', 'm', 'p', 'u', 't', 'a', 't', 'i', 'o', 'n', 'a', 'l', ' ', 'l', 'i', 'n', 'g', 'u', 'i', 's', 't', 'i', 'c', 's', ' ', 'c', 'o', 'n', 'c', 'e', 'r', 'n', 'e', 'd', ' ', 'w', 'i', 't', 'h', ' ', 't', 'h', 'e', ' ', 'i', 'n', 't', 'e', 'r', 'a', 'c', 't', 'i', 'o', 'n', 's', ' ', 'b', 'e', 't', 'w', 'e', 'e', 'n', ' ', 'c', 'o', 'm', 'p', 'u', 't', 'e', 'r', 's', ' ', 'a', 'n', 'd', ' ', 'h', 'u', 'm', 'a', 'n', ' ', '(', 'n', 'a', 't', 'u', 'r', 'a', 'l', ')', ' ', 'l', 'a', 'n', 'g

### Tweets Tokenization

In [303]:
tknzr = TweetTokenizer()
s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
print('Tokens:\n')
print(tknzr.tokenize(s0))

Tokens:

['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


### Tweets Toknization using `strip_handles` and `reduce_len`

In [304]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
tw = tknzr.tokenize(s1)
print(tw)

[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


###  Transform Text to Lowercase

In [305]:
tokens = [token.lower() for token in tokens]
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 98

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting', 'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'or', 'some', 'combination', 'thereof', '.', 'there', 'are', '365', 'days', 'usually', '.', 'this', 'year', 'is', '2020', '.']


### Remove special characters - regular expressions (regex)

Regular expressions are mathematical objects that allow you to interpret pieces of text.

They are key in the construction of programming languages. Here we are going to use the Python [re](https://docs.python.org/3/library/re.html) library created for handling regular expressions. 

We suggest this [re in Python tutorial](https://www.w3schools.com/python/python_regex.asp) to learn how to use the re library.

Aditionally, you can get the [Cheat-Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/) for regular expressions and an online tester [here](https://regexr.com/).

We will use here to remove some symbols: numbers and parentheses for example. This is not always the case.

In [306]:
import re
# digits (CAREFULLY)
tokens = [re.sub(r'\d+', '',token) for token in tokens]
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 98

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting', 'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'or', 'some', 'combination', 'thereof', '.', 'there', 'are', '', 'days', 'usually', '.', 'this', 'year', 'is', '', '.']


In [307]:
# parenthesis
tokens = [re.sub(r'[()]', '',token) for token in tokens]
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 98

['natural', 'language', 'processing', '', 'nlp', '', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '', 'natural', '', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', '', ',', 'connecting', 'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'or', 'some', 'combination', 'thereof', '.', 'there', 'are', '', 'days', 'usually', '.', 'this', 'year', 'is', '', '.']


In [308]:
# Take out punctuations and other symbols
tokens = [re.sub(r'[^\w\s]', '',token) for token in tokens]
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 98

['natural', 'language', 'processing', '', 'nlp', '', 'is', 'a', 'field', 'of', 'computer', 'science', '', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '', 'natural', '', 'languages', '', 'and', '', 'in', 'particular', '', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', '', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', '', 'machinereadable', 'logical', 'forms', '', '', 'connecting', 'language', 'and', 'machine', 'perception', '', 'managing', 'humancomputer', 'dialog', 'systems', '', 'or', 'some', 'combination', 'thereof', '', 'there', 'are', '', 'days', 'usually', '', 'this', 'year', 'is', '', '']


In [309]:
# Drop empty spaces
tokens = [token for token in tokens if len(token)>0]
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 78

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'and', 'in', 'particular', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', 'machinereadable', 'logical', 'forms', 'connecting', 'language', 'and', 'machine', 'perception', 'managing', 'humancomputer', 'dialog', 'systems', 'or', 'some', 'combination', 'thereof', 'there', 'are', 'days', 'usually', 'this', 'year', 'is']


### Lemmatization

**Lemmatization** is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the **word's lemma**, or dictionary form.

**Stemming** is the process of reducing inflected (or sometimes derived) words to their **word stem**, base or root form—generally a written word form.

Therefore, it links words with a meaning similar to a word.

Text preprocessing includes both `Stemming` and `Lemmatization`.

Many times people find these two terms confusing. Some treat these two as equals.

Actually, **lematization is preferred to stemming** because stemming performs morphological analysis of words.

The applications of the stemming are:

- It is used in comprehensive retrieval systems such as search engines.
- Used in compact indexing
- Examples of stemming:

**Example:**

* rocks -> rock
* corpora -> corpus
* better -> good

An important difference from stemming is that lemmatization takes a part of the voice parameter, "pos". If not provided, the default is "noun". In the following example we are going to place *pos = 'a'* which means adjective. If *pos = 'v'* is placed, it means verb. By default it is *pos = 'n'*, that is, a noun.

The following is the stemming implementation of some English words using the *nltk* library:

In [310]:
#nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
print("rocks   ->", lemmatizer.lemmatize("rocks")) 
print("corpora ->", lemmatizer.lemmatize("corpora")) 
  
# a denotes adjective in "pos" 
print("better  ->", lemmatizer.lemmatize("better", pos ="a")) 

rocks   -> rock
corpora -> corpus
better  -> good


In [311]:
#help(lemmatizer)

Y ahora vamos lematizar el texto de ejemplo, primero con verbos y luego con sustantivos

In [312]:
from nltk.stem import WordNetLemmatizer
#
# verbs
lemma_text =[]
for token in tokens:
    lemma_text.append(WordNetLemmatizer().lemmatize(token, pos='v'))

print('Tokens:\n')
print(tokens)
print('\nLemmas with pos="v":\n')
print(lemma_text)

Tokens:

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'and', 'in', 'particular', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', 'machinereadable', 'logical', 'forms', 'connecting', 'language', 'and', 'machine', 'perception', 'managing', 'humancomputer', 'dialog', 'systems', 'or', 'some', 'combination', 'thereof', 'there', 'are', 'days', 'usually', 'this', 'year', 'is']

Lemmas with pos="v":

['natural', 'language', 'process', 'nlp', 'be', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intell

In [313]:
# nouns
for i in range(len(lemma_text)):
    lemma_text[i] = WordNetLemmatizer().lemmatize(lemma_text[i], pos="n")
print('\nLemmas with pos="n":\n')
print(lemma_text)


Lemmas with pos="n":

['natural', 'language', 'process', 'nlp', 'be', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concern', 'with', 'the', 'interaction', 'between', 'computer', 'and', 'human', 'natural', 'language', 'and', 'in', 'particular', 'concern', 'with', 'program', 'computer', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpus', 'challenge', 'in', 'natural', 'language', 'process', 'frequently', 'involve', 'natural', 'language', 'understand', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', 'machinereadable', 'logical', 'form', 'connect', 'language', 'and', 'machine', 'perception', 'manage', 'humancomputer', 'dialog', 'system', 'or', 'some', 'combination', 'thereof', 'there', 'be', 'day', 'usually', 'this', 'year', 'be']


### Stemming

Steeming is the process of producing morphological variants of a root / base word. Bypass programs are commonly known as steeming or derivation algorithms. A stemming algorithm reduces the words as in the following examples

+ "chocolates", "chocolates", "choco" at the root of the word, "chocolate"
+ "recovery", "recovered", "recover" is reduced to the root "recover".

### Potential problems:

There are mainly two problems in stemming: overstemming and understemming.

Excessive overclipping occurs when two words are derived from the same stem that have different roots.

Undercutting occurs when two words are derived from the different stem but have the same root.

For example, the widely used Porter stemmer stems "universal", "university", and "universe" to "univers". This is a case of overstemming: though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results.

An example of understemming in the Porter stemmer is "alumnus" → "alumnu", "alumni" → "alumni", "alumna"/"alumnae" → "alumna". This English word keeps Latin morphology, and so these near-synonyms are not conflated.

Lets see Stemming in practice:

In [314]:
#from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer 
# crea una instancia de PorterStemmer 
ps = PorterStemmer()

for i in range(len(lemma_text)):
    lemma_text[i] = ps.stem(lemma_text[i])
print(lemma_text)

['natur', 'languag', 'process', 'nlp', 'be', 'a', 'field', 'of', 'comput', 'scienc', 'artifici', 'intellig', 'and', 'comput', 'linguist', 'concern', 'with', 'the', 'interact', 'between', 'comput', 'and', 'human', 'natur', 'languag', 'and', 'in', 'particular', 'concern', 'with', 'program', 'comput', 'to', 'fruit', 'process', 'larg', 'natur', 'languag', 'corpu', 'challeng', 'in', 'natur', 'languag', 'process', 'frequent', 'involv', 'natur', 'languag', 'understand', 'natur', 'languag', 'gener', 'frequent', 'from', 'formal', 'machineread', 'logic', 'form', 'connect', 'languag', 'and', 'machin', 'percept', 'manag', 'humancomput', 'dialog', 'system', 'or', 'some', 'combin', 'thereof', 'there', 'be', 'day', 'usual', 'thi', 'year', 'be']


### Removes words of length less than or equal to two (CAREFULLY)

In [586]:
tokens_4 = []

for token in tokens:
    if len(token) > 2:
        tokens_4.append(token)

tokens = tokens_4

# equivalent to
#tokens_4 = [token for token in tokens if len(token)>3]

print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 56

['natural', 'language', 'processing', 'field', 'computer', 'science', 'artificial', 'intelligence', 'computational', 'linguistics', 'concerned', 'interactions', 'computers', 'human', 'natural', 'languages', 'particular', 'concerned', 'programming', 'computers', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'frequently', 'formal', 'machinereadable', 'logical', 'forms', 'connecting', 'language', 'machine', 'perception', 'managing', 'humancomputer', 'dialog', 'systems', 'combination', 'thereof', 'days', 'usually', 'year']


###  Stopwords

Empty words or stopwords are words that in common language are **considered not to contribute to the semantic content of texts**. In the bag of words technique they are omitted, because they cause confusing classifications. Actually, the concept of empty words depends on the answer the researcher wants to get.

The following example shows the dictionary of English stopwords contained in the `gensim` library.

In [316]:
stopwords_gensim = list(gensim.parsing.preprocessing.STOPWORDS)
print(f'Stopwords in Gensim: {len(stopwords_gensim)}\n')
print(stopwords_gensim)

Stopwords in Gensim: 337

['front', 'hasnt', 'everywhere', 'an', 'thence', 'hence', 'quite', 'your', 'some', 'take', 'why', 'elsewhere', 'whether', 'own', 'go', 'a', 'every', 'seems', 'either', 'doing', 'anything', 'interest', 'get', 'of', 'me', 'into', 'part', 'else', 'wherein', 'under', 'less', 'full', 'still', 'below', 'seeming', 'been', 'since', 'using', 'toward', 'eight', 'hereby', 'couldnt', 'anywhere', 'until', 'within', 'is', 'although', 'latterly', 'doesn', 'had', 'more', 'otherwise', 'thereupon', 'top', 'rather', 'mostly', 'on', 'between', 'became', 'another', 'mill', 'whole', 'what', 'computer', 'via', 'call', 'he', 'keep', 'give', 'these', 'un', 'etc', 'put', 'detail', 'all', 'across', 'none', 'due', 'yours', 'co', 'and', 'hereupon', 'which', 'last', 'whereupon', 'or', 'name', 'were', 'though', 'who', 'does', 'this', 'cant', 'my', 'each', 'upon', 'most', 'never', 'among', 'whereafter', 'several', 'someone', 'sometimes', 'alone', 'against', 'anyone', 'there', 'those', 'fire'

En la librería *nltk* el diciconario de palabras vacías del inglés es actualmente:

In [317]:
from nltk.corpus import stopwords
#
stopwords_nltk = stopwords.words('english')
print(f'Stopwords in nltk: {len(stopwords_nltk)}\n')
print(stopwords_nltk)

Stopwords in nltk: 179

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',

### Note 1

Notice that the two sets of stopwords are different.

In fact, the common words are:

In [318]:
inter = set(stopwords_gensim).intersection(set(stopwords_nltk))
print(f'Common Stopwords: {len(inter)}\n')
print(inter)

Common Stopwords: 126

{'has', 'no', 'at', 'and', 'being', 'an', 'which', 're', 'your', 'or', 'some', 'were', 'who', 'are', 'be', 'does', 'through', 'this', 'why', 'my', 'out', 'own', 'each', 'will', 'a', 'most', 'myself', 'up', 'doing', 'just', 'against', 'that', 'there', 'those', 'of', 'me', 'themselves', 'into', 'they', 'the', 'yourself', 'himself', 'under', 'not', 'ours', 'she', 'below', 'been', 'him', 'i', 'too', 'off', 'any', 'to', 'few', 'have', 'very', 'her', 'don', 'when', 'until', 'was', 'herself', 'am', 'did', 'its', 'such', 'is', 'before', 'as', 'how', 'by', 'if', 'doesn', 'for', 'had', 'more', 'but', 'again', 'from', 'further', 'so', 'both', 'down', 'didn', 'ourselves', 'do', 'in', 'because', 'our', 'on', 'between', 'during', 'while', 'where', 'whom', 'we', 'about', 'here', 'their', 'what', 'than', 'nor', 'above', 'itself', 'he', 'yourselves', 'hers', 'over', 'these', 'them', 'other', 'only', 'you', 'once', 'with', 'then', 'all', 'after', 'same', 'should', 'his', 'it', 'ca

As an example we are going to remove the empty words from the tokenized text object defined above.

### Note 2

There is another library, called **SpaCy**, that also contains another set of stopwords:

In [319]:
import spacy

# install models with spaCy
# !python -m spacy download en_core_web_sm

# Load model
nlp = spacy.load('en_core_web_sm')

# Take stopwords from SpaCy
stopwords_spacy = list(nlp.Defaults.stop_words)

print(f'Stopwords in SpaCy: {len(stopwords_spacy)}\n')
print(stopwords_spacy)

Stopwords in SpaCy: 326

['front', 'everywhere', 'an', 'thence', 'ca', 'hence', 'quite', 'your', 'some', 'take', 'why', 'elsewhere', 'whether', 'own', 'go', 'a', 'every', 'seems', 'doing', 'anything', 'either', 'get', 'of', 'me', 'into', 'part', 'else', 'wherein', 'under', 'less', 'full', 'still', 'below', 'seeming', 'been', 'since', 'toward', 'using', 'eight', 'hereby', 'anywhere', 'until', 'within', "'m", 'is', 'although', 'latterly', 'had', 'more', 'otherwise', 'thereupon', 'top', 'rather', "n't", 'mostly', 'on', 'between', 'became', 'another', 'whole', 'what', '’d', 'via', 'call', 'he', 'keep', 'give', 'these', 'put', 'all', 'across', 'none', 'due', 'yours', 'and', 'hereupon', 'which', 'last', 'whereupon', 'or', 'name', 'were', 'though', 'who', 'does', 'this', 'my', 'each', 'upon', 'most', 'never', "'d", 'among', 'whereafter', 'several', 'someone', 'sometimes', 'alone', 'against', 'anyone', '‘ll', 'there', 'those', 'hundred', 'however', 'seemed', 'well', 'everyone', 'very', 'may', 

### Stopwords in Spanish using ntlk

In [320]:
palabrasVacias_nltk = stopwords.words('spanish')
print(f'Palabras Vacías en nltk, español: {len(palabrasVacias_nltk)}\n')
print(palabrasVacias_nltk)

Palabras Vacías en nltk, español: 313

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', '

**Let's remove the stop words from the example using nltk**

In [321]:
print(f'Tokens: {len(tokens)}\n')
print(tokens)

Tokens: 63

['natural', 'language', 'processing', 'field', 'computer', 'science', 'artificial', 'intelligence', 'computational', 'linguistics', 'concerned', 'with', 'interactions', 'between', 'computers', 'human', 'natural', 'languages', 'particular', 'concerned', 'with', 'programming', 'computers', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'frequently', 'from', 'formal', 'machinereadable', 'logical', 'forms', 'connecting', 'language', 'machine', 'perception', 'managing', 'humancomputer', 'dialog', 'systems', 'some', 'combination', 'thereof', 'there', 'days', 'usually', 'this', 'year']


In [322]:
tokens_n_e = [token for token in tokens if token not in stopwords_nltk]
#
tokens = tokens_n_e
print(f'Tokens: {len(tokens)}\n')
print(tokens)   

Tokens: 56

['natural', 'language', 'processing', 'field', 'computer', 'science', 'artificial', 'intelligence', 'computational', 'linguistics', 'concerned', 'interactions', 'computers', 'human', 'natural', 'languages', 'particular', 'concerned', 'programming', 'computers', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'frequently', 'formal', 'machinereadable', 'logical', 'forms', 'connecting', 'language', 'machine', 'perception', 'managing', 'humancomputer', 'dialog', 'systems', 'combination', 'thereof', 'days', 'usually', 'year']


## <span style="color:blue">TF-IDF</span> 

Taken from [Wikipedia](https://es.wikipedia.org/wiki/Tf-idf).

Tf-idf (Term frequency - Inverse document frequency), term frequency - inverse document frequency (that is, the frequency of occurrence of the term in the corpus of documents), is a numerical measure that expresses how relevant a word is for a document in a corpus. This measure is often used as a weighting factor in information retrieval and text mining.


The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the document corpus, which allows handling of the fact that some words are generally more common than others.

Variations of the tf-idf weight scheme are frequently used by search engines as a fundamental tool to measure the relevance of a document given a user's query, thus establishing an ordering or ranking of them.


Tf-idf can be used successfully for filtering stop-words, in different fields of pre-word processing.

### Mathematical details

Tf-idf is the product of two measurements, *term frequency* and *inverse document frequency*. There are several ways to determine the value of both.

In the case of the term frequency $ \text {tf} (t, d) $, the simplest option is to use the raw frequency of the term $ t $ in the document $ d $, that is, the number of times that the term $ t $ occurs in the document $ d $. If we denote the raw frequency of $ t $ by $ f (t, d) $, then the simple $ \text {tf} $ schema is $ \text {tf} (t, d) = f (t, d) $ .


Other possibilities are:

- *Boolean "frequencies*: tf (t, d) = 1 if t occurs in d, and 0 if not;
- *logarithmically scaled frequency*: tf (t, d) = 1 + log f (t, d) (y 0 if f (t, d) = 0);
- *standardized frequency*, to avoid a bias towards long documents. For example, divide the raw frequency by the maximum frequency of some term in the document:

$$
{\displaystyle \mathrm {tf} (t,d)={\frac {\mathrm {f} (t,d)}{\max\{\mathrm {f} (t,d):t\in d\}}}}
$$

The inverse document frequency is a measure of whether the term is common or not, in the corpus of documents. It is obtained by dividing the total number of documents by the number of documents that contain the term, and the logarithm of this quotient is taken:

$$
{\displaystyle \mathrm {idf} (t,D)=\log {\frac {|D|}{|\{d\in D:t\in d\}|}}}
$$

where

- ${\displaystyle |D|}$: cardinality of $D$, or number of documents in the corpus.
- ${\displaystyle |\{d\in D:t\in d\}|}$ : number of documents where the term $ t $ appears. If the term is not in the collection, a division-by-zero will occur. Therefore, it is common to fit this formula to ${\displaystyle 1+|\{d\in D:t\in d\}|}$.

Mathematically, the base of the logarithm function is not important and is a constant factor in the final result.

Hencem *tf-idf* is calculated as:

$$
{\displaystyle \text{tf-idf} (t,d,D)=\mathrm {tf} (t,d)\times \mathrm {idf} (t,D)}
$$

A high weight in *tf-idf* is reached with a high frequency of term (in the given document) and a low frequency of occurrence of the term in corpus of documents.

Since the quotient within the logarithm function of the idf is always greater than or equal to 1, the value of *idf* (and of *tf-idf*) is greater than or equal to 0.

When a term appears in many documents, the quotient within the logarithm approaches 1, giving a value of *idf* and *tf-idf* close to 0.

[[Go Back]](#Content)

## <span style="color:blue">Generative Models: Latent Dirichlet Allocation</span>

The Latent Dirichlet Allocation (LDA) technique is currently the most used for the extraction of documents from document corpus and is due to [Blei et al](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf). 

### Central ideas behind LDA, Blei et al.(2003)

The central ideas behind LDA are as follows. The generative model assumes that the documents are generated as follows:

1. The size $ N $ of the document is generated by a Poisson distribution $ \text {Poi} (\xi) $.
2. The topics are generated from a multinomial distribution with a probability vector $ \mathbf {\theta} $.
3. A priori it is assumed that the vector $ \mathbf {\theta} $ is generated by a Dirichlet distribution with vector of parameters $ \boldsymbol {\alpha} $. From this derives the name of the technique.
4. Each of the $ N $ words in a document is generated according to the following algorithm.
      - A topic is chosen $ z_n \sim \text {Multinomial} (\mathbf {\theta}) $.
      - The word $ w_n \sim \text {P} (w_n | z_n, \mathbf {\beta}) $ is chosen. Where $ \mathbf {\beta} $ is a matrix of probabilities of the words belonging to the topics. $ P $ is a multinomial probability conditional on the topic $ z_n $ and the vector of parameters $ \mathbf {\beta} $.


To the reader interested in the details, we refer him to the original paper of [Blei et al.](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).

The following image tries to show the central ideas behind the technique.


<figure>
<center>
<img src="../Imagenes/Diagram_Blei.png" width="800" height="700" align="center"/>
</center>
<figcaption>
<p style="text-align:center">Intuition behind LDA</p>
</figcaption>
</figure>

Fuente: 
[Intuition behind LDA](http://www.cs.cornell.edu/courses/cs6784/2010sp/lecture/30-BleiEtAl03.pdf)

Topic modeling is a type of statistical modeling to discover the abstract "themes" that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of a topic model and is used to classify the text of a document on a particular topic.

Build a topic model per document and words per topic model, modeled as Dirichlet distributions.

Here we are going to apply LDA to a set of documents and divide them into topics. Let us begin!

### Importing libraries

In [323]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/moury/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

We are going to write a function that lemmatizes and preprocesses the dataset

In [348]:
from nltk.stem import PorterStemmer 

def lemmatize_stemming(text):
    ps = PorterStemmer()
    return ps.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text): #  gensim.utils.simple_preprocess tokenizes el texto
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

[[Go Back]](#Content)

## <span style="color:blue">Example: One million headlines</span>

The dataset we will use is a list of over a million news headlines published over a 15-year period and can be downloaded from [Kaggle](https://www.kaggle.com/therohk/million-headlines/metadata).

Example adapted from [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)

In [338]:
import pandas as pd
data = pd.read_csv('../Datos/abcnews-date-text.csv');

In [337]:
data

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1226253,20201231,what abc readers learned from 2020 looking bac...
1226254,20201231,what are the south african and uk variants of ...
1226255,20201231,what victorias coronavirus restrictions mean f...
1226256,20201231,whats life like as an american doctor during c...


Lets check some data:

In [339]:
data_text = data[['headline_text']]
documents = data_text
documents.sample(5)

Unnamed: 0,headline_text
431728,second teacher charged over scots school assault
334778,alcohol a factor in assault increase stirling
1114105,rural sa mining tees
653306,gag lifted on palm island rioter
250541,crocs seen around resort islands


In [353]:
sample = np.random.choice(documents.index)
doc_sample = documents.iloc[sample].values[0]
print(f'Original Document: {sample}')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\nTokenized and Lemmatized Document: ')
print(preprocess(doc_sample))

Original Document: 689750
['broken', 'hill', 'woman', 'competes', 'for', 'miss', 'universe', 'australia']


Tokenized and Lemmatized Document: 
['break', 'hill', 'woman', 'compet', 'miss', 'univers', 'australia']


### Text Preprocessing

We are going to pre-process the texts, saving the results in the *processed_docs* object.

In [None]:
processed_docs = documents['headline_text'].map(preprocess)

In [359]:
processed_docs.sample(10)

764805                    [rudd, wont, wind, public, appear]
129826                        [seek, strengthen, tie, timor]
152039     [union, seek, probe, corbi, airport, drug, claim]
229507                                 [seven, brumbi, riot]
486015                               [leav, hitchhik, crash]
1086893               [cosbi, juri, deliv, verdict, holdout]
878429                      [power, leav, late, beat, demon]
627628                         [apra, ask, bank, live, will]
442389                [want, live, paradis, perish, kinglak]
1065663                    [fake, news, trump, blast, media]
Name: headline_text, dtype: object

### Building the dictionary

We create a dictionary from *processed_docs* that contains the number of times a word appears in the training set.

In [360]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 commun
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


Filter the tokens that appear in
less than 15 documents (absolute number) or
more than 0.5 documents (fraction of the total size of the corpus, not an absolute number).
After the previous two steps, keep only the first 100,000 most frequent tokens.

In [361]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

### Gensim doc2bow

For each document we create a dictionary that informs how many
words and how many times those words appear.

We put this in the *bow_corpus* object, then check our previously selected document.

In [367]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [447]:
sample = np.random.choice(documents.index)

doc_sample = documents.iloc[sample].values[0]
print(f'\nOriginal Document: {sample}')
print(doc_sample,'\n')

print('Bag of Words (BoW):\n')
print(bow_corpus[sample],'\n')

bow_doc_4310 = bow_corpus[sample]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))


Original Document: 533361
captive bred wallabies breeding in wild 

Bag of Words (BoW):

[(982, 2), (2069, 1), (4153, 1), (5099, 1)] 

Word 982 ("breed") appears 2 time.
Word 2069 ("wild") appears 1 time.
Word 4153 ("wallabi") appears 1 time.
Word 5099 ("captiv") appears 1 time.


This is a preview of the bag of words in the preprocessed document.

### TF-IDF

We create a model object *tf-idf* using `models.TfidfModel` from "bow_corpus" and put it in *tfidf*, then we apply the transformation to the whole corpus and call it *corpus_tfidf*. Finally, we preview the *TF-IDF* scores for our first document.

In [458]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

print(processed_docs[0],'\n')
print(bow_corpus[0],'\n')
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

['decid', 'commun', 'broadcast', 'licenc'] 

[(0, 1), (1, 1), (2, 1), (3, 1)] 

[(0, 0.5852942020878993),
 (1, 0.38405854933668493),
 (2, 0.5017732999224691),
 (3, 0.5080878695349914)]


### Running LDA using TF-IDF

In [496]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=4, workers=8)

**How many topics should we choose?** You can read [Evaluate Topic Models: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [497]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic[:120]))

Topic: 0 Word: 0.017*"coronaviru" + 0.011*"coast" + 0.011*"covid" + 0.010*"miss" + 0.009*"search" + 0.008*"polic" + 0.007*"hill" + 0.00
Topic: 1 Word: 0.009*"friday" + 0.008*"farm" + 0.007*"morrison" + 0.007*"care" + 0.007*"violenc" + 0.007*"juli" + 0.006*"age" + 0.006*"
Topic: 2 Word: 0.024*"news" + 0.019*"rural" + 0.009*"thursday" + 0.008*"grandstand" + 0.008*"nation" + 0.008*"busi" + 0.007*"financ" + 
Topic: 3 Word: 0.016*"charg" + 0.016*"murder" + 0.013*"court" + 0.011*"donald" + 0.010*"jail" + 0.010*"polic" + 0.009*"assault" + 0.009
Topic: 4 Word: 0.009*"south" + 0.008*"north" + 0.007*"turnbul" + 0.007*"korea" + 0.006*"east" + 0.006*"asylum" + 0.006*"australia" + 0.
Topic: 5 Word: 0.021*"crash" + 0.012*"polic" + 0.012*"driver" + 0.010*"die" + 0.009*"fatal" + 0.009*"woman" + 0.008*"road" + 0.008*"kil
Topic: 6 Word: 0.014*"trump" + 0.011*"elect" + 0.009*"countri" + 0.008*"govern" + 0.008*"hour" + 0.007*"health" + 0.007*"fund" + 0.006*
Topic: 7 Word: 0.012*"drum" + 0.012*"market" + 0

Can you distinguish different topics using the words in each topic and their corresponding weights?

### Performance evaluation classifying the sample document using the LDA TF-IDF model.

In [539]:
tfidf_vector

[(30, 0.6003087124903306), (440, 0.7997683725355745)]

In [504]:
sample = np.random.choice(documents.index)

doc_sample = documents.iloc[sample].values[0]

print(f'\nOriginal Document: {sample}')
print(doc_sample,'\n')

for index, score in sorted(lda_model_tfidf[tfidf_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))


Original Document: 1170265
asteroids and apocalypse and life on earth 

Score: 0.5456940531730652	 Topic: 0.024*"news" + 0.019*"rural" + 0.009*"thursday" + 0.008*"grandstand" + 0.008*"nation"
Score: 0.20384711027145386	 Topic: 0.021*"crash" + 0.012*"polic" + 0.012*"driver" + 0.010*"die" + 0.009*"fatal"
Score: 0.03131704032421112	 Topic: 0.009*"weather" + 0.009*"farmer" + 0.008*"monday" + 0.007*"andrew" + 0.007*"climat"
Score: 0.03131170570850372	 Topic: 0.014*"trump" + 0.011*"elect" + 0.009*"countri" + 0.008*"govern" + 0.008*"hour"
Score: 0.03131140395998955	 Topic: 0.009*"south" + 0.008*"north" + 0.007*"turnbul" + 0.007*"korea" + 0.006*"east"
Score: 0.03130675479769707	 Topic: 0.012*"drum" + 0.012*"market" + 0.011*"price" + 0.010*"rise" + 0.010*"share"
Score: 0.03130556643009186	 Topic: 0.009*"friday" + 0.008*"farm" + 0.007*"morrison" + 0.007*"care" + 0.007*"violenc"
Score: 0.03130359575152397	 Topic: 0.012*"interview" + 0.009*"final" + 0.009*"world" + 0.008*"australia" + 0.007*"leag

Nuestro documento de prueba tiene la mayor probabilidad de ser parte del tema que asignó nuestro modelo, que es la clasificación precisa.

## Test the model with a document not seen before.

In [503]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'

print(f'\nOriginal Document:')
print(unseen_document,'\n')

bow_vector = dictionary.doc2bow(preprocess(unseen_document))
tfidf_vector = tfidf[bow_vector]
for index, score in sorted(lda_model_tfidf[tfidf_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))


Original Document:
How a Pentagon deal became an identity crisis for Google 

Score: 0.545668363571167	 Topic: 0.024*"news" + 0.019*"rural" + 0.009*"thursday" + 0.008*"grandstand" + 0.008*"nation"
Score: 0.203868106007576	 Topic: 0.021*"crash" + 0.012*"polic" + 0.012*"driver" + 0.010*"die" + 0.009*"fatal"
Score: 0.031321410089731216	 Topic: 0.009*"weather" + 0.009*"farmer" + 0.008*"monday" + 0.007*"andrew" + 0.007*"climat"
Score: 0.03131203353404999	 Topic: 0.014*"trump" + 0.011*"elect" + 0.009*"countri" + 0.008*"govern" + 0.008*"hour"
Score: 0.0313115268945694	 Topic: 0.009*"south" + 0.008*"north" + 0.007*"turnbul" + 0.007*"korea" + 0.006*"east"
Score: 0.03130674362182617	 Topic: 0.012*"drum" + 0.012*"market" + 0.011*"price" + 0.010*"rise" + 0.010*"share"
Score: 0.03130554035305977	 Topic: 0.009*"friday" + 0.008*"farm" + 0.007*"morrison" + 0.007*"care" + 0.007*"violenc"
Score: 0.03130355849862099	 Topic: 0.012*"interview" + 0.009*"final" + 0.009*"world" + 0.008*"australia" + 0.007*"l

## <span style="color:blue">Example: Airlines Tweets</span>

This dataset can be found in [Kaggle](https://www.kaggle.com/c/spanish-arilines-tweets-sentiment-analysis/data?select=tweets_public.csv).

In [664]:
from nltk.stem import PorterStemmer 
import spacy
nlp = spacy.load('es_core_news_lg')

In [777]:
hey = nlp('Hola como estas')

In [828]:
def lemmatize(text,nlp):
    # can be parallelized
    doc = nlp(text)
    lemma = [n.lemma_ for n in doc]
        
    return lemma

def preprocess(text,nlp):
    
    result = []
    
    for token in gensim.utils.simple_preprocess(text): #  gensim.utils.simple_preprocess tokenizes el texto
        token = ''.join(x for x in token.lower() if x.isalpha())
        token = re.sub(r'http*','',token)
        #token = re.sub(r'\s\s+',' ',token)
        if token not in palabrasVacias_nltk and len(token) > 2:
            result.append(token)
        result = lemmatize(' '.join(result),nlp)
    return result

In [None]:
'''
## Parallel

# batch_size=200 is approx 8.3 gb of RAM
for essay in nlp.pipe(documents_spanish['text'], batch_size=200, n_process=4):
    # is_parsed is deprecated
    if essay.has_annotation("DEP"):
        lemma.append([n.lemma_ for n in essay])


    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        lemma.append(None)


documents_spanish['Lemas']  = lemma
'''

In [829]:
data_spanish = pd.read_csv('../Datos/tweets_public.csv')

In [830]:
data_spanish

Unnamed: 0,airline_sentiment,is_reply,reply_count,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,neutral,False,0,0,Trabajar en #Ryanair como #TMA: https://t.co/r...,,Fri Nov 03 12:05:12 +0000 2017,926419989107798016,,Madrid
1,neutral,True,0,0,@Iberia @FIONAFERRER Cuando gusten en Cancún s...,,Sun Nov 26 18:40:28 +0000 2017,934854385577943041,,Mexico City
2,negative,False,0,0,Sabiais que @Iberia te trata muy bien en santi...,,Mon Dec 25 15:40:45 +0000 2017,945318406441635840,,Madrid
3,negative,False,0,0,NUNCA NUNCA NUNCA pidáis el café de Ryanair.\n...,,Mon Nov 06 14:18:35 +0000 2017,927540721296568320,,Pacific Time (US & Canada)
4,positive,True,0,0,@cris_tortu @dakar @Iberia @Mitsubishi_ES @BFG...,,Mon Jan 01 23:00:57 +0000 2018,947965901332197376,,Buenos Aires
...,...,...,...,...,...,...,...,...,...,...
7862,negative,True,0,0,@Iberia @iberiaexpress especialistas en dejart...,,Thu Dec 28 22:34:23 +0000 2017,946509662341554176,,
7863,neutral,False,0,0,"Con @Iberia, mi destino a un solo click. ¡Dese...",,Wed Nov 29 18:59:49 +0000 2017,935946417495035904,,Eastern Time (US & Canada)
7864,positive,True,0,0,@Iberia Muy bien. Muchas gracias,,Tue Dec 26 21:38:36 +0000 2017,945770846949396480,,Greenland
7865,negative,False,0,0,Es que volar con Ryanair es peor que irte a ch...,,Tue Dec 19 09:08:35 +0000 2017,943045386570223616,,Atlantic Time (Canada)


In [831]:
data_text_spanish = data_spanish[['text']]
documents_spanish = data_text_spanish
documents_spanish.sample(5)

Unnamed: 0,text
5694,@Ryanair solicita licencia en #UK en previsión...
5402,@diana_twittea @JenHerranz @Iberia @iberiaexpr...
7155,@Iberia Estimados: estoy intentando hacer el ...
6666,"@ibarretxec @Iberia Otra historia, vuelo de Ma..."
5408,Q plomo resultó ni viaje a Londres en @Iberia....


In [914]:
documents_spanish[documents_spanish['text'].str.contains('iberia')]

Unnamed: 0,text
14,Un año de estos nos unirán con el resto de ibe...
66,Hoy iberia cumple 90 años de su primer vuelo. ...
76,@Zane_GH Pues menudo timo @Zane_GH @ibexpress_...
88,@AeroAsturias @aena Hoy en el aeropuerto en la...
91,"Todo en exceso es malo, menos ¡viajar! #HolaEu..."
...,...
7803,"#note8abordo, samsung regala teléfonos a todos..."
7840,"@InfoViajera una duda, viajo a guarulhos. lle..."
7841,Lo de que iberia ponga café aguado y leche en ...
7848,@elespectador Si es iberia voy a pensar que a ...


In [832]:
doc_sample

'@Iberia @KilliMR Pues realojad en vuelos de otras compañias que sabemos perfectamente que podéis o alquilad y fleta… https://t.co/JiUbV3nG4B'

In [834]:
sample = np.random.choice(documents_spanish.index)
doc_sample = documents_spanish.iloc[sample].values[0]
print(f'Original Document: {sample}')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('Lemmatized text')
print(lemmatize(doc_sample,nlp))
print('Clean text')
print(preprocess(doc_sample,nlp))

Original Document: 2134
['Viajo', 'con', '@Iberia', 'Valecia-', 'Melilla', 'con', 'transbordo', 'en', 'Madrid', 'y', 'no', 'me', 'llega', 'la', 'maleta', 'facturada,', 'como', 'siempre!', 'Si', 'no…', 'https://t.co/wsIHbRT6xx']
Lemmatized text
['viajar', 'con', '@iberia', 'valecia-', 'Melilla', 'con', 'transbordo', 'en', 'Madrid', 'y', 'no', 'yo', 'llegar', 'el', 'maleta', 'facturado', ',', 'como', 'siempre', '!', 'si', 'no', '…', 'https://t.co/wsihbrt6xx']
Clean text
['viajar', 'iberio', 'valecia', 'melilla', 'transbordo', 'madrid', 'llegar', 'maleta', 'facturado', 'siempre', 'wsihbrt']


### Text Preprocessing

We are going to pre-process the texts, saving the results in the *processed_docs* object.

In [849]:
processed_docs_spanish = documents_spanish['text'].apply(lambda x: preprocess(x, nlp))

In [867]:
processed_docs_spanish.head(10).values

array([list(['trabajar', 'ryanair', 'tmar', 'ruuarbe', 'empleo']),
       list(['iberia', 'fionaferrer', 'gustar', 'cancún', 'viajar', 'disfrutar', 'manera', 'igual']),
       list(['sabiai', 'iberia', 'tratar', 'bien', 'santiago', 'chile', 'cambiar', 'asiento', 'mandar', 'volar', 'trasero', 'uansbonn']),
       list(['nunca', 'nunca', 'nunca', 'pidar', 'café', 'ryanair', 'bueno', 'vender', 'bordo']),
       list(['cristortu', 'dakar', 'iberia', 'mitsubishies', 'bfgoodricheu', 'burgostur', 'astintlogistics', 'uremovil', 'karbium', 'éxito', 'vrkyvu']),
       list(['mgd', 'wow', 'bonito', 'solo', 'volado', 'uno', 'vez', 'iberia', 'siempre', 'tierra']),
       list(['iberia', 'plus', 'cumplir', 'año', 'querer', 'celebrar', 'él', 'contigo', 'manera', 'especial', 'elegir', 'número', 'favorito', 'wuujr', 'doge']),
       list(['barómetro', 'business', 'iberia', 'vueling', 'compañía', 'aéreo', 'utilizado', 'viaje', 'jyr']),
       list(['iberia', 'felicitación', 'iberia']),
       list(['cbe

### Building the dictionary

We create a dictionary from *processed_docs* that contains the number of times a word appears in the training set.

In [903]:
dictionary_spanish = gensim.corpora.Dictionary(processed_docs_spanish)
count = 0
for k, v in dictionary_spanish.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 empleo
1 ruuarbe
2 ryanair
3 tmar
4 trabajar
5 cancún
6 disfrutar
7 fionaferrer
8 gustar
9 iberia
10 igual


In [904]:
dictionary_spanish.most_common()[:10]

[('iberia', 4265),
 ('ryanair', 2128),
 ('vuelo', 1356),
 ('iberio', 1308),
 ('hacer', 612),
 ('hola', 574),
 ('destino', 545),
 ('poder', 497),
 ('madrid', 456),
 ('mejor', 435)]

In [905]:
dictionary_spanish.most_common()[-10:]

[('chicrevista', 1),
 ('maldad', 1),
 ('tana', 1),
 ('xcroh', 1),
 ('saplj', 1),
 ('especialista', 1),
 ('jguhm', 1),
 ('wbi', 1),
 ('chingar', 1),
 ('dbkvzo', 1)]

Filter the tokens that appear in
less than 15 documents (absolute number) or
more than 0.5 documents (fraction of the total size of the corpus, not an absolute number).
After the previous two steps, keep only the first 100,000 most frequent tokens.

In [906]:
dictionary_spanish.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [907]:
dictionary_spanish.most_common()[:10]

[('ryanair', 2128),
 ('vuelo', 1356),
 ('iberio', 1308),
 ('hacer', 612),
 ('hola', 574),
 ('destino', 545),
 ('poder', 497),
 ('madrid', 456),
 ('mejor', 435),
 ('solo', 434)]

In [908]:
dictionary_spanish.most_common()[-10:]

[('ejemplo', 15),
 ('palma', 15),
 ('estar', 15),
 ('encima', 15),
 ('bilbao', 15),
 ('bogotá', 15),
 ('rumbo', 15),
 ('imaginar', 15),
 ('diferente', 15),
 ('tren', 15)]

### Gensim doc2bow

For each document we create a dictionary that informs how many
words and how many times those words appear.

We put this in the *bow_corpus* object, then check our previously selected document.

In [909]:
bow_corpus = [dictionary_spanish.doc2bow(doc) for doc in processed_docs_spanish]

In [910]:
sample = np.random.choice(documents_spanish.index)

doc_sample = documents_spanish.iloc[sample].values[0]
print(f'\nOriginal Document: {sample}')
print(doc_sample,'\n')

print('Bag of Words (BoW):\n')
print(bow_corpus[sample],'\n')

bow_doc_4310 = bow_corpus[sample]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary_spanish[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))


Original Document: 7427
@Iberia @vueling hola, me sabríais decir en que horarios van a ocurrir las huelgas de vuestros funcionarios? cuanto… https://t.co/XGCz7vMkq4 

Bag of Words (BoW):

[(41, 1), (54, 1), (61, 1), (189, 1), (256, 1), (372, 1), (392, 1), (443, 1)] 

Word 41 ("vueling") appears 1 time.
Word 54 ("hola") appears 1 time.
Word 61 ("decir") appears 1 time.
Word 189 ("huelga") appears 1 time.
Word 256 ("ir") appears 1 time.
Word 372 ("cuanto") appears 1 time.
Word 392 ("ocurrir") appears 1 time.
Word 443 ("horario") appears 1 time.


This is a preview of the bag of words in the preprocessed document.

### TF-IDF

We create a model object *tf-idf* using *models.TfidfModel* from "bow_corpus" and put it in *tfidf*, then we apply the transformation to the whole corpus and call it *corpus_tfidf*. Finally, we preview the *TF-IDF* scores for our first document.

In [859]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

print(processed_docs_spanish[0],'\n')
print(bow_corpus[0],'\n')
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

['trabajar', 'ryanair', 'tmar', 'ruuarbe', 'empleo'] 

[(0, 1), (1, 1), (2, 1)] 

[(0, 0.7085014039582359), (1, 0.17743377739348923), (2, 0.6830395414828386)]


### Running LDA using TF-IDF

In [915]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary_spanish, passes=4, workers=8)

**How many topics should we choose?** You can read [Evaluate Topic Models: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [916]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic[:120]))

Topic: 0 Word: 0.020*"querer" + 0.020*"número" + 0.019*"plus" + 0.019*"cumplir" + 0.018*"especial" + 0.018*"elegir" + 0.017*"favorito" 
Topic: 1 Word: 0.019*"vuelo" + 0.016*"seguir" + 0.016*"poder" + 0.014*"ryanair" + 0.012*"iberio" + 0.010*"caso" + 0.010*"ser" + 0.009*"
Topic: 2 Word: 0.033*"ryanair" + 0.028*"spanair" + 0.027*"accidente" + 0.026*"piloto" + 0.026*"españa" + 0.026*"huelga" + 0.026*"así" +
Topic: 3 Word: 0.030*"vuelo" + 0.030*"iberio" + 0.027*"madrid" + 0.021*"ryanair" + 0.019*"vueling" + 0.015*"buen" + 0.013*"vía" + 0.011
Topic: 4 Word: 0.022*"pasar" + 0.019*"iberio" + 0.018*"ir" + 0.018*"ryanair" + 0.016*"hacer" + 0.016*"viajar" + 0.011*"poder" + 0.011*"
Topic: 5 Word: 0.024*"iberio" + 0.022*"hacer" + 0.021*"ryanair" + 0.019*"año" + 0.018*"gracia" + 0.017*"mucho" + 0.015*"feliz" + 0.014*
Topic: 6 Word: 0.055*"ryanair" + 0.018*"express" + 0.014*"empresa" + 0.014*"billete" + 0.014*"decir" + 0.014*"euros" + 0.013*"iberio" +
Topic: 7 Word: 0.020*"ryanair" + 0.016*"iberio" 

Can you distinguish different topics using the words in each topic and their corresponding weights?

### Performance evaluation classifying the sample document using the LDA TF-IDF model.

In [921]:
sample = np.random.choice(documents_spanish.index)

doc_sample = documents_spanish.iloc[sample].values[0]

print(f'\nOriginal Document: {sample}')
print(doc_sample,'\n')

bow_vector = dictionary.doc2bow(preprocess(doc_sample,nlp))
tfidf_vector = tfidf[bow_vector]

for index, score in sorted(lda_model_tfidf[tfidf_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))


Original Document: 2069
VIDEO: Reconstruyen el trágico accidente de Spanair ocurrido en Madrid en 2008 https://t.co/RfXoncfMqI https://t.co/LpoASexogj 

Score: 0.4162733554840088	 Topic: 0.033*"ryanair" + 0.028*"spanair" + 0.027*"accidente" + 0.026*"piloto" + 0.026*"españa"
Score: 0.2856001555919647	 Topic: 0.030*"vuelo" + 0.030*"iberio" + 0.027*"madrid" + 0.021*"ryanair" + 0.019*"vueling"
Score: 0.037270687520504	 Topic: 0.021*"gracias" + 0.021*"ryanair" + 0.018*"iberio" + 0.018*"pasajero" + 0.013*"vuelo"
Score: 0.037269432097673416	 Topic: 0.020*"ryanair" + 0.016*"iberio" + 0.013*"nuevo" + 0.013*"dar" + 0.011*"vuelo"
Score: 0.037266235798597336	 Topic: 0.022*"pasar" + 0.019*"iberio" + 0.018*"ir" + 0.018*"ryanair" + 0.016*"hacer"
Score: 0.037265997380018234	 Topic: 0.019*"vuelo" + 0.016*"seguir" + 0.016*"poder" + 0.014*"ryanair" + 0.012*"iberio"
Score: 0.037264082580804825	 Topic: 0.020*"querer" + 0.020*"número" + 0.019*"plus" + 0.019*"cumplir" + 0.018*"especial"
Score: 0.03726400062

Nuestro documento de prueba tiene la mayor probabilidad de ser parte del tema que asignó nuestro modelo, que es la clasificación precisa.

## Test the model with a document not seen before.

In [922]:
unseen_document = 'Terrible el servicio brindado. No volaré nunca más con ustedes.'

print(f'\nOriginal Document:')
print(unseen_document,'\n')

bow_vector = dictionary.doc2bow(preprocess(doc_sample,nlp))
tfidf_vector = tfidf[bow_vector]
for index, score in sorted(lda_model_tfidf[tfidf_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))


Original Document:
Terrible el servicio brindado. No volaré nunca más con ustedes. 

Score: 0.41381752490997314	 Topic: 0.033*"ryanair" + 0.028*"spanair" + 0.027*"accidente" + 0.026*"piloto" + 0.026*"españa"
Score: 0.28805580735206604	 Topic: 0.030*"vuelo" + 0.030*"iberio" + 0.027*"madrid" + 0.021*"ryanair" + 0.019*"vueling"
Score: 0.03727062791585922	 Topic: 0.021*"gracias" + 0.021*"ryanair" + 0.018*"iberio" + 0.018*"pasajero" + 0.013*"vuelo"
Score: 0.03726978972554207	 Topic: 0.020*"ryanair" + 0.016*"iberio" + 0.013*"nuevo" + 0.013*"dar" + 0.011*"vuelo"
Score: 0.03726619854569435	 Topic: 0.022*"pasar" + 0.019*"iberio" + 0.018*"ir" + 0.018*"ryanair" + 0.016*"hacer"
Score: 0.03726596385240555	 Topic: 0.019*"vuelo" + 0.016*"seguir" + 0.016*"poder" + 0.014*"ryanair" + 0.012*"iberio"
Score: 0.037264056503772736	 Topic: 0.020*"querer" + 0.020*"número" + 0.019*"plus" + 0.019*"cumplir" + 0.018*"especial"
Score: 0.03726397454738617	 Topic: 0.024*"iberio" + 0.022*"hacer" + 0.021*"ryanair" + 0

[[Go Back]](#Content)