# 5. Topic Modeling
#### Juan Julián Cea Morán

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Electronic-Arts-Logo.svg/1200px-Electronic-Arts-Logo.svg.png" width=100px>

---
Topic Modeling is a classic task in NLP field consisting on discovering abstract topics hidden in the data corpus. This is usefull for example, to understand our data or when we want to make unsupervised classification tasks. 

In this case, due to the nature of the data you are working with, Topic Modeling can be performed in different ways. The first approach would be to carry out a study of topics for each language present in each of the 4 contexts. The other option consists of carrying out Topic Modeling for each context without taking into account the different languages.

In this case, the second option is chosen, although certain key points must be taken into account for its correct implementation. The main problem to solve is to be able to represent words from different languages in the same vector space so that they are related to each other independently of the language but according to the context.

---

## Prepare data
As we saw in classification notebook, the first step is to prepare the data. The main diference is that in this case, we don't need to partition our dataset since this is not the same kind problem.

First of all, we have to import the data.

In [2]:
import pickle
import pandas as pd

preproc_df = pickle.load(open("../data/preproc_df.pkl", 'rb'))

In [3]:
preproc_df.head()

Unnamed: 0,Preprocessed,Lang,Category
0,"[read, book, town, everyone, uses, order, phar...",en,APR
1,"[recipes, appreciated, family, small, large, r...",en,APR
2,"[say, ease, author, even, made, effort, meet, ...",en,APR
3,"[milady, found, good, vein, anita, blake, base...",en,APR
4,"[somewhere, greece, gentlemen, decided, visit,...",en,APR


As I said before, I'm going to perform Topic Modeling for the four different context in the data, so the results show abdstrac topics whithin those categories. This means that is necessary to split data by context/category.


In [34]:
# APR
apr_en_data = preproc_df.loc[(preproc_df['Category'] == 'APR') & (preproc_df['Lang'] == 'en')]['Preprocessed'].tolist()
apr_fr_data = preproc_df.loc[(preproc_df['Category'] == 'APR') & (preproc_df['Lang'] == 'fr')]['Preprocessed'].tolist()

# Conference_papers
conf_en_data = preproc_df.loc[(preproc_df['Category'] == 'Conference_papers') & (preproc_df['Lang'] == 'en')]['Preprocessed'].tolist()
conf_fr_data = preproc_df.loc[(preproc_df['Category'] == 'Conference_papers') & (preproc_df['Lang'] == 'fr')]['Preprocessed'].tolist()

# PAN11
pan_en_data = preproc_df.loc[(preproc_df['Category'] == 'PAN11') & (preproc_df['Lang'] == 'en')]['Preprocessed'].tolist()
pan_es_data = preproc_df.loc[(preproc_df['Category'] == 'PAN11') & (preproc_df['Lang'] == 'es')]['Preprocessed'].tolist()

# Wikipedia
wiki_en_data = preproc_df.loc[(preproc_df['Category'] == 'Wikipedia') & (preproc_df['Lang'] == 'en')]['Preprocessed'].tolist()
wiki_es_data = preproc_df.loc[(preproc_df['Category'] == 'Wikipedia') & (preproc_df['Lang'] == 'es')]['Preprocessed'].tolist()
wiki_fr_data = preproc_df.loc[(preproc_df['Category'] == 'Wikipedia') & (preproc_df['Lang'] == 'fr')]['Preprocessed'].tolist()

Maybe we can use bigrams or trigrams (n-grams) as well as lemmatization or steeming at the preprocessing step. As I said in the 2. Preprocessing notebook conclusions, this is going to be a future work feature just because those operation are language dependant, so a language identifier is needed for the model to preprocess new samples.

---
## Different Approaches
There are different approaches to accomplish Topic Modeling. It's worth to spend some time reviewing those options in order to choose an appropriate one.

* **Bag of Words + LDA/NMF/LSI/etc:** When working with LDA (or similar algorithms), the usual vectorization model is Bag of Words. This model builds a dictionary with all the different words found in the corpus. Then, builds sparse vectors for each document with the same dimensionality as number of words in the dictionary. For each one of those vectors representing each document, a 1 is set on the position of a certain word in the vector if that document contains that word. This kind of vectorization carries an extreamly high dimensionality representation of the documents. Another problem of this approach is that there is no information about how the different terms in the documents are related, like in a classic one-hot encoding vectorization. However, using algorithms such LDA or NMF to reduce dimensionality of BoW models is widely used in the literature, and very efective when dealing with big amounts of data.

* **Word embeddings:** This is not Topic Modeling *per se*, but can be used to represent keyword clusters. This model address the problem of capturing semantic dependencies between words in the corpus so the spatial representation makes sense with the real world. For example, in a classic one-hot encodding the word *cat* would be at the same distance from *dog* tan from *pencil*. The main idea behind Word Embeddings is that when you represent those words, *cat* and *dog* are closer to each other than they are to *pencil*. So a possible approach would be using word embeddings plus a dimensioality reduction method like PCA, so that the result would be a 2 or 3 dimensional space with word cluster.

We have to take in consideration that the first approach is not aligible for multilingual topic modeling, so we shoud make a model for each language and each context. However, there are several extensions to the classic LDA for addressing multilingual topic modeling. They can be seen in this review: http://papers.nips.cc/paper/4583-symmetric-correspondence-topic-models-for-multilingual-text-analysis.pdf .Nevertheless, there is no implementantion, so let's save this for future work.

Regardding the second option, there are some proposals applicable to multilingual data called **multilingual word embeddings**: https://www.aclweb.org/anthology/D18-1024/ and https://github.com/facebookresearch/MUSE As well as the LDA extensions, this solution will be addressed in future iterations.

---

## Implementing LDA with language dependence

As I said before, the most straight forward solutions consist on performing LDA for each category and each language independently. LDA (*Latent Dirichlet Allocation*) is an unsupervised dimensionality recduction algorithm that considers each document in the corpus as a mix of hidden topics and each one of this topic, as a mix of keywords.

### 1. APR documents
Let's start making Topic Modeling for APR docs.

#### 1.1. English
The first step is create Bag of Words model.

In [37]:
%%time
from gensim import corpora, models

id2word = corpora.Dictionary(apr_en_data)
id2word.filter_extremes(no_below=12)
corpus = [id2word.doc2bow(sample) for sample in apr_en_data]

Wall time: 426 ms


Now, let's train LDA model with 10 topics

**Update:** Starting with 10 topics, then after studying the results coclude that the optimal number of topics to prevent overlapping and meaninglessnes is 3.

In [38]:
%%time
num_topics = 3

lda_model = models.LdaModel(corpus, num_topics, id2word=id2word, passes=4)

Wall time: 15 s


Let's plot the topics in an interactive 2 dimensional space

In [39]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=id2word)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

**APR - English Conclusions**

As it can be seen, we have perform Topic Modeling over APR documents in English.
Over the cours of the test, we have seen that 10 initial topics where too mach for this corpus since there was a lot of overlapping clusters. After several tests, we find out that the optimal number of topics was 3.

Regarding to the keywords of each cluster, we can profile them as follows:
* Topic 1: Books and literature (book, read, story, characters, novel, ...)
* Topic 2: Music: (album, music, group, rock, indie, ...)
* Topic 3: Films and Cinema (film, movie, actors, see, image, ...)

#### 1.2. French

Let's repit the same process


In [50]:
%%time
from gensim import corpora, models

id2word = corpora.Dictionary(apr_fr_data)
id2word.filter_extremes(no_below=12)
corpus = [id2word.doc2bow(sample) for sample in apr_fr_data]

num_topics = 3

lda_model = models.LdaModel(corpus, num_topics, id2word=id2word, passes=4)

Wall time: 8.05 s


In [51]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=id2word)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

**APR - French Conclusions**

It is also worth mentioning that the french results are less accurate since there are a lot of noissy keywords. I have test diferent extreme-filter values at Bag of Words phase and also diferent passes of the LDA in the training and these are the best results.

We can profile the same topics as in the english case, but there is still a lot of work to do in frech preprocessing (for example take some of the top keywords and include them into the stopwords list).

---
### 2. Wikipedia
#### 2.1. English

In [62]:
%%time
from gensim import corpora, models

id2word = corpora.Dictionary(wiki_en_data)
id2word.filter_extremes(no_below=12)
corpus = [id2word.doc2bow(sample) for sample in wiki_en_data]

Wall time: 7.4 s


In [65]:
%%time
num_topics = 6

lda_model = models.LdaModel(corpus, num_topics, id2word=id2word, passes=4)

Wall time: 1min 37s


In [66]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=id2word)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

**Wikipedia - English Conclusions**

As it can be seen, we have perform Topic Modeling over Wikipedia documents in English.
After several tests, we find out that the optimal number of topics was 6.

Regarding to the keywords of each cluster, we can profile them as follows:
* Topic 1: History and War (war, military, government, army, king, german, russian,...)
* Topic 2: Music and Cinema: (album, music, band, song, film, television, released, ...)
* Topic 3: Football (city, cup, football, club, players, ...)
* Topic 4: University and Knowledge (language, theory, science, philosophy, book, ...)
* Topic 5: Software and Technology (software, computer, engine, system, game, ...)
* Topic 6: Natural Sciences (species, water, plants, birds, fish, energy, stars...)

### 2.2 Spanish


In [74]:
%%time
from gensim import corpora, models

id2word = corpora.Dictionary(wiki_es_data)
id2word.filter_extremes(no_below=12)
corpus = [id2word.doc2bow(sample) for sample in wiki_es_data]

Wall time: 8.86 s


In [77]:
%%time
num_topics = 7

lda_model = models.LdaModel(corpus, num_topics, id2word=id2word, passes=4)

Wall time: 1min


In [78]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim

vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=id2word)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

**Wikipedia - Spanish Conclusions**

As it can be seen, we have perform Topic Modeling over Wikipedia documents in Spanish.
After several tests, we find out that the optimal number of topics was 7.

Regarding to the keywords of each cluster, we can profile them as follows:
* Topic 1: Geography and History (guerra, ciudad, población, provincia, imperio, ...)
* Topic 2: Religion and Spirituality: (dios, vida, libro, tiempo, espíritu, ...)
* Topic 3: Music (album, disco, banda, musica, cancion, ...)
* Topic 4: Medicine (cancer, energía, pulmon, tratamiento, células, ...)
* Topic 5: Software and Cinema(windows, linux, pelicula, version, cine, ...) -> Not so accurate.
* Topic 6: Football (equipo, futbol, temporada, copa, partido, ...)
* Topic 4: Natural Sciences (especies, agua, hojas, aves, plantas, animales, ...)

---
**Note:** The rest of the context-language duples shoud be generated in the same supervised way until discover the inner topics.