# 4. Topic Modelling
### Juan Julián Cea Morán

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Electronic-Arts-Logo.svg/1200px-Electronic-Arts-Logo.svg.png" width=100px>

---
Topic Modelling is a classic task in NLP field consisting on discovering abstract topics hidden in the data corpus. This is usefull for example, to understand our data or when we want to make unsupervised classification tasks. 

In this case, due to the nature of the data you are working with, Topic Modelling can be performed in different ways. The first approach would be to carry out a study of topics for each language present in each of the 4 contexts. The other option consists of carrying out Topic Modelling for each context without taking into account the different languages.

In this case, the second option is chosen, although certain key points must be taken into account for its correct implementation. The main problem to solve is to be able to represent words from different languages in the same vector space so that they are related to each other independently of the language but according to the context. For this, we are be using multilingual word embeddings.

With regarding the model, there are some options: LDA, NMF, LSI, etc. We are going to start with LDA since is the most used.

---

## Prepare data
As we saw in classification notebook, the first step is to prepare the data. The main diference is that in this case, we don't need to partition our dataset since this is not the same kind problem.

First of all, we have to import the data.

In [3]:
import pickle
import pandas as pd

preproc_df = pickle.load(open("../data/preproc_df.pkl", 'rb'))

In [4]:
preproc_df.head()

Unnamed: 0,Preprocessed,Category
0,"[read, book, town, everyone, uses, order, phar...",APR
1,"[recipes, appreciated, family, small, large, r...",APR
2,"[say, ease, author, even, made, effort, meet, ...",APR
3,"[milady, found, good, vein, anita, blake, base...",APR
4,"[somewhere, greece, gentlemen, decided, visit,...",APR


As I said before, I'm going to perform Topic Modelling for the four different context in the data, so the results show abdstrac topics whithin those categories. This means that is necessary to split data by context/category.


In [5]:
apr_data = [' '.join(text) for text in preproc_df.loc[(preproc_df['Category'] == 'APR')]['Preprocessed']]
conference_data = [' '.join(text) for text in preproc_df.loc[(preproc_df['Category'] == 'Conference_papers')]['Preprocessed']]
pan_data = [' '.join(text) for text in preproc_df.loc[(preproc_df['Category'] == 'PAN11')]['Preprocessed']]
wiki_data = [' '.join(text) for text in preproc_df.loc[(preproc_df['Category'] == 'Wikipedia')]['Preprocessed']]

Maybe we can use bigrams or trigrams (n-grams) as well as lemmatization or steeming at the preprocessing step. As I said in the 2. Preprocessing notebook conclusions, this is going to be a future work feature.

---
## Different Approaches
* **Bag of Words + LDA:** When working with LDA, the usual vectorization model is Bag of Words. This model builds a dictionary with all the different words found in the corpus. Then, builds sparse vectors for each document with the same dimensionality as number of words in the dictionary. For each one of those vectors representing each document, a 1 is set on the position of a certain word in the vector if that document contains that word. This kind of vectorization carries an extreamly high dimensionality representation of the documents. Another problem of this approach is that there is no information about how the different terms in the documents are related, like in a classic one-hot encoding vectorization. However, using algorithms such LDA or NMF to reduce dimensionality of BoW models is widely used in the literature, and very efective when dealing with big amounts of data.

* **Word embeddings:** Other novel option is word. This model address the problem of capturing semantic dependencies between words in the corpus so the spatial representation makes sense with the real world. For example, in a classic one-hot encodding the word *cat* would be at the same distance from *dog* tan from *pencil*. The main idea behind Word Embeddings is that when you represent those words, *cat* and *dog* are closer to each other than they are to *pencil*.


---
## Apply LDA
Once we have the data, it's time to apply the model. As I said, the first model to be tested is LDA since it is widely used for this task. LDA considers each document in the corpus as a mix of hidden topics and each one of this topic, as a mix of keywords.

Let's start making Topic Modelling for Wikipedia docs.

### Vectorization

---

**Conclusions of Classification task**
