`Zoumana KEITA, Data Scientist`

# Latent Dirichlet Allocation / Analysis (LDA)     

**Note**: you will need to unzip the data from the `data` folder in order to follow this notebook.  

This is a probabilistic model used to find clusters assigments for documents.  
It uses two probability values to cluster documents: 
- **P(word | topic)**: the probability that a particular word is associated with a particular topic. This first set of probability is also considered as the **Word X Topic** matrix.  
- **P(topics | documents)**: the topics associated with documents. This second set of probability is considered as **Topics X Documents** matrix.   
These probability values are calculated for all words, topics and documents.    

For this tutorial, we will be using the dataset of the Australian Broadcasting Corporation, available on kaggle:   
https://www.kaggle.com/therohk/million-headlines 

## Import Useful Libraries 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Load the Dataset

In [None]:
news_data = pd.read_csv("../input/news-data.csv")
news_data.shape

In [None]:
news_data.head()

Our data have over a million of records, and there are two columns: 
- the date a particular headline have been published.  
- the actual headline.   
By looking at the first 5 rows, we can see that we don't have the topic of the headline text! So, we will use LDA to attempt to figure out clusters of the news.   
**A million** of record, that is a lot of data. To do so, we will use only **12000** records to make the computation faster.   

## Preprocessing.    

In [None]:
NUM_SAMPLES = 12000 # The number of sample to use 
sample_df = news_data.sample(NUM_SAMPLES, replace=False).reset_index(drop=True)

In [None]:
sample_df.shape

In [None]:
sample_df.head()

We are not interested in the **publish_data** column, since we will only be using **headline_text** data.    

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.     


Be defining the **CountVectorizer** object as below, we ignore:   
- all terms that occur over 95% times in our document corpus. We say in this case that the terms occuring more than this threshold are not significant, most of them are  `stopwords`.   

- all the terms that occur fewer than twice in the entire corpus.  

In [None]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words="english")
dtm = cv.fit_transform(sample_df['headline_text'])

In [None]:
dtm

We can observe that our Document X Term Matrix (dtm) has:  
- 12000 documents, and.  
- 6506 distinct words   

We can also get all those words using the `get_feature_names()` function

In [None]:
feature_names = cv.get_feature_names()
len(feature_names) # show the total number of distinct words

Let's have a look at some of the features that have been extracted from the documents.  

In [None]:
feature_names[6500:]

## LDA     
From our DTM matrix, we can now build our LDA to extract topics from the underlined texts. The number of topic to be extracted is a hyperparameter, so we do not know it a a glance. In our case, we will be using 7 topics.   
LDA is an iterative algorithm, we will have 30 iterations in our case, but the default value is 10.  

In [None]:
NUM_TOPICS = 7 
LDA_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=30, random_state=42)

In [None]:
LDA_model.fit(dtm)

### Show Stored Words.   
Let's randomnly have a look at some words of that have been stored.  

In [None]:
len(feature_names)

In [None]:
import random 
for index in range(15):
    random_word_ID = random.randint(0, 6506)
    print(cv.get_feature_names()[random_word_ID])

### Top Words Per Topic

In [None]:
len(LDA_model.components_[0])

In [None]:
# Pick a single topic 
a_topic = LDA_model.components_[0]

# Get the indices that would sort this array
a_topic.argsort()

In [None]:
# The word least representative of this topic
a_topic[597]

In [None]:
# The word most representative of this topic
a_topic[3598]

Let have a look at the top 10 words for the topic we previously took

In [None]:
top_10_words_indices = a_topic.argsort()[-10:]

for i in top_10_words_indices:
    print(cv.get_feature_names()[i])

This looks like Government Article. Let's have a look at all the 7 topics found. 

In [None]:
for i, topic in enumerate(LDA_model.components_):
    print("THE TOP {} WORDS FOR TOPIC #{}".format(10, i))
    print([cv.get_feature_names()[index] for index in topic.argsort()[-10:]])
    print("\n")

### Attach Discovered Topic Labels to Original News

In [None]:
final_topics = LDA_model.transform(dtm)
final_topics.shape

**final_topics** contains, for each of our 12000 documents, the probability score of how likely a document belongs to each of the 7 topics.  This is a Document X Topics matrix. 
For example, below is the probability values for the first document.

In [None]:
final_topics[0]

In [None]:
final_topics[0].argmax()

This value (4) means that our LDA model thinks that the first document belongs to the 4th topic.

### Combination with the original data     
Let's create a new column called **Topic N°** that will correspond to the topic value to which each document belongs to.

In [None]:
sample_df["Topic N°"] = final_topics.argmax(axis=1)

In [None]:
sample_df.head()

According to our LDA model:   
- the first document belongs to 4th topic.  
- the second document belongs to 4th topic. 
- the third document belongs to 6th topic.  
etc.   

## Some Visualization       
We will be using the `pyldavis` module to visualize the topics associated to our documents.   

In [None]:
import pyLDAvis.sklearn

In [None]:
pyLDAvis.enable_notebook() # To enable the visualization on the notebook

In [None]:
panel = pyLDAvis.sklearn.prepare(LDA_model, dtm, cv, mds='tsne') # Create the panel for the visualization
panel

### Some Comments On The Graphic     

- By selecting a particular term on the right, we can see which topic(s) it belongs.    
- Vice-versa, by choosing a topic on the left, we can see all the terms, from most to least relevant term.  

**If you liked this kernel, please upvote. I am also open to suggestions**