# Latent Dirichlet Allocation (LDA)

This notebook contains an example of training an LDA topic modeling applied to news articles from which we will try to model what kind of topic they are.

In [1]:
import pandas as pd

npr = pd.read_csv('../UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Sample article:

In [2]:
npr.Article[0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

From the content above, we might think the article 0 is about politics.

Now, let's create the document term matrix.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# max_df=0.9 will discard words that appear in 90% of the documents
# this is to get rid of commond words.
# min_df=2 the minimum documents in which a word has to appear in order
# to count it
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr.Article)
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

With the document term matrix we can now train the LDA model using this vocabulary.

In [6]:
%%time
from sklearn.decomposition import LatentDirichletAllocation

# n_components is the number of topics I want to detect
LDA = LatentDirichletAllocation(n_components=7, random_state=42)
LDA.fit(dtm)

CPU times: user 8min 34s, sys: 6.24 s, total: 8min 40s
Wall time: 1min 27s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

Now, we need to proceed to extract the interesting values to match them with the vocabulary to get an idea of what we got.

### Step 1: Grab the vocabulary of words
We can rely on the count vectorizer we trained before since this contains the reference for every single term we got from the Document Term Matrix.

In [7]:
print(f"Number of words: {len(cv.get_feature_names())}")

Number of words: 54777


the feature names is just a list of the words, each with an index that we can access to know what is the human readable word.

In [9]:
import random

def print_random_word():
    fn = cv.get_feature_names()
    random_word_id = random.randint(0, len(fn))
    print(f"Word index {random_word_id} is: {fn[random_word_id]}")

print_random_word()

Word index 3806 is: asking


### Step 2 Grab the topics
Now with the trained LDA we can just grab the obtained topics. Each topic will be associated with an N number of words, but we will get the index of such words, that's why we need the vocabulary from above.

First, we can observe that the obtained components are the ones we specified at the moment of training the model.

In [10]:
len(LDA.components_)

7

In [13]:
LDA.components_.shape

(7, 54777)

We have a numpy array of shape 7 (the number of topics), 54777 (each word in the vocabulary). For each word we have a probability of occurence for that particular topic.

What we need to do now is to grab the highest probability words for the topics.

#### Note:
In a numpy array, we can call the `argsort()` method which will return another array with the indexes of the original array but sorted in terms of the original values from lowest to highest. This is perfect for our purpose since we need to get the highest probability words for each topic.

In [14]:
# example of argsort()
import numpy as np
arr = np.array([10, 200, 1])
# From this array we can observe that in terms of the index of each element,
# from the lowest to the highest that would be: [2, 0, 1] because:
# arr[2] = 1, arr[0]=10 and arr[1]=200
arr.argsort()

array([2, 0, 1])

In [16]:
single_topic = LDA.components_[0]
# we need the last ten because those are the ones with highest probability
top_10_single_topic = single_topic.argsort()[-10:]
top_10_single_topic

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

Now simply map the indexes with the vocabulary we got earlier.

In [17]:
print("For the first topic, we got the following words:")
for index in top_10_single_topic:
    print(cv.get_feature_names()[index])

For the first topic, we got the following words:
new
percent
government
company
million
care
people
health
said
says


So, it seems the first topic is about politics and healthcare.

Now let's obtain the top 10 words for each topic.

In [21]:
def print_topic_words(top=10):
    vocab = cv.get_feature_names()
    for topic, lda in enumerate(LDA.components_):
        top_n = lda.argsort()[-top:]
        print(f">>>For topic #{topic}, the top {top} words are:")
        words = [vocab[index] for index in top_n]
        print(f"{words}\n")

In [23]:
print_topic_words(15)

>>>For topic #0, the top 15 words are:
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']

>>>For topic #1, the top 15 words are:
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']

>>>For topic #2, the top 15 words are:
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']

>>>For topic #3, the top 15 words are:
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']

>>>For topic #4, the top 15 words are:
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']

>>>For topic #5, the top 15 words are:
['years', 'going', 've', 'life', 'don', 'new', '

Now let's attach the obtained topics to the original documents.

In [24]:
%%time
topic_results = LDA.transform(dtm)

CPU times: user 34.1 s, sys: 484 ms, total: 34.6 s
Wall time: 5.8 s


In [26]:
topic_results.shape

(11992, 7)

What we got is an array of shape 11992 (the total number of documents), 7 (total number of topics). But for each topic we have the probability of the document to belong to a particular topic. So for instance, let's take a look at the first document.

In [27]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

This array is telling us that the document $0$ has a $68\%$ probability of being topic #1, which from the description above is about:

```
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']
```
Which seems to be closely related with politics, and indeed it was our assumption before training the LDA model!

Now, let's apply the logic to the original dataframe.

In [29]:
npr['Topic'] = topic_results.argmax(axis=1)

In [31]:
npr.head(20)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


And now we have nicely assigned topics to each document!