# Latent Dirichlet Allocation

**Imports**

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import random

import pandas as pd
import numpy as np
np.set_printoptions(suppress=True, threshold=10)
pd.set_option('display.max_colwidth', 100)

**Loading Data**

In [2]:
df = pd.read_csv('articles.csv')
print('Shape:', df.shape)
df.head()

Shape: (11992, 1)


Unnamed: 0,Article
0,"In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in t..."
1,Donald Trump has used Twitter — his preferred means of communication — to weigh in on a ...
2,"Donald Trump is unabashedly praising Russian President Vladimir Putin, a day after outgoing Pr..."
3,"Updated at 2:50 p. m. ET, Russian President Vladimir Putin says Russia won’t be expelling U. S. ..."
4,"From photography, illustration and video, to data visualizations and immersive experiences, visu..."


## 1. Bag of Words

We create a bag a words that will be fed into the LDA model.
`max_df` is the maximum document frequency. If a word is in 95% of the documents, we ignore it.
`min_df` is the minimum document frequency. If a word is in less than 2 documents, we ignore it.

In [3]:
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [4]:
# Create a document term matrix
dtm = vectorizer.fit_transform(df['Article'])
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

The 11992 correspond to the number of rows in the data, 54777 corresponds to the number of words in the Bag of Words.

**Check the transformed features**

In [5]:
print(f'{len(vectorizer.get_feature_names_out())} transformed features')
vectorizer.get_feature_names_out()[random.randint(0,54776)]

54777 transformed features


'succeeding'

In [6]:
for i in range(5):
    random_word_id = random.randint(0,54776)
    print(vectorizer.get_feature_names_out()[random_word_id])

closets
coldest
kugeler
defiance
trundle


## 2. LDA

Now, we create a an LDA model with 7 topics (`n_components=7`)

In [7]:
lda = LatentDirichletAllocation(n_components=7, random_state=42)
lda.fit(dtm)

`lda.components_[i, j]` is the weight (kind of like a probability) of the word `j` in the topic `i`. You can think of it as the pseudocount that represents the number of times, the word `j` was assigned to topic `i`.

It can also be viewed as distribution over the words for each topic after normalization: `model.components_ / model.components_.sum(axis=1)[:, np.newaxis]`.

In [8]:
print('(n_topics, n_words):', lda.components_.shape)

lda_df = pd.DataFrame(lda.components_.T, index=vectorizer.get_feature_names_out(), 
                      columns=[f'Topic {i+1}' for i in range(7)])
lda_df.head()

(n_topics, n_words): (7, 54777)


Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7
00,8.643328,27.619175,7.227839,1.752141,3.114887,46.148639,0.493991
000,2380.143327,536.394437,824.033986,900.736692,350.409655,51.44086,418.841042
00000,0.142901,0.142857,0.142857,0.142857,0.142857,3.142814,0.142857
000s,3.142641,0.142861,0.142928,0.142857,0.142915,0.142886,0.142911
000th,0.142857,0.143092,0.143214,1.768809,0.143387,0.143158,2.515483


To access a topic's words frequency, we use `lda_component[num_topic]`.

In [9]:
# First topic
lda.components_[0]

array([   8.64332806, 2380.14332687,    0.14290052, ...,    0.14300682,
          0.14290204,    0.14286163])

### Top K Words for a Topic

`lda.components_` is an array of shape (n_topics, n_words) containing 7 topics with 54777 words each.

To get the top K words for each topic, we need to sort the words in each topic by their weights (probabilities). The weights are the values in lda.components_.

Remember that we used a CountVectorizer to fit the LDA model. So we need to map the vector representation to the real feature names to get the word's label.

In [10]:
# Get the index position of the highest 20 value in the first topic
first_topic = lda.components_[0]
top_twenty_words = first_topic.argsort()[-20:]

In [11]:
# Print out the top 20 words (top_twenty_words contains indexes position)
for i in top_twenty_words:
    print(vectorizer.get_feature_names_out()[i])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


**Top 15 words for each topic**

In [12]:
# For each topic, print the top 15 words
for topic in lda.components_:
    top_fifteen_words = topic.argsort()[-15:]
    print([vectorizer.get_feature_names_out()[i] for i in top_fifteen_words], '\n')	

['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says'] 

['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said'] 

['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says'] 

['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says'] 

['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump'] 

['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like'] 

['student', 'years', 'data', 'science', 'university', 'people', 'time', 'schools', 'just', 'education', 'new', 'like', 'students', 'school', 'says'] 



### Identifying a Topic

By having the top N words in a topic, we can infer it.
For example, the 7th topic: ['student', 'years', 'data', 'science', 'university', ...'] could be educatiom.


### Probability of a Document belonging to a Topic

We need to apply the LDA model to the Bag of Words to get the probabilities of the Documents belonging to particular Topic.

In [13]:
topics_results = lda.transform(dtm)
topics_results.shape

(11992, 7)

`topics_results` give us the probabilities of a document beloging to the 7 different topics.
We can see that the first Document has a 68% probability of belonging to Topic 2. 

In [14]:
topics_results[0].round(4) * 100

array([ 1.61, 68.33,  0.02,  0.02, 29.97,  0.02,  0.02])

## Assigning Topics to Articles in the DataFrame

Now that we have the probabilities of Documents belonging to Topics, we can map them.

In [15]:
nums_topics = []
for topic in topics_results:
    # Get the index of the topic with the highest probability
    top_topic_index = topic.argmax()
    # Add 1 to the index, so it starts from 1 instead of 0
    nums_topics.append(top_topic_index + 1)

nums_topics[:10]

[2, 2, 2, 2, 3, 4, 4, 3, 4, 3]

In [16]:
df['Topic'] = nums_topics
df.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in t...",2
1,Donald Trump has used Twitter — his preferred means of communication — to weigh in on a ...,2
2,"Donald Trump is unabashedly praising Russian President Vladimir Putin, a day after outgoing Pr...",2
3,"Updated at 2:50 p. m. ET, Russian President Vladimir Putin says Russia won’t be expelling U. S. ...",2
4,"From photography, illustration and video, to data visualizations and immersive experiences, visu...",3


In [17]:
topic_mapping = {0: 'Business', 1: 'Politics', 2: 'Lifestyle', 3: 'Health',
                4: 'Elections', 5: 'Entertainment', 6: 'Education'}

df['Topic'] = df['Topic'].map(topic_mapping)
df.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in t...",Lifestyle
1,Donald Trump has used Twitter — his preferred means of communication — to weigh in on a ...,Lifestyle
2,"Donald Trump is unabashedly praising Russian President Vladimir Putin, a day after outgoing Pr...",Lifestyle
3,"Updated at 2:50 p. m. ET, Russian President Vladimir Putin says Russia won’t be expelling U. S. ...",Lifestyle
4,"From photography, illustration and video, to data visualizations and immersive experiences, visu...",Health
