# Latent Dirichlet Allocation

## Data

We are using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd
import time
import random

In [2]:
npr = pd.read_csv('npr.csv')

In [3]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [5]:
len(npr)

11992

In [6]:
npr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11992 entries, 0 to 11991
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Article  11992 non-null  object
dtypes: object(1)
memory usage: 93.8+ KB


In [7]:
npr.describe()

Unnamed: 0,Article
count,11992
unique,11991
top,"Washington state has released an estimated 3, ..."
freq,2


## Preprocessing

In [8]:
# We need to see how many topics we want to categorise the documents into
from sklearn.feature_extraction.text import CountVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

if max_df is 0.9, it will discard the word which is there in 90% or + of documents.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

if min_df is 0.2, it will consider the word which is there in 20% or + of documents.
or if min_df = 3 then to consider the word it must be in atleast 3 documents.

In [9]:
# creating instance of the count vectorizer
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [10]:
# transforming document into tokens Document term matrix
dtm = cv.fit_transform(npr['Article'])

In [11]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [12]:
# We have around 55000 words in documents

## LDA

In [13]:
# call the LDA library
from sklearn.decomposition import LatentDirichletAllocation

In [14]:
# instantiating
LDA = LatentDirichletAllocation(n_components=7,learning_method='online', max_iter=15,verbose = 1,random_state=42)

In [15]:
# fitting LDA to get different topics from documents
tic = time.time()
LDA.fit(dtm)
toc = time.time()

print('time taken to get the topics using LDA', toc-tic)

iteration: 1 of max_iter: 15
iteration: 2 of max_iter: 15
iteration: 3 of max_iter: 15
iteration: 4 of max_iter: 15
iteration: 5 of max_iter: 15
iteration: 6 of max_iter: 15
iteration: 7 of max_iter: 15
iteration: 8 of max_iter: 15
iteration: 9 of max_iter: 15
iteration: 10 of max_iter: 15
iteration: 11 of max_iter: 15
iteration: 12 of max_iter: 15
iteration: 13 of max_iter: 15
iteration: 14 of max_iter: 15
iteration: 15 of max_iter: 15
time taken to get the topics using LDA 176.01935625076294


In [16]:
# now there are 3 things that we need to take care of :

# 1. grab the vocab of words
# 2. grab the topics
# 3. grab the highest probability words per topic

## Grab the Vocab of words


In [17]:
# how many words we have created using tokens
len(cv.get_feature_names())

54777

In [18]:
# lets see some of the words from articles
for i in range(15):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

capabilities
tormentors
anatomies
brash
formlessness
625
schindler
semiconscious
und
ornamented
airmen
nuttall
suckle
catcall
canterbury


### Grab the topics

In [19]:
# we have trained LDA which consists of LDA components
len(LDA.components_)

7

In [20]:
# it is a numpy array which consists of probability of each array in the vocabulary
type(LDA.components_)

numpy.ndarray

In [22]:
# as we can see each document has assoicated probability with each word
LDA.components_.shape

(7, 54777)

In [23]:
# lets have a look at components
LDA.components_

array([[2.52814589e+01, 1.18811053e+03, 1.42857157e-01, ...,
        1.42938060e-01, 1.42865130e-01, 1.42994690e-01],
       [1.75634706e+01, 6.78512027e+00, 1.42858131e-01, ...,
        1.42857146e-01, 1.43025627e-01, 1.42972105e-01],
       [5.08756153e+00, 7.43146984e+02, 1.42857165e-01, ...,
        5.14323850e+00, 2.38910207e+00, 1.42938371e-01],
       ...,
       [1.43308259e-01, 5.74942905e+02, 1.42857155e-01, ...,
        1.42868092e-01, 1.42857212e-01, 1.42895601e-01],
       [4.05242410e+01, 1.40054881e+01, 3.78538252e+00, ...,
        1.43171460e-01, 1.44525680e-01, 2.17199977e+00],
       [5.35639369e+00, 1.84226560e+03, 1.43121327e-01, ...,
        1.43003037e-01, 1.43263613e-01, 1.42946222e-01]])

### Grab the highest probability words per topic

In [24]:
# Let's grab one of the topic, we actually don't know what this topic represents
# SO we will see the most important words in topic and then try to give it a suitable name
single_topic = LDA.components_[0]

In [25]:
# Returns the indices(index postions) that would sort this array.
single_topic.argsort()


# """
# ## for example how it works:
# >>> import numpy as np
# >>> arr = np.array([10,20,0,7])
# >>> arr.argsort()
# array([2,3,0,1])
# ## which is nothing but the sorted array index of arr

# """

array([49173, 11509, 30304, ..., 37374, 42993, 42561], dtype=int64)

In [26]:
# Lets check probability of word with least importance in topic
single_topic[49173]

0.1428571435984841

In [27]:
# Word most representative of this topic
single_topic[42561]

7292.154328318624

In [28]:
# Top 20 words for this topic: 
single_topic.argsort()[-20:]

array([25555, 42985, 54412, 33992, 40950, 46581,  4050, 27219,  1483,
       31426,  9511, 49613, 53019, 11657, 21228, 40955, 36283, 37374,
       42993, 42561], dtype=int64)

In [29]:
# lets assign it to a variable
top_word_indices = single_topic.argsort()[-30:]

In [30]:
# Lets see what those words are:
top_words = []
for index in top_word_indices:
    Top_word = cv.get_feature_names()[index]
    top_words.append(Top_word)
print(top_words)

['world', 'violence', 'north', 'officials', 'united', 'isis', 'security', 'officers', 'syria', 'group', 'international', 'say', 'years', 'npr', 'reported', 'state', 'attack', 'killed', 'according', 'military', 'city', 'told', 'war', 'country', 'government', 'reports', 'people', 'police', 'says', 'said']


In [31]:
# Lets do it for all the topics what the words are

for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 30 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-30:]])
    print('\n')

THE TOP 30 WORDS FOR TOPIC #0
['world', 'violence', 'north', 'officials', 'united', 'isis', 'security', 'officers', 'syria', 'group', 'international', 'say', 'years', 'npr', 'reported', 'state', 'attack', 'killed', 'according', 'military', 'city', 'told', 'war', 'country', 'government', 'reports', 'people', 'police', 'says', 'said']


THE TOP 30 WORDS FOR TOPIC #1
['asked', 'obama', 'election', 'security', 'post', 'department', 'comey', 'press', 'information', 'national', 'committee', 'statement', 'investigation', 'new', 'intelligence', 'russian', 'npr', 'fbi', 'clinton', 'did', 'media', 'told', 'white', 'campaign', 'russia', 'house', 'news', 'president', 'said', 'trump']


THE TOP 30 WORDS FOR TOPIC #2
['town', 'called', 'come', 'place', 'long', 'ago', 'used', 'don', 'local', 'national', 'eat', 'way', 'little', 'animals', 'small', 'year', 'world', 'make', 'day', 'home', 'time', 'city', 'new', 'years', 'just', 'people', 'water', 'like', 'food', 'says']


THE TOP 30 WORDS FOR TOPIC #3
[

### Attaching Discovered Topic Labels to Original Articles
We have Document term matrix and our npr documents
So with LDA we will transfrom the documents to categorise into different topics

In [32]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [33]:
dtm.shape

(11992, 54777)

In [34]:
len(npr)

11992

In [35]:
# probability of document relating to a topic
topic_results = LDA.transform(dtm)

In [36]:
# lets see if it has categorised into topics
topic_results.shape

(11992, 7)

In [37]:
# for each document it will have associated probability of relating to each topic
topic_results[0]

array([7.50563640e-02, 6.74292758e-01, 7.10203426e-03, 2.25561402e-04,
       2.19627275e-01, 2.25559437e-04, 2.34704470e-02])

In [38]:
# round it to two digits
# It belongs ot topic #1 with 67% probability
topic_results[0].round(2)

array([0.08, 0.67, 0.01, 0.  , 0.22, 0.  , 0.02])

In [39]:
topic_results[0].argmax()

1

In [40]:
# Lets see the words and article here
# Lets see what those words are:
topic_check = LDA.components_[1]
top_word_indices = single_topic.argsort()[-30:]

top_words = []
for index in top_word_indices:
    Top_word = cv.get_feature_names()[index]
    top_words.append(Top_word)
print(top_words)

['world', 'violence', 'north', 'officials', 'united', 'isis', 'security', 'officers', 'syria', 'group', 'international', 'say', 'years', 'npr', 'reported', 'state', 'attack', 'killed', 'according', 'military', 'city', 'told', 'war', 'country', 'government', 'reports', 'people', 'police', 'says', 'said']


In [41]:
# Let check the first document which we think, belongs to topic #1
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

### Combining with Original Data

In [42]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [43]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 3, 4, 4], dtype=int64)

In [44]:
npr['Topic'] = topic_results.argmax(axis=1)

In [45]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",5
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",2
9,"Eighteen years ago, on New Year’s Eve, David F...",2


In [46]:
set(npr['Topic'])

{0, 1, 2, 3, 4, 5, 6}

In [47]:
# Lets assign some topic name to this numbers
topic_dict ={0:'Govt_Militry', 
             1:'White_house' , 
             2:'Editorial' , 
             3:'Education' , 
             4:'Politics' , 
             5:'World_news' , 
             6:'Business'   
}

In [48]:
npr['Topic_name'] = npr['Topic'].map(topic_dict)

In [49]:
npr.head(20)

Unnamed: 0,Article,Topic,Topic_name
0,"In the Washington of 2016, even when the polic...",1,White_house
1,Donald Trump has used Twitter — his prefe...,1,White_house
2,Donald Trump is unabashedly praising Russian...,1,White_house
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,White_house
4,"From photography, illustration and video, to d...",5,World_news
5,I did not want to join yoga class. I hated tho...,3,Education
6,With a who has publicly supported the debunk...,3,Education
7,"I was standing by the airport exit, debating w...",2,Editorial
8,"If movies were trying to be more realistic, pe...",2,Editorial
9,"Eighteen years ago, on New Year’s Eve, David F...",2,Editorial


In [50]:
npr.tail(20)

Unnamed: 0,Article,Topic,Topic_name
11972,In a disappointment to Alzheimer’s patients an...,3,Education
11973,"In my early 20s, smitten by the mythic underpi...",5,World_news
11974,It’s been a lively year for social media maven...,0,Govt_Militry
11975,This is not a review. It started out as one: I...,5,World_news
11976,"On a summer’s day in December, a warehouse in ...",2,Editorial
11977,"Elections aren’t exactly cozy, even in the bes...",5,World_news
11978,"Although her oldest child, Ben, is 10 years ol...",3,Education
11979,"When a political scandal explodes in France, t...",0,Govt_Militry
11980,The darkest moment for American police this ye...,0,Govt_Militry
11981,Russia was ordered to vacate two compounds it ...,1,White_house
