# Latent Dirichlet Allocation(LDA)

**Topic modeling** is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we’re not sure what we’re looking for.

**LDA Assumptions:**

*   Each document is just a collection of words or a “bag of words”. Thus, the order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model.

*   Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” and therefore can be eliminated from the documents as a preprocessing step. In fact, we can eliminate words that occur in at least %80 ~ %90 of the documents, without losing any information.
For example, if our corpus contains only medical documents, words like human, body, health, etc might be present in most of the documents and hence can be removed as they don’t add any specific information which would make the document stand out.

*   We know beforehand how many topics we want. ‘k’ is pre-decided.

*   All topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated

`p(word w with topic t) = p(topic t | document d) * p(word w | topic t)`

More Info at https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')

In [3]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
npr['Article'][4000]

'The headline shocked the   world of the surface Navy: Seven sailors aboard the destroyer USS Fitzgerald were killed, and other crew members injured, when the warship collided with a cargo vessel off Japan. As the Navy family grieves, both it and the wider world are asking the same question: How did this happen? The short answer is that no one knows  —   yet. Official inquiries into what led up to the encounter could take months or more. The Navy and the U. S. Coast Guard both likely will eventually issue reports that describe what happened and could make recommendations for preventing another such accident. ”I will not speculate on how long these investigations will last,” said Vice Adm. Joseph Aucoin, commander of the Navy’s 7th Fleet. The Fitzgerald and the other ships of Destroyer Squadron 15, based outside Tokyo, fall under his authority. There are clues, however, that explain how something like the Fitzgerald’s collision could happen, including photographs of the ships involved, 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [7]:
dtm = cv.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [9]:
from sklearn.decomposition import LatentDirichletAllocation

In [11]:
LDA = LatentDirichletAllocation(n_components=7, random_state=42)

In [12]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [13]:
# Finding Vocabulary of words

In [14]:
len(cv.get_feature_names())

54777

In [17]:
cv.get_feature_names()[50000]

'transcribe'

In [18]:
# finding the words that occur in a topic
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [20]:
LDA.components_.shape

(7, 54777)

In [24]:
# Words for the first topic
first_topic = LDA.components_[0]

In [25]:
first_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [26]:
# finding the top 10 occuring in a topic (highest to lowest in occurence)
first_topic.argsort()[-10:] #stored as indices to the cv array

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [28]:
# getting the words
for i in first_topic.argsort()[-10:]:
    print(cv.get_feature_names()[i])

new
percent
government
company
million
care
people
health
said
says


In [35]:
# Grabbing The highest probability words per topic
for i in range(1,LDA.components_.shape[0]):
    print(f"Top 10 words for Topic {i} :")
    print([cv.get_feature_names()[ind] for ind in LDA.components_[i].argsort()[-10:]])
    print('\n')

Top 10 words for Topic 1 :
['npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


Top 10 words for Topic 2 :
['time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


Top 10 words for Topic 3 :
['disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


Top 10 words for Topic 4 :
['obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


Top 10 words for Topic 5 :
['new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']


Top 10 words for Topic 6 :
['people', 'time', 'schools', 'just', 'education', 'new', 'like', 'students', 'school', 'says']




In [36]:
results = LDA.transform(dtm)

In [39]:
results.shape

(11992, 7)

In [41]:
# probabilities that the document belongs to which topic
results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [42]:
npr['Topic']=results.argmax(axis=1)

In [43]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4
