# Latent Dirichlet Allocation

## Data
> We will be using articles from NPR (National Public Radio), obtained from their website www.npr.org

In [79]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [7]:
npr = pd.read_csv('npr.csv')

In [8]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [9]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [10]:
len(npr)

11992

## Preprocessing

**`max_df`**: `float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**: `float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [12]:
cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words='english')

In [14]:
dtm  = cv.fit_transform(npr['Article'])

In [15]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA
> **`n_components`** is totally depend on user. User must have the intution about the topics realted to the documents.

In [17]:
LDA = LatentDirichletAllocation(n_components=7, random_state=42)

In [18]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

## Showing Stored words

In [19]:
# Grab the vocabulary of words

In [25]:
len(cv.get_feature_names())

54777

In [27]:
# it is only the list of words contains in all the documents
type(cv.get_feature_names())



list

In [33]:
# selecting ramdom 10 words from the list of all words
import random

for i in range(10):
    random_word_id = random.randint(0, 54776)
    print(cv.get_feature_names()[random_word_id])

lifelong
bulldozer
bert
excused
beachwear
batons
befitting
steelmaker
makeups
holman


## Showing Top Words Per Topic

In [34]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [35]:
print(len(LDA.components_))

7

In [44]:
print(len(LDA.components_[0]))

54777


In [36]:
print(LDA.components_[0])

[8.64332806e+00 2.38014333e+03 1.42900522e-01 ... 1.43006821e-01
 1.42902042e-01 1.42861626e-01]


In [37]:
single_topic = LDA.components_[0]

In [40]:
single_topic.argsort() # it showing the index position for the probablity of words belongs to repsective topics or components.

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [48]:
# Returns the indices that would sort this array.
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [41]:
# Word least representative of this topic
single_topic[18302]

0.14285714309286987

In [43]:
single_topic[10421]

2152.345550878652

In [49]:
top_word_indices = single_topic.argsort()[-10:]

In [50]:
for index in top_word_indices:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says




These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [51]:
for index, topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print()
    print()

THE TOP 15 WORDS FOR TOPIC #0




['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']


T

## Attaching Discovered Topic Labels to Original Articles

In [53]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [54]:
dtm.shape

(11992, 54777)

In [55]:
len(npr)

11992

## Transforming the sparse matrix by the Latent Drichilet Distribution

In [52]:
topic_results = LDA.transform(dtm)

In [59]:
len(topic_results)

11992

In [56]:
topic_results[0]

array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,
       2.99652737e-01, 2.25479379e-04, 2.25497980e-04])

In [57]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [58]:
topic_results[0].argmax()

1

This means that our model thinks that the first article belongs to topic #1.

## Combining with Original Data

In [60]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [93]:
topic_results.shape

(11992, 7)

In [100]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 3, 4, 0], dtype=int64)

In [68]:
npr['Topic'] = topic_results.argmax(axis=1)

In [70]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


# NMF - Non-Zero Matrix Factorization

In [76]:
tfidf = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words='english')

In [77]:
dtm = tfidf.fit_transform(npr['Article'])

In [78]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [82]:
nmf_model = NMF(n_components=7, random_state=42)

In [83]:
nmf_model.fit(dtm)



NMF(n_components=7, random_state=42)

In [84]:
tfidf.get_feature_names()[23000]



'herpes'

In [85]:
tfidf.get_feature_names()[2300]

'albala'

In [87]:
nmf_model.components_

array([[0.00000000e+00, 2.49950821e-01, 0.00000000e+00, ...,
        1.70313822e-03, 2.37544362e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 8.22048918e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 3.12379960e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.89723338e-03, 0.00000000e+00, 1.50186440e-03, ...,
        7.06428924e-04, 5.85500542e-04, 6.89536542e-04],
       [4.01763234e-03, 5.31643833e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [88]:
len(nmf_model.components_)

7

In [89]:
len(nmf_model.components_[0])

54777

In [86]:
for index, topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print()
    print()

THE TOP 15 WORDDS FOR TOPIC #0




['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDDS FOR TOPIC #5
['love', 've', 'don', 'album', 'way', 'time', 'son

In [101]:
topic_result = nmf_model.transform(dtm)

In [102]:
topic_result[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

### NMF gives you the coefficient value for which the document belongs to. 
### LDA gives the probability of the words and topics for the documents.

In [103]:
len(topic_result)

11992

In [105]:
topic_result[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [107]:
topic_result[0].argmax() # it gives the index position of the maximum value.

1

In [108]:
# if you want to above operation to all the columns then pass the parameter `axis= 1` then it will give the 
# index position which has the maximum values.
topic_result.argmax(axis=1) 

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [109]:
npr['Topic'] = topic_result.argmax(axis=1)

In [110]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5


In [111]:
mytopic_dir = {0:'Helath',1:'Election',2:'Legis',3:'Politics',4:'Election',5:'Music',6:'Edu'}
npr['Topic Label'] = npr['Topic'].map(mytopic_dir)

In [112]:
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,Election
1,Donald Trump has used Twitter — his prefe...,1,Election
2,Donald Trump is unabashedly praising Russian...,1,Election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Politics
4,"From photography, illustration and video, to d...",6,Edu
