# Latent Dirichlet Allocation

## Data

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org) to load up some articles.


In [None]:
import pandas as pd 

In [None]:
npr = pd.read_csv('npr.csv') #Check what the npr dataset actually holds

In [None]:
npr.head() # the code for this assignment is like the LDA exercise in class

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


We only have information on the article text. Notice how we don't have a label column indicating what the topic of the articles belongs! Let's use LDA to attempt to figure out clusters of the articles.

In [None]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [None]:
len(npr)

11992

## Preprocessing

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore certain terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words), i.e. the terms that are highly common across docs. we can pass in a number between 0 and 1. If float (e.g. 0.95), it's going to discard words that show up in 95 percent of the docs. Hence, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
min-dif is what the minimum doc frequency, i.e. words that show up in minimum number of times. For example, if we pass in min_dif=2, then this says the minimum doc frequency for. a word to be counted into this CountVectorizer, it has to show up to at least 2 documents. When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [None]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [None]:
dtm = cv.fit_transform(npr['Article']) #fitting dtm to article text

In [None]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA 
Perform the LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
LDA_model = LatentDirichletAllocation(n_components=7,random_state=42) #creating LDA model

In [None]:
LDA_model.fit(dtm) #fitting LDA model

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

## Showing Stored Words

In [None]:
len(cv.get_feature_names())

54777

In [None]:
cv.get_feature_names()[2300]

'albala'

In [None]:
import random 

In [None]:
for index in range(10): 
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id]) #displaying 10 random words from dtm

jalalabad
bugler
b_m_jefferson
maguire
tentatively
innocence
rakim
uggams
976
theorist


# Show Top Words Per Topic

In [None]:
len(LDA_model.components_)

7

In [None]:
len(LDA_model.components_[0])

54777

In [None]:
single_topic = LDA_model.components_[0]

In [None]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [None]:
# Word least representative of this topic
single_topic[27208]

3.142629908439293

In [None]:
# Word most representative of this topic
single_topic[42993]

6247.245510521077

In [None]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [None]:
for index,topic in enumerate(LDA_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
                                                                        
                                                                        
    print('\n') #printing top 15 words for each topic

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

# Attach Discovered Topic Labels to Original Articles

In [None]:
topic_results = LDA_model.transform(dtm) #transforming dtm for LDA

In [None]:
topic_results.shape

(11992, 7)

In [None]:
topic_results[0] #how well each topic fits first document

array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,
       2.99652737e-01, 2.25479379e-04, 2.25497980e-04])

In [None]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [None]:
topic_results[0].argmax()#Grab the index position of the most representative topic by calling argmax().

1

# Combine with Original Data

In [None]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [None]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 3, 4, 0])

In [None]:
npr['Topic'] = topic_results.argmax(axis=1) #labelling topic with topic best fit to document

In [None]:
mytopic_dict = {0:'insurance',1:'politics',2:'environment',3:'medical',4:'election',5:'music',6:'education'}
npr['Topic Type'] = npr['Topic'].map(mytopic_dict) #attaching topic name according to topic number

In [None]:
npr.head()

Unnamed: 0,Article,Topic,Topic Type
0,"In the Washington of 2016, even when the polic...",1,politics
1,Donald Trump has used Twitter — his prefe...,1,politics
2,Donald Trump is unabashedly praising Russian...,1,politics
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,politics
4,"From photography, illustration and video, to d...",2,environment
