# Latent Dirichlet Allocation

## Data

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd

npr = pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.

# Preprocessing

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [3]:
cv = CountVectorizer(max_df=.9, min_df=2, stop_words='english') 
# discard words that show up 90% of the time in the docs and show up keep words that occur min 2 time in the docs

In [4]:
dtm = cv.fit_transform(npr['Article'])

In [5]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA

In [6]:
from sklearn.decomposition import LatentDirichletAllocation

In [7]:
lda = LatentDirichletAllocation(n_components=7, random_state=42)

In [8]:
lda.fit(dtm)

## Showing Stored Words

In [9]:
# Grab the vocabulary of words
len(cv.get_feature_names_out())

54777

In [10]:
type(cv.get_feature_names_out())

numpy.ndarray

In [28]:
import random 

for i in range(10):
    random_word_id = random.randint(0, 54777)
    print(cv.get_feature_names_out()[random_word_id])

nobel
umber
andrews
refrigeration
disputed
invincible
infraction
crayon
migratory
amoxicillin


### Showing Top Words Per Topic

In [29]:
len(lda.components_)

7

In [30]:
type(lda.components_)

numpy.ndarray

In [32]:
lda.components_.shape

(7, 54777)

In [31]:
lda.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [33]:
single_topic = lda.components_[0]

In [34]:
# Grab the highest probability words per topic
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [36]:
# ARGSORT --> INDEX POSITIONS SORTED FROM LEAST --> GREATEST
# TOP 10 VALUES (10 GREATEST VALUES)
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [39]:
top_twenty_words = single_topic.argsort()[-20:]

In [40]:
for index in top_twenty_words:
    print(cv.get_feature_names_out()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


In [48]:
for i,topic in enumerate(lda.components_):
    print(f"THE TOP 20 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names_out()[index] for index in topic.argsort()[-20:]])
    print("\n")

THE TOP 20 WORDS FOR TOPIC #0
['president', 'state', 'tax', 'insurance', 'trump', 'companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 20 WORDS FOR TOPIC #1
['white', 'according', 'attack', 'reported', 'war', 'military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 20 WORDS FOR TOPIC #2
['little', 'know', 'don', 'year', 'make', 'way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 20 WORDS FOR TOPIC #3
['world', 'research', 'university', 'percent', 'care', 'time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 20 WORDS FOR TOPIC #4
['donald', 'political', 'states', 'law', 'just', 'voters', 'vote', 'election', 'par

### Attaching Discovered Topic Labels to Original Articles

In [44]:
topic_results = lda.transform(dtm)

In [45]:
topic_results.shape

(11992, 7)

In [46]:
npr['Topic'] = topic_results.argmax(axis=1)

In [49]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2
