# Unsupervised Learning
### LDA represents documents as mixtures of topics that spit out words with certain probabilities
> The number of topics is ambiguous, defined by the user

> The user must interpret what the topics are

* LDA - Latent Dirichlet Allocation
> Defined a probability distribution
> LDA is based off of ditribution

**Assumptions:**
1. Documents with similar topics use similar groups of words
2. Latent topics can be identified by groups of words appearing together
3. Documents are probability distributions over latent topics
4. Topics themselves are probability distributions over words

<img src='document_distribution.png'>
<img src='topics_distributions.png'>

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
# npr['Article'][0]
# npr.shape

(11992, 1)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
# discard words that appear in 90% of the document
# only include the word if it appears in at least 2 documents
# use 'english' stop_words
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [5]:
# Because we're performing unsupervised learning, it doesnt make any sense to split the data into a train-test split

In [6]:
dtm = cv.fit_transform(npr['Article'])

In [7]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## Performing LDA.fit for determining topics
1. Grab the vocabulary of words
2. Grab the topics
3. Grab the highest probability of words per topic

#### Step 1: Grab the vocab words

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

In [9]:
# Use n_components as the number of topics to be identified
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [10]:
# This may take a while
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [11]:
# this holds an instance of every word in each document
len(cv.get_feature_names())
# this is how you retrieve any specific word
cv.get_feature_names()[50000]

'transcribe'

In [14]:
#ASIDE - seeing random words in the CountVectorizer() (document term matrix)
import random
random_word_id = random.randint(0,len(cv.get_feature_names()))
cv.get_feature_names()[random_word_id]

'ninjas'

#### Step 2: grab the topics

In [16]:
# the number of presecribed topics
display(len(LDA.components_))
# Numpy Array containing probabilities for each word
display(type(LDA.components_))
print(f'Number of distinct words in DTM: {LDA.components_.shape[1]}')
print(f'Number of topics (dictated) in corpus: {LDA.components_.shape[0]}')
print()
display(LDA.components_)

7

numpy.ndarray

Number of distinct words in DTM: 54777
Number of topics (dictated) in corpus: 7



array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [18]:
single_topic = LDA.components_[0]
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [19]:
import numpy as np
arr = np.array([10, 200,1])
display(arr)
# This demonstrates what indices would be used to sort from lowest to highest
# where 2 is associated with the value 1 (the smallest value), 0 with 10, and 1 with 200 (the largest value)
display(arr.argsort())

array([ 10, 200,   1])

array([2, 0, 1])

In [20]:
# ARGSORT --> INDEX POSTIONS SORTED FROM LEAST TO GREATEST
# TOP 10 VALUES (10 GREATEST VALUES)
# LAST 10 VALUES OF ARGSORT() ARE ASSOCIATED WITH 10 GREATEST VALUES
single_topic.argsort()[-10:] # get the indices of the highest probability words for the topic

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [21]:
top_ten_words = single_topic.argsort()[-10:]

In [22]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


In [58]:
# Let's look at increasing the number of words, these appear to be gov, biz, or healthcare
top_twenty_words = single_topic.argsort()[-20:]
for index in top_twenty_words:
    print(cv.get_feature_names()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


#### 3. Grab the highest probability words per topic

In [42]:
for i,topic in enumerate(LDA.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [43]:
# Attach these topic numbers to the original articles

In [29]:
display(dtm)
display(npr.head())

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [30]:
topic_results = LDA.transform(dtm)

In [46]:
# Array the length of the number of rows/documents/articles of the original dataframe
# but with each of the topics
display(topic_results.shape)
# Probabilities of the topic that the first document is a member of
display(topic_results[0].round(2))
# First document seems to be associated with the second topic (topic 1)
for i,topic in enumerate(LDA.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')
    
display(npr['Article'][0][:120])
# topic 1 includes words like military, house, security, russia...

(11992, 7)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year sho'

In [47]:
# Now, lets just return the position of the highest probability:
topic_results[0].argmax()

1

In [48]:
# Now assign them to the original dataframe to "predict" the topics
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


# Summarized:

#### Setup:

In [50]:
import pandas as pd
npr = pd.read_csv('npr.csv')
npr.head()
from sklearn.feature_extraction.text import CountVectorizer

# discard words that appear in 90% of the document
# only include the word if it appears in at least 2 documents
# use 'english' stop_words
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

dtm = cv.fit_transform(npr['Article'])

#### Step 1: Grab the vocab words & Step 2: Grab the topics

In [52]:
from sklearn.decomposition import LatentDirichletAllocation
# Use n_components as the number of topics to be identified
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
# This may take a while
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

#### Words:

In [64]:
# this holds an instance of every word in each document
cv.get_feature_names()[:10]

['00', '000', '00000', '000s', '000th', '002', '004', '007', '009', '00s']

#### Step 2: Grab the topics

In [67]:
# This holds the topics:
display(LDA.components_)

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

#### Combine Step 1 and 2:
> See topics and their top words (not necessary for computation, but helpful for dictating subjects)

In [71]:
for i,topic in enumerate(LDA.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]][::-1])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['says', 'said', 'health', 'people', 'care', 'million', 'company', 'government', 'percent', 'new', '000', 'federal', 'year', 'money', 'companies']


THE TOP 15 WORDS FOR TOPIC #1
['said', 'trump', 'president', 'police', 'told', 'people', 'news', 'says', 'reports', 'npr', 'government', 'russia', 'security', 'house', 'military']


THE TOP 15 WORDS FOR TOPIC #2
['says', 'like', 'people', 'just', 'food', 'years', 'new', 'city', 'water', 'time', 'day', 'home', 'family', 'world', 'way']


THE TOP 15 WORDS FOR TOPIC #3
['says', 'people', 'health', 'women', 'like', 'study', 'children', 'just', 'patients', 'disease', 'medical', 'years', 'don', 'new', 'time']


THE TOP 15 WORDS FOR TOPIC #4
['trump', 'said', 'clinton', 'president', 'state', 'people', 'campaign', 'republican', 'court', 'obama', 'new', 'party', 'election', 'vote', 'voters']


THE TOP 15 WORDS FOR TOPIC #5
['like', 'just', 'people', 'think', 'know', 'time', 'really', 'music', 'way', 'new', 'don', 'life

#### Step 3: Grab the highest probability words per topic

In [69]:
topic_results = LDA.transform(dtm)

#### Finalize: Assign the classes/topics to the observations

In [70]:
npr['Topic'] = topic_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
