# Unsupervised Learning
### LDA represents documents as mixtures of topics that spit out words with certain probabilities
> The number of topics is ambiguous, defined by the user

> The user must interpret what the topics are

* LDA - Latent Dirichlet Allocation
> Defined a probability distribution
> LDA is based off of ditribution

**Assumptions:**
1. Documents with similar topics use similar groups of words
2. Latent topics can be identified by groups of words appearing together
3. Documents are probability distributions over latent topics
4. Topics themselves are probability distributions over words

<img src='document_distribution.png'>
<img src='topics_distributions.png'>

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [4]:
# npr['Article'][0]
# npr.shape

(11992, 1)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
# discard words that appear in 90% of the document
# only include the word if it appears in at least 2 documents
# use 'english' stop_words
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [7]:
# Because we're performing unsupervised learning, it doesnt make any sense to split the data into a train-test split

In [8]:
dtm = cv.fit_transform(npr['Article'])

In [9]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [10]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
# Use n_components as the number of topics to be identified
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [13]:
# This may take a while
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [None]:
# 1. Grab the vocabulary of words
# 2. Grab the topics
# 3. Grab the highest probability of words per topic