# Topic Modelling Using Latent Dirichlet Allocation

## What does LDA do ?
It represents documents as mixture of topics that spits out words with certain probabilities.
It assumes that documents are produced in the following fashion:
- Choose a topic mixture for the document over K fixed topics.
- First picking a topic according to the multinomial distribution that we sampled previously

### Generate each word in the topic 
- Using the topic to generate the word itself.
- Example if the topic is 'food' the probability of generating the word 'apple' is more than 'home'.

#### LDA then tries to backtrack from these words to topics that might have created these collection of words 


## Assumptions for LDA(Latent Dirichlet Allocation) in Topic Modelling
- Documents are probability distribution over latent topics.
- Topics themselves are probability distribution over latent words.

## Steps to how LDA executed
- We iterate through every document and assign each word in it to a particular K topic that we defined before.
- This random assignment gives us topic representation and word distribution of every topic.
- We iterate over every word in every topic and calculate t 
    p( t topic|d document) - the proportion of words assigned to each topic in every document d .
- We iterate over every word in every topic and calculate t :
    p(w word|d document) - the proportion of assignments to each topic in every document that comes from the word w.  
- Reassign w to new topic with probability p(t topic|d document) * p(w word|t topic).
- This is basically that topic t  generated the word w.
- At the end we have the words with highest probability of being assigned topic. 

## Imorting basic libraries

In [2]:
import pandas as pd
npr_csv = pd.read_csv('npr.csv')

In [3]:
npr_csv.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


## Our Articles consist of different types of articles

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

max_df gets rid of terms that are common in lot of documents(90%)<br>
min_df minimum document frequency of word to be counted in atleast 2 documents<br>
to remove stop words 'stop_words = "english"'

In [5]:
cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words="english")

To calculate the document term frequency

In [7]:
dtm = cv.fit_transform(npr_csv['Article'])

Importing our LDA Library

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
LDA = LatentDirichletAllocation(n_components= 7, random_state=42)

In [11]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

### These are all the words in our LDA

In [12]:
len(cv.get_feature_names())

54777

### Extracting the top 15 words from each topic 

In [13]:
import numpy as np

In [14]:
for index,topic in enumerate(LDA.components_):
    print(f'The top 15 words for topic #{index}')
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')
    print('\n')

The top 15 words for topic #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




The top 15 words for topic #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




The top 15 words for topic #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




The top 15 words for topic #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




The top 15 words for topic #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




The top 15 words for topic #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know'

In [15]:
topic_results = LDA.transform(dtm)

In [17]:
npr_csv['Topic'] = topic_results.argmax(axis= 1)

In [19]:
npr_csv

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


# Non Negative Matrix Factorization
- NNMF is an unsupervised learning algroithm that performs dimensionality reduction and clustering at the same time
- We will use TD-IDF in conjuction to our algorithm model topics accross document

## General idea behinf NNMF
- We've been given a non negative matrix of a containing our features A(Term Document Matrix), find K approximation vectors in terms of non-neagtive factors W(Basic Vectors) and H(Coefficient Matrix).

### Note :
- Basic Vetors: The topics(clusters) in the data.
- Coefficient Matrix : The membership weights for documents relative to each topic.

<img src ="NNMF_matrix.png" width ="70%" alt ="Non_neagtive metrices" />

- Basically we are going to approximate that multiplication of W and H would be equal to our matrix A. For that we will calculate the objective function.

<img src ="objective_function.png" width ="70%" alt ="Objective Function" />

- Expectation maximization optimization to refine W and H in order to minimise the values of objective function

<img src ="approximate_expectation.png" width ="70%" alt ="Approximate Expectation" />

### So we'll create a Term Document Matrix with TF-IDF Vectorization

<img src ="tem_document_matrix.png" width ="70%" alt ="Term Document Matrix" />

### Achieving our final result

<img src ="final_matrix.png" width ="70%" alt ="Target Matrix" />


## In order to implement it we'll follow the same steps as in LDA

In [8]:
import pandas as pd

In [9]:
npr_csv = pd.read_csv('npr.csv')

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tfid = TfidfVectorizer(max_df= 0.95, min_df= 2, stop_words= 'english')

In [13]:
dtm = tfid.fit_transform(npr_csv['Article'])

In [14]:
from sklearn.decomposition import NMF

In [15]:
nfm_model = NMF(n_components= 7, random_state=42)

In [16]:
nfm_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [18]:
for i,topic in enumerate(nfm_model.components_):
    print(f"The top 15 words from Topic #{i}")
    print([tfid.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')
    

The top 15 words from Topic #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


The top 15 words from Topic #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The top 15 words from Topic #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The top 15 words from Topic #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


The top 15 words from Topic #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The top 15 words from Topic #5
['love', 've', 'don

In [19]:
topic_results = nfm_model.transform(dtm)

In [20]:
npr_csv['Topic']  = topic_results.argmax(axis=1)

In [21]:
npr_csv.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


## Labeling our topics

In [30]:
my_topic_dict = {0 : 'Health', 1:'Election', 2: 'Legislation',3:'Politics', 4: 'Election', 5: 'Music',6: 'Education' }
npr_csv['Topic Label'] = npr_csv['Topic'].map(my_topic_dict)

In [31]:
npr_csv[0:10]

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,Election
1,Donald Trump has used Twitter — his prefe...,1,Election
2,Donald Trump is unabashedly praising Russian...,1,Election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Politics
4,"From photography, illustration and video, to d...",6,Education
5,I did not want to join yoga class. I hated tho...,5,Music
6,With a who has publicly supported the debunk...,0,Health
7,"I was standing by the airport exit, debating w...",0,Health
8,"If movies were trying to be more realistic, pe...",0,Health
9,"Eighteen years ago, on New Year’s Eve, David F...",5,Music
