### NLP: topic modelling using LDA and NMF  
Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.  

The "topics" produced by topic modeling techniques are clusters of similar words.  
  
We explore 2 methods here:  
1) LDA (Latent Dirichlet Allocation) method: select no. of topics, and allow the technique to map related words towards the topics. Probabilistic method    
2) NMF (Non-Negative Matrix) method: same idea but uses a matrix factorisation method.

###  Import the neccessary libraries
- LDA works only with bag of words approach only
- Regular expressions re, gensim and spacy are used to process texts. 
- PyLDAvis and matplotlib for visualization and numpy
- Pandas for manipulating and viewing data in tabular format.



In [2]:
# Sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF

# 
import numpy as np
import pandas as pd
import re, nltk

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

  from collections import Mapping


#### display_topic() is a commonly used function to display topics and related terms
- model - the lda model
- feature_names - the features names
- no_top_words - how many terms to display

** Do not change this function ** 

In [3]:

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

#### Function for Data Cleaning
- Clean_documents() is a function to perform data cleansing for a raw document
- This is a snub function for now. When you prepare your own set of data, you will write your own pre-processing logic.
- As you evaluate the results of the topic components, you may have to revist this function to improve on the data clean up

In [4]:
def clean_documents(document):
    # placeholder: Write data preparation codes here
    
    return document

###  Data Processing 

#### This is the section to modify if you have other sources
- Load in the documents from its source
- The LDA topic model algorithm requires a document word matrix as the main input.
- Vectorise the document using count vectorizing
- LDA can only use raw term counts for LDA because it is a probabilistic graphical model


In [5]:
df = pd.read_csv('datasets/News Set A.csv')
data= df["content"].values.tolist()

corpus = clean_documents(data)
corpus[0]

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [6]:
no_features = 1000
cv = CountVectorizer(max_features=no_features, stop_words='english')
document_terms = cv.fit_transform(corpus)
tf_feature_names = cv.get_feature_names()

#### Create the LDA model and apply LDA to the corpus of document
- Create LDA object
- Fit and transform the vectorize document (tf) using LDA


In [7]:
# TO DO:
    
no_topics = 20    # this is just a wild guess
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
#lda_output = lda_model.fit_transform(????????)
lda_output = lda_model.fit_transform(document_terms)
# print(lda_model)  # To look at the Model attributes


### Preview the terms of each topic
- Displays the topics terms

In [8]:
no_top_words = 20
display_topics(lda_model, tf_feature_names, no_top_words)

Topic 0:
said health people study dr use medical research drug data years cases new according human university change report risk used
Topic 1:
clinton hillary sanders campaign says democratic presidential state bernie foundation debate mrs nominee email candidate emails secretary supporters said trump
Topic 2:
said family life children years told father time mother home like son black day man young child parents just wife
Topic 3:
school percent students university education 000 workers jobs college year americans student schools american state states program immigration high number
Topic 4:
said police cnn officers told according man people officer gun authorities killed video shooting black shot attack reported death incident
Topic 5:
china united north states korea said trade nuclear world american south president iran chinese deal foreign policy country international obama
Topic 6:
house republicans senate ryan republican democrats care said health congress obamacare paul speaker 

In [9]:
kmin= 15
kmax = 25
tf_vectorized_documents= document_terms

topic_models = []
# try each value of k
for k in range(kmin,kmax+1):
    print("Applying LDA for k=%d ..." % k )
    lda_model = LatentDirichletAllocation(n_components=k, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
    lda_output = lda_model.fit_transform(tf_vectorized_documents)  
    log_likelihood = lda_model.score(tf_vectorized_documents)
    perplexity = lda_model.perplexity(tf_vectorized_documents)
    topic_models.append( (k,lda_model,lda_output, log_likelihood, perplexity) ) # store for later


Applying LDA for k=15 ...
Applying LDA for k=16 ...
Applying LDA for k=17 ...
Applying LDA for k=18 ...
Applying LDA for k=19 ...
Applying LDA for k=20 ...
Applying LDA for k=21 ...
Applying LDA for k=22 ...
Applying LDA for k=23 ...
Applying LDA for k=24 ...
Applying LDA for k=25 ...


In [10]:
# max the highest log likelyhood, min the perplexity
for model in topic_models:
  print("k topics : % 2d, Log Likelihood : % 5.2f   Perplexity : %5.2f" %(model[0], model[3], model[4]))  

k topics :  15, Log Likelihood : -51082868.30   Perplexity : 524.73
k topics :  16, Log Likelihood : -51030554.64   Perplexity : 521.38
k topics :  17, Log Likelihood : -51052165.09   Perplexity : 522.76
k topics :  18, Log Likelihood : -50966566.11   Perplexity : 517.30
k topics :  19, Log Likelihood : -50905533.92   Perplexity : 513.45
k topics :  20, Log Likelihood : -50798014.52   Perplexity : 506.72
k topics :  21, Log Likelihood : -50805089.87   Perplexity : 507.16
k topics :  22, Log Likelihood : -50805418.66   Perplexity : 507.18
k topics :  23, Log Likelihood : -50655259.89   Perplexity : 497.93
k topics :  24, Log Likelihood : -50729983.07   Perplexity : 502.51
k topics :  25, Log Likelihood : -50668270.93   Perplexity : 498.73


#### Using pyLDAvis to visualized topic models
- pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. 
- The package extracts information from a fitted LDA topic model to form an interactive web-based visualization.
- Intuitively, a good topic model will have non-overlapping wirh fairly big sized area for each topic.
- For examples on how to use scikit-learn's topic models with pyLDAvis see https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb
- Note: the topic numeration in this visualisation does not correspond to the models numeration of topics.


In [11]:
# TO DO: Write the codes to display pyLDAVIS 
import pyLDAvis
import json

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, document_terms,cv)
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


#### USING THE NMF MODEL:
for the same corpus

In [12]:
#NMF model TFIDF

n_features = 1000
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(corpus)


In [13]:
#NMF model
n_components = 20
n_top_words = 20
nmf_model = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5)
nmf = nmf_model.fit(tfidf)

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
display_topics(nmf, tfidf_feature_names, n_top_words)

Topic 0:
said people think don like going know just want ve really say way things lot right make didn time says
Topic 1:
trump donald president campaign republican presidential election nominee pence america gop rally washington new supporters media white administration immigration candidate
Topic 2:
mr said mrs party united new did like senator years political officials states chief york campaign lawyer trump interview american
Topic 3:
clinton hillary sanders campaign democratic foundation state presidential emails mrs email nominee bernie secretary voters candidate election party debate supporters
Topic 4:
police officers said officer man shooting shot killed city arrested authorities told incident black attack car cnn video according department
Topic 5:
isis islamic syria attack state military forces attacks syrian said iraq group terrorist government turkey israel muslim security united killed
Topic 6:
breitbart news gun daily texas bannon radio live fox follow media conservative 

In [14]:
k = n_components
A = tfidf 
# create the model
from sklearn import decomposition
model = decomposition.NMF( init="nndsvd", n_components=k ) 
# apply the model and extract the two factor matrices
W = model.fit_transform( A )
H = model.components_

In [15]:
import numpy as np
def get_descriptor( terms, H, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    # now get the terms corresponding to the top-ranked indices
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( terms[term_index] )
    return top_terms

In [16]:
## MNF top words by topics
descriptors = []
terms = tfidf_feature_names
for topic_index in range(k):
    descriptors.append( get_descriptor( terms, H, topic_index, 20 ) )
    str_descriptor = ", ".join( descriptors[topic_index] )
    print("Topic %02d: %s" % ( topic_index+1, str_descriptor ),
    print()
         )


Topic 01: people, think, like, don, going, know, just, ve, want, really, says, way, things, say, right, lot, make, good, thing, need None

Topic 02: trump, donald, president, campaign, republican, presidential, election, nominee, pence, said, rally, america, gop, washington, supporters, new, wall, comments, media, administration None

Topic 03: mr, ms, said, mrs, party, new, like, united, did, york, years, political, senator, chief, lawyer, states, wrote, officials, times, campaign None

Topic 04: clinton, hillary, sanders, campaign, democratic, foundation, state, presidential, mrs, emails, email, nominee, bernie, secretary, voters, candidate, election, debate, supporters, department None

Topic 05: police, officers, officer, man, shooting, shot, black, arrested, city, killed, attack, car, video, gun, department, incident, authorities, enforcement, crime, scene None

Topic 06: isis, islamic, syria, attack, state, attacks, iraq, forces, syrian, military, terrorist, group, turkey, musli

In [17]:
## LDA top words by topics

display_topics(lda_model, tf_feature_names, no_top_words)



Topic 0:
news breitbart media twitter facebook follow social political daily com post speech new bannon milo radio reporter times page conservative
Topic 1:
clinton hillary sanders campaign democratic state presidential email bernie emails mrs foundation secretary nominee candidate debate obama new supporters voters
Topic 2:
said family children life years father mother told home time son film school parents child young wife man like day
Topic 3:
percent 000 year million according number years california workers rate states 2015 report new jobs people increase average population 2014
Topic 4:
police said officers man officer gun shooting killed shot people authorities video told arrested violence attack car victims incident city
Topic 5:
team game season year university students time world won games best college players sports play win series second final fans
Topic 6:
women health said medical study dr sexual men research university people cases sex human use risk care drug body child