<h1 align=center>Media Topic Tracking</h1>
<h2 align=center>Using Natural Language Processing</h2>
<h2 align=center>and</h2>
<h2 align=center>Machine Learning</h2>

### Topic Modeling

After the text is cleaned we can start the process of topic modeling.  Topic modeling can be done in several different ways.  The first step in any approach is vectorizing the text using either an Count Vectorizer or Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer.  We then use one of several different approaches to topic identification.  The simplest is using Singular Value Decomposition on the term matrix (vectorizer output).  A more complicated approach is using Latent Dirichlet Allocation (LDA) which is a probablistic method for finding primary components in the term matrix.  Finally, we can also apply Primary Component Analysis to the term matrix and then apply a clustering algorithm to the output of PCA.<br>
<br>
In this notebook I examine SVD and LDA.  PCA and KMeans are evaluated in a separate notebook.

In [1]:
import sys
import re
import os.path
import requests
import time
import pandas as pd
import numpy as np

from os import path

from pymongo import MongoClient

In [2]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity


In [6]:
# Open the document database
db_client = MongoClient()
db_news = db_client['news_search']
db_news_col = db_news['search_result']
db_news_content = db_news['news_content']

In [7]:
# Define some utility functions
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

<h2 align=center>Vectorizing</h2>

### Count Vectorizer
As mentioned above, there are several vectorizing techniques.  This is the count
vectorizer.

In [8]:
# build our corpus
corpus = []
cursor = db_news_content.find({}, {'_id':1, 'text': 1, 'short_text' : 1, 'prop_nouns' : 1})
for article in list(cursor) :
    corpus.append(article['short_text'])


In [11]:
# Make a list of articles for transformation and a list of associated article id's
cursor = db_news_content.find({}, {'_id':1, 'text': 1, 'short_text' : 1, 'prop_nouns' : 1})
X_what = []
y = []
articles = list(cursor)
for article in articles :
    X_what.append(article['short_text'])
    y.append(article['_id'])


In [17]:
# Instantiate the vectorizer
vectorizer = CountVectorizer(stop_words='english', max_df=1.5, max_features=80, ngram_range=(1,2))
cv = vectorizer.fit(corpus)

# Transform the article list and convert it to a data frame
X_cv_what = cv.transform(X_what)
X_cv_what_df = pd.DataFrame(X_cv_what.toarray(), columns=vectorizer.get_feature_names())


## TF-IDF
The second vectorizing technique is TF-IDF

In [18]:
# Instantiate the vectorizer
tf_vect = TfidfVectorizer(stop_words='english', max_df=1.5, max_features=80, ngram_range=(1,2))
tfidf = tf_vect.fit(corpus)

# Transform the document list and create a dataframe
X_tf_what = tfidf.transform(X_what)
X_tf_what_df = pd.DataFrame(X_cv_what.toarray(), columns=vectorizer.get_feature_names())


<h2 align=center>Topic Modeling</h2>

Now that the corpus has been vectorized we'll look at the actual topic modeling techniques. 

<h3>LDA</h3>


In [20]:
# Prep some values for later visualization
sum_words = X_cv_what.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
words_freq =sorted(words_freq[:300], key = lambda x: x[1], reverse=True)
top_words = [word[0].upper() for word in words_freq]

In [21]:
# Instantiate the model
lda_model = LatentDirichletAllocation(n_components=8,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )

# Apply the model to the count vectorized document df
lda_output = lda_model.fit_transform(X_cv_what)

In [22]:
print_top_words(lda_model, vectorizer.get_feature_names(), 5)

Topic #0: debate steyer bloomberg billionaire million
Topic #1: biden voter poll sander hampshire
Topic #2: caucus iowa result delegate sander
Topic #3: iowa buttigieg people event year
Topic #4: buttigieg klobuchar black debate hampshire
Topic #5: sander biden trump democrats year
Topic #6: bloomberg trump york people city
Topic #7: warren woman sander plan debate



pyLDAvis is a useful visualization tool for LDA output.  The package applies 
PCA to the output of the model to make some useful visualizations

In [34]:

import sklearn
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
ldavizdata = pyLDAvis.sklearn.prepare(lda_model, X_cv_what, cv, mds='tsne')
pyLDAvis.save_html(ldavizdata, 'pyLDA_Full_Corpus.html')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [35]:
# Using the output above we can name the topics
topic_what_labels = ['Sanders Trump', 'People', 'Biden', 'Bloomberg',\
                     'Iowa Caucus Result', 'Buttigieg', 'Voter Steyer', 'Warren']

<h3>Truncated Singular Value Decomposition</h3>

In [23]:
lsa_cv = TruncatedSVD(15)
X_cv_topic_what = lsa_cv.fit_transform(X_cv_what)

sum(lsa_cv.explained_variance_ratio_)

0.762987352441249