 
<h1 style="text-align: center;"><span style="color: #333399;">Topic Modeling and Fake News: LDA, NMF, LSI</span></h1>
<h6 style="text-align: center;">Created by: Michael Gagliano on 10/29/2018</h6>
<h6 style="text-align: center;">"K-State Honor Code "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work.</h6>

## -- NOTE: Topic Modeling is a form of Clustering. LDA assumes one document might belong to multiple clusters of varying degrees. -- ##

In reality, but each document does not <i>exclusively</i> belong to a primary cluster. We are ultimately running a type of specialized clustering here.

# 1. Importing necessary packages and data

In [1]:
import csv
import pandas as pd

# import packages for text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import re

import gensim
from gensim.corpora import Dictionary
from gensim.models import ldamodel
from gensim import corpora, models, similarities

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet

import numpy
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity



## Make the processes visible

In [2]:
#https://radimrehurek.com/gensim/tutorial.html
# this makes process visible

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Import data

We're importing our corpus now. This corpus includes fake news documents from https://www.kaggle.com/mrisdal/fake-news/data.

In [3]:
# Using the below code adopted from Stack Overflow: https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072


import sys
import csv
maxInt = sys.maxsize
decrement = True

while decrement:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    decrement = False
    try:
        csv.field_size_limit(maxInt)
    except OverflowError:
        maxInt = int(maxInt/10)
        decrement = True

In [4]:
#Checking document Length
texts = []
r = csv.reader(open('data/fake.csv', 'rt', encoding="utf8"))
for i in r:
    texts.append(i)  
len(texts)

13000

# 2. Text Pre-Processing

<b>Modifying dataframe to pull just text over</b>

In [5]:
text = pd.read_csv('data/fake.csv', header=0)

In [6]:
#Getting just english fake news rows

text = text[(text['language'] == 'english')]
len(text)

12403

In [7]:
# Shortening dataframe to only the first 3000 rows
texts = text[:2000]
len(texts)

2000

<b>Isolating just the text colums</b>

In [8]:
texts1 = texts['text']
texts1

0       Print They should pay all the back all the mon...
1       Why Did Attorney General Loretta Lynch Plead T...
2       Red State : \nFox News Sunday reported this mo...
3       Email Kayla Mueller was a prisoner and torture...
4       Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...
5       Print Hillary goes absolutely berserk! She exp...
6       BREAKING! NYPD Ready To Make Arrests In Weiner...
7       BREAKING! NYPD Ready To Make Arrests In Weiner...
8       \nLimbaugh said that the revelations in the Wi...
9       Email \nThese people are sick and evil. They w...
10                                                       
11      \nWho? Comedian. \nWhere would she move? Spain...
12      Students expressed their “fear” over a Trump p...
13      Email For Republican politicians like Ohio Gov...
14      Copyright © 2016 100PercentFedUp.com, in assoc...
15      Go to Article A Trump supporter wearing a Trum...
16      Copyright © 2016 100PercentFedUp.com, in assoc...
17      Go to 

In [9]:
# Remove useless numbers and alphanumerical words
documents = [re.sub("[^a-zA-Z]+", " ", str(text)) for text in texts1]

# tokenize
texts = [[word for word in text.lower().split() ] for text in documents]

# stemming words: having --> have; friends --> friend
lmtzr = WordNetLemmatizer()
texts = [[lmtzr.lemmatize(word) for word in text ] for text in texts]

#porter_stemmer = PorterStemmer()
#texts = [[porter_stemmer.stem(word) for word in text ] for text in texts]

# remove common words 
stoplist = stopwords.words('english')
texts = [[word for word in text if word not in stoplist] for text in texts]

#remove short words
texts = [[ word for word in tokens if len(word) >= 3 ] for tokens in texts]


This is the step for removing extra stopwords. 

**The quality of topic modeling often relies on how extensively you have removed stopwords.**

**I found that verbs (adverbs, adjectives) are less meaningful than nouns since nouns tend to convery topics (and issues) such as tax, budget, education, war, etc.**

If you want to select all nouns, you can do it in Python https://stackoverflow.com/questions/33587667/extracting-all-nouns-from-a-text-file-using-nltk

In [10]:
# A list of extra stopwords specific to the debates transcripts (if you want to remove more stopwords)
extra_stopwords = ['will', 'people', 'need', 'think', 'well','going', 'can', 'country', 'know', 'lot', 'get','make','way','president', 'want',
                'like','say','got','said','just','something','tell','put','now', 'bad','back','want','right','every','one','use','come','never', 
                'many','along','things','day','also','first','guy', 'great', 'take', 'good', 'much','anderson', 'let', 'would', 'year', 'thing', 'america',
                'talk', 'talking', 'thank', 'does', 'give', 'look', 'believe', 'tonight','today','see']

extra_stoplist = extra_stopwords
texts = [[word for word in text if word not in extra_stoplist] for text in texts]
#https://github.com/alexperrier/datatalks/blob/master/debates/R/stm.R

### CREATING A CONTENT-SPECIFIC DICTIONARY 

In [11]:
# this is text processing required for topic modeling with Gensim

## Create a dictionary representation of the documents.
dictionary = Dictionary(texts)
dictionary.save('data/fake.dict')  # store the dictionary, for future reference

len(dictionary)

2018-10-29 21:54:07,436 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-10-29 21:54:07,901 : INFO : built Dictionary(31293 unique tokens: ['another', 'asap', 'benefit', 'bust', 'came']...) from 2000 documents (total 570179 corpus positions)
2018-10-29 21:54:07,902 : INFO : saving Dictionary object under data/fake.dict, separately None
2018-10-29 21:54:07,916 : INFO : saved data/fake.dict


31293

In [12]:
## Remove rare and common tokens.
# ignore words that appear in less than 2 documents or more than 40% documents (remove too frequent & infrequent words) - an optional step

dictionary.filter_extremes(no_below=2, no_above=0.4) #https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes
len(dictionary)

2018-10-29 21:54:07,960 : INFO : discarding 12516 tokens: [('scamming', 1), ('deduct', 1), ('deduction', 1), ('rationing', 1), ('berserk', 1), ('bezerk', 1), ('binder', 1), ('enabler', 1), ('eqb', 1), ('gpjps', 1)]...
2018-10-29 21:54:07,962 : INFO : keeping 18777 tokens which were in no less than 2 and no more than 800 (=40.0%) documents
2018-10-29 21:54:07,983 : INFO : resulting dictionary: Dictionary(18777 unique tokens: ['another', 'asap', 'benefit', 'bust', 'came']...)


18777

In [13]:
# convert words to vetors or integers
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('data/fake.mm', corpus)  # store to disk, for later use  # store to disk, for later use
len(corpus)

2018-10-29 21:54:08,331 : INFO : storing corpus in Matrix Market format to data/fake.mm
2018-10-29 21:54:08,331 : INFO : saving sparse matrix to data/fake.mm
2018-10-29 21:54:08,331 : INFO : PROGRESS: saving document #0
2018-10-29 21:54:08,492 : INFO : PROGRESS: saving document #1000
2018-10-29 21:54:08,726 : INFO : saved 2000x18777 matrix, density=0.940% (353135/37554000)
2018-10-29 21:54:08,733 : INFO : saving MmCorpus index to data/fake.mm.index


2000

In [14]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 18777
Number of documents: 2000


In [15]:
# later you can retrive the saved dict and corpus
# https://radimrehurek.com/gensim/tut1.html

saved_dict = dictionary.load('data/fake.dict')

## - HOW TO RETRIEVE THE DICTIONARY. IN THE FUTURE - ##

#for i in saved_dict.token2id.iteritems():
#    print i

2018-10-29 21:54:08,755 : INFO : loading Dictionary object from data/fake.dict
2018-10-29 21:54:08,770 : INFO : loaded data/fake.dict


In [16]:
# you can retrieve the saved corpus

corpus_saved = corpora.MmCorpus('data/fake.mm')


2018-10-29 21:54:08,779 : INFO : loaded corpus index from data/fake.mm.index
2018-10-29 21:54:08,780 : INFO : initializing cython corpus reader from data/fake.mm
2018-10-29 21:54:08,781 : INFO : accepted corpus with 2000 documents, 18777 features, 353135 non-zero entries


# 3. Answering Questions
- a. <b>Explain the difference between text classification and topic modeling in your own words.<br><br></b>

The largest difference between the two in the context I have used them is text classification is supervised, and topic modeling is unsupervised. Text classification predicts while topic modeling assesses and classifies. <br><br>

- b. <b>Explain the goal of topic modeling (LDA) in your own words. <br><br></b>

Topic modeling (via LDA) determines the "breakdown" of content within a dataset, and the posterior probabilities ("weight" or "score") of each word that would appear in the associated topic(s). One word may appear across multiple topics but may have a different magnitude of score depending. LDA is a <b>generative</b> process.<br><br>

- c. <b>Explain document-topic matrix (distribution) in your own words. <br><br></b>

Document-topic distribution is the breakdown of topics determined by the model.<br><br>

- d. <b>Explain term-topic matrix (distribution) in your own words. <br><br></b>

Term-topic distribution is the tokens/text items and their relative scores associated with each document-topic they fall within

# 4. Determing the Best Model (LDA, NMF, LSI)

- The below process would take many hours ...

### Note: More powerful and efficient option is likely PySpark

# LDA Model Building

**passes** controls how often we train the model on the entire corpus. It is important to set the number of "passes" high enough so the model is converged.

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb

In [17]:
numpy.random.seed(1) # setting random seed to get the same results each time. Helps stabilize the data and keep the models the same.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=15, passes=20, eval_every = 1)

2018-10-29 21:54:08,792 : INFO : using symmetric alpha at 0.06666666666666667
2018-10-29 21:54:08,792 : INFO : using symmetric eta at 0.06666666666666667
2018-10-29 21:54:08,795 : INFO : using serial LDA version on this node
2018-10-29 21:54:08,840 : INFO : running online (multi-pass) LDA training, 15 topics, 20 passes over the supplied corpus of 2000 documents, updating model once every 2000 documents, evaluating perplexity every 2000 documents, iterating 50x with a convergence threshold of 0.001000
2018-10-29 21:54:11,789 : INFO : -11.189 per-word bound, 2334.2 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:54:11,789 : INFO : PROGRESS: pass 0, at document #2000/2000
2018-10-29 21:54:13,422 : INFO : topic #3 (0.067): 0.007*"trump" + 0.007*"state" + 0.005*"war" + 0.005*"clinton" + 0.004*"world" + 0.003*"military" + 0.003*"election" + 0.003*"american" + 0.003*"even" + 0.003*"new"
2018-10-29 21:54:13,424 : INFO : topic #11 (0.067): 0.005*

2018-10-29 21:54:35,513 : INFO : topic #1 (0.067): 0.004*"state" + 0.004*"american" + 0.004*"iran" + 0.003*"muslim" + 0.003*"land" + 0.003*"war" + 0.003*"election" + 0.003*"even" + 0.003*"white" + 0.003*"world"
2018-10-29 21:54:35,515 : INFO : topic #5 (0.067): 0.007*"state" + 0.006*"election" + 0.004*"new" + 0.004*"vote" + 0.004*"trump" + 0.004*"government" + 0.003*"million" + 0.003*"law" + 0.003*"clinton" + 0.003*"american"
2018-10-29 21:54:35,517 : INFO : topic diff=0.705225, rho=0.377964
2018-10-29 21:54:38,717 : INFO : -8.510 per-word bound, 364.5 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:54:38,719 : INFO : PROGRESS: pass 6, at document #2000/2000
2018-10-29 21:54:40,283 : INFO : topic #12 (0.067): 0.008*"white" + 0.007*"african" + 0.007*"american" + 0.005*"kenya" + 0.005*"black" + 0.004*"africa" + 0.003*"new" + 0.003*"indian" + 0.003*"world" + 0.003*"support"
2018-10-29 21:54:40,283 : INFO : topic #6 (0.067): 0.013*"war" + 0.

2018-10-29 21:55:02,625 : INFO : topic #5 (0.067): 0.008*"state" + 0.005*"election" + 0.005*"new" + 0.004*"government" + 0.004*"vote" + 0.004*"million" + 0.004*"law" + 0.003*"information" + 0.003*"news" + 0.003*"machine"
2018-10-29 21:55:02,626 : INFO : topic #9 (0.067): 0.006*"life" + 0.006*"world" + 0.005*"even" + 0.004*"science" + 0.004*"feel" + 0.004*"really" + 0.003*"body" + 0.003*"may" + 0.003*"work" + 0.003*"food"
2018-10-29 21:55:02,628 : INFO : topic diff=0.193484, rho=0.277350
2018-10-29 21:55:05,451 : INFO : -8.426 per-word bound, 343.9 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:55:05,451 : INFO : PROGRESS: pass 12, at document #2000/2000
2018-10-29 21:55:06,817 : INFO : topic #13 (0.067): 0.014*"black" + 0.005*"email" + 0.005*"new" + 0.004*"police" + 0.004*"state" + 0.004*"trump" + 0.003*"phone" + 0.003*"post" + 0.003*"officer" + 0.003*"link"
2018-10-29 21:55:06,818 : INFO : topic #7 (0.067): 0.017*"trump" + 0.015*"party

2018-10-29 21:55:28,583 : INFO : topic #6 (0.067): 0.015*"war" + 0.012*"syria" + 0.009*"state" + 0.008*"russia" + 0.008*"government" + 0.007*"syrian" + 0.006*"military" + 0.005*"clinton" + 0.004*"saudi" + 0.004*"russian"
2018-10-29 21:55:28,584 : INFO : topic #5 (0.067): 0.008*"state" + 0.005*"new" + 0.005*"election" + 0.005*"government" + 0.004*"million" + 0.004*"law" + 0.003*"information" + 0.003*"news" + 0.003*"company" + 0.003*"vote"
2018-10-29 21:55:28,586 : INFO : topic diff=0.078981, rho=0.229416
2018-10-29 21:55:31,484 : INFO : -8.397 per-word bound, 337.0 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:55:31,484 : INFO : PROGRESS: pass 18, at document #2000/2000
2018-10-29 21:55:32,948 : INFO : topic #4 (0.067): 0.035*"clinton" + 0.013*"hillary" + 0.012*"trump" + 0.011*"email" + 0.008*"fbi" + 0.008*"campaign" + 0.005*"foundation" + 0.005*"bill" + 0.005*"state" + 0.005*"new"
2018-10-29 21:55:32,948 : INFO : topic #2 (0.067): 0.01

Review topic diff at the end of the process log (e.g., topic diff=0.104663). close to zero means that the model is converged; thus, this indicates we should set the number of "passes" higher (e.g., 30)

In [18]:
numpy.random.seed(1) # setting random seed to get the same results each time. 
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=35, passes=30, eval_every = 1)

2018-10-29 21:55:37,298 : INFO : using symmetric alpha at 0.02857142857142857
2018-10-29 21:55:37,299 : INFO : using symmetric eta at 0.02857142857142857
2018-10-29 21:55:37,304 : INFO : using serial LDA version on this node
2018-10-29 21:55:37,403 : INFO : running online (multi-pass) LDA training, 35 topics, 30 passes over the supplied corpus of 2000 documents, updating model once every 2000 documents, evaluating perplexity every 2000 documents, iterating 50x with a convergence threshold of 0.001000
2018-10-29 21:55:41,348 : INFO : -13.081 per-word bound, 8667.5 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:55:41,354 : INFO : PROGRESS: pass 0, at document #2000/2000
2018-10-29 21:55:43,649 : INFO : topic #15 (0.029): 0.009*"trump" + 0.006*"state" + 0.006*"clinton" + 0.005*"election" + 0.003*"american" + 0.003*"political" + 0.003*"war" + 0.003*"government" + 0.003*"world" + 0.002*"vote"
2018-10-29 21:55:43,657 : INFO : topic #2 (0.029)

2018-10-29 21:56:09,430 : INFO : topic #1 (0.029): 0.006*"saudi" + 0.005*"yemen" + 0.005*"iran" + 0.005*"muslim" + 0.004*"war" + 0.003*"group" + 0.003*"juror" + 0.003*"arm" + 0.003*"state" + 0.003*"force"
2018-10-29 21:56:09,433 : INFO : topic #14 (0.029): 0.016*"russian" + 0.011*"russia" + 0.008*"putin" + 0.007*"state" + 0.005*"election" + 0.005*"new" + 0.005*"show" + 0.004*"medium" + 0.004*"trump" + 0.003*"government"
2018-10-29 21:56:09,437 : INFO : topic diff=1.773915, rho=0.377964
2018-10-29 21:56:12,683 : INFO : -8.673 per-word bound, 408.3 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:56:12,683 : INFO : PROGRESS: pass 6, at document #2000/2000
2018-10-29 21:56:14,328 : INFO : topic #1 (0.029): 0.006*"saudi" + 0.005*"yemen" + 0.005*"iran" + 0.005*"muslim" + 0.004*"war" + 0.003*"juror" + 0.003*"arm" + 0.003*"group" + 0.003*"state" + 0.003*"force"
2018-10-29 21:56:14,328 : INFO : topic #5 (0.029): 0.007*"state" + 0.007*"election" +

2018-10-29 21:56:39,561 : INFO : topic #26 (0.029): 0.016*"white" + 0.016*"black" + 0.009*"trump" + 0.007*"state" + 0.006*"class" + 0.006*"movement" + 0.006*"obama" + 0.005*"comanche" + 0.005*"war" + 0.004*"american"
2018-10-29 21:56:39,562 : INFO : topic #3 (0.029): 0.011*"state" + 0.009*"united" + 0.008*"muslim" + 0.006*"world" + 0.006*"china" + 0.006*"percent" + 0.005*"immigrant" + 0.005*"government" + 0.005*"facebook" + 0.005*"war"
2018-10-29 21:56:39,566 : INFO : topic diff=0.455931, rho=0.277350
2018-10-29 21:56:42,912 : INFO : -8.505 per-word bound, 363.3 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:56:42,912 : INFO : PROGRESS: pass 12, at document #2000/2000
2018-10-29 21:56:44,642 : INFO : topic #9 (0.029): 0.013*"life" + 0.009*"feel" + 0.006*"world" + 0.006*"even" + 0.006*"film" + 0.006*"really" + 0.005*"mind" + 0.004*"actually" + 0.004*"science" + 0.004*"best"
2018-10-29 21:56:44,642 : INFO : topic #8 (0.029): 0.013*"north"

2018-10-29 21:57:10,180 : INFO : topic #34 (0.029): 0.011*"syrian" + 0.008*"syria" + 0.006*"cancer" + 0.006*"government" + 0.005*"study" + 0.005*"civilian" + 0.004*"war" + 0.004*"breast" + 0.004*"medical" + 0.004*"diet"
2018-10-29 21:57:10,181 : INFO : topic #14 (0.029): 0.020*"russian" + 0.014*"russia" + 0.011*"putin" + 0.008*"state" + 0.006*"show" + 0.005*"election" + 0.005*"new" + 0.004*"vladimir" + 0.004*"moscow" + 0.004*"http"
2018-10-29 21:57:10,184 : INFO : topic diff=0.153596, rho=0.229416
2018-10-29 21:57:13,701 : INFO : -8.454 per-word bound, 350.6 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:57:13,701 : INFO : PROGRESS: pass 18, at document #2000/2000
2018-10-29 21:57:15,448 : INFO : topic #14 (0.029): 0.020*"russian" + 0.014*"russia" + 0.011*"putin" + 0.008*"state" + 0.006*"show" + 0.005*"election" + 0.005*"new" + 0.004*"vladimir" + 0.004*"moscow" + 0.004*"http"
2018-10-29 21:57:15,450 : INFO : topic #9 (0.029): 0.012*"lif

2018-10-29 21:57:41,924 : INFO : topic #0 (0.029): 0.022*"child" + 0.015*"adhd" + 0.008*"drug" + 0.006*"disorder" + 0.006*"american" + 0.005*"even" + 0.005*"pharmaceutical" + 0.005*"france" + 0.005*"kid" + 0.004*"french"
2018-10-29 21:57:41,932 : INFO : topic #6 (0.029): 0.016*"war" + 0.016*"syria" + 0.013*"russia" + 0.009*"syrian" + 0.009*"government" + 0.009*"state" + 0.009*"russian" + 0.008*"military" + 0.006*"clinton" + 0.005*"washington"
2018-10-29 21:57:41,936 : INFO : topic diff=0.073399, rho=0.200000
2018-10-29 21:57:45,374 : INFO : -8.427 per-word bound, 344.1 perplexity estimate based on a held-out corpus of 2000 documents with 550673 words
2018-10-29 21:57:45,374 : INFO : PROGRESS: pass 24, at document #2000/2000
2018-10-29 21:57:47,407 : INFO : topic #27 (0.029): 0.019*"israel" + 0.012*"israeli" + 0.011*"palestinian" + 0.011*"water" + 0.008*"state" + 0.007*"gaza" + 0.007*"flint" + 0.006*"trump" + 0.006*"law" + 0.005*"border"
2018-10-29 21:57:47,409 : INFO : topic #15 (0.029

2018-10-29 21:58:22,189 : INFO : topic #6 (0.029): 0.017*"war" + 0.016*"syria" + 0.014*"russia" + 0.009*"syrian" + 0.009*"government" + 0.009*"state" + 0.009*"russian" + 0.009*"military" + 0.005*"clinton" + 0.005*"obama"
2018-10-29 21:58:22,190 : INFO : topic #5 (0.029): 0.008*"state" + 0.007*"election" + 0.005*"government" + 0.005*"new" + 0.005*"law" + 0.005*"machine" + 0.005*"serco" + 0.004*"voting" + 0.004*"office" + 0.004*"vote"
2018-10-29 21:58:22,193 : INFO : topic diff=0.044824, rho=0.179605


<b>New run with 30 passes results in topic diff=0.079865, which shows a surprisingly <b><i>less</i></b> accurate model.

In [19]:
model.save('data/lda.model') # same for tfidf, lda, ...
#model = models.LdaModel.load('data/lda.model')

#https://stackoverflow.com/questions/17354417/gensim-how-to-save-lda-models-produced-topics-to-a-readable-format-csv-txt-et

2018-10-29 21:58:22,201 : INFO : saving LdaState object under data/lda.model.state, separately None
2018-10-29 21:58:22,238 : INFO : saved data/lda.model.state
2018-10-29 21:58:22,259 : INFO : saving LdaModel object under data/lda.model, separately ['expElogbeta', 'sstats']
2018-10-29 21:58:22,260 : INFO : storing np array 'expElogbeta' to data/lda.model.expElogbeta.npy
2018-10-29 21:58:22,270 : INFO : not storing attribute dispatcher
2018-10-29 21:58:22,271 : INFO : not storing attribute id2word
2018-10-29 21:58:22,272 : INFO : not storing attribute state
2018-10-29 21:58:22,280 : INFO : saved data/lda.model


# Prints the topics. (Topic-Term Distribution)

In [20]:
model.show_topics(num_topics=15)
#show_topics(num_topics=10, num_words=10, log=False, formatted=True)

[(9,
  '0.012*"life" + 0.008*"feel" + 0.007*"really" + 0.007*"even" + 0.006*"world" + 0.005*"film" + 0.005*"mind" + 0.005*"actually" + 0.005*"work" + 0.005*"best"'),
 (14,
  '0.021*"russian" + 0.013*"russia" + 0.013*"putin" + 0.008*"state" + 0.006*"show" + 0.005*"election" + 0.005*"http" + 0.005*"vladimir" + 0.005*"new" + 0.004*"moscow"'),
 (3,
  '0.012*"state" + 0.009*"united" + 0.009*"muslim" + 0.008*"china" + 0.007*"world" + 0.007*"percent" + 0.006*"immigrant" + 0.005*"facebook" + 0.005*"government" + 0.005*"chinese"'),
 (13,
  '0.022*"nato" + 0.021*"russia" + 0.011*"troop" + 0.010*"russian" + 0.009*"state" + 0.007*"force" + 0.007*"border" + 0.005*"military" + 0.004*"tension" + 0.004*"new"'),
 (6,
  '0.017*"war" + 0.016*"syria" + 0.014*"russia" + 0.009*"syrian" + 0.009*"government" + 0.009*"state" + 0.009*"russian" + 0.009*"military" + 0.005*"clinton" + 0.005*"obama"'),
 (5,
  '0.008*"state" + 0.007*"election" + 0.005*"government" + 0.005*"new" + 0.005*"law" + 0.005*"machine" + 0.00

In [21]:
# Prints the topics.
for top in model.show_topics(num_topics=35):
    print(top)
print

(0, '0.023*"child" + 0.015*"adhd" + 0.009*"drug" + 0.007*"disorder" + 0.006*"american" + 0.006*"pharmaceutical" + 0.006*"even" + 0.005*"france" + 0.005*"kid" + 0.004*"industry"')
(1, '0.008*"saudi" + 0.008*"iran" + 0.007*"yemen" + 0.007*"muslim" + 0.005*"iranian" + 0.004*"war" + 0.004*"arm" + 0.004*"arabia" + 0.004*"juror" + 0.004*"force"')
(2, '0.025*"migrant" + 0.022*"refugee" + 0.021*"muslim" + 0.012*"philippine" + 0.009*"asylum" + 0.008*"duterte" + 0.007*"german" + 0.006*"police" + 0.006*"germany" + 0.006*"europe"')
(3, '0.012*"state" + 0.009*"united" + 0.009*"muslim" + 0.008*"china" + 0.007*"world" + 0.007*"percent" + 0.006*"immigrant" + 0.005*"facebook" + 0.005*"government" + 0.005*"chinese"')
(4, '0.034*"trump" + 0.017*"clinton" + 0.010*"campaign" + 0.009*"woman" + 0.008*"hillary" + 0.008*"donald" + 0.008*"story" + 0.007*"medium" + 0.006*"video" + 0.006*"news"')
(5, '0.008*"state" + 0.007*"election" + 0.005*"government" + 0.005*"new" + 0.005*"law" + 0.005*"machine" + 0.005*"serc

<function print>

In [22]:
# print words without probability
for i in range(0,35):
    topics = model.show_topic(i, 10)
    print(', '.join([str(word[0]) for word in topics]))

child, adhd, drug, disorder, american, pharmaceutical, even, france, kid, industry
saudi, iran, yemen, muslim, iranian, war, arm, arabia, juror, force
migrant, refugee, muslim, philippine, asylum, duterte, german, police, germany, europe
state, united, muslim, china, world, percent, immigrant, facebook, government, chinese
trump, clinton, campaign, woman, hillary, donald, story, medium, video, news
state, election, government, new, law, machine, serco, voting, office, vote
war, syria, russia, syrian, government, state, russian, military, clinton, obama
party, trump, left, system, democratic, black, class, white, police, robot
north, obama, korea, vitamin, black, food, trump, korean, left, war
life, feel, really, even, world, film, mind, actually, work, best
school, student, girl, muslim, teacher, trump, child, old, american, gun
church, catholic, download, child, god, birkenfeld, silver, podcast, window, anthrax
nuclear, american, indian, weapon, new, coup, thanksgiving, white, nation,

# Assigns the topics to the documents in corpus

## Shows probablity of each document pertaining to each modeled topic

In [23]:
lda_corpus = model[corpus]

results = []
for i in lda_corpus:
    print(i)
    results.append(i)
print 

[(2, 0.27882034), (5, 0.333458), (24, 0.11771571), (27, 0.24669765)]
[(28, 0.99389035)]
[(1, 0.14703268), (20, 0.5871697), (22, 0.25770658)]
[(4, 0.49837133), (28, 0.4653649)]
[(5, 0.10990971), (20, 0.017801927), (23, 0.44172493), (29, 0.09123863), (32, 0.33442685)]
[(4, 0.9910878)]
[(4, 0.018416265), (6, 0.3321984), (9, 0.035429962), (20, 0.5440982), (31, 0.06324588)]
[(0, 0.027536476), (1, 0.16524708), (4, 0.1287991), (5, 0.4620863), (20, 0.21245258)]
[(4, 0.016046526), (6, 0.07120346), (20, 0.8730206), (31, 0.031891216)]
[(10, 0.9852814)]
[(0, 0.028571429), (1, 0.028571429), (2, 0.028571429), (3, 0.028571429), (4, 0.028571429), (5, 0.028571429), (6, 0.028571429), (7, 0.028571429), (8, 0.028571429), (9, 0.028571429), (10, 0.028571429), (11, 0.028571429), (12, 0.028571429), (13, 0.028571429), (14, 0.028571429), (15, 0.028571429), (16, 0.028571429), (17, 0.028571429), (18, 0.028571429), (19, 0.028571429), (20, 0.028571429), (21, 0.028571429), (22, 0.028571429), (23, 0.028571429), (24, 

[(16, 0.6334038), (20, 0.1312128), (22, 0.16505377)]
[(4, 0.10382561), (5, 0.08762091), (6, 0.07656753), (20, 0.5411683), (25, 0.04766897), (33, 0.11128059)]
[(5, 0.98017496)]
[(14, 0.7000126), (19, 0.21427312)]
[(14, 0.2592105), (18, 0.5493075), (20, 0.1517304)]
[(4, 0.0827404), (14, 0.258459), (18, 0.543583), (20, 0.07670831)]
[(18, 0.86122453)]
[(18, 0.053499393), (29, 0.3874143), (30, 0.54183567)]
[(18, 0.86122453)]
[(18, 0.053510807), (29, 0.38738188), (30, 0.5418566)]
[(14, 0.7000396), (19, 0.21424608)]
[(14, 0.70002437), (19, 0.21426135)]
[(0, 0.028571429), (1, 0.028571429), (2, 0.028571429), (3, 0.028571429), (4, 0.028571429), (5, 0.028571429), (6, 0.028571429), (7, 0.028571429), (8, 0.028571429), (9, 0.028571429), (10, 0.028571429), (11, 0.028571429), (12, 0.028571429), (13, 0.028571429), (14, 0.028571429), (15, 0.028571429), (16, 0.028571429), (17, 0.028571429), (18, 0.028571429), (19, 0.028571429), (20, 0.028571429), (21, 0.028571429), (22, 0.028571429), (23, 0.028571429), (

[(9, 0.3303344), (11, 0.23430444), (26, 0.4027081)]
[(10, 0.05753075), (21, 0.25493044), (29, 0.67566496)]
[(14, 0.75445145), (30, 0.23917791)]
[(4, 0.16411038), (20, 0.706309), (25, 0.120345354)]
[(4, 0.28600258), (5, 0.05773262), (9, 0.09417564), (13, 0.07511154), (19, 0.10752604), (20, 0.092374474), (29, 0.15435329), (32, 0.12768178)]
[(4, 0.9165877), (20, 0.07049639)]
[(4, 0.1151952), (7, 0.025735747), (27, 0.7464802), (34, 0.09199087)]
[(4, 0.26803395), (30, 0.06089274), (31, 0.668124)]
[(1, 0.5884949), (5, 0.20350862), (29, 0.19748749)]
[(6, 0.23671483), (13, 0.44368383), (27, 0.31342378)]
[(9, 0.11007749), (14, 0.025599329), (27, 0.025801696), (29, 0.82549626)]
[(6, 0.15101816), (13, 0.3131526), (19, 0.3809387), (26, 0.06750752), (30, 0.07439604)]
[(5, 0.1711817), (13, 0.037780903), (17, 0.0694301), (18, 0.07990279), (30, 0.6381479)]
[(5, 0.2761947), (9, 0.26276937), (13, 0.16055511), (27, 0.07510596), (29, 0.20016484)]
[(4, 0.27963597), (20, 0.4285938), (24, 0.15710098), (31, 0

[(1, 0.33930394), (2, 0.12031397), (18, 0.3554349), (33, 0.17258404)]
[(0, 0.51428574), (1, 0.0142857125), (2, 0.0142857125), (3, 0.0142857125), (4, 0.0142857125), (5, 0.0142857125), (6, 0.0142857125), (7, 0.0142857125), (8, 0.0142857125), (9, 0.0142857125), (10, 0.0142857125), (11, 0.0142857125), (12, 0.0142857125), (13, 0.0142857125), (14, 0.0142857125), (15, 0.0142857125), (16, 0.0142857125), (17, 0.0142857125), (18, 0.0142857125), (19, 0.0142857125), (20, 0.0142857125), (21, 0.0142857125), (22, 0.0142857125), (23, 0.0142857125), (24, 0.0142857125), (25, 0.0142857125), (26, 0.0142857125), (27, 0.0142857125), (28, 0.0142857125), (29, 0.0142857125), (30, 0.0142857125), (31, 0.0142857125), (32, 0.0142857125), (33, 0.0142857125), (34, 0.0142857125)]
[(18, 0.98028874), (33, 0.014669269)]
[(6, 0.06348298), (19, 0.21574046), (23, 0.013601704), (29, 0.67200536), (32, 0.03445581)]
[(1, 0.055308964), (6, 0.5006412), (16, 0.03319232), (18, 0.31108025), (19, 0.031942736), (23, 0.0642784)]
[(13,

[(2, 0.07397877), (9, 0.3607333), (15, 0.39757824), (24, 0.13607708)]
[(1, 0.63451445), (9, 0.16366416), (32, 0.19420232)]
[(7, 0.62971616), (23, 0.363141)]
[(0, 0.34285712), (7, 0.34285712)]
[(6, 0.41652432), (18, 0.32011616), (23, 0.25791737)]
[(19, 0.49426615), (33, 0.4864919)]
[(15, 0.18137504), (17, 0.096644446), (25, 0.57451445), (31, 0.13192718)]
[(0, 0.05563937), (4, 0.4007795), (9, 0.015296875), (15, 0.12788197), (19, 0.1279329), (29, 0.2630977)]
[(2, 0.033493843), (4, 0.5197153), (7, 0.046901867), (20, 0.27022675), (21, 0.03633706), (26, 0.044049475), (29, 0.04111243)]
[(5, 0.5100498), (16, 0.118931815), (29, 0.36599478)]
[(20, 0.8889052), (29, 0.10384202)]
[(4, 0.47707376), (16, 0.25521517), (20, 0.10608929), (29, 0.15026647)]
[(1, 0.25126368), (4, 0.73599505)]
[(4, 0.26858395), (20, 0.09786968), (29, 0.6206691)]
[(2, 0.060711816), (4, 0.091817185), (9, 0.04927891), (10, 0.21709889), (15, 0.11287816), (16, 0.079752944), (17, 0.1492491), (20, 0.053042017), (29, 0.18108289)]
[

[(4, 0.34065387), (9, 0.033788577), (15, 0.501349), (29, 0.1212757)]
[(20, 0.0814937), (33, 0.9154649)]
[(4, 0.08646027), (5, 0.051385045), (6, 0.48695558), (20, 0.37143013)]
[(5, 0.02143128), (6, 0.024429547), (9, 0.022469774), (20, 0.29055732), (29, 0.63814616)]
[(0, 0.44247106), (4, 0.13841683), (9, 0.07120479), (29, 0.34040126)]
[(4, 0.12814204), (5, 0.04979638), (6, 0.01245024), (13, 0.014138564), (15, 0.03082945), (20, 0.25922418), (29, 0.47385392), (31, 0.030519946)]
[(6, 0.55192745), (13, 0.44259077)]
[(4, 0.07803967), (20, 0.29705262), (22, 0.041427895), (29, 0.5800064)]
[(4, 0.11136167), (9, 0.01859149), (15, 0.12208576), (16, 0.09762605), (17, 0.046793513), (20, 0.13236229), (22, 0.046637397), (29, 0.053465277), (31, 0.3636871)]
[(17, 0.23017056), (32, 0.015905246), (33, 0.70628923), (34, 0.044975113)]
[(5, 0.15454689), (16, 0.015993336), (23, 0.1198341), (24, 0.47911906), (29, 0.0648464), (31, 0.16359398)]
[(13, 0.07513286), (14, 0.07630799), (23, 0.84559065)]
[(1, 0.339143

[(13, 0.27280867), (14, 0.23423433), (33, 0.40984014)]
[(12, 0.2788068), (13, 0.25714284), (33, 0.23547895)]
[(12, 0.34646353), (15, 0.24504746), (23, 0.16998194), (34, 0.17524174)]
[(9, 0.3358129), (11, 0.30925375), (20, 0.31337488)]
[(0, 0.014285714), (1, 0.014285714), (2, 0.014285714), (3, 0.014285714), (4, 0.014285714), (5, 0.014285714), (6, 0.014285714), (7, 0.014285714), (8, 0.014285714), (9, 0.014285714), (10, 0.014285714), (11, 0.014285714), (12, 0.014285714), (13, 0.014285714), (14, 0.014285714), (15, 0.014285714), (16, 0.014285714), (17, 0.014285714), (18, 0.014285714), (19, 0.014285714), (20, 0.014285714), (21, 0.5142857), (22, 0.014285714), (23, 0.014285714), (24, 0.014285714), (25, 0.014285714), (26, 0.014285714), (27, 0.014285714), (28, 0.014285714), (29, 0.014285714), (30, 0.014285714), (31, 0.014285714), (32, 0.014285714), (33, 0.014285714), (34, 0.014285714)]
[(5, 0.057412587), (9, 0.07256273), (18, 0.1686582), (20, 0.054370835), (30, 0.60283685), (32, 0.039313387)]
[(

[(14, 0.03974847), (16, 0.22268079), (29, 0.19691515), (33, 0.5347508)]
[(3, 0.19803186), (29, 0.78700215)]
[(6, 0.47523412), (21, 0.4992833)]
[(6, 0.56495535), (15, 0.113098085), (18, 0.15167055), (22, 0.0757673), (29, 0.08190366)]
[(0, 0.31652775), (29, 0.6769246)]
[(4, 0.50113785), (20, 0.4867742)]
[(4, 0.92051697), (20, 0.07740626)]
[(6, 0.73104256), (29, 0.26643643)]
[(22, 0.4694833), (29, 0.5239691)]
[(5, 0.37858355), (29, 0.6099893)]
[(22, 0.8825738), (29, 0.113140464)]
[(22, 0.99385166)]
[(11, 0.99523807)]
[(20, 0.60489327), (24, 0.099795304), (29, 0.29222262)]
[(6, 0.8951373), (20, 0.03502681), (29, 0.066896014)]
[(4, 0.99334633)]
[(20, 0.49990025), (29, 0.49676812)]
[(13, 0.83663756), (20, 0.055602968), (29, 0.10307082)]
[(14, 0.03970999), (16, 0.22272772), (29, 0.19689651), (33, 0.5347611)]
[(3, 0.19809695), (29, 0.78693706)]
[(6, 0.47509688), (21, 0.49942046)]
[(0, 0.3165499), (29, 0.67690253)]
[(4, 0.5010676), (20, 0.48684454)]
[(4, 0.9205142), (20, 0.07740903)]
[(6, 0.731

[(24, 0.99764216)]
[(4, 0.2832027), (9, 0.2430115), (29, 0.46013978)]
[(5, 0.058478057), (8, 0.87695473), (22, 0.015984114), (34, 0.046722338)]
[(24, 0.99764216)]
[(9, 0.052401327), (14, 0.9426879)]
[(9, 0.12265198), (14, 0.8002298), (26, 0.07220269)]
[(0, 0.079302), (9, 0.19252686), (12, 0.15518114), (14, 0.22376068), (18, 0.123453796), (20, 0.2173207)]
[(4, 0.08071583), (11, 0.016232623), (15, 0.24010214), (16, 0.020834746), (20, 0.035518877), (23, 0.117547944), (29, 0.48398456)]
[(6, 0.058116388), (9, 0.06553312), (20, 0.8732721)]
[(10, 0.7044432), (23, 0.29124448)]
[(4, 0.5989149), (8, 0.39791048)]
[(15, 0.43606725), (17, 0.023410922), (29, 0.53812844)]
[(9, 0.08283658), (15, 0.35323974), (29, 0.5591617)]
[(4, 0.51311404), (17, 0.042255912), (31, 0.43809947)]
[(4, 0.014407459), (6, 0.15953709), (20, 0.575855), (29, 0.2469561)]
[(2, 0.032461315), (4, 0.03680102), (11, 0.061447304), (24, 0.39875525), (27, 0.15390062), (29, 0.31242853)]
[(2, 0.015665542), (11, 0.7595202), (29, 0.22095

[(4, 0.085071996), (9, 0.079956554), (12, 0.64125353), (15, 0.06494367), (17, 0.018241089), (19, 0.017958974), (20, 0.06724329), (26, 0.014927434)]
[(1, 0.74080175), (6, 0.16660492), (19, 0.09171679)]
[(16, 0.9961904)]
[(1, 0.014397104), (4, 0.047440603), (5, 0.41509333), (9, 0.07488379), (20, 0.095303096), (23, 0.046556264), (24, 0.07082161), (26, 0.0130038615), (29, 0.1141394), (31, 0.050027672), (32, 0.04275631)]
[(5, 0.39590672), (6, 0.118475944), (13, 0.10185832), (17, 0.057978857), (19, 0.189937), (20, 0.061348222), (34, 0.06987065)]
[(0, 0.03242693), (4, 0.09613558), (5, 0.061985545), (7, 0.029169789), (9, 0.07856662), (17, 0.4656721), (18, 0.0332594), (23, 0.10836678), (24, 0.011192365), (26, 0.062364314), (33, 0.018420348)]
[(20, 0.99845314)]
[(6, 0.490118), (20, 0.5088874)]
[(3, 0.13196078), (5, 0.29124907), (9, 0.5222774), (17, 0.044210605)]
[(20, 0.22912662), (28, 0.65405256), (29, 0.115390025)]
[(5, 0.05151975), (6, 0.39212614), (18, 0.067228876), (20, 0.030429587), (30, 0

[(4, 0.119525164), (5, 0.07141392), (7, 0.03962416), (9, 0.2703153), (19, 0.28064162), (27, 0.04982539), (30, 0.06705448), (31, 0.09721686)]
[(9, 0.26571175), (19, 0.4164645), (31, 0.10922817), (33, 0.20456958)]
[(33, 0.98407495)]
[(9, 0.30145845), (20, 0.6850722)]
[(9, 0.25629625), (18, 0.24991359), (29, 0.48670268)]
[(5, 0.040544435), (9, 0.018935716), (25, 0.6916485), (29, 0.24442056)]
[(7, 0.6378035), (29, 0.35521236)]
[(9, 0.19124098), (14, 0.2596617), (15, 0.18166755), (29, 0.3625898)]
[(0, 0.120314494), (9, 0.18806969), (15, 0.63968736), (29, 0.045691065)]
[(9, 0.30152303), (15, 0.33964014), (17, 0.12177266), (29, 0.2315628)]
[(29, 0.06676947), (33, 0.92884517)]
[(3, 0.16095042), (4, 0.16750103), (9, 0.6236924), (20, 0.03443627)]
[(9, 0.09878249), (23, 0.12753399), (29, 0.7608941)]
[(4, 0.05626999), (9, 0.24096254), (32, 0.69573456)]
[(9, 0.083938874), (10, 0.020988975), (20, 0.43692148), (29, 0.45106497)]
[(17, 0.42471606), (20, 0.29874504), (21, 0.17092207), (33, 0.08347398)]


[(4, 0.18264343), (17, 0.44260868), (29, 0.30134577), (33, 0.06504637)]
[(0, 0.15300104), (4, 0.50909305), (20, 0.21510196), (29, 0.112002544)]
[(23, 0.20283459), (29, 0.7883537)]
[(5, 0.08524844), (9, 0.07442572), (20, 0.26461935), (23, 0.021411229), (24, 0.048846465), (28, 0.024392806), (29, 0.47409946)]
[(7, 0.2288743), (12, 0.30803487), (29, 0.4524596)]
[(4, 0.2330076), (5, 0.60641336), (24, 0.15493529)]
[(1, 0.19853044), (9, 0.62949073), (13, 0.04191343), (29, 0.1170402)]
[(1, 0.12960042), (4, 0.42937762), (9, 0.10766529), (29, 0.32033145)]
[(4, 0.26448047), (13, 0.09563683), (29, 0.6293736)]
[(4, 0.2366355), (29, 0.75214005)]
[(4, 0.30425042), (5, 0.23533626), (6, 0.18134135), (20, 0.25032252), (24, 0.020812973)]
[(4, 0.1855079), (20, 0.49064645), (29, 0.18978019), (30, 0.12725225)]
[(4, 0.14584947), (5, 0.4189928), (9, 0.09886615), (13, 0.060436882), (15, 0.041528776), (29, 0.2265092)]
[(4, 0.6636754), (26, 0.018013602), (29, 0.22495422), (30, 0.0828125)]
[(4, 0.23408861), (9, 0

[(27, 0.99726355)]
[(9, 0.12389731), (14, 0.05460372), (17, 0.39791995), (19, 0.40817901), (28, 0.014397516)]
[(3, 0.024542553), (9, 0.1501754), (14, 0.010846865), (15, 0.35606268), (19, 0.36725098), (30, 0.08954628)]
[(9, 0.02798276), (15, 0.09793058), (17, 0.21107443), (19, 0.32525104), (26, 0.02344227), (29, 0.2973907), (34, 0.012514118)]
[(6, 0.08089659), (20, 0.018443916), (28, 0.0541552), (29, 0.504885), (32, 0.33909824)]
[(6, 0.037853096), (9, 0.028504964), (15, 0.080296256), (19, 0.36524984), (24, 0.029154453), (26, 0.047307245), (27, 0.016535759), (29, 0.37516928)]
[(16, 0.49117878), (17, 0.19267847), (26, 0.22087401), (34, 0.089888394)]
[(1, 0.42226654), (6, 0.17537914), (23, 0.14571594), (30, 0.021262893), (34, 0.22052543)]
[(6, 0.51497036), (19, 0.34602442), (26, 0.107368864), (29, 0.030727882)]
[(6, 0.10100863), (19, 0.8025289), (20, 0.017073767), (23, 0.07539898)]
[(1, 0.01383984), (27, 0.9847207)]
[(5, 0.080858946), (24, 0.9175953)]
[(6, 0.1306185), (19, 0.55699533), (25

<function print>

In [24]:
# finding highest value from each row
toptopic = [max(collection, key=lambda x: x[1])[0] for collection in results]
toptopic[:5] #Previewing first 5

[5, 28, 20, 4, 23]

In [25]:
toptopic = pd.DataFrame(toptopic)
documents = pd.DataFrame(documents)
documents = documents.rename(columns = {0: 'documents'})
summary = documents.join(toptopic)
summary.rename(columns = {0: 'top_topic'}, inplace = True)
summary.head()

Unnamed: 0,documents,top_topic
0,Print They should pay all the back all the mon...,5
1,Why Did Attorney General Loretta Lynch Plead T...,28
2,Red State Fox News Sunday reported this mornin...,20
3,Email Kayla Mueller was a prisoner and torture...,4
4,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,23


In [64]:
summary2 = summary.groupby('top_topic').count()

In [88]:
summary2.sort_values(by='documents', ascending = False).head()

Unnamed: 0_level_0,documents
top_topic,Unnamed: 1_level_1
29,268
20,147
4,117
19,109
6,101


<b> The top 5 most popular fake news topics are 2, 20, 4, 19, and 6 <i>from the above topic-modeling</i>, not the below pyLDAvis model.</b>

# Appendix 1 - pyLDAvis 

In [90]:
import pyLDAvis.gensim

Data is modeled via Principal Component Analysis (vectorized)

In [91]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

Topic numbers are changed in this visualization (LDAvis). 

- Tax cut / budget 
- War Syria Iraq Iran
- Immigration
- Plannedparenthood
- Big bank / wallstreet
- Judgement issue
- Iran, millitary
- Gun
- ...

# Appendix 2 

We want to show off the new `get_term_topics` and `get_document_topics` functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word `bank` is a good candidate here, where it can mean either the financial institution or a river bank.
In the toy corpus presented, there are 11 documents, 5 `river` related and 6 `finance` related. 

### get_term_topics

The function `get_term_topics` returns the odds of that particular word belonging to a particular topic. 
A few examples:

In [92]:
model.get_term_topics('war')

[(6, 0.017154103)]

In [93]:
model.get_term_topics('clinton')

[(4, 0.016514735), (20, 0.047979996), (25, 0.012405078), (29, 0.017000437)]

### get_document_topics 

`get_document_topics` is an already existing gensim functionality which uses the `inference` function to get the sufficient statistics and figure out the topic distribution of the document.

The addition to this is the ability for us to now know the topic distribution for each word in the document. 
Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.

The `get_document_topics` method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when `per_word_topics` is set as true.

In [94]:
bow = ['tax','cut','budget','border']

In [95]:
bow = model.id2word.doc2bow(bow) # convert to bag of words format first
print(bow)

[(353, 1), (1223, 1), (2927, 1), (4334, 1)]


In [96]:
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics

[(353, [25]), (1223, [25]), (2927, [25]), (4334, [25])]

In [97]:
phi_values

[(353, [(25, 0.99999994)]),
 (1223, [(25, 1.0)]),
 (2927, [(25, 1.0)]),
 (4334, [(25, 0.99999994)])]

In [98]:
for k, v in dictionary.token2id.items():
    print(k, v)

another 0
asap 1
benefit 2
bust 3
came 4
case 5
commit 6
control 7
deported 8
entire 9
everyone 10
family 11
four 12
fraud 13
government 14
group 15
immigrant 16
interest 17
million 18
money 19
month 20
muslim 21
numerous 22
pay 23
plus 24
print 25
refugee 26
related 27
reported 28
somali 29
stealing 30
stole 31
system 32
taxpayer 33
two 34
according 35
accusation 36
administration 37
aimed 38
amendment 39
american 40
answer 41
approved 42
assistant 43
attorney 44
avoid 45
barracuda 46
barred 47
beacon 48
behalf 49
billion 50
blocking 51
bound 52
brigade 53
cash 54
chosen 55
com 56
communication 57
comply 58
congress 59
congressional 60
core 61
corrupt 62
corruption 63
course 64
covering 65
deal 66
declining 67
deflects 68
delivered 69
detail 70
disclosing 71
earlier 72
effort 73
either 74
essentially 75
evidence 76
exclusively 77
fifth 78
finest 79
fla 80
follow 81
foremost 82
free 83
freeing 84
friday 85
general 86
hostage 87
incriminating 88
informing 89
initially 90
inquiry 91
inve

reach 1160
refused 1161
relative 1162
respective 1163
reveals 1164
romney 1165
roof 1166
seem 1167
shockingly 1168
shot 1169
somehow 1170
somewhat 1171
spite 1172
surpassed 1173
swing 1174
topped 1175
total 1176
turn 1177
turnout 1178
unlike 1179
vaunted 1180
virginia 1181
virtually 1182
whether 1183
wisconsin 1184
worked 1185
yankee 1186
alliance 1187
become 1188
built 1189
copyright 1190
liberty 1191
loop 1192
newsletter 1193
reserved 1194
sign 1195
subscribe 1196
wpdevelopers 1197
brave 1198
demonizing 1199
fly 1200
msnbc 1201
penny 1202
protest 1203
reporter 1204
shirt 1205
sided 1206
supporter 1207
violence 1208
wearing 1209
yorkers 1210
accustomed 1211
angeles 1212
annual 1213
anti 1214
antonio 1215
appropriate 1216
backing 1217
barrier 1218
baseball 1219
bat 1220
big 1221
billionaire 1222
border 1223
build 1224
bull 1225
burned 1226
carpet 1227
certainly 1228
chose 1229
clip 1230
coping 1231
criticism 1232
decade 1233
demonstrator 1234
depressed 1235
describing 1236
disappointme

linking 2160
list 2161
lord 2162
lux 2163
manafort 2164
manager 2165
martin 2166
meltdown 2167
meme 2168
mention 2169
minor 2170
mook 2171
mountain 2172
named 2173
natural 2174
neera 2175
non 2176
notified 2177
notorious 2178
objected 2179
oligarch 2180
operative 2181
outlined 2182
oversight 2183
palmieri 2184
permission 2185
planned 2186
podesta 2187
polarizing 2188
politically 2189
posted 2190
potentially 2191
pouring 2192
prior 2193
profit 2194
progress 2195
provocateur 2196
provoke 2197
recovered 2198
relationship 2199
robby 2200
roland 2201
ruling 2202
russia 2203
russian 2204
sally 2205
sexually 2206
shown 2207
slate 2208
spearheaded 2209
stating 2210
stepped 2211
strategy 2212
submitted 2213
sweetheart 2214
tanden 2215
tarmac 2216
teenage 2217
though 2218
thread 2219
tidal 2220
tie 2221
trick 2222
ukraine 2223
ukrainian 2224
underground 2225
underlined 2226
unravel 2227
vanguard 2228
wave 2229
whose 2230
yates 2231
ability 2232
accelerated 2233
accusatory 2234
accused 2235
accus

audio 3258
baiting 3259
blog 3260
clickbait 3261
cult 3262
deceit 3263
dishonesty 3264
eaten 3265
eating 3266
equate 3267
evaporated 3268
explains 3269
fiction 3270
fish 3271
foer 3272
franklin 3273
legitimacy 3274
lemon 3275
mic 3276
misinformation 3277
misogynist 3278
obsession 3279
pale 3280
path 3281
planted 3282
publishing 3283
rank 3284
retraction 3285
rightly 3286
saddest 3287
screwed 3288
season 3289
shrill 3290
simultaneously 3291
sorry 3292
spoil 3293
sum 3294
superpower 3295
tabloid 3296
tale 3297
tormented 3298
wilder 3299
amid 3300
candidacy 3301
chossudovsky 3302
episode 3303
experienced 3304
fanfare 3305
fetish 3306
globalresearch 3307
highlighting 3308
michel 3309
professor 3310
sustainable 3311
www 3312
acr 3313
blast 3314
bookmaker 3315
brexit 3316
britain 3317
broadcast 3318
edition 3319
esoteric 3320
evaluation 3321
fun 3322
gripping 3323
marcus 3324
mature 3325
nato 3326
odds 3327
overdrive 3328
packed 3329
permitting 3330
publication 3331
rejoined 3332
resume 3333

subterfuge 4409
suggests 4410
sunset 4411
super 4412
surrendered 4413
tackle 4414
technocrat 4415
traveling 4416
troll 4417
truthful 4418
underpinned 4419
vain 4420
version 4421
westerner 4422
xenophobic 4423
amish 4424
arkansas 4425
attendance 4426
biblical 4427
body 4428
brotherhood 4429
carrying 4430
census 4431
columbus 4432
consist 4433
deed 4434
dennis 4435
descendant 4436
despair 4437
developer 4438
endorse 4439
enter 4440
faith 4441
fivethirtyeight 4442
flamboyant 4443
freefall 4444
governing 4445
granted 4446
guaranteed 4447
guaranteeing 4448
heritage 4449
hopelessness 4450
horner 4451
imperative 4452
indiana 4453
informal 4454
instruct 4455
looked 4456
maintaining 4457
marriage 4458
mathematically 4459
midwest 4460
mood 4461
museum 4462
nate 4463
patriotism 4464
perennial 4465
persecuted 4466
pledged 4467
pledging 4468
portion 4469
poured 4470
predictive 4471
price 4472
protestant 4473
reconsider 4474
reformation 4475
reliably 4476
resigned 4477
rule 4478
rural 4479
sect 4480

cared 5659
circulated 5660
cleaned 5661
cps 5662
denial 5663
despicable 5664
disagrees 5665
disbanded 5666
disease 5667
disinformation 5668
ditto 5669
documented 5670
echelon 5671
elsewhere 5672
endless 5673
feed 5674
figured 5675
filth 5676
flagrant 5677
food 5678
forgive 5679
fragment 5680
fronted 5681
gate 5682
gavin 5683
google 5684
gutted 5685
heading 5686
heinous 5687
hitherto 5688
inflicting 5689
instantly 5690
jailed 5691
jimstone 5692
justify 5693
kill 5694
knocked 5695
lit 5696
loyal 5697
lunch 5698
malfunctioning 5699
manually 5700
mcfadyen 5701
moderator 5702
monopoly 5703
msm 5704
nearest 5705
neck 5706
pamela 5707
poison 5708
poisoned 5709
practically 5710
psyop 5711
putting 5712
rat 5713
recover 5714
rejoice 5715
restored 5716
rooted 5717
rot 5718
salvaged 5719
scientific 5720
scum 5721
seamless 5722
sewer 5723
sit 5724
snatching 5725
snopes 5726
spewed 5727
spot 5728
stream 5729
stretched 5730
study 5731
subtracted 5732
suddenly 5733
tainted 5734
task 5735
toast 5736
tr

surgical 6909
switched 6910
taste 6911
telegraph 6912
hero 6913
violater 6914
bleeding 6915
cemetery 6916
charcoal 6917
coffee 6918
internally 6919
kent 6920
liquid 6921
max 6922
poisoning 6923
vomit 6924
vomiting 6925
afghan 6926
ball 6927
connecting 6928
extract 6929
overlord 6930
par 6931
russkies 6932
script 6933
uranium 6934
zionist 6935
deception 6936
ghostrager 6937
isolated 6938
agitated 6939
applause 6940
boredom 6941
colonize 6942
colonized 6943
demise 6944
deserved 6945
emotional 6946
eon 6947
extinction 6948
genocide 6949
incarnation 6950
mat 6951
myriad 6952
precious 6953
sentient 6954
soft 6955
abysmal 6956
carved 6957
caste 6958
clarity 6959
disregard 6960
dissolution 6961
dissolve 6962
gasp 6963
guise 6964
hah 6965
happenstance 6966
herald 6967
intrigue 6968
mask 6969
maya 6970
mere 6971
pounce 6972
pray 6973
providence 6974
randomly 6975
remorse 6976
shattered 6977
skirt 6978
stagnant 6979
veil 6980
void 6981
avid 6982
fan 6983
glory 6984
sooner 6985
covered 6986
dug 6

transmit 8107
untold 8108
unvaccinated 8109
variation 8110
variously 8111
vest 8112
vista 8113
waded 8114
warm 8115
brute 8116
carey 8117
wedler 8118
autonomous 8119
mechanized 8120
robotics 8121
surround 8122
workforce 8123
demand 8124
forbidden 8125
holbrooks 8126
alabama 8127
shelby 8128
carter 8129
jimmy 8130
saga 8131
talcum 8132
designated 8133
gunnar 8134
jabhat 8135
obliquely 8136
ulson 8137
blair 8138
mainland 8139
chicken 8140
behest 8141
rake 8142
montreal 8143
nadia 8144
prupis 8145
slammed 8146
dismissed 8147
erdo 8148
recep 8149
tayyip 8150
advising 8151
confirmation 8152
est 8153
fourkiller 8154
guild 8155
militarized 8156
oklahoma 8157
owl 8158
protector 8159
tamara 8160
tribal 8161
broze 8162
derrick 8163
mint 8164
mintpressnews 8165
reservation 8166
demo 8167
mechanism 8168
abide 8169
acceptance 8170
acquiring 8171
actuality 8172
adaaa 8173
albany 8174
allege 8175
amr 8176
analog 8177
authored 8178
bibliography 8179
breached 8180
breast 8181
cancel 8182
carpenter 8183

bailing 9158
belligerent 9159
breakthrough 9160
burgeoning 9161
buying 9162
centered 9163
chicanery 9164
chomsky 9165
cleverly 9166
concedes 9167
conceding 9168
consequential 9169
consequently 9170
constitute 9171
contemplated 9172
contemptible 9173
conversely 9174
convincing 9175
curtail 9176
cynthia 9177
decidedly 9178
defeating 9179
denies 9180
deport 9181
deregulated 9182
detractor 9183
disgust 9184
dismantling 9185
dismissing 9186
displaced 9187
dissimilar 9188
dominant 9189
domination 9190
enables 9191
enrichment 9192
espoused 9193
evilism 9194
faction 9195
fossil 9196
framed 9197
freer 9198
futile 9199
hence 9200
implemented 9201
impoverishment 9202
intending 9203
lingering 9204
marginally 9205
materialize 9206
measurable 9207
menace 9208
militarism 9209
minimize 9210
minimized 9211
morality 9212
multitude 9213
mute 9214
nader 9215
noam 9216
notably 9217
objection 9218
objectionable 9219
openness 9220
oppressed 9221
pandering 9222
perpetuation 9223
picking 9224
placate 9225
poor

contaminates 10158
discriminatory 10159
distinguished 10160
facie 10161
impeccable 10162
inconsistent 10163
inspiring 10164
instituted 10165
invariably 10166
irish 10167
kamran 10168
kindly 10169
masterly 10170
prima 10171
punjab 10172
recognised 10173
scholarship 10174
scotland 10175
shutting 10176
tolerance 10177
transforming 10178
worrying 10179
admirable 10180
antagonism 10181
ballroom 10182
bankrupt 10183
brazen 10184
chanted 10185
decried 10186
ethically 10187
exceptionalism 10188
exceptionally 10189
grandfather 10190
hilton 10191
impotent 10192
incarnate 10193
keen 10194
landmark 10195
lining 10196
muscular 10197
negotiating 10198
nevertrump 10199
nobility 10200
nuanced 10201
portraying 10202
pragmatist 10203
prefers 10204
reckoning 10205
ross 10206
smelling 10207
societal 10208
spectacularly 10209
surpass 10210
tanker 10211
underestimated 10212
vitriol 10213
weasel 10214
advertised 10215
allah 10216
arbaeen 10217
askari 10218
attending 10219
carnival 10220
customary 10221
dicta

applaud 11373
astronomy 11374
camus 11375
culturally 11376
cure 11377
decolonize 11378
deficient 11379
dismissal 11380
egged 11381
eurocentric 11382
evaded 11383
evoked 11384
humiliation 11385
huntington 11386
identitarian 11387
joel 11388
johannesburg 11389
lament 11390
lecturer 11391
lightening 11392
mag 11393
marxism 11394
maverick 11395
methodology 11396
modernity 11397
mythical 11398
postcolonial 11399
remaking 11400
renaissance 11401
replaces 11402
restart 11403
rhodes 11404
scrapped 11405
scratched 11406
straitjacket 11407
subjugated 11408
swede 11409
swirling 11410
tooth 11411
tragic 11412
tuition 11413
voodoo 11414
watered 11415
zimbabwe 11416
clearest 11417
correlated 11418
espouse 11419
linda 11420
obscures 11421
predicts 11422
projected 11423
quartz 11424
questionnaire 11425
resentment 11426
respondent 11427
salience 11428
sample 11429
survey 11430
surveyed 11431
accounted 11432
bedard 11433
birthright 11434
illegals 11435
markedly 11436
newborn 11437
pew 11438
ankle 11439


thinly 12408
throat 12409
traumatised 12410
trope 12411
trouncing 12412
unaccountable 12413
unemployed 12414
unmistakable 12415
unquestioned 12416
unwillingness 12417
vichy 12418
waited 12419
worthless 12420
ziers 12421
barrow 12422
birmingham 12423
blackburn 12424
borough 12425
bradford 12426
cantle 12427
chuka 12428
dame 12429
devon 12430
dwindled 12431
finalised 12432
highlighted 12433
leicester 12434
louise 12435
luton 12436
mixing 12437
newham 12438
polarisation 12439
pupil 12440
retiring 12441
slough 12442
steepest 12443
umunna 12444
urgent 12445
yorkshire 12446
bludgeon 12447
curdling 12448
encampment 12449
eritrea 12450
hampered 12451
hooded 12452
jeanne 12453
littered 12454
lively 12455
makeshift 12456
organised 12457
passer 12458
pedestrian 12459
pitching 12460
rubbish 12461
ruining 12462
shopkeeper 12463
squalid 12464
squalor 12465
squat 12466
squatter 12467
stalingrad 12468
abound 12469
apt 12470
evangelical 12471
mitigate 12472
rift 12473
underscoring 12474
brightest 12475

haaretz 13657
unesco 13658
reluctantly 13659
twin 13660
adolf 13661
archaeological 13662
doubting 13663
nordic 13664
polar 13665
rumored 13666
treasure 13667
chat 13668
lure 13669
proposing 13670
unbeknownst 13671
mammogram 13672
commune 13673
contaminated 13674
mutant 13675
veg 13676
wheat 13677
absorbing 13678
booming 13679
harness 13680
platinum 13681
respiratory 13682
unexpectedly 13683
dipshit 13684
adapting 13685
dairy 13686
dietary 13687
glucose 13688
lactose 13689
amended 13690
hauling 13691
heating 13692
cheated 13693
demagogic 13694
dillary 13695
infinitely 13696
starter 13697
theantimedia 13698
undermined 13699
shit 13700
lackey 13701
alice 13702
amitabh 13703
desai 13704
forgets 13705
quirky 13706
reassured 13707
blackberry 13708
plante 13709
snippet 13710
ufkeozcx 13711
nauseam 13712
taker 13713
synchronized 13714
sicker 13715
manbij 13716
fucker 13717
birgitta 13718
icelandic 13719
nascent 13720
nsd 13721
pirate 13722
ttir 13723
beirut 13724
flattened 13725
photograph 137

clay 14907
ousting 14908
retaining 14909
bout 14910
euronews 14911
overcrowded 14912
parasite 14913
petrol 14914
endure 14915
navigate 14916
condones 14917
cupcake 14918
hugging 14919
tattooed 14920
throwback 14921
chandler 14922
infidel 14923
parlance 14924
shipping 14925
benchmark 14926
criminalize 14927
endemic 14928
omit 14929
renaming 14930
stricter 14931
tackling 14932
acknowledgement 14933
beheading 14934
combining 14935
extinct 14936
hugged 14937
judiciary 14938
oppress 14939
subjugate 14940
testament 14941
bert 14942
hardworking 14943
lashing 14944
quitting 14945
theresa 14946
charmaine 14947
cheaper 14948
corporal 14949
escorted 14950
luxurious 14951
nicer 14952
refurbishment 14953
tumor 14954
coliseum 14955
rickety 14956
cursory 14957
singling 14958
teased 14959
weaver 14960
contemplate 14961
cynic 14962
moe 14963
shakeup 14964
spilman 14965
winston 14966
blazing 14967
eclipse 14968
whitewash 14969
apoplectic 14970
inducing 14971
matteo 14972
paralysis 14973
shored 14974
str

allotted 16157
alternet 16158
bakken 16159
bald 16160
braun 16161
cody 16162
colorlines 16163
crisscrossed 16164
funes 16165
governement 16166
grist 16167
hitchcock 16168
interfered 16169
joye 16170
lakota 16171
lummi 16172
nowthis 16173
null 16174
plenary 16175
pommersheim 16176
precept 16177
pursuance 16178
remi 16179
seizement 16180
thereof 16181
truthout 16182
yessenia 16183
advertise 16184
agility 16185
allergic 16186
asthma 16187
attesting 16188
autobiography 16189
bead 16190
berkshire 16191
caracol 16192
catastrophically 16193
checker 16194
chittum 16195
chronically 16196
cinder 16197
cineas 16198
crisp 16199
defraud 16200
deputized 16201
diploma 16202
disbursed 16203
disparate 16204
downright 16205
evacuee 16206
excerpted 16207
faltered 16208
fizzled 16209
forego 16210
formaldehyde 16211
gala 16212
garment 16213
gleaming 16214
globetrotting 16215
gouverneur 16216
hanes 16217
hanesbrands 16218
hardwood 16219
hathaway 16220
headache 16221
helpful 16222
hile 16223
honeymoon 16224


fitness 17407
workout 17408
kudos 17409
marvel 17410
superhuman 17411
transforms 17412
degrasse 17413
squeezing 17414
tyson 17415
flex 17416
hoof 17417
sidekick 17418
treater 17419
trusty 17420
voil 17421
armor 17422
guttural 17423
pond 17424
rotating 17425
compartment 17426
frankenstein 17427
wiper 17428
barn 17429
grasping 17430
newbie 17431
darken 17432
rizzo 17433
showcase 17434
trunk 17435
champagne 17436
gigantic 17437
zoo 17438
clickhole 17439
onion 17440
antiviral 17441
coughing 17442
digest 17443
dispenser 17444
doorknob 17445
fdr 17446
flu 17447
phlegm 17448
delicious 17449
plump 17450
immersive 17451
mobilizes 17452
peru 17453
rebirth 17454
leftover 17455
subtract 17456
darkened 17457
meditate 17458
meditating 17459
meditation 17460
mindful 17461
appetite 17462
gratification 17463
jackal 17464
poise 17465
silk 17466
dancer 17467
gently 17468
princess 17469
relish 17470
sensation 17471
tracing 17472
traverse 17473
reconnect 17474
fateful 17475
thrilling 17476
trophy 17477
con

likud 18656
messianic 18657
reuven 18658
aegis 18659
attar 18660
itv 18661
subverted 18662
trumpeting 18663
underpinning 18664
upstairs 18665
pyatt 18666
rattle 18667
allocating 18668
cumbersome 18669
krassotkin 18670
nat 18671
burdened 18672
metric 18673
precluded 18674
bullhorn 18675
conformity 18676
mesh 18677
dianne 18678
feinstein 18679
quaint 18680
gabbard 18681
tulsi 18682
extrapolation 18683
quadrennial 18684
dovish 18685
languish 18686
rarer 18687
residual 18688
schema 18689
tiller 18690
gentry 18691
predisposition 18692
antalya 18693
carya 18694
kagan 18695
mujahedeen 18696
regnum 18697
rejoining 18698
fracked 18699
hypothermia 18700
engendered 18701
entreaty 18702
hilt 18703
junky 18704
motivate 18705
adduced 18706
bona 18707
fide 18708
functionary 18709
inept 18710
peasantry 18711
broc 18712
larken 18713
larkenrose 18714
legitimizing 18715
corbettreport 18716
shrug 18717
solace 18718
newsbud 18719
skouras 18720
spiro 18721
pizzagate 18722
subreddit 18723
edmonds 18724
ellro

In [99]:
for k, v in dictionary.token2id.items():
    if v == 474:
        print(k, v)

damning 474


# Appendix 3 - Non-negative Factorization Matrix (NMF)

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [101]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   stop_words='english')

In [102]:
for i in texts[:2]:
    print(str(i))

['uuid', 'ord_in_thread', 'author', 'published', 'title', 'text', 'language', 'crawled', 'site_url', 'country', 'domain_rank', 'thread_title', 'spam_score', 'main_img_url', 'replies_count', 'participants_count', 'likes', 'comments', 'shares', 'type']
['6a175f46bcd24d39b3e962ad0f29936721db70db', '0', 'Barracuda Brigade', '2016-10-26T21:41:00.000+03:00', 'Muslims BUSTED: They Stole Millions In Gov’t Benefits', 'Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related', 'english', '2016-10-27T01:49:27.168+03:00', '100percentfedup.com', 'US', '256

In [103]:
texts = [ str(i) for i in texts]
for i in texts[:2]:
    print(i)

['uuid', 'ord_in_thread', 'author', 'published', 'title', 'text', 'language', 'crawled', 'site_url', 'country', 'domain_rank', 'thread_title', 'spam_score', 'main_img_url', 'replies_count', 'participants_count', 'likes', 'comments', 'shares', 'type']
['6a175f46bcd24d39b3e962ad0f29936721db70db', '0', 'Barracuda Brigade', '2016-10-26T21:41:00.000+03:00', 'Muslims BUSTED: They Stole Millions In Gov’t Benefits', 'Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related', 'english', '2016-10-27T01:49:27.168+03:00', '100percentfedup.com', 'US', '256

In [104]:
tfidf = tfidf_vectorizer.fit_transform(texts)

In [105]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [106]:
nmf = NMF(n_components=35, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

n_top_words = 5

print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model (Frobenius norm):
Topic #0: people like world just time
Topic #1: clinton foundation hillary campaign clintons
Topic #2: trump donald president campaign election
Topic #3: la el que en los
Topic #4: russia russian putin nato moscow
Topic #5: 26t22 org truth 24453 16
Topic #6: text results italic block comment
Topic #7: fbi comey investigation director emails
Topic #8: pipeline dakota rock standing police
Topic #9: на не что по ru
Topic #10: aleppo syrian al civilians qaeda
Topic #11: obama nobama president house administration
Topic #12: election voting vote voter voters
Topic #13: gold silver market dollar price
Topic #14: der und die zu das
Topic #15: 11 02 pakalertpress com pakalert
Topic #16: 27t00 37 tank 2435 43
Topic #17: syria war assad syrian russia
Topic #18: podesta wikileaks emails email campaign
Topic #19: 10 03 com anonhq http
Topic #20: galacticconnection click alexandra galactic psychic
Topic #21: westernjournalism 26t22 46 829 43
Topic #22: israel 

In [107]:
nmf = NMF(n_components=20, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
n_top_words = 5

print_top_words(nmf, tfidf_feature_names, n_top_words)


Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: time way people years world
Topic #1: clinton hillary wikileaks fbi emails
Topic #2: trump donald uploads hillary wp
Topic #3: jpg uploads http com 000
Topic #4: le en la el spanish
Topic #5: jpg wp http bs com
Topic #6: 02 use time want link
Topic #7: russia russian said military nthe
Topic #8: police violence water standing state
Topic #9: russian ru на что по
Topic #10: syria syrian said terrorists people
Topic #11: obama americans black congress policies
Topic #12: vote voting election voters votes
Topic #13: market world year money news
Topic #14: nu w1200 noreply die https
Topic #15: wp uploads com bs www
Topic #16: said com time year bs
Topic #17: war world wars power military
Topic #18: state united year president said
Topic #19: 10 com 03 bs www



# Appendix 4 - Latent Semantic Index (LSI)

Using TFIDF

In [45]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2018-10-29 21:59:02,179 : INFO : collecting document frequencies
2018-10-29 21:59:02,181 : INFO : PROGRESS: processing document #0
2018-10-29 21:59:02,341 : INFO : calculating IDF weights for 2000 documents and 18776 features (353135 matrix non-zeros)


In [46]:
corpus_tfidf = tfidf[corpus]

In [47]:
numpy.random.seed(1) # setting random seed to get the same results each time. 

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=20) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2018-10-29 21:59:02,409 : INFO : using serial LSI version on this node
2018-10-29 21:59:02,410 : INFO : updating model with new documents
2018-10-29 21:59:03,839 : INFO : preparing a new chunk of documents
2018-10-29 21:59:03,911 : INFO : using 100 extra samples and 2 power iterations
2018-10-29 21:59:03,912 : INFO : 1st phase: constructing (18777, 120) action matrix
2018-10-29 21:59:03,975 : INFO : orthonormalizing (18777, 120) action matrix
2018-10-29 21:59:04,465 : INFO : 2nd phase: running dense svd on (120, 2000) matrix
2018-10-29 21:59:04,519 : INFO : computing the final decomposition
2018-10-29 21:59:04,520 : INFO : keeping 20 factors (discarding 56.986% of energy spectrum)
2018-10-29 21:59:04,537 : INFO : processed documents up to #2000
2018-10-29 21:59:04,540 : INFO : topic #0(7.121): 0.244*"trump" + 0.220*"clinton" + 0.126*"hillary" + 0.115*"war" + 0.107*"obama" + 0.106*"black" + 0.105*"russia" + 0.104*"election" + 0.102*"syria" + 0.098*"state"
2018-10-29 21:59:04,543 : INFO 

In [48]:
lsi.print_topics(20)

2018-10-29 21:59:04,574 : INFO : topic #0(7.121): 0.244*"trump" + 0.220*"clinton" + 0.126*"hillary" + 0.115*"war" + 0.107*"obama" + 0.106*"black" + 0.105*"russia" + 0.104*"election" + 0.102*"syria" + 0.098*"state"
2018-10-29 21:59:04,577 : INFO : topic #1(4.327): -0.315*"email" + -0.300*"subscribe" + -0.225*"notify" + -0.206*"donate" + -0.195*"donation" + 0.192*"syria" + -0.177*"blog" + -0.165*"clinton" + -0.149*"post" + -0.145*"address"
2018-10-29 21:59:04,580 : INFO : topic #2(4.184): -0.271*"subscribe" + 0.226*"trump" + -0.206*"notify" + -0.201*"syria" + -0.193*"email" + -0.187*"donate" + -0.167*"donation" + -0.158*"blog" + -0.155*"post" + 0.144*"clinton"
2018-10-29 21:59:04,582 : INFO : topic #3(3.577): 0.388*"clinton" + 0.367*"fbi" + -0.205*"black" + 0.174*"investigation" + 0.173*"comey" + 0.159*"foundation" + 0.153*"hillary" + -0.144*"trump" + 0.119*"email" + 0.117*"server"
2018-10-29 21:59:04,585 : INFO : topic #4(3.335): -0.240*"syria" + -0.230*"trump" + 0.209*"mosul" + -0.152*

[(0,
  '0.244*"trump" + 0.220*"clinton" + 0.126*"hillary" + 0.115*"war" + 0.107*"obama" + 0.106*"black" + 0.105*"russia" + 0.104*"election" + 0.102*"syria" + 0.098*"state"'),
 (1,
  '-0.315*"email" + -0.300*"subscribe" + -0.225*"notify" + -0.206*"donate" + -0.195*"donation" + 0.192*"syria" + -0.177*"blog" + -0.165*"clinton" + -0.149*"post" + -0.145*"address"'),
 (2,
  '-0.271*"subscribe" + 0.226*"trump" + -0.206*"notify" + -0.201*"syria" + -0.193*"email" + -0.187*"donate" + -0.167*"donation" + -0.158*"blog" + -0.155*"post" + 0.144*"clinton"'),
 (3,
  '0.388*"clinton" + 0.367*"fbi" + -0.205*"black" + 0.174*"investigation" + 0.173*"comey" + 0.159*"foundation" + 0.153*"hillary" + -0.144*"trump" + 0.119*"email" + 0.117*"server"'),
 (4,
  '-0.240*"syria" + -0.230*"trump" + 0.209*"mosul" + -0.152*"saudi" + 0.147*"isi" + -0.135*"war" + 0.131*"iraqi" + -0.130*"russia" + -0.119*"clinton" + -0.113*"turkey"'),
 (5,
  '-0.380*"mosul" + -0.333*"isi" + -0.261*"trump" + -0.228*"iraqi" + -0.173*"civil

### LSI using corpus (not tfidf)

In [49]:
# using corpus (not tfidf)

numpy.random.seed(1) # setting random seed to get the same results each time. 

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=20) # initialize an LSI transformation
corpus_lsi = lsi[corpus] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2018-10-29 21:59:04,633 : INFO : using serial LSI version on this node
2018-10-29 21:59:04,635 : INFO : updating model with new documents
2018-10-29 21:59:04,636 : INFO : preparing a new chunk of documents
2018-10-29 21:59:04,811 : INFO : using 100 extra samples and 2 power iterations
2018-10-29 21:59:04,812 : INFO : 1st phase: constructing (18777, 120) action matrix
2018-10-29 21:59:04,903 : INFO : orthonormalizing (18777, 120) action matrix
2018-10-29 21:59:05,554 : INFO : 2nd phase: running dense svd on (120, 2000) matrix
2018-10-29 21:59:05,606 : INFO : computing the final decomposition
2018-10-29 21:59:05,607 : INFO : keeping 20 factors (discarding 40.523% of energy spectrum)
2018-10-29 21:59:05,629 : INFO : processed documents up to #2000
2018-10-29 21:59:05,632 : INFO : topic #0(530.517): 0.320*"clinton" + 0.287*"trump" + 0.226*"state" + 0.169*"war" + 0.160*"american" + 0.140*"white" + 0.132*"hillary" + 0.129*"government" + 0.127*"party" + 0.126*"election"
2018-10-29 21:59:05,63

In [50]:
lsi.print_topics(20)

2018-10-29 21:59:05,650 : INFO : topic #0(530.517): 0.320*"clinton" + 0.287*"trump" + 0.226*"state" + 0.169*"war" + 0.160*"american" + 0.140*"white" + 0.132*"hillary" + 0.129*"government" + 0.127*"party" + 0.126*"election"
2018-10-29 21:59:05,652 : INFO : topic #1(279.084): -0.627*"clinton" + 0.219*"war" + -0.218*"trump" + -0.193*"hillary" + 0.181*"syria" + 0.143*"government" + 0.141*"syrian" + -0.131*"email" + 0.103*"world" + 0.100*"force"
2018-10-29 21:59:05,654 : INFO : topic #2(250.244): 0.603*"trump" + -0.337*"clinton" + 0.200*"white" + 0.199*"black" + -0.129*"government" + 0.124*"party" + -0.121*"syria" + 0.120*"obama" + 0.120*"election" + 0.114*"donald"
2018-10-29 21:59:05,656 : INFO : topic #3(216.759): -0.347*"syria" + -0.266*"war" + -0.259*"syrian" + 0.220*"kenya" + -0.219*"trump" + -0.209*"russia" + 0.182*"african" + 0.154*"black" + -0.118*"russian" + 0.103*"africa"
2018-10-29 21:59:05,658 : INFO : topic #4(201.118): -0.480*"kenya" + -0.350*"african" + -0.209*"africa" + -0.1

[(0,
  '0.320*"clinton" + 0.287*"trump" + 0.226*"state" + 0.169*"war" + 0.160*"american" + 0.140*"white" + 0.132*"hillary" + 0.129*"government" + 0.127*"party" + 0.126*"election"'),
 (1,
  '-0.627*"clinton" + 0.219*"war" + -0.218*"trump" + -0.193*"hillary" + 0.181*"syria" + 0.143*"government" + 0.141*"syrian" + -0.131*"email" + 0.103*"world" + 0.100*"force"'),
 (2,
  '0.603*"trump" + -0.337*"clinton" + 0.200*"white" + 0.199*"black" + -0.129*"government" + 0.124*"party" + -0.121*"syria" + 0.120*"obama" + 0.120*"election" + 0.114*"donald"'),
 (3,
  '-0.347*"syria" + -0.266*"war" + -0.259*"syrian" + 0.220*"kenya" + -0.219*"trump" + -0.209*"russia" + 0.182*"african" + 0.154*"black" + -0.118*"russian" + 0.103*"africa"'),
 (4,
  '-0.480*"kenya" + -0.350*"african" + -0.209*"africa" + -0.146*"clinton" + -0.132*"kenyan" + -0.128*"union" + -0.125*"trump" + -0.121*"illicit" + -0.120*"black" + -0.119*"war"'),
 (5,
  '0.568*"black" + 0.401*"white" + -0.232*"trump" + -0.196*"kenya" + 0.182*"clinton"

# 6. Storytelling

<b>The accuracy of the models may not avoid complete bias from selecting certain stopwords, as well as manually modifying the number of passes and/or number of topics to generate from the corpus. However, it was found:</b>

- The LDA and LSI models predicted very simialr topic-model distribution. I believe they would be even closer if the data was cleaned futher and tighter parameters were set - especially regarding stopwords. <br><br>

- LDA was actually "more accurate" with term-topic modeling at a lower set number of topics (15 compared to 35), likely as a result of chunking the dataset. Too many topics may have underfit the data based on the size we had pulled.<br><br>

- The most common/popular term-topic distribution that fell over the top 5 models had a large overlap. Primarily: 'Clinton', 'Trump', 'Russia', 'Election', 'Party', 'American'. Because these terms all experience large degrees of overlap, it may be difficult to determine the exact context of each document from this analysis alone. Bi-grams and Tri-grams may benefit and tell a better story here. <br><br>

- On the flip side of the above point, the text was not completely cleaned (to an extremely high-level) for some more obscure and unrelated documents. This diversified the topic-term distribution and may impact and dilute the accuracy of the results.<br><br>

- From a technical perspective, all we have done here is model this using probabilities. This is helpful for surface level observations of a large corpus.<br><br>

- From a managerial perspective, this may not be extremely helpful. Drilling down using deeper text analysis such as bi-grams, word frequency, and sentiment analysis in each term-topic may provide better insight for potential action in the future.<br><br>

- Identifying fake news is crucial, and time-consuming. Methods to include when screening news must be met with a high level of skepticism and thorough examination. This even includes searching author names, publication names and ownership, and determining the validity of claims, quotes, and sources provided throughout articles. Creating a visual map or breadcrumb trail to these things may be difficult, but very worth it if there was a way to start scraping for those characteristics and assessing them outside in a separate progam.

