# LDA TOPIC MODELING


## This notebook applies LDA modeling to a an dataset of 480 news articles shared to Reddit and Twitter, along with their corresponding comment threads

## The utility of topic modeling methods is their capability to uncover unobserved variables—topics—which shape the meaning of textual documents. 

### In the following case, we are interested in understanding what characteristics of news articles elicit user engagement on social media
using the GENSIM, NLTK, spaCy, and SKLearn libraries:

https://radimrehurek.com/gensim/

https://www.nltk.org

https://scikit-learn.org/stable/

https://spacy.io


and folowing a really useful tutorial from the following journal article: 

https://link_to_article_here.com


# Steps

## 1. Preparing the text for preprocessing
    1a. Spell check
    1b. Expand contractions

## 2. Text preprocessing

     2a. Partition data into testing and training subsets
 
     2b. Tokenization
     
     2c. Stop Word Removal
     
     2d. Lemmatization
     
     2e. Bigrams and Trigrams
     
     2f. Exclude terms in > 99% and < 1% of documents
     
     2g. Generate Corpus and Dictionary
 

## 2. Selecting the number of topics (k)
 
     2a. Computing Model Perplexity


## 3. Model Results

     3a. pyLDAvis visualization to assist with Topic Labeling
     
     3b. Topic Mixtures (Document-Term Matrix)

 
# Helpful Links:

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

     
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158


# 1. Before you read in your data, you should do the following:

## 1a. Run your textual data through a spell checker
     Altough there are automated spell checkers, they aren't as accurate as we'd like. Thus, we 
     encourage you to have a human run it through a spellchecker. Although this is time intensive, we 
     believe this is necessary.
## 1b. Expand all english contractions (e.g., "don't" -> "do not")
    Similar to the spellchecker, we needed human coders to do this, to ensure accuracy


# Required Libraries

In [234]:
import nltk
import numpy as np
import pandas as pd
from pandas import read_excel

from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer 

from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib
from sklearn.manifold import MDS

import spacy
from spacy.lang.en import English

# Compute bigrams.
from gensim.models import Phrases
from gensim.utils import simple_preprocess

from IPython.display import display 
import vpython as vs
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib.colors as mcolors
from matplotlib.patches import Rectangle
from matplotlib.ticker import FuncFormatter

import mpld3
from mpld3 import plugins, utils
import plotly
import plotly.graph_objs as go
import json
import pyLDAvis
import pyLDAvis.gensim

from sklearn.cluster import KMeans
from sklearn.utils import shuffle

import gensim
from gensim.models.wrappers import LdaMallet
from gensim.test.utils import common_corpus, common_dictionary
from gensim import corpora, models, similarities
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from PIL import *
import pickle

# ----------------------------------------------Misc---------------------------------------
import re
import csv
import os 
import codecs
import collections as cs
import logging
import random

pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)

# Read in the following:
### 1. dataset
### 2. Stopword location
### 3. Gensim location
### 4. Mallet path

In [235]:
#Paths
file_location = '/Users/me/Desktop/LDA_Main/project2/data_articles/news_dataset.xlsx'
stopwords_location = '/Users/me/Desktop/LDA_Main/project2/data_articles/stopwords.txt'
log_location = '/Users/me/Desktop/LDA_Main/project2/data_articles/gensim.log'
os.environ['MALLET_HOME'] = '/Users/me/Desktop/LDA_Main/mallet-2.0.8'
mallet_path = '/Users/me/Desktop/LDA_Main/mallet-2.0.8/bin/mallet'

## Did the dataset load with the correct number of columns and rows?

In [236]:
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

480 Rows.  15 Columns.


## Make sure the head of the dataset looks correct

In [237]:
indices = [0,333,777,932]

samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

Sample Tickets:




Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike



Unnamed: 0,uid,title,first_com,all_com,text,Year,Month,week,Post_link,Article_link,Article_domain,date,engagement,Platform,Gender
0,1.0,"Daisy Coleman, teen at center of Maryville sex...",Those poor girls. Both of them.,"Gentry County officials last week, meanwhile, ...","Daisy Coleman, the northwest Missouri teenager...",2014.0,1.0,1.0,://www.reddit.com/r/news/comments/1upxir/dai...,://www.kansascity.com/news/special-reports/m...,Kansas City Star,41646.7,81.0,Reddit,
1,334.0,Former Waldron Schools Secretary Pleads Guilty...,Former Waldron Schools Secretary Pleads Guilty...,,A former Waldron School District secretary was...,,,,://twitter.com/DavidLo55938576/status/685248...,://www.5new line.com/article/news/local/o...,,2016-01-07 23:56:11+00:00,0.0,Twitter,c
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,


# 2. DATA PREPROCESSING

# 2a. Split data into test and training sets

In [238]:
# Randomimze the order of the rows in the dataframe
data = shuffle(data)

In [239]:
#Partition data into a testing and training sets

from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.50, random_state=42)
print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  15 Columns.
240 Rows.  15 Columns.
480 Rows.  15 Columns.


In [240]:
#Add an extra column to the dataframe to identify the testing vs the training rows vs the full dataset

data_train['Data_Type'] = "traning_data"
data_test['Data_Type'] = "testing_data"
data['Data_Type'] = "full_data"

print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  16 Columns.
240 Rows.  16 Columns.
480 Rows.  16 Columns.




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# 2b. Tokenization
### Tokenization involves coverting the text to lowercase, removing special characters, null valies, and punctuation from the text

## Here we can see how many null values there are in each column

In [241]:
# null values in each column of the testing set
data_train.isnull().sum()

uid                 0
title               0
first_com           0
all_com           108
text                0
Year              115
Month             115
week              115
Post_link           0
Article_link        0
Article_domain    117
date                0
engagement          1
Platform            0
Gender            125
Data_Type           0
dtype: int64

In [242]:
# null values in each column of the training set
data_test.isnull().sum()

uid                 0
title               1
first_com           0
all_com           114
text                0
Year              125
Month             125
week              125
Post_link           0
Article_link        0
Article_domain    126
date                1
engagement          2
Platform            0
Gender            115
Data_Type           0
dtype: int64

## Now we need to remove null values from the data

In [243]:
#----------------------------------------Removing Null values--------------------------------------------------

#finding null values in the training subset 
print("=============Training Data=============")
data_train['text'] = data_train['text']

print('Number of rows in text:', len(data_train['text']))

print("-------------------")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#finding null values in the testing subset 
print("\n=============Testing Data==============\n")
data_test['text'] = data_test['text']

print('Number of rows in text:', len(data_test['text']))

print("-------------------")

print("Null Values in text: {}".format(data_test['text'].isnull().sum()))


print("\n\n====================Removing Null Values==========================\n\n")
#Removing null values from the training subset 

text_train = data_train['text']

print("After removing Null Values in Train Data")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#Removing null values from the testing subset 

text_test = data_test['text']

print("\n\nAfter removing Null Values in Test Data")
print("Null Values in text: {}".format(data_test['text'].isnull().sum()))

Number of rows in text: 240
-------------------
Null Values in text: 0


Number of rows in text: 240
-------------------
Null Values in text: 0




After removing Null Values in Train Data
Null Values in text: 0


After removing Null Values in Test Data
Null Values in text: 0




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Convert the text to lowercase

In [244]:
#----------------------------------------Converting to Lower case--------------------------------------------------

##Train Data
text_train = text_train.str.lower()

##Test Data
text_test = text_test.str.lower()

print("=======Train Data==============\n")
print(text_train.head(2))

print("\n=======Test Data==============\n")
print(text_test.head(2))


56     a teacher's aide and cheerleader coach who was...
106    a high school track coach is in jail for alleg...
Name: text, dtype: object


213    el paso police arrested a former texas priest ...
118    it has been confirmed that a reported sexual a...
Name: text, dtype: object


## Remove the following:
- special characters
- alphanumerics
- numbers
- words that appear in the corpus less than twice
- extra spaces

In [245]:
##Remove from training set

text_train_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_train]
text_train_regex = [re.sub(r'\'', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\d', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'_', ' ',  sent) for sent in text_train_regex]

##Remove from testing set

text_test_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_test]
text_test_regex = [re.sub(r'\'', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\d', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'_', ' ',  sent) for sent in text_test_regex]


print("=======Train Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])

print("\n=======Test Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])



[INFO] article_text....................

[' paso police arrested former texas priest who during paso catholic diocese investigation last year allegedly admitted sexually abusing child\n\nmiguel luna  was arrested monday afternoon front his home the  block keltner avenue northeast paso police said was prosecuter with aggravated sexual assault child according jail records\n\nluna being held  bond\n\npolice said the arrest came after investigation the crimes against children unit into allegations woman who said she was repeatedly sexually abused luna when she was child not clear the woman the same person who was part the investigation the diocese\n\npolice said that the alleged sexual abuse spanned several years the ending only after the girl moved away\n\nofficials with the diocese could not immediately reached for comment monday paso catholic diocese bishop mark seitz last year had news conference regarding church investigation that found alleged sexual misconduct luna involving girl\

## Remove All Punctuation

In [246]:
def tokenize(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes all punctuation

In [247]:
##Training set
text_train_tokens = list(tokenize(text_train_regex))


print("\n[INFO] article text....................\n")
print(text_train_tokens[:2])

##Testing set
text_test_tokens = list(tokenize(text_test_regex))

print("\n[INFO] article text....................\n")
print(text_test_tokens[:2])


[INFO] article text....................

[['teachers', 'aide', 'and', 'cheerleader', 'coach', 'who', 'was', 'accused', 'having', 'sexual', 'encounters', 'with', 'yearold', 'student', 'has', 'been', 'prosecuter', 'with', 'rape', 'christina', 'jewellbellucio', 'greece', 'was', 'prosecuter', 'tuesday', 'with', 'thirddegree', 'rape', 'and', 'thirddegree', 'criminal', 'sex', 'act', 'both', 'felonies', 'for', 'allegedly', 'having', 'sexual', 'contact', 'with', 'the', 'student', 'her', 'home', 'and', 'another', 'location', 'june', 'laurel', 'heiden', 'spokeswoman', 'for', 'the', 'greece', 'central', 'school', 'district', 'said', 'the', 'district', 'was', 'informed', 'the', 'alleged', 'misconduct', 'monday', 'and', 'launched', 'internal', 'investigation', 'the', 'district', 'also', 'notified', 'greece', 'police', 'heiden', 'said', 'jewellbellucio', 'was', 'hired', 'cheerleader', 'coach', 'greece', 'olympia', 'high', 'school', 'august', 'heiden', 'said', 'she', 'became', 'teachers', 'aide', 'a

# 2c. Lemmatization
## Reduces words to their lemma (e.g., assaulted -> assault)

In [267]:
def lemmatization(texts, allowed_postags=['VERB', 'NOUN']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc])
    return texts_out

In [268]:
nlp = spacy.load('en', disable=['parser', 'ner'])
# Lemmatize Training Set, keeping only noun, adj, vb, adv

text_train_lemma = lemmatization(text_train_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_train_lemma)))
print(text_train_lemma[:4])

# Lemmatize Testing Set, keeping only noun, adj, vb, adv

text_test_lemma = lemmatization(text_test_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_test_lemma)))
print(text_test_lemma[:4])

240
[['teacher', 'aide', 'and', 'cheerleader', 'coach', 'who', 'be', 'accuse', 'have', 'sexual', 'encounter', 'with', 'yearold', 'student', 'have', 'be', 'prosecuter', 'with', 'rape', 'christina', 'jewellbellucio', 'greece', 'be', 'prosecuter', 'tuesday', 'with', 'thirddegree', 'rape', 'and', 'thirddegree', 'criminal', 'sex', 'act', 'both', 'felony', 'for', 'allegedly', 'have', 'sexual', 'contact', 'with', 'the', 'student', '-PRON-', 'home', 'and', 'another', 'location', 'june', 'laurel', 'heiden', 'spokeswoman', 'for', 'the', 'greece', 'central', 'school', 'district', 'say', 'the', 'district', 'be', 'inform', 'the', 'allege', 'misconduct', 'monday', 'and', 'launch', 'internal', 'investigation', 'the', 'district', 'also', 'notify', 'greece', 'police', 'heiden', 'say', 'jewellbellucio', 'be', 'hire', 'cheerleader', 'coach', 'greece', 'olympia', 'high', 'school', 'august', 'heiden', 'say', '-PRON-', 'become', 'teacher', 'aide', 'and', 'then', 'senior', 'teacher', 'aide', 'the', 'school',

# 2d. Remove Stopwords
## NOTE: Edit the stopwords txt file to add additional words to filter out

In [269]:
#Prepare to remove stopwords
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))
newStopWords =[str(x.strip()) for x in open(stopwords_location,'r').read().split('\n')]
stopwords.update(newStopWords)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]
print(len(stopwords))

4416


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannahstevens/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [270]:
##Remove stopwords from training set

text_train_stopwords = remove_stopwords(text_train_lemma)


print("\n[INFO] article text....................\n")
print(text_train_stopwords[:2])

##Remove stopwords from testing set

text_test_stopwords = remove_stopwords(text_test_lemma)


print("\n[INFO] article text....................\n")
print(text_test_stopwords[:2])


[INFO] article text....................

[['teacher', 'aide', 'cheerleader', 'coach', 'accuse', 'sexual', 'encounter', 'yearold', 'student', 'prosecuter', 'rape', 'christina', 'jewellbellucio', 'greece', 'prosecuter', 'tuesday', 'thirddegree', 'rape', 'thirddegree', 'criminal', 'sex', 'act', 'felony', 'allegedly', 'sexual', 'contact', 'student', 'home', 'another', 'location', 'laurel', 'heiden', 'spokeswoman', 'greece', 'central', 'school', 'district', 'say', 'district', 'inform', 'allege', 'misconduct', 'monday', 'launch', 'internal', 'investigation', 'district', 'also', 'notify', 'greece', 'police', 'heiden', 'say', 'jewellbellucio', 'hire', 'cheerleader', 'coach', 'greece', 'olympia', 'high', 'school', 'august', 'heiden', 'say', 'become', 'teacher', 'aide', 'senior', 'teacher', 'aide', 'school', 'accord', 'school', 'document', 'jewellbellucio', 'name', 'class', 'coach', 'year', 'winter', 'section', 'cheerlead', 'sectional', 'resign', 'position', 'school', 'tuesday', 'heiden', 'say'

# 2e. Bigrams and Trigrams
##    Bigrams are two words that frequently co-occur together
##    Trigrams are three words that frequently co-occur together

In [271]:
##Training set
            
text_train_bigram = Phrases(text_train_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_train_trigram = Phrases(text_train_bigram[text_train_stopwords], threshold=1)

text_train_bigram_mod = gensim.models.phrases.Phraser(text_train_bigram)
text_train_trigram_mod = gensim.models.phrases.Phraser(text_train_trigram)

for idx in range(len(text_train_stopwords)):
    for token in text_train_trigram_mod[text_train_bigram_mod[text_train_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_train_stopwords[idx].append(token)
            
##Testing set

text_test_bigram = Phrases(text_test_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_test_trigram = Phrases(text_test_bigram[text_test_stopwords], threshold=1)

text_test_bigram_mod = gensim.models.phrases.Phraser(text_test_bigram)
text_test_trigram_mod = gensim.models.phrases.Phraser(text_test_trigram)

for idx in range(len(text_test_stopwords)):
    for token in text_test_trigram_mod[text_test_bigram_mod[text_test_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_test_stopwords[idx].append(token)

## Print the head of the data with Bigrams and Trigrams
### e.g., under the testing dataset, you can see the words 'report sexual' and 'good guy' are bigrams

In [272]:
print("\n[INFO] Text Article training data....................\n")
print(text_train_stopwords[:2])

print("\n[INFO] Text Article testing data....................\n")
print(text_test_stopwords[:2])


[INFO] Text Article training data....................

[['teacher', 'aide', 'cheerleader', 'coach', 'accuse', 'sexual', 'encounter', 'yearold', 'student', 'prosecuter', 'rape', 'christina', 'jewellbellucio', 'greece', 'prosecuter', 'tuesday', 'thirddegree', 'rape', 'thirddegree', 'criminal', 'sex', 'act', 'felony', 'allegedly', 'sexual', 'contact', 'student', 'home', 'another', 'location', 'laurel', 'heiden', 'spokeswoman', 'greece', 'central', 'school', 'district', 'say', 'district', 'inform', 'allege', 'misconduct', 'monday', 'launch', 'internal', 'investigation', 'district', 'also', 'notify', 'greece', 'police', 'heiden', 'say', 'jewellbellucio', 'hire', 'cheerleader', 'coach', 'greece', 'olympia', 'high', 'school', 'august', 'heiden', 'say', 'become', 'teacher', 'aide', 'senior', 'teacher', 'aide', 'school', 'accord', 'school', 'document', 'jewellbellucio', 'name', 'class', 'coach', 'year', 'winter', 'section', 'cheerlead', 'sectional', 'resign', 'position', 'school', 'tuesday', '

# 2f. Remove words that occur in less than & greater than of documents
### The corpus is our collection of documents (i.e., our textual questionnaire responses)
### The dictionary takes each unique word in the corpus and assigns them an index

In [273]:
dictionary_text_train = corpora.Dictionary(text_train_stopwords)
dictionary_text_train.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_train = [dictionary_text_train.doc2bow(text) for text in text_train_stopwords]
pickle.dump(corpus_text_train, open('/Users/mes/Desktop/LDA_Main/project2/data_articles/corpus_text_train.pkl', 'wb'))
dictionary_text_train.save('/Users/me/Desktop/LDA_Main/project2/data_articles/dictionary_text_train.gensim')

dictionary_text_test = corpora.Dictionary(text_test_stopwords)
dictionary_text_test.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_test = [dictionary_text_test.doc2bow(text) for text in text_test_stopwords]
pickle.dump(corpus_text_test, open('/Users/me/Desktop/LDA_Main/project2/data_articles/corpus_text_test.pkl', 'wb'))
dictionary_text_test.save('/Users/me/Desktop/LDA_Main/project2/data_articles/dictionary_text_test.gensim')

# 2g. Print Dictionary and Corpus for each subset 

In [274]:
## Training data dictionary & corpus
text_train_stopwords
for k, v in dictionary_text_train.token2id.items():
    print(k, v)

accord 0
accuse 1
accuse sexual 2
act 3
aide 4
allege 5
allegedly 6
also 7
another 8
arraign 9
august 10
back 11
bail 12
become 13
central 14
cheerlead 15
cheerleader 16
christina 17
class 18
coach 19
contact 20
court 21
criminal 22
district 23
document 24
due 25
encounter 26
felony 27
greece 28
heiden 29
high 30
high school 31
hire 32
home 33
inform 34
internal 35
internal investigation 36
investigation 37
jewellbellucio 38
launch 39
laurel 40
location 41
march 42
misconduct 43
monday 44
name 45
notify 46
olympia 47
police 48
position 49
prosecuter 50
prosecuter rape 51
rape 52
release 53
resign 54
say 55
school 56
school district 57
section 58
sectional 59
senior 60
sex 61
sex act 62
sexual 63
sexual contact 64
spokeswoman 65
student 66
teacher 67
thirddegree 68
thirddegree criminal 69
town 70
tuesday 71
violent 72
winter 73
year 74
yearold 75
yearold student 76
according 77
allegation 78
arrest 79
arrest tuesday 80
assistant 81
available 82
clara 83
comment 84
count 85
county 86
cou

quickly 862
quit 863
reach 864
register 865
register sex_offender 866
repeat 867
reputation 868
robin 869
satisfied 870
say feel 871
secretary 872
see 873
sentence 874
sentencing 875
separate 876
september 877
serve 878
several time 879
sexual battery 880
shed 881
shortly 882
slap 883
something 884
southern 885
stop 886
sue 887
suspension 888
tease 889
tell investigator 890
though 891
tional 892
today 893
touch 894
touchy 895
try get 896
two separate 897
two year 898
unable 899
verdict 900
vindicate 901
violate 902
would try 903
wrap 904
wrong 905
year ago 906
academy 907
achievement 908
achievement prep 909
alcohol 910
andrea 911
anticipate 912
arrest make 913
attorney say 914
automate 915
behind 916
bewilder 917
breast 918
bring 919
brown 920
camera 921
campus 922
catch 923
cause 924
charter 925
cially 926
communicate 927
cooperate 928
day jail 929
daygarbees 930
disappear 931
drug 932
elementary 933
estimate 934
exit 935
feel safe 936
female 937
final 938
finger 939
fully 940
garbee

host 1612
important 1613
inc 1614
inc citate 1615
increasingly 1616
indicate 1617
influence 1618
initially 1619
insider 1620
instance 1621
interaction 1622
intoxicate 1623
intoxicated 1624
investigative 1625
involuntary 1626
jail vate 1627
john doe 1628
landscape 1629
later say 1630
lawmaker 1631
least 1632
legal 1633
legally 1634
legitimately 1635
liberal 1636
los 1637
los angeles 1638
majority 1639
make public 1640
make statement 1641
marxist 1642
massive 1643
material 1644
measure 1645
memoir 1646
mensright 1647
michele 1648
muchneede 1649
nature 1650
navigate 1651
obama 1652
obtain 1653
occidental 1654
occidental college 1655
one year 1656
pace 1657
panic 1658
penalty 1659
performance 1660
perhaps 1661
petition 1662
poet 1663
policie 1664
politically 1665
potential 1666
predominately 1667
prevalence 1668
problem 1669
procedure 1670
professor 1671
properly 1672
propose 1673
protect 1674
punk 1675
question 1676
raise 1677
rea 1678
reexamine 1679
refer 1680
rely 1681
represent 1682
re

inexcusably 2612
inquiry 2613
jacobs 2614
jumble 2615
jump 2616
kick 2617
launch investigation 2618
laura 2619
letter send 2620
olympic 2621
olympics 2622
one day 2623
professional 2624
protract 2625
provost 2626
raymond 2627
razorback 2628
reason 2629
reasonable 2630
reflect 2631
report assault 2632
respond request 2633
right thing 2634
rigorous 2635
severe 2636
shameful 2637
survjustice 2638
suspended 2639
transmit 2640
two day 2641
united states 2642
uphold 2643
whole 2644
wonder 2645
write letter 2646
abc 2647
alcoholic 2648
alpha 2649
alpha delta 2650
bedroom 2651
bloodstain 2652
chapel 2653
chief 2654
chocolate 2655
collaborate 2656
container 2657
delta 2658
duke 2659
durham 2660
durham police 2661
frat 2662
fraternity house 2663
keith 2664
kit 2665
knock 2666
lawrence 2667
lease 2668
lopez 2669
mean 2670
member fraternity 2671
phi 2672
pour 2673
pursuant 2674
recruit 2675
rem 2676
semester 2677
swere 2678
thing remember 2679
throw 2680
tuesday afternoon 2681
undergrad 2682
under

birthday 3418
booker 3419
boss 3420
circumstance 3421
comments 3422
community service 3423
complie 3424
constitution 3425
correctly 3426
craig 3427
critique 3428
defense attorney 3429
defer 3430
democrat 3431
design 3432
disappoint 3433
district attorneys 3434
everything 3435
face maximum 3436
fiancee 3437
fight 3438
file motion 3439
fiveyear 3440
girl cry 3441
girl mother 3442
girl testify 3443
guilt 3444
howard 3445
howard say 3446
identify victim 3447
jeanine 3448
kiss 3449
lawn 3450
livid 3451
marry 3452
maximum 3453
mess 3454
mop 3455
mow 3456
performing 3457
polygraph 3458
pornography 3459
rape girl 3460
rape victim 3461
rape yearold 3462
rapist 3463
reconsider 3464
recuse 3465
refrain 3466
rehabilitate 3467
removal 3468
requirement 3469
restriction 3470
rethink 3471
say anything 3472
say everything 3473
say think 3474
say thursday 3475
scottie 3476
sentence year 3477
sex crime 3478
sex offender 3479
sir 3480
society 3481
spend time 3482
spending 3483
spur 3484
suppose 3485
term 

yelling 4361
chance 4362
goda 4363
gwen 4364
largo 4365
marlboro 4366
ministry 4367
rell 4368
upper 4369
accidental 4370
accuse rape 4371
assure 4372
autobiography 4373
bungalow 4374
comedian 4375
cramp 4376
decadeold 4377
dickin 4378
dickinson 4379
dignify 4380
discredit 4381
exclusive 4382
glass 4383
groggily 4384
janice 4385
joan 4386
lake 4387
leg 4388
lifeguard 4389
menstrual 4390
nov 4391
npr 4392
oncamera 4393
outpouring 4394
pajama 4395
patchwork 4396
reconnecte 4397
rehab 4398
resurface 4399
robe 4400
seman 4401
several week 4402
supermodel 4403
tahoe 4404
tarshis 4405
thinking 4406
tonight 4407
wine 4408
woke 4409
yikes 4410
asbury 4411
basharas 4412
deon 4413
domestic violence 4414
evans 4415
interrupt 4416
lab 4417
michigan 4418
michigan state 4419
murderforhire 4420
nunlee 4421
tuddle 4422
upstairs 4423
vonda 4424
wayne 4425
ally 4426
bbc 4427
behave 4428
clip 4429
collectively 4430
culpability 4431
deafness 4432
decisionmake 4433
diversity 4434
empower 4435
equitable 4436

freedom 5361
heavy 5362
hitter 5363
horrible 5364
impressionable 5365
katy 5366
keha 5367
kelis 5368
kelly 5369
kesha file 5370
legit 5371
mafia 5372
manipulation 5373
mos 5374
nake 5375
paint 5376
perry 5377
pitbull 5378
plane 5379
pop star 5380
regain 5381
rida 5382
rowland 5383
snort 5384
sore 5385
talented 5386
tik 5387
tok 5388
wholehearte 5389
adequately 5390
aegean 5391
airconditione 5392
ancient 5393
antalya 5394
aquifer 5395
astronomer 5396
australian 5397
availability 5398
avert 5399
barrier 5400
bland 5401
brine 5402
cafe 5403
careful 5404
catalog 5405
cautiously 5406
chaperone 5407
chemist 5408
chemistry 5409
clancy 5410
clancys 5411
cloak 5412
coauthor 5413
conceptualize 5414
covet 5415
culturally 5416
dangerous 5417
denizli 5418
deposit 5419
designate 5420
diary 5421
diet 5422
discipline 5423
dissertation 5424
doll 5425
eat 5426
elsewhere 5427
endlessly 5428
engineering 5429
excursion 5430
fieldbase 5431
fieldwork 5432
fingernail 5433
formidable 5434
fundamental 5435
gift

interesting 6361
internationally 6362
invitation 6363
jailnte 6364
jerk 6365
joy 6366
khawaja 6367
labase 6368
lighting 6369
lightning 6370
locally 6371
lomatski 6372
macho 6373
make uncomfortable 6374
manchester 6375
marea 6376
marketer 6377
marshall 6378
messagebase 6379
misconstrue 6380
misogyny 6381
music festival 6382
network 6383
offend 6384
onstage 6385
orebase 6386
peak 6387
plus 6388
policing 6389
polite 6390
portland 6391
preppe 6392
proactive 6393
promoter 6394
psychedelic 6395
putte 6396
pwr 6397
queerpunk 6398
quota 6399
rate 6400
reid 6401
reignite 6402
rejection 6403
residentrobert 6404
restroom 6405
rufus 6406
saferbraver 6407
sara 6408
seigel 6409
sense 6410
shitty 6411
showcase 6412
similarly 6413
sol 6414
soundcheck 6415
stagecoach 6416
stamper 6417
stefanie 6418
stock 6419
sweden 6420
tenn 6421
thrust 6422
transcend 6423
unofficial 6424
unwelcome 6425
usual 6426
valley music 6427
viral 6428
whittington 6429
womenonly 6430
borough 6431
bronx 6432
hotline 6433
nypds 6

scrolleradscom 7361
slowly 7362
snap 7363
specious 7364
spurious 7365
suicidal 7366
symbol 7367
thengirlfriend 7368
ultrareligious 7369
utterly 7370
bridge 7371
mcdonad 7372
mcdonald 7373
alcvivar 7374
alumnus 7375
ames 7376
coincidence 7377
corrective 7378
deviate 7379
healthandsafety 7380
iowa 7381
iowa state 7382
maria 7383
mccarroll 7384
natasha 7385
notification 7386
notifications 7387
oren 7388
sae 7389
tuttle 7390
weghorst 7391
buenas 7392
congregant 7393
foreign 7394
heba 7395
jorge 7396
knupp 7397
larry 7398
matta 7399
norwalk 7400
nuevas 7401
abandon 7402
altar 7403
ampedup 7404
arivaca 7405
atascosa 7406
bidegain 7407
bidegain say 7408
border 7409
brave 7410
caffeine 7411
cold 7412
coyote 7413
cruz 7414
desert 7415
desk 7416
electrolyte 7417
embark 7418
estrada 7419
estradas 7420
fronteras 7421
funnel 7422
guatemala 7423
impossible 7424
injection 7425
jaime 7426
jurisdiction 7427
mancinasflore 7428
migrant 7429
mountains 7430
nickname 7431
nogales 7432
oaxaca 7433
organized 

longheld 8360
loosely 8361
lorde 8362
louder 8363
lyn 8364
marginalization 8365
melanie say 8366
microcosm 8367
midterm 8368
millertonnet 8369
misclassifye 8370
misleading 8371
mystique 8372
neal 8373
nita 8374
nocontact 8375
notebook 8376
olivia 8377
opening 8378
oppress 8379
optional 8380
ostensible 8381
outlines 8382
overwhelmed 8383
overwhelming 8384
pack 8385
parcel 8386
partnership 8387
pearl 8388
perpetual 8389
persist 8390
pews 8391
phys 8392
plainly 8393
portray 8394
prestigious 8395
professorship 8396
programming 8397
propel 8398
pulliam 8399
pwis 8400
quest 8401
radical 8402
raheem 8403
reclassify 8404
reconcile 8405
reimagine 8406
resistance 8407
respectability 8408
respectable 8409
respectively 8410
retract 8411
retweet 8412
revive 8413
rhythm 8414
rice 8415
root 8416
rudy 8417
rushing 8418
sag 8419
schmidt 8420
scrawled 8421
seductive 8422
sidetrack 8423
silvanus 8424
singlesex 8425
sisterhood 8426
smalltown 8427
soul 8428
spawn 8429
speaker 8430
speech 8431
spelhouse 843

In [275]:
## Testing Data dictionary & corpus
text_test_stopwords
for k, v in dictionary_text_test.token2id.items():
    print(k, v)

abuse 0
accord 1
act 2
acting 3
action 4
admit 5
admit sexual 6
adult 7
afternoon 8
aggravated 9
allegation 10
allege 11
allege sexual 12
allegedly 13
also 14
also work 15
always 16
among 17
antonio 18
anyone 19
anyone information 20
arrest 21
arrest come 22
arrest monday 23
ask 24
assault 25
assumption 26
august 27
avenue 28
away 29
begin 30
bishop 31
bless 32
block 33
board 34
bond 35
call 36
call crime 37
canon 38
case 39
catholic 40
cause 41
chance 42
child 43
child accord 44
christi 45
church 46
church official 47
clear 48
clergy 49
come 50
come forward 51
comment 52
commit 53
conference 54
contact 55
contact police 56
contradictory 57
corpus 58
could 59
crime 60
crime child 61
criminal 62
deeply 63
department 64
diocesan 65
diocese 66
directly 67
disclose 68
ending 69
enforcement 70
feel 71
find 72
formal 73
formally 74
former 75
forward 76
four 77
front 78
girl 79
girl say 80
god 81
harassment 82
hold 83
home 84
home block 85
immediately 86
immoral 87
include 88
inform 89
inform

true 902
tug 903
unaware 904
understand 905
underway 906
unincorporated 907
upsetting 908
vaunted 909
version 910
virginia 911
wait 912
walk 913
war 914
warrant 915
washington 916
welfare 917
whose 918
win 919
within 920
yard 921
yearold 922
yearold man 923
young 924
youth 925
zebrowski 926
account 927
actress 928
aggressive 929
ago 930
appearance 931
art 932
article 933
backlash 934
barbara 935
battle 936
belt 937
bill 938
blank 939
bowman 940
business 941
camille 942
cancellation 943
caption 944
case settle 945
classic 946
click 947
collection 948
comedian 949
court 950
date 951
david 952
day 953
deny 954
donation 955
dozen 956
drugged 957
face 958
famous 959
follower 960
forget 961
funny 962
hashtag 963
headline 964
height 965
host 966
ian 967
lawsuit 968
lawsuit claim 969
letterman 970
meme 971
mentor 972
morning 973
multiple 974
multiple time 975
next 976
number 977
online 978
ope 979
opinion 980
pant 981
picture 982
pleasure 983
pointed 984
popularity 985
post 986
prosecuter sexu

lucrative 1902
lyndhurst 1903
make public 1904
malicious 1905
marketing 1906
match 1907
mattress 1908
morningafter 1909
naked 1910
native 1911
new york_city 1912
nice 1913
nomination 1914
onenight 1915
opportunity 1916
otherwise 1917
outreach 1918
overcome 1919
paytoplay 1920
perhaps 1921
physician 1922
pick 1923
plan 1924
politic 1925
political 1926
potentially 1927
pro 1928
promise 1929
proposal 1930
quid 1931
quo 1932
reassure 1933
recall 1934
relationship 1935
remember wake 1936
resemble 1937
respect 1938
return 1939
riverside 1940
rutger 1941
say client 1942
say one 1943
select 1944
sexism 1945
sexual relationship 1946
shit 1947
shot 1948
side 1949
sometime 1950
south 1951
spike 1952
stick 1953
strive 1954
stupid 1955
submit 1956
sure 1957
suv 1958
sylvia 1959
take advantage 1960
taxi 1961
tell new 1962
tell want 1963
tequila 1964
third 1965
time alleged 1966
totally 1967
trade 1968
transmit 1969
two month 1970
two year 1971
vendor 1972
vindictive 1973
volunteer 1974
voter 1975
wa

sentencing 2901
sex offender 2902
slammer 2903
spark 2904
sun 2905
surgery 2906
telegraph 2907
top 2908
touching 2909
transgender 2910
trouser 2911
wakefield 2912
wink 2913
women 2914
womenonly 2915
wood 2916
apprehend 2917
aurora 2918
biasrelate 2919
black man 2920
canvass 2921
cbs 2922
hate 2923
hate crime 2924
heel 2925
hurl 2926
incident involve 2927
mentally 2928
plaza 2929
racial 2930
shopping 2931
slur 2932
viejo 2933
woman leave 2934
young man 2935
aggravate 2936
aggravate sexual 2937
anymore 2938
arrest tuesday 2939
assault deadly 2940
blanket 2941
boy 2942
carlos 2943
clothe 2944
country illegally 2945
deadly 2946
downstairs 2947
easily 2948
emotional 2949
girl walk 2950
harlow 2951
illegally 2952
immigrant 2953
known 2954
lap 2955
mental 2956
mostwante 2957
mysacom 2958
northeast side 2959
sexual conduct 2960
undocumented 2961
way arrest 2962
would come 2963
yearold boy 2964
amy 2965
amys 2966
baking 2967
bouzaglo 2968
customer 2969
duo 2970
dynamic 2971
edit 2972
exclusive 

phone interview 3901
pray 3902
prominent 3903
puberty 3904
purity 3905
quran 3906
rabbi 3907
reaching 3908
reluctant 3909
replace 3910
reproduction 3911
reputation 3912
resist 3913
retain 3914
retire 3915
revere 3916
rockford 3917
saleem 3918
scarf 3919
scholar 3920
school work 3921
schukar 3922
screen 3923
secretarial 3924
secretive 3925
segment 3926
settlement 3927
sewer 3928
sexual health 3929
sexuality 3930
shamsi 3931
simply 3932
since remove 3933
site 3934
sixth 3935
sjailngboard 3936
socialize 3937
spare 3938
specify 3939
stain 3940
stature 3941
sticky 3942
student accord 3943
substantiate 3944
suddenly 3945
suspicious 3946
taboo 3947
tell authority 3948
tension 3949
terrorism 3950
terrorist 3951
thomas 3952
three woman 3953
typically 3954
ubaidulla 3955
ultimately 3956
ultravictorian 3957
uncomfortable 3958
uncommon 3959
unfair 3960
unheardof 3961
unmarriageable 3962
unthinkable 3963
value 3964
vately 3965
veil 3966
worker 3967
yearold woman 3968
youtube 3969
youtube video 3970

intermittently 4901
italian 4902
javelina 4903
jimmy 4904
jimmys 4905
kimberly 4906
lay 4907
ldquothe 4908
leroy 4909
lunch 4910
marina 4911
marinadelrey 4912
mental health 4913
miserable 4914
mistreat 4915
momma 4916
moody 4917
mortified 4918
musician 4919
navigate 4920
nondisclosure 4921
notice intent 4922
novel 4923
orange 4924
ottavia 4925
payment 4926
peace 4927
peril 4928
portrait 4929
precipitate 4930
producer 4931
prolific 4932
prostitute 4933
pseudonymous 4934
quietly 4935
reunion 4936
revelation 4937
rey 4938
ritzcarlton 4939
rivet 4940
rome 4941
sattro 4942
scrub 4943
selfie 4944
series 4945
sexual encounter 4946
shitty 4947
smoking 4948
societal 4949
speech 4950
spiral 4951
spotlight 4952
stationery 4953
stepfather 4954
subsequent 4955
subsequently 4956
taliban 4957
theoretically 4958
thingsrdquo 4959
torsos 4960
trepidation 4961
two young 4962
twotake 4963
unclothed 4964
unidentified 4965
veteran 4966
victim sexual_assault 4967
weakness 4968
willis 4969
witherspoon 4970
xo

firstdegree murder 5901
fitzgerald 5902
girardeau 5903
heartbreaking 5904
heights 5905
herd 5906
imperial 5907
indication 5908
jailmary 5909
jamie 5910
linkedin 5911
louis 5912
louis county 5913
manhunt 5914
mcculloch 5915
mcculloch say 5916
nonprofit 5917
parishioner 5918
postdispatch 5919
region 5920
religious supply 5921
ridge 5922
rosary 5923
schmidt 5924
sentiment 5925
shop 5926
sing 5927
somber 5928
southeast 5929
store louis 5930
stutte 5931
suburban 5932
suburban louis 5933
twoday 5934
unoppose 5935
wesley 5936
wildwood 5937
yearold jamie 5938
beverly 5939
condemn 5940
fame 5941
nel 5942
roy 5943
roy moore 5944
step forward 5945
without consent 5946
womens 5947
fatally 5948
get away 5949
gunman 5950
married 5951
mondays 5952
person interest 5953
bloodied 5954
chitwood 5955
could hear 5956
daytona 5957
dispatcher 5958
frolander 5959
immediately respond 5960
independently 5961
knot 5962
lip 5963
living 5964
motionless 5965
observe 5966
puddle 5967
raymond 5968
reuter 5969
severel

asinine 6897
attendant 6898
bloomington 6899
boun 6900
coordination 6901
deeper 6902
diligence 6903
duck 6904
excite 6905
hunch 6906
ie 6907
kara 6908
laurel 6909
lean 6910
mclean 6911
mesa 6912
mesas 6913
reading 6914
regional 6915
standpoint 6916
stepping 6917
stumble 6918
ticketing 6919
try get 6920
wesbite 6921
wmbd 6922
broadway 6923
corner 6924
freret 6925
lbs 6926
nopd 6927
tulane 6928
willow 6929
element 6930
genesee 6931
genesee county 6932
hearing schedule 6933
imperative 6934
lape 6935
leyton 6936
linden 6937
pickell 6938
pickell say 6939
postpone 6940
pretrial 6941
punishable 6942
restore 6943
stern 6944
swear 6945
vienna 6946
walmart 6947
zofchak 6948
astrophysicist 6949
behave 6950
besides 6951
cosmos 6952
credo 6953
degrasse 6954
dialogue 6955
geographic 6956
global 6957
hayden 6958
irreversibly 6959
museum 6960
neil 6961
patheoscom 6962
planetarium 6963
religion 6964
renowned 6965
scientist 6966
servant 6967
spirituality 6968
tyson 6969
vehemently 6970
chatham 6971
corr

schooling 7650
seminary 7651
slightly 7652
sound like 7653
spiritually 7654
spring 7655
standing 7656
unconnecte 7657
untrained 7658
unworthy 7659
ward 7660
weaken 7661
welllike 7662
withdrawal 7663
withdrawn 7664
worthiness 7665
worthless 7666
aggravated sexual 7667
brutal 7668
calais 7669
crockett 7670
daniella 7671
dfw 7672
ezmorin 7673
grayson 7674
hampton 7675
informational 7676
nurses 7677
plano 7678
sane 7679
sgt hampton 7680
sherman 7681
surgical 7682
adventure 7683
affleck 7684
batman 7685
beek 7686
believer 7687
bigname 7688
bombshell 7689
boob 7690
busan 7691
carr 7692
collapse 7693
complicit 7694
condemnation 7695
countless 7696
crews 7697
damning 7698
daw 7699
der 7700
enterjailse 7701
excellent 7702
grinned 7703
hef 7704
hilarie 7705
honk 7706
horn 7707
humiliate 7708
invitation 7709
jfk 7710
korea 7711
maker 7712
megadonor 7713
namely 7714
oliver 7715
oscarwinne 7716
predation 7717
premiere 7718
rockn 7719
soundtrack 7720
spy 7721
stevens 7722
ted 7723
thankfully 7724
th

migrate 8533
mindless 8534
poptart 8535
realworld 8536
romantic 8537
shape 8538
theory 8539
twosentence 8540
uscs 8541
vogue 8542
altogether 8543
backstreet 8544
charismatic 8545
console 8546
diagram 8547
dining 8548
discretion 8549
eager 8550
furnish 8551
hangout 8552
info 8553
layout 8554
liquor 8555
melissa 8556
melissaschuman 8557
naturally 8558
nick 8559
reppe 8560
schuman 8561
virgin 8562
jailce 8563
kanawha 8564
magistrate 8565
nitro 8566
preemployment 8567
probationary 8568
propertysurety 8569
videoe 8570
arise 8571
authenticate 8572
buddybuddy 8573
continually 8574
fohounhedo 8575
gut 8576
lax 8577
max 8578
maxime 8579
nineminute 8580
rideshare 8581
shady 8582
suntimes 8583
wifes 8584
yassin 8585
yassin say 8586
amandas 8587
aregot 8588
asst 8589
ate 8590
bank 8591
blackburn 8592
blackburns 8593
davey 8594
eyewitness news 8595
hench 8596
impd 8597
interrogate 8598
jalen 8599
kessler 8600
mothers 8601
woods 8602
alec 8603
blake 8604
chorizo 8605
edward 8606
expletive 8607
expre

## Ensure the data looks correct
### i.e., are words properly lemmatized, tokenized, etc?

In [276]:
# View information about each corpus

print("\n[INFO] Article Dataset....................\n")
print('Train Data', str(len(text_train_stopwords)))
print('Test Data', str(len(text_test_stopwords)))

print("\n\n=================================================\n\n")

print("[INFO] Training Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_train_stopwords[:1])

print("\n[INFO] Testing Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_test_stopwords[:1])


[INFO] Article Dataset....................

Train Data 240
Test Data 240




[INFO] Training Sample Data
--------------------------------------
--------------------------------------

[INFO] Article Text....................

[['teacher', 'aide', 'cheerleader', 'coach', 'accuse', 'sexual', 'encounter', 'yearold', 'student', 'prosecuter', 'rape', 'christina', 'jewellbellucio', 'greece', 'prosecuter', 'tuesday', 'thirddegree', 'rape', 'thirddegree', 'criminal', 'sex', 'act', 'felony', 'allegedly', 'sexual', 'contact', 'student', 'home', 'another', 'location', 'laurel', 'heiden', 'spokeswoman', 'greece', 'central', 'school', 'district', 'say', 'district', 'inform', 'allege', 'misconduct', 'monday', 'launch', 'internal', 'investigation', 'district', 'also', 'notify', 'greece', 'police', 'heiden', 'say', 'jewellbellucio', 'hire', 'cheerleader', 'coach', 'greece', 'olympia', 'high', 'school', 'august', 'heiden', 'say', 'become', 'teacher', 'aide', 'senior', 'teacher', 'aide', 'school', 'acco

In [277]:

print("\n[INFO] Article Text Dictionary and Corpus.....................\n")

print("\nNumber of unique words in the training set's dictionary:")
print(str(len(dictionary_text_train)))

print("\nNumber of unique words in the testing set's dictionary:")
print(str(len(dictionary_text_test)))

print("\nNumber of rows in the testing set:")
print(str(len(corpus_text_test)))

print("\nNumber of rows in the training set:")
print(str(len(corpus_text_train)))


[INFO] Article Text Dictionary and Corpus.....................


Number of unique words in the training set's dictionary:
8455

Number of unique words in the testing set's dictionary:
8879

Number of rows in the testing set:
240

Number of rows in the training set:
240


# 3. Select number of topics (k)

## 3a. Setting Model Hyperparameters 

### 1. Beta (referred to as 'eta' in gensim) = the [distribution of the] number of words per topic
### 2. Alpha =  the [distribution of the] number of topics per document

#### Both alpha and eta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’:
        - ‘auto’ = the model learns the best values for the hyperparameters as it is trained       
                   on more and more data (i.e., it learns an asymmetric prior from the corpus). See 
                   http://jonathan-huang.org/research/dirichlet/dirichlet.pdf for an overview             
        - 'asymmetric' = uses a fixed, normalized asymmetric prior of 1.0 / k (number of topics)
        - 'symmetric' = uses a distribution of 1 / k (number of topics)

In Bayesian statistics, we have to define the distributions (i.e., prior distributions) of unknown variables (e.g., ϕ and θ) before running the data analysis. These should be defined based on theoretical assumptions about how we think the topics are actually distributed amongst our data. In our case, it makes sense to assume that some documents discuss more/less topics than other documents; thus we set the document-topic distribution to be asymmetric. 
#### Thus, we reccommend setting alpha = 'auto' as it sets the distribution to be asymmetric, and learns the best alpha value (i.e., lowest perplexity scores) from the data itself. It also makes sense to assume that some topics contain more words than others. Thus, we reccomend setting the distribution of the number of words per topic to be asymmetric as well. 

### 3. Passes = number of laps the model goes through the entire corpus
        - Incrasing the number of passes reduces model bias
### 4. Chunksize = number of documents to load into memory at a time 
        - smaller chunksizes save memory, but take longer to train
### 5. Update_every = number of chunks to process before maximizing your model 
### 6. Random state = sets the seed to make the model reproducable
### 7. Number of topics (k)
Researchers must tell the model how many (k) prominent goal inference topics to sort each ‘bag of words’ document into. Problematically, several different k-values might work. Thus, we use a metric called perplexity to help us to determine the optimal number of topics. The utility in perplexity comes from comparing perplexity values across models with differing k-values to pinpoint the best model (i.e., the model with the lowest perplexity score). 

Thus, we recommend testing the perplexity of the model with a variety of k values, and then runing the final model using the k-value with the lowest perplexity score.


## 3b. Compute model perplexity scores
We run the model with different topic numbers (k) to find the optimal topics number

We will start by looking at k=1-32 topics

In [285]:
k = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
random_state=42
update_every=2
chunksize=1800
passes=600
iterations=1600
alpha='auto'
eta='auto'
per_word_topics=True

lda_model_text_trains = []
lda_model_text_tests = []

In [286]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] Article Text Training Data LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_train = gensim.models.ldamodel.LdaModel(corpus=corpus_text_train,
                                                  id2word=dictionary_text_train,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

    lda_model_text_trains.append(lda_model_text_train)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_train.log_perplexity(corpus_text_train))


***********************************************************************
[INFO] Article Text Training Data LDA Results....
***********************************************************************

Perplexity (num_topics = 2):  -7.796055046679738

Perplexity (num_topics = 4):  -7.782917860978472

Perplexity (num_topics = 6):  -7.814755115892571

Perplexity (num_topics = 8):  -7.836687645516105

Perplexity (num_topics = 10):  -7.859251299619402

Perplexity (num_topics = 12):  -7.882492604344419

Perplexity (num_topics = 14):  -7.903351307101996

Perplexity (num_topics = 16):  -7.9169580113246125

Perplexity (num_topics = 18):  -7.93318788534051

Perplexity (num_topics = 20):  -7.941399663894025

Perplexity (num_topics = 22):  -7.957818621067821

Perplexity (num_topics = 24):  -7.972209064262733

Perplexity (num_topics = 26):  -7.985005679507375

Perplexity (num_topics = 28):  -7.996059802274738

Perplexity (num_topics = 30):  -8.015043843599607

Perplexity (num_topics = 32):  -8.020541479

### We can see that k=3-9 topics have the most drastic decreases in perplexity scores. 
### Thus, we opt to take a closer look at these 4 models

### Here is the representative 'Bag of Words' for each topic in the Training Dataset
As you can see, this doesn't give us enough information about each topic to assess model fit. Thus, we utilize a interactive visualization

In [298]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Training Data Topic Outputs....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_trains[j].show_topics(num_topics=i, num_words=10, log=True, formatted=True)
    for topic in topics:
        print(topic)


***********************************************************************
[INFO] Article Text Training Data Topic Outputs....
***********************************************************************

[INFO] Num_topics: 1

(2, '0.027*"say" + 0.024*"police" + 0.013*"assault" + 0.009*"victim" + 0.008*"arrest" + 0.008*"sexual" + 0.008*"woman" + 0.007*"yearold" + 0.007*"accord" + 0.006*"man"')

[INFO] Num_topics: 2

(1, '0.031*"say" + 0.009*"woman" + 0.007*"tell" + 0.006*"one" + 0.006*"would" + 0.005*"school" + 0.005*"assault" + 0.005*"patient" + 0.005*"sexual" + 0.005*"police"')
(5, '0.020*"police" + 0.017*"say" + 0.010*"assault" + 0.009*"arrest" + 0.008*"sexual" + 0.007*"victim" + 0.007*"yearold" + 0.007*"accord" + 0.006*"man" + 0.006*"girl"')

[INFO] Num_topics: 3

(6, '0.023*"say" + 0.011*"sexual" + 0.009*"assault" + 0.005*"student" + 0.005*"happen" + 0.005*"butter" + 0.004*"sexual assault" + 0.004*"woman" + 0.004*"get" + 0.004*"man"')
(0, '0.026*"say" + 0.010*"assault" + 0.008*"woman" + 

IndexError: list index out of range

# 3c. Model selection using pyLDAvis Visualization 
## We're interested in the models with k= 3,4,5,6,7 topics, so we visualize those using the pyLDAvis documentation (https://www.aclweb.org/anthology/W14-3110.pdf)

## When selecting the optimal number of topics, we need to find a balance between underfitting and underfitting the model

### OVERFITTING (i.e., too many topics): 
#### Practical takeaway- this can make it harder for human coders to label
#### pros- less overlap amongst topics
#### cons- less coherence amongst the words in each topic; decreased varaiance in each document's distirbution of topics

### UNDERFITTING (i.e., too few topics): 
#### Practical takeaway- doesn't produce enough variance, limiting options for statistical analyses
#### pros- more coherent 'bag of words' comprising each topic; increased varaiance in the distirbution of topics in each document
#### cons- more overlap amongst topics

## Reading pyLDAvis

### LEFT PANE:
- The area of each circle represents the prevalence of each topic over the entire corpus 
- The distance between the center of circles indicate the similarity between topics (i.e., inter-topic differences)

---------------------------------------------------------------------------------------------------------

### RIGHT PANE:
- If you hover over a particular topic on the left, the histogram on the right side lists the top 30 most relevant terms
- The widths of the gray bars represent the corpus-wide frequencies of each term, and the widths of the red bars represent the topic-specific frequencies of each term
- A slider at the top can adjust the relevence metric (λ); however, for our purposes, be sure it i set to λ = 1. For more information on the relevance metric, see (https://www.aclweb.org/anthology/W14-3110.pdf). 








In [309]:
print("k = 14 topics ...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[6], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 14 topics ...................


In [310]:
print("k = 10 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[4], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 10 topics...................


# k=5 topics looks the best to me! The topics appear to be relatively spread out, with no overlapping topics
# At the same time, the 'bag of words' comprising each topic appears coherent enough to label. 

In [311]:
print("k = 12 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[5], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 12 topics...................


In [312]:
print("k = 8 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[3], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 8 topics...................


In [313]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[2], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 6 topics...................


In [314]:
print("k = 4 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[1], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 4 topics...................


# 4. Model Validation for Test Dataset with 5 topics (Num_Topics = 5)

### Now, we check to make sure the training dataset shows a similar trends, to offer evidence of reproducability. We do this by:

(1) Validating that the the testing set perplexity value for k=5 topics relatively similar to the training set (it is)

(2) check the fit of the pyLDAvis visualization for indicators of similar model fit (it looks good)

## 4a. Get Perplexity Scores of the Testing Dataset

In [315]:
print("\n***********************************************************************")
print("[INFO] Article text Testing Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_test = gensim.models.ldamodel.LdaModel(corpus=corpus_text_test,
                                                  id2word=dictionary_text_test,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                                      
                                                  per_word_topics=per_word_topics)

    lda_model_text_tests.append(lda_model_text_test)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_test.log_perplexity(corpus_text_test))


***********************************************************************
[INFO] Article text Testing Dataset LDA Results....
***********************************************************************

Perplexity (num_topics = 1):  -7.81853972914781

Perplexity (num_topics = 2):  -7.8099506163040235

Perplexity (num_topics = 3):  -7.849578165574469

Perplexity (num_topics = 4):  -7.872137101924929

Perplexity (num_topics = 5):  -7.860978996355535

Perplexity (num_topics = 6):  -7.860813348659627

Perplexity (num_topics = 7):  -7.867544201941664

Perplexity (num_topics = 8):  -7.871373485118535

Perplexity (num_topics = 9):  -7.882356706735592

Perplexity (num_topics = 10):  -7.897328224484313

Perplexity (num_topics = 11):  -7.893920758916011

Perplexity (num_topics = 12):  -7.897566461503732

Perplexity (num_topics = 13):  -7.9050351906700165

Perplexity (num_topics = 14):  -7.90396301725211

Perplexity (num_topics = 15):  -7.917879398442028

Perplexity (num_topics = 16):  -7.920450538385

## 4b. Check the pyLDAvis visualization for indicators of similar model fit (it looks good)

In [316]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Testing Data Topic Output....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_tests[j].show_topics(num_topics=i, num_words=5, log=True, formatted=True)
    for topic in topics:
        print(topic)


***********************************************************************
[INFO] Article Text Testing Data Topic Output....
***********************************************************************

[INFO] Num_topics: 1

(1, '0.022*"say" + 0.012*"police" + 0.011*"assault" + 0.009*"sexual" + 0.005*"woman"')

[INFO] Num_topics: 2

(2, '0.012*"say" + 0.011*"sexual" + 0.009*"assault" + 0.004*"case" + 0.004*"student"')
(1, '0.022*"say" + 0.011*"police" + 0.010*"assault" + 0.008*"sexual" + 0.006*"school"')

[INFO] Num_topics: 3

(2, '0.009*"sexual" + 0.007*"assault" + 0.005*"winston" + 0.004*"case" + 0.004*"say"')
(0, '0.025*"say" + 0.012*"assault" + 0.009*"police" + 0.008*"woman" + 0.008*"sexual"')
(1, '0.020*"say" + 0.010*"police" + 0.010*"assault" + 0.008*"sexual" + 0.006*"woman"')

[INFO] Num_topics: 4

(2, '0.012*"sexual" + 0.009*"assault" + 0.008*"say" + 0.005*"winston" + 0.005*"police"')
(4, '0.021*"say" + 0.009*"woman" + 0.007*"tell" + 0.007*"police" + 0.006*"defendent"')
(0, '0.025*"sa

(2, '0.026*"sall" + 0.013*"affidavit" + 0.010*"coast" + 0.010*"coworker" + 0.010*"affidavit say"')
(34, '0.010*"fbi" + 0.007*"back" + 0.007*"judge" + 0.007*"republicans" + 0.007*"investiage"')
(8, '0.033*"church" + 0.017*"bishop" + 0.017*"allegation" + 0.009*"claim" + 0.009*"time"')
(17, '0.019*"say" + 0.009*"police" + 0.009*"case" + 0.006*"december" + 0.006*"monica"')
(23, '0.024*"say" + 0.023*"police" + 0.014*"allen say" + 0.014*"allen" + 0.012*"township"')
(27, '0.017*"sexual" + 0.012*"woman" + 0.008*"assault" + 0.007*"believe" + 0.007*"bill"')
(33, '0.011*"would" + 0.011*"noo" + 0.009*"sinner" + 0.009*"jesus" + 0.009*"school"')
(28, '0.021*"schneider" + 0.012*"bugai" + 0.011*"victim" + 0.011*"would" + 0.009*"sfpd"')
(14, '0.013*"year" + 0.011*"one" + 0.011*"coffey" + 0.010*"player" + 0.010*"former"')
(36, '0.022*"girl" + 0.016*"child" + 0.012*"sheriffs" + 0.012*"say" + 0.010*"deputy"')
(32, '0.021*"say" + 0.011*"police" + 0.010*"woman" + 0.008*"assault" + 0.007*"booker"')
(9, '0.01

IndexError: list index out of range

In [317]:
print("k = 5 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[3], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

k = 5 topics...................


In [318]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[4], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

k = 6 topics...................


## Now that we've done this, we can run the full model (test & training) with k=5 topics

# 5. Run LDA model on full dataset

## 5a. Repeat preprocessing steps 2a-2g on the full dataset

In [319]:
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

480 Rows.  15 Columns.


In [320]:
indices = [0,333,777,932]

samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

Sample Tickets:




Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike



Unnamed: 0,uid,title,first_com,all_com,text,Year,Month,week,Post_link,Article_link,Article_domain,date,engagement,Platform,Gender
0,1.0,"Daisy Coleman, teen at center of Maryville sex...",Those poor girls. Both of them.,"Gentry County officials last week, meanwhile, ...","Daisy Coleman, the northwest Missouri teenager...",2014.0,1.0,1.0,://www.reddit.com/r/news/comments/1upxir/dai...,://www.kansascity.com/news/special-reports/m...,Kansas City Star,41646.7,81.0,Reddit,
1,334.0,Former Waldron Schools Secretary Pleads Guilty...,Former Waldron Schools Secretary Pleads Guilty...,,A former Waldron School District secretary was...,,,,://twitter.com/DavidLo55938576/status/685248...,://www.5new line.com/article/news/local/o...,,2016-01-07 23:56:11+00:00,0.0,Twitter,c
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,


In [321]:
# number of null values in each column of the full dataset
data.isnull().sum()

uid                 0
title               1
first_com           0
all_com           222
text                0
Year              240
Month             240
week              240
Post_link           0
Article_link        0
Article_domain    243
date                1
engagement          3
Platform            0
Gender            240
dtype: int64

In [322]:
#finding null values in the full dataset
print("=============Full Dataset=============")
data['text'] = data['text']

print('Number of rows in text:', len(data['text']))

print("-------------------")
print("Null Values in text: {}".format(data['text'].isnull().sum()))

#Removing null values from the full dataset

text = data['text']

print("After removing Null Values in Full Dataset")
print("Null Values in text: {}".format(data['text'].isnull().sum()))



Number of rows in text: 480
-------------------
Null Values in text: 0
After removing Null Values in Full Dataset
Null Values in text: 0


In [323]:
#----------------------------------------Convert everything to Lower case--------------------------------------------------

##Train Data
text = text.str.lower()

print("=======Full Dataset==============\n")
print(text.head(1))



0    daisy coleman, the northwest missouri teenager...
Name: text, dtype: object


In [324]:
##Remove special characters from full dataset

text_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text]
text_regex = [re.sub(r'\'', '', sent) for sent in text_regex]
text_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_regex]
text_regex = [re.sub(r'\d', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'_', ' ',  sent) for sent in text_regex]

print("=======Full Dataset==============\n")
print("\n[INFO] text....................\n")
print(text_regex[:2])




[INFO] text....................

['daisy coleman the northwest missouri teenager whose alleged  sexual assault maryville has garnered national attention was recovering tuesday from suicide attempt family friend familiar with the situation told the star\nrobin bourland longtime acquaintance the coleman family said coleman was being treated kansas city childrens psychiatric hospital after ingesting unidentified pills sunday evening colemans mother melinda coleman first released news the suicide attempt facebook post monday\nits been heartbreaking for them bourland said the colemans who now live albany this has been really long drawnout battle and its heartbreaking see something like this happening\nfederal judge kansas accused sexual\nharassment submits his resignation\naccording bourland the incident stemmed from online harassment that daisy coleman  received after attending party over the weekend disparaging facebook post generated   additional harsh attacks said bourland and just es

In [325]:
## Full Data set
text_tokens = list(tokenize(text_regex))


print("\n[INFO] text....................\n")
print(text_tokens[:2])


[INFO] text....................

[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'alleged', 'sexual', 'assault', 'maryville', 'has', 'garnered', 'national', 'attention', 'was', 'recovering', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'told', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'said', 'coleman', 'was', 'being', 'treated', 'kansas', 'city', 'childrens', 'psychiatric', 'hospital', 'after', 'ingesting', 'unidentified', 'pills', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'released', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', 'its', 'been', 'heartbreaking', 'for', 'them', 'bourland', 'said', 'the', 'colemans', 'who', 'now', 'live', 'albany', 'this', 'has', 'been', 'really', 'long', 'drawnout', 'battle', 'and', 'its', 'heartbreaking', 'see', 'something', 'like', 'this', 'happening', 'federal', 'jud

In [326]:
# Lemmatize full dataset, keeping only noun, adj, vb, adv

text_lemma = lemmatization(text_tokens, allowed_postags=['ADJ', 'VERB', 'ADV', 'SCONJ', 'PRON', 'PART', 'INTJ', 'AUX', 'ADV', 'ADP', 'ADJ'])
print(str(len(text_lemma)))
print(text_lemma[:4])

480
[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'have', 'garner', 'national', 'attention', 'be', 'recover', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'tell', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'say', 'coleman', 'be', 'be', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'after', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'release', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', '-PRON-', 'be', 'heartbreake', 'for', '-PRON-', 'bourland', 'say', 'the', 'coleman', 'who', 'now', 'live', 'albany', 'this', 'have', 'be', 'really', 'long', 'drawnout', 'battle', 'and', '-PRON-', 'heartbreaking', 'see', 'something', 'like', 'this', 'happen', 'federal', 'judge', 'kansas', 'accuse', 'sexual', 'harassment', 's

In [327]:
#Trigrams and Bigrams in full dataset
            
text_bigram = Phrases(text_lemma, min_count=3, delimiter=b' ', threshold=1)
text_trigram = Phrases(text_bigram[text_lemma], threshold=1)

text_bigram_mod = gensim.models.phrases.Phraser(text_bigram)
text_trigram_mod = gensim.models.phrases.Phraser(text_trigram)

for idx in range(len(text_lemma)):
    for token in text_trigram_mod[text_bigram_mod[text_lemma[idx]]]:
        #print(token)
        if ' ' in token:
            text_lemma[idx].append(token)
print("\n[INFO] text....................\n")
print(text_lemma[:2])


[INFO] text....................

[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'have', 'garner', 'national', 'attention', 'be', 'recover', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'tell', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'say', 'coleman', 'be', 'be', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'after', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'release', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', '-PRON-', 'be', 'heartbreake', 'for', '-PRON-', 'bourland', 'say', 'the', 'coleman', 'who', 'now', 'live', 'albany', 'this', 'have', 'be', 'really', 'long', 'drawnout', 'battle', 'and', '-PRON-', 'heartbreaking', 'see', 'something', 'like', 'this', 'happen', 'federal', 'judge', 'kansas', 'accus

In [328]:
##Remove stopwords from training set

text_stopwords = remove_stopwords(text_lemma)


print("\n[INFO] text....................\n")
print(text_stopwords[:2])




[INFO] text....................

[['daisy', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'garner', 'national', 'attention', 'recover', 'tuesday', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'situation', 'tell', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'family', 'say', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'first', 'release', 'news', 'suicide', 'attempt', 'facebook', 'post', 'monday', 'heartbreake', 'bourland', 'say', 'live', 'albany', 'really', 'long', 'drawnout', 'battle', 'heartbreaking', 'see', 'something', 'like', 'happen', 'federal', 'judge', 'kansas', 'accuse', 'sexual', 'harassment', 'submit', 'resignation', 'accord', 'bourland', 'incident', 'stem', 'online', 'harassment', 'daisy', 'receive', 'attend', 'party', 'weekend', 'disparage', 'facebook', 'post', 'generate', 'additional', 'harsh', 'atta

In [329]:
## Generate Corpus and Dictionary for full dataset

dictionary_text = corpora.Dictionary(text_stopwords)
dictionary_text.filter_extremes(no_below=.01, no_above=0.99)
corpus_text = [dictionary_text.doc2bow(text) for text in text_stopwords]
pickle.dump(corpus_text, open('/Users/me/Desktop/LDA_Main/project2/data_articles/corpus_text.pkl', 'wb'))
dictionary_text.save('/Users/me/Desktop/LDA_Main/project2/data_articles/dictionary_text.gensim')

## 5b. Ensure the model perplexity score for k=5 topics is relatively similar to the perplexity of the testing and training sets (it is)

In [330]:
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta= 'auto'
per_word_topics=True

lda_model_text = []

In [331]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                            
                                                  per_word_topics=per_word_topics)

    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (num_topics = 1):  -7.427279812484941

Perplexity (num_topics = 2):  -7.389076487519594

Perplexity (num_topics = 3):  -7.3666858749920285

Perplexity (num_topics = 4):  -7.380203813789564

Perplexity (num_topics = 5):  -7.403557617739715

Perplexity (num_topics = 6):  -7.407368698753037

Perplexity (num_topics = 7):  -7.413200647619442

Perplexity (num_topics = 8):  -7.4111298751562025

Perplexity (num_topics = 9):  -7.411717042523926

Perplexity (num_topics = 10):  -7.4115652399320044

Perplexity (num_topics = 11):  -7.424479865980419

Perplexity (num_topics = 12):  -7.4231746774150125

Perplexity (num_topics = 13):  -7.423783190119228

Perplexity (num_topics = 14):  -7.422064043632383

Perplexity (num_topics = 15):  -7.415118205091898

Perplexity (num_topics = 16):  -7.409866194398966

Pe

## 5c. Run final model

In [333]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 4):  -7.380203813789564


## 5d. Visualize Model Results

In [334]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 4...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 4

(0, '0.038*"say" + 0.025*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"tell" + 0.008*"accord" + 0.008*"county" + 0.007*"arrest"')
(1, '0.028*"sexual" + 0.020*"assault" + 0.019*"say" + 0.018*"student" + 0.013*"university" + 0.012*"college" + 0.010*"campus" + 0.008*"school" + 0.008*"report" + 0.007*"rape"')
(2, '0.027*"say" + 0.013*"woman" + 0.009*"sexual" + 0.008*"tell" + 0.008*"assault" + 0.007*"would" + 0.006*"come" + 0.006*"time" + 0.005*"year" + 0.005*"make"')
(3, '0.023*"say" + 0.010*"driver" + 0.008*"woman" + 0.007*"assault" + 0.007*"tell" + 0.006*"defendent" + 0.006*"go" + 0.006*"get" + 0.005*"would" + 0.005*"like"')
text.....k = 5...................



Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





In [335]:
# Initializing LDA Models and Parameters
topic_number = 5
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 5):  -7.403557617739715


In [336]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 5...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 5

(0, '0.039*"say" + 0.026*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.009*"woman" + 0.008*"accord" + 0.008*"tell" + 0.008*"county" + 0.007*"arrest"')
(1, '0.032*"sexual" + 0.022*"student" + 0.022*"assault" + 0.017*"say" + 0.016*"university" + 0.015*"college" + 0.012*"campus" + 0.010*"school" + 0.009*"report" + 0.008*"rape"')
(2, '0.026*"say" + 0.013*"woman" + 0.010*"sexual" + 0.009*"tell" + 0.009*"assault" + 0.007*"would" + 0.007*"come" + 0.006*"time" + 0.006*"year" + 0.006*"know"')
(3, '0.026*"say" + 0.011*"driver" + 0.009*"woman" + 0.008*"assault" + 0.007*"defendent" + 0.007*"tell" + 0.007*"go" + 0.006*"get" + 0.006*"would" + 0.005*"want"')
(4, '0.022*"say" + 0.011*"saleem" + 0.010*"woman" + 0.007*"school" + 0.005*"jack" + 0.005*"mozaffar" + 0.005*"one" + 0

In [337]:
# Initializing LDA Models and Parameters
topic_number = 6
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 6):  -7.407368698753037


In [338]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 6

(0, '0.039*"say" + 0.027*"police" + 0.017*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"accord" + 0.008*"county" + 0.008*"girl" + 0.008*"tell"')
(1, '0.038*"sexual" + 0.028*"assault" + 0.019*"student" + 0.014*"college" + 0.013*"say" + 0.012*"university" + 0.012*"rape" + 0.011*"campus" + 0.008*"report" + 0.008*"school"')
(2, '0.027*"say" + 0.013*"woman" + 0.012*"sexual" + 0.011*"assault" + 0.009*"tell" + 0.007*"come" + 0.006*"time" + 0.006*"year" + 0.006*"would" + 0.005*"claim"')
(3, '0.028*"say" + 0.010*"woman" + 0.009*"tell" + 0.008*"would" + 0.008*"get" + 0.007*"go" + 0.007*"defendent" + 0.006*"know" + 0.006*"assault" + 0.006*"want"')
(4, '0.024*"say" + 0.012*"saleem" + 0.011*"woman" + 0.008*"school" + 0.006*"jack" + 0.006*"mozaffar" + 0.006*"one" + 0.00

In [339]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 4):  -7.380203813789564


In [340]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 4

(0, '0.038*"say" + 0.025*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"tell" + 0.008*"accord" + 0.008*"county" + 0.007*"arrest"')
(1, '0.028*"sexual" + 0.020*"assault" + 0.019*"say" + 0.018*"student" + 0.013*"university" + 0.012*"college" + 0.010*"campus" + 0.008*"school" + 0.008*"report" + 0.007*"rape"')
(2, '0.027*"say" + 0.013*"woman" + 0.009*"sexual" + 0.008*"tell" + 0.008*"assault" + 0.007*"would" + 0.006*"come" + 0.006*"time" + 0.005*"year" + 0.005*"make"')
(3, '0.023*"say" + 0.010*"driver" + 0.008*"woman" + 0.007*"assault" + 0.007*"tell" + 0.006*"defendent" + 0.006*"go" + 0.006*"get" + 0.005*"would" + 0.005*"like"')
text.....k = 6...................


## 6. Save the analysis results to an excel file for topic validation

### 6a. Here we generate a column that tells us which topic each response contributed the most to

In [341]:
cols = [color for name, color in mcolors.XKCD_COLORS.items()]
mycolors = [color for name, color in mcolors.XKCD_COLORS.items()]

In [342]:
#  Full Dataset


def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get the main topic of each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution, and Keywords for each document
        raw_frame = {}
        for j, (topic_num, prop_topic) in enumerate(row):
            #if j < 2:  # => dominant topic
                #wp = ldamodel.show_topic(topic_num)
                #topic_keywords = ", ".join([word for word, prop in wp])

                #sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,7), topic_keywords]), ignore_index=True)
            if j==0:
                raw_frame['Dominant'] = topic_num

            raw_frame['Topic' + str(topic_num)] = round(prop_topic, 4)

            #else:
            #    break
        df = pd.DataFrame(data=raw_frame, index=[0])
        sent_topics_df = sent_topics_df.append(df)
        
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution']

    # Add original text to the end of the output
    #contents = pd.Series(texts)
    #sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords_text = format_topics_sentences(ldamodel=lda_model_text, corpus=corpus_text, texts=text_stopwords)

df_dominant_topic_text = df_topic_sents_keywords_text.reset_index()
# Format
df_dominant_topic_text.index.name='Document_No';

print(df_dominant_topic_text.head(812))
# Format
#df_dominant_topic_text_train = df_topic_sents_keywords_text_train.reset_index()
#df_dominant_topic_text_train.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
#df_dominant_topic_text_train.head(3)


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





             index  Dominant  Topic0  Topic1  Topic2  Topic3
Document_No                                                 
0                0         0  0.7415  0.2584     NaN     NaN
1                0         2  0.0100     NaN  0.9898     NaN
2                0         2  0.2357  0.1125  0.6518     NaN
3                0         1  0.0546  0.9453     NaN     NaN
4                0         0  0.9992     NaN     NaN     NaN
5                0         2     NaN  0.3929  0.6069     NaN
6                0         3  0.2378     NaN     NaN  0.7616
7                0         0  0.9995     NaN     NaN     NaN
8                0         0  0.3910  0.2288  0.3801     NaN
9                0         0  0.9996     NaN     NaN     NaN
10               0         0  0.9996     NaN     NaN     NaN
11               0         2  0.4202     NaN  0.5796     NaN
12               0         0  0.8280     NaN  0.1719     NaN
13               0         3     NaN     NaN     NaN  0.9998
14               0      

## 6b. Generate a dataframe to export the results into

In [344]:
lda_topics_text = np.array(df_dominant_topic_text['Dominant'])
topic0_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic0'])
topic1_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic1'])
topic2_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic2'])
topic3_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic3'])


text = np.array(data['text'])

uid = np.array(data['uid'])

results = { 
    'uid' : uid, 
    'text': text, 
    'lda_topics_text': lda_topics_text, 
    'topic0_contrib_lda_topics_text':topic0_contrib_lda_topics_text,
    'topic1_contrib_lda_topics_text':topic1_contrib_lda_topics_text,
    'topic2_contrib_lda_topics_text':topic2_contrib_lda_topics_text,
    'topic3_contrib_lda_topics_text':topic3_contrib_lda_topics_text,
}

frame = pd.DataFrame(results, columns = [
                                                'uid',
                                                'text', 'lda_topics_text', 
                                                'topic0_contrib_lda_topics_text',
                                                'topic1_contrib_lda_topics_text',
                                                'topic2_contrib_lda_topics_text',
                                                'topic3_contrib_lda_topics_text',

                                              ])



frame.to_excel("/Users/me/Desktop/LDA_Main/project2/data_articles/lda_results_full_dataset_topic_num_5.xlsx")

## 5d. Export restuls to an .xlsx file

In [345]:
'/Users/me/Desktop/LDA_Main/project2/data_articles/news_dataset.xlsx'
frame.to_excel("/Users/me/Desktop/LDA_Main/project2/data_articles/lda_results_All_data_topic_num_4___.xlsx")