# modelingeditorialslants 

## LDA Modeling of Topics Discussed in New Articles


## This notebook applies LDA modeling to a an dataset of 480 news articles shared to the '/r/news/ subreddit and Twitter

## Topic Modeling is a text analysis method that uses machine learning algorithms to classify the content of a large body of texts—called a corpus—into meaningful topics. 

## The specific technique used here, Latent Dirichlet Allocation (LDA), identifies underlying patterns of co-occurring words within individual texts to extrapolate latent categories across a corpus of texts (i.e., topics).

### In the following case, we are interested in understanding what characteristics of news articles elicit user engagement on social media relying heavily on the GENSIM, NLTK, spaCy, and SKLearn libraries:

https://radimrehurek.com/gensim/

https://www.nltk.org

https://scikit-learn.org/stable/

https://spacy.io



# Steps

## 1. Preprocessing
    1a. Spell check
    
    1b. Expand contractions

## 2. Text processing

     2a. Partition data into testing and training subsets
 
     2b. Tokenization
     
     2c. Stop Word Removal
     
     2d. Lemmatization
     
     2e. Bigrams and Trigrams
     
     2f. Exclude terms in > 99% and < 1% of documents
     
     2g. Generate Corpus and Dictionary
 

## 2. Selecting the number of topics (k)
 
     2a. Computing Model Perplexity


## 3. Model Results

     3a. pyLDAvis visualization to assist with Topic Labeling
     
     3b. Topic Mixtures (Document-Term Matrix)

 
# Helpful Links:

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

     
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158


# Sampling Procedures
We took a systematic random sample of 50 posts shared to each platform every week for five years (from January 2015 to December 2019, totaling 24,000 posts). Human coders then filtered through all 24,000 posts to extract two posts from each week (one from Reddit and one from Twitter), which shared a news report of a sexual assault. We only included reports of sexual assault; all other news frames were excluded (e.g., stories about rape culture more generally). ### Additionally, only reports written English were retained. If all 50 posts on a particular day were excluded, another 50 posts were sampled from that particular week. Human coders extracted the text from the news articles and their comment threads. When a news article had been taken down, a different post was sampled for that week.

# 1. Preprocessing Procedures:

## 1a. Run your textual data through a spell checker
     Altough there are automated spell checkers, they aren't as accurate as we'd like. Thus, we 
     encourage you to have a human run it through a spellchecker. Although this is time intensive, we 
     believe this is necessary.
## 1b. Expand all english contractions (e.g., "don't" -> "do not")
    Similar to the spellchecker, we needed human coders to do this, to ensure accuracy

# Required Libraries

In [3]:
## Load Required Libraries

#general
import numpy as np
import pandas as pd
import re
import pickle
from IPython.display import display

#setting up Jupyter notebook 
%matplotlib inline
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)

#text preprocessing
import nltk
from nltk.corpus import stopwords

import spacy
from spacy.lang.en import English

from gensim.models import Phrases
from gensim.utils import simple_preprocess


#modeling
import gensim
from gensim.models.ldamodel import LdaModel


#plotting
import pyLDAvis
import pyLDAvis.gensim

# Read in the following:
### 1. dataset
### 2. Stopword location
### 3. Gensim location
### 4. Mallet path

In [11]:
## Set file Paths

#data files
file_location = './data_articles/news_dataset.xlsx'

#stop words
stopwords_location = './data_articles/stopwords.txt'

## Did the dataset load with the correct number of columns and rows?

In [12]:
## Check to make sure the dataset looks correct
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

480 Rows.  15 Columns.


## Make sure the head of the dataset looks correct

In [15]:
indices = [0,333,777,932]

samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

Sample Tickets:




Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike



Unnamed: 0,uid,title,first_com,all_com,text,Year,Month,week,Post_link,Article_link,Article_domain,date,engagement,Platform,Gender
0,1.0,"Daisy Coleman, teen at center of Maryville sex...",Those poor girls. Both of them.,"Gentry County officials last week, meanwhile, ...","Daisy Coleman, the northwest Missouri teenager...",2014.0,1.0,1.0,://www.reddit.com/r/news/comments/1upxir/dai...,://www.kansascity.com/news/special-reports/m...,Kansas City Star,41646.7,81.0,Reddit,
1,334.0,Former Waldron Schools Secretary Pleads Guilty...,Former Waldron Schools Secretary Pleads Guilty...,,A former Waldron School District secretary was...,,,,://twitter.com/DavidLo55938576/status/685248...,://www.5new line.com/article/news/local/o...,,2016-01-07 23:56:11+00:00,0.0,Twitter,c
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,


# 2. DATA PREPROCESSING

# 2a. Split data into test and training sets

In [16]:
# Randomimze the order of the rows in the dataframe
data = shuffle(data)

In [17]:
#Partition data into a testing and training sets

from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.50, random_state=42)
print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  15 Columns.
240 Rows.  15 Columns.
480 Rows.  15 Columns.


In [18]:
#Add an extra column to the dataframe to identify the testing vs the training rows vs the full dataset

data_train['Data_Type'] = "traning_data"
data_test['Data_Type'] = "testing_data"
data['Data_Type'] = "full_data"

print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  16 Columns.
240 Rows.  16 Columns.
480 Rows.  16 Columns.




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# 2b. Tokenization
### Tokenization involves coverting the text to lowercase, removing special characters, null valies, and punctuation from the text

## Here we can see how many null values there are in each column

In [19]:
# null values in each column of the testing set
data_train.isnull().sum()

uid                 0
title               0
first_com           0
all_com            98
text                0
Year              105
Month             105
week              105
Post_link           0
Article_link        0
Article_domain    107
date                1
engagement          1
Platform            0
Gender            135
Data_Type           0
dtype: int64

In [20]:
# null values in each column of the training set
data_test.isnull().sum()

uid                 0
title               1
first_com           0
all_com           124
text                0
Year              135
Month             135
week              135
Post_link           0
Article_link        0
Article_domain    136
date                0
engagement          2
Platform            0
Gender            105
Data_Type           0
dtype: int64

## Now we need to remove null values from the data

In [21]:
#----------------------------------------Removing Null values--------------------------------------------------

#finding null values in the training subset 
print("=============Training Data=============")
data_train['text'] = data_train['text']

print('Number of rows in text:', len(data_train['text']))

print("-------------------")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#finding null values in the testing subset 
print("\n=============Testing Data==============\n")
data_test['text'] = data_test['text']

print('Number of rows in text:', len(data_test['text']))

print("-------------------")

print("Null Values in text: {}".format(data_test['text'].isnull().sum()))


print("\n\n====================Removing Null Values==========================\n\n")
#Removing null values from the training subset 

text_train = data_train['text']

print("After removing Null Values in Train Data")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#Removing null values from the testing subset 

text_test = data_test['text']

print("\n\nAfter removing Null Values in Test Data")
print("Null Values in text: {}".format(data_test['text'].isnull().sum()))

Number of rows in text: 240
-------------------
Null Values in text: 0


Number of rows in text: 240
-------------------
Null Values in text: 0




After removing Null Values in Train Data
Null Values in text: 0


After removing Null Values in Test Data
Null Values in text: 0




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Convert the text to lowercase

In [22]:
#----------------------------------------Converting to Lower case--------------------------------------------------

##Train Data
text_train = text_train.str.lower()

##Test Data
text_test = text_test.str.lower()

print("=======Train Data==============\n")
print(text_train.head(2))

print("\n=======Test Data==============\n")
print(text_test.head(2))


210    police are investigating reports of a sexual a...
320    the attorney for the female employee who was a...
Name: text, dtype: object


386    the texarkana texas police department arrested...
327    viola davis has moved fans with passionate per...
Name: text, dtype: object


## Remove the following:
- special characters
- alphanumerics
- numbers
- words that appear in the corpus less than twice
- extra spaces

In [23]:
##Remove from training set

text_train_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_train]
text_train_regex = [re.sub(r'\'', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\d', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'_', ' ',  sent) for sent in text_train_regex]

##Remove from testing set

text_test_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_test]
text_test_regex = [re.sub(r'\'', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\d', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'_', ' ',  sent) for sent in text_test_regex]


print("=======Train Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])

print("\n=======Test Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])



[INFO] article_text....................

['the texarkana texas police department arrested timothy paul wilder texarkana texas for sexual assault child\n\n \n\nofficers were notified the allegations after the child told family member the assault  the female victim who now fourteen years old was six eight years old when the assault occurred\n\n \n\nwilder turned himself detectives monday morning with his attorney and was booked into the bistate jail was released later the day after posting  bond\n\n \n\nthe case has been forwarded the district attorneys office\n\n', 'viola davis has moved fans with passionate performances the big and small screens but recent speech which opened about her embattled sisters suffering the one the most haunting and emotional moments she ever\n\nthe rape foundations dedication ceremony for the new stuart housesanta monica  november  actress viola davis attends the rape foundations dedication ceremony for the new stuart house november santa monica california

## Remove All Punctuation

In [24]:
def tokenize(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes all punctuation

In [25]:
##Training set
text_train_tokens = list(tokenize(text_train_regex))


print("\n[INFO] article text....................\n")
print(text_train_tokens[:2])

##Testing set
text_test_tokens = list(tokenize(text_test_regex))

print("\n[INFO] article text....................\n")
print(text_test_tokens[:2])


[INFO] article text....................

[['police', 'are', 'investigating', 'reports', 'sexual', 'assault', 'montclair', 'highs', 'school', 'involving', 'two', 'minors', 'officials', 'said', 'principal', 'earle', 'let', 'the', 'parents', 'know', 'about', 'the', 'alleged', 'incident', 'the', 'schools', 'allgender', 'bathroom', 'monday', 'email', 'the', 'essex', 'county', 'prosecutors', 'office', 'now', 'investigating', 'police', 'say', 'the', 'victim', 'girl', 'and', 'the', 'alleged', 'attacker', 'boy', 'there', 'are', 'criminal', 'forceds', 'yet'], ['the', 'attorney', 'for', 'the', 'female', 'employee', 'who', 'was', 'attacked', 'duke', 'university', 'hospital', 'sept', 'raised', 'questions', 'tuesday', 'about', 'why', 'the', 'case', 'not', 'being', 'investigated', 'sexual', 'assault', 'the', 'attack', 'occurred', 'about', 'visitor', 'bathroom', 'the', 'ninth', 'floor', 'the', 'hospital', 'workers', 'said', 'the', 'female', 'technician', 'was', 'followed', 'into', 'the', 'bathroom', 

# 2c. Lemmatization
## Reduces words to their lemma (e.g., assaulted -> assault)

In [26]:
def lemmatization(texts, allowed_postags=['VERB', 'NOUN']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc])
    return texts_out

In [28]:
nlp = spacy.load('en', disable=['parser', 'ner'])
# Lemmatize Training Set, keeping only nouns and verbs

text_train_lemma = lemmatization(text_train_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_train_lemma)))
print(text_train_lemma[:4])

# Lemmatize Testing Set, keeping only nouns and verbs
text_test_lemma = lemmatization(text_test_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_test_lemma)))
print(text_test_lemma[:4])

240
[['police', 'be', 'investigate', 'report', 'sexual', 'assault', 'montclair', 'high', 'school', 'involve', 'two', 'minor', 'official', 'say', 'principal', 'earle', 'let', 'the', 'parent', 'know', 'about', 'the', 'allege', 'incident', 'the', 'school', 'allgender', 'bathroom', 'monday', 'email', 'the', 'essex', 'county', 'prosecutor', 'office', 'now', 'investigate', 'police', 'say', 'the', 'victim', 'girl', 'and', 'the', 'allege', 'attacker', 'boy', 'there', 'be', 'criminal', 'forced', 'yet'], ['the', 'attorney', 'for', 'the', 'female', 'employee', 'who', 'be', 'attack', 'duke', 'university', 'hospital', 'sept', 'raise', 'question', 'tuesday', 'about', 'why', 'the', 'case', 'not', 'be', 'investigate', 'sexual', 'assault', 'the', 'attack', 'occur', 'about', 'visitor', 'bathroom', 'the', 'ninth', 'floor', 'the', 'hospital', 'worker', 'say', 'the', 'female', 'technician', 'be', 'follow', 'into', 'the', 'bathroom', 'after', 'complete', 'procedure', 'and', 'be', 'choke', 'and', 'shove', 't

240
[['the', 'texarkana', 'texas', 'police', 'department', 'arrest', 'timothy', 'paul', 'wilder', 'texarkana', 'texas', 'for', 'sexual', 'assault', 'child', 'officer', 'be', 'notify', 'the', 'allegation', 'after', 'the', 'child', 'tell', 'family', 'member', 'the', 'assault', 'the', 'female', 'victim', 'who', 'now', 'fourteen', 'year', 'old', 'be', 'six', 'eight', 'year', 'old', 'when', 'the', 'assault', 'occur', 'wilder', 'turn', '-PRON-', 'detective', 'monday', 'morning', 'with', '-PRON-', 'attorney', 'and', 'be', 'book', 'into', 'the', 'bistate', 'jail', 'be', 'release', 'later', 'the', 'day', 'after', 'post', 'bond', 'the', 'case', 'have', 'be', 'forward', 'the', 'district', 'attorney', 'office'], ['viola', 'davis', 'have', 'move', 'fan', 'with', 'passionate', 'performance', 'the', 'big', 'and', 'small', 'screen', 'but', 'recent', 'speech', 'which', 'open', 'about', '-PRON-', 'embattle', 'sister', 'suffer', 'the', 'one', 'the', 'most', 'haunting', 'and', 'emotional', 'moment', '-PRO

# 2d. Remove Stopwords
## NOTE: Edit the stopwords txt file to add additional words to filter out

In [29]:
#Prepare to remove stopwords
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))
newStopWords =[str(x.strip()) for x in open(stopwords_location,'r').read().split('\n')]
stopwords.update(newStopWords)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]
print(len(stopwords))

4416


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannahstevens/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
##Remove stopwords from training set

text_train_stopwords = remove_stopwords(text_train_lemma)


print("\n[INFO] article text....................\n")
print(text_train_stopwords[:2])

##Remove stopwords from testing set

text_test_stopwords = remove_stopwords(text_test_lemma)


print("\n[INFO] article text....................\n")
print(text_test_stopwords[:2])


[INFO] article text....................

[['police', 'investigate', 'report', 'sexual', 'assault', 'montclair', 'high', 'school', 'involve', 'two', 'minor', 'official', 'say', 'principal', 'earle', 'let', 'parent', 'know', 'allege', 'incident', 'school', 'allgender', 'bathroom', 'monday', 'email', 'essex', 'county', 'prosecutor', 'office', 'investigate', 'police', 'say', 'victim', 'girl', 'allege', 'attacker', 'boy', 'criminal', 'forced', 'yet'], ['attorney', 'female', 'employee', 'attack', 'duke', 'university', 'hospital', 'sept', 'raise', 'question', 'tuesday', 'case', 'investigate', 'sexual', 'assault', 'attack', 'occur', 'visitor', 'bathroom', 'ninth', 'floor', 'hospital', 'worker', 'say', 'female', 'technician', 'follow', 'bathroom', 'complete', 'procedure', 'choke', 'shove', 'ground', 'bleed', 'nearly', 'unable', 'walk', 'eventually', 'load', 'onto', 'stretcher', 'take', 'emergency', 'room', 'treatment', 'witness', 'say', 'justin', 'bamberg', 'victim', 'attorney', 'say', 'attack

# 2e. Bigrams and Trigrams
##    Bigrams are two words that frequently co-occur together
##    Trigrams are three words that frequently co-occur together

In [31]:
##Training set
            
text_train_bigram = Phrases(text_train_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_train_trigram = Phrases(text_train_bigram[text_train_stopwords], threshold=1)

text_train_bigram_mod = gensim.models.phrases.Phraser(text_train_bigram)
text_train_trigram_mod = gensim.models.phrases.Phraser(text_train_trigram)

for idx in range(len(text_train_stopwords)):
    for token in text_train_trigram_mod[text_train_bigram_mod[text_train_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_train_stopwords[idx].append(token)
            
##Testing set

text_test_bigram = Phrases(text_test_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_test_trigram = Phrases(text_test_bigram[text_test_stopwords], threshold=1)

text_test_bigram_mod = gensim.models.phrases.Phraser(text_test_bigram)
text_test_trigram_mod = gensim.models.phrases.Phraser(text_test_trigram)

for idx in range(len(text_test_stopwords)):
    for token in text_test_trigram_mod[text_test_bigram_mod[text_test_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_test_stopwords[idx].append(token)

## Print the head of the data with Bigrams and Trigrams
### e.g., under the testing dataset, you can see the words 'report sexual' and 'good guy' are bigrams

In [32]:
print("\n[INFO] Text Article training data....................\n")
print(text_train_stopwords[:2])

print("\n[INFO] Text Article testing data....................\n")
print(text_test_stopwords[:2])


[INFO] Text Article training data....................

[['police', 'investigate', 'report', 'sexual', 'assault', 'montclair', 'high', 'school', 'involve', 'two', 'minor', 'official', 'say', 'principal', 'earle', 'let', 'parent', 'know', 'allege', 'incident', 'school', 'allgender', 'bathroom', 'monday', 'email', 'essex', 'county', 'prosecutor', 'office', 'investigate', 'police', 'say', 'victim', 'girl', 'allege', 'attacker', 'boy', 'criminal', 'forced', 'yet', 'police investigate', 'report sexual_assault', 'high school', 'involve two', 'official say', 'allege incident', 'county prosecutor', 'police say', 'criminal forced'], ['attorney', 'female', 'employee', 'attack', 'duke', 'university', 'hospital', 'sept', 'raise', 'question', 'tuesday', 'case', 'investigate', 'sexual', 'assault', 'attack', 'occur', 'visitor', 'bathroom', 'ninth', 'floor', 'hospital', 'worker', 'say', 'female', 'technician', 'follow', 'bathroom', 'complete', 'procedure', 'choke', 'shove', 'ground', 'bleed', 'nearly'

# 2f. Remove words that occur in less than & greater than of documents
### The corpus is our collection of documents (i.e., our textual questionnaire responses)
### The dictionary takes each unique word in the corpus and assigns them an index

In [34]:
dictionary_text_train = corpora.Dictionary(text_train_stopwords)
dictionary_text_train.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_train = [dictionary_text_train.doc2bow(text) for text in text_train_stopwords]
pickle.dump(corpus_text_train, open('./data_articles/corpus_text_train.pkl', 'wb'))
dictionary_text_train.save('./data_articles/dictionary_text_train.gensim')

dictionary_text_test = corpora.Dictionary(text_test_stopwords)
dictionary_text_test.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_test = [dictionary_text_test.doc2bow(text) for text in text_test_stopwords]
pickle.dump(corpus_text_test, open('./data_articles/corpus_text_test.pkl', 'wb'))
dictionary_text_test.save('./data_articles/dictionary_text_test.gensim')

# 2g. Print Dictionary and Corpus for each subset 

In [35]:
## Training data dictionary & corpus
text_train_stopwords
for k, v in dictionary_text_train.token2id.items():
    print(k, v)

allege 0
allege incident 1
allgender 2
assault 3
attacker 4
bathroom 5
boy 6
county 7
county prosecutor 8
criminal 9
criminal forced 10
earle 11
email 12
essex 13
forced 14
girl 15
high 16
high school 17
incident 18
investigate 19
involve 20
involve two 21
know 22
let 23
minor 24
monday 25
montclair 26
office 27
official 28
official say 29
parent 30
police 31
police investigate 32
police say 33
principal 34
prosecutor 35
report 36
report sexual_assault 37
say 38
school 39
sexual 40
two 41
victim 42
yet 43
adamantly 44
administrator 45
alert 46
answer 47
anticipate 48
apprehension 49
ask 50
assault case 51
attack 52
attack occur 53
attorney 54
avoid 55
awake 56
aware 57
bamberg 58
begin 59
behalf 60
believe 61
bleed 62
building 63
case 64
choke 65
client 66
comment 67
complete 68
complexion 69
conference 70
confidentiality 71
consistent 72
continue 73
continue investigate 74
cooperate 75
criticize 76
decline 77
decline comment 78
deny 79
department 80
describe 81
develop 82
different 83

persuade 1097
political 1098
popular 1099
popularity 1100
previously 1101
program 1102
prosecuter rape 1103
reenact 1104
reenactment 1105
religious 1106
remove 1107
reply 1108
return 1109
roommate 1110
sandra 1111
sexual abuse 1112
shade 1113
shut 1114
situate 1115
something like 1116
south 1117
spend 1118
split 1119
start 1120
state attorney 1121
stop victim 1122
streetthe 1123
stuff 1124
submit 1125
suggest 1126
thought 1127
triathlon 1128
uic 1129
underwear 1130
unfortunately 1131
valentines 1132
washington 1133
washington post 1134
west 1135
affiliate 1136
anyone information 1137
black 1138
book 1139
chicago police 1140
confirm 1141
connection 1142
county sheriffs 1143
daylong 1144
doug 1145
dps 1146
drama 1147
drive 1148
extradited 1149
ferris 1150
flight 1151
gholston 1152
grand 1153
guam 1154
happen night 1155
indicate 1156
indication 1157
information release 1158
instruct 1159
international 1160
kettner 1161
list 1162
lock 1163
male 1164
mecosta 1165
message 1166
mlive 1167
nei

mile 1847
party 1848
rcalle 1849
rochester 1850
vicitms 1851
word 1852
young 1853
young woman 1854
abdullah 1855
accelerate 1856
accordance 1857
account 1858
accredit 1859
accuser last 1860
action 1861
acute 1862
address 1863
admission 1864
adopt 1865
advocacy 1866
advocacy group 1867
aggressive 1868
alarming 1869
almost 1870
alqaria 1871
americanize 1872
americans 1873
ana 1874
angry 1875
anguish 1876
apologize 1877
archbishop 1878
asian 1879
assimilate 1880
assimilated 1881
attitude 1882
attribute 1883
award 1884
baggage 1885
basement 1886
bear 1887
bedroom 1888
belief 1889
better 1890
billy 1891
black pant 1892
blanket 1893
blemish 1894
blog 1895
brief 1896
bristle 1897
broker 1898
bylaw 1899
catholic 1900
ceiling 1901
challenging 1902
chaplain 1903
chapter 1904
chicago act 1905
chicago area 1906
class 1907
cloister 1908
closeknit 1909
cmdr 1910
commenter 1911
common 1912
computer 1913
confide 1914
conflict 1915
conscientious 1916
conservative 1917
corporate 1918
country 1919
cousin

caleb 2847
campus sexual_assault 2848
campuse 2849
citate 2850
city police 2851
college campus 2852
college campuse 2853
columnist 2854
combat 2855
conclude 2856
condom 2857
conscious 2858
constitutional 2859
coordinate 2860
correct 2861
crack 2862
criminal trial 2863
dakota 2864
dangerously 2865
danielle 2866
defeat 2867
defining 2868
demonstration 2869
dezmine 2870
dirk 2871
discrimination 2872
dismantle 2873
disregard 2874
dorm 2875
doubt 2876
draft 2877
due 2878
due process 2879
educations 2880
effort 2881
emotional distress 2882
encouraged 2883
establish 2884
exercise 2885
federal judge 2886
firstyear 2887
forks 2888
foundation 2889
gloria 2890
goingtohave 2891
gpa 2892
guidance 2893
guilt 2894
helpless 2895
inc 2896
inc citate 2897
infliction 2898
informed 2899
injury 2900
intentional 2901
john doe 2902
justice department 2903
libel 2904
likely 2905
make statement 2906
malice 2907
mandate 2908
marilou 2909
meyer 2910
mirkovich 2911
multidefendent 2912
negligence 2913
nevertheless

distract 3846
earlymorne 3847
fleet 3848
foyer 3849
freckle 3850
garden 3851
heighten 3852
indecently 3853
jean 3854
kcolore 3855
neighborhood 3856
nonlife 3857
past 3858
pea 3859
plague 3860
police spokeswoman 3861
police think 3862
populated 3863
rattle 3864
spate 3865
spiked 3866
summer 3867
tileston 3868
timberlandtype 3869
unity 3870
warn 3871
watchful 3872
woman attack 3873
also prosecuter 3874
believe may 3875
certainly 3876
craigslist 3877
criminal complaint 3878
doyle 3879
duress 3880
extortion 3881
face year 3882
include rape 3883
maverick 3884
mendez 3885
mission 3886
modeling 3887
photographs 3888
preliminary 3889
readiness 3890
rosale 3891
rosales 3892
solicit 3893
soliciting 3894
tional victim 3895
valley 3896
ashamed 3897
await 3898
boxer 3899
burn 3900
countys 3901
girlfriend 3902
humiliating 3903
month later 3904
rookie 3905
rope 3906
sausage 3907
shave 3908
soulsearche 3909
stun 3910
talk sexual 3911
thankful 3912
torture 3913
waldeck 3914
waldeck say 3915
bohannon 39

cruel 4846
daisy 4847
debut 4848
delaney 4849
earlier year 4850
effect 4851
elementary 4852
ella 4853
empowerment 4854
endearment 4855
endure 4856
exclude 4857
exploration 4858
fairon 4859
fantasy 4860
festival 4861
find percent 4862
flyer 4863
fundraiser 4864
genderbase 4865
gov 4866
grandfather 4867
groundwork 4868
halt 4869
hauntingly 4870
help get 4871
host 4872
hug 4873
hurl 4874
incentivize 4875
influence 4876
introduction 4877
jada 4878
kindergartner 4879
loud 4880
louder 4881
mcauliffes 4882
meme 4883
mindset 4884
multipronge 4885
nonprofit 4886
norris 4887
one boy 4888
one five 4889
one thing 4890
pave 4891
plea deal 4892
poorlyshe 4893
positive 4894
pott 4895
potts 4896
presentation 4897
press conference 4898
pretend 4899
preventive 4900
project 4901
public school 4902
push back 4903
rape culture 4904
raphael 4905
raphael say 4906
rate 4907
readymade 4908
researcher 4909
rose 4910
rug 4911
safe bae 4912
safebae 4913
school student 4914
secon 4915
shael 4916
snowball 4917
some

disciplinary action 5846
epsilon 5847
exact 5848
formally 5849
fraternal 5850
fraternitys 5851
gamma 5852
immediate 5853
improjailetie 5854
ing 5855
initiative 5856
kappa 5857
kappa alpha 5858
lcoal 5859
logan 5860
poi 5861
proud 5862
student conduct 5863
usu 5864
utah 5865
vitale 5866
voluntarily 5867
wray 5868
adventure 5869
affleck 5870
around time 5871
batman 5872
beek 5873
believer 5874
ben 5875
bigname 5876
bombshell 5877
boob 5878
burton 5879
bury 5880
busan 5881
carr 5882
carrie 5883
collapse 5884
complicit 5885
condemn 5886
condemnation 5887
crews 5888
damning 5889
daw 5890
der 5891
development 5892
easy 5893
enterjailse 5894
excellent 5895
grinned 5896
hef 5897
hilarie 5898
horror 5899
invitation 5900
jfk 5901
korea 5902
mansion 5903
megadonor 5904
model 5905
namely 5906
navigate 5907
oliver 5908
openly 5909
oscarwinne 5910
predation 5911
premiere 5912
relief 5913
rival 5914
rockn 5915
seemingly 5916
soundtrack 5917
spy 5918
ted 5919
thankfully 5920
thusly 5921
toll 5922
toxi

request comment 6846
sec 6847
standup 6848
storyline 6849
tribeca 6850
undress 6851
whoa 6852
belmar 6853
blend 6854
busy 6855
calvary 6856
cape 6857
daylight 6858
deviant 6859
dissolve 6860
echo 6861
elaborate 6862
fergu 6863
fitzgerald 6864
girardeau 6865
heartbreaking 6866
heights 6867
herd 6868
identifie 6869
jailmary 6870
like see 6871
linkedin 6872
mcculloch say 6873
operator 6874
padua 6875
parishioner 6876
postdispatch 6877
relieve 6878
ridge 6879
rosary 6880
say decision 6881
sentiment 6882
shop 6883
sing 6884
somber 6885
stutte 6886
unoppose 6887
wesley 6888
wildwood 6889
anyway 6890
apd 6891
ejaculate 6892
hardly 6893
lineup 6894
reintegrate 6895
sarber 6896
semen 6897
squeeze 6898
zip 6899
accord state 6900
activitie 6901
adrienne 6902
assess 6903
greek 6904
greek life 6905
jailnciple 6906
kehrwald 6907
kentucky 6908
lamba 6909
lambda 6910
ledger 6911
letter send 6912
membership 6913
nick 6914
philanthropic 6915
say letter 6916
student affair 6917
tumultuous 6918
waldrip 69

reading 7845
say still 7846
standpoint 7847
stepping 7848
ticketing 7849
wesbite 7850
wmbd 7851
concertgoer 7852
lookedon 7853
mansfield 7854
patron 7855
roxbury 7856
urban 7857
venue 7858
xfinity 7859
allred 7860
assault minor 7861
beverly 7862
nel 7863
roy 7864
adequately 7865
ajaill 7866
application 7867
assault report 7868
athletics 7869
aware allegation 7870
banished 7871
barcombpeter 7872
beaty 7873
burden 7874
casebycase 7875
clinkscale 7876
conductbehavior 7877
curry 7878
dole 7879
emphasis 7880
enable 7881
entail 7882
erinn 7883
expectation 7884
forcedd 7885
forgo 7886
gary 7887
haute 7888
hyman 7889
jayhawker 7890
journalworld 7891
libby 7892
likewise 7893
listing 7894
marchiony 7895
mcclure 7896
nonacademic 7897
notation 7898
offfield 7899
opener 7900
plaintiff 7901
plausible 7902
readmission 7903
reiterate 7904
revise 7905
roerig 7906
roerig say 7907
roster 7908
rower 7909
rowing 7910
scenario 7911
sherard 7912
sycamore 7913
tackett 7914
tammara 7915
teams 7916
technically 

accomplished 8845
accomplishment 8846
alaleh 8847
alto 8848
cier 8849
clara 8850
coconspirator 8851
commentator 8852
congressman 8853
dennis 8854
dilute 8855
distinction 8856
downplay 8857
empathize 8858
exhibit 8859
exswimmer 8860
firsttime 8861
gent 8862
harhold 8863
hastert 8864
herhold 8865
herholds 8866
kianerci 8867
ladie 8868
merciful 8869
mercury 8870
messy 8871
mitigate 8872
mockery 8873
palo 8874
perp 8875
perps 8876
persky 8877
perturb 8878
plus 8879
promiscuity 8880
remorse 8881
seriousness 8882
shred 8883
silly 8884
smack 8885
soft 8886
summation 8887
swimmer 8888
swimming 8889
tactless 8890
timeout 8891
twelve 8892
unindicted 8893
weak 8894
afterhour 8895
alaric 8896
captain say 8897
carla 8898
counties 8899
feasible 8900
hayes say 8901
lankershim 8902
lapds 8903
robberyhomicide 8904
sale 8905
sentra 8906
ventura 8907
xux 8908
zuniga 8909
ackley 8910
buch 8911
fisher 8912
kristina 8913
noel 8914
absentia 8915
braun 8916
champagne 8917
chinatown 8918
diagnostic 8919
fierce

In [36]:
## Testing Data dictionary & corpus
text_test_stopwords
for k, v in dictionary_text_test.token2id.items():
    print(k, v)

allegation 0
arrest 1
assault 2
assault female 3
assault occur 4
attorney 5
bistate 6
bond 7
book 8
case 9
child 10
day 11
department 12
detective 13
district 14
district attorney 15
eight 16
family 17
family member 18
female 19
forward 20
fourteen 21
jail 22
later 23
member 24
monday 25
morning 26
notify 27
occur 28
office 29
officer 30
old 31
paul 32
police 33
police department 34
post 35
release 36
sexual 37
sexual assault 38
six 39
tell 40
texarkana 41
texas 42
timothy 43
turn 44
victim 45
wilder 46
year 47
year old 48
abuse 49
action 50
actress 51
adam 52
add 53
afternoon 54
age 55
angry 56
apologize 57
assailant 58
attend 59
attention 60
away 61
believe 62
benefit 63
big 64
blame 65
california 66
ceremony 67
corner 68
could 69
course 70
create 71
creative 72
davis 73
dedication 74
demand 75
describe 76
devastated 77
different 78
dirty 79
drug 80
dunham 81
embattle 82
emotional 83
end 84
event 85
ever 86
explain 87
eye 88
fan 89
far 90
feel 91
find 92
follow 93
foundation 94
frail

money 829
motel 830
motion 831
mouth 832
muck 833
mud 834
multiple 835
nail 836
naked 837
near 838
need 839
neighbor 840
neighborhood 841
nerve 842
never 843
never see 844
newspaper 845
next 846
nice 847
night attack 848
nod 849
north 850
northeastern 851
nothing 852
nowhere 853
nurse 854
oct 855
offense 856
office decline 857
official 858
one case 859
opendoor 860
option 861
outside 862
outstanding 863
owe 864
packet 865
page 866
paperwork 867
parent 868
park 869
parker 870
parking 871
parking lot 872
parkway 873
parris 874
part 875
participate 876
pass 877
passenger 878
patch 879
paternal 880
pay 881
payne 882
peek 883
people 884
people know 885
personal 886
petty 887
phone 888
physical 889
pick 890
picnic 891
place 892
plan 893
plate 894
plea 895
plea deal 896
plead 897
please 898
plow 899
pluff 900
pocket 901
point 902
police call 903
police find 904
police officer 905
police report 906
police say 907
policy 908
port 909
pose 910
position 911
presence 912
previous 913
prison 914
pr

total 1579
trump 1580
trump say 1581
unearthed 1582
washington 1583
washington post 1584
wrong 1585
act 1586
actual 1587
advance 1588
aggressive 1589
allege sexual 1590
also claim 1591
anonymous 1592
asleep 1593
band 1594
ben 1595
brooklyn 1596
bruce 1597
bttm 1598
bttms 1599
cancel 1600
circulate 1601
claim make 1602
fact 1603
fallout 1604
initial 1605
jen 1606
jezebel 1607
know know 1608
liv 1609
management 1610
opening 1611
percent 1612
permission 1613
pitchfork 1614
predatory 1615
protection 1616
pseudonym 1617
pwr 1618
pwr bttm 1619
representative 1620
rough 1621
shocking 1622
story tell 1623
take place 1624
tonight 1625
tour 1626
two day 1627
upcoming 1628
absolutely 1629
absurd 1630
accusation 1631
acquit 1632
alcoholic 1633
altar 1634
bickering 1635
blasey 1636
blind 1637
blumenthal 1638
blurbs 1639
brett 1640
camp 1641
christine 1642
college 1643
conclusion 1644
counselor 1645
country 1646
defendant 1647
delay 1648
democrats 1649
desperately 1650
develop 1651
disservice 1652
d

queen 2578
quickly 2579
say victim 2580
source say 2581
subway 2582
truck 2583
two man 2584
vague 2585
victim yearold 2586
woman attack 2587
abdulaziz 2588
alsaud 2589
arrest wednesday 2590
beverly 2591
coerce 2592
compound 2593
copulation 2594
diplomatic 2595
estate 2596
gate 2597
gated 2598
hollywood 2599
immunity 2600
jailnce 2601
majed 2602
rent 2603
saudi 2604
spokesperson 2605
suspicion 2606
thr 2607
wallingford 2608
within 2609
worker 2610
affect 2611
artist 2612
attire 2613
boycott 2614
closed 2615
continuously 2616
datsik 2617
defeat 2618
desensitize 2619
difficult 2620
diligent 2621
drinkingdose 2622
edm 2623
electronic 2624
electronic music 2625
ethical 2626
even though 2627
fame 2628
favorite 2629
festival 2630
fight back 2631
find guilty 2632
fuel 2633
fuk 2634
get rid 2635
harm 2636
hate 2637
human 2638
ideology 2639
involve sexual 2640
love one 2641
minimize 2642
miss 2643
music community 2644
overuse 2645
plague 2646
prevent 2647
production 2648
psychologically 2649
rat

undetected 3578
usually 3579
yvonne 3580
airborne 3581
boring 3582
bragg 3583
burris 3584
carolinabase 3585
crystal 3586
disobey 3587
erik 3588
fayobservercom 3589
fivesentence 3590
forcible 3591
forfeit 3592
formerly 3593
fort 3594
generally 3595
heinous 3596
hometown 3597
innocence 3598
major 3599
military 3600
proceed 3601
provide detail 3602
say monday 3603
scant 3604
sixday 3605
therefor 3606
violentmartial 3607
accord criminal 3608
complaint file 3609
complaint say 3610
consume 3611
encounter 3612
gordon 3613
incident reportedly 3614
jailce 3615
kanawha 3616
magistrate 3617
nitro 3618
preemployment 3619
probationary 3620
propertysurety 3621
prosecuter sexual_assault 3622
seconddegree 3623
videoe 3624
app 3625
blindly 3626
curly 3627
knife 3628
malia 3629
middle 3630
mistakenly 3631
motto 3632
parka 3633
photograph 3634
sasha 3635
sedan 3636
sidwell 3637
silvercolore 3638
thirty 3639
accountability 3640
alamo 3641
allegation make 3642
aught 3643
blog 3644
blogger 3645
chud 3646
ci

discretion 4578
eager 4579
exception 4580
furnish 4581
hangout 4582
healing 4583
impression 4584
info 4585
laugh 4586
layout 4587
liquor 4588
living 4589
melissa 4590
melissaschuman 4591
naturally 4592
nick 4593
outpouring 4594
religious 4595
reppe 4596
roommate 4597
schuman 4598
value 4599
virgin 4600
vocal 4601
accord police 4602
accredit 4603
ala 4604
allege sexually 4605
background 4606
background check 4607
behalf 4608
checkr 4609
connote 4610
councilman 4611
councilwoman 4612
database 4613
deactivate 4614
driverpartner 4615
early sunday 4616
forceful 4617
girl take 4618
girl tell 4619
honolulu 4620
kobayashi 4621
makiki 4622
menor 4623
mililani 4624
moana 4625
ongoing 4626
oversee 4627
oversight 4628
penetration 4629
prosecuter firstdegree 4630
public safety 4631
regulate 4632
rider 4633
rideshare 4634
rideshare company 4635
ridesharing 4636
ron 4637
screener 4638
screening 4639
shopping 4640
smart 4641
sufficient 4642
undergo 4643
wadahara 4644
wrinkle 4645
advertise 4646
advert

hampton 5578
heck 5579
heritage 5580
immigration 5581
informational 5582
laceration 5583
like happen 5584
nurses 5585
plano 5586
rivera 5587
sane 5588
sgt hampton 5589
sherman 5590
surgical 5591
willing 5592
actually happen 5593
album 5594
apart 5595
billboard 5596
cosbys 5597
desire 5598
echelon 5599
extricate 5600
freekesha 5601
gain 5602
grossly 5603
incentive 5604
judge rule 5605
madonna 5606
professionally 5607
protract 5608
rainn 5609
socialize 5610
somewhat 5611
sooner 5612
stern 5613
underreported 5614
worth 5615
wringer 5616
ysupportsrape 5617
finland 5618
forward allegation 5619
forward claim 5620
judd 5621
laak 5622
ninni 5623
refresh 5624
suddenly 5625
thread 5626
accompany 5627
adversarial 5628
alexander 5629
alisa 5630
amendment 5631
approjailate 5632
assertive 5633
bolster 5634
briefing 5635
chasm 5636
clarity 5637
commonwealth 5638
congress 5639
contravene 5640
convention 5641
countrys 5642
culpability 5643
deaf 5644
democrat 5645
diane 5646
directive 5647
disagree 5648

infraction 6577
insert 6578
lcps 6579
loudoun 6580
lpd 6581
say believe 6582
shelter 6583
student school 6584
teammate 6585
timesmirror 6586
wayde 6587
woodgrove 6588
arena 6589
babe 6590
baxter 6591
color 6592
farm 6593
fort worth 6594
granbury 6595
grande 6596
hidalgo 6597
hidalgo county 6598
holiday 6599
intern 6600
juvenile jail 6601
marie 6602
mcallen 6603
passage 6604
rangers 6605
reached 6606
rio 6607
rite 6608
savannah 6609
selena 6610
services 6611
sixmonth 6612
southwest 6613
destination 6614
kissimmee 6615
osceola 6616
penamelo 6617
wesh 6618
wooded 6619
broomstick 6620
chester 6621
classman 6622
conestoga 6623
cosign 6624
disgusting 6625
dykes 6626
everyone think 6627
excoriate 6628
extraordinary 6629
internal investigation 6630
lieute 6631
mcginnis 6632
noone 6633
outdated 6634
ritual 6635
scribe 6636
sixpage 6637
state victim 6638
talkabouttown 6639
thankfully 6640
unsupervised 6641
milby 6642
herthey 6643
mountain 6644
rady 6645
ramirez 6646
sdpd 6647
churchs 6648
congre

leed 7577
malebodie 7578
nicola 7579
overview 7580
padded 7581
reassignment 7582
slammer 7583
stephen 7584
surgery 7585
telegraph 7586
touching 7587
wakefield 7588
wink 7589
cuban 7590
factually 7591
lemon 7592
thirdhand 7593
untoward 7594
blindfold 7595
button 7596
defiance 7597
deportation 7598
gag 7599
kgw 7600
martinez 7601
methinduce 7602
multnomah 7603
psychosis 7604
reentere 7605
reese 7606
sanctuary 7607
scarf 7608
sergio 7609
sock 7610
transient 7611
undocumented 7612
phoenix 7613
bloodied 7614
chitwood 7615
daytona 7616
frolander 7617
independently 7618
knot 7619
motionless 7620
puddle 7621
raymond 7622
reuter 7623
scratch 7624
swollen 7625
achievement 7626
achievement prep 7627
anticipate 7628
automate 7629
bewilder 7630
daygarbees 7631
elementary 7632
garbee 7633
goodbye 7634
gracefully 7635
handing 7636
heal 7637
lunging 7638
muffin 7639
pcp 7640
prep 7641
profession 7642
school teacher 7643
shantora 7644
smack 7645
snapshot 7646
stressor 7647
stub 7648
unpaid 7649
acrosst

## Ensure the data looks correct
### i.e., are words properly lemmatized, tokenized, etc?

In [37]:
# View information about each corpus

print("\n[INFO] Article Dataset....................\n")
print('Train Data', str(len(text_train_stopwords)))
print('Test Data', str(len(text_test_stopwords)))

print("\n\n=================================================\n\n")

print("[INFO] Training Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_train_stopwords[:1])

print("\n[INFO] Testing Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_test_stopwords[:1])


[INFO] Article Dataset....................

Train Data 240
Test Data 240




[INFO] Training Sample Data
--------------------------------------
--------------------------------------

[INFO] Article Text....................

[['police', 'investigate', 'report', 'sexual', 'assault', 'montclair', 'high', 'school', 'involve', 'two', 'minor', 'official', 'say', 'principal', 'earle', 'let', 'parent', 'know', 'allege', 'incident', 'school', 'allgender', 'bathroom', 'monday', 'email', 'essex', 'county', 'prosecutor', 'office', 'investigate', 'police', 'say', 'victim', 'girl', 'allege', 'attacker', 'boy', 'criminal', 'forced', 'yet', 'police investigate', 'report sexual_assault', 'high school', 'involve two', 'official say', 'allege incident', 'county prosecutor', 'police say', 'criminal forced']]

[INFO] Testing Sample Data
--------------------------------------
--------------------------------------

[INFO] Article Text....................

[['texarkana', 'texas', 'police', 'department', 'a

In [38]:

print("\n[INFO] Article Text Dictionary and Corpus.....................\n")

print("\nNumber of unique words in the training set's dictionary:")
print(str(len(dictionary_text_train)))

print("\nNumber of unique words in the testing set's dictionary:")
print(str(len(dictionary_text_test)))

print("\nNumber of rows in the testing set:")
print(str(len(corpus_text_test)))

print("\nNumber of rows in the training set:")
print(str(len(corpus_text_train)))


[INFO] Article Text Dictionary and Corpus.....................


Number of unique words in the training set's dictionary:
9013

Number of unique words in the testing set's dictionary:
8330

Number of rows in the testing set:
240

Number of rows in the training set:
240


# 3. Select number of topics (k)

## 3a. Setting Model Hyperparameters 

### 1. Beta (referred to as 'eta' in gensim) = the [distribution of the] number of words per topic
### 2. Alpha =  the [distribution of the] number of topics per document

#### Both alpha and eta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’:
        - ‘auto’ = the model learns the best values for the hyperparameters as it is trained       
                   on more and more data (i.e., it learns an asymmetric prior from the corpus). See 
                   http://jonathan-huang.org/research/dirichlet/dirichlet.pdf for an overview             
        - 'asymmetric' = uses a fixed, normalized asymmetric prior of 1.0 / k (number of topics)
        - 'symmetric' = uses a distribution of 1 / k (number of topics)

In Bayesian statistics, we have to define the distributions (i.e., prior distributions) of unknown variables (e.g., ϕ and θ) before running the data analysis. These should be defined based on theoretical assumptions about how we think the topics are actually distributed amongst our data. In our case, it makes sense to assume that some documents discuss more/less topics than other documents; thus we set the document-topic distribution to be asymmetric. 
#### Thus, we reccommend setting alpha = 'auto' as it sets the distribution to be asymmetric, and learns the best alpha value (i.e., lowest perplexity scores) from the data itself. It also makes sense to assume that some topics contain more words than others. Thus, we reccomend setting the distribution of the number of words per topic to be asymmetric as well. 

### 3. Passes = number of laps the model goes through the entire corpus
        - Incrasing the number of passes reduces model bias
### 4. Chunksize = number of documents to load into memory at a time 
        - smaller chunksizes save memory, but take longer to train
### 5. Update_every = number of chunks to process before maximizing your model 
### 6. Random state = sets the seed to make the model reproducable
### 7. Number of topics (k)
Researchers must tell the model how many (k) prominent goal inference topics to sort each ‘bag of words’ document into. Problematically, several different k-values might work. Thus, we use a metric called perplexity to help us to determine the optimal number of topics. The utility in perplexity comes from comparing perplexity values across models with differing k-values to pinpoint the best model (i.e., the model with the lowest perplexity score). 

Thus, we recommend testing the perplexity of the model with a variety of k values, and then runing the final model using the k-value with the lowest perplexity score.


## 3b. Compute model perplexity scores
We run the model with different topic numbers (k) to find the optimal topics number

We will start by looking at k=1-32 topics

In [39]:
k = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
random_state=42
update_every=2
chunksize=1800
passes=600
iterations=1600
alpha='auto'
eta='auto'
per_word_topics=True

lda_model_text_trains = []
lda_model_text_tests = []

In [286]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] Article Text Training Data LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_train = gensim.models.ldamodel.LdaModel(corpus=corpus_text_train,
                                                  id2word=dictionary_text_train,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

    lda_model_text_trains.append(lda_model_text_train)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_train.log_perplexity(corpus_text_train))


***********************************************************************
[INFO] Article Text Training Data LDA Results....
***********************************************************************

Perplexity (num_topics = 2):  -7.796055046679738

Perplexity (num_topics = 4):  -7.782917860978472

Perplexity (num_topics = 6):  -7.814755115892571

Perplexity (num_topics = 8):  -7.836687645516105

Perplexity (num_topics = 10):  -7.859251299619402

Perplexity (num_topics = 12):  -7.882492604344419

Perplexity (num_topics = 14):  -7.903351307101996

Perplexity (num_topics = 16):  -7.9169580113246125

Perplexity (num_topics = 18):  -7.93318788534051

Perplexity (num_topics = 20):  -7.941399663894025

Perplexity (num_topics = 22):  -7.957818621067821

Perplexity (num_topics = 24):  -7.972209064262733

Perplexity (num_topics = 26):  -7.985005679507375

Perplexity (num_topics = 28):  -7.996059802274738

Perplexity (num_topics = 30):  -8.015043843599607

Perplexity (num_topics = 32):  -8.020541479

### We can see that k=3-9 topics have the most drastic decreases in perplexity scores. 
### Thus, we opt to take a closer look at these 4 models

### Here is the representative 'Bag of Words' for each topic in the Training Dataset
As you can see, this doesn't give us enough information about each topic to assess model fit. Thus, we utilize a interactive visualization

In [298]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Training Data Topic Outputs....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_trains[j].show_topics(num_topics=i, num_words=10, log=True, formatted=True)
    for topic in topics:
        print(topic)


***********************************************************************
[INFO] Article Text Training Data Topic Outputs....
***********************************************************************

[INFO] Num_topics: 1

(2, '0.027*"say" + 0.024*"police" + 0.013*"assault" + 0.009*"victim" + 0.008*"arrest" + 0.008*"sexual" + 0.008*"woman" + 0.007*"yearold" + 0.007*"accord" + 0.006*"man"')

[INFO] Num_topics: 2

(1, '0.031*"say" + 0.009*"woman" + 0.007*"tell" + 0.006*"one" + 0.006*"would" + 0.005*"school" + 0.005*"assault" + 0.005*"patient" + 0.005*"sexual" + 0.005*"police"')
(5, '0.020*"police" + 0.017*"say" + 0.010*"assault" + 0.009*"arrest" + 0.008*"sexual" + 0.007*"victim" + 0.007*"yearold" + 0.007*"accord" + 0.006*"man" + 0.006*"girl"')

[INFO] Num_topics: 3

(6, '0.023*"say" + 0.011*"sexual" + 0.009*"assault" + 0.005*"student" + 0.005*"happen" + 0.005*"butter" + 0.004*"sexual assault" + 0.004*"woman" + 0.004*"get" + 0.004*"man"')
(0, '0.026*"say" + 0.010*"assault" + 0.008*"woman" + 

IndexError: list index out of range

# 3c. Model selection using pyLDAvis Visualization 
## We're interested in the models with k= 3,4,5,6,7 topics, so we visualize those using the pyLDAvis documentation (https://www.aclweb.org/anthology/W14-3110.pdf)

## When selecting the optimal number of topics, we need to find a balance between underfitting and underfitting the model

### OVERFITTING (i.e., too many topics): 
#### Practical takeaway- this can make it harder for human coders to label
#### pros- less overlap amongst topics
#### cons- less coherence amongst the words in each topic; decreased varaiance in each document's distirbution of topics

### UNDERFITTING (i.e., too few topics): 
#### Practical takeaway- doesn't produce enough variance, limiting options for statistical analyses
#### pros- more coherent 'bag of words' comprising each topic; increased varaiance in the distirbution of topics in each document
#### cons- more overlap amongst topics

## Reading pyLDAvis

### LEFT PANE:
- The area of each circle represents the prevalence of each topic over the entire corpus 
- The distance between the center of circles indicate the similarity between topics (i.e., inter-topic differences)

---------------------------------------------------------------------------------------------------------

### RIGHT PANE:
- If you hover over a particular topic on the left, the histogram on the right side lists the top 30 most relevant terms
- The widths of the gray bars represent the corpus-wide frequencies of each term, and the widths of the red bars represent the topic-specific frequencies of each term
- A slider at the top can adjust the relevence metric (λ); however, for our purposes, be sure it i set to λ = 1. For more information on the relevance metric, see (https://www.aclweb.org/anthology/W14-3110.pdf). 








In [309]:
print("k = 14 topics ...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[6], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 14 topics ...................


In [310]:
print("k = 10 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[4], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 10 topics...................


# k=5 topics looks the best to me! The topics appear to be relatively spread out, with no overlapping topics
# At the same time, the 'bag of words' comprising each topic appears coherent enough to label. 

In [311]:
print("k = 12 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[5], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 12 topics...................


In [312]:
print("k = 8 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[3], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 8 topics...................


In [313]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[2], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 6 topics...................


In [314]:
print("k = 4 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[1], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

k = 4 topics...................


# 4. Model Validation for Test Dataset with 5 topics (Num_Topics = 5)

### Now, we check to make sure the training dataset shows a similar trends, to offer evidence of reproducability. We do this by:

(1) Validating that the the testing set perplexity value for k=5 topics relatively similar to the training set (it is)

(2) check the fit of the pyLDAvis visualization for indicators of similar model fit (it looks good)

## 4a. Get Perplexity Scores of the Testing Dataset

In [315]:
print("\n***********************************************************************")
print("[INFO] Article text Testing Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_test = gensim.models.ldamodel.LdaModel(corpus=corpus_text_test,
                                                  id2word=dictionary_text_test,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                                      
                                                  per_word_topics=per_word_topics)

    lda_model_text_tests.append(lda_model_text_test)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_test.log_perplexity(corpus_text_test))


***********************************************************************
[INFO] Article text Testing Dataset LDA Results....
***********************************************************************

Perplexity (num_topics = 1):  -7.81853972914781

Perplexity (num_topics = 2):  -7.8099506163040235

Perplexity (num_topics = 3):  -7.849578165574469

Perplexity (num_topics = 4):  -7.872137101924929

Perplexity (num_topics = 5):  -7.860978996355535

Perplexity (num_topics = 6):  -7.860813348659627

Perplexity (num_topics = 7):  -7.867544201941664

Perplexity (num_topics = 8):  -7.871373485118535

Perplexity (num_topics = 9):  -7.882356706735592

Perplexity (num_topics = 10):  -7.897328224484313

Perplexity (num_topics = 11):  -7.893920758916011

Perplexity (num_topics = 12):  -7.897566461503732

Perplexity (num_topics = 13):  -7.9050351906700165

Perplexity (num_topics = 14):  -7.90396301725211

Perplexity (num_topics = 15):  -7.917879398442028

Perplexity (num_topics = 16):  -7.920450538385

## 4b. Check the pyLDAvis visualization for indicators of similar model fit (it looks good)

In [316]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Testing Data Topic Output....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_tests[j].show_topics(num_topics=i, num_words=5, log=True, formatted=True)
    for topic in topics:
        print(topic)


***********************************************************************
[INFO] Article Text Testing Data Topic Output....
***********************************************************************

[INFO] Num_topics: 1

(1, '0.022*"say" + 0.012*"police" + 0.011*"assault" + 0.009*"sexual" + 0.005*"woman"')

[INFO] Num_topics: 2

(2, '0.012*"say" + 0.011*"sexual" + 0.009*"assault" + 0.004*"case" + 0.004*"student"')
(1, '0.022*"say" + 0.011*"police" + 0.010*"assault" + 0.008*"sexual" + 0.006*"school"')

[INFO] Num_topics: 3

(2, '0.009*"sexual" + 0.007*"assault" + 0.005*"winston" + 0.004*"case" + 0.004*"say"')
(0, '0.025*"say" + 0.012*"assault" + 0.009*"police" + 0.008*"woman" + 0.008*"sexual"')
(1, '0.020*"say" + 0.010*"police" + 0.010*"assault" + 0.008*"sexual" + 0.006*"woman"')

[INFO] Num_topics: 4

(2, '0.012*"sexual" + 0.009*"assault" + 0.008*"say" + 0.005*"winston" + 0.005*"police"')
(4, '0.021*"say" + 0.009*"woman" + 0.007*"tell" + 0.007*"police" + 0.006*"defendent"')
(0, '0.025*"sa

(2, '0.026*"sall" + 0.013*"affidavit" + 0.010*"coast" + 0.010*"coworker" + 0.010*"affidavit say"')
(34, '0.010*"fbi" + 0.007*"back" + 0.007*"judge" + 0.007*"republicans" + 0.007*"investiage"')
(8, '0.033*"church" + 0.017*"bishop" + 0.017*"allegation" + 0.009*"claim" + 0.009*"time"')
(17, '0.019*"say" + 0.009*"police" + 0.009*"case" + 0.006*"december" + 0.006*"monica"')
(23, '0.024*"say" + 0.023*"police" + 0.014*"allen say" + 0.014*"allen" + 0.012*"township"')
(27, '0.017*"sexual" + 0.012*"woman" + 0.008*"assault" + 0.007*"believe" + 0.007*"bill"')
(33, '0.011*"would" + 0.011*"noo" + 0.009*"sinner" + 0.009*"jesus" + 0.009*"school"')
(28, '0.021*"schneider" + 0.012*"bugai" + 0.011*"victim" + 0.011*"would" + 0.009*"sfpd"')
(14, '0.013*"year" + 0.011*"one" + 0.011*"coffey" + 0.010*"player" + 0.010*"former"')
(36, '0.022*"girl" + 0.016*"child" + 0.012*"sheriffs" + 0.012*"say" + 0.010*"deputy"')
(32, '0.021*"say" + 0.011*"police" + 0.010*"woman" + 0.008*"assault" + 0.007*"booker"')
(9, '0.01

IndexError: list index out of range

In [317]:
print("k = 5 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[3], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

k = 5 topics...................


In [318]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[4], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

k = 6 topics...................


## Now that we've done this, we can run the full model (test & training) with k=5 topics

# 5. Run LDA model on full dataset

## 5a. Repeat preprocessing steps 2a-2g on the full dataset

In [319]:
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

480 Rows.  15 Columns.


In [320]:
indices = [0,333,777,932]

samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

Sample Tickets:




Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike



Unnamed: 0,uid,title,first_com,all_com,text,Year,Month,week,Post_link,Article_link,Article_domain,date,engagement,Platform,Gender
0,1.0,"Daisy Coleman, teen at center of Maryville sex...",Those poor girls. Both of them.,"Gentry County officials last week, meanwhile, ...","Daisy Coleman, the northwest Missouri teenager...",2014.0,1.0,1.0,://www.reddit.com/r/news/comments/1upxir/dai...,://www.kansascity.com/news/special-reports/m...,Kansas City Star,41646.7,81.0,Reddit,
1,334.0,Former Waldron Schools Secretary Pleads Guilty...,Former Waldron Schools Secretary Pleads Guilty...,,A former Waldron School District secretary was...,,,,://twitter.com/DavidLo55938576/status/685248...,://www.5new line.com/article/news/local/o...,,2016-01-07 23:56:11+00:00,0.0,Twitter,c
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,


In [321]:
# number of null values in each column of the full dataset
data.isnull().sum()

uid                 0
title               1
first_com           0
all_com           222
text                0
Year              240
Month             240
week              240
Post_link           0
Article_link        0
Article_domain    243
date                1
engagement          3
Platform            0
Gender            240
dtype: int64

In [322]:
#finding null values in the full dataset
print("=============Full Dataset=============")
data['text'] = data['text']

print('Number of rows in text:', len(data['text']))

print("-------------------")
print("Null Values in text: {}".format(data['text'].isnull().sum()))

#Removing null values from the full dataset

text = data['text']

print("After removing Null Values in Full Dataset")
print("Null Values in text: {}".format(data['text'].isnull().sum()))



Number of rows in text: 480
-------------------
Null Values in text: 0
After removing Null Values in Full Dataset
Null Values in text: 0


In [323]:
#----------------------------------------Convert everything to Lower case--------------------------------------------------

##Train Data
text = text.str.lower()

print("=======Full Dataset==============\n")
print(text.head(1))



0    daisy coleman, the northwest missouri teenager...
Name: text, dtype: object


In [324]:
##Remove special characters from full dataset

text_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text]
text_regex = [re.sub(r'\'', '', sent) for sent in text_regex]
text_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_regex]
text_regex = [re.sub(r'\d', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'_', ' ',  sent) for sent in text_regex]

print("=======Full Dataset==============\n")
print("\n[INFO] text....................\n")
print(text_regex[:2])




[INFO] text....................

['daisy coleman the northwest missouri teenager whose alleged  sexual assault maryville has garnered national attention was recovering tuesday from suicide attempt family friend familiar with the situation told the star\nrobin bourland longtime acquaintance the coleman family said coleman was being treated kansas city childrens psychiatric hospital after ingesting unidentified pills sunday evening colemans mother melinda coleman first released news the suicide attempt facebook post monday\nits been heartbreaking for them bourland said the colemans who now live albany this has been really long drawnout battle and its heartbreaking see something like this happening\nfederal judge kansas accused sexual\nharassment submits his resignation\naccording bourland the incident stemmed from online harassment that daisy coleman  received after attending party over the weekend disparaging facebook post generated   additional harsh attacks said bourland and just es

In [325]:
## Full Data set
text_tokens = list(tokenize(text_regex))


print("\n[INFO] text....................\n")
print(text_tokens[:2])


[INFO] text....................

[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'alleged', 'sexual', 'assault', 'maryville', 'has', 'garnered', 'national', 'attention', 'was', 'recovering', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'told', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'said', 'coleman', 'was', 'being', 'treated', 'kansas', 'city', 'childrens', 'psychiatric', 'hospital', 'after', 'ingesting', 'unidentified', 'pills', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'released', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', 'its', 'been', 'heartbreaking', 'for', 'them', 'bourland', 'said', 'the', 'colemans', 'who', 'now', 'live', 'albany', 'this', 'has', 'been', 'really', 'long', 'drawnout', 'battle', 'and', 'its', 'heartbreaking', 'see', 'something', 'like', 'this', 'happening', 'federal', 'jud

In [326]:
# Lemmatize full dataset, keeping only noun, adj, vb, adv

text_lemma = lemmatization(text_tokens, allowed_postags=['ADJ', 'VERB', 'ADV', 'SCONJ', 'PRON', 'PART', 'INTJ', 'AUX', 'ADV', 'ADP', 'ADJ'])
print(str(len(text_lemma)))
print(text_lemma[:4])

480
[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'have', 'garner', 'national', 'attention', 'be', 'recover', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'tell', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'say', 'coleman', 'be', 'be', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'after', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'release', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', '-PRON-', 'be', 'heartbreake', 'for', '-PRON-', 'bourland', 'say', 'the', 'coleman', 'who', 'now', 'live', 'albany', 'this', 'have', 'be', 'really', 'long', 'drawnout', 'battle', 'and', '-PRON-', 'heartbreaking', 'see', 'something', 'like', 'this', 'happen', 'federal', 'judge', 'kansas', 'accuse', 'sexual', 'harassment', 's

In [327]:
#Trigrams and Bigrams in full dataset
            
text_bigram = Phrases(text_lemma, min_count=3, delimiter=b' ', threshold=1)
text_trigram = Phrases(text_bigram[text_lemma], threshold=1)

text_bigram_mod = gensim.models.phrases.Phraser(text_bigram)
text_trigram_mod = gensim.models.phrases.Phraser(text_trigram)

for idx in range(len(text_lemma)):
    for token in text_trigram_mod[text_bigram_mod[text_lemma[idx]]]:
        #print(token)
        if ' ' in token:
            text_lemma[idx].append(token)
print("\n[INFO] text....................\n")
print(text_lemma[:2])


[INFO] text....................

[['daisy', 'coleman', 'the', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'have', 'garner', 'national', 'attention', 'be', 'recover', 'tuesday', 'from', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'with', 'the', 'situation', 'tell', 'the', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'the', 'coleman', 'family', 'say', 'coleman', 'be', 'be', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'after', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'coleman', 'first', 'release', 'news', 'the', 'suicide', 'attempt', 'facebook', 'post', 'monday', '-PRON-', 'be', 'heartbreake', 'for', '-PRON-', 'bourland', 'say', 'the', 'coleman', 'who', 'now', 'live', 'albany', 'this', 'have', 'be', 'really', 'long', 'drawnout', 'battle', 'and', '-PRON-', 'heartbreaking', 'see', 'something', 'like', 'this', 'happen', 'federal', 'judge', 'kansas', 'accus

In [328]:
##Remove stopwords from training set

text_stopwords = remove_stopwords(text_lemma)


print("\n[INFO] text....................\n")
print(text_stopwords[:2])




[INFO] text....................

[['daisy', 'northwest', 'missouri', 'teenager', 'whose', 'allege', 'sexual', 'assault', 'maryville', 'garner', 'national', 'attention', 'recover', 'tuesday', 'suicide', 'attempt', 'family', 'friend', 'familiar', 'situation', 'tell', 'star', 'robin', 'bourland', 'longtime', 'acquaintance', 'family', 'say', 'treat', 'kansas', 'city', 'children', 'psychiatric', 'hospital', 'ingest', 'unidentified', 'pill', 'sunday', 'evening', 'colemans', 'mother', 'melinda', 'first', 'release', 'news', 'suicide', 'attempt', 'facebook', 'post', 'monday', 'heartbreake', 'bourland', 'say', 'live', 'albany', 'really', 'long', 'drawnout', 'battle', 'heartbreaking', 'see', 'something', 'like', 'happen', 'federal', 'judge', 'kansas', 'accuse', 'sexual', 'harassment', 'submit', 'resignation', 'accord', 'bourland', 'incident', 'stem', 'online', 'harassment', 'daisy', 'receive', 'attend', 'party', 'weekend', 'disparage', 'facebook', 'post', 'generate', 'additional', 'harsh', 'atta

In [329]:
## Generate Corpus and Dictionary for full dataset

dictionary_text = corpora.Dictionary(text_stopwords)
dictionary_text.filter_extremes(no_below=.01, no_above=0.99)
corpus_text = [dictionary_text.doc2bow(text) for text in text_stopwords]
pickle.dump(corpus_text, open('/Users/me/Desktop/LDA_Main/project2/data_articles/corpus_text.pkl', 'wb'))
dictionary_text.save('/Users/me/Desktop/LDA_Main/project2/data_articles/dictionary_text.gensim')

## 5b. Ensure the model perplexity score for k=5 topics is relatively similar to the perplexity of the testing and training sets (it is)

In [330]:
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta= 'auto'
per_word_topics=True

lda_model_text = []

In [331]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                            
                                                  per_word_topics=per_word_topics)

    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (num_topics = 1):  -7.427279812484941

Perplexity (num_topics = 2):  -7.389076487519594

Perplexity (num_topics = 3):  -7.3666858749920285

Perplexity (num_topics = 4):  -7.380203813789564

Perplexity (num_topics = 5):  -7.403557617739715

Perplexity (num_topics = 6):  -7.407368698753037

Perplexity (num_topics = 7):  -7.413200647619442

Perplexity (num_topics = 8):  -7.4111298751562025

Perplexity (num_topics = 9):  -7.411717042523926

Perplexity (num_topics = 10):  -7.4115652399320044

Perplexity (num_topics = 11):  -7.424479865980419

Perplexity (num_topics = 12):  -7.4231746774150125

Perplexity (num_topics = 13):  -7.423783190119228

Perplexity (num_topics = 14):  -7.422064043632383

Perplexity (num_topics = 15):  -7.415118205091898

Perplexity (num_topics = 16):  -7.409866194398966

Pe

## 5c. Run final model

In [333]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 4):  -7.380203813789564


## 5d. Visualize Model Results

In [334]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 4...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 4

(0, '0.038*"say" + 0.025*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"tell" + 0.008*"accord" + 0.008*"county" + 0.007*"arrest"')
(1, '0.028*"sexual" + 0.020*"assault" + 0.019*"say" + 0.018*"student" + 0.013*"university" + 0.012*"college" + 0.010*"campus" + 0.008*"school" + 0.008*"report" + 0.007*"rape"')
(2, '0.027*"say" + 0.013*"woman" + 0.009*"sexual" + 0.008*"tell" + 0.008*"assault" + 0.007*"would" + 0.006*"come" + 0.006*"time" + 0.005*"year" + 0.005*"make"')
(3, '0.023*"say" + 0.010*"driver" + 0.008*"woman" + 0.007*"assault" + 0.007*"tell" + 0.006*"defendent" + 0.006*"go" + 0.006*"get" + 0.005*"would" + 0.005*"like"')
text.....k = 5...................



Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





In [335]:
# Initializing LDA Models and Parameters
topic_number = 5
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 5):  -7.403557617739715


In [336]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 5...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 5

(0, '0.039*"say" + 0.026*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.009*"woman" + 0.008*"accord" + 0.008*"tell" + 0.008*"county" + 0.007*"arrest"')
(1, '0.032*"sexual" + 0.022*"student" + 0.022*"assault" + 0.017*"say" + 0.016*"university" + 0.015*"college" + 0.012*"campus" + 0.010*"school" + 0.009*"report" + 0.008*"rape"')
(2, '0.026*"say" + 0.013*"woman" + 0.010*"sexual" + 0.009*"tell" + 0.009*"assault" + 0.007*"would" + 0.007*"come" + 0.006*"time" + 0.006*"year" + 0.006*"know"')
(3, '0.026*"say" + 0.011*"driver" + 0.009*"woman" + 0.008*"assault" + 0.007*"defendent" + 0.007*"tell" + 0.007*"go" + 0.006*"get" + 0.006*"would" + 0.005*"want"')
(4, '0.022*"say" + 0.011*"saleem" + 0.010*"woman" + 0.007*"school" + 0.005*"jack" + 0.005*"mozaffar" + 0.005*"one" + 0

In [337]:
# Initializing LDA Models and Parameters
topic_number = 6
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 6):  -7.407368698753037


In [338]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 6

(0, '0.039*"say" + 0.027*"police" + 0.017*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"accord" + 0.008*"county" + 0.008*"girl" + 0.008*"tell"')
(1, '0.038*"sexual" + 0.028*"assault" + 0.019*"student" + 0.014*"college" + 0.013*"say" + 0.012*"university" + 0.012*"rape" + 0.011*"campus" + 0.008*"report" + 0.008*"school"')
(2, '0.027*"say" + 0.013*"woman" + 0.012*"sexual" + 0.011*"assault" + 0.009*"tell" + 0.007*"come" + 0.006*"time" + 0.006*"year" + 0.006*"would" + 0.005*"claim"')
(3, '0.028*"say" + 0.010*"woman" + 0.009*"tell" + 0.008*"would" + 0.008*"get" + 0.007*"go" + 0.007*"defendent" + 0.006*"know" + 0.006*"assault" + 0.006*"want"')
(4, '0.024*"say" + 0.012*"saleem" + 0.011*"woman" + 0.008*"school" + 0.006*"jack" + 0.006*"mozaffar" + 0.006*"one" + 0.00

In [339]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))


***********************************************************************
[INFO] text Full Dataset LDA Results....
***********************************************************************

Perplexity (topic_number = 4):  -7.380203813789564


In [340]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)


***********************************************************************
[INFO] text Full Dataset Model Results....
***********************************************************************

[INFO] Num_topics: 4

(0, '0.038*"say" + 0.025*"police" + 0.016*"assault" + 0.012*"sexual" + 0.011*"victim" + 0.010*"woman" + 0.008*"tell" + 0.008*"accord" + 0.008*"county" + 0.007*"arrest"')
(1, '0.028*"sexual" + 0.020*"assault" + 0.019*"say" + 0.018*"student" + 0.013*"university" + 0.012*"college" + 0.010*"campus" + 0.008*"school" + 0.008*"report" + 0.007*"rape"')
(2, '0.027*"say" + 0.013*"woman" + 0.009*"sexual" + 0.008*"tell" + 0.008*"assault" + 0.007*"would" + 0.006*"come" + 0.006*"time" + 0.005*"year" + 0.005*"make"')
(3, '0.023*"say" + 0.010*"driver" + 0.008*"woman" + 0.007*"assault" + 0.007*"tell" + 0.006*"defendent" + 0.006*"go" + 0.006*"get" + 0.005*"would" + 0.005*"like"')
text.....k = 6...................


## 6. Save the analysis results to an excel file for topic validation

### 6a. Here we generate a column that tells us which topic each response contributed the most to

In [341]:
cols = [color for name, color in mcolors.XKCD_COLORS.items()]
mycolors = [color for name, color in mcolors.XKCD_COLORS.items()]

In [342]:
#  Full Dataset


def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get the main topic of each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution, and Keywords for each document
        raw_frame = {}
        for j, (topic_num, prop_topic) in enumerate(row):
            #if j < 2:  # => dominant topic
                #wp = ldamodel.show_topic(topic_num)
                #topic_keywords = ", ".join([word for word, prop in wp])

                #sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,7), topic_keywords]), ignore_index=True)
            if j==0:
                raw_frame['Dominant'] = topic_num

            raw_frame['Topic' + str(topic_num)] = round(prop_topic, 4)

            #else:
            #    break
        df = pd.DataFrame(data=raw_frame, index=[0])
        sent_topics_df = sent_topics_df.append(df)
        
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution']

    # Add original text to the end of the output
    #contents = pd.Series(texts)
    #sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords_text = format_topics_sentences(ldamodel=lda_model_text, corpus=corpus_text, texts=text_stopwords)

df_dominant_topic_text = df_topic_sents_keywords_text.reset_index()
# Format
df_dominant_topic_text.index.name='Document_No';

print(df_dominant_topic_text.head(812))
# Format
#df_dominant_topic_text_train = df_topic_sents_keywords_text_train.reset_index()
#df_dominant_topic_text_train.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
#df_dominant_topic_text_train.head(3)


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





             index  Dominant  Topic0  Topic1  Topic2  Topic3
Document_No                                                 
0                0         0  0.7415  0.2584     NaN     NaN
1                0         2  0.0100     NaN  0.9898     NaN
2                0         2  0.2357  0.1125  0.6518     NaN
3                0         1  0.0546  0.9453     NaN     NaN
4                0         0  0.9992     NaN     NaN     NaN
5                0         2     NaN  0.3929  0.6069     NaN
6                0         3  0.2378     NaN     NaN  0.7616
7                0         0  0.9995     NaN     NaN     NaN
8                0         0  0.3910  0.2288  0.3801     NaN
9                0         0  0.9996     NaN     NaN     NaN
10               0         0  0.9996     NaN     NaN     NaN
11               0         2  0.4202     NaN  0.5796     NaN
12               0         0  0.8280     NaN  0.1719     NaN
13               0         3     NaN     NaN     NaN  0.9998
14               0      

## 6b. Generate a dataframe to export the results into

In [344]:
lda_topics_text = np.array(df_dominant_topic_text['Dominant'])
topic0_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic0'])
topic1_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic1'])
topic2_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic2'])
topic3_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic3'])


text = np.array(data['text'])

uid = np.array(data['uid'])

results = { 
    'uid' : uid, 
    'text': text, 
    'lda_topics_text': lda_topics_text, 
    'topic0_contrib_lda_topics_text':topic0_contrib_lda_topics_text,
    'topic1_contrib_lda_topics_text':topic1_contrib_lda_topics_text,
    'topic2_contrib_lda_topics_text':topic2_contrib_lda_topics_text,
    'topic3_contrib_lda_topics_text':topic3_contrib_lda_topics_text,
}

frame = pd.DataFrame(results, columns = [
                                                'uid',
                                                'text', 'lda_topics_text', 
                                                'topic0_contrib_lda_topics_text',
                                                'topic1_contrib_lda_topics_text',
                                                'topic2_contrib_lda_topics_text',
                                                'topic3_contrib_lda_topics_text',

                                              ])



frame.to_excel("./data_articles/lda_results_full_dataset_topic_num_5.xlsx")

## 5d. Export restuls to an .xlsx file

In [345]:
'./data_articles/news_dataset.xlsx'
frame.to_excel("./data_articles/lda_results_All_data_topic_num_4.xlsx")