# modelingeditorialslants 

## LDA Modeling of Topics Discussed in New Articles


## This notebook applies LDA modeling to a an dataset of 480 news articles shared to the '/r/news/' subreddit and Twitter

## Topic Modeling is a text analysis method that uses machine learning algorithms to classify the content of a large body of texts—called a corpus—into meaningful topics. 

## The specific technique used here, Latent Dirichlet Allocation (LDA), identifies underlying patterns of co-occurring words within individual texts to extrapolate latent categories across a corpus of texts (i.e., topics).

### In the following case, we are interested in understanding what characteristics of news articles elicit user engagement on social media relying heavily on the GENSIM, NLTK, spaCy, and SKLearn libraries:

https://radimrehurek.com/gensim/

https://www.nltk.org

https://scikit-learn.org/stable/

https://spacy.io



# Steps

## 1. Preprocessing
    1a. Spell check
    
    1b. Expand contractions

## 2. Text processing

     2a. Partition data into testing and training subsets
 
     2b. Tokenization
     
     2c. Stop Word Removal
     
     2d. Lemmatization
     
     2e. Bigrams and Trigrams
     
     2f. Exclude terms in > 99% and < 1% of documents
     
     2g. Generate Corpus and Dictionary
 

## 2. Selecting the number of topics (k)
 
     2a. Computing Model Perplexity


## 3. Model Results

     3a. pyLDAvis visualization to assist with Topic Labeling
     
     3b. Topic Mixtures (Document-Term Matrix)

 
# Helpful Links:

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

     
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158


# Sampling Procedures
We took a systematic random sample of 50 posts shared to each platform every week for five years (from January 2015 to December 2019, totaling 24,000 posts). Human coders then filtered through all 24,000 posts to extract two posts from each week (one from Reddit and one from Twitter), which shared a news report of a sexual assault. We only included reports of sexual assault; all other news frames were excluded (e.g., stories about rape culture more generally). ### Additionally, only reports written English were retained. If all 50 posts on a particular day were excluded, another 50 posts were sampled from that particular week. Human coders extracted the text from the news articles and their comment threads. When a news article had been taken down, a different post was sampled for that week.

# 1. Preprocessing Procedures:

## 1a. Run your textual data through a spell checker
     Altough there are automated spell checkers, they aren't as accurate as we'd like. Thus, we 
     encourage you to have a human run it through a spellchecker. Although this is time intensive, we 
     believe this is necessary.
## 1b. Expand all english contractions (e.g., "don't" -> "do not")
    Similar to the spellchecker, we needed human coders to do this, to ensure accuracy

# Required Libraries

In [12]:
## Load Required Libraries

#general
import numpy as np
import pandas as pd
import re
import pickle
from IPython.display import display

#setting up Jupyter notebook 
%matplotlib inline
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)

#text preprocessing
import nltk
from nltk.corpus import stopwords

import spacy
from spacy.lang.en import English

from gensim.models import Phrases
from gensim.utils import simple_preprocess

from sklearn.utils import shuffle

#modeling
import gensim
from gensim.models.ldamodel import LdaModel


#plotting
import pyLDAvis
import pyLDAvis.gensim

In [40]:
import nltk
import numpy as np
import pandas as pd
from pandas import read_excel

from nltk.corpus import stopwords
from nltk.tag import pos_tag

from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib
from sklearn.manifold import MDS

import spacy
from spacy.lang.en import English

# Compute bigrams.
from gensim.models import Phrases
from gensim.utils import simple_preprocess

from IPython.display import display 
import vpython as vs
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib.colors as mcolors
from matplotlib.patches import Rectangle
from matplotlib.ticker import FuncFormatter

import mpld3
from mpld3 import plugins, utils
import plotly
import plotly.graph_objs as go
import json
import pyLDAvis
import pyLDAvis.gensim

from sklearn.cluster import KMeans
from sklearn.utils import shuffle

import gensim
from gensim.models.wrappers import LdaMallet
from gensim.test.utils import common_corpus, common_dictionary
from gensim import corpora, models, similarities
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from PIL import *
import pickle

# ----------------------------------------------Misc---------------------------------------
import re
import csv
import os 
import codecs
import collections as cs
import logging
import random

pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)



<IPython.core.display.Javascript object>

# Read in the following:
### 1. dataset
### 2. Stopword location
### 3. Gensim location
### 4. Mallet path

In [13]:
## Set file Paths

#data files
file_location = './data/news_dataset.xlsx'

#stop words
stopwords_location = './data/stopwords.txt'

## Did the dataset load with the correct number of columns and rows?

In [14]:
## Check to make sure the dataset looks correct
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

480 Rows.  6 Columns.


## Make sure the head of the dataset looks correct

In [15]:
samples = pd.DataFrame(data.loc[0:999], columns = data.keys()).reindex(columns=['id', 'text'])
print("Sample Tickets:")
display(samples)

Sample Tickets:


Unnamed: 0,id,text
0,1,"Daisy Coleman, the northwest Missouri teenager..."
1,2,Swiss Cheese Pervert' terrorizes Mayfair GENTL...
2,3,Pat-Down At Defense Intelligence Agency Leads...
3,4,Half of Sexual Abuse Claims in American prison...
4,5,An educator in California was prosecuter with ...
5,6,"For decades, students at Bob Jones University ..."
6,7,"Modern Family star Sarah Hyland, 23, was posin..."
7,8,A recent ruling by the Texas court of Criminal...
8,9,The top Army prosecutor for sexual assault cas...
9,10,Two teenage girls in southern Maryland bullied...


# 2. DATA PREPROCESSING

# 2a. Split data into test and training sets

In [16]:
# Randomimze the order of the rows in the dataframe
data = shuffle(data)

In [17]:
#Partition data into a testing and training sets

from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data, test_size=0.50, random_state=42)
print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  6 Columns.
240 Rows.  6 Columns.
480 Rows.  6 Columns.


In [20]:
#Add an extra column to the dataframe to identify the testing vs the training rows vs the full dataset

data_train['Data_Type'] = "traning_data"
data_test['Data_Type'] = "testing_data"
data['Data_Type'] = "full_data"

print("{} Rows.  {} Columns.".format(*data_train.shape))
print("{} Rows.  {} Columns.".format(*data_test.shape))
print("{} Rows.  {} Columns.".format(*data.shape))

240 Rows.  7 Columns.
240 Rows.  7 Columns.
480 Rows.  7 Columns.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


# 2b. Tokenization
### Tokenization involves coverting the text to lowercase, removing special characters, null values, and punctuation from the text

## Here we can see how many null values there are in each column

In [21]:
# null values in each column of the testing set
data_train.isnull().sum()

id             0
text           0
Year         124
Month        124
week         124
Platform       0
Data_Type      0
dtype: int64

In [22]:
# null values in each column of the training set
data_test.isnull().sum()

id             0
text           0
Year         116
Month        116
week         116
Platform       0
Data_Type      0
dtype: int64

## Now we need to remove null values from the data

In [23]:
#----------------------------------------Removing Null values--------------------------------------------------

#finding null values in the training subset 
print("=============Training Data=============")
data_train['text'] = data_train['text']

print('Number of rows in text:', len(data_train['text']))

print("-------------------")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#finding null values in the testing subset 
print("\n=============Testing Data==============\n")
data_test['text'] = data_test['text']

print('Number of rows in text:', len(data_test['text']))

print("-------------------")

print("Null Values in text: {}".format(data_test['text'].isnull().sum()))


print("\n\n====================Removing Null Values==========================\n\n")
#Removing null values from the training subset 

text_train = data_train['text']

print("After removing Null Values in Train Data")
print("Null Values in text: {}".format(data_train['text'].isnull().sum()))


#Removing null values from the testing subset 

text_test = data_test['text']

print("\n\nAfter removing Null Values in Test Data")
print("Null Values in text: {}".format(data_test['text'].isnull().sum()))

Number of rows in text: 240
-------------------
Null Values in text: 0


Number of rows in text: 240
-------------------
Null Values in text: 0




After removing Null Values in Train Data
Null Values in text: 0


After removing Null Values in Test Data
Null Values in text: 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


## Convert the text to lowercase

In [24]:
#----------------------------------------Converting to Lower case--------------------------------------------------

##Train Data
text_train = text_train.str.lower()

##Test Data
text_test = text_test.str.lower()

print("=======Train Data==============\n")
print(text_train.head(2))

print("\n=======Test Data==============\n")
print(text_test.head(2))


284    beth* thought brian* was sweet – a little shy ...
336    for decades, students at  — the elite historic...
Name: text, dtype: object


252    police say a woman says she was assaulted in t...
11     mckelvey, 27, was accused of having sexual con...
Name: text, dtype: object


## Remove the following:
- special characters
- alphanumerics
- numbers
- words that appear in the corpus less than twice
- extra spaces

In [25]:
##Remove from training set

text_train_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_train]
text_train_regex = [re.sub(r'\'', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\d', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_train_regex]
text_train_regex = [re.sub(r'_', ' ',  sent) for sent in text_train_regex]

##Remove from testing set

text_test_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text_test]
text_test_regex = [re.sub(r'\'', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\d', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_test_regex]
text_test_regex = [re.sub(r'_', ' ',  sent) for sent in text_test_regex]


print("=======Train Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])

print("\n=======Test Data==============\n")
print("\n[INFO] article_text....................\n")
print(text_test_regex[:2])



[INFO] article_text....................

['police say woman says she was assaulted the stairwell the garland avenue parking garage', 'mckelvey  was accused having sexual contact with two teenagers who attended his church reached plea deal forceds thirddegree sexual abuse and two counts sexual exploitation clergy was sentenced prison term  years for the thirddegree sexual abuse prosecuter and five years each for both sexual exploitation counselor clergy forceds  the first two counts will run consecutively and the last count concurrent could spend  years prison\n\nmckelvey also being added the iowa sex offender registry part the sentence\n\ndes moines police said they were contacted girl and her parents august  regarding the sexual assault\n\npolice said that during the investigation was discovered there could second victim investigators made contact with the second girl also juvenile and confirmed inapprojailate sexual incidents had occurred with her well\n\nmckelvey was youth pastor 

## Remove All Punctuation

In [26]:
def tokenize(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes all punctuation

In [27]:
##Training set
text_train_tokens = list(tokenize(text_train_regex))


print("\n[INFO] article text....................\n")
print(text_train_tokens[:2])

##Testing set
text_test_tokens = list(tokenize(text_test_regex))

print("\n[INFO] article text....................\n")
print(text_test_tokens[:2])


[INFO] article text....................

[['beth', 'thought', 'brian', 'was', 'sweet', 'little', 'shy', 'but', 'trustworthy', 'she', 'knew', 'from', 'her', 'parents', 'look', 'for', 'men', 'who', 'respect', 'women', 'but', 'when', 'got', 'her', 'alone', 'near', 'their', 'university', 'minnesota', 'campus', 'began', 'push', 'too', 'hard', 'too', 'fast', 'didnt', 'stop', 'when', 'she', 'asked', 'him', 'didnt', 'seem', 'care', 'when', 'she', 'said', 'she', 'realized', 'what', 'was', 'about', 'happen', 'and', 'her', 'brain', 'shut', 'down', 'most', 'people', 'still', 'envision', 'sexual', 'assault', 'rare', 'occurrence', 'committed', 'strangers', 'fact', 'out', 'times', 'the', 'rapist', 'knows', 'the', 'victim', 'according', 'the', 'national', 'institute', 'justice', 'the', 'good', 'news', 'thanks', 'partly', 'the', 'campus', 'sexual', 'violence', 'elimination', 'save', 'act', 'more', 'victims', 'are', 'comfortable', 'reporting', 'rape', 'college', 'authorities', 'often', 'confidentially'

# 2c. Lemmatization
Next, the lemmatization tool in the spaCy package transformed words to their base lemma, or dictionary term (Srinivasa-Desikan, 2018). Lemmatization converts different forms of words to their base lemma (e.g., the word “assaulted” and “assaulting” are transformed to “assault”). We felt lemmatization was theoretically necessary, as suffixes and prefixes could confound our results. For example, linguistic masking devices and the use of passive versus active voice have been shown to influence attributions of blame (Leets, 2000; Knobloch-Westerwick & Taylor, 2008).

In [28]:
def lemmatization(texts, allowed_postags=['VERB', 'NOUN']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc])
    return texts_out

In [29]:
nlp = spacy.load('en', disable=['parser', 'ner'])
# Lemmatize Training Set, keeping only nouns and verbs

text_train_lemma = lemmatization(text_train_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_train_lemma)))
print(text_train_lemma[:4])

# Lemmatize Testing Set, keeping only nouns and verbs
text_test_lemma = lemmatization(text_test_tokens, allowed_postags=['VERB', 'NOUN'])
print(str(len(text_test_lemma)))
print(text_test_lemma[:4])

240
[['beth', 'think', 'brian', 'be', 'sweet', 'little', 'shy', 'but', 'trustworthy', '-PRON-', 'know', 'from', '-PRON-', 'parent', 'look', 'for', 'man', 'who', 'respect', 'woman', 'but', 'when', 'get', '-PRON-', 'alone', 'near', '-PRON-', 'university', 'minnesota', 'campus', 'begin', 'push', 'too', 'hard', 'too', 'fast', 'do', 'not', 'stop', 'when', '-PRON-', 'ask', '-PRON-', 'do', 'not', 'seem', 'care', 'when', '-PRON-', 'say', '-PRON-', 'realize', 'what', 'be', 'about', 'happen', 'and', '-PRON-', 'brain', 'shut', 'down', 'most', 'people', 'still', 'envision', 'sexual', 'assault', 'rare', 'occurrence', 'commit', 'stranger', 'fact', 'out', 'time', 'the', 'rapist', 'know', 'the', 'victim', 'accord', 'the', 'national', 'institute', 'justice', 'the', 'good', 'news', 'thank', 'partly', 'the', 'campus', 'sexual', 'violence', 'elimination', 'save', 'act', 'more', 'victim', 'be', 'comfortable', 'reporting', 'rape', 'college', 'authority', 'often', 'confidentially', 'the', 'bad', 'news', 'the

240
[['police', 'say', 'woman', 'say', '-PRON-', 'be', 'assault', 'the', 'stairwell', 'the', 'garland', 'avenue', 'parking', 'garage'], ['mckelvey', 'be', 'accuse', 'have', 'sexual', 'contact', 'with', 'two', 'teenager', 'who', 'attend', '-PRON-', 'church', 'reach', 'plea', 'deal', 'forced', 'thirddegree', 'sexual', 'abuse', 'and', 'two', 'count', 'sexual', 'exploitation', 'clergy', 'be', 'sentence', 'prison', 'term', 'year', 'for', 'the', 'thirddegree', 'sexual', 'abuse', 'prosecuter', 'and', 'five', 'year', 'each', 'for', 'both', 'sexual', 'exploitation', 'counselor', 'clergy', 'forced', 'the', 'first', 'two', 'count', 'will', 'run', 'consecutively', 'and', 'the', 'last', 'count', 'concurrent', 'could', 'spend', 'year', 'prison', 'mckelvey', 'also', 'be', 'add', 'the', 'iowa', 'sex', 'offender', 'registry', 'part', 'the', 'sentence', 'des', 'moines', 'police', 'say', '-PRON-', 'be', 'contact', 'girl', 'and', '-PRON-', 'parent', 'august', 'regard', 'the', 'sexual', 'assault', 'police'

# 2d. Remove Stopwords
## Next, we filtered out commonly occurring words (e.g., “the,” “an”)—known as stop words—to make sure the documents only included meaningful words. 
## NOTE: Edit the stopwords txt file to add additional words to filter out

In [30]:
#Prepare to remove stopwords
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))
newStopWords =[str(x.strip()) for x in open(stopwords_location,'r').read().split('\n')]
stopwords.update(newStopWords)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]
print(len(stopwords))

4416


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannahstevens/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
##Remove stopwords from training set

text_train_stopwords = remove_stopwords(text_train_lemma)


print("\n[INFO] article text....................\n")
print(text_train_stopwords[:2])

##Remove stopwords from testing set

text_test_stopwords = remove_stopwords(text_test_lemma)


print("\n[INFO] article text....................\n")
print(text_test_stopwords[:2])


[INFO] article text....................

[['beth', 'think', 'brian', 'sweet', 'little', 'shy', 'trustworthy', 'know', 'parent', 'look', 'man', 'respect', 'woman', 'get', 'alone', 'near', 'university', 'minnesota', 'campus', 'begin', 'push', 'hard', 'fast', 'stop', 'ask', 'seem', 'care', 'say', 'realize', 'happen', 'brain', 'shut', 'people', 'still', 'envision', 'sexual', 'assault', 'rare', 'occurrence', 'commit', 'stranger', 'fact', 'time', 'rapist', 'know', 'victim', 'accord', 'national', 'institute', 'justice', 'good', 'news', 'thank', 'partly', 'campus', 'sexual', 'violence', 'elimination', 'save', 'act', 'victim', 'comfortable', 'reporting', 'rape', 'college', 'authority', 'often', 'confidentially', 'bad', 'news', 'number', 'rape', 'go', 'decade', 'research', 'consistently', 'indicate', 'college', 'woman', 'victim', 'nonconsensual', 'sex', 'lion', 'den', 'still', 'ask', 'young', 'people', 'fend', 'say', 'caroline', 'palmer', 'law', 'policy', 'manager', 'minnesota', 'coalition', 's

# 2e. Bigrams and Trigrams
##    Bigrams are two words that frequently co-occur together
##    Trigrams are three words that frequently co-occur together

In [32]:
##Training set
            
text_train_bigram = Phrases(text_train_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_train_trigram = Phrases(text_train_bigram[text_train_stopwords], threshold=1)

text_train_bigram_mod = gensim.models.phrases.Phraser(text_train_bigram)
text_train_trigram_mod = gensim.models.phrases.Phraser(text_train_trigram)

for idx in range(len(text_train_stopwords)):
    for token in text_train_trigram_mod[text_train_bigram_mod[text_train_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_train_stopwords[idx].append(token)
            
##Testing set

text_test_bigram = Phrases(text_test_stopwords, min_count=3, delimiter=b' ', threshold=1)
text_test_trigram = Phrases(text_test_bigram[text_test_stopwords], threshold=1)

text_test_bigram_mod = gensim.models.phrases.Phraser(text_test_bigram)
text_test_trigram_mod = gensim.models.phrases.Phraser(text_test_trigram)

for idx in range(len(text_test_stopwords)):
    for token in text_test_trigram_mod[text_test_bigram_mod[text_test_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            text_test_stopwords[idx].append(token)

## Print the head of the data with Bigrams and Trigrams
### e.g., under the testing dataset, you can see the words 'report sexual' and 'good guy' are bigrams

In [33]:
print("\n[INFO] Text Article training data....................\n")
print(text_train_stopwords[:2])

print("\n[INFO] Text Article testing data....................\n")
print(text_test_stopwords[:2])


[INFO] Text Article training data....................

[['beth', 'think', 'brian', 'sweet', 'little', 'shy', 'trustworthy', 'know', 'parent', 'look', 'man', 'respect', 'woman', 'get', 'alone', 'near', 'university', 'minnesota', 'campus', 'begin', 'push', 'hard', 'fast', 'stop', 'ask', 'seem', 'care', 'say', 'realize', 'happen', 'brain', 'shut', 'people', 'still', 'envision', 'sexual', 'assault', 'rare', 'occurrence', 'commit', 'stranger', 'fact', 'time', 'rapist', 'know', 'victim', 'accord', 'national', 'institute', 'justice', 'good', 'news', 'thank', 'partly', 'campus', 'sexual', 'violence', 'elimination', 'save', 'act', 'victim', 'comfortable', 'reporting', 'rape', 'college', 'authority', 'often', 'confidentially', 'bad', 'news', 'number', 'rape', 'go', 'decade', 'research', 'consistently', 'indicate', 'college', 'woman', 'victim', 'nonconsensual', 'sex', 'lion', 'den', 'still', 'ask', 'young', 'people', 'fend', 'say', 'caroline', 'palmer', 'law', 'policy', 'manager', 'minnesota', '

# 2f. Remove words that occur in less than & greater than of documents
### The corpus is our collection of documents (i.e., our textual questionnaire responses)
### The dictionary takes each unique word in the corpus and assigns them an index

In [34]:
dictionary_text_train = corpora.Dictionary(text_train_stopwords)
dictionary_text_train.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_train = [dictionary_text_train.doc2bow(text) for text in text_train_stopwords]
pickle.dump(corpus_text_train, open('./data/corpus_text_train.pkl', 'wb'))
dictionary_text_train.save('./data/dictionary_text_train.gensim')

dictionary_text_test = corpora.Dictionary(text_test_stopwords)
dictionary_text_test.filter_extremes(no_below=.01, no_above=0.99)
corpus_text_test = [dictionary_text_test.doc2bow(text) for text in text_test_stopwords]
pickle.dump(corpus_text_test, open('./data/corpus_text_test.pkl', 'wb'))
dictionary_text_test.save('./data/dictionary_text_test.gensim')

# 2g. Print Dictionary and Corpus for each subset 

In [35]:
## Training data dictionary & corpus
text_train_stopwords
for k, v in dictionary_text_train.token2id.items():
    print(k, v)

able 0
accord 1
account 2
acquaintance 3
act 4
active 5
adult 6
advantage 7
advice 8
age 9
aggressive 10
alcohol 11
alone 12
alongside 13
also 14
although 15
answer 16
approach 17
around 18
ask 19
assault 20
attack 21
attempt 22
attempt rape 23
attitude 24
augsburg 25
aurora 26
aurora center 27
authority 28
average 29
avoid 30
away 31
bad 32
become 33
begin 34
behavior 35
beth 36
big 37
ble 38
block 39
boston 40
boy 41
brain 42
brian 43
bystander 44
call 45
campus 46
campus sexual 47
car 48
care 49
caroline 50
case 51
center 52
certain 53
change 54
child 55
civil 56
class 57
clothe 58
cloud 59
coach 60
coalition 61
college 62
comfortable 63
commit 64
confidentially 65
confront 66
consent 67
consequence 68
consider 69
consistently 70
constitute 71
contact 72
continue 73
control 74
conversation 75
cournoyer 76
create 77
crime 78
criminal 79
damage 80
danger 81
date 82
daughter 83
decade 84
defeminize 85
den 86
dept 87
director 88
distract 89
donna 90
drinking 91
duluthbase 92
dunn 93
eas

please 1076
point 1077
police 1078
police department 1079
politic 1080
popular 1081
portray 1082
potential 1083
predomi 1084
predomi tly 1085
preparation 1086
prepare 1087
president 1088
prestigious 1089
previous 1090
problem 1091
proceeding 1092
process 1093
professor 1094
professorship 1095
programming 1096
prohibit 1097
propel 1098
properly 1099
prosecute 1100
protect 1101
protest 1102
provide 1103
psychology 1104
public 1105
public safety 1106
publicize 1107
publish 1108
pull 1109
pulliam 1110
punishment 1111
purpose 1112
put 1113
put hand 1114
pwis 1115
quaalude 1116
quest 1117
question 1118
questioning 1119
quickly 1120
quiet 1121
race 1122
racial 1123
radical 1124
raheem 1125
rape victim 1126
rape woman 1127
rate 1128
rather 1129
raymond 1130
reach 1131
react 1132
read 1133
really 1134
recall 1135
receive 1136
recent 1137
reclassify 1138
reconcile 1139
record 1140
recreation 1141
red 1142
refer 1143
refuse 1144
regard 1145
reimagine 1146
reiterate 1147
relationship 1148
remain 1

dressing 1836
dressing room 1837
enter 1838
escalate 1839
essentially 1840
excerpt 1841
film 1842
harder 1843
hbo 1844
icon 1845
icy 1846
instal 1847
kiss 1848
laugh 1849
lifelike 1850
limo 1851
marnie 1852
mask 1853
memoir 1854
misogynist 1855
moment 1856
multiple occasion 1857
occasion 1858
petulant 1859
prison 1860
profile 1861
reveal 1862
say put 1863
secret 1864
telegraph 1865
tell anyone 1866
tippi 1867
touch 1868
try force 1869
whole 1870
aggravate 1871
ago 1872
also prosecuter 1873
anchor 1874
argument 1875
assault woman 1876
assistant 1877
bank 1878
bank say 1879
batey 1880
bathroom 1881
beginning 1882
believe 1883
betray 1884
bottle 1885
brandon 1886
brief 1887
briefing 1888
build 1889
bully 1890
bullying 1891
bus 1892
camera 1893
cellphone 1894
chest 1895
climb 1896
close 1897
codefendant 1898
commodore 1899
composure 1900
convict 1901
cory 1902
count 1903
count aggravate 1904
credibility 1905
crucial 1906
daily 1907
death 1908
defendent place 1909
defense 1910
deliberately 

baseball 2602
complex 2603
composite 2604
composite sketch 2605
earlier week 2606
fight back 2607
husky 2608
jawline 2609
line 2610
maryland 2611
mid 2612
montgomery 2613
montgomery county 2614
rockville 2615
rollins 2616
run away 2617
shirt 2618
sketch 2619
special 2620
special victim 2621
tan 2622
urge 2623
africanamerican 2624
ambassador 2625
awareness 2626
collaboration 2627
counter 2628
definitely 2629
diana 2630
differently 2631
establish 2632
explain 2633
findthe 2634
happenedit 2635
healing 2636
imagine 2637
inspiring 2638
isiah 2639
jay 2640
last year 2641
legend 2642
manhood 2643
nba 2644
old 2645
press forced 2646
rape yearold 2647
ready 2648
ross 2649
say could 2650
share story 2651
sharing 2652
site 2653
terrify 2654
therapeutic 2655
therapy 2656
thomas 2657
transpire 2658
traumatic 2659
voiceless 2660
voicethe 2661
would happen 2662
wrong 2663
year old 2664
zeke 2665
accord criminal_complaint 2666
agent 2667
aggravate sexual 2668
airlines 2669
airport 2670
amand 2671
atte

muslim woman 3602
muslims 3603
nadiah 3604
nadiah mohajir 3605
nasty 3606
negative 3607
new allegation 3608
nobody 3609
none 3610
notion 3611
object 3612
office say 3613
officially 3614
omer 3615
omer mozaffar 3616
one another 3617
ordinarily 3618
originate 3619
outsider 3620
pakistan 3621
pakistani 3622
paramount 3623
peaceful 3624
pedophile 3625
phone interview 3626
pool 3627
pray 3628
prayer 3629
prominent 3630
prompt 3631
puberty 3632
publicity 3633
purity 3634
put arm 3635
quran 3636
rabbi 3637
raise 3638
reaching 3639
reluctant 3640
replace 3641
reproduction 3642
resist 3643
retain 3644
right thing 3645
rockford 3646
saleem 3647
saleem first 3648
saleem found 3649
saleem say 3650
say mozaffar 3651
scholar 3652
school health 3653
school work 3654
school would 3655
schukar 3656
secretarial 3657
secretive 3658
segment 3659
settlement 3660
sewer 3661
sexual health 3662
sexuality 3663
shamsi 3664
shoulder 3665
sin 3666
since remove 3667
sixth 3668
sjailngboard 3669
social worker 3670


year ago 4602
acuaintance 4603
chock 4604
dec 4605
flirite 4606
horrific 4607
longtime 4608
mile 4609
one man 4610
rcalle 4611
rochester 4612
vicitms 4613
yearold male 4614
boise 4615
bsu 4616
code conduct 4617
corner 4618
donzale 4619
hendrix 4620
idaho 4621
investigation allegation 4622
jackson 4623
marquis 4624
perform oral 4625
redshirt 4626
reon 4627
rep 4628
roddie 4629
statesman 4630
wrongdoing 4631
avimael 4632
child year 4633
copulation 4634
inside bathroom 4635
lewd 4636
lopez 4637
noon 4638
oral copulation 4639
penetration 4640
sexual penetration 4641
alien 4642
also face 4643
arrival 4644
aurelio 4645
buren 4646
chi 4647
children 4648
dezgomez 4649
donald 4650
donald trump 4651
extradite 4652
fla 4653
go home 4654
hartford township 4655
illegally 4656
mexican 4657
mich 4658
miss 4659
police find 4660
polk 4661
polk county 4662
protective 4663
report authority 4664
trump 4665
wzzm 4666
wzzmdeputie 4667
attack happen 4668
bone 4669
broken 4670
endure 4671
fundraising 4672
nat

traumatize 5601
truly 5602
undertake 5603
wildly 5604
admittedly 5605
audit 5606
block avenue 5607
busse 5608
demonstration 5609
dignity 5610
dildo 5611
disappointing 5612
downtown 5613
epsilon 5614
fraternity house 5615
greg 5616
headquarters 5617
jessee 5618
kpbs 5619
marcher 5620
montezuma 5621
obscenity 5622
organizer 5623
phis 5624
portion 5625
row 5626
sdsu 5627
seriousness 5628
sigma 5629
state university 5630
virginias 5631
abuse accord 5632
authorized 5633
brick 5634
charlotte 5635
charter 5636
chudzicki 5637
claiming 5638
cute 5639
demetrius 5640
exhibit 5641
fether 5642
kipp 5643
kipp charlotte 5644
lawsuit say 5645
rape girl 5646
seventhgrade 5647
uncle 5648
unconsensual 5649
victim sexual 5650
voicemail 5651
cuban 5652
factually 5653
gop 5654
lemon 5655
telephone 5656
tell story 5657
thirdhand 5658
untoward 5659
abandonment 5660
accuracy 5661
adhere 5662
administrations 5663
ali 5664
approve 5665
arise 5666
arthur 5667
caleb 5668
columnist 5669
combat 5670
constitutional 5

laurel 6601
olympia 6602
school district 6603
sectional 6604
teacher aide 6605
milby 6606
andre 6607
atlantic 6608
burkeman 6609
cbs news 6610
counsels 6611
decade ago 6612
journalist 6613
lobbyist 6614
meddling 6615
mueller 6616
newsmax 6617
nowjustice 6618
offer reward 6619
paralegal 6620
polite 6621
portnoy 6622
probe 6623
reward 6624
rightwe 6625
russian 6626
scheme 6627
special counsel 6628
supporter 6629
tray 6630
unveil 6631
witch 6632
bang 6633
johnson 6634
military 6635
obstruct 6636
pillow 6637
portsmouth 6638
recant 6639
sailor 6640
seaman 6641
virginianpilot 6642
acquaint 6643
broadcast 6644
chicago police 6645
dedicate 6646
depict 6647
disable 6648
eddie 6649
genuine 6650
glorify 6651
guglielmi 6652
guglielmi say 6653
headquarter 6654
hideous 6655
identifiable 6656
reunite 6657
screengrab 6658
staffer 6659
superintendent 6660
amphitheater 6661
auburn 6662
bestknown 6663
cornell 6664
dawn 6665
des 6666
dilemma 6667
entity 6668
grammar 6669
grammywinne 6670
hayne 6671
herre 

bypassed 7601
disadvantage 7602
earn 7603
earth 7604
geoff 7605
geoffrey 7606
janet 7607
marcy 7608
marcys 7609
misuse 7610
nasa 7611
objectify 7612
parameter 7613
planet 7614
post publish 7615
postdoctoral 7616
procedural 7617
provost 7618
respectful 7619
revelation 7620
richey 7621
sexualize 7622
sincerely 7623
solar 7624
web 7625
york times 7626
evolution 7627
hey 7628
honour 7629
hooker 7630
liar 7631
naughty 7632
phrase 7633
purport 7634
realise 7635
reclaim 7636
recognise 7637
slut 7638
subversive 7639
sup 7640
tricky 7641
ultimate 7642
varietys 7643
ajaill 7644
application 7645
banished 7646
barcombpeter 7647
beaty 7648
casebycase 7649
clinkscale 7650
conductbehavior 7651
correspondence 7652
curry 7653
daisy 7654
dan 7655
dole 7656
durham 7657
entail 7658
erinn 7659
forcedd 7660
forgo 7661
gary 7662
halloween 7663
haute 7664
hyman 7665
jayhawker 7666
journalworld 7667
lawsuit allege 7668
libby 7669
likewise 7670
listing 7671
marchiony 7672
mcclure 7673
misunderstood 7674
nonacad

lawful 8601
meritt 8602
meritts 8603
plenty 8604
thatno 8605
alamo 8606
aught 8607
blogger 8608
chud 8609
cinemas 8610
cinephile 8611
collider 8612
complicated 8613
condemnation 8614
cool 8615
devin 8616
drafthouse 8617
editorinchief 8618
empathetic 8619
faraci 8620
faracis 8621
forgiveness 8622
honorable 8623
ideological 8624
influential 8625
knowle 8626
loyalty 8627
monster 8628
pence 8629
pussy 8630
running 8631
sciretta 8632
slashfilm 8633
views 8634
vile 8635
weintraub 8636
worthy 8637
alma 8638
coastal 8639
cornerback 8640
knowledgeable 8641
lewis iii 8642
mater 8643
ray 8644
wmbf 8645
biasrelate 8646
canvass 8647
heel 8648
slur 8649
viejo 8650
bakersfield 8651
bug 8652
cohn 8653
gabriel 8654
harrington 8655
jeopardise 8656
kern 8657
kern county 8658
nations 8659
ncwp 8660
nondisclosure 8661
payment 8662
payofor 8663
stiles 8664
unethical 8665
unheard 8666
waive 8667
accidentally 8668
arrogant 8669
assault public 8670
batess 8671
belittle 8672
bum 8673
caress 8674
certainty 8675


In [36]:
## Testing Data dictionary & corpus
text_test_stopwords
for k, v in dictionary_text_test.token2id.items():
    print(k, v)

assault 0
avenue 1
garage 2
garland 3
parking 4
police 5
police say 6
say 7
stairwell 8
woman 9
woman say 10
abuse 11
accuse 12
accuse sexual 13
accuser 14
action 15
add 16
also 17
anyone 18
assembly 19
attend 20
august 21
authority 22
begin 23
bravery 24
child 25
church 26
clergy 27
come 28
come forward 29
concurrent 30
confirm 31
congregation 32
consecutively 33
contact 34
could 35
counselor 36
count 37
deal 38
deceive 39
defendent 40
des 41
discover 42
entire 43
ever 44
except 45
exploitation 46
feel 47
find 48
first 49
five 50
five year 51
foolish 52
forced 53
forward 54
girl 55
go 56
go make 57
god 58
happy 59
heritage 60
hope 61
hurt 62
inapprojailate 63
inapprojailate sexual 64
incident 65
incident occur 66
incredible 67
insecurity 68
inspire 69
intelligent 70
investigation 71
investigator 72
iowa 73
issue 74
justice 75
juvenile 76
last 77
life 78
like 79
lure 80
make 81
mckelvey 82
moines 83
monster 84
normal 85
notify 86
occur 87
offender 88
one 89
parent 90
part 91
pastor 92


rebuke 868
refusal 869
refuse 870
release 871
remind 872
represent 873
representative 874
request 875
reside 876
resign 877
resist 878
resolve 879
return 880
ring 881
role 882
rub 883
rule 884
seek 885
separately 886
serious allegation 887
sesame 888
sex upon 889
sexual abuse 890
sexual contact 891
sexually abuse 892
sexy 893
shortly 894
significannotly 895
similarly 896
singer 897
site 898
skill 899
socialize 900
solicit 901
sordid 902
spank 903
spit 904
statement 905
statute 906
statute limitation 907
straight 908
street 909
submit 910
subsequently 911
suffer 912
suit 913
summer 914
sure 915
surreptitiously 916
swim 917
teenage 918
theater 919
thewrap 920
though 921
three 922
transformer 923
trial 924
trip 925
twice 926
ultimately 927
underwater 928
undress 929
unreasonable 930
unspecified 931
unwillingness 932
upcoming 933
use 934
use position 935
verbally 936
vince 937
vindicate 938
voice 939
vow 940
warn 941
water 942
wealth 943
wear 944
without 945
woody 946
work 947
xmen 948
aff

really 1618
rise 1619
safe 1620
san 1621
santa 1622
say tuesday 1623
sexual relationship 1624
side 1625
similar 1626
suspicion 1627
take allegation 1628
track coach 1629
two month 1630
union 1631
vice 1632
want make 1633
yearold student 1634
accord police 1635
already 1636
baptist 1637
baptist church 1638
choir 1639
copulation 1640
forced include 1641
highlands 1642
kareem 1643
meet 1644
north 1645
police ask 1646
sacramento 1647
sodomy 1648
testament 1649
accord new 1650
affairs 1651
allege rape 1652
allegedly rape 1653
antianxiety 1654
badge 1655
bag 1656
basically 1657
beach 1658
brighton 1659
brooklyn 1660
brutalize 1661
bureau 1662
buybust 1663
chipotle 1664
city 1665
coney 1666
consensual 1667
cop 1668
damn 1669
david 1670
definitely 1671
deny 1672
desk 1673
district attorneys 1674
dnainfos 1675
dodge 1676
drugrelate 1677
duty 1678
edward 1679
endowment 1680
espey 1681
foot 1682
force perform 1683
hall 1684
hurowitz 1685
instead 1686
island 1687
john 1688
justification 1689
kidna

online 2617
performance 2618
perry 2619
physical 2620
promote 2621
ross 2622
samuel 2623
sexual performance 2624
sville 2625
tamper 2626
yearold boy 2627
ahead 2628
amanda 2629
amandas 2630
anxious 2631
aregot 2632
asst 2633
ate 2634
atm 2635
bank 2636
blackburn 2637
blackburns 2638
burglary 2639
captain 2640
card 2641
commander 2642
commit crime 2643
concerned 2644
continue investigate 2645
davey 2646
dna 2647
energy 2648
eric 2649
excite 2650
eyewitness 2651
gas 2652
gas station 2653
hench 2654
impd 2655
indianapolis 2656
interest 2657
interrogate 2658
investigative 2659
jalen 2660
kessler 2661
killer 2662
last week 2663
major 2664
mcconnell 2665
metro police 2666
michigan 2667
mothers 2668
news conference 2669
parole 2670
patient 2671
pregnant 2672
scene 2673
source 2674
source close 2675
step forward 2676
subdivision 2677
surveillance video 2678
tragically 2679
wat 2680
watch 2681
week ago 2682
welcome 2683
woods 2684
accord statement 2685
allege sexually 2686
angela 2687
awards 26

barbizon 3617
bonita 3618
communicate 3619
corporate 3620
diego 3621
dreamed 3622
elliott 3623
friar 3624
girlfriend 3625
heart 3626
jailority 3627
janine 3628
kanter 3629
kelly 3630
kyser 3631
kyser say 3632
law firm 3633
marry 3634
mind 3635
modeling 3636
money 3637
nightmare 3638
outstanding 3639
photographer 3640
say word 3641
skip 3642
someone know 3643
strive 3644
tape 3645
trigger 3646
unaffiliate 3647
unfortunate 3648
unnamed 3649
young girl 3650
bloodied 3651
bruise 3652
chitwood 3653
daytona 3654
dispatcher 3655
florida 3656
frolander 3657
independently 3658
interrupt 3659
knot 3660
lay 3661
lip 3662
living 3663
motionless 3664
observe 3665
police chief 3666
puddle 3667
raymond 3668
reuter 3669
say look 3670
scratch 3671
swollen 3672
yearsold 3673
andrea 3674
blouse 3675
casey 3676
caseys 3677
clerk 3678
coordinator 3679
county official 3680
deposit 3681
dress 3682
dry 3683
elect 3684
employment 3685
ever happen 3686
executive 3687
favor 3688
flake 3689
front 3690
generally i

sneak 4616
solo 4617
spearsnsync 4618
sponsor 4619
spree 4620
stadium 4621
stre 4622
submission 4623
superstar 4624
surprise 4625
tel 4626
tiny 4627
tour 4628
toy 4629
toya 4630
transparent 4631
ugly 4632
underpant 4633
undisclosed 4634
usher 4635
vicitm 4636
victimize 4637
vilify 4638
wade 4639
waist 4640
wasand 4641
weitzman 4642
wembley 4643
wildly 4644
would come 4645
yearslong 4646
zonen 4647
assault victim 4648
back home 4649
boat 4650
dec 4651
get home 4652
portable 4653
potty 4654
prosecuter sexual 4655
ramp 4656
relay 4657
rosa 4658
tina 4659
vacation 4660
visiting 4661
woodlawn 4662
beer 4663
gore 4664
muldrow 4665
pornographic 4666
probable 4667
probable cause 4668
sallisaw 4669
sand 4670
sand rape 4671
sequoyah 4672
sister 4673
six count 4674
tionally 4675
troy 4676
twoyear 4677
accompany 4678
alumna 4679
anonymously 4680
awareness 4681
barnhart 4682
chancellor 4683
constitute 4684
crackdown 4685
cynthia 4686
discus 4687
intensify 4688
knew 4689
learning 4690
mit 4691
poll 

occupational 5616
ptsd 5617
specialty 5618
stomach 5619
ventura 5620
abc news 5621
adopt 5622
artur 5623
asher 5624
bump 5625
certificate 5626
corruption 5627
corruption minor 5628
custom 5629
dauphin 5630
doest 5631
dual 5632
embassy 5633
enrolling 5634
enrollment 5635
falsified 5636
infanti 5637
john harris 5638
news today 5639
offices 5640
potts 5641
relationship yearold 5642
samarin 5643
samarin tell 5644
say aware 5645
science 5646
stephayne 5647
tell abc 5648
tell whtm 5649
tourist 5650
ukraine 5651
ukrainian 5652
visa 5653
whtm 5654
array 5655
blow 5656
bob 5657
ciara 5658
cyrus 5659
def 5660
extensively 5661
flick 5662
flo 5663
hitter 5664
impressionable 5665
katy 5666
keha 5667
kelis 5668
legit 5669
mafia 5670
manipulation 5671
miley 5672
mos 5673
nake 5674
paint 5675
pitbull 5676
plane 5677
regain 5678
rida 5679
rowland 5680
sore 5681
tik 5682
tok 5683
wholehearte 5684
anything like 5685
believable 5686
churchs 5687
clean 5688
congregational 5689
esser 5690
facet 5691
involve

hacktivist 6613
heartbreake 6614
heartbreaking 6615
injustice 6616
latter 6617
maryville 6618
melinda 6619
nodaway 6620
notoriety 6621
persist 6622
peters 6623
plant 6624
psychiatric 6625
rabbit 6626
rally 6627
recover 6628
resignation 6629
rice 6630
skinned 6631
temperature 6632
terrorize 6633
timetable 6634
welfare 6635
yard 6636
zach 6637
abuse child 6638
accommodate 6639
advertising 6640
alternate 6641
alumnus 6642
angrily 6643
assail 6644
basyle 6645
beige 6646
betray 6647
bible 6648
brick 6649
bubble 6650
burchwell 6651
compel 6652
conflict 6653
consultant 6654
consulting 6655
defer 6656
denomination 6657
dismissal 6658
erin 6659
evangelical 6660
evangelist 6661
explanation 6662
fundamentalism 6663
fundamentalist 6664
godly 6665
greatgrand 6666
handbook 6667
immaculate 6668
insular 6669
intense 6670
lawn 6671
lewis 6672
literal 6673
lowrise 6674
lustful 6675
manicure 6676
ordinary 6677
primaryand 6678
randy 6679
rewrote 6680
scripturally 6681
sideline 6682
spirit 6683
superintend

grandfather 7615
ineligible 7616
intellectual 7617
intellectually 7618
jurist 7619
kaitlyn 7620
malinger 7621
malingeringthe 7622
orleansbase 7623
purposely 7624
row 7625
rylie 7626
threejudge 7627
caldwell 7628
chs 7629
iosim 7630
newark 7631
reiterate 7632
timothy 7633
arbitrary 7634
bulldog 7635
cornell 7636
culpable 7637
destroy 7638
drastic 7639
espns 7640
flimsy 7641
galvanize 7642
irreparable 7643
ivy 7644
medcalf 7645
myron 7646
oblivious 7647
oneil 7648
overly 7649
pillory 7650
primarythought 7651
rational 7652
retaliatory 7653
simultaneously 7654
stern say 7655
ves 7656
warmup 7657
whip 7658
whipping 7659
zone 7660
ahole 7661
albeit 7662
alyssa 7663
apparatus 7664
appreciate 7665
availability 7666
booker 7667
bookers 7668
categorize 7669
chitchat 7670
clist 7671
comprehend 7672
contenderhardly 7673
corroboration 7674
cory 7675
cory booker 7676
croc 7677
cycle 7678
default 7679
democrats 7680
dhillon 7681
disengage 7682
entryway 7683
exception 7684
exclusively 7685
flip 7686
f

## Ensure the data looks correct
### i.e., are words properly lemmatized, tokenized, etc?

In [37]:
# View information about each corpus

print("\n[INFO] Article Dataset....................\n")
print('Train Data', str(len(text_train_stopwords)))
print('Test Data', str(len(text_test_stopwords)))

print("\n\n=================================================\n\n")

print("[INFO] Training Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_train_stopwords[:1])

print("\n[INFO] Testing Sample Data")
print("--------------------------------------")
print("--------------------------------------")
print("\n[INFO] Article Text....................\n")
print(text_test_stopwords[:1])


[INFO] Article Dataset....................

Train Data 240
Test Data 240




[INFO] Training Sample Data
--------------------------------------
--------------------------------------

[INFO] Article Text....................

[['beth', 'think', 'brian', 'sweet', 'little', 'shy', 'trustworthy', 'know', 'parent', 'look', 'man', 'respect', 'woman', 'get', 'alone', 'near', 'university', 'minnesota', 'campus', 'begin', 'push', 'hard', 'fast', 'stop', 'ask', 'seem', 'care', 'say', 'realize', 'happen', 'brain', 'shut', 'people', 'still', 'envision', 'sexual', 'assault', 'rare', 'occurrence', 'commit', 'stranger', 'fact', 'time', 'rapist', 'know', 'victim', 'accord', 'national', 'institute', 'justice', 'good', 'news', 'thank', 'partly', 'campus', 'sexual', 'violence', 'elimination', 'save', 'act', 'victim', 'comfortable', 'reporting', 'rape', 'college', 'authority', 'often', 'confidentially', 'bad', 'news', 'number', 'rape', 'go', 'decade', 'research', 'consistently', 'indicate', 'college', 'w

In [38]:

print("\n[INFO] Article Text Dictionary and Corpus.....................\n")

print("\nNumber of unique words in the training set's dictionary:")
print(str(len(dictionary_text_train)))

print("\nNumber of unique words in the testing set's dictionary:")
print(str(len(dictionary_text_test)))

print("\nNumber of rows in the testing set:")
print(str(len(corpus_text_test)))

print("\nNumber of rows in the training set:")
print(str(len(corpus_text_train)))


[INFO] Article Text Dictionary and Corpus.....................


Number of unique words in the training set's dictionary:
8979

Number of unique words in the testing set's dictionary:
8343

Number of rows in the testing set:
240

Number of rows in the training set:
240


# 3. Select number of topics (k)

## 3a. Setting Model Hyperparameters 

### 1. Beta (referred to as 'eta' in gensim) = the [distribution of the] number of words per topic
### 2. Alpha =  the [distribution of the] number of topics per document

#### Both alpha and eta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’:
- ‘auto’ = the model learns the best values for the hyperparameters; in other words, it learns an asymmetric prior from the corpus). See http://jonathan-huang.org/research/dirichlet/dirichlet.pdf for an overview             
- 'asymmetric' = uses a fixed, normalized asymmetric prior of 1.0 / k (number of topics)
- 'symmetric' = uses a distribution of 1 / k (number of topics)

In Bayesian statistics, we have to define the distributions (i.e., prior distributions) of unknown variables (e.g., ϕ and θ) before running the data analysis. These should be defined based on theoretical assumptions about how we think the topics are actually distributed amongst our data. In our case, it makes sense to assume that some documents discuss more/less topics than other documents

#### Thus, we reccommend setting alpha = 'auto' as it sets the distribution to be asymmetric, and learns the best alpha value (i.e., lowest perplexity scores) from the data itself. It also makes sense to assume that some topics contain more words than others. Thus, we reccomend setting the distribution of the number of words per topic to be asymmetric as well. 

### 3. Passes = number of laps the model goes through the entire corpus
        - Increasing the number of passes reduces model bias
### 4. Chunksize = number of documents to load into memory at a time 
        - smaller chunksizes save memory, but take longer to train
### 5. Update_every = number of chunks to process before maximizing your model 
### 6. Random state = sets the seed to make the model reproducable
### 7. Number of topics (k)
Researchers must tell the model how many (k) prominent goal inference topics to sort each ‘bag of words’ document into. To ascertain the optimal number of topics for our study, we used an evaluation metrics called topic perplexity. Topic perplexity looks at the normalized log-likelihood of a test set. In other words, it examines how surprised a model is to see new terms. Relative to scores generated by other topic numbers (k), lower perplexity scores generally indicate greater accuracy (Newman, Lau, Grieser, & Baldwin, 2010). Notably, a single model perplexity score is useless on its own; the value of perplexity measures comes from comparing perplexity values across multiple models with different numbers of topics (k).

## 3b. Compute model perplexity scores
We run the model with different topic numbers (k) to find the optimal topics number

We will start by looking at k=1-32 topics

In [39]:
k = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
random_state=42
update_every=2
chunksize=1800
passes=600
iterations=1600
alpha='auto'
eta='auto'
per_word_topics=True

lda_model_text_trains = []
lda_model_text_tests = []

## Which perplexity score is best?
When selecting the number of topics for the model, researchers should choose the lowest perplexity measure from the range of theoretically appropriate k-values they hope to measure. We aimed to measure news topics, broadly. Therefore, we want to take a closer look at the models with the lowest number of topics, relative to the lowest perplexity score. Results suggested that the seven models with the lowest number of topics, relative to the lowest perplexity score, are k = 3-7 topics.

In [None]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] Article Text Training Data LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_train = gensim.models.ldamodel.LdaModel(corpus=corpus_text_train,
                                                  id2word=dictionary_text_train,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

    lda_model_text_trains.append(lda_model_text_train)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_train.log_perplexity(corpus_text_train))


***********************************************************************
[INFO] Article Text Training Data LDA Results....
***********************************************************************

Perplexity (num_topics = 2):  -7.799363946574354

Perplexity (num_topics = 4):  -7.819559453604992

Perplexity (num_topics = 6):  -7.85134911073541

Perplexity (num_topics = 8):  -7.872259397487816


### Recall, can see that k=3-9 topics have the most drastic decreases in perplexity scores. 
### Thus, we opt to take a closer look at these 7 models

### Here is the representative 'Bag of Words' for each topic in the Training Dataset
As you can see, this doesn't give us enough information about each topic to assess model fit. Thus, we utilize a interactive visualization

In [None]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Training Data Topic Outputs....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_trains[j].show_topics(num_topics=i, num_words=10, log=True, formatted=True)
    for topic in topics:
        print(topic)

# 3c. Model selection using pyLDAvis Visualization 
## We're interested in the models with k= 3,4,5,6,7 topics, so we visualize those using the pyLDAvis documentation (https://www.aclweb.org/anthology/W14-3110.pdf)

## When selecting the optimal number of topics, we need to find a balance between underfitting and underfitting the model

### OVERFITTING (i.e., too many topics): 
#### Practical takeaway- this can make it harder for human coders to label
#### pros- less overlap amongst topics
#### cons- less coherence amongst the words in each topic; decreased varaiance in each document's distirbution of topics

### UNDERFITTING (i.e., too few topics): 
#### Practical takeaway- doesn't produce enough variance, limiting options for statistical analyses
#### pros- more coherent 'bag of words' comprising each topic; increased varaiance in the distirbution of topics in each document
#### cons- more overlap amongst topics

## Reading pyLDAvis

### LEFT PANE:
- The area of each circle represents the prevalence of each topic over the entire corpus 
- The distance between the center of circles indicate the similarity between topics (i.e., inter-topic differences)

---------------------------------------------------------------------------------------------------------

### RIGHT PANE:
- If you hover over a particular topic on the left, the histogram on the right side lists the top 30 most relevant terms
- The widths of the gray bars represent the corpus-wide frequencies of each term, and the widths of the red bars represent the topic-specific frequencies of each term
- A slider at the top can adjust the relevence metric (λ); however, for our purposes, be sure it i set to λ = 1. For more information on the relevance metric, see (https://www.aclweb.org/anthology/W14-3110.pdf). 








In [None]:
print("k = 14 topics ...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[6], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

In [None]:
print("k = 12 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[5], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

In [None]:
print("k = 10 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[4], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

# k=4 topics looks the best to me! The topics appear to be relatively spread out, with no overlapping topics
# At the same time, the 'bag of words' comprising each topic appears coherent enough to label. 

# As our objective was to capture the news article topics broadly, the four-topic model worked well.

In [None]:
print("k = 8 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[3], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

In [None]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[2], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

In [None]:
print("k = 4 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_trains[1], corpus_text_train, dictionary_text_train)
pyLDAvis.display(lda_display)

# 4. Model Validation for Test Dataset with 5 topics (Num_Topics = 5)

### Now, we check to make sure the training dataset shows a similar trends, to offer evidence of reproducability. We do this by:

(1) Validating that the the testing set perplexity value for k=5 topics relatively similar to the training set (it is)

(2) check the fit of the pyLDAvis visualization for indicators of similar model fit (it looks good)

## 4a. Get Perplexity Scores of the Testing Dataset

In [None]:
print("\n***********************************************************************")
print("[INFO] Article text Testing Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text_test = gensim.models.ldamodel.LdaModel(corpus=corpus_text_test,
                                                  id2word=dictionary_text_test,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                                      
                                                  per_word_topics=per_word_topics)

    lda_model_text_tests.append(lda_model_text_test)
    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text_test.log_perplexity(corpus_text_test))

## 4b. Check the pyLDAvis visualization for indicators of similar model fit (it looks good)

In [None]:
l = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
print("\n***********************************************************************")
print("[INFO] Article Text Testing Data Topic Output....")
print("***********************************************************************")

for i, j in zip(k, l):
    print("\n[INFO] Num_topics: {}\n".format(i))
    topics = lda_model_text_tests[j].show_topics(num_topics=i, num_words=5, log=True, formatted=True)
    for topic in topics:
        print(topic)

In [None]:
print("k = 5 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[3], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

In [None]:
print("k = 6 topics...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text_tests[4], corpus_text_test, dictionary_text_test)
pyLDAvis.display(lda_display)

## Now that we've done this, we can run the full model (test & training) with k=5 topics

# 5. Run LDA model on full dataset

## 5a. Repeat preprocessing steps 2a-2g on the full dataset

In [None]:
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

In [None]:
indices = [0,333,777,932]

samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

In [None]:
# number of null values in each column of the full dataset
data.isnull().sum()

In [None]:
#finding null values in the full dataset
print("=============Full Dataset=============")
data['text'] = data['text']

print('Number of rows in text:', len(data['text']))

print("-------------------")
print("Null Values in text: {}".format(data['text'].isnull().sum()))

#Removing null values from the full dataset

text = data['text']

print("After removing Null Values in Full Dataset")
print("Null Values in text: {}".format(data['text'].isnull().sum()))



In [None]:
#----------------------------------------Convert everything to Lower case--------------------------------------------------

##Train Data
text = text.str.lower()

print("=======Full Dataset==============\n")
print(text.head(1))


In [None]:
##Remove special characters from full dataset

text_regex = [re.sub(r'\S*@\S*\s?', '', sent) for sent in text]
text_regex = [re.sub(r'\'', '', sent) for sent in text_regex]
text_regex = [re.sub(r'[^\w\s]', '', sent) for sent in text_regex]
text_regex = [re.sub(r'\d', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'\W*\b\w{1,2}\b', '',  sent) for sent in text_regex]
text_regex = [re.sub(r'_', ' ',  sent) for sent in text_regex]

print("=======Full Dataset==============\n")
print("\n[INFO] text....................\n")
print(text_regex[:2])


In [None]:
## Full Data set
text_tokens = list(tokenize(text_regex))


print("\n[INFO] text....................\n")
print(text_tokens[:2])

In [None]:
# Lemmatize full dataset, keeping only noun, adj, vb, adv

text_lemma = lemmatization(text_tokens, allowed_postags=['ADJ', 'VERB', 'ADV', 'SCONJ', 'PRON', 'PART', 'INTJ', 'AUX', 'ADV', 'ADP', 'ADJ'])
print(str(len(text_lemma)))
print(text_lemma[:4])

In [None]:
#Trigrams and Bigrams in full dataset
            
text_bigram = Phrases(text_lemma, min_count=3, delimiter=b' ', threshold=1)
text_trigram = Phrases(text_bigram[text_lemma], threshold=1)

text_bigram_mod = gensim.models.phrases.Phraser(text_bigram)
text_trigram_mod = gensim.models.phrases.Phraser(text_trigram)

for idx in range(len(text_lemma)):
    for token in text_trigram_mod[text_bigram_mod[text_lemma[idx]]]:
        #print(token)
        if ' ' in token:
            text_lemma[idx].append(token)
print("\n[INFO] text....................\n")
print(text_lemma[:2])

In [None]:
##Remove stopwords from training set

text_stopwords = remove_stopwords(text_lemma)


print("\n[INFO] text....................\n")
print(text_stopwords[:2])



In [None]:
## Generate Corpus and Dictionary for full dataset

dictionary_text = corpora.Dictionary(text_stopwords)
dictionary_text.filter_extremes(no_below=.01, no_above=0.99)
corpus_text = [dictionary_text.doc2bow(text) for text in text_stopwords]
pickle.dump(corpus_text, open('./data/corpus_text.pkl', 'wb'))
dictionary_text.save('./data/dictionary_text.gensim')

## 5b. Ensure the model perplexity score for k=5 topics is relatively similar to the perplexity of the testing and training sets (it is)

In [None]:
k = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta= 'auto'
per_word_topics=True

lda_model_text = []

In [None]:
#Get Perplexity Scores of Training Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


for i in k:

    lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=i, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,                                                            
                                                  per_word_topics=per_word_topics)

    print('\nPerplexity (num_topics = {}): '.format(i), lda_model_text.log_perplexity(corpus_text))

## 5c. Run final model

In [None]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))

## 5d. Visualize Model Results

In [None]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 4...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)

In [None]:
# Initializing LDA Models and Parameters
topic_number = 5
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))

In [None]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 5...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)

In [None]:
# Initializing LDA Models and Parameters
topic_number = 6
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))

In [None]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)

In [None]:
# Initializing LDA Models and Parameters
topic_number = 4
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

# Full Dataset

print("\n***********************************************************************")
print("[INFO] text Full Dataset LDA Results....")
print("***********************************************************************")


lda_model_text = gensim.models.ldamodel.LdaModel(corpus=corpus_text,
                                                  id2word=dictionary_text,
                                                  num_topics=topic_number, 
                                                  random_state=random_state,
                                                  update_every=update_every,
                                                  chunksize=chunksize,
                                                  passes=passes,
                                                  iterations=iterations,
                                                  alpha=alpha,
                                                  eta=eta,
                                                  per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_text.log_perplexity(corpus_text))

In [None]:
#text Model Results

print("\n***********************************************************************")
print("[INFO] text Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_text.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

print("text.....k = 6...................")
lda_display = pyLDAvis.gensim.prepare(lda_model_text, corpus_text, dictionary_text)
pyLDAvis.display(lda_display)

## 6. Save the analysis results to an excel file for topic validation

### 6a. Here we generate a column that tells us which topic each response contributed the most to

In [None]:
cols = [color for name, color in mcolors.XKCD_COLORS.items()]
mycolors = [color for name, color in mcolors.XKCD_COLORS.items()]

In [None]:
#  Full Dataset


def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get the main topic of each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution, and Keywords for each document
        raw_frame = {}
        for j, (topic_num, prop_topic) in enumerate(row):
            #if j < 2:  # => dominant topic
                #wp = ldamodel.show_topic(topic_num)
                #topic_keywords = ", ".join([word for word, prop in wp])

                #sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,7), topic_keywords]), ignore_index=True)
            if j==0:
                raw_frame['Dominant'] = topic_num

            raw_frame['Topic' + str(topic_num)] = round(prop_topic, 4)

            #else:
            #    break
        df = pd.DataFrame(data=raw_frame, index=[0])
        sent_topics_df = sent_topics_df.append(df)
        
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    #sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution']

    # Add original text to the end of the output
    #contents = pd.Series(texts)
    #sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords_text = format_topics_sentences(ldamodel=lda_model_text, corpus=corpus_text, texts=text_stopwords)

df_dominant_topic_text = df_topic_sents_keywords_text.reset_index()
# Format
df_dominant_topic_text.index.name='Document_No';

print(df_dominant_topic_text.head(812))
# Format
#df_dominant_topic_text_train = df_topic_sents_keywords_text_train.reset_index()
#df_dominant_topic_text_train.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
#df_dominant_topic_text_train.head(3)

## 6b. Generate a dataframe to export the results into

In [None]:
lda_topics_text = np.array(df_dominant_topic_text['Dominant'])
topic0_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic0'])
topic1_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic1'])
topic2_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic2'])
topic3_contrib_lda_topics_text = np.array(df_dominant_topic_text['Topic3'])


text = np.array(data['text'])

id = np.array(data['id'])

results = { 
    'id' : id, 
    'text': text, 
    'lda_topics_text': lda_topics_text, 
    'topic0_contrib_lda_topics_text':topic0_contrib_lda_topics_text,
    'topic1_contrib_lda_topics_text':topic1_contrib_lda_topics_text,
    'topic2_contrib_lda_topics_text':topic2_contrib_lda_topics_text,
    'topic3_contrib_lda_topics_text':topic3_contrib_lda_topics_text,
}

frame = pd.DataFrame(results, columns = [
                                                'id',
                                                'text', 'lda_topics_text', 
                                                'topic0_contrib_lda_topics_text',
                                                'topic1_contrib_lda_topics_text',
                                                'topic2_contrib_lda_topics_text',
                                                'topic3_contrib_lda_topics_text',

                                              ])



frame.to_excel("./data/lda_results_full_dataset_topic_num_5.xlsx")

## 5d. Export restuls to an .xlsx file

In [None]:
'./data/news_dataset.xlsx'
frame.to_excel("./data/lda_results.xlsx")