# Topic modeling
## Lab31 - HRF-Police

Author: Micah Swain

## Guiding Quesitons
Can I cluster twitter topics to isolate police prutatility incidents on twitter?

## Approach
- apply NLP of the twitter data
- Estimate topic model on the data
- Create visualizations relating to the topics.

# Import Data

In [48]:
import pandas as pd

In [49]:
df = pd.read_csv("../static/combined_tweets.csv")

In [50]:
df.dropna(inplace=True)

In [51]:
df.shape

(6539, 3)

In [52]:
df.head()

Unnamed: 0,ids,text,reddit
0,1266136557871869952,Police in NYC made several arrests during a pr...,1
1,1266159669262893057,Calls for justice for George Floyd. Protesters...,1
2,1266555286678048770,NYPD just casually slamming a dude with a car ...,1
3,1266540710188195843,Update: Got her permission with a fuck yeah. T...,1
4,1266529475757510656,NYPD officer just called a female protester a ...,1


In [53]:
df['text'][0]

'Police in NYC made several arrests during a protest in NYC.'

# Apply NLP

-To do later, figure out what unit of analysis works best(lemmas, adjs, keywords, nouns, spacy tokens, etc)

In [54]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [126]:
# NOT USING

def get_lemmas(text):
    
    stop_words = ['police', 'officer', 'cop']
    
    lemmas = []
    
    doc = nlp(text.lower())
    
    for token in doc:
        conditions = (token.is_stop == False) and \
                    (token.is_punct == False) and \
                    (token.pos_ != 'PRON') and \
                    (not token.lemma_ in stop_words) and \
                    (token.prefix_ != "@") and \
                    (token.prefix != " ")
        if conditions:
            lemmas.append(token.lemma_)
    
    return lemmas

In [99]:
# NOT USING
df['lemmas'] = df['text'].apply(tokenize)


In [125]:
# spacy.load('en')
# from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = nlp(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.pos_ == 'PROPN':
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [113]:
import nltk

nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/micahswain/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [114]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/micahswain/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [115]:
more_stop = ['police', 'officer', 'cop', 'SCREEN_NAME']

In [127]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    tokens = [token for token in tokens if token not in more_stop]
    return tokens

In [128]:
df['lemmas'] = df['text'].apply(prepare_text_for_lda)

In [129]:
df

Unnamed: 0,ids,text,reddit,lemmas
0,1266136557871869952,Police in NYC made several arrests during a pr...,1,"[several, arrest, protest]"
1,1266159669262893057,Calls for justice for George Floyd. Protesters...,1,"[call, justice, protester, street, rally]"
2,1266555286678048770,NYPD just casually slamming a dude with a car ...,1,"[casually, slam]"
3,1266540710188195843,Update: Got her permission with a fuck yeah. T...,1,"[update, permission, push, fling]"
4,1266529475757510656,NYPD officer just called a female protester a ...,1,"[call, female, protester, stupid, fucking, bit..."
...,...,...,...,...
6555,1364030259171971076,@narsheviking @heghapoghagan @Raz_Libar @n0cla...,0,[proxy]
6556,1364030261483208710,@Lauren_Southern This is a reminder that the p...,0,"[reminder, follow, order, tyrant]"
6557,1364030261906857984,@nytimes Pelosi set the attack up days earlier...,0,"[attack, earlier, stand, anarchist]"
6558,1364030277463519234,So i started playing and then some time in Dec...,0,"[start, playing, start, faith]"


## Topic Modeling w/ Gensim
* Learn a Vocabulary
* Create a BAs of Words (BoW) representation of reach document
* Estimate our LDS model
* Clean up results
* Add topic information back to dataframe

In [130]:
import gensim
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

In [131]:
id2word = corpora.Dictionary(df['lemmas'])

In [132]:
len(id2word.keys())

5551

In [133]:
corpus = [id2word.doc2bow(doc) for doc in df['lemmas']]

In [141]:
lda = LdaMulticore(corpus=corpus,
                  id2word=id2word,
                  num_topics = 5,
                  passes = 200,
                  workers = 4,
                  random_state = 42)

In [142]:
lda.print_topics()

[(0,
  '0.014*"protester" + 0.011*"people" + 0.008*"department" + 0.008*"crowd" + 0.007*"watch" + 0.007*"look" + 0.007*"arrest" + 0.007*"black" + 0.006*"criminal" + 0.006*"today"'),
 (1,
  '0.018*"kill" + 0.011*"people" + 0.010*"worker" + 0.009*"defund" + 0.008*"social" + 0.008*"still" + 0.007*"thought" + 0.007*"would" + 0.007*"custody" + 0.006*"health"'),
 (2,
  '0.017*"think" + 0.012*"arrest" + 0.009*"really" + 0.008*"protest" + 0.007*"involve" + 0.007*"going" + 0.006*"peaceful" + 0.006*"attack" + 0.006*"protester" + 0.006*"little"'),
 (3,
  '0.018*"death" + 0.017*"black" + 0.014*"report" + 0.012*"would" + 0.012*"video" + 0.012*"could" + 0.009*"brutality" + 0.007*"force" + 0.007*"legal" + 0.007*"charge"'),
 (4,
  '0.022*"people" + 0.017*"arrest" + 0.013*"right" + 0.010*"charge" + 0.008*"attack" + 0.008*"woman" + 0.007*"call" + 0.007*"murder" + 0.007*"chief" + 0.006*"former"')]

In [143]:
import re

words = [re.findall(r'"([^"]*)"', t[1]) for t in lda.print_topics()]

In [144]:
topics = [' '.join(t[0:5]) for t in words]

In [145]:
for id, t in enumerate(topics):
    print(f"----- Topic {id} ------")
    print(t, end="\n\n")

----- Topic 0 ------
protester people department crowd watch

----- Topic 1 ------
kill people worker defund social

----- Topic 2 ------
think arrest really protest involve

----- Topic 3 ------
death black report would video

----- Topic 4 ------
people arrest right charge attack



## Analyzing the Rusults of LDA
- How good are the topcis themselves:
    * Using intertopic distance visualization
    * Looking at some fo the token distributions
- Using the LDA topics analysis:
    * Score each review with a top topic
    * Summary visualization of top versus sentiment

In [146]:
import pyLDAvis.gensim
import warnings

warnings.filterwarnings("ignore")

pyLDAvis.enable_notebook()

In [147]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)