# Topic modeling
## Lab31 - HRF-Police

Author: Micah Swain

## Guiding Quesitons
Can I cluster twitter topics to isolate police prutatility incidents on twitter?

## Approach
- apply NLP of the twitter data
- Estimate topic model on the data
- Create visualizations relating to the topics.

# Import Data

In [1]:
# install dependencies 
import pandas as pd
import spacy
from spacy.lang.en import English
import nltk

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

In [2]:
# read csv to test data
df = pd.read_csv("combined_tweets.csv")

In [3]:
# drop NAs and get shape
df.dropna(inplace=True)
df.shape

(6539, 3)

In [4]:
# seeing how the data looks
df.head()

Unnamed: 0,ids,text,reddit
0,1266136557871869952,Police in NYC made several arrests during a pr...,1
1,1266159669262893057,Calls for justice for George Floyd. Protesters...,1
2,1266555286678048770,NYPD just casually slamming a dude with a car ...,1
3,1266540710188195843,Update: Got her permission with a fuck yeah. T...,1
4,1266529475757510656,NYPD officer just called a female protester a ...,1


In [5]:
# example of text 
sample = df['text'][0]
sample

'Police in NYC made several arrests during a protest in NYC.'

# Apply NLP

-To do later, figure out what unit of analysis works best(lemmas, adjs, keywords, nouns, spacy tokens, etc)

In [6]:
# loading small version of english nlp
nlp = spacy.load("en_core_web_sm")

In [7]:
# load english parser from spacy
parser = English()

# boiler-plate tokenize function
def tokenize(text):
    """Parses a string into a list of semantic units (words)
    Args: text (str): The string that the function will tokenize.
    Returns: list: tokens parsed out by the mechanics of your choice
    """
    lda_tokens = []
    tokens = nlp(text)

    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.pos_ == 'PROPN':
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [8]:
# sample through function to test outcome
tokenize(sample)

['police',
 'in',
 'made',
 'several',
 'arrests',
 'during',
 'a',
 'protest',
 'in',
 '.']

In [9]:
nltk.download('wordnet')

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    

def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to C:\Users\Josh
[nltk_data]     Carlisle\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [12]:
# universal stopwords from nltk
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to C:\Users\Josh
[nltk_data]     Carlisle\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [13]:
# extra stop words that pertains to this model
more_stop = ['police', 'officer', 'cop', 'SCREEN_NAME']

In [14]:
def prepare_text_for_lda(text):
    """ takes text and tokenizes it, only looks at tweets with more than 4 words and removes stopwords"""
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    tokens = [token for token in tokens if token not in more_stop]
    return tokens

In [15]:
# creates column in DF with lemmas
df['lemmas'] = df['text'].apply(prepare_text_for_lda)

In [16]:
# visualize your work
df

Unnamed: 0,ids,text,reddit,lemmas
0,1266136557871869952,Police in NYC made several arrests during a pr...,1,"[several, arrest, protest]"
1,1266159669262893057,Calls for justice for George Floyd. Protesters...,1,"[call, justice, protester, street, rally]"
2,1266555286678048770,NYPD just casually slamming a dude with a car ...,1,"[casually, slam]"
3,1266540710188195843,Update: Got her permission with a fuck yeah. T...,1,"[update, permission, push, fling]"
4,1266529475757510656,NYPD officer just called a female protester a ...,1,"[call, female, protester, stupid, fucking, bit..."
...,...,...,...,...
6555,1364030259171971076,@narsheviking @heghapoghagan @Raz_Libar @n0cla...,0,"[classic, proxy]"
6556,1364030261483208710,@Lauren_Southern This is a reminder that the p...,0,"[reminder, follow, order, tyrant]"
6557,1364030261906857984,@nytimes Pelosi set the attack up days earlier...,0,"[attack, earlier, capitol, stand, anarchist]"
6558,1364030277463519234,So i started playing and then some time in Dec...,0,"[start, playing, start, faith]"


## Topic Modeling w/ Gensim
* Learn a Vocabulary
* Create a BAs of Words (BoW) representation of reach document
* Estimate our LDS model
* Clean up results
* Add topic information back to dataframe

In [17]:
import gensim
from gensim import corpora
from gensim.models.ldamulticore import LdaMulticore

In [18]:
id2word = corpora.Dictionary(df['lemmas'])

In [19]:
len(id2word.keys())

5557

In [20]:
corpus = [id2word.doc2bow(doc) for doc in df['lemmas']]

In [21]:
lda = LdaMulticore(corpus=corpus,
                  id2word=id2word,
                  num_topics = 5,
                  passes = 200,
                  workers = 4,
                  random_state = 42)

In [22]:
lda.print_topics()

[(0,
  '0.024*"would" + 0.020*"charge" + 0.016*"arrest" + 0.015*"could" + 0.008*"first" + 0.007*"chief" + 0.006*"assault" + 0.006*"death" + 0.006*"want" + 0.006*"try"'),
 (1,
  '0.016*"attack" + 0.013*"kill" + 0.011*"protester" + 0.010*"arrest" + 0.009*"call" + 0.007*"please" + 0.007*"protest" + 0.007*"report" + 0.007*"killing" + 0.007*"peaceful"'),
 (2,
  '0.013*"protest" + 0.010*"department" + 0.010*"worker" + 0.009*"spray" + 0.008*"another" + 0.008*"arrest" + 0.007*"pepper" + 0.007*"social" + 0.007*"night" + 0.006*"custody"'),
 (3,
  '0.017*"right" + 0.015*"state" + 0.014*"people" + 0.011*"force" + 0.008*"really" + 0.007*"found" + 0.007*"suspect" + 0.006*"shooting" + 0.006*"health" + 0.005*"public"'),
 (4,
  '0.032*"people" + 0.015*"black" + 0.010*"brutality" + 0.008*"murder" + 0.008*"crime" + 0.008*"death" + 0.007*"video" + 0.007*"protestors" + 0.007*"still" + 0.007*"station"')]

In [23]:
import re

words = [re.findall(r'"([^"]*)"', t[1]) for t in lda.print_topics()]

In [24]:
topics = [' '.join(t[0:5]) for t in words]

In [25]:
for id, t in enumerate(topics):
    print(f"----- Topic {id} ------")
    print(t, end="\n\n")

----- Topic 0 ------
would charge arrest could first

----- Topic 1 ------
attack kill protester arrest call

----- Topic 2 ------
protest department worker spray another

----- Topic 3 ------
right state people force really

----- Topic 4 ------
people black brutality murder crime



## Analyzing the Rusults of LDA
- How good are the topcis themselves:
    * Using intertopic distance visualization
    * Looking at some fo the token distributions
- Using the LDA topics analysis:
    * Score each review with a top topic
    * Summary visualization of top versus sentiment

In [26]:
import pyLDAvis.gensim
import warnings

warnings.filterwarnings("ignore")

pyLDAvis.enable_notebook()

In [27]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)