# COGS 108 - Final Project 

# Overview

In this research project, we looked to analyze whether Trump's inflammatory rhetoric against Asia on Twitter.Specifically, we performed sentiment analysis on Donald Trump's tweets had any effect on Anti-Asian hate crimes in the US. We found no statistically signficant correlation in our models. However, considering that this is only a preliminary study, we offer future routes for further research.

# Names

- Lily Nguyen
- Prithviraj Pahwa
- Sebastian Nunez
- Dev Bhatia

# Research Question

What is the effect of negative sentiment of Trump’s Tweets when discussing China and Asian affairs since 2013 on hate crimes against Asian-Americans in the U.S.?

We are interested in measuring Trump sentiment of his Tweets using VADER.

## Background and Prior Work

The motivation for this question stems from President Trump's controversial rhetoric regarding the Coronavirus (COVID-19) and China. Considering recent events, we want to examine how Trump's prior record in inflaming racial and ethnic tensions against minority populations within the United States. Based on previous research, it has been observed that Trump's rhetoric has affected Arab and Muslim-American behavior [1,2,3]. Additionally, it has also been found that there exists a correlation between Trump's rhetoric and anti-Muslim hate crimes [4]. With our approach we hope to apply similar techniques for a different ethnic group (Chinese-Americans).

References (include links):
- [1] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102652
- [2] https://alexandra-siegel.com/wp-content/uploads/2019/08/qjps_election_hatespeech_RR.pdf
- [3] https://www.cambridge.org/core/journals/american-political-science-review/article/effects-of-divisive-political-campaigns-on-the-daytoday-segregation-of-arab-and-muslim-americans/D9977E81BE7E3A11772E51CAA8A9812F
- [4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3149103


# Hypothesis


We hypothesize that within the months that Trump has posted the most negative Tweets when discussing China or Asia affairs, more hate crimes are committed. Similarly, we also hypothesize that within the weeks that Trump has posted the most negative Tweets when discussing China or Asia affairs, more hate crimes are committed within those weeks. We argue that Trump’s inflammatory rhetoric causes hate crimes to increase based on the widespread promotion of hateful ideas that encourage audiences to act on them. In his Tweets about China, simply through our observations, we find them to be negative towards China (economic blame, "trade war", etc). And thus, his rhetoric could spark anti-Chinese sentiment amongst his followers, possibly leading to an increase in hate crimes against Chinese-Americans.

# Dataset(s)

We are using two datasets to test our hypothesis.

__Tweets by Donald Trump__
- Dataset Name: 2009_to_2020_Trump_Tweets.csv
- Link to the dataset: http://www.trumptwitterarchive.com/archive
- Number of observations: 41,082 
- This dataset includes all of President Trumps tweets from May 2009 to roughly present day. We want to use this dataset to understand his sentiment towards China/Asia.

__FBI Hate Crimes Dataset__
- Dataset Name: hate_crime.zip
- Link to the dataset: https://crime-data-explorer.fr.cloud.gov/downloads-and-docs
- Number of observations: 201,402
- This dataset contains hate crime records within the United States from 1991-2018.

We will explore our research question by aggregating both os these dataframes across a certain time horizon. For example, if we choose a time horizon of a month, we will aggregate the Hate Crime Data by adding up all the Hate Crimes against different minorities (say Asians) that occured within that timeframe. Similarly, we will aggregate our tweets dataset in such a manner that for each month, we will only look at the tweets by Donald Trump in that specific month.

We will try out time horizons of a month as well as a week.

# Setup

In [0]:
!pip install nltk --user
!pip install plotly --user
!pip install statsmodels --user
!pip install gensim --user



In [0]:
import pandas as pd
import numpy as np
import plotly.express as px
import nltk 
nltk.download('vader_lexicon')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk import pos_tag, word_tokenize, WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
import plotly.figure_factory as ff
import plotly.graph_objects as go
import statsmodels.api as sm
import statsmodels.formula.api as smf
import gensim
import matplotlib.pyplot as plt
from gensim.models import Word2Vec

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.



The twython library has not been installed. Some functionality from the twitter package will not be available.



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.



pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



# Data Cleaning

Describe your data cleaning steps here.

As a proof of concept, we initally want to look whether words such as Asia/China are associated with negative objects throughout Trump's Twitter profile. Thus, we will look to create Word Embedding models to see what words occupy a similar space.

We have found a dataset of Donald Trump's tweets and will use that to conduct our analysis.

In [19]:
df = pd.read_csv("https://raw.githubusercontent.com/COGS108/group55_sp20/master/original_Trump_Tweets.csv?token=AKZZ2IEWN5UMDHA75VDNONC65BACW")
df.head()

Unnamed: 0,id_str,text,created_at,favorite_count,retweet_count,is_retweet,source
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone
1,1.257851e+18,RT @realDonaldTrump: For the constant criticis...,05-06-2020 1:52:17,0,18579,True,Twitter for iPhone
2,1.25785e+18,RT @NNVPLizer2019: “The Navajo Nation is a Nat...,05-06-2020 1:50:21,0,6738,True,Twitter for iPhone
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone
4,1.257798e+18,RT @realDonaldTrump: Will be doing a major int...,05-05-2020 22:21:41,0,11399,True,Twitter for iPhone


In [21]:
# Changing column names to make them neater
rename_dict = {"id_str":"ID", "text":"Text", "created_at":"Date", "favorite_count":"Favourites",
               "retweet_count":"Retweets", "is_retweet":"Is_Retweet", "source":"Source"}
df = df.rename(rename_dict, axis = 1)
df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone
1,1.257851e+18,RT @realDonaldTrump: For the constant criticis...,05-06-2020 1:52:17,0,18579,True,Twitter for iPhone
2,1.25785e+18,RT @NNVPLizer2019: “The Navajo Nation is a Nat...,05-06-2020 1:50:21,0,6738,True,Twitter for iPhone
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone
4,1.257798e+18,RT @realDonaldTrump: Will be doing a major int...,05-05-2020 22:21:41,0,11399,True,Twitter for iPhone


In [22]:
# Getting only tweets i.e. taking out retweets
tweets_df = df[df["Is_Retweet"] == False]
tweets_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone
6,1.257794e+18,I was thrilled to be back in the Great State o...,05-05-2020 22:06:11,38713,9402,False,Twitter for iPhone
7,1.257768e+18,Will be doing a major interview tonight at 6:3...,05-05-2020 20:25:31,51521,12017,False,Twitter for iPhone
8,1.257753e+18,On #NationalTeacherDay we recognize the countl...,05-05-2020 19:25:14,34874,8602,False,Twitter for iPhone


In [28]:
# Let's look at whether our dataset has null values in the text column
print("% of null values in text column: " + str(tweets_df["Text"].isnull().mean()))

% of null values in text column: 0.06589358799454298


After checking, these are all older tweets which we can just remove from our dataset

In [29]:
tweets_df = tweets_df[tweets_df["Text"].notnull()]
print("% of null values in text column: " + str(tweets_df["Text"].isnull().mean()))
tweets_df.head()

% of null values in text column: 0.0


Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone
6,1.257794e+18,I was thrilled to be back in the Great State o...,05-05-2020 22:06:11,38713,9402,False,Twitter for iPhone
7,1.257768e+18,Will be doing a major interview tonight at 6:3...,05-05-2020 20:25:31,51521,12017,False,Twitter for iPhone
8,1.257753e+18,On #NationalTeacherDay we recognize the countl...,05-05-2020 19:25:14,34874,8602,False,Twitter for iPhone


We can thus proceed with our dataset as is for now.

In [23]:
# Changing column names to make them neater
rename_dict = {"id_str":"ID", "text":"Text", "created_at":"Date", "favorite_count":"Favourites",
               "retweet_count":"Retweets", "is_retweet":"Is_Retweet", "source":"Source"}
tweets_df = tweets_df.rename(rename_dict, axis = 1)
tweets_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone
6,1.257794e+18,I was thrilled to be back in the Great State o...,05-05-2020 22:06:11,38713,9402,False,Twitter for iPhone
7,1.257768e+18,Will be doing a major interview tonight at 6:3...,05-05-2020 20:25:31,51521,12017,False,Twitter for iPhone
8,1.257753e+18,On #NationalTeacherDay we recognize the countl...,05-05-2020 19:25:14,34874,8602,False,Twitter for iPhone


We will create a Word2Vec Word Embedding model to see what sort of space words like China/Asia occupy in Trump's lexicon.

In [0]:
tweets_w2v = Word2Vec(tweets_df["Text"].apply(word_tokenize), sg = 1)

In [0]:
tweets_w2v.wv.most_similar("China", topn = 20)


Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.



[('Iran', 0.7977445721626282),
 ('Europe', 0.7679320573806763),
 ('Turkey', 0.7641910314559937),
 ('OPEC', 0.7572132349014282),
 ('Canada', 0.7428255677223206),
 ('trade', 0.7427544593811035),
 ('oil', 0.7269597053527832),
 ('NATO', 0.7192667722702026),
 ('Chinese', 0.7141560316085815),
 ('nuclear', 0.708271861076355),
 ('negotiations', 0.7011555433273315),
 ('terrorism', 0.700624942779541),
 ('countries', 0.700510561466217),
 ('Nuclear', 0.6995514631271362),
 ('Mexico', 0.6876879334449768),
 ('Germany', 0.6876776218414307),
 ('Tariffs', 0.6817082166671753),
 ('Trade', 0.6801229119300842),
 ('Korea', 0.6779588460922241),
 ('Pakistan', 0.6742398738861084)]

In [0]:
tweets_w2v.wv.most_similar("Chinese", topn = 20)


Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.



[('Taliban', 0.8238333463668823),
 ('Kurds', 0.810981035232544),
 ('European', 0.8043925762176514),
 ('unlimited', 0.7999637126922607),
 ('negotiating', 0.7982966899871826),
 ('Honduras', 0.797623872756958),
 ('Iranians', 0.7974710464477539),
 ('ceasefire', 0.7971891164779663),
 ('fighters', 0.7909480929374695),
 ('prisoners', 0.7896702289581299),
 ('negotiated', 0.7889379262924194),
 ('mostly', 0.7858562469482422),
 ('Turkey', 0.7822383642196655),
 ('sanctions', 0.7819817066192627),
 ('Virus', 0.78155517578125),
 ('practices', 0.7812646627426147),
 ('foolishly', 0.7810503244400024),
 ('TPP', 0.7809121608734131),
 ('vast', 0.7808462977409363),
 ('charged', 0.7785201668739319)]

In [0]:
tweets_w2v.wv.most_similar("Asia", topn = 20)


Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.



[('shipped', 0.9240245819091797),
 ('export', 0.9094930291175842),
 ('devaluation', 0.9083917140960693),
 ('planes', 0.9080590009689331),
 ('robbed', 0.9071637988090515),
 ('eating', 0.9070737957954407),
 ('removing', 0.9068514108657837),
 ('subsidizing', 0.9018558859825134),
 ('renegotiate', 0.9016463756561279),
 ('capital', 0.9013884663581848),
 ('relationships', 0.9012633562088013),
 ('non-Tariffed', 0.9011678695678711),
 ('dangers', 0.9010893106460571),
 ('reaching', 0.9010334014892578),
 ('devalue', 0.9009050130844116),
 ('worldwide', 0.8999800682067871),
 ('pumping', 0.8982710838317871),
 ('ruin', 0.8960363268852234),
 ('onslaught', 0.8953925967216492),
 ('permanently', 0.8951070308685303)]

In [0]:
tweets_w2v.wv.most_similar("Asian", topn = 20)


Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.



[('boom', 0.9389751553535461),
 ('GROWTH', 0.9364407658576965),
 ('workforce', 0.9351952075958252),
 ('rural', 0.9337780475616455),
 ('timeless', 0.9331744909286499),
 ('soaring', 0.9316504597663879),
 ('factory', 0.9298288822174072),
 ('as…', 0.9297479391098022),
 ('bonds', 0.928805947303772),
 ('independence', 0.9273960590362549),
 ('500000', 0.9272329807281494),
 ('significantly', 0.926680326461792),
 ('employers', 0.9265719056129456),
 ('stagnant', 0.9263258576393127),
 ('250000', 0.9260743260383606),
 ('needless', 0.9260541796684265),
 ('developed', 0.9255905151367188),
 ('upward', 0.9255610704421997),
 ('waving', 0.9248571991920471),
 ('Migration', 0.9247310757637024)]

While usually associated with normal words, realtively high levels of similarity with words like Taliban, terrorism, devluation, robbed, ruin and most importantly Virus do tend to indicate Trump's racist rhetorics. Thus, we feel confident in pursuing our research question.

Our research question is mostly concerned with looking at Donald Trump's inflammatory rhetoric against Asian people (in particular, Chinese people). Thus, we will include an additional column which checks whether the tweet directly mentions at least one of Asia or China.

In [0]:
# Creating a column to see if China/Asia/Asian-Americans are mentioned in the
# tweet's text.
watch_words = ["asia", "china", "asian", "chinese"]

In [0]:
def check_if_words_mentioned(tweet, word_list = watch_words):
  """
  Checking if the tweet mentions any word in the word_list. 
  Used to identify tweets in which Asia/China is directly mentioned.
  """
  for i in tweet.split():
    if i.lower() in word_list:
      return True
  return False

In [32]:
tweets_df["China/Asia Mentioned?"] = tweets_df["Text"].apply(lambda x: check_if_words_mentioned(x, watch_words))
tweets_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone,False
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone,False
6,1.257794e+18,I was thrilled to be back in the Great State o...,05-05-2020 22:06:11,38713,9402,False,Twitter for iPhone,False
7,1.257768e+18,Will be doing a major interview tonight at 6:3...,05-05-2020 20:25:31,51521,12017,False,Twitter for iPhone,False
8,1.257753e+18,On #NationalTeacherDay we recognize the countl...,05-05-2020 19:25:14,34874,8602,False,Twitter for iPhone,False


For exploratory purposes, we will create an additional dataframe called asia_df which will consist of all the tweets in which Trump directly mention Asia/China.

In [33]:
asia_df = tweets_df[tweets_df["China/Asia Mentioned?"] == True]
asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?
55,1.257042e+18,Intelligence has just reported to me that I wa...,05-03-2020 20:18:19,165655,41034,False,Twitter for iPhone,True
56,1.257042e+18,....Fake News got it wrong again as always and...,05-03-2020 20:18:19,96666,22990,False,Twitter for iPhone,True
79,1.25672e+18,The Democrats are just as always looking for t...,05-02-2020 23:00:39,54048,17076,False,Twitter for iPhone,True
134,1.256244e+18,Concast (@NBCNews) and Fake News @CNN are goin...,05-01-2020 15:27:29,163161,45538,False,Twitter for iPhone,True
517,1.25159e+18,China wants Sleepy Joe sooo badly. They want a...,04-18-2020 19:13:28,121495,34270,False,Twitter for iPhone,True


Let us look at how many times Trump mentioned Asia in his tweets in the past few years.

In [34]:
mentions_by_year = asia_df["Date"].apply(pd.to_datetime).apply(lambda x: x.year).value_counts()
fig = px.bar(x = mentions_by_year.index, y = mentions_by_year, color = mentions_by_year)
fig.update_layout(title = "Number Of Tweets Mentioning Asia/China by year", xaxis_title = "Year", yaxis_title = "Number Of Tweets")

Let us move towards analyzing the emotions behind the tweets using Sentiment Analysis.

In [35]:
asia_df['token'] = asia_df['Text'].apply(word_tokenize) 
asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token
55,1.257042e+18,Intelligence has just reported to me that I wa...,05-03-2020 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th..."
56,1.257042e+18,....Fake News got it wrong again as always and...,05-03-2020 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ..."
79,1.25672e+18,The Democrats are just as always looking for t...,05-02-2020 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin..."
134,1.256244e+18,Concast (@NBCNews) and Fake News @CNN are goin...,05-01-2020 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @..."
517,1.25159e+18,China wants Sleepy Joe sooo badly. They want a...,04-18-2020 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th..."


Let us look at a couple of examples to see how Vader performs.

In [36]:
print(asia_df["Text"].iloc[0])

Intelligence has just reported to me that I was correct and that they did NOT bring up the CoronaVirus subject matter until late into January just prior to my banning China from the U.S. Also they only spoke of the Virus in a very non-threatening or matter of fact manner...


In [37]:
analyser = SentimentIntensityAnalyzer()
analyser.polarity_scores(asia_df["Text"].iloc[0])

{'compound': 0.5106, 'neg': 0.0, 'neu': 0.895, 'pos': 0.105}

In [38]:
print(asia_df["Text"].iloc[2])

The Democrats are just as always looking for trouble. They do nothing constructive even in times of crisis. They don’t want to blame their cash cow China for the plague. China is blaming Europe. Dr. Fauci will be testifying before the Senate very soon! #DONOTHINGDEMOCRATS https://t.co/fgHuYeiOQY


In [39]:
analyser.polarity_scores(asia_df["Text"].iloc[2])

{'compound': -0.908, 'neg': 0.231, 'neu': 0.746, 'pos': 0.024}

The sentiment analysis seems to be matching our intuitions, although the negative rating for the second tweet seems a bit too low.

We will now move onto pre-processing the text by removing stop-words and stemming the remaining tokens to hopefully extract more robust sentiments from the text.

In [40]:
stop_words = set(stopwords.words('english'))
asia_df['stop_tokens'] = asia_df['token'].apply(lambda x: [item for item in x if item not in stop_words])
asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens
55,1.257042e+18,Intelligence has just reported to me that I wa...,05-03-2020 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th...","[Intelligence, reported, I, correct, NOT, brin..."
56,1.257042e+18,....Fake News got it wrong again as always and...,05-03-2020 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ...","[..., .Fake, News, got, wrong, always, tens, t..."
79,1.25672e+18,The Democrats are just as always looking for t...,05-02-2020 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin...","[The, Democrats, always, looking, trouble, ., ..."
134,1.256244e+18,Concast (@NBCNews) and Fake News @CNN are goin...,05-01-2020 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @...","[Concast, (, @, NBCNews, ), Fake, News, @, CNN..."
517,1.25159e+18,China wants Sleepy Joe sooo badly. They want a...,04-18-2020 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[China, wants, Sleepy, Joe, sooo, badly, ., Th..."


In [41]:
# Moving onto stemming
ps = PorterStemmer()
asia_df['stem_tokens'] = asia_df['stop_tokens'].apply(lambda x: [ps.stem(y) for y in x])
asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens,stem_tokens
55,1.257042e+18,Intelligence has just reported to me that I wa...,05-03-2020 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th...","[Intelligence, reported, I, correct, NOT, brin...","[intellig, report, I, correct, not, bring, cor..."
56,1.257042e+18,....Fake News got it wrong again as always and...,05-03-2020 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ...","[..., .Fake, News, got, wrong, always, tens, t...","[..., .fake, new, got, wrong, alway, ten, thou..."
79,1.25672e+18,The Democrats are just as always looking for t...,05-02-2020 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin...","[The, Democrats, always, looking, trouble, ., ...","[the, democrat, alway, look, troubl, ., they, ..."
134,1.256244e+18,Concast (@NBCNews) and Fake News @CNN are goin...,05-01-2020 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @...","[Concast, (, @, NBCNews, ), Fake, News, @, CNN...","[concast, (, @, nbcnew, ), fake, new, @, cnn, ..."
517,1.25159e+18,China wants Sleepy Joe sooo badly. They want a...,04-18-2020 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[china, want, sleepi, joe, sooo, badli, ., the..."


We will now apply Vader's Sentiment Analysis on these pre-processed tokens.

In [42]:
# Using apply to perform the row-wise analysis
asia_df["Text"].apply(analyser.polarity_scores)

55       {'neg': 0.0, 'neu': 0.895, 'pos': 0.105, 'comp...
56       {'neg': 0.142, 'neu': 0.752, 'pos': 0.106, 'co...
79       {'neg': 0.231, 'neu': 0.746, 'pos': 0.024, 'co...
134      {'neg': 0.13, 'neu': 0.69, 'pos': 0.18, 'compo...
517      {'neg': 0.075, 'neu': 0.75, 'pos': 0.175, 'com...
                               ...                        
44767    {'neg': 0.309, 'neu': 0.691, 'pos': 0.0, 'comp...
44781    {'neg': 0.254, 'neu': 0.652, 'pos': 0.094, 'co...
44786    {'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'comp...
44788    {'neg': 0.059, 'neu': 0.781, 'pos': 0.161, 'co...
44910    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
Name: Text, Length: 733, dtype: object

We will write a function to add all these ratings into our dataframe.

In [0]:
def take_tokens_to_new_df(df, token_col):
  """
  Uses the correct token column and conducts sentiment analysis on that column
  in the dataframe. Returns a new dataframe with all the relevant information.
  """
  new_df = df.copy()
  ser = new_df[token_col].apply(lambda x: " ".join(x)).apply(analyser.polarity_scores)
  indices = []
  negatives = []
  neutrals = []
  positives = []
  compounds = []
  for i, j in zip(ser.index, ser):
    indices.append(i)
    negatives.append(j["neg"])
    neutrals.append(j["neu"])
    positives.append(j["pos"])
    compounds.append(j["compound"])
  new_df["indices"] = indices
  new_df["neg"] = negatives
  new_df["neu"] = neutrals
  new_df["pos"] = positives
  new_df["compound"] = compounds
  return new_df

In [44]:
# Applying the function to asia_df
asia_df = take_tokens_to_new_df(asia_df, "stem_tokens")
asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens,stem_tokens,indices,neg,neu,pos,compound
55,1.257042e+18,Intelligence has just reported to me that I wa...,05-03-2020 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th...","[Intelligence, reported, I, correct, NOT, brin...","[intellig, report, I, correct, not, bring, cor...",55,0.145,0.766,0.089,-0.5267
56,1.257042e+18,....Fake News got it wrong again as always and...,05-03-2020 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ...","[..., .Fake, News, got, wrong, always, tens, t...","[..., .fake, new, got, wrong, alway, ten, thou...",56,0.287,0.568,0.145,-0.7177
79,1.25672e+18,The Democrats are just as always looking for t...,05-02-2020 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin...","[The, Democrats, always, looking, trouble, ., ...","[the, democrat, alway, look, troubl, ., they, ...",79,0.157,0.803,0.04,-0.5848
134,1.256244e+18,Concast (@NBCNews) and Fake News @CNN are goin...,05-01-2020 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @...","[Concast, (, @, NBCNews, ), Fake, News, @, CNN...","[concast, (, @, nbcnew, ), fake, new, @, cnn, ...",134,0.095,0.646,0.258,0.6476
517,1.25159e+18,China wants Sleepy Joe sooo badly. They want a...,04-18-2020 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[china, want, sleepi, joe, sooo, badli, ., the...",517,0.06,0.73,0.21,0.3595


In [45]:
# Repeating the same process of pre-processing and sentiment analysis for the larger tweets_df
tweets_df['token'] = tweets_df['Text'].apply(word_tokenize) 
tweets_df['stop_tokens'] = tweets_df['token'].apply(lambda x: [item for item in x if item not in stop_words])
tweets_df['stem_tokens'] = tweets_df['stop_tokens'].apply(lambda x: [ps.stem(y) for y in x])
tweets_df = take_tokens_to_new_df(tweets_df, "stem_tokens")
tweets_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens,stem_tokens,indices,neg,neu,pos,compound
0,1.257873e+18,Thank you @Honeywell! https://t.co/4jH6NF63XI,05-06-2020 3:22:09,36163,8497,False,Twitter for iPhone,False,"[Thank, you, @, Honeywell, !, https, :, //t.co...","[Thank, @, Honeywell, !, https, :, //t.co/4jH6...","[thank, @, honeywel, !, http, :, //t.co/4jh6nf...",0,0.0,0.518,0.482,0.4199
3,1.257823e+18,https://t.co/cnQ4tL0aN3,05-06-2020 0:02:17,57072,17626,False,Twitter for iPhone,False,"[https, :, //t.co/cnQ4tL0aN3]","[https, :, //t.co/cnQ4tL0aN3]","[http, :, //t.co/cnq4tl0an3]",3,0.0,1.0,0.0,0.0
6,1.257794e+18,I was thrilled to be back in the Great State o...,05-05-2020 22:06:11,38713,9402,False,Twitter for iPhone,False,"[I, was, thrilled, to, be, back, in, the, Grea...","[I, thrilled, back, Great, State, Arizona, inc...","[I, thrill, back, great, state, arizona, incre...",6,0.0,0.615,0.385,0.784
7,1.257768e+18,Will be doing a major interview tonight at 6:3...,05-05-2020 20:25:31,51521,12017,False,Twitter for iPhone,False,"[Will, be, doing, a, major, interview, tonight...","[Will, major, interview, tonight, 6:30, P.M., ...","[will, major, interview, tonight, 6:30, p.m., ...",7,0.0,1.0,0.0,0.0
8,1.257753e+18,On #NationalTeacherDay we recognize the countl...,05-05-2020 19:25:14,34874,8602,False,Twitter for iPhone,False,"[On, #, NationalTeacherDay, we, recognize, the...","[On, #, NationalTeacherDay, recognize, countle...","[On, #, nationalteacherday, recogn, countless,...",8,0.0,0.866,0.134,0.4199


Let us examine how the neg ratings look like for the 2 different dataframes.

In [46]:
# Entire Dataframe
tweets_df["neg"].describe()

count    41082.000000
mean         0.074147
std          0.122814
min          0.000000
25%          0.000000
50%          0.000000
75%          0.136000
max          1.000000
Name: neg, dtype: float64

In [47]:
# Subset of Dataframe
asia_df["neg"].describe()

count    733.000000
mean       0.099632
std        0.116034
min        0.000000
25%        0.000000
50%        0.070000
75%        0.172000
max        0.624000
Name: neg, dtype: float64

Let us also look at how the distributions for the 2 different columns look like to try and gain a firmer understanding of the entire distribution. We will draw both a boxplot and the plot of the entire distribution.

In [48]:
# Drawing Boxplots
y0 = tweets_df["neg"]
y1 = asia_df["neg"]
fig = go.Figure()
fig.add_trace(go.Box(y=y0, name = "All Tweets"))
fig.add_trace(go.Box(y=y1, name = "Asia Tweets"))
fig.update_layout(title = "Negative Ratings of Tweets for Different Datasets", 
                  xaxis_title = "Dataset", yaxis_title = "Negative Ratings")
fig.show()

As we can observe, there isn't a great deal of difference between the two datasets although the Asia Tweets do seem to be on the slightly more negative side.

In [49]:
y0 = tweets_df["neg"]
y1 = asia_df["neg"]
# Group data together
hist_data = [y0, y1]

group_labels = ['Whole Dataset', 'Asia Dataset']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, show_hist = False, bin_size=.2)
fig.update_layout(title = "Distrobution OF Negative Ratings For Different Datasets", 
                  xaxis_title = "Dataset", yaxis_title = "Distribution Of Negative Ratings")
fig.show()

It's again tough to make out too much detail from this apart from the greater proportion of 0 negative ratings for thee whole dataset. Confirming this:

In [50]:
print("% of 0 values for neg for the whole dataset: " + str(np.round(y0.value_counts(normalize = True)[0], 4) * 100))
print("% of 0 values for neg for the Asia dataset: " + str(np.round(y1.value_counts(normalize = True)[0], 2) * 100))

% of 0 values for neg for the whole dataset: 65.29
% of 0 values for neg for the Asia dataset: 44.0


### Integrating Hate Crime Data

In [0]:
# Unzipping hate_crime to get hate_crime_data
!unzip hate_crime.zip -d hate_crime_data

Archive:  hate_crime.zip
  inflating: hate_crime_data/hate_crime.csv  
  inflating: hate_crime_data/HC Readme.docx  


In [0]:
hate_crime_data = pd.read_csv("/content/hate_crime_data/hate_crime.csv")
hate_crime_data.head()


Columns (19) have mixed types.Specify dtype option on import or set low_memory=False.



Unnamed: 0,INCIDENT_ID,DATA_YEAR,ORI,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,STATE_ABBR,STATE_NAME,DIVISION_NAME,REGION_NAME,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,INCIDENT_DATE,ADULT_VICTIM_COUNT,JUVENILE_VICTIM_COUNT,TOTAL_OFFENDER_COUNT,ADULT_OFFENDER_COUNT,JUVENILE_OFFENDER_COUNT,OFFENDER_RACE,OFFENDER_ETHNICITY,VICTIM_COUNT,OFFENSE_NAME,TOTAL_INDIVIDUAL_VICTIMS,LOCATION_NAME,BIAS_DESC,VICTIM_TYPES,MULTIPLE_OFFENSE,MULTIPLE_BIAS
0,3015,1991,AR0040200,Rogers,,City,AR,Arkansas,West South Central,South,5,"Cities from 10,000 thru 24,999",31-AUG-91,,,1,,,White,,1,Intimidation,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-Black or African American,Individual,S,S
1,3016,1991,AR0290100,Hope,,City,AR,Arkansas,West South Central,South,6,"Cities from 2,500 thru 9,999",19-SEP-91,,,1,,,Black or African American,,1,Simple Assault,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual,S,S
2,43,1991,AR0350100,Pine Bluff,,City,AR,Arkansas,West South Central,South,3,"Cities from 50,000 thru 99,999",04-JUL-91,,,1,,,Black or African American,,1,Aggravated Assault,1.0,Residence/Home,Anti-Black or African American,Individual,S,S
3,44,1991,AR0350100,Pine Bluff,,City,AR,Arkansas,West South Central,South,3,"Cities from 50,000 thru 99,999",24-DEC-91,,,1,,,Black or African American,,2,Aggravated Assault;Destruction/Damage/Vandalis...,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual,M,S
4,3017,1991,AR0350100,Pine Bluff,,City,AR,Arkansas,West South Central,South,3,"Cities from 50,000 thru 99,999",23-DEC-91,,,1,,,Black or African American,,1,Aggravated Assault,1.0,Service/Gas Station,Anti-White,Individual,S,S


In [0]:
# Examining the data
ser = hate_crime_data["BIAS_DESC"].value_counts(normalize = True)[:10]
ser

Anti-Black or African American                               0.342875
Anti-Jewish                                                  0.129636
Anti-White                                                   0.115912
Anti-Gay (Male)                                              0.100872
Anti-Hispanic or Latino                                      0.063738
Anti-Other Race/Ethnicity/Ancestry                           0.050153
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group)    0.030173
Anti-Asian                                                   0.029359
Anti-Multiple Races, Group                                   0.024101
Anti-Lesbian (Female)                                        0.021181
Name: BIAS_DESC, dtype: float64

In [0]:
fig = px.bar(x = ser.index, y = np.round(ser * 100, 4), color = ser)
fig.update_layout(title = "Percentage Of All Hate Crimes For Each Persecuted Group",
                  xaxis_title = "Group", yaxis_title = "Percentage Of All Hate Crimes")
fig.show()

We see that hate crimes against Asians are a realtively small piece of all hate crimes. However, recent events and certain political rhetorics has inflamed racial/ethnic tensions and thus, hate crimes against Asian Americans are certainly worth exploring.

In [0]:
# Trying to isolate anti-asian hate crimes in a new dataframe called asian_hc
hate_crime_data["Anti-Asian?"] = hate_crime_data["BIAS_DESC"].str.contains("Anti-Asian")
asian_hc = hate_crime_data[hate_crime_data["Anti-Asian?"] == True]
asian_hc.head()

Unnamed: 0,INCIDENT_ID,DATA_YEAR,ORI,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,STATE_ABBR,STATE_NAME,DIVISION_NAME,REGION_NAME,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,INCIDENT_DATE,ADULT_VICTIM_COUNT,JUVENILE_VICTIM_COUNT,TOTAL_OFFENDER_COUNT,ADULT_OFFENDER_COUNT,JUVENILE_OFFENDER_COUNT,OFFENDER_RACE,OFFENDER_ETHNICITY,VICTIM_COUNT,OFFENSE_NAME,TOTAL_INDIVIDUAL_VICTIMS,LOCATION_NAME,BIAS_DESC,VICTIM_TYPES,MULTIPLE_OFFENSE,MULTIPLE_BIAS,Anti-Asian?
43,33,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",13-NOV-91,,,1,,,White,,1,Intimidation,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-Asian,Individual,S,S,True
55,3010,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",07-DEC-91,,,1,,,Unknown,,1,Intimidation,1.0,Restaurant,Anti-Asian,Individual,S,S,True
63,51,1991,CA0300900,Garden Grove,,City,CA,California,Pacific,West,2,"Cities from 100,000 thru 249,999",16-AUG-91,,,2,,,White,,2,Aggravated Assault,2.0,Parking/Drop Lot/Garage,Anti-Asian,Individual,S,S,True
71,77,1991,CO0010100,Aurora,,City,CO,Colorado,Mountain,West,2,"Cities from 100,000 thru 249,999",01-AUG-91,,,1,,,White,,1,Aggravated Assault,1.0,Residence/Home,Anti-Asian,Individual,S,S,True
116,95,1991,CO0031200,Chatfield State Park,,Other State Agency,CO,Colorado,Mountain,West,7,"Cities under 2,500",29-JUN-91,,,1,,,White,,1,Destruction/Damage/Vandalism of Property,1.0,Other/Unknown,Anti-Asian,Individual,S,S,True


To analyze the temporal component of the tweets and the hate crimes, we have to adopt time horizions which we will then examine. Let us start with a time horizon of <b> 1 month </b> for now.

In [0]:
asian_hc["INCIDENT_DATE"] = asian_hc["INCIDENT_DATE"].apply(pd.to_datetime)
asian_hc["Incident Month & Year"] = asian_hc["INCIDENT_DATE"].apply(lambda x: str(x.month) + ", " + str(x.year))
asian_hc.head()

Unnamed: 0,INCIDENT_ID,DATA_YEAR,ORI,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,STATE_ABBR,STATE_NAME,DIVISION_NAME,REGION_NAME,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,INCIDENT_DATE,ADULT_VICTIM_COUNT,JUVENILE_VICTIM_COUNT,TOTAL_OFFENDER_COUNT,ADULT_OFFENDER_COUNT,JUVENILE_OFFENDER_COUNT,OFFENDER_RACE,OFFENDER_ETHNICITY,VICTIM_COUNT,OFFENSE_NAME,TOTAL_INDIVIDUAL_VICTIMS,LOCATION_NAME,BIAS_DESC,VICTIM_TYPES,MULTIPLE_OFFENSE,MULTIPLE_BIAS,Anti-Asian?,Incident Month,Incident Month & Year
43,33,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",1991-11-13,,,1,,,White,,1,Intimidation,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-Asian,Individual,S,S,True,"11, 1991","11, 1991"
55,3010,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",1991-12-07,,,1,,,Unknown,,1,Intimidation,1.0,Restaurant,Anti-Asian,Individual,S,S,True,"12, 1991","12, 1991"
63,51,1991,CA0300900,Garden Grove,,City,CA,California,Pacific,West,2,"Cities from 100,000 thru 249,999",1991-08-16,,,2,,,White,,2,Aggravated Assault,2.0,Parking/Drop Lot/Garage,Anti-Asian,Individual,S,S,True,"8, 1991","8, 1991"
71,77,1991,CO0010100,Aurora,,City,CO,Colorado,Mountain,West,2,"Cities from 100,000 thru 249,999",1991-08-01,,,1,,,White,,1,Aggravated Assault,1.0,Residence/Home,Anti-Asian,Individual,S,S,True,"8, 1991","8, 1991"
116,95,1991,CO0031200,Chatfield State Park,,Other State Agency,CO,Colorado,Mountain,West,7,"Cities under 2,500",1991-06-29,,,1,,,White,,1,Destruction/Damage/Vandalism of Property,1.0,Other/Unknown,Anti-Asian,Individual,S,S,True,"6, 1991","6, 1991"


In [0]:
# Let us get the monthly counts of hate crimes for our entire dataset.
monthly_counts_df = asian_hc.groupby("Incident Month & Year").count()
monthly_counts_df["Number Of Hate Crime Incidents"] = monthly_counts_df["INCIDENT_ID"]
monthly_c ounts_df = monthly_counts_df[["Number Of Hate Crime Incidents"]]
monthly_counts_df.head()

Unnamed: 0_level_0,Number Of Hate Crime Incidents
Incident Month & Year,Unnamed: 1_level_1
"1, 1991",9
"1, 1992",17
"1, 1993",21
"1, 1994",15
"1, 1995",25


Let us group together our tweets with the same monthly time horizon.

In [0]:
# Making a copy of the asia_df with processed_asia_df
processed_asia_df = asia_df.copy()
processed_asia_df["Date"] = processed_asia_df["Date"].apply(pd.to_datetime)
processed_asia_df["Tweet Month & Year"] = processed_asia_df["Date"].apply(lambda x: str(x.month) + ", " + str(x.year))
processed_asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens,stem_tokens,indices,neg,neu,pos,compound,Tweet Month & Year
33,1.260000e+18,Intelligence has just reported to me that I wa...,2020-05-03 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th...","[Intelligence, reported, I, correct, NOT, brin...","[intellig, report, I, correct, not, bring, cor...",33,0.145,0.766,0.089,-0.5267,"5, 2020"
34,1.260000e+18,....Fake News got it wrong again as always and...,2020-05-03 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ...","[..., .Fake, News, got, wrong, always, tens, t...","[..., .fake, new, got, wrong, alway, ten, thou...",34,0.287,0.568,0.145,-0.7177,"5, 2020"
51,1.260000e+18,The Democrats are just as always looking for t...,2020-05-02 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin...","[The, Democrats, always, looking, trouble, ., ...","[the, democrat, alway, look, troubl, ., they, ...",51,0.157,0.803,0.040,-0.5848,"5, 2020"
74,1.260000e+18,Concast (@NBCNews) and Fake News @CNN are goin...,2020-05-01 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @...","[Concast, (, @, NBCNews, ), Fake, News, @, CNN...","[concast, (, @, nbcnew, ), fake, new, @, cnn, ...",74,0.095,0.646,0.258,0.6476,"5, 2020"
245,1.250000e+18,China wants Sleepy Joe sooo badly. They want a...,2020-04-18 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[china, want, sleepi, joe, sooo, badli, ., the...",245,0.060,0.730,0.210,0.3595,"4, 2020"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40735,2.580000e+17,China has so much of our debt that they can't ...,2012-10-16 18:01:26,11,10,False,Twitter Web Client,True,"[China, has, so, much, of, our, debt, that, th...","[China, much, debt, ca, n't, put, us, default,...","[china, much, debt, ca, n't, put, us, default,...",40735,0.370,0.630,0.000,-0.8313,"10, 2012"
40749,2.580000e+17,China is our enemy--they want to destroy us --...,2012-10-16 13:53:35,28,208,False,Twitter Web Client,True,"[China, is, our, enemy, --, they, want, to, de...","[China, enemy, --, want, destroy, us, --, Reds...","[china, enemi, --, want, destroy, us, --, reds...",40749,0.297,0.593,0.110,-0.4939,"10, 2012"
40754,2.580000e+17,So China is ordering us to raise the Debt Limi...,2012-10-16 00:48:35,284,465,False,Twitter Web Client,True,"[So, China, is, ordering, us, to, raise, the, ...","[So, China, ordering, us, raise, Debt, Limit, ...","[So, china, order, us, rais, debt, limit, ...,...",40754,0.315,0.685,0.000,-0.5574,"10, 2012"
40756,2.580000e+17,Every dollar @BarackObama spends costs $1.40 w...,2012-10-15 21:25:45,223,705,False,Twitter Web Client,True,"[Every, dollar, @, BarackObama, spends, costs,...","[Every, dollar, @, BarackObama, spends, costs,...","[everi, dollar, @, barackobama, spend, cost, $...",40756,0.000,0.798,0.202,0.5093,"10, 2012"


In [0]:
asia_tweets_monthly_data = processed_asia_df.groupby("Tweet Month & Year").count()[["ID"]].rename({"ID":"Number Of Tweets"}, axis = 1)
asia_tweets_monthly_data.head()

Unnamed: 0_level_0,Number Of Tweets
Tweet Month & Year,Unnamed: 1_level_1
"1, 2013",39
"1, 2014",2
"1, 2015",7
"1, 2016",1
"1, 2017",3
...,...
"9, 2015",1
"9, 2016",2
"9, 2017",2
"9, 2018",7


We will now integrate our sentiment analysis of these tweets within our results. To aggregate across the months, we will use a variety of approaches. Initally, we will take the mean of the sentiment ratings (neg, neu, pos and compound) across all tweets in the month and add that into our new dataframe.

In [0]:
def get_monthly_average_neg_ratings(month, df):
  """
  Gets the average monthly sentiment ratings for the given month.
  """
  subset = df[df["Tweet Month & Year"] == month]
  d = {}
  d["Neg Mean"] = subset["neg"].mean()
  d["Neu Mean"] = subset["neu"].mean()
  d["Pos Mean"] = subset["pos"].mean()
  d["Compound Mean"] = subset["compound"].mean()
  return d

In [0]:
negs = []
neus = []
poss = []
compounds = []
for i in asia_tweets_monthly_data.index:
  d = get_monthly_average_neg_ratings(i, processed_asia_df)
  negs.append(d["Neg Mean"])
  neus.append(d["Neu Mean"])
  poss.append(d["Pos Mean"])
  compounds.append(d["Compound Mean"])

In [0]:
asia_tweets_monthly_data["Neg Mean For Month"] = negs
asia_tweets_monthly_data["Neu Mean For Month"] = neus
asia_tweets_monthly_data["Pos Mean For Month"] = poss
asia_tweets_monthly_data["Compound Mean For Month"] = compounds
asia_tweets_monthly_data.head()

Unnamed: 0_level_0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month
Tweet Month & Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"1, 2013",39,0.156231,0.736795,0.107000,-0.056254
"1, 2014",2,0.182500,0.576000,0.242000,0.521400
"1, 2015",7,0.086429,0.796857,0.116714,0.189629
"1, 2016",1,0.000000,0.859000,0.141000,0.624900
"1, 2017",3,0.143333,0.763667,0.093000,-0.002933
...,...,...,...,...,...
"9, 2015",1,0.152000,0.695000,0.152000,0.000000
"9, 2016",2,0.192500,0.807500,0.000000,-0.559450
"9, 2017",2,0.116000,0.546500,0.337500,0.666750
"9, 2018",7,0.065143,0.851857,0.082857,0.106529


We will now merge this dataset with our number of hate crimes dataset to perform our analysis.

In [0]:
final_df = asia_tweets_monthly_data.join(monthly_counts_df, how = "inner")
final_df.head()

Unnamed: 0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month,Number Of Hate Crime Incidents
"1, 2013",39,0.156231,0.736795,0.107,-0.056254,13
"1, 2014",2,0.1825,0.576,0.242,0.5214,13
"1, 2015",7,0.086429,0.796857,0.116714,0.189629,7
"1, 2016",1,0.0,0.859,0.141,0.6249,7
"1, 2017",3,0.143333,0.763667,0.093,-0.002933,6


We can now move onto the Data Analysis Portion:

# Data Analysis & Results

In [0]:
# Renaming columns to put into Statsmodels OLS
final_df = final_df.rename({"Number Of Hate Crime Incidents": "Num_Hate_Crimes", "Number Of Tweets":"Num_Tweets", 
                            "Neg Mean For Month":"Neg_Mean", "Pos Mean For Month":"Pos_Mean", "Neu Mean For Month":"Neu_Mean",
                            "Compound Mean For Month":"Compound_Mean"}, axis = 1)
final_df.head()

Unnamed: 0,Num_Tweets,Neg_Mean,Neu_Mean,Pos_Mean,Compound_Mean,Num_Hate_Crimes
"1, 2013",39,0.156231,0.736795,0.107,-0.056254,13
"1, 2014",2,0.1825,0.576,0.242,0.5214,13
"1, 2015",7,0.086429,0.796857,0.116714,0.189629,7
"1, 2016",1,0.0,0.859,0.141,0.6249,7
"1, 2017",3,0.143333,0.763667,0.093,-0.002933,6


We will now perform a hypothesis test to see whether any of our features (Number Of Tweets within the time horizon, Average Negative Rating) have a significant effect on the number of hate crimes against Asian Americans by performing a Least Squares Linear Regression with the Number of Tweets and the Monthly Negative Tweet Ratings as our features and the Number of Hate Crimes as our target variable. We will start off with a smaller portion of the features and add more in expanded models. Formally, our hypothesis test is as follows:

H0: Coefficient on Num_Tweets/Neg_Mean = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Mean != 0 <br>
Significance Level: 5%



In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Mean", data=final_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.100
Model:                            OLS   Adj. R-squared:                  0.070
Method:                 Least Squares   F-statistic:                     3.317
Date:                Fri, 05 Jun 2020   Prob (F-statistic):             0.0430
Time:                        07:26:42   Log-Likelihood:                -178.01
No. Observations:                  63   AIC:                             362.0
Df Residuals:                      60   BIC:                             368.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.2714      1.025      9.046      0.0

As we can see, the Number of Tweets seem to be of little to no consequence. However, the Mean of Negative Ratings seems to have somewhat of an effect, given its low p-value of 0.014 (which is lower than 0.05) and its t-statistic of 2.525. However, the r-squared of our model is still a low 0.10, which means that we are able to explain only 10% of the vaiation in the number of hate crimes with our model. This leads us to think that our model has low predictive power.

Let us try expanding the number of features. We will now also include the Mean of Positive, Neutral and Compound ratings that we calculated using Sentiment Analysis.

Formally, our new hypothesis test is as follows:

H0: Coefficient on Num_Tweets/Neg_Mean/Neu_Mean/Pos_Mean/Compound_Mean = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Mean/Neu_Mean/Pos_Mean/Compound_Mea != 0 <br>
Significance Level: 5%

In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Mean + Neu_Mean + Pos_Mean + Compound_Mean", data=final_df.rename({"Number Of Hate Crime Incidents": "Num_Hate_Crimes", "Number Of Tweets":"Num_Tweets", "Neg Mean For Month":"Neg_Avg"}, axis = 1))
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.123
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     1.601
Date:                Fri, 05 Jun 2020   Prob (F-statistic):              0.175
Time:                        07:30:24   Log-Likelihood:                -177.18
No. Observations:                  63   AIC:                             366.4
Df Residuals:                      57   BIC:                             379.2
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      -848.9139   2962.469     -0.287

This model seems to confirm that our model (as currently constructed) has low predictive power. None of the features seem to have a statistically significant effect on the Number Of Hate Crimes (with very high p-values and low t-statistic values). Furthermore, the R-squared of 0.123 also reinforces that our model lacks a great deal of predictve power.

We will try to aggregate the sentiment analysis using weeks rather than months and see if the change in time horizon makes our model perform better.

In [0]:
asian_hc["INCIDENT_DATE"] = asian_hc["INCIDENT_DATE"].apply(pd.to_datetime)
asian_hc["Incident Week & Year"] = asian_hc["INCIDENT_DATE"].apply(lambda x: str(x.week) + ", " + str(x.year))
asian_hc.head()

Unnamed: 0,INCIDENT_ID,DATA_YEAR,ORI,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,STATE_ABBR,STATE_NAME,DIVISION_NAME,REGION_NAME,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,INCIDENT_DATE,ADULT_VICTIM_COUNT,JUVENILE_VICTIM_COUNT,TOTAL_OFFENDER_COUNT,ADULT_OFFENDER_COUNT,JUVENILE_OFFENDER_COUNT,OFFENDER_RACE,OFFENDER_ETHNICITY,VICTIM_COUNT,OFFENSE_NAME,TOTAL_INDIVIDUAL_VICTIMS,LOCATION_NAME,BIAS_DESC,VICTIM_TYPES,MULTIPLE_OFFENSE,MULTIPLE_BIAS,Anti-Asian?,Incident Month,Incident Month & Year,Incident Week & Year
43,33,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",1991-11-13,,,1,,,White,,1,Intimidation,1.0,Highway/Road/Alley/Street/Sidewalk,Anti-Asian,Individual,S,S,True,"11, 1991","11, 1991","46, 1991"
55,3010,1991,AZ0072300,Phoenix,,City,AZ,Arizona,Mountain,West,1B,"Cities from 500,000 thru 999,999",1991-12-07,,,1,,,Unknown,,1,Intimidation,1.0,Restaurant,Anti-Asian,Individual,S,S,True,"12, 1991","12, 1991","49, 1991"
63,51,1991,CA0300900,Garden Grove,,City,CA,California,Pacific,West,2,"Cities from 100,000 thru 249,999",1991-08-16,,,2,,,White,,2,Aggravated Assault,2.0,Parking/Drop Lot/Garage,Anti-Asian,Individual,S,S,True,"8, 1991","8, 1991","33, 1991"
71,77,1991,CO0010100,Aurora,,City,CO,Colorado,Mountain,West,2,"Cities from 100,000 thru 249,999",1991-08-01,,,1,,,White,,1,Aggravated Assault,1.0,Residence/Home,Anti-Asian,Individual,S,S,True,"8, 1991","8, 1991","31, 1991"
116,95,1991,CO0031200,Chatfield State Park,,Other State Agency,CO,Colorado,Mountain,West,7,"Cities under 2,500",1991-06-29,,,1,,,White,,1,Destruction/Damage/Vandalism of Property,1.0,Other/Unknown,Anti-Asian,Individual,S,S,True,"6, 1991","6, 1991","26, 1991"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201218,552461,2018,WASPD0000,Seattle,,City,WA,Washington,Pacific,West,1B,"Cities from 500,000 thru 999,999",2018-12-17,1.0,0.0,1,1.0,0.0,White,Unknown,1,Simple Assault,1.0,Specialty Store,Anti-Asian,Individual,S,S,True,"12, 2018","12, 2018","51, 2018"
201223,552485,2018,WASPD0000,Seattle,,City,WA,Washington,Pacific,West,1B,"Cities from 500,000 thru 999,999",2018-10-25,1.0,0.0,1,1.0,0.0,White,Unknown,2,Robbery,1.0,Convenience Store,Anti-Asian,Business;Individual,M,S,True,"10, 2018","10, 2018","43, 2018"
201226,552490,2018,WASPD0000,Seattle,,City,WA,Washington,Pacific,West,1B,"Cities from 500,000 thru 999,999",2018-11-15,1.0,0.0,1,1.0,0.0,Black or African American,Unknown,1,Intimidation,1.0,Hotel/Motel/Etc.,Anti-Asian,Individual,S,S,True,"11, 2018","11, 2018","46, 2018"
201228,552496,2018,WASPD0000,Seattle,,City,WA,Washington,Pacific,West,1B,"Cities from 500,000 thru 999,999",2018-10-08,1.0,0.0,0,,,Unknown,Unknown,1,Destruction/Damage/Vandalism of Property,1.0,Parking/Drop Lot/Garage,Anti-Asian,Individual,S,S,True,"10, 2018","10, 2018","41, 2018"


In [0]:
weekly_counts_df = asian_hc.groupby("Incident Week & Year").count()
weekly_counts_df["Number Of Hate Crime Incidents"] = weekly_counts_df["INCIDENT_ID"]
weekly_counts_df = weekly_counts_df[["Number Of Hate Crime Incidents"]]
weekly_counts_df.head()

Unnamed: 0_level_0,Number Of Hate Crime Incidents
Incident Week & Year,Unnamed: 1_level_1
"1, 1992",2
"1, 1993",6
"1, 1994",3
"1, 1995",6
"1, 1996",12
...,...
"9, 2014",12
"9, 2015",2
"9, 2016",3
"9, 2017",3


In [0]:
processed_asia_df["Date"] = processed_asia_df["Date"].apply(pd.to_datetime)
processed_asia_df["Tweet Week & Year"] = processed_asia_df["Date"].apply(lambda x: str(x.week) + ", " + str(x.year))
processed_asia_df.head()

Unnamed: 0,ID,Text,Date,Favourites,Retweets,Is_Retweet,Source,China/Asia Mentioned?,token,stop_tokens,stem_tokens,indices,neg,neu,pos,compound,Tweet Month & Year,Tweet Week & Year
33,1.260000e+18,Intelligence has just reported to me that I wa...,2020-05-03 20:18:19,165655,41034,False,Twitter for iPhone,True,"[Intelligence, has, just, reported, to, me, th...","[Intelligence, reported, I, correct, NOT, brin...","[intellig, report, I, correct, not, bring, cor...",33,0.145,0.766,0.089,-0.5267,"5, 2020","18, 2020"
34,1.260000e+18,....Fake News got it wrong again as always and...,2020-05-03 20:18:19,96666,22990,False,Twitter for iPhone,True,"[..., .Fake, News, got, it, wrong, again, as, ...","[..., .Fake, News, got, wrong, always, tens, t...","[..., .fake, new, got, wrong, alway, ten, thou...",34,0.287,0.568,0.145,-0.7177,"5, 2020","18, 2020"
51,1.260000e+18,The Democrats are just as always looking for t...,2020-05-02 23:00:39,54048,17076,False,Twitter for iPhone,True,"[The, Democrats, are, just, as, always, lookin...","[The, Democrats, always, looking, trouble, ., ...","[the, democrat, alway, look, troubl, ., they, ...",51,0.157,0.803,0.040,-0.5848,"5, 2020","18, 2020"
74,1.260000e+18,Concast (@NBCNews) and Fake News @CNN are goin...,2020-05-01 15:27:29,163161,45538,False,Twitter for iPhone,True,"[Concast, (, @, NBCNews, ), and, Fake, News, @...","[Concast, (, @, NBCNews, ), Fake, News, @, CNN...","[concast, (, @, nbcnew, ), fake, new, @, cnn, ...",74,0.095,0.646,0.258,0.6476,"5, 2020","18, 2020"
245,1.250000e+18,China wants Sleepy Joe sooo badly. They want a...,2020-04-18 19:13:28,121495,34270,False,Twitter for iPhone,True,"[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[China, wants, Sleepy, Joe, sooo, badly, ., Th...","[china, want, sleepi, joe, sooo, badli, ., the...",245,0.060,0.730,0.210,0.3595,"4, 2020","16, 2020"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40735,2.580000e+17,China has so much of our debt that they can't ...,2012-10-16 18:01:26,11,10,False,Twitter Web Client,True,"[China, has, so, much, of, our, debt, that, th...","[China, much, debt, ca, n't, put, us, default,...","[china, much, debt, ca, n't, put, us, default,...",40735,0.370,0.630,0.000,-0.8313,"10, 2012","42, 2012"
40749,2.580000e+17,China is our enemy--they want to destroy us --...,2012-10-16 13:53:35,28,208,False,Twitter Web Client,True,"[China, is, our, enemy, --, they, want, to, de...","[China, enemy, --, want, destroy, us, --, Reds...","[china, enemi, --, want, destroy, us, --, reds...",40749,0.297,0.593,0.110,-0.4939,"10, 2012","42, 2012"
40754,2.580000e+17,So China is ordering us to raise the Debt Limi...,2012-10-16 00:48:35,284,465,False,Twitter Web Client,True,"[So, China, is, ordering, us, to, raise, the, ...","[So, China, ordering, us, raise, Debt, Limit, ...","[So, china, order, us, rais, debt, limit, ...,...",40754,0.315,0.685,0.000,-0.5574,"10, 2012","42, 2012"
40756,2.580000e+17,Every dollar @BarackObama spends costs $1.40 w...,2012-10-15 21:25:45,223,705,False,Twitter Web Client,True,"[Every, dollar, @, BarackObama, spends, costs,...","[Every, dollar, @, BarackObama, spends, costs,...","[everi, dollar, @, barackobama, spend, cost, $...",40756,0.000,0.798,0.202,0.5093,"10, 2012","42, 2012"


In [0]:
asia_tweets_weekly_data = processed_asia_df.groupby("Tweet Week & Year").count()[["ID"]].rename({"ID":"Number Of Tweets"}, axis = 1)
asia_tweets_weekly_data.head()

Unnamed: 0_level_0,Number Of Tweets
Tweet Week & Year,Unnamed: 1_level_1
"1, 2013",3
"1, 2014",1
"1, 2015",1
"1, 2017",2
"1, 2019",3


In [0]:
def get_weekly_average_neg_ratings(week, df):
  """
  Getting average sentiment ratings for the given week from the dataframe.
  """
  subset = df[df["Tweet Week & Year"] == week]
  d = {}
  d["Neg Mean"] = subset["neg"].mean()
  d["Neu Mean"] = subset["neu"].mean()
  d["Pos Mean"] = subset["pos"].mean()
  d["Compound Mean"] = subset["compound"].mean()
  return d

In [0]:
negs = []
neus = []
poss = []
compounds = []
for i in asia_tweets_weekly_data.index:
  d = get_weekly_average_neg_ratings(i, processed_asia_df)
  negs.append(d["Neg Mean"])
  neus.append(d["Neu Mean"])
  poss.append(d["Pos Mean"])
  compounds.append(d["Compound Mean"])

In [0]:
asia_tweets_weekly_data["Neg Mean For Month"] = negs
asia_tweets_weekly_data["Neu Mean For Month"] = neus
asia_tweets_weekly_data["Pos Mean For Month"] = poss
asia_tweets_weekly_data["Compound Mean For Month"] = compounds
asia_tweets_weekly_data.head()

Unnamed: 0_level_0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month
Tweet Week & Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"1, 2013",3,0.229333,0.662667,0.108333,-0.254367
"1, 2014",1,0.231,0.445,0.325,0.1759
"1, 2015",1,0.286,0.5,0.214,-0.3818
"1, 2017",2,0.165,0.835,0.0,-0.31275
"1, 2019",3,0.04,0.860333,0.099667,0.216567


In [0]:
final_df = asia_tweets_weekly_data.join(weekly_counts_df, how = "inner")
final_df.head()

Unnamed: 0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month,Number Of Hate Crime Incidents
"1, 2013",3,0.229333,0.662667,0.108333,-0.254367,3
"1, 2014",1,0.231,0.445,0.325,0.1759,5
"1, 2015",1,0.286,0.5,0.214,-0.3818,1
"1, 2017",2,0.165,0.835,0.0,-0.31275,1
"10, 2013",3,0.110333,0.805333,0.084,-0.125133,3


In [0]:
# Renaming for ease of statsmodels
final_df = final_df.rename({"Number Of Hate Crime Incidents": "Num_Hate_Crimes", "Number Of Tweets":"Num_Tweets", 
                            "Neg Mean For Month":"Neg_Mean", "Pos Mean For Month":"Pos_Mean", "Neu Mean For Month":"Neu_Mean",
                            "Compound Mean For Month":"Compound_Mean"}, axis = 1)
final_df.head()

Unnamed: 0,Num_Tweets,Neg_Mean,Neu_Mean,Pos_Mean,Compound_Mean,Num_Hate_Crimes
"1, 2013",3,0.229333,0.662667,0.108333,-0.254367,3
"1, 2014",1,0.231,0.445,0.325,0.1759,5
"1, 2015",1,0.286,0.5,0.214,-0.3818,1
"1, 2017",2,0.165,0.835,0.0,-0.31275,1
"10, 2013",3,0.110333,0.805333,0.084,-0.125133,3


We will apply the same hypothesis tests as we did in the monthly example. Mathematically:

H0: Coefficient on Num_Tweets/Neg_Mean = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Mean != 0 <br>
Significance Level: 5%

In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Mean", data=final_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.025
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     1.838
Date:                Fri, 05 Jun 2020   Prob (F-statistic):              0.163
Time:                        07:37:48   Log-Likelihood:                -273.84
No. Observations:                 144   AIC:                             553.7
Df Residuals:                     141   BIC:                             562.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.8791      0.245     11.774      0.0

This one seems to perform even worse than the monthly example. None of the features are near statistical significance and an R-squared of 0.025 implies very little predictive power.

Adding all our features, our new hypothesis test is:

H0: Coefficient on Num_Tweets/Neg_Mean/Neu_Mean/Pos_Mean/Compound_Mean = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Mean/Neu_Mean/Pos_Mean/Compound_Mea != 0 <br>
Significance Level: 5%

In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Mean + Neu_Mean + Pos_Mean + Compound_Mean", data=final_df.rename({"Number Of Hate Crime Incidents": "Num_Hate_Crimes", "Number Of Tweets":"Num_Tweets", "Neg Mean For Month":"Neg_Avg"}, axis = 1))
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     1.487
Date:                Fri, 05 Jun 2020   Prob (F-statistic):              0.198
Time:                        07:40:03   Log-Likelihood:                -271.91
No. Observations:                 144   AIC:                             555.8
Df Residuals:                     138   BIC:                             573.6
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      -118.5704    645.709     -0.184

Same issues hold with this one. There is no discernible significance to any of our variables and our R-squared remains low.

For the remainder of the analysis, we will stick to the month time horizon do its greater robustness.

We will now attempt a different way of aggregating sentiment ratings across our chosen time horizon. Intuitively, racist attitudes get worse with greater consumption of insensitive rhetoric. Thus, rather than taking the mean, we will take the sum of the sentiment ratings and see if that improves our model.

### Aggregating Sentiment Ratings Using Sum

In [0]:
def get_monthly_sum_neg_ratings(month, df):
  """
  Aggregates sum of sentiment ratings in df for the given month
  """
  subset = df[df["Tweet Month & Year"] == month]
  d = {}
  d["Neg Sum"] = subset["neg"].sum()
  d["Neu Sum"] = subset["neu"].sum()
  d["Pos Sum"] = subset["pos"].sum()
  d["Compound Sum"] = subset["compound"].sum()
  return d

In [0]:
negs = []
neus = []
poss = []
compounds = []
for i in asia_tweets_monthly_data.index:
  d = get_monthly_sum_neg_ratings(i, processed_asia_df)
  negs.append(d["Neg Sum"])
  neus.append(d["Neu Sum"])
  poss.append(d["Pos Sum"])
  compounds.append(d["Compound Sum"])

In [0]:
asia_tweets_monthly_data["Neg Sum For Month"] = negs
asia_tweets_monthly_data["Neu Sum For Month"] = neus
asia_tweets_monthly_data["Pos Sum For Month"] = poss
asia_tweets_monthly_data["Compound Sum For Month"] = compounds
asia_tweets_monthly_data.head()

Unnamed: 0_level_0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month,Neg Sum For Month,Neu Sum For Month,Pos Sum For Month,Compound Sum For Month
Tweet Month & Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"1, 2013",39,6.093,28.735,4.173,-2.1939,6.093,28.735,4.173,-2.1939
"1, 2014",2,0.365,1.152,0.484,1.0428,0.365,1.152,0.484,1.0428
"1, 2015",7,0.605,5.578,0.817,1.3274,0.605,5.578,0.817,1.3274
"1, 2016",1,0.0,0.859,0.141,0.6249,0.0,0.859,0.141,0.6249
"1, 2017",3,0.43,2.291,0.279,-0.0088,0.43,2.291,0.279,-0.0088


In [0]:
final_df = asia_tweets_monthly_data.join(monthly_counts_df, how = "inner")
final_df.head()

Unnamed: 0,Number Of Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month,Neg Sum For Month,Neu Sum For Month,Pos Sum For Month,Compound Sum For Month,Number Of Hate Crime Incidents
"1, 2013",39,6.093,28.735,4.173,-2.1939,6.093,28.735,4.173,-2.1939,13
"1, 2014",2,0.365,1.152,0.484,1.0428,0.365,1.152,0.484,1.0428,13
"1, 2015",7,0.605,5.578,0.817,1.3274,0.605,5.578,0.817,1.3274,7
"1, 2016",1,0.0,0.859,0.141,0.6249,0.0,0.859,0.141,0.6249,7
"1, 2017",3,0.43,2.291,0.279,-0.0088,0.43,2.291,0.279,-0.0088,6


In [0]:
# Renaming columns
final_df = final_df.rename({"Number Of Hate Crime Incidents": "Num_Hate_Crimes", "Number Of Tweets":"Num_Tweets", 
                            "Neg Sum For Month":"Neg_Sum", "Pos Sum For Month":"Pos_Sum", "Neu Sum For Month":"Neu_Sum",
                            "Compound Sum For Month":"Compound_Sum"}, axis = 1)
final_df.head()

Unnamed: 0,Num_Tweets,Neg Mean For Month,Neu Mean For Month,Pos Mean For Month,Compound Mean For Month,Neg_Sum,Neu_Sum,Pos_Sum,Compound_Sum,Num_Hate_Crimes
"1, 2013",39,6.093,28.735,4.173,-2.1939,6.093,28.735,4.173,-2.1939,13
"1, 2014",2,0.365,1.152,0.484,1.0428,0.365,1.152,0.484,1.0428,13
"1, 2015",7,0.605,5.578,0.817,1.3274,0.605,5.578,0.817,1.3274,7
"1, 2016",1,0.0,0.859,0.141,0.6249,0.0,0.859,0.141,0.6249,7
"1, 2017",3,0.43,2.291,0.279,-0.0088,0.43,2.291,0.279,-0.0088,6


We will perform the same hypothesis test as earlier but use our new variables this time. Formally:

H0: Coefficient on Num_Tweets/Neg_Sum = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Sum != 0 <br>
Significance Level: 5%

In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Sum", data=final_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                 -0.028
Method:                 Least Squares   F-statistic:                    0.1478
Date:                Fri, 05 Jun 2020   Prob (F-statistic):              0.863
Time:                        07:48:44   Log-Likelihood:                -181.16
No. Observations:                  63   AIC:                             368.3
Df Residuals:                      60   BIC:                             374.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.2571      0.704     15.998      0.0

We again face the same issues as we did in the prior examples. We find no statistically significant result and a putrid R-squared. We will attempt to add features and see if this model performs better. 

Our new hypothesis test is:

H0: Coefficient on Num_Tweets/Neg_Sum/Neu_Sum/Pos_Sum/Compound_Sum = 0 <br>
H1: Coeffiicient on Num_Tweets/Neg_Sum/Neu_Sum/Pos_Sum/Compound_Sum != 0 <br>
Significance Level: 5%

In [0]:
mod = smf.ols(formula="Num_Hate_Crimes ~ Num_Tweets + Neg_Sum + Neu_Sum + Pos_Sum + Compound_Sum", data=final_df)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:        Num_Hate_Crimes   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                 -0.079
Method:                 Least Squares   F-statistic:                   0.08984
Date:                Fri, 05 Jun 2020   Prob (F-statistic):              0.994
Time:                        07:50:40   Log-Likelihood:                -181.07
No. Observations:                  63   AIC:                             374.1
Df Residuals:                      57   BIC:                             387.0
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       11.2400      0.739     15.218   

This model also fails to find any statistically signficant relation between our variables.

# Ethics & Privacy

All datasets used in this experiment are available to the general public. Neither of the datasets violate the privacy of any individuals included within them.

All of the tweets analyzed are sourced from trumptwitterarchive.com. The site is maintained by Brendan Brown, monitors Trump's tweets in realtime, and explicitly states any visitor on the site has permission to use the dataset. Initally, we did have a minor concern regarding the validity of the data so we looked deeper into the publisher's process for obtaining these tweets. We found that they used the Tweepy module in Python to scrape twitter for tweets from Trumps accounts without filtering out keywords that portray the author of the tweet in any positive/negative light.

One impact of our analysis is that because we do not find any significant relation between Trump's tweets and hate crimes in the US towards Asian-Americans during 2015-2018, we intend to be clear that our results also do not provide any indications for future trends. 

Moreover, Donald Trump's privacy is not breached as Tweets are publicly available by Trump himself.

Additonally, our use of the hate_crime dataset does NOT individulaize any one case or set of cases only within a specific set of regions in the US. The data in the set is unlinkable to any one individual and primarily considers the state, time of the crime, and the offender's bias for the crime.

In terms of the impact of our analysis, we build on previous literature on the causes of hate crime that could be helpful in combatting it. We do not consider this an exoneration of Trump and do not mean to say that this study definitively states that Trump's actions do not correspond to increased Anti-Asian hate crimes but merely observing the consequences of his actions.



# Conclusion & Discussion

*Fill in your discussion information here*

In this study, we sought to understand the relationship between the negative sentiment of Trumps tweets against Asia and China and hate-crimes against Asian-Americans in the U.S. We did not find a statistically significant correlation between the two events in our analysis.

We do believe that our model is limited due to the following reasons:
- Choosing somewhat arbitrary time horizons limits the effectiveness of our study as we may not be able to capture the time lag of crimes after Trump tweets. 
- Our way of aggregating the sentiments (negative, positive et al) by taking the mean or sum of the sentiments across our time horizon means that a lot of relevant data is lost. Hate crimes can be promoted by even 1 single tweet so aggregating the sentiments across different tweets means we lose the impact that 1 tweet can have. 
- We do not have access to hate crimes data for 2019 and 2020, which significantly impacts our analysis as the Coronavirus Pandemic was in late 2019 and has inflamed racial tensions.

Although we did not find any data that that proves that there is a relationship between the amount of hate crimes and sentiment of trump's tweets, we are hoping that in the future that we can improve the model to where we can find some other confounding variables and control for them as much as possible (these might include state and time effects). Even though our model didn't give us any substantial evidence towards proving our hypothesis, we still believe that the evidence isn't absent given how the coronavirus is linked to Asia and how Trump and his supporters have heightened their hateful rhetoric towards Asia/China. We hope that we'll be able to add more data to our model and that hopefully we'll have the hate crime data accessible for 2019-2020.



# Team Contributions

*Specify who in your group worked on which parts of the project.*

Prithviraj: Found the Twitter data and cleaned it. Also did some exploratory data anlysis. Aggregated the hate crime data and the tweet data and built the model to analyze the data.

Lily: Helped in background research & prior work. Also helped in coming up with the hypothesis and research question.

Dev: Helped in formulating the research questions. Focused on considering and writing out the privacy and ethics section.

Sebastian: Performed the sentiment analysis on Trump's Tweets. Also helped in writing out the conclusion and discussion part.