Zhixian (Alex) Li

Final Project

#Method
##Dataset
We are going to use the dataset gathered by Davidson, et al. (2017), which consists of tweets identified as potential hate speech compiled by Hatebase.org. Collected in 2017, it is a relatively new dataset, but is also unfettered by the recent controversies and large emergence of unwanted posts around the recent acquisition of Twitter (Levin, 2022).

Workers were then recruited from CrowdFlower (CF) to identify whether a speech is "hate speech," contains "offensive language," or is neither. Amongst these three types, we also propose a potentially binary distinction of the tweets being either disturbing (hate speech or offensive language) or non-disturbing (neither). Each entry is coded by a minimum of 3 and a maximum of 9 workers, and achieved an inter-coder agreement score of 92% on CF platform. Therefore, this data reliably tells how likely a tweet (an entry) is a line of hate speech/offensive post through illustrating how many people classified it as such. This dataset could be found at https://github.com/t-davidson/hate-speech-and-offensive-language. And additional information explaining the columns is listed below.

**tweet**: the tweet itself, identified as either containing hate speech or offensive language. It is left in the raw, uncensored form, ready for analysis. Looking through the dataset, only one or two cases have there been a text that has a word censored, as in "f**k". Because it might be the intention of the user to do so (like self-censorship), I decided to leave it as is.

**count**: the number of CF participants who coded this specific entry (tweet), with a minimum of 3. We can use this to normalize other measures in the data.

**hate_speech**: the number of CF participants who judged the entry (tweet) to be hate speech. Note that it is in raw counts, so we need to normalize it with regard to **count**.

**offensive_language**: the number of CF participants who judged the entry (tweet) to be offensive. Note that it is in raw counts, so we need to normalize it with regard to **count**.

**neither**: the number of CF participants who judged the entry (tweet) to be neither offensive nor hate speech. Note that it is in raw counts, so we need to normalize it with regard to **count**.

##Linguistic Analysis

Linguistic analysis in our study consists of two parts, namely sentiment analysis and semantic similarity measurements to discern patterns within the tweet entries. We would compare which of the two methods are more effective.

*POS-Tagging Sentiment Analysis*

Utilizing the SpaCy package and the NRC Word-Emotion Lexicon developed by Mohammad & Turney (2012), we conducted sentiment analysis on the tweets. This lexicon categorizes words into positive or negative sentiment groups, as well as eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. By calculating sentiment scores for different parts of speech (POS) within each tweet, we aimed to gauge the overall positivity or negativity of the entries as well as the specific emotion of anger, as those are the entries that display significant scores (some of the other emtions, such as disgust and fear, often returns a score of zero in sentiment analysis). The sentiment analysis process involved normalizing raw counts of word incidences by tweet length.

*Semantic Similarity*

We also employed semantic similarity analysis using the SpaCy package and a comprehensive NLP pipeline to elucidate linguistic patterns. Before conducting this analysis, text preprocessing was performed, which involved removing stop words, punctuation, and converting all text to lowercase. Additionally, tweets were fully classified as hate speech, offensive language, or neither through a consensus-based approach. We selected a subset of entries from each tweet type to serve as corpi, facilitating the comparison of semantic similarity scores across different categories.

The process involved three main steps:

1. Construction of corpi: We compiled a randomly selected 132 entries from each tweet type into separate lists of words, creating distinct corpi for hate speech, offensive language, and neutral tweets.
2. Calculation of semantic similarity scores: We randomly select 131 entries that are not used before in constructing the corpus of each tweet type and computed their semantic similarity scores relative to the respective corpi. This step allowed us to quantify the degree of similarity between individual tweets and each category.
3. Comparative analysis: Each tweet was assessed for its semantic similarity to all three corpi, resulting in three distinct similarity scores per tweet: hate speech similarity score, offensive language similarity score, and neutral similarity score. By examining these scores, we aimed to discern underlying linguistic similarities and differences across different tweet types.
Through the integration of sentiment analysis and semantic similarity measurements, our linguistic analysis offers a comprehensive understanding of the linguistic characteristics of hate speech within the Twitter dataset under investigation.

##Statistical Analysis

We would conduct ANOVA analyses with regard to:
1. The sentiment scores and
2. The semantic similarity scores.

*POS-Tagging Sentiment Analysis*

We conduct ANOVA tests to all three types of tweets (organized as three different groups) as well as to all three types of sentiment/emotion scores (positive sentiments, negative sentiments, anger) in R. We would also find for the sentiment/emotion scores:
1. Significance of the ANOVA test. This indicates whether a statistically significant difference exists between the groups in terms of semantic similarity scores.
2. A pairwise t-test. This is Bonferroni-adjusted for its p-values, and indicates what difference exists between each group, and whether such a difference is statistically significant.
3. A Tukey Honestly Significant Difference (HSD) test. This serves for similar purposes as the pairwise t-test, and provides another measure to check on the results of the pairwise t-test.
4. A Kruskal-Wallis test. This relaxes constraints of assumptions in a normal ANOVA test, reaffirming if statistically significant difference exists between the categories.

*Semantic Similarity*

We conduct ANOVA tests to all three types of tweets (organized as three different groups) as well as with regard to all three types of scores (similarity score with hate corpus, offensive corpus, and neither corpus) in R. For similarity score of the three groups with each corpus, we would find
1. Significance of the ANOVA test. This indicates whether a statistically significant difference exists between the groups in terms of semantic similarity scores.
2. A pairwise t-test. This is Bonferroni-adjusted for its p-values, and indicates what difference exists between each group, and whether such a difference is statistically significant.
3. A Tukey Honestly Significant Difference (HSD) test. This serves for similar purposes as the pairwise t-test, and provides another measure to check on the results of the pairwise t-test.
4. A Kruskal-Wallis test. This relaxes constraints of assumptions in a normal ANOVA test, reaffirming if statistically significant difference exists between the categories.

Eventually, we would see whether POS-Tagging Sentiment Analysis or Semantic Similarity could be a better fit in terms of distinguishing the three different types of tweets. We would also see how much each method could explain this difference.

We would also display our results using bar graphs generated by R-based packages, particularly ggplot. To run ANOVA tests, the R packages we use include tidyverse, ggpubr, and rstatix.

# NLP Analysis
The following steps were taken for the NLP Analysis:
1. Read in our dataset (hate speech, file name "label_data.csv") through Colab and Pandas.
2. Clean up the dataset and define the three tweet categories as stated above
3. Create datasets for tweets in each category
4. Compile a corpus of text for each category (hate corpus, offensive corpus, neither corpus).
5. Find semantic similarity scores of each tweet with regard to each corpus (hate corpus, offensive corpus, neither corpus).
6. Use SpaCy to figure out normalized number of positivity/negativity, as well as normalized number of anger (an emotion) found in each tweet.

## 1. Reading Data

In [None]:
# Load the Drive helper
from google.colab import drive

# Below will prompt for authorization but it will make your google drive available (i.e., mount your drive).
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#find out where you are and move to correct location
import os #package for figuring out operating system

os.getcwd() #what is the current working directory

os.listdir() #what is in currrent working directory

os.chdir("drive/MyDrive/Colab_Notebooks") #change directory

os.listdir()

['trump_tweets.csv',
 'Untitled0.ipynb',
 '“1-introduction-to-python-solns”的副本',
 '“2_functions_and_apis.ipynb”的副本',
 '“1-text-simple-classification-training.ipynb”的副本 (1)',
 '“gradio-demo.ipynb”的副本',
 '“1-text-simple-classification-training.ipynb”的副本',
 '“string-manipulation_student.ipynb”的副本',
 'HW1_DS5780',
 'Alex_Li_extra_credit_HW_2024122',
 'reading_600_texts_metadata_only.csv',
 'reading_600_texts.csv',
 '“mophology_grammaticality.ipynb”的副本',
 'morphology_savefile',
 'Little Bit of Research',
 'sentiment',
 '“POS tagging and sentiment.ipynb”的副本 (1)',
 '“POS tagging and sentiment.ipynb”的副本',
 '“3_iteration_and_tidbits.ipynb”的副本',
 '“DS5780-named-entity-recognition_student.ipynb”的副本 (1)',
 '“DS5780-named-entity-recognition_student.ipynb”的副本',
 'Plonkering',
 '“dependency_parsing_clauses.ipynb”的副本',
 'Extra Credit Assignment 2024219',
 '“pos_tagging.ipynb”的副本',
 'ex_credit_2024226',
 'Plonker 2024226',
 'labeled_data.csv',
 'Zhixian_(Alex)_Li_Assignment_1_202433.ipynb',
 'CRAPII_10

We will now call in the dataframe of hate speech/offensive language/neither gathered by Davidson, et al. (2017). Notice that this dataframe has statements that are hate speech, statements that are offensive, and statements that aren't, based on classifications. Trigger warning is that the tweets could be offensive/hateful.

In [None]:
import pandas as pd

#read in the .csv file of the dataset gathered by Davidson, et al. (2017)

reading_data = pd.read_csv('labeled_data.csv', encoding = "ISO-8859-1") # This is for the referential dataset of hate speech/offensive speech/neither speech

reading_data

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an..."
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies


Judging from the number of rows and columns, it looks like all of our data is read included. We can now continue onwards with our data analysis given everything is loaded properly.

## 2. Clear Dataset/Data Wrangling
We would now clear the dataset. What we will do in this section consists of two parts:
1. We would proceed to remove all the stop words and punctuation.
2. We would do some data wrangling, such that we classify which sentences would be considered hate speech/offensive and which would not be considered in either categories.

In [None]:
# we should download large version of Spacy, which is often not found.
import spacy
import spacy.cli #spacy command line interface

spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
nlp = spacy.load("en_core_web_lg")

The following presents cleared data with no stopwords, punctuations, and is unitarily lowercase.

In [None]:
#define a function first
def stopwords_remove(text): #function
    doc = nlp(text, disable=["parser", "ner"]) #spacy the text
    text_no_stopwords = [token.text for token in doc if not token.is_stop and not token.is_punct and not token.is_digit] #get list of words that are not stopwords or punctuation or standalone numbers
    return ' '.join(text_no_stopwords) #return a string


# Apply the remove_stopwords_spacy function to the 'tweet' column
reading_data['cw_tweet'] = reading_data['tweet'].apply(stopwords_remove)

reading_data['cw_tweet'] = reading_data['cw_tweet'].str.lower() #lowercase

reading_data


Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,cw_tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,rt @mayasolovely woman complain cleaning house...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,rt @mleew17 boy dats cold tyga dwn bad cuffin ...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,rt @urkindofbrand dawg rt @80sbaby4life fuck b...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,rt @c_g_anderson @viva_based look like tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,rt @shenikaroberts shit hear true faker bitch ...
...,...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,muthaf***in lie 8220;@lifeasking @20_pearls @c...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an...",gone broke wrong heart baby drove redneck crazy
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...,young buck wanna eat dat nigguh like ai nt fuc...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies,youu got wild bitches tellin lies


Looks like everything is cleared up properly. Commented out terms are a choice of the author to self-censor (since no asterisks are found anywhere else in the dataset) so it is kept as it is. Also, some numbers do exist in terms but no standalone numbers are included in the cleared data. We will now observe the shape of our data.

In [None]:
# to observe the shape of our data
print(reading_data.shape)

(24783, 8)


## 3. Create Datasets

We will now create a dataset with only hate speech, a dataset with only offensive language, and a dataset with only neither. If a sentence is coded solely as "neither" by all participants, then it would be considered as a part of the neither category. If a sentence is coded solely as "hate_speech" by all participants, then it would be considered as a part of the hate category. And if a sentence is coded solely as "offensive_language" by all participants, then it would be considered as a part of the offensive category. While this may forgo some more ambiguous entries, it would provide a clear guideline on what is and what is not hate speech.

In [None]:
reading_data['neither'] = reading_data['neither'] / reading_data['count'] # those that are only rated as neither
reading_data['hate'] = reading_data['hate_speech'] / reading_data['count'] # those that are only rated as hate speech
reading_data['offensive'] = reading_data['offensive_language'] / reading_data['count'] # those that are rated as offensive language
read_df = reading_data[['neither', 'hate', 'offensive', 'cw_tweet']].copy() # what we really need is the categorizations of each tweet and the cw_tweet

In [None]:
hate_df = read_df[read_df['hate'] == 1] #This creates a dataframe where there are only hate speech
offensive_df = read_df[read_df['offensive'] == 1] #This creates a dataframe where there are only hate speech
neither_df = read_df[read_df['neither'] == 1] #This creates a dataframe where there are only non-hate speech

In [None]:
hate_df

Unnamed: 0,neither,hate,offensive,cw_tweet
89,0.0,1.0,0.0,@cb_baby24 @white_thunduh alsarabsss s beaner ...
110,0.0,1.0,0.0,@devilgrimz @vigxrarts fucking gay blacklisted...
184,0.0,1.0,0.0,@markroundtreejr lmfaoooo hate black people ht...
459,0.0,1.0,0.0,hey look video man found kidnapped girls ohio ...
540,0.0,1.0,0.0,people time aryan race stand mongerls turn wor...
...,...,...,...,...
24085,0.0,1.0,0.0,inb4 unfunny nigger kill
24307,0.0,1.0,0.0,oh fag anyways
24523,0.0,1.0,0.0,receptionist m forced talk racist cunt called ...
24559,0.0,1.0,0.0,working bill prevent retards voting knew retar...


In [None]:
offensive_df

Unnamed: 0,neither,hate,offensive,cw_tweet
1,0.0,0.0,1.0,rt @mleew17 boy dats cold tyga dwn bad cuffin ...
2,0.0,0.0,1.0,rt @urkindofbrand dawg rt @80sbaby4life fuck b...
4,0.0,0.0,1.0,rt @shenikaroberts shit hear true faker bitch ...
6,0.0,0.0,1.0,@__brighterdays sit hate bitch got shit going
7,0.0,0.0,1.0,8220;@selfiequeenbri cause tired big bitches c...
...,...,...,...,...
24772,0.0,0.0,1.0,gone pussy pop stage
24774,0.0,0.0,1.0,care bout dis bitch dick yo feelings
24775,0.0,0.0,1.0,worried bout bitches need
24780,0.0,0.0,1.0,young buck wanna eat dat nigguh like ai nt fuc...


In [None]:
neither_df

Unnamed: 0,neither,hate,offensive,cw_tweet
0,1.0,0.0,0.0,rt @mayasolovely woman complain cleaning house...
63,1.0,0.0,0.0,@addicted2guys -simplyaddictedtoguys http://t....
70,1.0,0.0,0.0,"@arizonasfinest6 eggplant emoji doe?""y looked ..."
115,1.0,0.0,0.0,@domworldpeace baseball season win yankees lov...
118,1.0,0.0,0.0,@dunderbail early bird night owl wise worms
...,...,...,...,...
24714,1.0,0.0,0.0,wondertrade best pokemon feature got level jap...
24721,1.0,0.0,0.0,wth playing missy mean seriously rt @mr_republ...
24735,1.0,0.0,0.0,yay like jacob colored layouts
24736,1.0,0.0,0.0,yaya ho cute avi tho rt @vivala_ari idea sleep


The results are quite peculiar. There are only 263 entries classified uniformly as hate, but 2872 entries classified as neither and 14347 entries classified uniformly as offensive. This might be because it is generally more difficult to classify what is hate speech than what is offensive (anything with a curseword could be considered offensive, for instance), as Davidson, et al. (2017) points out.

## 4. Compiling an entire text for each category
What we do here is compiling an entire list of words classified as hate speech unitarily by every coder. However, instead of using every single tweet in the 263 hate tweets, what we would do is to split it up, using the first 132 as a list of reference text for "hate", and test whether each entry in the last 131 could have closer similarity with those hate text entries than the non-hate text entries. We will do the same for the other categories: selecting the first 132 to be compiled into a reference sampler, and use the last 131 to test it.

In [None]:
# define a function that can compile everything together into a list of words
def compiletext(dataset):
  total = []
  for item in dataset['cw_tweet']:
    total.append(item)

  return (' '.join(total))

hate_df_cropped = hate_df.sample(132, replace = False) # select the tweets that will form the hate sampler
all_hate = compiletext(hate_df_cropped) # hate sampler created
#print(all_hate)

Next up, we are compiling a list of words that are unitarily marked as offensive. We are selecting the first 132 of the dataset of tweets classified unitarily as offensive to form the list.

In [None]:
offensive_df_cropped = offensive_df.sample(132, replace = False)
all_offensive = compiletext(offensive_df_cropped)
#print(all_offensive)

Now, we will link the discussions of neither in here as well, by compiling a list of words that could be considered as neither hateful nor offensive. Again, we are selecting the first 132 entries.



In [None]:
neither_df_cropped = neither_df.sample(132, replace = False)
all_neither = compiletext(neither_df_cropped)
#print(all_neither)

foley abduction linked british jihadi kidnapping ring http://t.co/x6sip3woxb @joshrogin @elilake need like fuzzy blankets sleep 128525;&#128139 monkey monkey doo rt @itsgirllcode reasons mermaid 

 periods 
 pants 
 perfect hair 
 u lure men death 

 free clam bra charlie crist outlawed hated rediculed vilified shaking president hand country democracy rt @giants mariano rivera street outside yankee stadium named giants similar honors http://t.co/&#8230 rt @t_mart88 jaheim donell jones lyfe jennings ginuwine charlie wilson dru hill jodeci tyrese etc bring type rt @danwashburn man captured muammar gaddafi wearing yankees cap http://t.co/7g70popw season retweeting @nikaaaa3 yankee tweets 128514;&#128514;&#128514;&#128514 @jessespector bird hat hats hats beat 8220;@atrue_cowboy truth distance love birds http://t.co/1kqnw4lj&#8221 tell royals stay home trash&#128553 rt @niallofficial absolutely perfect day london suns gona cracker wembley tonight yeaaaaahhhh 127813;&#127813;&#127813 rt @got

## 5. Testing Semantic Similarities
So now, we can examine similarities of the rest of the hate-speech entries (the other 131) with
1. the hate-speech compilation and
2. the offensive speech compilation and
3. the neither hate nor offensive speech compilation.

We can see if it is more similar to either 1., 2., or 3. And we would also do the same semantic similarity analysis with offensive speech entries and neither hate/offensive speech entries.

In [None]:
# Sample the hate speech entries to be tested
hate_data = hate_df.sample(131, replace = False)

# Sample the offensive speech entries to be tested
offensive_data = offensive_df.sample(131, replace = False)

# Sample the neither speech entries to be tested
neither_data = neither_df.sample(131, replace = False)

We will now SpaCy all three dataframes, and nlp all the other datasets.

In [None]:
hate_docs = list(nlp.pipe(hate_data["cw_tweet"])) #spacy all the texts in the dataframe
offensive_docs = list(nlp.pipe(offensive_data["cw_tweet"])) #spacy all the texts in the dataframe
neither_docs = list(nlp.pipe(neither_data["cw_tweet"])) #spacy all the texts in the dataframe

hate_ref = nlp(all_hate) #nlp the all_hate dataset
offensive_ref = nlp(all_offensive) #nlp the all_offensive dataset
neither_ref = nlp(all_neither) #nlp the all_neither dataset

#Similarity Tests

Next step is doing similarity tests. What we will do here is finding similarity between each tweet entry in each category to a corresponding corpus -- the list compiled above. For instance, we can compare a hate tweet with the hate corpus, the offensive corpus, and the neither corpus. Because it is kind of a repetitive process, we would do it in a defined function.

In [None]:
def similarity_test(ref, docs):
  doc_similarity = [] #holder list for spacy docs similarity
  for doc in docs:
    similarity_score = doc.similarity(ref) #check similarity between tweet (doc) and reference corpus
    doc_similarity.append(similarity_score)
  return doc_similarity #returns a list of similarity scores

We would now define a function that uses the similarity score function defined above to find similarity scores in each category, and append it to a dataframe. For instance, we could append the similarity score of each test tweet entry in hate_speech to the hate_data dataframe.

In [None]:
def append_df(docs, df):
  similarity_score_hate = similarity_test(hate_ref, docs) #utilize the previous function to obtain all similarity scores of tweet entries with regard to hate corpus
  similarity_score_offensive = similarity_test(offensive_ref, docs) #obtain all similarity scores of tweet entries with regard to offensive corpus
  similarity_score_neither = similarity_test(neither_ref, docs) #obtain all similarity scores of tweet entries with regard to neither corpus

  #append the list of similarity scores at a new column of the dataframe
  df['hate_score'] = similarity_score_hate
  df['offensive_score'] = similarity_score_offensive
  df['neither_score'] = similarity_score_neither

  return df.copy()

We are now running the append_df function to find the similarity scores of each tweet with regard to 1) the hate corpus, 2) the offensive corpus, and 3) the neither corpus. We are also appending these scores to each dataframe.

In [None]:
hate_data = append_df(hate_docs, hate_data)
offensive_data = append_df(offensive_docs, offensive_data)
neither_data = append_df(neither_docs, neither_data)

## 6. Analysis of Sentiments (Positive VS. Negative)

In this section, we will be using the NRC lexicons to analyze sentiments and emotions of the words (Mohammad & Turney, 2012). Particularly, we would focus on positive and negative emotions because prior works tested them and returned statistically significant outcomes (Udanor & Anyanwu, 2019). Furthermore, we would also use the emotion of anger because amongst all the negative emotions, it is the one that returns some results (some sentiment scores for some of the items) rather than returning all 0s.

In [None]:
#to begin with, let's move to the correct location

print(os.getcwd()) #what is the current working directory, in this case should be Colab_Notebooks

%cd /content/drive/MyDrive/Colab_Notebooks/sentiment

print(os.getcwd())

print(os.listdir()) #what is in current working directory, in this case should direct to the csv file of NRC Lexicons


/content/drive/MyDrive/Colab_Notebooks
/content/drive/MyDrive/Colab_Notebooks/sentiment
/content/drive/MyDrive/Colab_Notebooks/sentiment
['nrc.csv', 'labeled_data_output.csv']


In [None]:
#Now let us read in the NRC Lexicons

sent_df = pd.read_csv('nrc.csv')

sent_df #take a look at the entries

Unnamed: 0,Anger_NRC,Anticipation_NRC,Disgust_NRC,Fear_NRC,Joy_NRC,Negative_NRC,Positive_NRC,Sadness_NRC,Surprise_NRC,Trust_NRC
0,expletive,unfulfilled,smut,smut,tantalizing,smut,greeting,measles,greeting,proven
1,inept,tantalizing,measles,measles,felicity,expletive,tantalizing,inconsequential,unfulfilled,privy
2,unfulfilled,wait,inept,lynch,lovable,measles,inventor,unfulfilled,tantalizing,pawn
3,lynch,haste,perverted,militia,unbeaten,inept,felicity,lynch,trump,lovable
4,agitation,unbeaten,lynch,servile,superstar,perverted,civility,gray,unbeaten,merchant
...,...,...,...,...,...,...,...,...,...,...
3319,,,,,,revive,,,,
3320,,,,,,lace,,,,
3321,,,,,,schism,,,,
3322,,,,,,annoy,,,,


Judging from the results and the original file, the words are all loaded in correctly. Indeed, there are some NaNs, but they are carried over from the fact that some of the lists are longer than other ones. So let us proceed with sending it to a list of lists.

In [None]:
nrc_lists = sent_df.values.T.tolist()

Next, we would hand input each column name, since this data set does not have too many columns.

In [None]:
anger = nrc_lists[0]
anticipation = nrc_lists[1]
disgust = nrc_lists[2]
fear = nrc_lists[3]
joy = nrc_lists[4]
negative = nrc_lists[5]
positive = nrc_lists[6]
sadness = nrc_lists[7]
surprise = nrc_lists[8]
trust = nrc_lists[9]

print(negative[0:11]) #displays correctly
print(positive[0:11]) #displays correctly
print(positive[-11]) #still contains NaNs

['smut', 'expletive', 'measles', 'inept', 'perverted', 'inconsequential', 'unfulfilled', 'tantalizing', 'lynch', 'constrained', 'agitation']
['greeting', 'tantalizing', 'inventor', 'felicity', 'civility', 'artistic', 'lovable', 'restful', 'unbeaten', 'superstar', 'tutelage']
nan


We are now going to do sentiment analysis. But to begin with, we should define a safe_divide function. This is to prevent issues happening when the denominator is 0.

In [None]:
# safe divide function to stop zero in the denominator from causing problems.
# this is particularly useful when we normalize everything.

def safe_divide(a, b): # function calls two arguments. Here the numerator is the number of positive/negative words and the denominator is the number of words total.
    if b != 0:
        return a/b
    else:
        return 0

Now this is the most important part - the analysis itself. We know that negations could make things complicated for the sentimental/emotional words, so we would just exclude all these words that have negations before them.

The list of negations are selected from Cambridge Dictionary (2024).

In [None]:
def emotion_analysis(reading_data_docs):

  #let's introduce a list of negations, compiled from Cambridge Dictionary (2024)
  negations = ["not", "never", "neither", "barely", "little", "few", "hardly", "scarcely", "seldom", "rarely", "no", "nothing", "none", "nobody", "nowhere", "don't", "shouldn't", "wouldn't", "won't", "couldn't", "can't", "cannot"]

  nw_final = []
  all_positives = []
  all_negatives = []
  all_anger = []

  for doc in reading_data_docs:
    nw = 0 #count for the total number of words
    positive_count = 0 #the number of positive words
    negative_count = 0 #the number of negative words
    anger_count = 0 #the number of anger words

    words = [] #a list for words in each doc

    for token in doc:
      if not token.is_punct and not token.is_space: #excluding punctuations and spaces
        nw += 1
      if str(token) in negations: #have to keep in negations in the lists
        words.append(str(token))
      if not token.is_punct and not token.is_space and token.pos_ in {"ADJ", "ADV"}: #find adj. and adv.
        words.append(str(token))
      if not token.is_punct and not token.is_space and token.pos_ == "VERB": #find verbs
        words.append(str(token.lemma_))

    for i in range(len(words)):
      if words[i] in positive:
        if i >= 2 and any(negation in words[i-2:i] for negation in negations):
        #checks for negations in front of a positive word
          continue #not count it if there is negations before it.
        positive_count += 1
      if words[i] in negative:
        if i >= 2 and any(negation in words[i-2:i] for negation in negations):
        #checks for negations in front of a negative word
          continue #not count it if there is negation before it.
        negative_count += 1
      if words[i] in anger:
        if i >= 2 and any(negation in words[i-2:i] for negation in negations):
        #checks for negations in front of an anger word
          continue #not count it if there is negation before it.
        anger_count += 1

    #norm by number of words and assign to list of lists
    all_positives.append(safe_divide(positive_count, nw))
    all_negatives.append(safe_divide(negative_count, nw))
    all_anger.append(safe_divide(anger_count, nw))
    nw_final.append(nw)

  return (all_positives, all_negatives, all_anger, nw_final)

In [None]:
#This function is primarily created to append the results from the emotional_analysis function to our target datasets.
def append_df_ea(docs, df):
  pos, neg, ang, nw = emotion_analysis(docs)
  df['POS'] = pos
  df['NEG'] = neg
  df['ANG'] = ang
  df['nw'] = nw
  return df.copy()

In [None]:
# the actual process utilizing both functions created above.
hate_data = append_df_ea(hate_docs, hate_data)
offensive_data = append_df_ea(offensive_docs, offensive_data)
neither_data = append_df_ea(neither_docs, neither_data)

Here we will just show the results from the dataframe of offensive tweets. It has been checked to work similarly for datasets of hate tweets and neither tweets.

In [None]:
# this is a test function to see if everything is appended correctly.
#hate_data
#offensive_data
#neither_data

Unnamed: 0,neither,hate,offensive,cw_tweet,hate_score,offensive_score,neither_score,POS,NEG,ANG,nw
13338,0.0,0.0,1.0,niggas straight hoes bruh stop fuckin em,0.627864,0.704937,0.487391,0.000000,0.000000,0.000000,7
14636,0.0,0.0,1.0,rt @cesar_wtx pink nipples cute looking nasty ...,0.672881,0.737149,0.611094,0.125000,0.125000,0.125000,8
23849,0.0,0.0,1.0,pussies,0.468185,0.507126,0.290199,0.000000,0.000000,0.000000,1
13420,0.0,0.0,1.0,planet earth protective bull dykes,0.596041,0.506198,0.623926,0.200000,0.000000,0.000000,5
18561,0.0,0.0,1.0,rt @anderson6benton 8220;@t_rev_13 dumb bitche...,0.286640,0.318218,0.269675,0.111111,0.222222,0.000000,9
...,...,...,...,...,...,...,...,...,...,...,...
5898,0.0,0.0,1.0,@eanahs girl wanting shit growing hoes bout 6m...,0.756131,0.746567,0.717331,0.125000,0.000000,0.000000,8
9080,0.0,0.0,1.0,females damn near naked wondering niccas jus w...,0.718797,0.738074,0.609591,0.000000,0.100000,0.100000,10
18501,0.0,0.0,1.0,rt @_twelvestarr niggas catching attitudes exp...,0.803441,0.816503,0.762343,0.083333,0.083333,0.000000,12
5613,0.0,0.0,1.0,@bigshaadswerver yea tj hoe dunking tk hitting...,0.700463,0.697575,0.698217,0.000000,0.090909,0.090909,11


Now, let us compile all three dataframes together for our final output:

In [None]:
data_final = pd.concat([hate_data, offensive_data, neither_data])
data_final

Unnamed: 0,neither,hate,offensive,cw_tweet,hate_score,offensive_score,neither_score,POS,NEG,ANG,nw
6641,0.0,1.0,0.0,@matt_chu22 @ryanriehle agree ethiopian starvi...,0.812004,0.804991,0.676782,0.100000,0.00,0.0,10
3448,0.0,1.0,0.0,@idisdummies speak coherent sentence sounding ...,0.663531,0.538475,0.698305,0.125000,0.00,0.0,8
4015,0.0,1.0,0.0,@makeanothahov t.o faggot crybaby,0.429916,0.468362,0.123594,0.000000,0.00,0.0,4
4076,0.0,1.0,0.0,@menachemdreyfus @natlfascist88 @waspnse shit ...,0.700178,0.757788,0.505909,0.000000,0.00,0.0,16
24776,0.0,1.0,0.0,niggers,0.636074,0.585991,0.438029,0.000000,0.00,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...
20134,1.0,0.0,0.0,rt @the_blueprint trash hope rt @complexmag @b...,0.499402,0.480513,0.625962,0.000000,0.00,0.0,10
21871,1.0,0.0,0.0,clumsy bird lol life,0.551176,0.527947,0.555604,0.000000,0.25,0.0,4
6062,1.0,0.0,0.0,@hardball @now @hardball_chris jimmy want talk...,0.764405,0.775169,0.706643,0.181818,0.00,0.0,11
21627,1.0,0.0,0.0,value abu qutada&#8217;s statement credentials...,0.442987,0.268430,0.555722,0.076923,0.00,0.0,13


Seeing that the number of rows and columns are correct (393 = 3*131), this means our three dataframes are successfully compiled together. Therefore, we can output our final dataframe.

In [None]:
data_final.to_csv('data_final_2024417.csv')

# References

Cambridge Dictionary. (2024). *Negation – English Grammar Today*. Cambridge Dictionary. https://dictionary.cambridge.org/us/grammar/british-grammar/negation_2

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. *Proceedings of the International AAAI Conference on Web and Social Media, 11*(1), 512–515. https://doi.org/10.1609/icwsm.v11i1.14955

Levin, B. (2022, October 6). *Get ready for Elon Musk to turn Twitter into a right-wing cesspool that could hand Trump 2024*. Vanity Fair. https://www.vanityfair.com/news/2022/10/elon-musk-twitter-deal-donald-trump-right-wing-hate

Mohammad, S. M., & Turney, P. D. (2012). Crowdsourcing a word–emotion association lexicon. *Computational Intelligence, 29*(3), 436–465. https://doi.org/10.1111/j.1467-8640.2012.00460.x

Udanor, C., & Anyanwu, C. C. (2019). Combating the challenges of social media hate speech in a polarized society. *Data Technologies and Applications, 53*(4), 501–527. https://doi.org/10.1108/dta-01-2019-0007
