# <span style="color:blue">Evaluating Sequence Learning Models for Identifying Hate Speech using Explainable AI - Validation data preprocessing notebook</span>

## Validation data set preprocessing


Author: Amir Mozahebi <br>
Thesis: Evaluating Sequence Learning Models for Identifying Hate Speech using Explainable AI

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
%store -r sequenceLength
%store -r vocabSize
%store -r validationSet
%store -r tokenizer
%store -r word_index

In [3]:
validationSet.head(10)

Unnamed: 0,index,label,tweet
0,0,2,advises to build a happy and longlasting mar...
1,1,2,some of yall bios are trash mine pret...
2,2,2,nd keyring make results shop at xx
3,3,2,there is beauty all around beautiful beauty po...
4,4,2,dogs are bathed and flea free new beds and spr...
5,5,2,im sorry you are going through a divorce
6,6,2,when the internet is broken so you cant watch ...
7,7,2,poont gotta be trash
8,8,2,i would be amazed if more people were honest a...
9,9,2,my city is sooo shit for fuck sake


In [4]:
validationSet["label"].value_counts()

2    900
1    900
0    900
Name: label, dtype: int64

In [5]:
validationSet_neither = validationSet.loc[validationSet['label'] == 2] 
validationSet_offensive = validationSet.loc[validationSet['label'] == 1] 
validationSet_hate = validationSet.loc[validationSet['label'] == 0] 

In [6]:
validationSet_neither.reset_index(inplace=True, drop=True)
validationSet_offensive.reset_index(inplace=True, drop=True)
validationSet_hate.reset_index(inplace=True, drop=True)

In [7]:
validationSet_hate

Unnamed: 0,index,label,tweet
0,1800,0,back in my day there were nowhere near as many...
1,1801,0,there is a clear correlation between muslim po...
2,1802,0,i love that my workplace is arab people free
3,1803,0,these black women are not different from this ...
4,1804,0,fag
...,...,...,...
895,2695,0,there is no difference between white women and...
896,2696,0,it is not easy to say and do not come for me i...
897,2697,0,the quran is the book that you must obey where...
898,2698,0,this is getting ridiculous why does everyone t...


### Remove stopwords

In [8]:
#Remove stopwords
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

preprocessing_stopwords = set(stopwords.words('english'))
additionalStopwords = ["u", "im"]
preprocessing_stopwords = set().union(preprocessing_stopwords, additionalStopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amirmozahebi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
#Remove stopwords function
def removeStopwords(tweet):
    filtered = [word for word in tweet.split() if word not in preprocessing_stopwords]
    return " ".join(filtered)

In [10]:
validationSet_neither_stopwords_removed = validationSet_neither["tweet"].apply(removeStopwords)
validationSet_offensive_stopwords_removed = validationSet_offensive["tweet"].apply(removeStopwords)
validationSet_hate_stopwords_removed = validationSet_hate["tweet"].apply(removeStopwords)

### Lemmatization

In [11]:
#Import and initialize Lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [16]:
#Lemmatization Function
def lemmatizing(tweet):
    lemmatized_text = lemmatizer.lemmatize(tweet)
    return lemmatized_text

In [17]:
validationSetLemmatized_neither = [lemmatizing(tweet) for tweet in validationSet_neither_stopwords_removed]
validationSetLemmatized_offensive = [lemmatizing(tweet) for tweet in validationSet_offensive_stopwords_removed]
validationSetLemmatized_hate = [lemmatizing(tweet) for tweet in validationSet_hate_stopwords_removed]

In [18]:
validationSetLemmatized_hate

['back day nowhere near many black people country coincidence life better back',
 'clear correlation muslim population crime rates',
 'love workplace arab people free',
 'black women different black cow',
 'fag',
 'hes fucking pervert like gays',
 'black chicks annoying',
 'much like united states example sorts natural wonders cannot seen anywhere else world everglades grand canyon asian men plenty',
 'allahsoil conflate islamic extremism terrorism emiratis word',
 'spoty perhaps murrayhaters would happier murray bit like gazza hero drunk wifebeater',
 'man look awful costume look like trannie',
 'theres toxic energy foreigners dont like',
 'mixed race arent welcome around',
 'would like called heshe stick gender born simples',
 'truth sexualharassment laborviolations',
 'love rude arabs',
 'would love go back time tell windrush turn',
 'cant make literal retard level literally par course',
 'think romanians real threat country',
 'pueorico environmental injustice inflame protests coal

### Tokenization

In [22]:
validationSet_neither_tokens = [nltk.word_tokenize(tweet) for tweet in validationSetLemmatized_neither]
validationSet_offensive_tokens = [nltk.word_tokenize(tweet) for tweet in validationSetLemmatized_offensive]
validationSet_hate_tokens = [nltk.word_tokenize(tweet) for tweet in validationSetLemmatized_hate]

In [23]:
validationSet_hate_tokens[0]

['back',
 'day',
 'nowhere',
 'near',
 'many',
 'black',
 'people',
 'country',
 'coincidence',
 'life',
 'better',
 'back']

### Sequencing and Padding

In [24]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [28]:
validationSequences_neither = tokenizer.texts_to_sequences(validationSet_neither_tokens) 
validationSequences_offensive = tokenizer.texts_to_sequences(validationSet_offensive_tokens)
validationSequences_hate = tokenizer.texts_to_sequences(validationSet_hate_tokens)

In [29]:
paddedSequences_neither = pad_sequences(validationSequences_neither, maxlen=sequenceLength, padding="post", truncating="post")
paddedSequences_offensive = pad_sequences(validationSequences_offensive, maxlen=sequenceLength, padding="post", truncating="post")
paddedSequences_hate = pad_sequences(validationSequences_hate, maxlen=sequenceLength, padding="post", truncating="post")

### Show some values

In [31]:
print(validationSet_hate["tweet"][0])
print(validationSet_hate_stopwords_removed[0])
print(validationSetLemmatized_hate[0])
print(validationSet_hate_tokens[0])
print(paddedSequences_hate[0])



back in my day there were nowhere near as many black people in this country it is no coincidence that life was better back then
back day nowhere near many black people country coincidence life better back
['back', 'day', 'nowhere', 'near', 'many', 'black', 'people', 'country', 'coincidence', 'life', 'better', 'back']
back day nowhere near many black people country coincidence life better back
[  53   27 3358  633   49    9    4   36 9494   47   98   53    0    0
    0    0    0    0    0    0    0    0    0    0    0]


### Show some validation samples

In [34]:
validationSet_hate

Unnamed: 0,index,label,tweet
0,1800,0,back in my day there were nowhere near as many...
1,1801,0,there is a clear correlation between muslim po...
2,1802,0,i love that my workplace is arab people free
3,1803,0,these black women are not different from this ...
4,1804,0,fag
...,...,...,...
895,2695,0,there is no difference between white women and...
896,2696,0,it is not easy to say and do not come for me i...
897,2697,0,the quran is the book that you must obey where...
898,2698,0,this is getting ridiculous why does everyone t...


### Store notebook variables

In [33]:
%store validationSet_neither
%store validationSet_offensive 
%store validationSet_hate
%store validationSetLemmatized_neither
%store validationSetLemmatized_offensive 
%store validationSetLemmatized_hate

Stored 'validationSet_neither' (DataFrame)
Stored 'validationSet_offensive' (DataFrame)
Stored 'validationSet_hate' (DataFrame)
Stored 'validationSetLemmatized_neither' (list)
Stored 'validationSetLemmatized_offensive' (list)
Stored 'validationSetLemmatized_hate' (list)
