# Hate Speech Detector

Hate speech proliferates online, posing significant societal challenges and threatening the safety and well-being of individuals and communities. Despite efforts to combat this issue, hate speech often goes undetected due to its complex and evolving nature. Identifying and addressing instances of hate speech is crucial for fostering inclusivity, tolerance, and safety in online spaces.

To curb the effect of hate speech, we will build a machine leaning model that can accuratey identify tweet that are hate speech and be able to filter it out. 

## Dataset

The data for this project is obtained from 6 different sources. We download the dataset from each of the sources, merge them and thereafter preprocess it to solve any quality and structural issue with the data. 

The process of getting the dataset together are explained below

In [1]:
# Import the libraries

import glob
import pandas as pd

## Dataset one


The first dataset is obtained from this [link](https://github.com/Vicomtech/hate-speech-dataset). We first downloaded it to our local machine and save it in a directory called hate-speech-dataset. The dataset is not structured as it saves each of the tweet in an individual txt file. The corresponding label for the tweet are saved in another folder called annotations_metadata. 

The tweet with label are for training while those without label are for testing. We will load the data in all_files folder which contains both training and test data. Thereafter, we will merge it with the label. The resulting dataset will then be separated to train and test base on the presence or absence of label.

In [2]:
# Specify the path to the folder containing the files
folder_path = "hate-speech-dataset/all_files"

# Use glob to get the list of file names that end with .txt in the folder
file_names = glob.glob(folder_path + "\*.txt")

In [3]:
# read the content of each files and save the tweet to the corresponding tweet_id

tweet_list = []

for file_name in file_names:
    file_id = file_name.split("\\")[-1][:-4]
    with open(file_name, "r", encoding = "utf-8") as f:
        tweet = f.read()
    tweet_list.append({"file_id": file_id,
                       "tweet": tweet})
# save the first dataset as df1
df1 = pd.DataFrame(tweet_list)
df1.head()

Unnamed: 0,file_id,tweet
0,12834217_1,"As of March 13th , 2014 , the booklet had been..."
1,12834217_10,Thank you in advance. : ) Download the youtube...
2,12834217_2,In order to help increase the booklets downloa...
3,12834217_3,( Simply copy and paste the following text int...
4,12834217_4,Click below for a FREE download of a colorfull...


In [4]:
#get the label for df1 and merge it 

df1_label = pd.read_csv("hate-speech-dataset/annotations_metadata.csv")

# merge df1 and df1_label on file_id column 

df1 = pd.merge(left = df1, right = df1_label, on  = "file_id", how = "left")

# view the complete dataset
df1.head()

Unnamed: 0,file_id,tweet,user_id,subforum_id,num_contexts,label
0,12834217_1,"As of March 13th , 2014 , the booklet had been...",572066,1346,0,noHate
1,12834217_10,Thank you in advance. : ) Download the youtube...,572066,1346,0,noHate
2,12834217_2,In order to help increase the booklets downloa...,572066,1346,0,noHate
3,12834217_3,( Simply copy and paste the following text int...,572066,1346,0,noHate
4,12834217_4,Click below for a FREE download of a colorfull...,572066,1346,0,hate


In [5]:
# We need only the tweet colunms and the label columns

df1 = df1[["tweet", "label"]]

# write a function to cateorize the label
def is_hate(x):
    if x.lower() == "hate":
        return 1
    elif x.lower() == "nohate":
        return 0
    else:
        pass
# create is_hate column    
df1["is_hate"] = df1["label"].copy().apply(is_hate)

# drop label column 
df1.drop(columns = "label", inplace = True)

df1.head()

Unnamed: 0,tweet,is_hate
0,"As of March 13th , 2014 , the booklet had been...",0.0
1,Thank you in advance. : ) Download the youtube...,0.0
2,In order to help increase the booklets downloa...,0.0
3,( Simply copy and paste the following text int...,0.0
4,Click below for a FREE download of a colorfull...,1.0


### Dataset 2

The second dataset is gotten from Huggingface. You can access it through this [link](https://huggingface.co/datasets/tweets_hate_speech_detection). The data set is split into train and test. We will merge the two together and rename it to a form that will be consistent with that of df1

In [6]:
# load train dataset
df2_train = pd.read_csv("huggingface/train_tweet.csv", encoding = "utf-8")
df2_test = pd.read_csv("huggingface/test_tweets.csv", encoding = "utf-8")



In [7]:
df2_train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [8]:
df2_test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [9]:
# concat the two datasets

df2 = pd.concat([df2_train, df2_test])

# take out label and tweet columns and thereafter rename label to is_hate
df2 = df2[["tweet", "label"]]

# rename the columns 
df2.columns = ["tweet", 'is_hate']

df2.head()

Unnamed: 0,tweet,is_hate
0,@user when a father is dysfunctional and is s...,0.0
1,@user @user thanks for #lyft credit i can't us...,0.0
2,bihday your majesty,0.0
3,#model i love u take with u all the time in ...,0.0
4,factsguide: society now #motivation,0.0


### Dataset three

The third dataset is called hatespeech kenya. It is obtained from this [link](https://www.kaggle.com/datasets/edwardombui/hatespeech-kenya). We will load the dataset and rename the columns to be consistent with the previous two

In [10]:
# load the dataset
df3 = pd.read_csv("HateSpeech_Kenya.csv", encoding = "utf-8")
df3.head()

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet
0,0,0,3,0,['The political elite are in desperation. Ordi...
1,0,0,3,0,"[""Am just curious the only people who are call..."
2,0,0,3,0,['USERNAME_3 the area politicians are the one ...
3,0,0,3,0,['War expected in Nakuru if something is not d...
4,0,0,3,0,['USERNAME_4 tells kikuyus activists that they...


In [11]:
df3['Class'].unique()

array([0, 1, 2], dtype=int64)

0 for Neither; 1 for Offensive; 2 for Hate speech. Therefore we will reclass the label with 1 for hate speech and 0 for other classes.

In [12]:
# define a function to reasign the label
def reassign_label(x):
    if x == 2:
        return 1
    else:
        return 0
    
# apply the function on class column
df3['is_hate'] = df3['Class'].apply(reassign_label)
df3.head()

Unnamed: 0,hate_speech,offensive_language,neither,Class,Tweet,is_hate
0,0,0,3,0,['The political elite are in desperation. Ordi...,0
1,0,0,3,0,"[""Am just curious the only people who are call...",0
2,0,0,3,0,['USERNAME_3 the area politicians are the one ...,0
3,0,0,3,0,['War expected in Nakuru if something is not d...,0
4,0,0,3,0,['USERNAME_4 tells kikuyus activists that they...,0


In [13]:
# take the tweet and is_hate columns

df3 = df3[["Tweet", "is_hate"]]

# rename the columns to tweet and is_hate
df3.columns = ["tweet", "is_hate"]

In [14]:
df3.head()

Unnamed: 0,tweet,is_hate
0,['The political elite are in desperation. Ordi...,0
1,"[""Am just curious the only people who are call...",0
2,['USERNAME_3 the area politicians are the one ...,0
3,['War expected in Nakuru if something is not d...,0
4,['USERNAME_4 tells kikuyus activists that they...,0


### Dataset four


The fourth dataset is Hate Speech Detection curated Dataset gotten from this [link](https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset). we will load the dataset and reconcile the column title

In [15]:
df4 = pd.read_csv("HateSpeechDataset.csv", encoding = "utf-8")

df4.head()

Unnamed: 0,Content,Label,Content_int
0,denial of normal the con be asked to comment o...,1,"[146715, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,..."
1,just by being able to tweet this insufferable ...,1,"[146715, 14, 15, 16, 17, 7, 18, 19, 20, 21, 22..."
2,that is retarded you too cute to be single tha...,1,"[146715, 28, 29, 30, 26, 31, 32, 7, 5, 33, 28,..."
3,thought of a real badass mongol style declarat...,1,"[146715, 35, 1, 24, 36, 37, 38, 39, 40, 1, 41,..."
4,afro american basho,1,"[146715, 46, 47, 48, 146714]"


In [16]:
# get the content and the label columns and rename them to tweet and is_hate respectively\
df4 = df4[["Content", "Label"]]

# rename the columns 
df4.columns = ["tweet", "is_hate"]

# view the dataset
df4.head()

Unnamed: 0,tweet,is_hate
0,denial of normal the con be asked to comment o...,1
1,just by being able to tweet this insufferable ...,1
2,that is retarded you too cute to be single tha...,1
3,thought of a real badass mongol style declarat...,1
4,afro american basho,1


In [17]:
df4['is_hate'].value_counts()

is_hate
0        361594
1         79305
Label         7
Name: count, dtype: int64

In [18]:
df4[df4['is_hate'] == "Label"]

Unnamed: 0,tweet,is_hate
190108,content,Label
418486,content,Label
422333,content,Label
424241,content,Label
426162,content,Label
435474,content,Label
437104,content,Label


seven of the observations in the dataset four has a non boolean label "Label". A closer observations shows that the those entries doesn't make any sense as their correscponding tweet are just "Content". Thus, we will drop those observations

In [19]:
df4 = df4[df4['is_hate'] != "Label"]

### Dataset five

The fifth dataset is Hate Speech and Offensive Language Dataset from this [link](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset). Load the dataset and reconcile the columns

In [20]:
# load the dataset
df5 = pd.read_csv("labeled_data.csv", encoding = "utf-8")

df5.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [21]:
df5['class'].unique()

array([2, 1, 0], dtype=int64)

Class 0 refers to hate speech, class 1 refers to offensive language and class 2 refers to neither. We will reclass the label and give those that are hate speech a label of 1 while the other classes are given a label of 0.

In [22]:
# define a function to reasign the label
def reassign_label(x):
    if x == 0:
        return 1
    else:
        return 0
    
# apply the function on class column
df5['is_hate'] = df5['class'].apply(reassign_label)
df5.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,is_hate
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,0
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,0
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,0
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,0
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,0


In [23]:
# ectract the tweet and is_hate columns
df5 = df5[["tweet", "is_hate"]]

df5.head()

Unnamed: 0,tweet,is_hate
0,!!! RT @mayasolovely: As a woman you shouldn't...,0
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,0
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,0
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,0
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,0


### Complete dataset

Now that each of the datasets have been prepared, we will concat all of them together and do data preprocessing

In [24]:
# concat df1, df2, df3, df4 and df5

df = pd.concat([df1, df2, df3, df4, df5])

df.head()

Unnamed: 0,tweet,is_hate
0,"As of March 13th , 2014 , the booklet had been...",0.0
1,Thank you in advance. : ) Download the youtube...,0.0
2,In order to help increase the booklets downloa...,0.0
3,( Simply copy and paste the following text int...,0.0
4,Click below for a FREE download of a colorfull...,1.0


In [25]:
# Convert the labels to the same datatype
df['is_hate'] = df['is_hate'].astype(float)

In [26]:
df['is_hate'].unique()

array([ 0.,  1., nan])

## Check for duplicate

Since we gathered the data from different sources, there is possibility of overlap between them. To solve this problem, we check for duplicate entries and then drop them

In [27]:
# check for duplicates
df.duplicated().sum()

26187

In [28]:
# Drop duplicate
df.drop_duplicates(inplace = True)

# confirm there is no duplicate again
df.duplicated().sum()

0

## Split to train test

Entries with labels are for training while entries with no label are for testing. Therefore we will split our dataset into train and test base on their corresponding labels

In [29]:
# Extract Train dataset
df_train = df[df['is_hate'].notnull()]

In [30]:
# extract test dataset
df_test = df[df['is_hate'].isnull()]

## Save clean dataset

Now we will save the clean dataset

In [31]:
df_train.to_csv("dataset/clean_train.csv", index = False)
df_test.to_csv("dataset/clean_test.csv", index = False)