AUTHOR : ABIRAMI RAJALINGAM

I wrote all of the explanatory text and comments in this notebook. It was adapted from https://github.com/NitishKundu/Tweets_hate_speech_classification/blob/main/notebooks/hate_speech_detection_transfer_learning_model.ipynb

#Hate-speech-Tweets-Data-Prep

##Importing necessary libraries

BLOCK 1

In [20]:
#importing necessary packages 
#basic python package for data manipulation and storage.
import numpy as np
#numpy library is usually used for performing numerical computations and manipulating arrays
import pandas as pd
#panda library is used for data manipulation, analysis, and visualization


import matplotlib.pyplot as plt
#graphical / graph plotting package from seaborn and matplot library

import re
#re library is used for regular expression facilities 

import nltk
#NLTK library is used for NLP tasks like text processing and analysis tasks such as tokenization, stemming, and part-of-speech tagging
from nltk.corpus import stopwords
#here the code imports the stopwords module from the Natural Language Toolkit (NLTK) library
#stopwords module contains a list of common words in a language (such as "the", "a", "an", "in", "is", etc.) 
#that are often considered to be unimportant in text analysis 
from nltk.stem.wordnet import WordNetLemmatizer
#The code imports the WordNetLemmatizer class from the NLTK library, which provides a tool for lemmatizing words in natural language text.
from nltk.corpus import wordnet
# imports the wordnet module from the NLTK library, which provides a lexical database for the English language.
#The wordnet module includes a variety of resources, including synonyms and antonyms for words, as well as information on word senses, semantic relationships, and word usage examples.
lemmatizer = nltk.stem.WordNetLemmatizer()
#creates an instance of the WordNetLemmatizer class from the NLTK library and assigns it to the variable lemmatizer.
#The WordNetLemmatizer is a tool for lemmatizing words in natural language text, 
#which means reducing a word to its base or dictionary form based on its context in the sentence.


nltk.download('omw-1.4')
#This downloads the Open Multilingual WordNet (OMW) resource for NLTK version 1.4, which is a multilingual lexical database of word meanings 
nltk.download('punkt')
#This downloads the Punkt tokenizer, which is a pre-trained unsupervised machine learning algorithm for sentence segmentation.
nltk.download('averaged_perceptron_tagger')
#This downloads a pre-trained model for part-of-speech (POS) tagging using the Averaged Perceptron algorithm. 
nltk.download('wordnet')
#his downloads the WordNet lexical database, which is a large semantic lexicon for the English language.
nltk.download('stopwords')
#This downloads a list of common stopwords for the English language,

import warnings
#This imports the Python built-in warnings module, which provides a way to handle warnings that occur during the execution of a program.
warnings.filterwarnings('ignore') 
#This sets the filter to ignore all warnings generated by the code that follows.
%matplotlib inline
#This is a Jupyter notebook-specific command that enables the display of Matplotlib plots in the output of a Jupyter notebook cell. 

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Mounting Google Drive

BLOCK 2

In [21]:
from google.colab import drive
# This imports the drive module from the google.colab library, which provides a set of tools for working with Google Drive in Colab.

drive.mount('/content/gdrive')
#This mounts the Google Drive of the user who is running the Colab notebook and prompts them to enter an authorization code to grant Colab access to their Drive
root_path = 'gdrive/My Drive/hate_speech/'
#This sets the root_path variable to the path of a specific directory in the user's Google Drive.

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


#### Reading data in csv format

BLOCK 3

In [22]:
data_1 =  pd.read_csv('gdrive/My Drive/hate_speech/labeled_data.csv')
#This uses the pd.read_csv() function from the pandas library to read a CSV file located at drive.
#the data is loaded into a DataFrame data_1
data_1.head()
#This displays the first few rows of the DataFrame.

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;


#### Inspecting data

BLOCK 4

In [23]:
data_1.info()
#inspects the data using info() function

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          24783 non-null  int64 
 1   count               24783 non-null  int64 
 2   hate_speech         24783 non-null  int64 
 3   offensive_language  24783 non-null  int64 
 4   neither             24783 non-null  int64 
 5   class               24783 non-null  int64 
 6   tweet               24783 non-null  object
dtypes: int64(6), object(1)
memory usage: 1.3+ MB


#### Indexing first column as "id"

BLOCK 5

In [24]:
# renaming the first columns as id
data_1.rename(columns={'Unnamed: 0':'id'}, inplace = True)

# increasing max length for all columns and number of columns
pd.set_option('display.max_colwidth', -1)
pd.set_option("display.max_columns", 50)

pd.set_option('display.max_info_columns', 500)
pd.set_option('display.max_rows', 500)

In [25]:
# set id to index
data_1.set_index('id').head(5)

Unnamed: 0_level_0,count,hate_speech,offensive_language,neither,class,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;


#### Checking class value counts

BLOCK 6

In [26]:
data_1['class'].value_counts()
#to count the number of instances of each unique value in the 'class' column of a pandas DataFrame called data_1

1    19190
2    4163 
0    1430 
Name: class, dtype: int64

#### Inspecting Hate speech class

BLOCK 7

In [27]:
hate_speech_df = data_1[data_1['class'] == 0]
#here the hate speech class is 0 
#filtering only the hate speech class and assigning into new data frame
hate_speech_df.head()
#inspecing the first few rows of the data frame using head() function

Unnamed: 0,id,count,hate_speech,offensive_language,neither,class,tweet
85,85,3,2,1,0,0,"""@Blackman38Tide: @WhaleLookyHere @HowdyDowdy11 queer"" gaywad"
89,90,3,3,0,0,0,"""@CB_Baby24: @white_thunduh alsarabsss"" hes a beaner smh you can tell hes a mexican"
110,111,3,3,0,0,0,"""@DevilGrimz: @VigxRArts you're fucking gay, blacklisted hoe"" Holding out for #TehGodClan anyway http://t.co/xUCcwoetmn"
184,186,3,3,0,0,0,"""@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPLE https://t.co/RNvD2nLCDR"" This is why there's black people and niggers"
202,204,3,2,1,0,0,"""@NoChillPaz: ""At least I'm not a nigger"" http://t.co/RGJa7CfoiT""\n\nLmfao"


#### Inspecting Offensive language class

BLOCK 8

In [28]:
offensive_lang_df = data_1[data_1['class'] == 1]
#here the offensive language class is 1
#filtering only the offensive language class and assigning into new data frame
offensive_lang_df.head()
##inspecing the first few rows of the data frame using head() function

Unnamed: 0,id,count,hate_speech,offensive_language,neither,class,tweet
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""


#### Inspecting Neutral language class

In [29]:
neutral_df = data_1[data_1['class'] == 2]
#here the neutral language class is 2
#filtering only the neutral language class and assigning into new data frame
neutral_df.head()

Unnamed: 0,id,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
40,40,3,0,1,2,2,""" momma said no pussy cats inside my doghouse """
63,63,3,0,0,3,2,"""@Addicted2Guys: -SimplyAddictedToGuys http://t.co/1jL4hi8ZMF"" woof woof hot scally lad"
66,66,3,0,1,2,2,"""@AllAboutManFeet: http://t.co/3gzUpfuMev"" woof woof and hot soles"
67,67,3,0,1,2,2,"""@Allyhaaaaa: Lemmie eat a Oreo &amp; do these dishes."" One oreo? Lol"


## Reading 2nd dataset  - Train Dataset

BLOCK 9

In [30]:
data_2 = pd.read_csv('gdrive/My Drive/hate_speech/train_label.csv')
#This uses the pd.read_csv() function from the pandas library to read a training CSV file located at drive.
#the data is loaded into a DataFrame data_2
data_2.head()
#This displays the first few rows of the DataFrame.

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


In [31]:
data_2.set_index('id').head(5)
#inspecting only the first 5 rows by setting the first column as id

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
3,0,bihday your majesty
4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
5,0,factsguide: society now #motivation


#### Inspecting Neutral language class from 2nd dataset

BLOCK 10

In [32]:
neutral_2 = data_2[data_2['label'] == 0]
#here in the 2nd dataset neutral language class is 0
#filtering only the neutral language class and assigning into new data frame
neutral_2.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


#### Inspecting Hate speech class from 2nd dataset



BLOCK 11

In [33]:
hate_speech_2 = data_2[data_2['label'] == 1]
#here in the 2nd dataset hate speech language class is 1
#filtering only the hate speech language class and assigning into new data frame
hate_speech_2.head()

Unnamed: 0,id,label,tweet
13,14,1,@user #cnn calls #michigan middle school 'build the wall' chant '' #tcot
14,15,1,no comment! in #australia #opkillingbay #seashepherd #helpcovedolphins #thecove #helpcovedolphins
17,18,1,retweet if you agree!
23,24,1,@user @user lumpy says i am a . prove it lumpy.
34,35,1,it's unbelievable that in the 21st century we'd need something like this. again. #neverump #xenophobia


#### No. of records of respective classes from 1st datasets

BLOCK 12

In [34]:
print(" Number of records for hate speech (1st Dataframe) : ", len(hate_speech_df['class']))
print("\n Number of records for offensive language (1st Dataframe) : ", len(offensive_lang_df['class']))
print("\n Number of records for neutral language (1st Dataframe) : ", len(neutral_df['class']))
#The above code is counting the number of records in three different dataframes: 'hate_speech_df', 'offensive_lang_df', and 'neutral_df'. 
#The first dataframe contains records of hate speech, the second contains records of offensive language, and the third contains records of neutral language. 
#The purpose of this code is to provide a summary of the number of records in each category.

 Number of records for hate speech (1st Dataframe) :  1430

 Number of records for offensive language (1st Dataframe) :  19190

 Number of records for neutral language (1st Dataframe) :  4163


#### No. of records of respective classes from 2nd datasets

BLOCK 13

In [35]:
print(" Number of records for hate speech (2nd Dataframe) : ", len(hate_speech_2['label']))
print("\n Number of records for neutral language (2nd Dataframe) : ", len(neutral_2['label']))
#The above code is counting the number of records in two different dataframes from 2nd dataset: 'hate_speech_2', 'neutral_2'. 
#The first dataframe contains records of hate speech, the second contains records of neutral language. 
#The purpose of this code is to provide a summary of the number of records in each category.

 Number of records for hate speech (2nd Dataframe) :  2242

 Number of records for neutral language (2nd Dataframe) :  29720


#### Dropping neutral class from 1st dataset

In [36]:
data_1.drop(data_1[data_1['class'] == 2].index, inplace=True)
#dropping the neutral class from first dataset
data_1.head()
#inspecting the dataframe

Unnamed: 0,id,count,hate_speech,offensive_language,neither,class,tweet
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""


BLOCK 14

In [37]:
data_1['class'].value_counts()
#The above code counts the number of occurrences of each unique value in the 'class' column of a DataFrame called data_1 
#and displays the counts in descending order.

1    19190
0    1430 
Name: class, dtype: int64

In [38]:
data_1.drop(columns=['count',	'hate_speech',	'offensive_language',	'neither'], axis=1, inplace=True)
data_1.set_index('id').head()

#The above code drops the columns 'count', 'hate_speech', 'offensive_language', and 'neither' 
#from a DataFrame called data_1 and sets the 'id' column as the new index of the DataFrame, displaying the first five rows of the modified DataFrame. 
#The 'inplace=True' parameter specifies that the DataFrame should be modified in place rather than returning a new DataFrame.

Unnamed: 0_level_0,class,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""


In [39]:
neutral_2 = neutral_2[:21000]
print(len(neutral_2))
#The above code reduces the size of a list called neutral_2 to 21000 elements, and then prints the length of the modified list.

21000


In [40]:
neutral_2.insert(2, "target", 0, True)
neutral_2
#The above code inserts a new column called "target" with a value of 0 into a 
#DataFrame or list called neutral_2 at position 2 (i.e., the third column), shifting all existing columns to the right. 

Unnamed: 0,id,label,target,tweet
0,1,0,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,0,bihday your majesty
3,4,0,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,0,factsguide: society now #motivation
...,...,...,...,...
22574,22575,0,0,coldâï¸âï¸âï¸ #like4like #likeforlikealways #hours#tagsforlikesâ¦
22575,22576,0,0,@user 3 more sleeps until @user ð´ð´ð´ #hurryup @user @user @user @user #conquercancer
22576,22577,0,0,can't wait to sta gettin my shit together. #future #goals
22579,22580,0,0,can #lighttherapy help with #sad or #depression? #altwaystoheal #healthy is !!


In [41]:
data_1.insert(2, "target", 1, True)
data_1
#The above code inserts a new column called "target" with a value of 1 into a 
#DataFrame called data_1 at position 2 (i.e., the third column), shifting all existing columns to the right. 

Unnamed: 0,id,class,target,tweet
1,1,1,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,2,1,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,4,1,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,5,1,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""
...,...,...,...,...
24776,25289,0,1,you're all niggers
24777,25290,0,1,you're such a retard i hope you get type 2 diabetes and die from a sugar rush you fucking faggot @Dare_ILK
24778,25291,1,1,"you's a muthaf***in lie &#8220;@LifeAsKing: @20_Pearls @corey_emanuel right! His TL is trash &#8230;. Now, mine? Bible scriptures and hymns&#8221;"
24780,25294,1,1,young buck wanna eat!!.. dat nigguh like I aint fuckin dis up again


In [42]:
data_1.drop(columns=['class'], axis=1, inplace=True)
#The code drops the 'class' column from a DataFrame called data_1
neutral_2.drop(columns=['label'], axis=1, inplace=True)
#The code drops the 'label' column from a DataFrame or list called neutral_2



In [43]:
data_1.target.value_counts()
#The above code counts the number of occurrences of each unique value in the 'target' column of a DataFrame called data_1 and displays the counts in descending order. 

1    20620
Name: target, dtype: int64

In [44]:
neutral_2.target.value_counts()
#The above code counts the number of occurrences of each unique value in the 'target' column of a DataFrame or list called neutral_2 and displays the counts in descending order. 

0    21000
Name: target, dtype: int64

### Merging both dataframes

BLOCK 15

In [45]:
df_final = pd.concat([data_1, neutral_2], join = 'inner')
#The above code concatenates two DataFrames or lists called data_1 and neutral_2 along their columns (axis=1) using the 'pd.concat()' function. 
#The 'join = "inner"' parameter specifies that only columns that are present in both DataFrames should be included in the final concatenated DataFrame. 
df_final.set_index('id', inplace=True)
#The concatenated DataFrame is assigned to a new variable called df_final. 
#The 'inplace=True' parameter in the next line specifies that the modified DataFrame should be assigned to the same variable as before. 
df_final.head()
#Finally, the 'set_index()' function sets the 'id' column as the new index of the DataFrame, displaying the first five rows of the modified DataFrame.

Unnamed: 0_level_0,target,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
5,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"""


In [46]:
# Checking final count after merging both dataframes
df_final.target.value_counts()

0    21000
1    20620
Name: target, dtype: int64

In [47]:
# Resetting index
final_df = df_final.sample(frac=1, random_state=43).reset_index(drop=True)
final_df

#The above code shuffles the rows of a DataFrame called df_final randomly using the 'sample()' function. 
#The 'frac=1' parameter specifies that all the rows should be included in the shuffled DataFrame, while 'random_state=43' 
#ensures reproducibility of the shuffling process.
#The 'reset_index()' function resets the index of the shuffled DataFrame 
#to a default sequential index, and the 'drop=True' parameter prevents the old index from being added as a new column in the DataFrame. 
#The shuffled DataFrame is then assigned to a new variable called final_df.

Unnamed: 0,target,tweet
0,0,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine
4,1,@munafal777 karma is a bitch
...,...,...
41615,0,all dressed up and ready for @user
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;


In [48]:
# total length of final dataset

len(final_df)

41620

#### Converting tweets to lowercase

BLOCK 16

In [49]:
final_df['tweet_lower'] = final_df['tweet'].apply(lambda x : x if type(x)!=str else x.lower())
#The above code creates a new column in a DataFrame called final_df called 'tweet_lower', 
#which is a lowercase version of the 'tweet' column. The 'apply()' function applies a lambda function to each row of the 'tweet' column. 
#The lambda function checks if the value in the 'tweet' column is a string, and if so, converts it to lowercase using the 'lower()' method. 
#The resulting lowercase string is then stored in the new 'tweet_lower' column.
final_df.head()

Unnamed: 0,target,tweet,tweet_lower
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch


In [50]:
# Dropping all duplicates
final_df = final_df.drop_duplicates(subset ='tweet_lower', keep = 'first')
#The above code removes duplicate rows from a DataFrame called final_df based on the 'tweet_lower' column 
#and keeps only the first occurrence of each unique value using the 'drop_duplicates()' function. 
#The 'subset' parameter specifies that the duplicates should be identified based on the 'tweet_lower' column, 
#and the 'keep' parameter specifies that only the first occurrence of each unique value should be kept. 
#The modified DataFrame with duplicates removed is assigned to the same variable as before.
print(len(final_df))

40222


In [51]:
assert len(final_df) == 40222
#The above code uses an assertion statement to check if the length of the DataFrame called final_df is equal to 40222

In [52]:
# removing string in retweets that has URL in it
retweets =  final_df[final_df['tweet_lower'].str.contains(r'http://t(?!$)')]
#The above code creates a new DataFrame called 'retweets' by filtering the DataFrame 'final_df' using the 'str.contains()' method with a regular expression pattern. 
retweets

Unnamed: 0,target,tweet,tweet_lower
17,1,Young Gem &amp; Don Chief been killing it in the yo...going in on that real music..no bubble gum trash in the booth... http://t.co/lkjNb7uZmt,young gem &amp; don chief been killing it in the yo...going in on that real music..no bubble gum trash in the booth... http://t.co/lkjnb7uzmt
19,1,RT @urbandictionary: @The2kGod nigger: A fully grown niglet http://t.co/HJKlNHDrWT http://t.co/x6y5XKKPMq,rt @urbandictionary: @the2kgod nigger: a fully grown niglet http://t.co/hjklnhdrwt http://t.co/x6y5xkkpmq
22,1,RT @Bayonettes: When you tweet som'n just for laughs &amp; a bitch wanna spoil the fun by getting Serious.. &#128530;&#128530; http://t.co/9LD1HWz1BU,rt @bayonettes: when you tweet som'n just for laughs &amp; a bitch wanna spoil the fun by getting serious.. &#128530;&#128530; http://t.co/9ld1hwz1bu
96,1,RT @PRAYINGFORHEAD: ya pussy stank @_ANGELSAMUELS http://t.co/i9tXjSDOZZ,rt @prayingforhead: ya pussy stank @_angelsamuels http://t.co/i9txjsdozz
100,1,ALL HE NEEDS IS A SKIRT!!!\nSure looks like a bitch to me.\n http://t.co/89HH52pCUe&#8221;,all he needs is a skirt!!!\nsure looks like a bitch to me.\n http://t.co/89hh52pcue&#8221;
...,...,...,...
41486,1,RT @LiIuglymane: When your side bitch tries to hug you in public http://t.co/T5uRIsXNdF,rt @liiuglymane: when your side bitch tries to hug you in public http://t.co/t5urisxndf
41520,1,&#8220;@JalapenoBright: This hoe waited until she got 45 to make a sextape chile.. http://t.co/r9gMmImakO&#8221; OH MY LORD&#128561;,&#8220;@jalapenobright: this hoe waited until she got 45 to make a sextape chile.. http://t.co/r9gmmimako&#8221; oh my lord&#128561;
41524,1,"#longhair don't care got my #gayboyproblems everywhere, I'm a #bitch I'm a #champ, I'm totally full of&#8230; http://t.co/Qia88qGmWh","#longhair don't care got my #gayboyproblems everywhere, i'm a #bitch i'm a #champ, i'm totally full of&#8230; http://t.co/qia88qgmwh"
41605,1,"#porn,#android,#iphone,#ipad,#sex,#xxx, | #CloseUp | pussy fuck close up http://t.co/0dvaZWLq2q","#porn,#android,#iphone,#ipad,#sex,#xxx, | #closeup | pussy fuck close up http://t.co/0dvazwlq2q"


In [53]:
tweets_emoji =  final_df[final_df['tweet_lower'].str.contains(r'#[0-9]')]
#The above code first creates a new DataFrame called 'tweets_emoji' by filtering 
#the DataFrame 'final_df' using the 'str.contains()' method with a regular expression pattern. 
tweets_emoji['target'].value_counts()

1    5065
0    379 
Name: target, dtype: int64

In [54]:
#Function replacing a specific regex pattern with an empty space
def pattern_remover(input_txt, pattern):
    
    """ Function replacing a specific regex pattern with an empty space"""
    #find all non-overlapping occurrences of the pattern in the input text using findall() function
    r = re.findall(pattern, input_txt)
    #replace each occurrence of the pattern with an empty string in the input text
    for i in r:
        input_txt = re.sub(i, '', input_txt)
        #return the modified input text with the pattern removed
    return input_txt

In [55]:
def count(input_txt, pattern):
    #This function takes two parameters: input_txt, which is a string containing the text of a tweet, and pattern, which is a regular expression pattern.
    """Simple function returning the pattern count instances in each tweet"""
    r = re.findall(pattern, input_txt)
    #The function uses the re.findall() function to find all instances of the pattern in the input text. 
    #It then returns the length of the resulting list, which corresponds to the number of times the pattern appears in the tweet.
    return len(r)

In [56]:
final_df['handle_count'] = np.vectorize(count)(final_df['tweet_lower'], "@[\w]*")
#adds a new column to the final_df DataFrame called handle_count. 
#The column is created using the np.vectorize() function, which applies the count() 
#function (defined earlier) to each element of the tweet_lower column in final_df, 
#passing in the regular expression pattern @[\w]* as the pattern argument. 
#This pattern matches any sequence of characters that starts with the @ symbol and is followed by any number of alphanumeric characters or underscores.
final_df['handle_count'].value_counts()
#The second line of code displays the counts of unique values in the handle_count column, 
#indicating how many tweets contain 0, 1, 2, etc. mentions of Twitter handles.

0     21227
1     13623
2     3596 
3     1152 
4     351  
5     143  
6     72   
7     24   
8     20   
9     9    
10    4    
11    1    
Name: handle_count, dtype: int64

In [57]:
# remove twitter handles (@user)
final_df['handle_removed'] = np.vectorize(pattern_remover)(final_df['tweet_lower'], "@[\w]*")
#This line of code uses the np.vectorize() function to apply the pattern_remover() function to every 
#element of the final_df['tweet_lower'] column, where each element is a string representing a tweet. 
#The pattern_remover() function takes two arguments: input_txt, which is the text to remove patterns from, 
#and pattern, which is the regex pattern to remove. In this case, the pattern is @[\w]*, 
#which matches any word starting with the @ symbol, representing a user handle on Twitter. 
#The output of the function is a new column in final_df called handle_removed, which contains the same tweets as the tweet_lower column, 
#but with all user handles removed.

In [58]:
final_df.head()
#inspecting the final_df dataframe using head() 

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch


In [59]:
# pattern remover to delete URLS for RETWEETS
final_df['url_removed'] = np.vectorize(pattern_remover)(final_df['handle_removed'], "https?://[A-Za-z0-9./]*")
#adds a new column to the final_df dataframe called "url_removed". 
#The column is created by applying the pattern_remover function to the "handle_removed" column of 
#the final_df dataframe using the np.vectorize function. 
#The pattern_remover function takes two arguments: input_txt (the text to search for the pattern to remove) 
#and pattern (the regular expression pattern to remove). In this case, 
#the pattern being removed is any URL that appears in the text. 
#The resulting column contains the same text as the "handle_removed" column, but with any URLs removed.
final_df.head()

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch


In [60]:
# removing all characters apart from letters hashtags and apostrophes
final_df['special_char_removed'] = final_df['url_removed'].str.replace("[^a-zA-Z#']", " ")
#adds a new column called 'special_char_removed' to the DataFrame 'final_df'. 
#It performs a string replace operation on the 'url_removed' column using the regular expression "[^a-zA-Z#']" 
#which matches any character that is not an uppercase or lowercase alphabet, 
#hash symbol (#), or apostrophe ('), and replaces it with a space character. 
final_df

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch
...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #


In [61]:
# removing hashtags
final_df['single_hashtag_removed'] = np.vectorize(pattern_remover)(final_df['special_char_removed'], " # ")
#The above code applies the pattern_remover function to remove all single hashtags (#) from the 'special_char_removed' 
#column in the final_df DataFrame and stores the output in a new column called 'single_hashtag_removed'. 
final_df

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch
...,...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch


BLOCK 17

In [62]:
# counting tweet length
final_df['tweets_length'] = final_df['single_hashtag_removed'].apply(lambda x: len(x))
#The above code calculates the length of each tweet in characters and stores it in a new column 'tweets_length' in the final_df dataframe. It uses the apply 
#method to apply a lambda function to each row of the 'single_hashtag_removed' column, which takes the length of the string and returns the value.
final_df

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17
...,...,...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,32
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,84
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,91
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch,54


In [63]:
# checking length of tweets greater than 280 as more than 280 are outliers

len(final_df[final_df['tweets_length'] > 280])
#The above code checks the number of tweets in the DataFrame final_df 
#that have a length greater than 280 characters. 
#It does this by filtering the DataFrame for rows where the value in the tweets_length column is greater than 280, and 
#then getting the length of the resulting DataFrame.

6

In [64]:
# removing those rows who has tweets length greater than 280 as they are very less in number so insignificant in model building process

idx = final_df.index[final_df['tweets_length'] > 280]
#creates an index idx for rows in final_df that have a tweets_length greater than 280.

final_df = final_df.drop(idx, axis=0)
#drops all the rows that have a tweets_length greater than 280 using the drop() method of pandas dataframes. 
#The axis=0 parameter specifies that rows need to be dropped and not columns.

assert len(final_df[final_df['tweets_length'] > 280]) == 0
#checks if there are any rows with a tweets_length greater than 280 left in the final_df dataframe. 
#If the assertion passes, it means that there are no rows with a tweets_length greater than 280 in the final_df dataframe.

In [65]:
final_df.isna().sum()
#checking for na values using na ()

target                    0
tweet                     0
tweet_lower               0
handle_count              0
handle_removed            0
url_removed               0
special_char_removed      0
single_hashtag_removed    0
tweets_length             0
dtype: int64

In [66]:
final_df
#inspecting the final_df dataframe

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17
...,...,...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,32
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,84
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,91
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch,54


BLOCK 18


In [67]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    
    """ Function defining the actual part of speech as adjective, 
    verb, noun or adverb"""
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
#This function takes in a part of speech (POS) tag from the nltk library, and maps it to the corresponding POS tag in the wordnet library. 
#It does this by checking the first letter of the nltk tag, and mapping it to the appropriate wordnet tag. 
#If the tag is not one of the four main POS categories (adjective, verb, noun, or adverb), the function returns None.

In [68]:
def lemmatize_sentence(sentence):
    
    """Function to lemmatize with POS all tweets"""
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
            # 'ass' kept being reduced to 'as' for some reason         
        if word == 'ass':
            lemmatized_sentence.append(word)
        
        elif tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)


#This function lemmatize_sentence takes a sentence as input, then tokenizes the sentence and finds the Part-of-Speech (POS) 
#tag for each token using the nltk.pos_tag() function. 
#It then maps the tag to the corresponding wordnet tag using the nltk_tag_to_wordnet_tag() function. 
#The lemmatization is performed using the lemmatizer.lemmatize() function from the WordNetLemmatizer module of the NLTK library. 
#Finally, it returns the lemmatized sentence as a string by joining the lemmatized tokens.

In [69]:
final_df['lemmatized'] = final_df['single_hashtag_removed'].apply(lambda x: lemmatize_sentence(x))
final_df

#The above code applies the function lemmatize_sentence to each row of the single_hashtag_removed column in the final_df dataframe, 
#and saves the results in a new column called lemmatized. 
#The lemmatize_sentence function lemmatizes each word in a sentence using its part of speech (POS) tag, 
#which is determined using the NLTK library's pos_tag function. The lemmatized sentence is returned as a string and is joined together using a space separator.

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length,lemmatized
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49,brand new big flowdan in the email # horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23,fuck bitch get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95,rt exactly do n't be gettin no additional pussy for them soft ass tweet jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77,niccas use to be like you always tryn pull a finesse duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17,karma be a bitch
...,...,...,...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,32,all dress up and ready for
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,84,too many bitch get rabies and i hate a ho hoppin ' woman # stank pussy poppin ' woman
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,91,the # lesmiserables gang ready for westendlive # lesmiserables # lesmis # westend # love
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch,54,dumb ugly stupid bullshit ass bitch


In [70]:
# removing space after hashtags
final_df['lemmatized'] = final_df['lemmatized'].str.replace('# ', '#')
#The above code is replacing all instances of the string '# ' with '#' in the 'lemmatized' column of the DataFrame 'final_df'.
final_df

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length,lemmatized
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49,brand new big flowdan in the email #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23,fuck bitch get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95,rt exactly do n't be gettin no additional pussy for them soft ass tweet jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77,niccas use to be like you always tryn pull a finesse duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17,karma be a bitch
...,...,...,...,...,...,...,...,...,...,...
41615,0,all dressed up and ready for @user,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,32,all dress up and ready for
41616,1,Too many bitches got rabies And I hate a ho hoppin' woman #Stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,84,too many bitch get rabies and i hate a ho hoppin ' woman #stank pussy poppin ' woman
41617,0,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,91,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love
41618,1,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch,54,dumb ugly stupid bullshit ass bitch


In [71]:
# replacing space before "'"
final_df['lemmatized'] = final_df['lemmatized'].str.replace(" '" ,"'")
#This code replaces all occurrences of " '" with "'" in the 'lemmatized' column of the final_df dataframe. 
#This is done to remove any spaces between apostrophes and the preceding word in the tweets.
final_df.head()

Unnamed: 0,target,tweet,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length,lemmatized
0,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49,brand new big flowdan in the email #horrorshow
1,1,@mckinley719 fuck bitches get money,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23,fuck bitch get money
2,1,RT @_____AL: Exactly. Don't be gettin no additional pussy for them soft ass tweets. Jus be you dude lol,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95,rt exactly do n't be gettin no additional pussy for them soft ass tweet jus be you dude lol
3,1,Niccas use to be like you always Tryn pull a finesse....Duh nicca I need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77,niccas use to be like you always tryn pull a finesse duh nicca i need mine
4,1,@munafal777 karma is a bitch,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17,karma be a bitch


In [72]:
# removing stopwords
stop = stopwords.words('english')
#initializes a variable named 'stop' with a list of stopwords from the NLTK library for the English language.
final_df['tweet_stopwords_removed'] = final_df['lemmatized'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
#uses the apply() method on the 'lemmatized' column of the final_df dataframe to apply a lambda function to each row of the column.
#lambda function splits the sentence into words, and for each word, it checks if the word is in the list of stopwords. 
#If the word is not in the list of stopwords, it is added to a new list of words. 
#Finally, the words are joined back into a sentence using the join() method with a space separator. 
#The resulting sentence is returned and assigned to the 'tweet_stopwords_removed' column of the final_df dataframe.


In [73]:
final_df['tweet_stopwords_removed'] = final_df['tweet_stopwords_removed'].apply(lambda x: ' '.join([word for word in x.split() if len(word)>2])) 
#This line of code removes any words in the tweet that have a length less than or equal to 2. 
#It applies a lambda function to the 'tweet_stopwords_removed' column of the DataFrame. 
#The lambda function splits the tweet into individual words, then checks if the length of each word is greater than 2. 
#If so, it keeps the word in a list comprehension. Finally, it joins the list of words back into a string and assigns it to the 'tweet_stopwords_removed' column of the DataFrame.

In [74]:
final_df.drop(columns='tweet', axis=1, inplace=True)
#dropping the tweet column from the final df dataframe

In [75]:
final_df = final_df.loc[:, ["tweet_lower","handle_count","handle_removed","url_removed", "special_char_removed", "single_hashtag_removed", "tweets_length", "lemmatized", "tweet_stopwords_removed", "target"]]
#This code selects specific columns from the DataFrame final_df and assigns the resulting DataFrame to final_df. 
#The selected columns are "tweet_lower", "handle_count", "handle_removed", "url_removed", "special_char_removed", "single_hashtag_removed", "tweets_length", "lemmatized", "tweet_stopwords_removed", and "target". 
#This operation is performed using the loc method, which allows the selection of rows and columns based on labels or boolean arrays.

In [76]:
final_df
#inspecting the dataframe

Unnamed: 0,tweet_lower,handle_count,handle_removed,url_removed,special_char_removed,single_hashtag_removed,tweets_length,lemmatized,tweet_stopwords_removed,target
0,brand new big flowdan in the emails #horrorshow,0,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,brand new big flowdan in the emails #horrorshow,49,brand new big flowdan in the email #horrorshow,brand new big flowdan email #horrorshow,0
1,@mckinley719 fuck bitches get money,1,fuck bitches get money,fuck bitches get money,fuck bitches get money,fuck bitches get money,23,fuck bitch get money,fuck bitch get money,1
2,rt @_____al: exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,1,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt : exactly. don't be gettin no additional pussy for them soft ass tweets. jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,rt exactly don't be gettin no additional pussy for them soft ass tweets jus be you dude lol,95,rt exactly do n't be gettin no additional pussy for them soft ass tweet jus be you dude lol,exactly n't gettin additional pussy soft ass tweet jus dude lol,1
3,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,0,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse....duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use to be like you always tryn pull a finesse duh nicca i need mine,77,niccas use to be like you always tryn pull a finesse duh nicca i need mine,niccas use like always tryn pull finesse duh nicca need mine,1
4,@munafal777 karma is a bitch,1,karma is a bitch,karma is a bitch,karma is a bitch,karma is a bitch,17,karma be a bitch,karma bitch,1
...,...,...,...,...,...,...,...,...,...,...
41615,all dressed up and ready for @user,1,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,all dressed up and ready for,32,all dress up and ready for,dress ready,0
41616,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,0,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy-poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,too many bitches got rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,84,too many bitch get rabies and i hate a ho hoppin' woman #stank pussy poppin' woman,many bitch get rabies hate hoppin' woman #stank pussy poppin' woman,1
41617,@user the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,1,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #loveâ¦,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,91,the #lesmiserables gang ready for westendlive #lesmiserables #lesmis #westend #love,#lesmiserables gang ready westendlive #lesmiserables #lesmis #westend #love,0
41618,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,0,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch &#128074;&#128074;&#128074;,dumb ugly stupid bullshit ass bitch # # #,dumb ugly stupid bullshit ass bitch,54,dumb ugly stupid bullshit ass bitch,dumb ugly stupid bullshit ass bitch,1


In [77]:
# Saving final data to google drive in order to build model
final_df.to_csv('final_data.csv')
!cp final_data.csv "gdrive/My Drive/hate_speech/"