# Hate Speech Detection Using SVM (Support Vectore Machine) Model
The Hate Speech Detection project is aimed at developing a machine learning model that can identify and classify hate speech in social media posts. Hate speech is defined as any language that is intended to degrade, intimidate, or incite violence or prejudicial action against a particular group of people based on their race, ethnicity, religion, gender, sexual orientation, or other characteristic.

The use of hate speech has become a growing concern in recent years, with the rise of social media platforms that provide a global audience for anyone with an internet connection. Hate speech not only has a negative impact on the individuals or groups targeted, but it can also contribute to a wider climate of intolerance and discrimination.

The proposed solution is to use a Support Vector Machine (SVM) algorithm, a type of supervised learning model, to analyze text data from social media posts and classify them as either hate speech or not. The project involves several stages, including data preprocessing, feature extraction, model training and testing, and performance evaluation. The ultimate goal is to develop a model that can accurately identify and flag hate speech, which can be used by social media companies to enforce their content policies and protect their users from harmful content.

#### Required Library
1) Pandas is a Python library used for data manipulation and analysis. It provides data structures and functions to read, write and manipulate tabular data (e.g., CSV, Excel, SQL database). In the context of the Hate Speech Detection project, Pandas is used to store and preprocess the raw text data in a DataFrame object.

2) NumPy is a fundamental Python library for scientific computing that provides support for large, multi-dimensional arrays and matrices, as well as a wide range of mathematical functions to operate on these arrays. In the context of the Hate Speech Detection project, NumPy is used to handle the arrays and matrices that are used in the SVM algorithm.

3) Matplotlib is a Python 2D plotting library that provides a variety of visualizations, such as line plots, scatter plots, histograms, and bar charts. In the context of the Hate Speech Detection project, Matplotlib is used to visualize the data and model performance.

4) NLTK (Natural Language Toolkit) is a Python library for natural language processing (NLP) tasks, such as tokenization, stemming, lemmatization, part-of-speech tagging, and text classification. In the context of the Hate Speech Detection project, NLTK is used to preprocess the text data and extract relevant features.

5) Scikit-learn (sklearn) is a popular Python library for machine learning that provides a variety of algorithms and tools for data preprocessing, feature extraction, model selection, and evaluation. In the context of the Hate Speech Detection project, Scikit-learn is used to implement the SVM algorithm, split the data into training and testing sets, and evaluate the model performance.

6) Pickle is a Python library used for serializing and de-serializing Python objects (e.g., model objects) to a file or a stream. In the context of the Hate Speech Detection project, Pickle is used to save the trained SVM model to a file so that it can be used later for prediction without retraining the model.

In [1]:
# Importing all necessary library needed in this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, svm
from sklearn.metrics import accuracy_score

### Read and Process Data for training and testing

#### Read data with pandas
The data set can be easily added as a pandas Data Frame with the help of ‘read_csv’ function. I have set the encoding to ‘latin-1’ as the text had many special characters.

In [2]:
Corpus = pd.read_csv(r"./labeled_data.csv",encoding='latin-1')

View corpus data in tabular form. This is a free open source dataset downloaded from kaggle. It has 24783 rows and 7 columns. But we need only two column of data for our training which are class and tweet. there are three class of text here. they are labeled as 0,1,2. 0 means hate speech, 1 means offensive and 2 means neither of them. tweet are the original text. it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.

In [3]:
Corpus

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an..."
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies


#### Data Pre-Processing
This is an important step in any data mining process. This basically involves transforming raw data into an understandable format for NLP models. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data pre-processing is a proven method of resolving such issues.This will help in getting better results through the classification algorithms.

In [4]:
# Step - a : Remove blank rows if any.
Corpus['tweet'].dropna(inplace=True)
# Step - b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['tweet'] = [entry.lower() for entry in Corpus['tweet']]

##### Tokenization: 
This is a process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing. NLTK Library has word_tokenize and sent_tokenize to easily break a stream of text into a list of words or sentences, respectively.

In [5]:
Corpus['tweet']= [word_tokenize(entry) for entry in Corpus['tweet']]

In [6]:
# first entry text after tokenization
print(Corpus["tweet"][0])

['!', '!', '!', 'rt', '@', 'mayasolovely', ':', 'as', 'a', 'woman', 'you', 'should', "n't", 'complain', 'about', 'cleaning', 'up', 'your', 'house', '.', '&', 'amp', ';', 'as', 'a', 'man', 'you', 'should', 'always', 'take', 'the', 'trash', 'out', '...']


In [7]:
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

##### Word Stemming/Lemmatization: 
The aim of both processes is the same, reducing the inflectional forms of each word into a common base or root. Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

The code below is a preprocessing step for a Hate Speech Detection project using SVM. The input data is assumed to be in a Pandas DataFrame called "Corpus" with a column named "tweet" containing raw text data.

The code performs the following tasks:

Iterates over each tweet in the "tweet" column of the DataFrame using an enumerated loop.

For each tweet, it initializes an empty list called "Final_words".

It uses NLTK library functions to tokenize each word in the tweet and identify its part-of-speech tag (noun, verb, etc.).

It checks if the word is a stopword (common words like "a", "the", "and", etc.) and if it contains only alphabets.

If the above conditions are true, it lemmatizes the word using WordNetLemmatizer and the appropriate part-of-speech tag.

The final processed set of words for each tweet is stored as a string in a new column called "text_final" in the same DataFrame.

This preprocessing step aims to standardize the text data by removing unnecessary words, converting them to their base form (lemmatization), and grouping similar words (e.g., "run", "running", "ran" all become "run"). This will make it easier for the SVM model to classify the text data accurately.

In [8]:
for index,entry in enumerate(Corpus['tweet']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

View text before lemmatization and after processing and lemmatization

In [9]:
print(Corpus["tweet"][0])
print(Corpus["text_final"][0])

['!', '!', '!', 'rt', '@', 'mayasolovely', ':', 'as', 'a', 'woman', 'you', 'should', "n't", 'complain', 'about', 'cleaning', 'up', 'your', 'house', '.', '&', 'amp', ';', 'as', 'a', 'man', 'you', 'should', 'always', 'take', 'the', 'trash', 'out', '...']
['rt', 'mayasolovely', 'woman', 'complain', 'clean', 'house', 'amp', 'man', 'always', 'take', 'trash']


##### Prepare Train and Test Data sets
The Corpus will be split into two data sets, Training and Test. The training data set will be used to fit the model and the predictions will be performed on the test data set.This can be done through the train_test_split from the sklearn library. The Training Data will have 80% of the corpus and Test data will have the remaining 20% as we have set the parameter test_size=0.2 .

In [10]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['class'],test_size=0.2)

In [11]:
print(Train_X[10])
print(Train_Y[10])

['keeks', 'bitch', 'curve', 'everyone', 'lol', 'walk', 'conversation', 'like', 'smh']
1


##### Word Vectorization
It is a general process of turning a collection of text documents into numerical feature vectors.Their are many methods to convert text data to vectors which the model can understand but by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency — Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

1) Term Frequency: This summarizes how often a given word appears within a document.

2) Inverse Document Frequency: This down scales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The following syntax can be used to first fit the TG-IDF model on the whole corpus. This will help TF-IDF build a vocabulary of words which it has learned from the corpus data and it will assign a unique integer number to each of these words. Their will be maximum of 5000 unique words/features as we have set parameter max_features=5000.

Finally we will transform Train_X and Test_X to vectorized Train_X_Tfidf and Test_X_Tfidf. These will now contain for each row a list of unique integer number and its associated importance as calculated by TF-IDF.

In [12]:
# Vectorize the texts
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

You can use the below syntax to see the vocabulary that it has learned from the corpus

In [31]:
print(Tfidf_vect.vocabulary_)



#### Training and Testing Data

##### Train a SVM(Support Vector Machine) model with training data

In [192]:
# Initialize a SVM model
SVM = svm.SVC(C=1.0, kernel='linear')

In [193]:
# Train with training data
history = SVM.fit(Train_X_Tfidf,Train_Y)

##### Test with test data

In [194]:
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

In [195]:
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  91.10349001412145


#### Save and Load Model

In [171]:
# We will use picke library to save and load the model
import pickle

In [172]:
# save model as file 'svm_model.pkl'
with open('svm_model.pkl', 'wb') as file:
    pickle.dump(SVM, file)


In [173]:
# save vetorizer as file 'vectorizer.pkl'
with open('vectorizer.pkl', 'wb') as file:
    pickle.dump(Tfidf_vect, file)

#### Detect Hate Speech/ Offensive Speech using loaded model and vectorizer

In [174]:
# Load model
with open('svm_model.pkl', 'rb') as file:
    # Call load method to deserialze
    model = pickle.load(file)

In [175]:
# Load Vectorizer
with open('vectorizer.pkl', 'rb') as file:
    # Call load method to deserialze
    vectorizer = pickle.load(file)

In [176]:
def detect(text:str):
    labels = ["Hate Speech","Offensive","Neutral"]
    # convert to lower case
    text = text.lower()
    # tokenize
    text = word_tokenize(text)
    # WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(text):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    text_final = [str(Final_words)]
    vectorized_text = vectorizer.transform(text_final)
    predictions = model.predict(vectorized_text)
    print(f"Category of given text is '{labels[predictions[0]]}' ")


In [191]:
detect("Womens are the power of our country")

Category of given text is 'Neutral' 
