## **Hate speech detection using Machine Learning**

In [None]:
# import packages
import pandas as pd
import numpy as np
import nltk #natural language tool kit
import string
import re #regular expression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from nltk.util import pr
from nltk.corpus import stopwords


In [None]:
# importing the dataset
data = pd.read_csv("twitter.csv")
print(data.head(10))

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   
5           5      3            1                   2        0      1   
6           6      3            0                   3        0      1   
7           7      3            0                   3        0      1   
8           8      3            0                   3        0      1   
9           9      3            1                   2        0      1   

                                               tweet  
0  !!! RT @mayasolovely: As a woman you shouldn't...  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  
2  !!!!!!! RT @UrKindOfBrand Da

The nltk.SnowballStemmer("english") is a function in the Natural Language Toolkit (NLTK) library, used for stemming words in the English language.

Stemming is the process of reducing words to their base or root form. For instance, the words "running", "runner", and "ran" all stem from the root word "run". Stemming helps in normalizing words for various natural language processing (NLP) tasks.

In [None]:
from nltk.stem import SnowballStemmer
stemmer = nltk.SnowballStemmer("english")

In [None]:
# Map the columns for hate speech creating a new column called 'label'
data["label"] = data["class"].map({0: "Hate Speech", 1: "Offensive Language", 2: "No Hate and Offensive"})
print(data.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet                  label  
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language  


In [None]:
#split the dataset or have a seperate data so that only required columns or according to our need are there in the newly available.
data1 = data[['tweet','label']]
print(data1.head())

                                               tweet                  label
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language


The re library in Python provides support for regular expressions, which are a powerful tool for matching patterns in text. Regular expressions allow you to search, match, and manipulate strings based on specific patterns.

Stopwords are common words that are usually filtered out during natural language processing (NLP) tasks because they carry little meaningful information about the content of the text. Examples of stopwords include "and", "the", "is", "in", and "at". Removing stopwords helps to reduce the size of the data and focus on the more important words that are relevant to the analysis.

In [None]:
from nltk.tokenize import TweetTokenizer

# Initialize the stemmer and tokenizer
stemmer = SnowballStemmer("english")
tokenizer = TweetTokenizer()


Tokenizing is the process of splitting text into individual units, called tokens. These tokens can be words, phrases, symbols, or other meaningful elements. In natural language processing (NLP), tokenizing is a crucial step because it allows the text to be processed and analyzed more easily.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# clean the sentence or text in dataset
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove mentions
    text = re.sub(r'@\w+', '', text)

    # Remove hashtags (keeping the text after the #)
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove special characters, numbers, and punctuations
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize the text
    words = tokenizer.tokenize(text)

    # Remove stopwords and stem the words
    stop_words = set(stopwords.words('english'))
    words = [stemmer.stem(word) for word in words if word not in stop_words]

    # Join the cleaned words back into a single string
    cleaned_text = ' '.join(words)

    return cleaned_text

In [None]:
# applying the function to the data1 which is our new dataset with only the necessarry columns
data1['tweet'] = data1['tweet'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data1['tweet'] = data1['tweet'].apply(clean_text)


**Training using Decision tree classifier**

CountVectorizer is a simple yet powerful tool for converting text data into numerical data for machine learning models. It creates a matrix of token counts, which can be further used for various NLP tasks like text classification, clustering, and topic modeling. The flexibility of CountVectorizer allows for easy customization to suit different text processing needs.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [None]:
x = np.array(data1['tweet'])
y = np.array(data1['label'])

X = cv.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=123)

clf = DecisionTreeClassifier()
clf.fit(x_train,y_train)


In [None]:
# Validation
from sklearn.metrics import accuracy_score, classification_report
# Make predictions
y_pred = clf.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.88

Classification Report:
                       precision    recall  f1-score   support

          Hate Speech       0.34      0.30      0.32       436
No Hate and Offensive       0.82      0.85      0.83      1247
   Offensive Language       0.93      0.93      0.93      5752

             accuracy                           0.88      7435
            macro avg       0.69      0.69      0.69      7435
         weighted avg       0.87      0.88      0.88      7435



In [None]:
# Testing
sample = "It's been a killing journey"
d = cv.transform([sample]).toarray()
print(clf.predict(d))

['No Hate and Offensive']
