## Hate Speech Recognition

** Problem Statement **

  Developing an automated hate speech detection system for online platforms to effectively identify and filter hate speech from user-generated content. The system should accurately differentiate between hate speech, offensive language, and benign content, contributing to a safer and more inclusive online environmen

In [2]:
# IMporting libraries

from nltk.util import pr
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [3]:
# Importing data
data = pd.read_csv('twitter.csv')
data.head(25)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
5,5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just..."
6,6,3,0,3,0,1,"!!!!!!""@__BrighterDays: I can not just sit up ..."
7,7,3,0,3,0,1,!!!!&#8220;@selfiequeenbri: cause I'm tired of...
8,8,3,0,3,0,1,""" &amp; you might not get ya bitch back &amp; ..."
9,9,3,1,2,0,1,""" @rhythmixx_ :hobbies include: fighting Maria..."


In [6]:
# Labellling the data

data['labels'] = data['class'].map({0:"Hate Speech" , 1:"Offensive Language" , 2:"No hate and offensive language"})
print(data.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  \
0  !!! RT @mayasolovely: As a woman you shouldn't...   
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...   
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...   
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...   
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...   

                           labels  
0  No hate and offensive language  
1              Offensive Language  
2              Offensive Language  
3              Offensive Language  
4              Offensive Language  


In [7]:
data = data[['tweet' , 'labels']]
print(data.head())

                                               tweet  \
0  !!! RT @mayasolovely: As a woman you shouldn't...   
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...   
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...   
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...   
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...   

                           labels  
0  No hate and offensive language  
1              Offensive Language  
2              Offensive Language  
3              Offensive Language  
4              Offensive Language  


## NLP

In [9]:
import re
import nltk
stemmer = nltk.SnowballStemmer('english')
from nltk.corpus import stopwords
import string
stopword = set(stopwords.words('english'))

In [12]:
def cleantext(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]' , '' , text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]

    text = " ".join(text)
    return text
data["tweet"] = data["tweet"].apply(cleantext)
data.head(50)
    

Unnamed: 0,tweet,labels
0,rt mayasolov woman shouldnt complain clean ho...,No hate and offensive language
1,rt boy dat coldtyga dwn bad cuffin dat hoe ...,Offensive Language
2,rt urkindofbrand dawg rt ever fuck bitch sta...,Offensive Language
3,rt cganderson vivaba look like tranni,Offensive Language
4,rt shenikarobert shit hear might true might f...,Offensive Language
5,tmadisonx shit blow meclaim faith somebodi sti...,Offensive Language
6,brighterday sit hate anoth bitch got much shi...,Offensive Language
7,caus im tire big bitch come us skinni,Offensive Language
8,amp might get ya bitch back amp,Offensive Language
9,rhythmixx hobbi includ fight mariambitch,Offensive Language


In [14]:
## Dividing data

x = np.array(data['tweet'])
y = np.array(data['labels'])

In [16]:
cv =CountVectorizer()
X = cv.fit_transform(x) # Fitting the data in X
X_train,X_test, y_train,y_test = train_test_split(X,y, test_size=0.3, random_state=42)



In [18]:
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.8759919300605246

### We get 87% accuracy

In [1]:
def hate_recognition():
    import streamlit as st
    st.title("Hate Speech Detection")
    user = st.text_area("Enter your tweet:")
    if len(user)<1:
        st.write(" ")
    else:
        sample = user
        data= cv.transform([sample]).toarray()
        a = clf.predict(data)
        st.title(a)
    hate_recognition()