#### The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

#### Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

In [1]:
import pandas as pd
import numpy as np 
import nltk
import re
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
tweet_train = pd.read_csv("twitter_reviews.csv",index_col = "id")

In [3]:
tweet_train.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
5,0,factsguide: society now #motivation


In [4]:
tweet_train.shape

(31962, 2)

In [5]:
tweet_train["label"].value_counts()/len(tweet_train)

0    0.929854
1    0.070146
Name: label, dtype: float64

#### This dataset  is unbalanced

In [6]:
df_mazority = tweet_train[tweet_train["label"] ==0]
df_mazority.shape

(29720, 2)

#### To balance thes dataset i am trying to balance minority label with the mazority label .

In [7]:
from sklearn.utils import resample
df = resample(tweet_train[tweet_train.label == 1],
              replace = True ,      #with replacement
              n_samples = len(df_mazority),    #to match mazority label
              random_state = 2)

In [8]:
df.shape

(29720, 2)

In [9]:
balanced_df=  pd.concat([df_mazority,df])

In [10]:
balanced_df["label"].value_counts()

1    29720
0    29720
Name: label, dtype: int64

In [11]:
len(balanced_df)

59440

In [12]:
Y = balanced_df["label"] # dependent var
X = balanced_df["tweet"] # independent var

# train-test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size = 0.3,random_state = 42)

In [13]:
from nltk.corpus import stopwords # for removing stopwords
from nltk.stem import PorterStemmer # for stemming purpose 
ps = PorterStemmer() # initialising the object 
corpus = []

In [14]:
# defining a function to remove stopwords and bring the words into their base word  
def stop_stem_text(tweet):
    
    tweets = " ".join(filter(lambda x: x[0] != '@',tweet.split()))
    tweets = re.sub('[^a-zA-Z]',' ',tweets)
    tweets = tweets.lower()
    tweets = tweets.split()
    tweets = [ps.stem(word) for word in tweets if  word  not in stopwords.words('english')]
    tweets = " ".join(tweets)
    return tweets
    

In [15]:
trn_df = x_train.apply(stop_stem_text) # applying the above defined function onto the training data 

In [16]:
trn_df.head()

id
31653    six scientist six suicid secret die protect go...
29104    queue basket food shop guy say give us smile t...
20554    opinion rife lgbt commun gay peopl demand equa...
20433                                inboxzero mam weekend
13819    fear root racism mostli fear ego acknowledg an...
Name: tweet, dtype: object

In [17]:
# Now ,i am converting words into matrix basically Sparse Matrix using BAG OF WORDS .
from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(max_features = 30000) # including only 30000 words to prevent memory error

In [18]:
trn_df1 = cv.fit_transform(trn_df).toarray()

In [19]:
len(trn_df1)

41608

In [20]:
#Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
model = nb.fit(trn_df1,y_train)

In [21]:
test_df  = x_test.apply(stop_stem_text) #applying the above defined function onto the testing  data 

In [22]:
test_df.head()

id
21961                 real waspi sborn abject povey togeth
21182                  shit happen life goe good bless day
20502    end smoke reefer go crazi listen negro music e...
18719    mani delici burger amp entre select one hardes...
14897                                wanna see shohairclub
Name: tweet, dtype: object

In [23]:
test_df1 = cv.transform(test_df).toarray()

In [24]:
test_df1

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [25]:
pred = model.predict(test_df1)

In [26]:
from sklearn.metrics import classification_report

print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.97      0.92      0.94      8886
           1       0.92      0.97      0.95      8946

    accuracy                           0.94     17832
   macro avg       0.95      0.94      0.94     17832
weighted avg       0.95      0.94      0.94     17832



In [27]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,pred))

[[8150  736]
 [ 253 8693]]


In [29]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,pred))

0.9445379093764019
