### **Problem Statement**
Dataset containing several tweets with positive and negative sentiment associated with it.

Cyber bullying and hate speech has been a menace for quite a long time,So our objective for this task is to detect speeches tweets associated with negative sentiments.From this dataset we classify a tweet as hate speech if it has racist or sexist tweets associated with it.

So our task here is to classify racist and sexist tweets from other tweets and filter them out.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
url = 'https://raw.githubusercontent.com/AmbujaBudakoti27/Sentiment-Analysis/main/twitter_training_dataset.csv'
df = pd.read_csv(url, names=["id", "label", "tweet"])

In [3]:
df.head(5)

Unnamed: 0,id,label,tweet
0,id,label,tweet
1,1,0,@user when a father is dysfunctional and is s...
2,2,0,@user @user thanks for #lyft credit i can't us...
3,3,0,bihday your majesty
4,4,0,#model i love u take with u all the time in ...


### **Dataset Description**

The data is in csv format.
In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.
Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist,our objective is to predict the labels on the given test dataset.

### Attribute Information


*   id : The id associated with the tweets in the given dataset
*   tweets : The tweets collected from various sources and having either postive or negative sentiments
*   label : A tweet with label '0' is of positive sentiment while a tweet with label '1' is of negative sentiment







In [4]:
df.drop('id',axis='columns',inplace=True)

In [5]:
df.shape

(31963, 2)

In [6]:
df

Unnamed: 0,label,tweet
0,label,tweet
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
...,...,...
31958,0,ate @user isz that youuu?ðððððð...
31959,0,to see nina turner on the airwaves trying to...
31960,0,listening to sad songs on a monday morning otw...
31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [7]:
df['label'].value_counts()

0        29720
1         2242
label        1
Name: label, dtype: int64

In [8]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

for i in range(0, len(df)):
  m = re.sub('[^a-zA-Z]', ' ', df["tweet"][i])
  m = re.sub('user', '', m)
  m = m.lower()
  m = m.split()
  m = [ps.stem(word) for word in m if not word in stopwords.words('english')]
  m =' '.join(m)
  df["tweet"][i] = m

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
df

Unnamed: 0,label,tweet
0,label,tweet
1,0,father dysfunct selfish drag kid dysfunct run
2,0,thank lyft credit use caus offer wheelchair va...
3,0,bihday majesti
4,0,model love u take u time ur
...,...,...
31958,0,ate isz youuu
31959,0,see nina turner airwav tri wrap mantl genuin h...
31960,0,listen sad song monday morn otw work sad
31961,1,sikh templ vandalis calgari wso condemn act


We see here that the data is imbalanced.

In [10]:
tweets = df['tweet']
tweets

0                                                    tweet
1            father dysfunct selfish drag kid dysfunct run
2        thank lyft credit use caus offer wheelchair va...
3                                           bihday majesti
4                              model love u take u time ur
                               ...                        
31958                                        ate isz youuu
31959    see nina turner airwav tri wrap mantl genuin h...
31960             listen sad song monday morn otw work sad
31961          sikh templ vandalis calgari wso condemn act
31962                                         thank follow
Name: tweet, Length: 31963, dtype: object

In [11]:
y = pd.get_dummies(df['label'])
y = y.iloc[:,1].values
y

array([0, 0, 0, ..., 0, 1, 0], dtype=uint8)

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 2500)
X = cv.fit_transform(df["tweet"]).toarray()

In [13]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0,  stratify=y)

In [15]:
from sklearn.naive_bayes import MultinomialNB
sentiment_detect_model_v1 = MultinomialNB().fit(X_train, y_train)
y_pred = sentiment_detect_model_v1.predict(X_test)

from sklearn.metrics import accuracy_score, f1_score
accuracy_v1 = accuracy_score(y_test, y_pred)
f1_v1 = f1_score(y_test, y_pred)

from sklearn.metrics import confusion_matrix
conf_v1 = confusion_matrix(y_test, y_pred)

In [16]:
accuracy_v1, f1_v1

(0.9424370405130612, 0.6229508196721312)

In [17]:
from sklearn.metrics import classification_report
classificationr=classification_report(y_test, y_pred)
print(classificationr)

              precision    recall  f1-score   support

           0       0.98      0.96      0.97      5945
           1       0.58      0.68      0.62       448

    accuracy                           0.94      6393
   macro avg       0.78      0.82      0.80      6393
weighted avg       0.95      0.94      0.94      6393



Here wee see that the f_1 score for class 1 (negative tweets) is very low. This is because the dataset is highly imbalanced.

Why use F1-Score instead of Accuracy ?

From the above countplot generated above we see how imbalanced our dataset is.We can see that the values with label:0 i.e. positive sentiments are quite high in number as compared to the values with labels:1 i.e. negative sentiments.
So when we keep accuracy as our evaluation metric there may be cases where we may encounter high number of false positives.