In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### **Problem Statement**

Dataset containing several tweets with positive and negative sentiment associated with it.

Cyber bullying and hate speech has been a menace for quite a long time,So our objective for this task is to detect speeches tweets associated with negative sentiments.From this dataset we classify a tweet as hate speech if it has racist or sexist tweets associated with it.

So our task here is to classify racist and sexist tweets from other tweets and filter them out.

In [2]:
url = 'https://raw.githubusercontent.com/AmbujaBudakoti27/Sentiment-Analysis/main/twitter_training_dataset.csv'
df = pd.read_csv(url, names=["id", "label", "tweet"])

In [3]:
df.head(5)

Unnamed: 0,id,label,tweet
0,id,label,tweet
1,1,0,@user when a father is dysfunctional and is s...
2,2,0,@user @user thanks for #lyft credit i can't us...
3,3,0,bihday your majesty
4,4,0,#model i love u take with u all the time in ...


### **Dataset Description**

The data is in csv format.
In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.
Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist,our objective is to predict the labels on the given test dataset.

### Attribute Information


*   id : The id associated with the tweets in the given dataset
*   tweets : The tweets collected from various sources and having either postive or negative sentiments
*   label : A tweet with label '0' is of positive sentiment while a tweet with label '1' is of negative sentiment

In [4]:
df.drop('id',axis='columns',inplace=True)

In [5]:
df.shape

(31963, 2)

In [6]:
df['label'].value_counts()

0        29720
1         2242
label        1
Name: label, dtype: int64

**We see here that the data is imbalanced.**

### **Imbalanced Learning**


"The class imbalance problem typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones." https://www3.nd.edu/~dial/publications/chawla2004editorial.pdf

In [7]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

for i in range(0, len(df)):
  m = re.sub('[^a-zA-Z]', ' ', df["tweet"][i])
  m = re.sub('user', '', m)
  m = m.lower()
  m = m.split()
  m = [ps.stem(word) for word in m if not word in stopwords.words('english')]
  m =' '.join(m)
  df["tweet"][i] = m

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Class count
df['label'].value_counts()

0        29720
1         2242
label        1
Name: label, dtype: int64

In [9]:
count_pos = 29720
count_neg = 2242

In [10]:
df_pos = df[df['label'] == '0']
df_neg = df[df['label'] == '1']
df_pos.shape

(29720, 2)

## **Mitigating Skewdness of Data**

Some of the techniques that can be used to deal with this are:
1. Undersampling
2. Oversampling

Within imbalanced-learn, there are different techniques you can use for oversampling. I will use below two.

*  RandomOverSampler
*   SMOTE (Synthetic Minority Over-Sampling Technique)




### **RandomOverSampler**
Random over-sampling is simply a process of repeating some samples of the minority class and balance the number of samples between classes in the dataset.

In [11]:
# Oversample 1-class and concat the DataFrames of both classes
df_neg_over = df_neg.sample(count_pos, replace=True)
df_over = pd.concat([df_pos, df_neg_over], axis=0)


In [12]:
df_neg_over.shape, df_over.shape

((29720, 2), (59440, 2))

In [13]:
df_over

Unnamed: 0,label,tweet
1,0,father dysfunct selfish drag kid dysfunct run
2,0,thank lyft credit use caus offer wheelchair va...
3,0,bihday majesti
4,0,model love u take u time ur
5,0,factsguid societi motiv
...,...,...
24603,1,michel obama ape heel case begin donaldtrump a...
22919,1,think told dictionari
16574,1,hispan amp feel like stomp listen retweet bori...
31692,1,way facebook repoingsystem fail communitystand...


In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000)
X_r = cv.fit_transform(df_over['tweet']).toarray()

In [15]:
X_r.shape

(59440, 5000)

In [16]:
y_r = pd.get_dummies(df_over['label'])
y_r = y_r.iloc[:,1].values
y_r

array([0, 0, 0, ..., 1, 1, 1], dtype=uint8)

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_r, y_r, test_size=0.2, random_state=0,  stratify=y_r)

In [18]:
from sklearn.naive_bayes import MultinomialNB
sentiment_detect_model_v1 = MultinomialNB().fit(X_train, y_train)
y_pred = sentiment_detect_model_v1.predict(X_test)

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
accuracy_v1 = accuracy_score(y_test, y_pred)
f1_v1 = f1_score(y_test, y_pred)
conf_v1 = confusion_matrix(y_test, y_pred)

In [19]:
accuracy_v1, f1_v1

(0.9310228802153432, 0.931904999169573)

In [20]:
from sklearn.metrics import classification_report
classificationr=classification_report(y_test, y_pred)
print(classificationr)

              precision    recall  f1-score   support

           0       0.94      0.92      0.93      5944
           1       0.92      0.94      0.93      5944

    accuracy                           0.93     11888
   macro avg       0.93      0.93      0.93     11888
weighted avg       0.93      0.93      0.93     11888



In [21]:
def pred(text):
  m = re.sub('[^a-zA-Z]', ' ', text)
  m = re.sub('user', '', m)
  m = m.lower()
  m = m.split()
  m = [ps.stem(word) for word in m if not word in stopwords.words('english')]
  m =' '.join(m)
  k=[]
  k.append(m)
  x = cv.transform(k).toarray()
  y_pred = sentiment_detect_model_v1.predict(x)
  if y_pred[0]==0:
    print("Positive")
  else:
    print("Negative")

In [22]:
text = "Die in hell"
pred(text)

Negative


In [23]:
text = "The world is a beautiful place"
pred(text)

Positive


### **SMOTE (Synthetic Minority Over-Sampling Technique)**

SMOTE is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement.

According to the original research paper "SMOTE: Synthetic Minority Over-sampling Technique" (Chawla et al., 2002), "synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general." What this means is that when SMOTE creates a new synthetic data, it will choose one data to copy, and look at its k nearest neighbours. Then, on feature space, it will create random values in feature space that is between the original sample and its neighbours.

In [24]:
y = pd.get_dummies(df['label'])
y = y.iloc[:,1].values
y

array([0, 0, 0, ..., 0, 1, 0], dtype=uint8)

In [25]:
X = cv.fit_transform(df['tweet']).toarray()

In [26]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X, y)

y_sm.shape



(59442,)

In [27]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_sm, y_sm, test_size=0.2, random_state=15, stratify=y_sm)

In [28]:
from sklearn.naive_bayes import MultinomialNB
sentiment_detect_model_v2 = MultinomialNB().fit(X_train2, y_train2)
y_pred2 = sentiment_detect_model_v2.predict(X_test2)

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
accuracy_v2 = accuracy_score(y_test2, y_pred2)
f1_v2 = f1_score(y_test2, y_pred2)
conf_v2 = confusion_matrix(y_test2, y_pred2)

In [29]:
from sklearn.metrics import classification_report
classificationr2=classification_report(y_test2, y_pred2)
print(classificationr2)

              precision    recall  f1-score   support

           0       0.94      0.96      0.95      5945
           1       0.96      0.93      0.95      5944

    accuracy                           0.95     11889
   macro avg       0.95      0.95      0.95     11889
weighted avg       0.95      0.95      0.95     11889



SMOTE sampling seems to have a slightly higher accuracy and F1 score compared to random oversampling. With the results so far, it seems like choosing SMOTE oversampling is preferable over original or random oversampling.

In [30]:
def pred2(text):
  m = re.sub('[^a-zA-Z]', ' ', text)
  m = re.sub('user', '', m)
  m = m.lower()
  m = m.split()
  m = [ps.stem(word) for word in m if not word in stopwords.words('english')]
  m =' '.join(m)
  k=[]
  k.append(m)
  x = cv.transform(k).toarray()
  y_pred = sentiment_detect_model_v2.predict(x)
  if y_pred[0]==0:
    print("Positive")
  else:
    print("Negative")

In [53]:
text = "I am disgusted by the way he acted"
pred2(text)
text = "The black man was murdered"
pred2(text)
text = "I am delighted at your arrival, i though I would never see you again!!"
pred2(text)
text = "The man was excited and elated for the dinner he had been planning for a long time"
pred2(text)
text = "I am happy for the concert, will get to go out after a long time"
pred2(text)
text = "Women cannot do what men can do"
pred2(text)
text = "The world is a beautiful place"
pred2(text)
text = "Whites are superior then the black"
pred2(text)

Negative
Negative
Positive
Positive
Positive
Negative
Positive
Negative
