## Problem Statement

#### Hate Speech Classification

Hate speech is an unfortunately common occurrence on the Internet. Often social media sites like Facebook and Twitter face the problem of identifying and censoring problematic posts while weighing the right to freedom of speech. The importance of detecting and moderating hate speech is evident from the strong connection between hate speech and actual hate crimes. Early identification of users promoting hate speech could enable outreach programs that attempt to prevent an escalation from speech to action.

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

#### Data Source 
https://datahack.analyticsvidhya.com/contest/hate-speech-classification/

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn import metrics
import nltk
import os
import gc
from keras.preprocessing import sequence,text
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Input, Dense,Dropout,Embedding,LSTM, CuDNNGRU, Conv1D,GlobalMaxPooling1D,Flatten,MaxPooling1D,GRU,GlobalMaxPool1D,SpatialDropout1D,Bidirectional
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,f1_score
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import warnings
warnings.filterwarnings("ignore")

Using TensorFlow backend.


### 2. Load train and test dataset

In [2]:
## load dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test_tweets.csv")

train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [3]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


### 3. Data Cleaning

- punctuation removal
- stopwords removal
- lemitization

In [4]:
punctuation = string.punctuation
stopword = stopwords.words("english")

lem = WordNetLemmatizer()

def clean(text):
    
    text = text.lower()
    
    text = re.sub(r"http\S+", "", text)
    
    # punctuation removal
    text = "".join(p for p in text if p not in punctuation)
    
    # stopwords removal
    words = text.split()
    words = [w for w in words if w not in stopword]
    
    # lemitization
    words = [lem.lemmatize(word,'v') for word in words]
    words = [lem.lemmatize(word,'n') for word in words]
    
    text = " ".join(words)
    
    return text

In [5]:
train['cleaned'] = train['tweet'].apply(clean)
test['cleaned'] = test['tweet'].apply(clean)

### 4. Working with train dataset

#### 4 (a). Apply to_categorical on label variable
Converts a class vector (integers) to binary class matrix.
np.utils.to_categorical is used to convert array of labeled data(from 0 to nb_classes-1) to one-hot vector. 

In [6]:
target = to_categorical(train['label'])
train = train.drop('label', axis = 1)

#### 4 (b). Split train dataset

In [7]:
x_train, x_val, y_train, y_val = train_test_split(train['cleaned'], target, test_size = 0.2, random_state = 1)

#### 4 (c). Tokenize words
A sentence or data can be split into words using the method word_tokenize():

In [8]:
words = " ".join(x_train)
words = nltk.word_tokenize(words)
dist = nltk.FreqDist(words)
num_unique_words = len(dist)

In [9]:
r_len = []
for w in x_train:
    word=nltk.word_tokenize(w)
    l=len(word)
    r_len.append(l)
max_len = np.max(r_len)
max_len

26

In [10]:
max_features = num_unique_words
max_words = max_len
batch_size = 128
embed_dim = 300

### 5. Text Preprocessing using Keras Tokenizer
- This class allows to vectorize a text corpus, by turning each text into either a sequence of integers or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf

In [11]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(x_train))
x_train = tokenizer.texts_to_sequences(x_train)
x_val = tokenizer.texts_to_sequences(x_val)
x_test  = tokenizer.texts_to_sequences(test['cleaned'])


### 6. Sequence Preprocessing using keras.preprocessing.sequence.pad_sequences
- Pads sequences to the same length. Sequences longer than num_timesteps are truncated so that they fit the desired length. The position where padding or truncation happens is determined by the arguments padding and truncating, respectively.

In [12]:
x_train = sequence.pad_sequences(x_train, maxlen=max_words)
x_val = sequence.pad_sequences(x_val, maxlen=max_words)
x_test = sequence.pad_sequences(x_test, maxlen=max_words)
#print(x_train.shape)
#print(x_val.shape)
#print(x_test.shape)

### 7. Initiate a Model
Example:
from keras.models import Model
from keras.layers import Input, Dense

a = Input(shape=(32,))
b = Dense(32)(a)
model = Model(inputs=a, outputs=b)

~ This model will include all layers required in the computation of b given a.
~ Compile : 
compile(optimizer, loss=None, metrics=None, loss_weights=None, sample_weight_mode=None, weighted_metrics=None, target_tensors=None)


In [13]:
inp = Input(shape=(max_words,))
x = Embedding(max_features, embed_dim)(inp)
x = Bidirectional(GRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(2, activation = "softmax")(x)

model1 = Model(inputs = inp, outputs=x)
model1.compile(loss = 'categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
print(model1.summary())

W0820 18:37:16.016825  2372 deprecation_wrapper.py:119] From C:\Users\Sourav\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0820 18:37:16.034679  2372 deprecation_wrapper.py:119] From C:\Users\Sourav\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0820 18:37:16.037159  2372 deprecation_wrapper.py:119] From C:\Users\Sourav\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0820 18:37:16.333256  2372 deprecation_wrapper.py:119] From C:\Users\Sourav\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0820 18:37:16.337719  2372 deprecation.py

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 26)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 26, 300)           11084100  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 26, 128)           140160    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
Total para

### 8. Fit a Model

In [14]:
model1.fit(x_train, y_train, batch_size=512, epochs=20, validation_data=(x_val, y_val))

W0820 18:37:24.272287  2372 deprecation.py:323] From C:\Users\Sourav\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 25569 samples, validate on 6393 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x21e5c0b2438>

### 9. Find F1 score

In [15]:
pred1 = np.round(np.clip(model1.predict(x_val),  0, 1))
print(f1_score(y_val, pred1, average=None))

[0.97691149 0.66826923]


### 10. Predict target variable for test and save the result in csv file

In [16]:
pred1=np.round(np.clip(model1.predict(x_test), 0, 1)).astype(int)
pred1 = pd.DataFrame(pred1)
pred1 = pred1.idxmax(axis=1)
submission_GRU = pd.DataFrame({'id':test['id'], 'label':pred1})
submission_GRU.to_csv("submission_GRU.v3.csv", index=False)