# Recurrent Neural Network with LSTM for Toxic Comment Classification

# 1 - Packages

Let's first import all the packages that I'll be using during this project.

*  numpy is the main package for scientific computing with Python.
*  pandas is a library for data manipulation and analysis.
*  train_test_split is used to split our training data into training, dev datasets.
*  Tokenizer allows to vectorize a text corpus, by turning each text into a vector.
*  pad_sequences transforms a list of num_samples sequences into a 2D Numpy array.
*  keras.layers contains the layers we'll use to create our RNN.
*  Model will contain all the layers we'll use to get our output.

In [1]:
import numpy as np, pandas as pd

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model

Using TensorFlow backend.


# 2 - Dataset

*  We imported the train, test data provided by MNIST.
*  We then imported an embedding dataset which I found on kaggle datasets.
* We then splitted the train data into two datasets, one contained features and other contained the target variables.

In [2]:
train = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv')
test = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv')
embedding_file = '../input/glove6b50d/glove.6B.50d.txt'
X_raw = train['comment_text'].values
Y = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test_raw = test['comment_text'].values

In [3]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

# 3 - Data Pre-processing

*  We created an object of Tokenizer class with max numbers = 20000 and then we trained it on X_raw dataset.
*  We then turned the list of texts into sequences.
*  We then padded our dataset so that every sequence is about the same size.
*  We then splitted the features, targets into training, dev sets with 1/5 ratio.

In [4]:
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(X_raw))
X_tokenized = tokenizer.texts_to_sequences(X_raw)
X_test_tokenized = tokenizer.texts_to_sequences(X_test_raw)
X = pad_sequences(X_tokenized, maxlen = 100)
X_test = pad_sequences(X_test_tokenized, maxlen = 100)
X_train, X_dev, Y_train, Y_dev = train_test_split(X, Y, test_size = 0.2, random_state = 42)

*  Here we converted our embedding_file into a dictionary where word is the key and the embeddin vector is the value.
*  We then took all the embedding valued and calculated their mean and standard deviation.

In [5]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(embedding_file))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940498, 0.6441043)

* **word_index :  ** A dictionary of words and their uniquely assigned integers.
* **embedding_matrix:  ** A matrix which contains the relationship between different words.

In [6]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

# 4 - Model

Now as we have prepared our training data, we can start building our model.

**In this model we have used - **

* **Embedding layer :  ** Turns positive integers (indexes) into dense vectors of fixed size.

* **Bi-directional RNN with LSTM :  ** It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.

* **Dense :  ** A simple Neural Network layer.

* **Dropout :  ** So that the weights don't rely too much on some activations.

* **Dense:  ** Output layer with sogmoid as the activation function.



In [7]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 5 - Train the model

As our model is now ready we can start it's training.

In [8]:
model.fit(X_train, Y_train, batch_size=32, epochs=2, validation_split=0.1);

Train on 114890 samples, validate on 12766 samples
Epoch 1/2
Epoch 2/2


# 6 - Test the model

Now that our model is trained we can test how well our model can perform on unseen datasets

In [9]:
dev_loss, dev_accuracy = model.evaluate(X_dev, Y_dev)
print('loss on dev set is: {}'.format(round(dev_loss, 2)))
print('accuracy on dev set is: {}'.format(round(dev_accuracy, 2)))

loss on dev set is: 0.27
accuracy on dev set is: 0.99


# 7 - Submit the result

In [10]:
y_test = model.predict([X_test], batch_size=1024, verbose=1)
sample_submission = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv')
sample_submission[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = y_test
sample_submission.to_csv('submission.csv', index=False)



# Conclusion
* In this project I learned how to use keras api.
* I learned a lot of interesting stuff like how to convert a sentence into a form which can be used by the model and the inner working of the lstm(Which I didn't use bcz I used keras which handles all that confusing stuff itself).
* In this project I learned how to handel the text datasets and how to perform multiclass classification for cases like this.