
# Introduction

* This notebook contains an NLP sentiment classification of comments into several class labels of several types of verbal violence.
* The dataset is from kaggle: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
* In this notebook I chose to use embedding and bidirectional LSTM (using Keras) as the prediction model.
* In another notebook I made a linear baseline for this contest: https://github.com/MiaDor12/NLP_Sentiment_Analysis-Toxic_Comments-with_sklearn.
* The predictions that I submitted to kaggle from the linear model got an avg AUC score of 0.83514.
* Using the deep learning model (in this notebook), the predictions I submitted to kaggle got an avg AUC score of 0.97163 !
* This means that the deep learning model increased the score by more than 13 percentage points!


### The notebook is divided into the following parts:
* Part 1: Notebook Preparation
* Part 2: Data Processing
* Part 3: Modeling - Deep Learning

# Part 1: Notebook Preparation

## Import

In [1]:
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model

## Config

In [2]:
labels_list = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [3]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

# Part 2: Data Processing

In [4]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

display(df_train.head(5))
print(df_train.shape)
display(df_test.head(5))
print(df_test.shape)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


(159571, 8)


Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


(153164, 2)


In [5]:
list_sentences_train = df_train["comment_text"].fillna("_na_").values
y = df_train[labels_list].values

list_sentences_test = df_test["comment_text"].fillna("_na_").values

In [6]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

# Part 3: Modeling - Deep Learning

In [7]:
# Build the model
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(20, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [8]:
# Fit the model
model.fit(X_train, y, batch_size=32, epochs=2, validation_split=0.1)


Epoch 1/2
Epoch 2/2



In [9]:
# Predict the labels of the test dataset
y_test = model.predict([X_test], batch_size=1024, verbose=1)

predictionsDf = pd.DataFrame(df_test['id'])
for label in labels_list:
    predictionsDf[label] = None

predictionsDf[labels_list] = y_test
predictionsDf.to_csv('NLP_toxic_comments_submission.csv', index=False)

