<center><u><h1 style="color:green">Toxic Comment Classification</h1></u></center>

### Table of Content :
1. Importing Data and Libraries
2. Exploratory Data Analysis (EDA)
3. Data Pre-processing
4. Modeling<br />
    * Naive Bayes SVM Model <br />
    * LSTM <br />
    * BERT model <br />
5. Model Ensembling

<h2 style="color:blue">1. Importing Libraries</h2>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

import warnings
warnings.simplefilter(action="ignore")

<h2 style="color:blue">Loading the Data</h2>

In [2]:
train = pd.read_csv('../dataset/train.csv')
test = pd.read_csv('../dataset/test.csv')

In [3]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


<h2 style="color:blue">2. Exploratory Data Analysis</h2>

<h2 style="color:blue">3. Data Pre-Processing</h2>

In [5]:
classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
targets = train[classes].values

train_sentences = train['comment_text']
test_sentences = test['comment_text']


<h2 style="color:blue">Tokenization</h2>

In [6]:
max_features = 22000
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(train_sentences))
tokenized_train = tokenizer.texts_to_sequences(train_sentences)
tokenized_test = tokenizer.texts_to_sequences(test_sentences)

In [7]:
tokenized_train[:1]

[[688,
  75,
  1,
  126,
  130,
  177,
  29,
  672,
  4511,
  12052,
  1116,
  86,
  331,
  51,
  2278,
  11448,
  50,
  6864,
  15,
  60,
  2756,
  148,
  7,
  2937,
  34,
  117,
  1221,
  15190,
  2825,
  4,
  45,
  59,
  244,
  1,
  365,
  31,
  1,
  38,
  27,
  143,
  73,
  3462,
  89,
  3085,
  4583,
  2273,
  985]]

<h3 style="color:blue">Padding</h3>

In [8]:
maxlen = 200
X_train = pad_sequences(tokenized_train, maxlen = maxlen)
X_test = pad_sequences(tokenized_test, maxlen = maxlen)

In [9]:
totalNumWords = [len(comment) for comment in tokenized_train]

<h2 style="color:Blue;">4. Modeling</h2>

<h3 style="color:green;">Naive Bayes SVM Model</h3>

<h3 style="color:green;">LSTM</h3>

In [10]:
embed_size = 128

inp = Input(shape = (maxlen, ))
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True, name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

In [11]:
model = Model(inputs=inp, outputs=x)
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [12]:
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 128)          2816000   
_________________________________________________________________
lstm_layer (LSTM)            (None, 200, 60)           45360     
_________________________________________________________________
global_max_pooling1d (Global (None, 60)                0         
_________________________________________________________________
dropout (Dropout)            (None, 60)                0         
_________________________________________________________________
dense (Dense)                (None, 50)                3050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)               

In [13]:
batch_size = 32
epochs = 2
model.fit(X_train, targets, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x257eb2c9348>

<h3 style="color:blue;">Prediction</h3>

In [14]:
prediction = model.predict(X_test)
prediction

array([[9.9876463e-01, 4.7929606e-01, 9.6515357e-01, 5.0769478e-02,
        9.0644848e-01, 3.0519968e-01],
       [1.2151003e-03, 3.3778169e-06, 2.2137165e-04, 4.7504791e-06,
        2.1991134e-04, 1.0523786e-04],
       [2.3737848e-03, 2.2966593e-05, 5.8311224e-04, 3.0894731e-05,
        5.8305264e-04, 3.0198693e-04],
       ...,
       [8.7785721e-04, 9.8991063e-07, 1.2421608e-04, 1.1324536e-06,
        1.1200194e-04, 4.8286187e-05],
       [4.6142936e-03, 2.3861034e-05, 7.4401498e-04, 2.5410016e-05,
        7.9599023e-04, 4.6423078e-04],
       [9.8672056e-01, 5.8082610e-02, 8.5039860e-01, 6.7667365e-03,
        7.2913611e-01, 4.2897999e-02]], dtype=float32)

<h3 style="color:green;">BERT model</h3>

<h2 style="color:Blue;">5. Model Ensembling</h2>

### Work in progress...