# Toxic comment classification

## 1. Importance of a problem
In the Internet there is a huge traffic of comments. It is extremely hard to moderate it from negative online behaviour like toxic comments(comments that are rude, disrespectful or otherwise likely to make someone leave a discussion).

It can help small and big companies to moderate their comments effecient.

## Data

We used Kaggle dataset from Toxicity Classification competition. (https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data)

In [1]:
import numpy as np
import pandas as pd

train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [2]:
print(train.shape)
train.head()

(1804874, 45)


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


## Cleaning data

Our dataset looks fine. But we would remove punctionation from comment text as they are useless for our model.

For working with text we transformed words into 300 dimention vectors.

We used pretrained word vectors for that. (https://fasttext.cc/docs/en/english-vectors.html) 

In [17]:
import re, string

# all text would be 220 length. if smaller padding would add zeros
TEXT_LEN=220

def preprocess(data):
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    def clean(text):
        return regex.sub('', text)

    data = data.astype(str).apply(clean)
    return data

def toKeyValue(key, *values):
    return key, np.asarray(values, dtype='float32')


def loadDict(path):
    with open(path) as f:
        return dict(toKeyValue(*line.strip().split(' ')) for line in f)

def makeEmbleding(words):
    wordVecDict = loadDict('./inputs/wiki-news-300d-1M.vec')
    inputMatrix = np.zeros((len(words) + 1, 300))
    for word, i in words.items():
        if word in wordVecDict:
            inputMatrix[i] = wordVecDict[word]
    return inputMatrix

In [9]:
x_train = preprocess(train['comment_text'])
y_target_train = np.where(train['target'] >= 0.5, 1, 0)
y_classes_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']]

x_test = preprocess(test['comment_text'])

In [34]:
from keras.preprocessing import text, sequence

print('Start tokenizing all test and train words..');
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(list(x_train) + list(x_test))

print('Start sequence..');
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

print('Add paddings..');
x_train = sequence.pad_sequences(x_train, maxlen=TEXT_LEN)
x_test = sequence.pad_sequences(x_test, maxlen=TEXT_LEN)

# print('Making embeding matrix..')
# embedding = makeEmbleding(tokenizer.word_index)

print('Prepare data done!')

Start tokenizing all test and train words..
Start sequence..
Add paddings..
Prepare data done!


## Modeling

We decided to use LSTM RNN neural network architecture for our project. It allows 

In [26]:
from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, Dropout, add, concatenate
from keras.layers import LSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D

def createModel(inputMatrix, num_aux_targets):
    LSTM_UNITS = 128
    DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
    
    words = Input(shape=(TEXT_LEN,))
    
    x = Embedding(*inputMatrix.shape, weights=[inputMatrix], trainable=False)(words)
    x = SpatialDropout1D(0.3)(x)
    x = Bidirectional(LSTM(LSTM_UNITS, return_sequences=True))(x)
    x = Bidirectional(LSTM(LSTM_UNITS, return_sequences=True))(x)

    hidden = concatenate([
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    
    result = Dense(1, activation='sigmoid')(hidden)
    aux_result = Dense(num_aux_targets, activation='sigmoid')(hidden)
    
    model = Model(inputs=words, outputs=[result, aux_result])
    model.compile(loss='binary_crossentropy', optimizer='adam')

    return model

In [35]:
model = createModel(embedding, y_classes_train.shape[-1])

In [None]:
from keras.callbacks import LearningRateScheduler
model.fit(
    x_train[1:100],
    [y_target_train[1:100], y_classes_train[1:100]],
    batch_size=32,
    epochs=10,
    verbose=2,
    callbacks=[
        LearningRateScheduler(lambda epoch: 1e-3 * (0.6 ** epoch))
    ]
)

Epoch 1/10
