# Machine Learning Final Project

## Toxic Comment Classification Challenge

- Anton Bilchuk
- Ivan Prodaiko

Ukrainian Catholic Univercity, May 2019

ATENTION: for running this model CUDA supported device is required!

## Data Preprocessing

In [3]:
import numpy as np
import pandas as pd

train = pd.read_csv("./data/train.csv")
test = pd.read_csv("./data/test.csv")

print("Train shape : ",train.shape)
print("Test shape : ",test.shape)

train.head()

Train shape :  (1804874, 45)
Test shape :  (97320, 2)


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


## We will start with Embedding. We have to extract all words from train and test datasets and find their vector forms.

In [4]:
all_comments = pd.concat([train[['id','comment_text']], test], axis=0)
print("All coments shape: " + str(all_comments.shape) )

All coments shape: (1902194, 2)


## Let's load some pretrained word2vec embedding matrix. We used 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset.
Here you can use any pretrained dataset you want

In [5]:
from gensim.models import KeyedVectors

embeddings = KeyedVectors.load_word2vec_format('./inputs/wiki-news-300d-1M.vec')

## Text preprocessing

Replace aren't to are not and so on..

In [8]:
import pickle

pickle_in = open("./inputs/contraction_mapping.pickle","rb")
contraction_mapping = pickle.load(pickle_in)

Replace aren't to are not and so on

In [9]:
def clean_contractions(text):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
    return text

Remove punctuations. Replace _ and quotes to space

In [10]:
punctuations = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
punctuations_mapping = {"_":" ", "`":" "}

def clean_punctuations(text):
    for p in punctuations_mapping:
        text = text.replace(p, punctuations_mapping[p])    
    for p in punctuations:
        text = text.replace(p, " " + p + " ")
    return text

Swears mean the same, but can be written in different forms. We decided to replace all swears with one word. Bad one.

In [13]:
import re

swear_words = np.load('./inputs/swear_words.npy')
swear_replaces = []
for swear in swear_words:
    if swear[1:(len(swear)-1)] not in embeddings:
        swear_replaces.append(swear)
swear_replaces = '|'.join(swear_replaces)

def replace_swears(text):
    return re.sub(swear_replaces, ' ' + swear_words[0] + ' ', text)

### Apply all preprocessing to comments

In [14]:
print("Start")
all_comments['comment_text'] = all_comments['comment_text'].apply(lambda x: x.lower())
print("1/4 Done")
all_comments['comment_text'] = all_comments['comment_text'].apply(clean_contractions)
print("2/4 Done")
all_comments['comment_text'] = all_comments['comment_text'].apply(clean_punctuations)
print("3/4 Done")
all_comments['comment_text'] = all_comments['comment_text'].apply(replace_swears)
print("4/4 Done")

Start
1/4 Done
2/4 Done
3/4 Done
4/4 Done


In [15]:
train_texts = all_comments.iloc[:len(train),:]
test_texts = all_comments.iloc[len(train):,:]

train = pd.concat([train_texts,train[['target']]],axis=1)
train.head()

Unnamed: 0,id,comment_text,target
0,59848,"this is so cool . it is like , ' ...",0.0
1,59849,thank you ! ! this would make my life a ...,0.0
2,59852,this is such an urgent design problem ; kud...,0.0
3,59855,is this something i will be able to install on...,0.0
4,59856,haha you guys are a bunch of losers .,0.893617


In [16]:
del(all_comments)

In [17]:
train['target'] = np.where(train['target'] >= 0.5, True, False)

In [18]:
from sklearn import model_selection

train_dataset, validation_dataset = model_selection.train_test_split(train, test_size=0.1)

We have to tokenize all words. They would be represented as a unique index.

As not all sentences have the same count of word, but our model has static input size we added pad_text function that will add empty paddings to the sentances that do not have enough words.

In [24]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_NUM_WORDS = 100000
MAX_SEQUENCE_LENGTH = 256
TOXICITY_COLUMN = 'target'
TEXT_COLUMN = 'comment_text'

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_dataset[TEXT_COLUMN])

def pad_text(texts, tokenizer):
    return pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=MAX_SEQUENCE_LENGTH)

Embeddigns matrix would represent 300 dimention vector not from word but from tokenized word index.

In [25]:
EMBEDDINGS_DIMENSION = 300
embeddings_matrix = np.zeros((len(tokenizer.word_index) + 1,EMBEDDINGS_DIMENSION))

for word, index in tokenizer.word_index.items():
    if word in embeddings.vocab:
        embeddings_matrix[index] = embeddings[word]

Prepare all data before modeling

In [26]:
# trainig
train_text = pad_text(train_dataset[TEXT_COLUMN], tokenizer)
train_labels = train_dataset[TOXICITY_COLUMN]

# validation step
validate_text = pad_text(validation_dataset[TEXT_COLUMN], tokenizer)
validate_labels = validation_dataset[TOXICITY_COLUMN]

# Modeling

Our model was describe into the paper. There you can also find a diagam and described layers. In short, this is a RNN GRU clasification neural network. It also has an Convolutional 1D layer.

In [28]:
from keras.layers import Embedding, Input, Dense, CuDNNGRU,concatenate, Bidirectional, SpatialDropout1D, Conv1D, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.optimizers import RMSprop, Adam
from keras.models import Model
from keras.callbacks import EarlyStopping

# (256x1) This is sentence with word indexes in tokenizer.
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

# (256x300) Embedding layer would transform every word index into 300 dimension vector.
embedding_layer = Embedding(len(tokenizer.word_index) + 1,
                            EMBEDDINGS_DIMENSION,
                            weights=[embeddings_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

x = embedding_layer(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)   
x = Conv1D(64, kernel_size = 2, padding = "valid", kernel_initializer = "he_uniform")(x)

avg_pooling = GlobalAveragePooling1D()(x)
max_pooling = GlobalMaxPooling1D()(x)     

x = concatenate([avg_pooling, max_pooling])

dense = Dense(1, activation='sigmoid')(x)

model = Model(sequence_input, dense)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 256)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 256, 300)     90366600    input_1[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d_1 (SpatialDro (None, 256, 300)     0           embedding_1[0][0]                
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 256, 128)     140544      spatial_dropout1d_1[0][0]        
__________________________________________________________________________________________________
conv1d_1 (

In [29]:
model.compile(loss='binary_crossentropy',
              optimizer=Adam(),
              metrics=['acc']
)

# We learn till model would stop increase accurancy.
Max epoch count is 100. And batch size 1024. It is pretty big batch size, but we have a large dataset.

In [31]:
model.fit(
    train_text,
    train_labels,
    batch_size=1024,
    epochs=100,
    validation_data=(validate_text, validate_labels),
    callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)])

Train on 1624386 samples, validate on 180488 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 00011: early stopping


# Save model progress

In [32]:
print("Saving model..")
model.save('gru.h5')

print("Saving embeddings..")
pd.DataFrame(embeddings_matrix).to_csv('./embedding_matrix.gru.csv')

print("Saving word tokenizer..")
with open('tokenizer.gru.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Saving model..
Saving embeddings..
Saving word tokenizer..
