Okay, lets try a word embedding model. Hopefully this will be a bit more robust than the first model I played with, since the words are better represented than with a simple tokenizer. 

First lets import our Libraries and Dataset.

In [41]:
import os
import sys
import numpy as np
import pandas as pd
import csv
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

In [42]:
#Setup a few paramaters
base_dir = ''
glove_dir = os.path.join(base_dir, 'glove.6B')
max_seq_len = 1000
max_num_words = 20000
embedding_dim = 100
validation_split = 0.2

In [47]:
#load the data into pandas for easy handling
data = pd.read_csv('train.csv')
#get a training and testing set
train, test = train_test_split(data, test_size = .2)

In [49]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
57055,9887960831f7ffa0,"""Re:MOTDThere is no need to be sorry, we all a...",0,0,0,0,0,0
55384,93f917f8f0342a26,"""\nhalf of it is pure quackery. that wouldn't ...",0,0,0,0,0,0
345,00dcf539dd64e23e,"""\nYes, it looks much better than before. At t...",0,0,0,0,0,0
159486,fed9fcfd8505a0ad,", spent by private U.S. citizens",0,0,0,0,0,0
97690,0a9b776d6b5befb6,REVERSIONS \n\nAny particular reason you are r...,0,0,0,0,0,0


In [50]:
test.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
40664,6c8b054e2f4b59b0,"""\nThat must have been part of the new Greek b...",0,0,0,0,0,0
10449,1b9ca6cddc442f50,Survivors \n\nUnder this head we say there are...,0,0,0,0,0,0
75259,c94f62ef7644c6b5,"""\n\nCome on, use your common sense. You don't...",0,0,0,0,0,0
100667,1ac27752e1cb3b71,""" \n\nAnd what I'm writing is classified as an...",0,0,0,0,0,0
5063,0d6bd0c4dc3cd9a1,(27 Feb 2006 - 6 Apr 2006),0,0,0,0,0,0


In [51]:
#quick sanity check
train.isnull().any()

id               False
comment_text     False
toxic            False
severe_toxic     False
obscene          False
threat           False
insult           False
identity_hate    False
dtype: bool

So we see our dataset is adequately clean (Kaggle sure is nice). I've gone ahead and split the "testing" data into testing and training. I want to test on a subset of the data so I can validate the model. Once I'm confident in it, I can submit predictions to Kaggle for out-of-sample validation. That said, this is a luxury I seldom have so I don't want to lean on it like a crutch. 

In [52]:
#lets chop up the data into the componet pieces we want
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]
list_sentences_test = test["comment_text"]

In [55]:
list_sentences_train.head()

57055     "Re:MOTDThere is no need to be sorry, we all a...
55384     "\nhalf of it is pure quackery. that wouldn't ...
345       "\nYes, it looks much better than before. At t...
159486                     , spent by private U.S. citizens
97690     REVERSIONS \n\nAny particular reason you are r...
Name: comment_text, dtype: object

In [54]:
list_sentences_test.head()

40664     "\nThat must have been part of the new Greek b...
10449     Survivors \n\nUnder this head we say there are...
75259     "\n\nCome on, use your common sense. You don't...
100667    " \n\nAnd what I'm writing is classified as an...
5063                             (27 Feb 2006 - 6 Apr 2006)
Name: comment_text, dtype: object

In [58]:
y[:5]

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

Okay, the dataset looks pretty hunky-dory. Lets pre-process it with some GloVe Embeddings. 