# Train Toxicity Model

This notebook trains a model to detect toxicity in online comments. It uses a CNN architecture for text classification trained on the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973) and pre-trained GloVe embeddings which can be found at:
http://nlp.stanford.edu/data/glove.6B.zip
(source page: http://nlp.stanford.edu/projects/glove/).

This model is a modification of [example code](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py) found in the [Keras Github repository](https://github.com/fchollet/keras) and released under an [MIT license](https://github.com/fchollet/keras/blob/master/LICENSE). For further details of this license, find it [online](https://github.com/fchollet/keras/blob/master/LICENSE) or in this repository in the file KERAS_LICENSE. 

## Usage Instructions
(TODO: nthain) - Move to README

Prior to running the notebook, you must:

* Download the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973)
* Download pre-trained [GloVe embeddings](http://nlp.stanford.edu/data/glove.6B.zip)
* (optional) To skip the training step, you will need to download a model and tokenizer file. We are looking into the appropriate means for distributing these (sometimes large) files.

In [None]:
!pip install --upgrade tensorflow==1.4

Collecting tensorflow==1.4
  Downloading tensorflow-1.4.0-cp27-cp27mu-manylinux1_x86_64.whl (40.7MB)
[K    100% |████████████████████████████████| 40.8MB 2.8kB/s ta 0:00:011


In [7]:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd

from model_tool import ToxModel
from gensim.models.keyedvectors import KeyedVectors
from nltk.corpus import stopwords
import re



ImportError: cannot import name np_utils

In [6]:
!pip install keras==2.0.6

Collecting keras==2.0.6
  Downloading Keras-2.0.6.tar.gz (228kB)
[K    100% |████████████████████████████████| 235kB 541kB/s ta 0:00:01
[?25hCollecting theano (from keras==2.0.6)
  Downloading Theano-1.0.1.tar.gz (2.8MB)
[K    100% |████████████████████████████████| 2.8MB 224kB/s ta 0:00:01
Building wheels for collected packages: keras, theano
  Running setup.py bdist_wheel for keras ... [?25ldone
[?25h  Stored in directory: /home/dcek/.cache/pip/wheels/c2/80/ba/2beab8c2131e2dcc391ee8a2f55e648af66348115c245e0839
  Running setup.py bdist_wheel for theano ... [?25ldone
[?25h  Stored in directory: /home/dcek/.cache/pip/wheels/46/a2/7d/b4cac381d5151daa9f9e0b3e4e4b65edaea6355ae296c97cf2
Successfully built keras theano
Installing collected packages: theano, keras
  Found existing installation: Keras 2.1.3
    Uninstalling Keras-2.1.3:
      Successfully uninstalled Keras-2.1.3
Successfully installed keras-2.0.6 theano-1.0.1


## Load Data

In [2]:
training_3rd_vis = pd.read_csv('cleaned_final_train.csv')
display(training_3rd_vis["comment_text"].head(n=20))
n_records_features_vis_3rd = len(training_3rd_vis)
print(" Number of features {}".format(n_records_features_vis_3rd))
train = 'cleaned_final_train.csv'


0     explanation edits made username hardcore metal...
1     aww matches background colour seemingly stuck ...
2     hey man really trying edit war guy constantly ...
3     make real suggestions improvement wondered sec...
4                         sir hero chance remember page
5              congratulations well use tools well talk
6                           cocksucker piss around work
7     vandalism matt shirvington article reverted pl...
8     sorry word nonsense offensive anyway intending...
9                  alignment subject contrary dulithgow
10    fair use rationale image wonju jpg thanks uplo...
11                     bbq man lets discuss maybe phone
12    hey talk exclusive group wp talibans good dest...
14    oh girl started arguments stuck nose belong be...
15    juelz santanas age two zero zero two juelz san...
16              bye look come think comming back tosser
17      redirect talk voydan pop georgiev chernodrinski
18    mitsurugi point made sense argue include h

 Number of features 638284


In [3]:
model_list = []

## Train Models

In [None]:
MODEL_NAME = 'augmentori_cleaned_singlebigru_fasttext_1st'
debias_random_model = ToxModel()
debias_random_model.train(2,train, text_column = 'comment_text', toxic = 'toxic', severe_toxic = 'severe_toxic', obscene = 'obscene', threat = 'threat', insult = 'insult', identity_hate = 'identity_hate', model_name = MODEL_NAME, model_list = model_list)

Hyperparameters
---------------
max_num_words: 199814
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 0.0007
es_patience: 1
batch_size: 256
embedding_dim: 300
epochs: 1
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: False
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
train_text_temp shape (638284, 250) and train_labels_temp shape (638284, 6)
 ---- 
train_text shape (638284, 250) and train_labels shape (638284, 6)
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 574456 samples, validate on 63828 samples
Epoch 1/1

Epoch 00001: val_loss improved from inf to 0.04554, saving model to models/augmentori_cleaned_singlebigru_fasttext_1st0_model.h5
 - 652s - loss: 0.0676 - acc: 0.9767 - val_loss: 0.0455 - val_acc: 0.9821
Epoch 0 auc 0.983094783184 best_auc -1
Train on 574456 samples, validate on 63828 sa

In [None]:
from keras.models import load_model
import os
model_list = []
for fold_id in range(0, 10):
    model_path = 'augmentori_gru_lstm' + str(fold_id)
    model = load_model(
        os.path.join('models', '%s_model.h5' % model_path))
    model_list.append(model)
    

In [8]:
from keras.models import load_model
import numpy as np
import os
model_list = []
for fold_id in range(0, 10):
    model_path = 'augmentori_gru_lstm' + str(fold_id)
    model = load_model(
        os.path.join('models', '%s_model.h5' % model_path))
    model_path = os.path.join('models', "model{0}_weights.npy".format(fold_id))
    weights = np.load(model_path)
    model.set_weights(weights)
    model_list.append(model)

In [11]:
from keras.preprocessing.sequence import pad_sequences
import cPickle
import os
def prep_text(texts):
    """Turns text into into padded sequences.

    The tokenizer must be initialized before calling this method.

    Args:
      texts: Sequence of text strings.

    Returns:
      A tokenized and padded text sequence as a model input.
    """
    model_name = 'augmentori_gru_lstm'
    tokenizer = cPickle.load(
        open(
            os.path.join('models', '%s_tokenizer.pkl' % model_name),
            'rb'))
    text_sequences = tokenizer.texts_to_sequences(texts)
    return pad_sequences(
        text_sequences, maxlen=250)

In [12]:

total_meta = []
meta_train = pd.read_csv('final_train.csv')
X_test = meta_train['comment_text']
X_test = prep_text(X_test)
X= X_test
fold_size = len(X) // 10
for fold_id in range(0, 10):
    fold_start = fold_size * fold_id
    fold_end = fold_start + fold_size
            
    if fold_id == 10 - 1:
        fold_end = len(X)

    train_x = np.concatenate([X[:fold_start], X[fold_end:]])

    val_x = X[fold_start:fold_end]
          
    meta = model_list[fold_id].predict(val_x, batch_size=128)
    if (fold_id == 0):
        total_meta = meta
    else:
        total_meta = np.concatenate((total_meta, meta), axis=0)

In [13]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
subm = pd.read_csv('sample_submission.csv')
submid = pd.DataFrame({'id': subm["id"]})
total_meta_data = pd.concat([submid, pd.DataFrame(total_meta, columns = label_cols)], axis=1)


In [14]:
display(total_meta_data.head(n=20))


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,5.298672e-07,1.361637e-10,7.52802e-09,3.818864e-10,3.872117e-08,7.897755e-10
1,0000247867823ef7,2.175462e-06,2.507667e-09,3.737669e-08,7.730432e-10,1.419462e-07,1.429161e-09
2,00013b17ad220c46,0.01382709,1.843092e-07,4.588822e-06,1.099868e-08,9.896796e-06,1.191894e-07
3,00017563c3f7919a,4.313957e-07,9.578123e-11,6.308026e-09,3.524564e-10,3.479985e-08,8.696782e-10
4,00017695ad8997eb,2.839943e-06,5.312022e-09,1.006481e-07,1.515298e-09,3.593553e-07,2.260464e-09
5,0001ea8717f6de06,2.226181e-06,1.201818e-08,6.024056e-08,2.033103e-09,2.641394e-07,3.800403e-09
6,00024115d4cbde0f,0.9997441,0.0009550502,0.9863858,2.46944e-08,0.9533083,2.931226e-05
7,000247e83dcc1211,7.83567e-05,3.863176e-08,1.40317e-07,7.525878e-10,6.863291e-07,3.154716e-09
8,00025358d4737918,6.994727e-07,4.960894e-10,1.851791e-08,6.080487e-10,4.313723e-08,7.851165e-10
9,00026d1092fe71cc,1.34181e-06,1.451595e-09,2.378028e-08,4.642549e-10,8.334749e-08,4.743154e-10


In [15]:
total_meta_data.to_csv('augmentori_meta_grulstmCV_nopretrain.csv', index=False)

In [8]:
test_predicts = pd.read_csv('gru_cv_output.csv')
display(test_predicts.head(n=20))
test_predicts.shape


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999538,0.6181733,0.99764,0.05096922,0.9898657,0.3255197
1,0000247867823ef7,1.3e-05,1.247423e-07,3e-06,4.708379e-08,2.553956e-06,1.125692e-06
2,00013b17ad220c46,5.5e-05,1.836647e-06,2e-05,6.44184e-07,1.196271e-05,7.204705e-06
3,00017563c3f7919a,1.3e-05,2.737614e-07,3e-06,1.778666e-07,1.700491e-06,5.387048e-07
4,00017695ad8997eb,0.000919,9.784006e-06,0.000113,2.973163e-06,2.222102e-05,6.399844e-06
5,0001ea8717f6de06,0.000131,1.483309e-06,4.8e-05,9.474676e-07,1.413156e-05,4.073511e-06
6,00024115d4cbde0f,4.3e-05,6.720049e-07,1.2e-05,7.15059e-07,9.158607e-06,1.265853e-06
7,000247e83dcc1211,0.295549,0.0002245386,0.005013,3.404184e-06,0.006566032,0.0001335455
8,00025358d4737918,0.029627,5.836995e-06,0.000559,3.917941e-07,0.001921604,1.296143e-05
9,00026d1092fe71cc,1.4e-05,1.825587e-07,2e-06,4.109103e-07,1.30062e-06,3.998649e-07


(153164, 7)

### Random model

In [3]:
MODEL_NAME = 'multi-labelNLP_charrnn'
debias_random_model = ToxModel()
debias_random_model.train(1,train, text_column = 'comment_text', toxic = 'toxic', severe_toxic = 'severe_toxic', obscene = 'obscene', threat = 'threat', insult = 'insult', identity_hate = 'identity_hate', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 50000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 7e-05
es_patience: 1
batch_size: 128
embedding_dim: 300
epochs: 50
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
train_text_temp shape (159571, 250) and train_labels_temp shape (159571, 4)
 ---- 
valid_text shape (15958, 250) and valid_labels shape (15958, 4)
train_text shape (143613, 250) and train_labels shape (143613, 4)
Data prepared!
Loading embeddings...
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
nicknamer not in vocabulary
 not in vocabulary
 not in vocabulary
songfulness not in vocabulary
Elonore not in vocabulary
homager not in vocabulary
forebodingness not in vocabulary
 not in vocabulary
Fredek not in vocabulary
 not in v

 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
dissipatedness not in vocabulary
 not in vocabulary
 not in vocabulary
musicianships not in vocabulary
mahjongs not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
campanologist not in vocabulary
 not in vocabulary
reprinter not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Christianizes not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
dyslexically not in vocabulary
preluder not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
moderatenesses not in vocabulary
Ingamar not in vocabulary
jiujitsus not in vocabulary
subcomputation not in vocabulary
 not in vocabulary
archfiends not in vocabulary
 not in vocabulary
foxtrotting not in vocabulary
electroencephalographs not in vocabulary
adversene

 not in vocabulary
Nikaniki not in vocabulary
Arianist not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
knowledgeableness not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
zigzagger not in vocabulary
Exchequers not in vocabulary
 not in vocabulary
 not in vocabulary
magnetohydrodynamical not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Jurua not in vocabulary
 not in vocabulary
chanciness not in vocabulary
 not in vocabulary
maydays not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
trinketed not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
outargued not in vocabulary
Reichstags not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in voca

outfaced not in vocabulary
dumdums not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Sanskritize not in vocabulary
 not in vocabulary
geocentricism not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
uraniums not in vocabulary
 not in vocabulary
disreputableness not in vocabulary
monarchistic not in vocabulary
hdqrs not in vocabulary
Fahrenheits not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
disinformations not in vocabulary
pericardia not in vocabulary
 not in vocabulary
Nefen not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
Aguistin not in vocabulary
 not in vocabulary
Gayel not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabular

disreputes not in vocabulary
 not in vocabulary
Gerhardine not in vocabulary
 not in vocabulary
 not in vocabulary
monkeyshine not in vocabulary
hydraulicking not in vocabulary
friedcake not in vocabulary
dratting not in vocabulary
 not in vocabulary
Melamie not in vocabulary
 not in vocabulary
 not in vocabulary
womenfolks not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
doters not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
legmen not in vocabulary
Americanizations not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
clamorer not in vocabulary
 not in vocabulary
 not in vocabulary
Rayshell not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
promenader not in vocabulary
balefuller not in vocabulary
nonbelligerent not in vocabulary
 not in vocabulary
barning not

 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
demimondaines not in vocabulary
stormbound not in vocabulary
Katuscha not in vocabulary
 not in vocabulary
 not in vocabulary
Americanizations not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Caroljean not in vocabulary
Aprilette not in vocabulary
epiglottises not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
saltinesses not in vocabulary
 not in vocabulary
shirtmake not in vocabulary
 not in vocabulary
 not in vocabulary
clandestineness not in vocabulary
 not in vocabulary
Ingeberg not in vocabulary
Agace not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
glottalization not in vocabulary
ignorantness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
backarrow not in

Rochella not in vocabulary
 not in vocabulary
dormants not in vocabulary
capeskins not in vocabulary
unbudging not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
tenderheartedness not in vocabulary
 not in vocabulary
 not in vocabulary
cosignatory not in vocabulary
instituter not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
underusing not in vocabulary
schussboomer not in vocabulary
 not in vocabulary
empaneling not in vocabulary
 not in vocabulary
Melisandra not in vocabulary
 not in vocabulary
darkener not in vocabulary
nonvoter not in vocabulary
Guendolen not in vocabulary
clawer not in vocabulary
 not in vocabulary
 not in vocabulary
squidded not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Kakalina not in vocabulary
 not in vocabulary
 no

 not in vocabulary
courteousnesses not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
passioning not in vocabulary
 not in vocabulary
 not in vocabulary
indubitableness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
supremal not in vocabulary
Kamillah not in vocabulary
cherisher not in vocabulary
 not in vocabulary
heister not in vocabulary
 not in vocabulary
Corenda not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
lassoer not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
hookedness not in vocabulary
 not in vocabulary
sextupling not in vocabulary
spininess not in vocabulary
spreeing not in vocabulary
Kwangchows not in vocabulary
 not in vocabulary
coveter not in vocabulary
 not in vocabulary
 not in vocabulary
 not in 

cacophonist not in vocabulary
 not in vocabulary
plainsongs not in vocabulary
 not in vocabulary
 not in vocabulary
bonhomies not in vocabulary
dishabilles not in vocabulary
 not in vocabulary
zigzagger not in vocabulary
Kwangchows not in vocabulary
 not in vocabulary
nonperformances not in vocabulary
reputing not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
tradeswoman not in vocabulary
 not in vocabulary
 not in vocabulary
Janenna not in vocabulary
respectableness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Diannne not in vocabulary
 not in vocabulary
 not in vocabulary
guaranis not in vocabulary
 not in vocabulary
Tildie not in vocabulary
gigacycle not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
manilas not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
groveler not in vocabulary
 not in vocabulary
twenti

branchlike not in vocabulary
polygraphing not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
disproportionates not in vocabulary
dreamlessness not in vocabulary
dishabilles not in vocabulary
 not in vocabulary
 not in vocabulary
havocked not in vocabulary
 not in vocabulary
 not in vocabulary
Chiarra not in vocabulary
 not in vocabulary
magnetohydrodynamical not in vocabulary
Jillana not in vocabulary
deathward not in vocabulary
 not in vocabulary
 not in vocabulary
sicklies not in vocabulary
battledores not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
markkaa not in vocabulary
interindex not in vocabulary
countersignatures not in vocabulary
Weidar not in vocabulary
possessional not in vocabulary
Indianian not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Sheratan not in vocabulary
 not in vocabulary
 not in vocabulary
Netzahualcoyotl not in vocabulary
bel

Sikhisms not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
meagres not in vocabulary
 not in vocabulary
Europeanizations not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
Valenka not in vocabulary
 not in vocabulary
Ransell not in vocabulary
 not in vocabulary
Trevar not in vocabulary
 not in vocabulary
 not in vocabulary
Elladine not in vocabulary
detailedness not in vocabulary
 not in vocabulary
sharkskins not in vocabulary
Nikolia not in vocabulary
nepotist not in vocabulary
dozenths not in vocabulary
blacksnakes not in vocabulary
Janenna not in vocabulary
technocracies not in vocabulary
 not in vocabulary
 not in vocabulary
hatstands not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
reservednesses not in vocabulary
eloquences n

disproportionates not in vocabulary
 not in vocabulary
butterfats not in vocabulary
[] not in vocabulary List is empty
temporarinesses not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
waitpeople not in vocabulary
 not in vocabulary
avarices not in vocabulary
 not in vocabulary
monkeyshine not in vocabulary
jocoseness not in vocabulary
companionableness not in vocabulary
 not in vocabulary
stockpiler not in vocabulary
beseecher not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Sanskritize not in vocabulary
 not in vocabulary
intraindustry not in vocabulary
 not in vocabulary
perspicuousness not in vocabulary
Gabriellia not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
excusableness not in vocabulary
 not in vocabulary
supersaturate not in vocabulary
 not in vocabulary
 not in vocabulary
acmes not in voc

haywires not in vocabulary
nigglers not in vocabulary
cabstand not in vocabulary
 not in vocabulary
 not in vocabulary
Caralie not in vocabulary
 not in vocabulary
Barbabra not in vocabulary
Indianian not in vocabulary
 not in vocabulary
fruitfuller not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Hatchure not in vocabulary
valetudinarianism not in vocabulary
nonadministrative not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
repaves not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
shrewishly not in vocabulary
 not in vocabulary
underregistration not in vocabulary
 not in vocabulary
Weisenheimer not in vocabulary
 not in vocabulary
Gaultiero not in vocabulary
[] not in vocabulary List is empty
magnetohydrodynamical not in vocabulary
williwaws not in vocabulary
 not in vocabulary


OHiggins not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Nerti not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
miscopying not in vocabulary
dogtrotting not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
irremediableness not in vocabulary
kilocycle not in vocabulary
 not in vocabulary
Murvyn not in vocabulary
humaner not in vocabulary
 not in vocabulary
 not in vocabulary
glibber not in vocabulary
 not in vocabulary
kriegspiel not in vocabulary
 not in vocabulary
Gerianna not in vocabulary
 not in vocabulary
fopping not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
redyes not in vocabulary
 not in vocabulary
Celisse not in vocabulary
Ulberto not in vocabulary
 not in vocabulary
Hurleigh not in vocabulary
vibraharps not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
cascaras not in vocabulary
malarkeys not in vocabulary
jingoisms not in vocabulary

 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
monkeyshine not in vocabulary
 not in vocabulary
devilments not in vocabulary
magnetohydrodynamical not in vocabulary
 not in vocabulary
 not in vocabulary
hopples not in vocabulary
 not in vocabulary
 not in vocabulary
marshiness not in vocabulary
[] not in vocabulary List is empty
delinquently not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
designational not in vocabulary
 not in vocabulary
thirtieths not in vocabulary
mortifier not in vocabulary
Yanaton not in vocabulary
 not in vocabulary
retroflection not in vocabulary
Myrwyn not in vocabulary
 not in vocabulary
changeablenesses not in vocabulary
warinesses not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
dadaisms not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
[] not in 

kolas not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
insinuator not in vocabulary
 not in vocabulary
spatterdock not in vocabulary
 not in vocabulary
compartmentalizations not in vocabulary
counterpoising not in vocabulary
 not in vocabulary
dismals not in vocabulary
 not in vocabulary
mortgageable not in vocabulary
 not in vocabulary
 not in vocabulary
Visakhapatnams not in vocabulary
 not in vocabulary
Menkalinan not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
tympanums not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
Frederigo not in vocabulary
 not in vocabulary
 not in vocabulary
Wafs not in vocabulary
autocorrelate not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
antiformant not in vocabulary
 not in vocabulary
greathearted not in vocabulary
Bahamanians not in vocabulary
Wezen not in vocabulary
 not in vocabulary
 not in voc

 not in vocabulary
solvently not in vocabulary
 not in vocabulary
facsimiled not in vocabulary
 not in vocabulary
 not in vocabulary
insentience not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
steeplejacks not in vocabulary
 not in vocabulary
Allianora not in vocabulary
balkiness not in vocabulary
Kelbee not in vocabulary
 not in vocabulary
underregistration not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
satisfactoriness not in vocabulary
contentments not in vocabulary
Kessia not in vocabulary
 not in vocabulary
 not in vocabulary
prexes not in vocabulary
ballyhoos not in vocabulary
cyclopedias not in vocabulary
 not in vocabulary
 not in vocabulary
AstroTurfs not in vocabulary
 not in vocabulary
imping not in vocabulary
Gabrila not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabular

[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
paludal not in vocabulary
Hanukas not in vocabulary
 not in vocabulary
almoners not in vocabulary
communized not in vocabulary
 not in vocabulary
 not in vocabulary
Emelyne not in vocabulary
Americanizations not in vocabulary
 not in vocabulary
 not in vocabulary
landslid not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
pitapats not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
comprehensibleness not in vocabulary
fierier not in vocabulary
 not in vocabulary
indistinguishableness not in vocabulary
Gwenore not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
Faunie not in vocabulary
 not in vocabulary
 not in vocabulary
intraline not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary

 not in vocabulary
 not in vocabulary
wintertimes not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
spectrographically not in vocabulary
squabbest not in vocabulary
Mullikan not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Anglophobes not in vocabulary
loyaler not in vocabulary
 not in vocabulary
shadowiness not in vocabulary
outhits not in vocabulary
 not in vocabulary
 not in vocabulary
Tallahoosa not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
extricable not in vocabulary
deallocator not in vocabulary
Gallicism not in vocabulary
 not in vocabulary
 not in vocabulary
Lucais not in vocabulary
LOuverture not in vocabulary
wherewithals not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
sanes not in vocabulary
 not in vocabul

 not in vocabulary
jiujitsus not in vocabulary
indefatigableness not in vocabulary
obligingness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
whimseys not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
peristalses not in vocabulary
Viviyan not in vocabulary
 not in vocabulary
Mariejeanne not in vocabulary
 not in vocabulary
goldenest not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
heterogeneousness not in vocabulary
 not in vocabulary
 not i

[] not in vocabulary List is empty
courteousnesses not in vocabulary
testatrices not in vocabulary
 not in vocabulary
overcomplexity not in vocabulary
outgoes not in vocabulary
thymines not in vocabulary
feater not in vocabulary
Franciskus not in vocabulary
 not in vocabulary
 not in vocabulary
Aridatha not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
millidegrees not in vocabulary
mayoresses not in vocabulary
peristalses not in vocabulary
Hildagard not in vocabulary
 not in vocabulary
Dickensians not in vocabulary
krills not in vocabulary
 not in vocabulary
ransomer not in vocabulary
wolfishness not in vocabulary
 not in vocabulary
 not in vocabulary
Melisandra not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
singletree not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in v

atheroscleroses not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
debauchedness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
swa

 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Merridie not in vocabulary
backarrow not in vocabulary
wilded not in vocabulary
gigacycle not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
forestallment not in vocabulary
 not in vocabulary
newsdealer not in vocabulary
 not in vocabulary
bulimias not in vocabulary
maharishis not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Vidovik not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Kelila not in vocabulary
gestapos not in vocabulary
 not in vocabulary
junketeered not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
handshaker not in vocabulary
 not in vocabulary
 not in vocabulary
electroencephalographs not in vocabulary
candlewicks not in vocabulary
 no

 not in vocabulary
Margette not in vocabulary
henpecks not in vocabulary
Guntar not in vocabulary
 not in vocabulary
 not in vocabulary
telemarketings not in vocabulary
 not in vocabulary
astronomies not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
permafrosts not in vocabulary
perihelia not in vocabulary
 not in vocabulary
 not in vocabulary
babbitts not in vocabulary
 not in vocabulary
ensurer not in vocabulary
dirking not in vocabulary
Gabriellia not in vocabulary
 not in vocabulary
 not in vocabulary
newsdealer not in vocabulary
crudding not in vocabulary
musculatures not in vocabulary
tomcatted not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
nervelessness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Tawsha not in vo

[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
eggheaded not in vocabulary
gorgeousnesses not in vocabulary
Ashien not in vocabulary
climbings not in vocabulary
 not in vocabulary
perverter not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
roadsweepers not in vocabulary
Tadio not in vocabulary
 not in vocabulary
 not in vocabulary
sentimentalization not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
coynesses not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
dopier not in vocabulary
 not in vocabulary
elodeas not in vocabulary
Isahella not in vocabulary
 not in vocabulary
Labradorean not in vocabulary
vitalizations not in vocabulary
 not in vocabulary
 not in vocabulary
turgidness not in vocabulary
 not in voc

 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
metallings not in vocabulary
horticultures not in vocabulary
epicyclical not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
wildcatted not in vocabulary
Ardelis not in vocabulary
perfunctoriness not in vocabulary
 not in vocabulary
patricides not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
chutzpahs not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
calciums not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
batistes not in vocabulary
vims not in vocabulary
 not in vocabulary
squirehood not in vocabulary
 not in vocabulary
 not in vocabulary
homeopathies not in vocabulary
 not in vocabulary
uncoloredness not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
cephalics not in vocabulary

 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
rebeller not in vocabulary
galliums not in vocabulary
finites not in vocabulary
 not in vocabulary
Olenek not in vocabulary
 not in vocabulary
Cthrine not in vocabulary
traditionalized not in vocabulary
 not in vocabulary
 not in vocabulary
caparisoning not in vocabulary
 not in vocabulary
Sheffielder not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
aphoristically not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
fullword not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
buddings not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
knowledgeableness not in vocabulary
figurativeness not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocab

 not in vocabulary
 not in vocabulary
 not in vocabulary
lazinesses not in vocabulary
Vitia not in vocabulary
fuhrers not in vocabulary
Porrima not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
reticences not in vocabulary
sowbelly not in vocabulary
astrakhans not in vocabulary
 not in vocabulary
Sheratan not in vocabulary
 not in vocabulary
toastmistress not in vocabulary
melancholias not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
gauziness not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
lovelinesses not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
diagrammaticality not in vocabulary
courteousnesses not in vocabulary
militarisms not in vocabulary
Reichstags not in vocabulary
bonhomies not in vocabulary
 not in vocabulary
Rafaelia not in vocabulary
 not 

exacter not in vocabulary
 not in vocabulary
caviled not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Heindrick not in vocabulary
 not in vocabulary
 not in vocabulary
Welshwoman not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
briner not in vocabulary
 not in vocabulary
 not in vocabulary
noninterchangeable not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
Valenka not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
camaraderies not in vocabulary
 not in vocabulary
Beitris not in vocabulary
vigilantist not in vocabulary
 not in vocabulary
Chiarra not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
greenswards not in vocabulary
 not in vocabulary
burnables not in vocabulary
flecker not in vocabulary
speedboating not in vocabulary
netts not in vocabulary
 not in vocabulary
 no

[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
bedaub not in vocabulary
mongolisms not in vocabulary
Aguistin not in vocabulary
 not in vocabulary
metropolitanization not in vocabulary
salacity not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
underenumerated not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
passioning not in vocabulary
[] not in vocabulary List is empty
colonelcies not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
bemire not in vocabulary
consumerisms not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
tandooris not in vocabulary
krills not in vocabulary
 not in vocabulary
polychemicals not in vocabulary
 not in vocabulary
foxtrotting not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
Gwenore not in vocabulary
 no

 not in vocabulary
 not in vocabulary
sigher not in vocabulary
chrisms not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
vealer not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
molybdenums not in vocabulary
 not in vocabulary
 not in vocabulary
argots not in vocabulary
 not in vocabulary
bleater not in vocabulary
separatenesses not in vocabulary
imping not in vocabulary
 not in vocabulary
 not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
 not in vocabulary
 not in vocabulary
seizors not in vocabulary
 not in vocabulary
brontosaurs not in vocabulary
[] not in vocabulary List is empty
ethnologies not in vocabulary
 not in vocabulary
misguidedness not in vocabulary
[] not in vocabulary List is empty
 not in vocabulary
kriegspiel not in vocabulary
 not in vocabulary
 not in vocabulary
 not in vocabulary
stinkingly not in vocabulary
 not in vocabulary
beriberis not in vocabulary
sinisterness not in vocabulary
connotative

Train on 143613 samples, validate on 15958 samples
Epoch 1/50
Epoch 00000: val_loss improved from inf to 0.08809, saving model to models/multi-labelNLP_model.h5
1049s - loss: 0.2646 - acc: 0.9442 - val_loss: 0.0881 - val_acc: 0.9708
Epoch 2/50
Epoch 00001: val_loss improved from 0.08809 to 0.06706, saving model to models/multi-labelNLP_model.h5
1045s - loss: 0.0788 - acc: 0.9715 - val_loss: 0.0671 - val_acc: 0.9708
Epoch 3/50
Epoch 00002: val_loss improved from 0.06706 to 0.05972, saving model to models/multi-labelNLP_model.h5
1011s - loss: 0.0646 - acc: 0.9774 - val_loss: 0.0597 - val_acc: 0.9833
Epoch 4/50
Epoch 00003: val_loss improved from 0.05972 to 0.05452, saving model to models/multi-labelNLP_model.h5
1000s - loss: 0.0581 - acc: 0.9799 - val_loss: 0.0545 - val_acc: 0.9837
Epoch 5/50
Epoch 00004: val_loss improved from 0.05452 to 0.05107, saving model to models/multi-labelNLP_model.h5
991s - loss: 0.0541 - acc: 0.9820 - val_loss: 0.0511 - val_acc: 0.9845
Epoch 6/50
Epoch 00005: 

ValueError: Unknown layer: Attention

In [8]:
MODEL_NAME = 'multi-labelNLP-second'
second_model = ToxModel()
second_model.train(0,train, text_column = 'comment_text', toxic = 'toxic', severe_toxic = 'severe_toxic', obscene = 'obscene', threat = 'threat', insult = 'insult', identity_hate = 'identity_hate', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 7e-05
es_patience: 1
batch_size: 128
embedding_dim: 300
epochs: 50
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: False
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
train_text_temp shape (159571, 250) and train_labels_temp shape (159571, 2)
 ---- 
valid_text shape (15958, 250) and valid_labels shape (15958, 2)
train_text shape (143613, 250) and train_labels shape (143613, 2)
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 143613 samples, validate on 15958 samples
Epoch 1/50
Epoch 00000: val_loss improved from inf to 0.19217, saving model to models/multi-labelNLP-second_model.h5
493s - loss: 0.3494 - acc: 0.9083 - val_loss: 0.1922 - val_acc: 0.9470
Epoch 2/50
Epoch 00001: val_loss improved from 

In [7]:
debias_random_model = ToxModel(model_name="multi-labelNLP-gru-cv0") 

Hyperparameters
---------------
max_num_words: 30000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 7e-05
embedding_dim: 300
cnn_kernel_sizes: [5, 5, 5]
es_patience: 1
epochs: 50
cnn_filter_sizes: [128, 128, 128]
batch_size: 128
model_name: multi-labelNLP-gru-cv
max_sequence_length: 250
stop_early: False
embedding_trainable: False



In [4]:
second_model = ToxModel(model_name="multi-labelNLP-second") 

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 7e-05
embedding_dim: 300
cnn_kernel_sizes: [5, 5, 5]
es_patience: 1
epochs: 50
cnn_filter_sizes: [128, 128, 128]
batch_size: 128
model_name: multi-labelNLP-second
max_sequence_length: 250
stop_early: False
embedding_trainable: False



In [8]:
import numpy as np
random_test = pd.read_csv('test.csv')
np.where(pd.isnull(random_test)) #check null rows

(array([], dtype=int64), array([], dtype=int64))

In [7]:
print(random_test.iloc[52300]) #print value of null row

id                                               56cf8b7315e85f14
comment_text    SOmebody fucked up the homepage plz edit!! tha...
Name: 52300, dtype: object


In [9]:
random_test = pd.read_csv('test.csv')
#random_test = random_test.dropna()
prediction = debias_random_model.predict(random_test['comment_text'])

In [10]:
prediction.shape

(153164, 6)

In [9]:
random_test = pd.read_csv('test.csv')
random_test = random_test.dropna()
print(random_test.iloc[52300])

id                                                   231302702569
comment_text    Just a note about external links: I have remov...
Name: 52301, dtype: object


In [10]:
random_test.shape

(226997, 2)

In [8]:
for id, p in enumerate(prediction):
    if(id <20):
        print(p)

[0.94351786 0.07261173 0.87614065 0.2488063 ]
[2.8036063e-04 2.0171970e-05 3.3794012e-04 2.7047885e-05]
[0.0013453  0.00012583 0.00167132 0.00017606]
[1.0656077e-04 9.6954127e-06 1.1833789e-04 1.3261302e-05]
[9.6572866e-04 7.3803014e-05 1.2766599e-03 1.0159020e-04]
[1.4914108e-04 1.2232612e-05 1.7201710e-04 1.6546424e-05]
[3.7022735e-04 3.1974312e-05 4.6733892e-04 4.4136799e-05]
[0.03845833 0.00216745 0.05712204 0.00358418]
[0.00292935 0.00019908 0.005152   0.00027639]
[1.1728035e-04 1.0837189e-05 1.2751747e-04 1.4890081e-05]
[0.06014245 0.00027489 0.07595153 0.00041313]
[0.0037365  0.00061411 0.00671029 0.00094831]
[1.1792146e-04 6.9626149e-06 1.3104505e-04 8.9941350e-06]
[1.12934598e-04 1.16250676e-05 1.20583965e-04 1.62092383e-05]
[9.21920000e-05 9.38676385e-06 1.02012295e-04 1.30086419e-05]
[3.8325472e-04 3.0754032e-05 4.9539335e-04 4.2015748e-05]
[0.00463269 0.00045546 0.00638342 0.000654  ]
[0.00668138 0.00048621 0.01024681 0.00069852]
[1.05928797e-04 1.01692503e-05 1.12349095e-0

In [9]:
#second model
random_test = pd.read_csv('test.csv')
random_test = random_test.dropna()
prediction_second = second_model.predict(random_test['comment_text'])
prediction_second.shape

(153164, 2)

In [10]:
for id, p in enumerate(prediction_second):
    if(id <20):
        print(p)

[0.98729837 0.32326353]
[8.3768519e-04 1.1632358e-06]
[1.1443039e-03 2.0584848e-06]
[4.5535603e-04 3.8094805e-07]
[2.3912357e-02 7.7905795e-05]
[5.241920e-04 4.858725e-07]
[2.1548306e-03 5.0328690e-06]
[0.73191476 0.0132686 ]
[0.03209163 0.00011017]
[4.660586e-04 4.495613e-07]
[0.43936914 0.00131104]
[0.30907333 0.00103553]
[6.220829e-04 5.968530e-07]
[5.3532032e-04 4.9457026e-07]
[4.8731244e-04 3.8716783e-07]
[1.6188634e-03 2.8966961e-06]
[1.9687539e-02 3.7851005e-05]
[1.3295321e-02 4.3206492e-05]
[5.089241e-04 4.337683e-07]
[1.7802878e-03 3.2394330e-06]


In [11]:
random_test = pd.read_csv('test.csv')
#random_test = random_test.dropna()
test_id = random_test['id'].astype(str)

In [12]:
test_id.shape

(153164,)

In [13]:
header = ["id"]
df = pd.DataFrame(test_id, columns=header)
#display(df.head(n=20))

df.id = df.id.astype("str")
print(df.dtypes)
display(df.head(n=20))
#print(np.where(pd.isnull(df)))
#print(df.shape)
#df.reset_index(drop=True, inplace=True)



id    object
dtype: object


Unnamed: 0,id
0,00001cee341fdb12
1,0000247867823ef7
2,00013b17ad220c46
3,00017563c3f7919a
4,00017695ad8997eb
5,0001ea8717f6de06
6,00024115d4cbde0f
7,000247e83dcc1211
8,00025358d4737918
9,00026d1092fe71cc


In [14]:
#IF NO SPLIT
headers = ["toxic","severe_toxic","obscene","threat","insult","identity_hate"]
test_df = pd.DataFrame(prediction, columns=headers, dtype=float)
display(test_df.head(n=20))
print(np.where(pd.isnull(test_df)))
print(test_df.shape)
test_df.reset_index(drop=True, inplace=True)


Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0.999144,0.4910571,0.997201,0.01553519,0.9842854,0.1343174
1,1.3e-05,6.671647e-07,1e-05,1.191349e-07,4.216913e-06,5.983958e-07
2,0.000209,6.238616e-06,8.5e-05,1.348426e-06,3.153176e-05,5.135477e-06
3,2.5e-05,9.297542e-07,1.2e-05,3.098087e-07,4.252371e-06,1.555411e-07
4,0.002148,1.040108e-05,0.000469,3.862768e-05,0.000105268,2.039362e-05
5,4.2e-05,1.26899e-06,1.6e-05,2.848018e-07,4.899765e-06,3.813471e-07
6,5.1e-05,1.395919e-06,1.6e-05,6.946671e-07,7.570949e-06,4.054838e-07
7,0.071056,0.0001163827,0.001971,1.682328e-06,0.0008182763,6.897181e-06
8,0.8019,3.431676e-05,0.024083,2.839076e-06,0.1149196,1.346543e-05
9,1.1e-05,3.321367e-07,7e-06,5.973085e-08,1.487395e-06,1.68403e-07


(array([], dtype=int64), array([], dtype=int64))
(153164, 6)


In [14]:
#IF SPLIT
headers = ["toxic","severe_toxic"]
test_df_second = pd.DataFrame(prediction_second, columns=headers, dtype=float)
display(test_df_second.head(n=20))
print(np.where(pd.isnull(test_df_second)))
print(test_df_second.shape)
test_df_second.reset_index(drop=True, inplace=True)


Unnamed: 0,toxic,severe_toxic
0,0.987298,0.3232635
1,0.000838,1.163236e-06
2,0.001144,2.058485e-06
3,0.000455,3.809481e-07
4,0.023912,7.79058e-05
5,0.000524,4.858725e-07
6,0.002155,5.032869e-06
7,0.731915,0.0132686
8,0.032092,0.0001101697
9,0.000466,4.495613e-07


(array([], dtype=int64), array([], dtype=int64))
(153164, 2)


In [15]:
headers = ["obscene","threat","insult","identity_hate"]
test_df = pd.DataFrame(prediction, columns=headers, dtype=float)
display(test_df.head(n=20))
print(np.where(pd.isnull(test_df)))
print(test_df.shape)
test_df.reset_index(drop=True, inplace=True)

Unnamed: 0,obscene,threat,insult,identity_hate
0,0.943518,0.072612,0.876141,0.248806
1,0.00028,2e-05,0.000338,2.7e-05
2,0.001345,0.000126,0.001671,0.000176
3,0.000107,1e-05,0.000118,1.3e-05
4,0.000966,7.4e-05,0.001277,0.000102
5,0.000149,1.2e-05,0.000172,1.7e-05
6,0.00037,3.2e-05,0.000467,4.4e-05
7,0.038458,0.002167,0.057122,0.003584
8,0.002929,0.000199,0.005152,0.000276
9,0.000117,1.1e-05,0.000128,1.5e-05


(array([], dtype=int64), array([], dtype=int64))
(153164, 4)


In [15]:
#IF NO SPLIT
df_new = pd.concat([df,test_df], axis=1)
#df_new = df.merge(test_df, how='outer')
#df_new.id = df_new.id.astype("int")
display(df_new.head(n=20))
print(df_new.dtypes)

np.where(pd.isnull(df_new))

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999144,0.4910571,0.997201,0.01553519,0.9842854,0.1343174
1,0000247867823ef7,1.3e-05,6.671647e-07,1e-05,1.191349e-07,4.216913e-06,5.983958e-07
2,00013b17ad220c46,0.000209,6.238616e-06,8.5e-05,1.348426e-06,3.153176e-05,5.135477e-06
3,00017563c3f7919a,2.5e-05,9.297542e-07,1.2e-05,3.098087e-07,4.252371e-06,1.555411e-07
4,00017695ad8997eb,0.002148,1.040108e-05,0.000469,3.862768e-05,0.000105268,2.039362e-05
5,0001ea8717f6de06,4.2e-05,1.26899e-06,1.6e-05,2.848018e-07,4.899765e-06,3.813471e-07
6,00024115d4cbde0f,5.1e-05,1.395919e-06,1.6e-05,6.946671e-07,7.570949e-06,4.054838e-07
7,000247e83dcc1211,0.071056,0.0001163827,0.001971,1.682328e-06,0.0008182763,6.897181e-06
8,00025358d4737918,0.8019,3.431676e-05,0.024083,2.839076e-06,0.1149196,1.346543e-05
9,00026d1092fe71cc,1.1e-05,3.321367e-07,7e-06,5.973085e-08,1.487395e-06,1.68403e-07


id                object
toxic            float64
severe_toxic     float64
obscene          float64
threat           float64
insult           float64
identity_hate    float64
dtype: object


(array([], dtype=int64), array([], dtype=int64))

In [18]:
#IF SPLIT
df_new = pd.concat([df,test_df_second,test_df], axis=1)
#df_new = df.merge(test_df, how='outer')
#df_new.id = df_new.id.astype("int")
display(df_new.head(n=20))
print(df_new.dtypes)

np.where(pd.isnull(df_new))

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.987298,0.3232635,0.943518,0.072612,0.876141,0.248806
1,0000247867823ef7,0.000838,1.163236e-06,0.00028,2e-05,0.000338,2.7e-05
2,00013b17ad220c46,0.001144,2.058485e-06,0.001345,0.000126,0.001671,0.000176
3,00017563c3f7919a,0.000455,3.809481e-07,0.000107,1e-05,0.000118,1.3e-05
4,00017695ad8997eb,0.023912,7.79058e-05,0.000966,7.4e-05,0.001277,0.000102
5,0001ea8717f6de06,0.000524,4.858725e-07,0.000149,1.2e-05,0.000172,1.7e-05
6,00024115d4cbde0f,0.002155,5.032869e-06,0.00037,3.2e-05,0.000467,4.4e-05
7,000247e83dcc1211,0.731915,0.0132686,0.038458,0.002167,0.057122,0.003584
8,00025358d4737918,0.032092,0.0001101697,0.002929,0.000199,0.005152,0.000276
9,00026d1092fe71cc,0.000466,4.495613e-07,0.000117,1.1e-05,0.000128,1.5e-05


id                object
toxic            float64
severe_toxic     float64
obscene          float64
threat           float64
insult           float64
identity_hate    float64
dtype: object


(array([], dtype=int64), array([], dtype=int64))

In [18]:
#hds = ["id","toxic","severe_toxic","obscene","threat","insult","identity_hate"]

#df2 = pd.DataFrame([[231298963278,0.5,0.5,0.5,0.5,0.5,0.5]], columns = hds)
#df_newer = df_new.append(df2)

In [13]:
df_new.shape

(153164, 7)

In [16]:
head = ["id","toxic","severe_toxic","obscene","threat","insult","identity_hate"]
df_new.to_csv('cv_gru_output.csv', columns = head, index=False)

In [None]:
for id, p in enumerate(prediction):
    

### Plain wikipedia model

In [7]:
MODEL_NAME = 'cnn_wiki_tox_v3'
wiki_model = ToxModel()
wiki_model.train(wiki['train'], wiki['dev'], text_column = 'comment', label_column = 'is_toxic', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 95692 samples, validate on 32128 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.17471, saving model to ../models/cnn_wiki_tox_v3_model.h5
134s - loss: 0.2437 - acc: 0.9141 - val_loss: 0.1747 - val_acc: 0.9359
Epoch 2/20
Epoch 00001: val_loss improved from 0.17471 to 0.14997, saving model to ../models/cnn_wiki_tox_v3_model.h5
134s - loss: 0.1654 - acc: 0.9388 - val_loss: 0.1500 - val_acc: 0.9439
Epoch 3/20
Epoch 00002: val_loss improved from 0.14997 to 0.13735, saving mo

In [8]:
wiki_test = pd.read_csv(wiki['test'])
wiki_model.score_auc(wiki_test['comment'], wiki_test['is_toxic'])

0.95997760130597887

### Debiased model

In [9]:
MODEL_NAME = 'cnn_debias_tox_v3'
debias_model = ToxModel()
debias_model.train(debias['train'], debias['dev'], text_column = 'comment', label_column = 'is_toxic', model_name = MODEL_NAME)

Hyperparameters
---------------
max_num_words: 10000
dropout_rate: 0.3
verbose: True
cnn_pooling_sizes: [5, 5, 40]
es_min_delta: 0
learning_rate: 5e-05
es_patience: 1
batch_size: 128
embedding_dim: 100
epochs: 20
cnn_filter_sizes: [128, 128, 128]
cnn_kernel_sizes: [5, 5, 5]
max_sequence_length: 250
stop_early: True
embedding_trainable: False

Fitting tokenizer...
Tokenizer fitted!
Preparing data...
Data prepared!
Loading embeddings...
Embeddings loaded!
Building model graph...
Training model...
Train on 99157 samples, validate on 33283 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.16575, saving model to ../models/cnn_debias_tox_v3_model.h5
140s - loss: 0.2258 - acc: 0.9215 - val_loss: 0.1657 - val_acc: 0.9406
Epoch 2/20
Epoch 00001: val_loss improved from 0.16575 to 0.14430, saving model to ../models/cnn_debias_tox_v3_model.h5
139s - loss: 0.1595 - acc: 0.9420 - val_loss: 0.1443 - val_acc: 0.9472
Epoch 3/20
Epoch 00002: val_loss improved from 0.14430 to 0.13724, savin

In [11]:
debias_test = pd.read_csv(debias['test'])
debias_model.prep_data_and_score(debias_test['comment'], debias_test['is_toxic'])

AttributeError: ToxModel instance has no attribute 'prep_data_and_score'