## Challange: Build NLP Model
<br>
I am using <a href='https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge'>this Kaggle competition</a> to build an NLP model. The task is to identify "toxic comments" and the hand-labeled data is provided by Jigsaw. The challenge is not only to find toxic comments but to correctly identify the type of toxic comment as given in six classes.

In [117]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
import nltk
import gc
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss, make_scorer
from scipy import sparse

In [125]:
df = pd.read_csv('train.csv')
classes = [
    'toxic',
    'severe_toxic',
    'obscene', 
    'threat',
    'insult', 
    'identity_hate'
]

df[(df[classes].sum(axis=1) > 1)][classes].head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
32,1,0,1,0,1,0
81,1,0,1,0,1,0
86,1,0,1,0,1,0
104,1,0,1,0,1,0
122,0,0,1,0,1,0


As can be seen above, the classes are not mutually exclusive.

## Evaluation

The model performance is the mean column-wise log-loss score for the six distinct toxic comment types. To calculate this, one must calculate the log loss of each prediction for each class and take the mean of those six log-loss values.

## Data Cleaning & Preparation
For a minimal first-attempt, I'll use spacy to parse the 

In [4]:
prs = spacy.load('en')
data = pd.DataFrame(index=df.index)


def parse(document):
    iteration = next(stat)
    if iteration % 10000 == 0:
        print(iteration, " of {} rows complete!".format(len(df)))
    return prs(document)

stat = iter(range(0, len(df)))
data['raw_parse'] = df.comment_text.apply(parse)

In [12]:
data.raw_parse.iloc[1234]

"

 One more feeder league? 

If I am correct, there should be one more league added to the Relegation to table template. The league in question is the Aldershot Senior League. Please view this talk page for further details under section 18 (Error?) of the page. STalk to me "

In [48]:
def get_words(document):
    iteration = next(stat)
    if iteration % 10000 == 0:
        print(iteration, " of {} rows complete!".format(len(df)))
    lemmas = []
    for token in document:
        if not token.is_punct and not token.is_space and not token.is_oov:
            lemmas.append(token.lemma_)
    return lemmas

def get_oov(document):
    iteration = next(stat)
    if iteration % 10000 == 0:
        print(iteration, " of {} rows complete!".format(len(df)))
    oovs = []
    for token in document:
        if token.is_oov:
            oovs.append(token.orth_)
    return oovs

def reconstruct_text(lemmas):
    iteration = next(stat)
    if iteration % 10000 == 0:
        print(iteration, " of {} rows complete!".format(len(df)))
    result = ''
    for lemma in lemmas:
        if str(lemma) == '-PRON-':
            result += ' ' + 'pronoun'
        else:
            result += ' ' + str(lemma)
    return result

In [29]:
stat = iter(range(0, len(df)))
data['lemmas'] = data.raw_parse.apply(get_words)

0  of 95851 rows complete!
5000  of 95851 rows complete!
10000  of 95851 rows complete!
15000  of 95851 rows complete!
20000  of 95851 rows complete!
25000  of 95851 rows complete!
30000  of 95851 rows complete!
35000  of 95851 rows complete!
40000  of 95851 rows complete!
45000  of 95851 rows complete!
50000  of 95851 rows complete!
55000  of 95851 rows complete!
60000  of 95851 rows complete!
65000  of 95851 rows complete!
70000  of 95851 rows complete!
75000  of 95851 rows complete!
80000  of 95851 rows complete!
85000  of 95851 rows complete!
90000  of 95851 rows complete!
95000  of 95851 rows complete!


In [33]:
stat = iter(range(0, len(df)))
data['oovs'] = data.raw_parse.apply(get_oov)

0  of 95851 rows complete!
10000  of 95851 rows complete!
20000  of 95851 rows complete!
30000  of 95851 rows complete!
40000  of 95851 rows complete!
50000  of 95851 rows complete!
60000  of 95851 rows complete!
70000  of 95851 rows complete!
80000  of 95851 rows complete!
90000  of 95851 rows complete!


In [49]:
stat = iter(range(0, len(df)))
data['clean_text'] = data.lemmas.apply(reconstruct_text)

0  of 95851 rows complete!
10000  of 95851 rows complete!
20000  of 95851 rows complete!
30000  of 95851 rows complete!
40000  of 95851 rows complete!
50000  of 95851 rows complete!
60000  of 95851 rows complete!
70000  of 95851 rows complete!
80000  of 95851 rows complete!
90000  of 95851 rows complete!


In [50]:
gc.collect()

672

In [51]:
data.clean_text[1234]

' one more feeder league if pronoun be correct there should be one more league add to the relegation to table template the league in question be the senior league please view this talk page for further detail under section 18 error of the page to pronoun'

## BoW

In [71]:
gc.collect()
vctr = CountVectorizer(ngram_range=(1, 2), stop_words = 'english', min_df=.001, max_df=.5)
X = vctr.fit_transform(data.clean_text)

In [65]:
ll_scorer = make_scorer(log_loss)
Y = df.toxic

In [77]:
mod = MultinomialNB()
cross_val_score(mod, X, Y, scoring=ll_scorer)

array([ 2.29605676,  2.35342502,  2.54585443])

In [87]:
class NullModel(object):
    
    def __init__(self):
        return None
    
    def fit(self, Xtrain, Ytrain):
        return self
    
    def predict(self, X):
        length = X.shape[0]
        return [0 for x in range(0, length)]
    
    def get_params(self, deep=False):
        return {}
    
nulmod = null_model()

cross_val_score(nulmod, X, Y, scoring=ll_scorer)

array([ 3.34134637,  3.27118427,  3.3728007 ])

In [79]:
cross_val_score(mod, X, Y)

array([ 0.93352321,  0.93186228,  0.92629108])

In [68]:
Y.sum()/len(df)

0.09636832166591898

In [74]:
df.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

In [97]:
def test_features(df, X):
    mean_ll = []
    null_ll = []
    mean_acc = []
    maj_class = []
    for cls in classes:
        Y = df[cls]
        mod = MultinomialNB()
        mean_ll.append(cross_val_score(mod, X, Y, scoring=ll_scorer, cv=5).mean())
        nulmod = NullModel()
        null_ll.append(cross_val_score(nulmod, X, Y, scoring=ll_scorer).mean())
        mean_acc.append(cross_val_score(mod, X, Y, cv=5).mean())
        maj_class.append(1-(Y.sum()/len(df)))

    result = pd.DataFrame()
    result['class'] = classes
    result['mean_log_loss'] = mean_ll
    result['null_log_loss'] = null_ll
    result['mean_accuracy'] = mean_acc
    result['majority_class_prior'] = maj_class
    return result

In [113]:
result_ll = []
ngram_max = []
for x in range(1, 5):
    gc.collect()
    print("round: ", x)
    mdf = x
    ngram_max.append(mdf)
    vctr = CountVectorizer(ngram_range=(1, mdf), stop_words = 'english', min_df=.0005, max_df=.5)
    print("vectorizing...")
    X = vctr.fit_transform(data.clean_text)
    print("testing...")
    result = test_features(df, X)
    result_ll.append(result.mean_log_loss.mean())
    print("done!")
    


round:  1
vectorizing...
testing...
done!
round:  2
vectorizing...
testing...
done!
round:  3
vectorizing...
testing...
done!
round:  4
vectorizing...
testing...
done!


In [114]:
rezz = pd.DataFrame()
rezz['ll'] = result_ll
rezz['mdf'] = ngram_max
rezz.loc[rezz.mdf.idxmin()]

ll     1.016469
mdf    1.000000
Name: 0, dtype: float64

In [94]:
result.mean_log_loss.mean()

1.2684722758112856

In [95]:
result.null_log_loss.mean()

1.2728948663912809

In [115]:
gc.collect()
vctr = CountVectorizer(ngram_range=(1, 1), stop_words = 'english', min_df=.0005, max_df=.5)
X = vctr.fit_transform(data.clean_text)
test_features(df, X)


Unnamed: 0,class,mean_log_loss,null_log_loss,mean_accuracy,majority_class_prior
0,toxic,1.899363,3.328444,0.945008,0.903632
1,severe_toxic,0.644657,0.347727,0.981336,0.989932
2,obscene,1.146612,1.840968,0.966803,0.946699
3,threat,0.339445,0.109903,0.990172,0.996818
4,insult,1.379034,1.717012,0.960073,0.950287
5,identity_hate,0.689701,0.293315,0.980031,0.991508


## Tf-Idf

In [119]:
gc.collect()
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words = 'english', min_df=.0005, max_df=.5)

In [120]:
gc.collect()
print("vectorizing...")
X = tfidf.fit_transform(data.clean_text)
print("testing features...")
test_features(df, X)

vectorizing...
testing features...


Unnamed: 0,class,mean_log_loss,null_log_loss,mean_accuracy,majority_class_prior
0,toxic,1.783316,3.328444,0.948368,0.903632
1,severe_toxic,0.341601,0.347727,0.99011,0.989932
2,obscene,0.963907,1.840968,0.972092,0.946699
3,threat,0.110263,0.109903,0.996808,0.996818
4,insult,1.154168,1.717012,0.966584,0.950287
5,identity_hate,0.28791,0.293315,0.991664,0.991508


In [122]:
result = test_features(df, X)
result.mean_log_loss.mean()

0.77352750511435786

## "Out of Box" Performance of CountVectorizer and Tfidf <br>
Out of the box, the CountVectorizer improved very little on the performance of the "null" model (just predicting the majority class all the time.) After tinkering with hyperparameters, the model improved by quite a bit from a null log loss of 1.27 to 1.02 (not great, but a marked improvement).
<br>
Tf-idf "out of the box" gave a log loss score of .77, which is a huge improvement; still not awesome, but it's evidence of Tf-idf's general aplicability. The only class it wasn't able to predict very well was "threat," which is an extremely rare class (99.7% of documents are not threats).