# 3 - XGBoost
Let's do some explorations using [XGBoost](https://github.com/dmlc/xgboost). 

In [6]:
import numpy as np
import pandas as pd
import zipfile

filepath =  '/Users/freddiekarlbom/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/train.csv.zip'

with zipfile.ZipFile(filepath) as zip:
    with zip.open('train.csv') as myZip:
        df = pd.read_csv(myZip) 

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [9]:
prediction_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

X = df['comment_text']
Y = df[prediction_columns]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=1337)

In [58]:
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier

In [64]:
import xgboost as xgb

In [81]:
pipeline = Pipeline([('vect', CountVectorizer(stop_words="english")),
                 ('tfidf', TfidfTransformer()),
                 ('clf', MultiOutputClassifier(estimator=xgb.XGBClassifier(silent=False))),
])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__estimator__max_depth': (3, 5),
              'clf__estimator__learning_rate': (0.1, 0.2),
              'clf__estimator__n_estimators': (100, 200)
             }

clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=True)

In [82]:
# Small subset of data to iterate more quickly...
clf.fit(X_train[:5000], Y_train[:5000])

Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 21.8min
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed: 52.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...g_lambda=1, scale_pos_weight=1, seed=None,
       silent=False, subsample=1),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__estimator__max_depth': (3, 5), 'clf__estimator__learning_rate': (0.1, 0.2), 'clf__estimator__n_estimators': (100, 200)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=True)

In [87]:
clf.best_params_

{'clf__estimator__learning_rate': 0.1,
 'clf__estimator__max_depth': 3,
 'clf__estimator__n_estimators': 100,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 1)}

Ok, so that 50 minute grid search just says that the default settings are the best and that the time was wasted. Time for a deep breath before we move on.

In [84]:
clf.score(X_train[:5000], Y_train[:5000])

0.92800000000000005

In [134]:
clf.score(X_test, Y_test)

0.91214437899486156

In [None]:
model = clf.best_estimator_

In [None]:
full_model = model.fit(X_train, Y_train)
full_model.score(X_train, Y_train)

In [None]:
full_model.score(X_test, Y_test)

## Gridsearch analysis
Comparing to the ~90% accuracy of plain guessing, it's not a very big improvement. It's worth noticing though, that since there are so few examples of toxic comments, using a small subset of the data means there's probably not that many examples to learn from.

Since there are so few toxic comments comparatively, there might be a problem with the model learning to basically always guess it's non-toxic. In order to investigate this, and see if we can create an improvement, let's create a training set with a different distribution that contains a larger share of toxic comments, and then train the most promising model using that data.

This should allow the model to learn much better rules for toxic comments, but might end up giving a lot of false positives on the test set.

In [92]:
# Filtering out all toxic comments from the training set
Y_train_toxic = Y_train.loc[(Y_train == 1).any(axis=1), :]
X_train_toxic = X_train.loc[Y_train_toxic.index.values]
Y_train_toxic.shape[0]

14604

In [116]:
# Let's go for ~2/3 non toxic and 1/3 toxic in the training set
# giving us a training set roughly 1/3 of the original size
non_toxic_samples = 30000
Y_train_nontoxic = Y_train.loc[(Y_train == 0).all(axis=1), :]

Y_train_nontoxic_sample = Y_train_nontoxic.sample(n=non_toxic_samples, random_state=1337)
X_train_nontoxic_sample = X_train.loc[Y_train_nontoxic_sample.index.values]

Y_train_sample = pd.concat([Y_train_toxic, Y_train_nontoxic_sample])
X_train_sample = pd.concat([X_train_toxic, X_train_nontoxic_sample])

In [119]:
new_model = model.fit(X_train_sample, Y_train_sample)

In [136]:
# Note - this was unexpected
new_model.score(X_train_sample, Y_train_sample)

0.73033808627028962

In [135]:
new_model.score(X_train, Y_train)

0.9156622311350644

In [122]:
new_model.score(X_test, Y_test)

0.90669256799097631

## Sub-word level embeddings
We saw no improvement by shifting the distribution of the training set. Instead, let's test what happens if we start looking at N-grams of chars instead of words.

The benefit from this is that we will capture word stems much better. A toxic word like _fuck_ will now be recognized as the same no matter if it's _fuck_, _fucked_, _fucking_ or some other version, which might yield better results.

In [None]:
pipeline_nchar = Pipeline([('vect', CountVectorizer(analyzer='char_wb', ngram_range=(3,6))),
                 ('tfidf', TfidfTransformer()),
                 ('clf', MultiOutputClassifier(estimator=xgb.XGBClassifier())),
])

In [127]:
pipeline_nchar.fit(X_train[:5000], Y_train[:5000])

In [129]:
pipeline_nchar.score(X_train[:5000], Y_train[:5000])

0.94920000000000004

In [133]:
pipeline_nchar.score(X_test, Y_test)

0.91189372101767141

In [None]:
nchar_df = pd.DataFrame(pipeline_nchar.predict(X_test))
pd.crosstab(nchar_df[0], Y_test.toxic)

## Wrapping up
This sub-word level embedding approach seems promising when looking at the training set. We are clearly overfitting when omparing to the test set but given that we trained on just a small piece of data, it might generalise better if we increase the training size.

After running some grid search on very small subsets, I ended up selecting the below pipeline which will be trained on the full dataset.

In [218]:
pipeline_nchar = Pipeline([('vect', CountVectorizer(analyzer='char_wb', ngram_range=(3, 6))),
                 ('tfidf', TfidfTransformer()),
                 ('clf', MultiOutputClassifier(estimator=xgb.XGBClassifier(silent=False, 
                                                                           early_stopping_rounds=5
                                                                          ))),
])

parameters = {
    'vect__max_df': (0.1, 0.2),
    'vect__min_df': (10, 20)
}

clf_nchar = GridSearchCV(pipeline_nchar, parameters, n_jobs=-1, verbose=True)

In [None]:
clf_nchar.fit(X_train[:1000], Y_train[:1000])

print(clf_nchar.score(X_train, Y_train))
print(clf_nchar.score(X_test, Y_test))

Fitting 3 folds for each of 4 candidates, totalling 12 fits


## Takeaways
- Even training on a small subset of the full data, it takes a very long time to do grid search and iterate. Learning what takes time will be key in order to avoid unneccessary waiting time and have quicker iterations. Similarly, running code in cloud rather than on my local computer.
- I'm still going through the motions of the process quite mechanically while learning. Going forward, I need to start digging deeper into the actual featurisation
- Pipelines are a good way to package a finished workflow, and to do grid searches, but I'm beginning to realise the importance of digging deeper into the vectorization steps, which means it would be better to initially handle them separately. This though goes all the way back to the data exploration step, that looking at things such as term frequency and what words are in there would have meant I could make more informed decisions.