# 3 - XGBoost
Let's do some explorations using [XGBoost](https://github.com/dmlc/xgboost). 

In [176]:
import numpy as np
import pandas as pd
import zipfile

filepath =  '/Users/freddiekarlbom/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/train.csv.zip'

with zipfile.ZipFile(filepath) as zip:
    with zip.open('train.csv') as myZip:
        df = pd.read_csv(myZip) 

In [177]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [178]:
prediction_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

X = df['comment_text']
Y = df[prediction_columns]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=1337)

In [179]:
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier

In [180]:
import xgboost as xgb

In [181]:
pipeline = Pipeline([('vect', CountVectorizer(stop_words="english")),
                 ('tfidf', TfidfTransformer()),
                 ('clf', MultiOutputClassifier(estimator=xgb.XGBClassifier(silent=False, 
                                                                           early_stopping_rounds=5))),
])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__estimator__max_depth': (3, 5),
              'clf__estimator__learning_rate': (0.05, 0.1),
              'clf__estimator__n_estimators': (100, 200)
             }

clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=True)

In [182]:
# Small subset of data to iterate more quickly...
clf.fit(X_train[:2000], Y_train[:2000])

Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed: 15.6min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...g_lambda=1, scale_pos_weight=1, seed=None,
       silent=False, subsample=1),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__estimator__max_depth': (3, 5), 'clf__estimator__learning_rate': (0.05, 0.1), 'clf__estimator__n_estimators': (100, 200)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=True)

In [183]:
clf.best_params_

{'clf__estimator__learning_rate': 0.05,
 'clf__estimator__max_depth': 3,
 'clf__estimator__n_estimators': 100,
 'tfidf__use_idf': False,
 'vect__ngram_range': (1, 1)}

Ok, so that 20 minute grid search just says that the default settings are the best and that the time was wasted. Time for a deep breath before we move on.

In [184]:
clf.score(X_train, Y_train)

0.90509912055315322

In [185]:
clf.score(X_test, Y_test)

0.90493796215064548

In [186]:
model = clf.best_estimator_

In [187]:
full_model = model.fit(X_train, Y_train)
full_model.score(X_train, Y_train)

0.9097644363671813

In [188]:
full_model.score(X_test, Y_test)

0.90731921293395157

## Manual Tuning
Ugly code as it was done in a haste in order to submit something before the competition deadline. Not shown here is some sub-word level embedding tests that was also briefly tried.

In [197]:
pipeline_manual = Pipeline([('vect', CountVectorizer(ngram_range=(1, 1), stop_words="english")),
                 ('clf', MultiOutputClassifier(estimator=xgb.XGBClassifier(max_depth=10,
                                                                          learning_rate=0.01,
                                                                          n_estimators=1000,
                                                                          eval_metric='auc',
                                                                          base_score=0.1))),
])

In [198]:
manual = pipeline_manual.fit(X_train, Y_train)

In [199]:
manual.score(X_test, Y_test)

0.91440030078957257

In [283]:
filepath =  '/Users/freddiekarlbom/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/test.csv.zip'

with zipfile.ZipFile(filepath) as zip:
    with zip.open('test.csv') as myZip:
        df = pd.read_csv(myZip) 

In [284]:
X = df['comment_text']

In [285]:
outcome = manual.predict_proba(X)

In [290]:
predictions = pd.DataFrame([outcome[0][:,1], 
                   outcome[1][:,1],
                   outcome[2][:,1],
                   outcome[3][:,1],
                   outcome[4][:,1],
                   outcome[5][:,1]])

In [291]:
predictions = predictions.T

In [292]:
predictions.columns = prediction_columns

In [293]:
df.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [301]:
prediction_columns

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [320]:
d = {
    'id': df['id'],
    'toxic': predictions['toxic'],
    'severe_toxic': predictions['severe_toxic'],
    'obscene': predictions['obscene'],
    'threat': predictions['threat'],
    'insult': predictions['insult'],
    'identity_hate': predictions['identity_hate'],
}

submission = pd.DataFrame(data=d)

In [321]:
ind = ['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
submission = submission.reindex(columns=ind)
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.992834,0.268399,0.88634,0.142722,0.952689,0.034542
1,0000247867823ef7,0.052105,0.004629,0.021585,0.001585,0.028282,0.004806
2,00013b17ad220c46,0.047138,0.00214,0.015783,0.00076,0.021427,0.003288
3,00017563c3f7919a,0.015043,0.001031,0.00718,0.001124,0.00714,0.00097
4,00017695ad8997eb,0.055856,0.002097,0.012666,0.001194,0.020693,0.003508


In [323]:
submission.to_csv('predictions.csv', index=False)

## Takeaways
- Even training on a small subset of the full data, it takes a very long time to do grid search and iterate. Learning what takes time will be key in order to avoid unneccessary waiting time and have quicker iterations. Similarly, running code in cloud rather than on my local computer.
- I'm still going through the motions of the process quite mechanically while learning. Going forward, I need to start digging deeper into the actual featurisation
- Pipelines are a good way to package a finished workflow, and to do grid searches, but I'm beginning to realise the importance of digging deeper into the vectorization steps, which means it would be better to initially handle them separately. This though goes all the way back to the data exploration step, that looking at things such as term frequency and what words are in there would have meant I could make more informed decisions.