# Training an XGBoost model for [MoA prediction on kaggle](https://www.kaggle.com/c/lish-moa/overview) 



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/lish-moa/sample_submission.csv
/kaggle/input/lish-moa/train_targets_scored.csv
/kaggle/input/lish-moa/train_targets_nonscored.csv
/kaggle/input/lish-moa/train_features.csv
/kaggle/input/lish-moa/test_features.csv


## Importing the good stuff

* We'll be training an XGBoost model here, but since it's a multilabel problem, we'll use the `MultiOutputClassifier` as a wrapper over the `XGBClassifier` 


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from xgboost import XGBClassifier
from sklearn.model_selection import KFold
from category_encoders import CountEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import log_loss

import matplotlib.pyplot as plt

from sklearn.multioutput import MultiOutputClassifier

import os
import warnings
warnings.filterwarnings('ignore')

## Some important params

* You could go for a larger number or folds, but that would take much longer to train, and won't necessarily give better results for this dataset 


In [3]:
SEED = 42
NFOLDS = 5
DATA_DIR = '/kaggle/input/lish-moa/'
np.random.seed(SEED)

## Loading all the CSV files

In [4]:
train = pd.read_csv(DATA_DIR + 'train_features.csv')
targets = pd.read_csv(DATA_DIR + 'train_targets_scored.csv')

test = pd.read_csv(DATA_DIR + 'test_features.csv')
sub = pd.read_csv(DATA_DIR + 'sample_submission.csv')

## Dropping the `sig_id` column for training 

In [None]:
X = train.iloc[:,1:].to_numpy()
X_test = test.iloc[:,1:].to_numpy()
y = targets.iloc[:,1:].to_numpy()

## Defining the pipeline


* `MultiOutputClassifier` is built based on the idea of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification
* `tree_method='gpu_hist'` bascally tells XGBoost to utilize a CUDA capable device if available 
* The pipeline has2 main parts:

    * `CountEncoder` is used to encode categorical values where the argument  `cols` specifies the list of columns to encode 
    * `classifier` is the `MultiOutputClassifier` that we just built

In [5]:
classifier = MultiOutputClassifier(XGBClassifier(tree_method='gpu_hist'))

clf = Pipeline([('encode', CountEncoder(cols=[0, 2])),
                ('classify', classifier)
               ])

## Fine tuning 

* `classify__estimator__colsample_bytree` specifies the maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit
* `classify__estimator__learning_rate` the learning rate
* `classify__estimator__max_delta_step` specifies the maximum step size in each iteration, setting this to `0` means there's no limit. But this constraint helps when training highly imbalanced logistic regression models 
* `classify__estimator__subsample` specifies the subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees.

In [6]:
params = {'classify__estimator__colsample_bytree': 0.6522,
          'classify__estimator__gamma': 3.6975,
          'classify__estimator__learning_rate': 0.0503,
          'classify__estimator__max_delta_step': 2.0706,
          'classify__estimator__max_depth': 10,
          'classify__estimator__min_child_weight': 31.5800,
          'classify__estimator__n_estimators': 166,
          'classify__estimator__subsample': 0.8639
         }

_ = clf.set_params(**params)

## The training loop 

Some notes first:
* OOF means Out Of Fold, equivalent to a "holdout set"
* K fold cross valudation involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.

**important**: we're dropping all the features where `cp_type==ctl_vehicle` because the targets of these rows are always zero, so we hardcode it into the final submission. We're dropping it here

```
ctl_mask = X_train[:,0]=='ctl_vehicle'
X_train = X_train[~ctl_mask,:]
y_train = y_train[~ctl_mask]
```

* We're using `predict_proba` here instead of the usual `predict`. 
    *  `predict` will give you output like `0`,`1`
    * `predict_proba` will give you the probability value of y being `0` or `1`.

Our final prediction on the test set  is basically an average of the predictions made the the models trained on each fold

**Useful links for a beginner like myself:**
* [What is log loss ? explained by a kaggle grandmaster](https://www.kaggle.com/dansbecker/what-is-log-loss)
* [Video explaining log loss](https://www.youtube.com/watch?v=MztgenIfGgM&ab_channel=BhaveshBhatt)

In [7]:
oof_preds = np.zeros(y.shape)
test_preds = np.zeros((test.shape[0], y.shape[1]))
oof_losses = []
kf = KFold(n_splits=NFOLDS)
for fn, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    print('Starting fold: ', fn)
    X_train, X_val = X[trn_idx], X[val_idx]
    y_train, y_val = y[trn_idx], y[val_idx]
    
    # drop where cp_type==ctl_vehicle (baseline)
    ctl_mask = X_train[:,0]=='ctl_vehicle'
    X_train = X_train[~ctl_mask,:]
    y_train = y_train[~ctl_mask]
    
    clf.fit(X_train, y_train)
    val_preds = clf.predict_proba(X_val) # list of preds per class
    val_preds = np.array(val_preds)[:,:,1].T # take the positive class
    oof_preds[val_idx] = val_preds
    
    loss = log_loss(np.ravel(y_val), np.ravel(val_preds))
    oof_losses.append(loss)
    preds = clf.predict_proba(X_test)
    preds = np.array(preds)[:,:,1].T # take the positive class
    test_preds += preds / NFOLDS  ## good old averaging 
    
print(oof_losses)
print('Mean OOF loss across folds', np.mean(oof_losses))
print('STD OOF loss across folds', np.std(oof_losses))

Starting fold:  0
Starting fold:  1
Starting fold:  2
Starting fold:  3
Starting fold:  4
[0.0169781773377249, 0.01704491710861325, 0.016865153552168475, 0.01700900926983899, 0.01717882474706338]
Mean OOF loss across folds 0.017015216403081797
STD OOF loss across folds 0.00010156682747757948


## Hardcoding the preds where `train['cp_type']=='ctl_vehicle'` on OOF preds and test preds 

In [8]:
# set control train preds to 0
control_mask = train['cp_type']=='ctl_vehicle'
oof_preds[control_mask] = 0

print('OOF log loss: ', log_loss(np.ravel(y), np.ravel(oof_preds)))

OOF log loss:  0.0167240932391125


In [9]:
control_mask = test['cp_type']=='ctl_vehicle'
test_preds[control_mask] = 0

## Making a submission

In [10]:
sub.iloc[:,1:] = test_preds
sub.to_csv('submission.csv', index=False)

In [11]:
sub.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_0004d9e33,0.002041,0.002052,0.002055,0.011048,0.013914,0.003767,0.002396,0.005898,0.002035,...,0.002038,0.001974,0.002214,0.003002,0.002947,0.002033,0.002294,0.002079,0.001879,0.002083
1,id_001897cda,0.002041,0.002052,0.002055,0.003816,0.004644,0.003374,0.00203,0.005303,0.002035,...,0.002038,0.002162,0.002117,0.001061,0.005006,0.002033,0.005964,0.002051,0.002421,0.002085
2,id_002429b5b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,id_00276f245,0.002041,0.002052,0.002055,0.010955,0.008548,0.003237,0.002358,0.004069,0.002035,...,0.002038,0.002048,0.002364,0.005247,0.002792,0.002033,0.002569,0.00207,0.002198,0.002063
4,id_0027f1083,0.002041,0.002052,0.002055,0.014007,0.017502,0.002726,0.003422,0.004535,0.002035,...,0.002038,0.001964,0.00248,0.005332,0.00275,0.002033,0.002204,0.002079,0.001766,0.002074
