# XGBoost - Multilabel Classification

This notebook uses LightGBM and scikit-learn's [MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) to frame the problem as 206 separate binary classification tasks.

**Pros**:
- Simple - can use any binary classification model.

**Cons**:
- Doesn't consider any correlations between the different labels.
- Slow - have to train 206 separate models (in this case: 1030 models because of the five folds).

[Source kernel](https://www.kaggle.com/fchmiel/xgboost-baseline-multilabel-classification)
[Source kernel](https://www.kaggle.com/nroman/moa-lightgbm-206-models)

In [1]:
%load_ext autoreload
%aimport numpy, matplotlib, pandas, category_encoders, sklearn, xgboost
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

from category_encoders import CountEncoder
from datetime import datetime
from lightgbm import LGBMClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from time import time

from src.data.make_dataset import get_base_datasets

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
plt.style.use("../style.mplstyle")
SEED = 42
NFOLDS = 5
np.random.seed(SEED)

## Load Data

- `df`: feature data.
 - 772 gene expression features and 100 cell viability features. Also:
   - `cp_type`: indicates whether the experiment is a treatment (contains drug) or a control (contains no drug - probably DMSO, which has negligible biological effects).
   - `cp_dose`: the dose level used in the experiment. Generally, higher dose leads to stronger effect.
   - `cp_time`: time elapsed between adding the drug and taking the measurement.
   - `flag`: specifies if row is training (n=23,814) or test (n=3,982) data.
 - One row = one drug at a specific dose (high/low) and time point (24/48/72 hours) (`sig_id`). 5000 unique drugs in total, with ~6 records each (no column that links the records).
- `df_tts`: 206 binary target mechanisms for the 23,814 training drugs.
- `df_ttn`: 402 additional unscored targets for the 23,814 training drugs for model development.

In [3]:
df, df_ttn, df_tts = get_base_datasets()
print(
    df.query('flag=="train"').shape,
    df.query('flag=="test"').shape,
    df_ttn.shape,
    df_tts.shape,
)

(23814, 877) (3982, 877) (23814, 403) (23814, 207)


## Train Model

In [4]:
X = df.query('flag=="train"').iloc[:, 1:-1]
X_test = df.query('flag=="test"').iloc[:, 1:-1]
y = df_tts.iloc[:, 1:].values

With LightGBM, minimal pre-processing is necessary. Just encode the two categorical columns (given that they each consist of only two values, any type of encoding is fine).

In [5]:
lgb = MultiOutputClassifier(LGBMClassifier())
pipe = Pipeline(
    [("encode", CountEncoder(cols=[0, 2], return_df=False)), ("classify", lgb)]
)

The following hyperparameters have been taken from another problem (they aren't optimal).

In [6]:
params = {
    "classify__estimator__num_leaves": 491,
    "classify__estimator__min_child_weight": 0.03,
    "classify__estimator__feature_fraction": 0.3,
    "classify__estimator__bagging_fraction": 0.4,
    "classify__estimator__min_data_in_leaf": 106,
    "classify__estimator__objective": "binary",
    "classify__estimator__max_depth": -1,
    "classify__estimator__learning_rate": 0.01,
    "classify__estimator__boosting_type": "gbdt",
    "classify__estimator__bagging_seed": 11,
    "classify__estimator__metric": "binary_logloss",
    "classify__estimator__verbosity": 0,
    "classify__estimator__reg_alpha": 0.4,
    "classify__estimator__reg_lambda": 0.6,
    "classify__estimator__random_state": 47,
}

_ = pipe.set_params(**params)

For training, we elect to ignore all control rows (`cp_type=="ctl_vehicle"`). Later, we predict all zeros for control rows.

In [7]:
start = time()

oof_preds = np.zeros(y.shape)
test_preds = np.zeros((X_test.shape[0], y.shape[1]))
oof_losses = []
kf = KFold(n_splits=NFOLDS)
for fn, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"Fold: {fn} - elapsed time: {(time()-start)/60:.1f} mins")
    X_train, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_train, y_val = y[trn_idx], y[val_idx]

    ctl_mask = (X_train.iloc[:, 0] == "ctl_vehicle").values
    X_train = X_train[~ctl_mask]
    y_train = y_train[~ctl_mask]

    pipe.fit(X_train.values, y_train)
    val_preds = pipe.predict_proba(X_val.values)  # list of preds per class
    val_preds = np.array(val_preds)[:, :, 1].T  # take the positive class results only
    oof_preds[val_idx] = val_preds

    loss = log_loss(np.ravel(y_val), np.ravel(val_preds))
    oof_losses.append(loss)
    preds = pipe.predict_proba(X_test.values)
    preds = np.array(preds)[:, :, 1].T  # take the positive class results only
    test_preds += preds / NFOLDS


print(f"Completed. Total elapsed time: {(time()-start)/60:.1f} mins")
print(f"OOF losses: {[str(l)[:8] for l in oof_losses]}")
print(
    f"Mean OOF (STD) loss across folds: {np.mean(oof_losses):.6f} ({np.std(oof_losses):.6f})"
)

Fold: 0 - elapsed time: 0.0 mins


  elif pd.api.types.is_categorical(cols):


Fold: 1 - elapsed time: 8.9 mins


  elif pd.api.types.is_categorical(cols):


Fold: 2 - elapsed time: 17.2 mins


  elif pd.api.types.is_categorical(cols):


Fold: 3 - elapsed time: 25.6 mins


  elif pd.api.types.is_categorical(cols):


Fold: 4 - elapsed time: 34.1 mins


  elif pd.api.types.is_categorical(cols):


Completed. Total elapsed time: 42.5 mins
OOF losses: ['0.017037', '0.017113', '0.016950', '0.016995', '0.017307']
Mean OOF (STD) loss across folds: 0.017081 (0.000125)


In [8]:
# set control train preds to 0
control_mask = df.query('flag=="train"')["cp_type"] == "ctl_vehicle"
oof_preds[control_mask] = 0

print(f"OOF log loss: {log_loss(np.ravel(y), np.ravel(oof_preds)):.6f}")

OOF log loss: 0.016869


In [9]:
# set control test preds to 0
control_mask = X_test["cp_type"] == "ctl_vehicle"
test_preds[control_mask] = 0

## Create Submission

In [10]:
sub = pd.read_csv("../data/raw/sample_submission.csv")
sub.iloc[:, 1:] = test_preds
t = datetime.now().strftime("%Y%m%d%H%M%S")
sub.to_csv(f"../data/submissions/LightGBM_{t}.csv", index=False)