# ML Assignment 2 - Canadian Hospital Re-admittance Challenge

*Harsh Kumar - IMT2021016* |
*Subhajeet Lahiri - IMT2021022* |
*Sai Madhavan G - IMT2021101*

This file contains our attempts at training non-neural ensemble models.

The data used in this file has been preprocessed using the same methods as for assignment 1. The code can be found in `preprocessing.py`

## Methodology

We use implementations of models from various paradigms of ensembling

### Bagging

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
- [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier)

### Boosting

- [XGBoostClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html)
- [LightGBMClassifier](https://lightgbm.readthedocs.io/en/stable/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)
- [LGBM dart variant](https://lightgbm.readthedocs.io/en/stable/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier)
- [CatBoostClassifier](https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier)
- [HistGradientBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier)

### Voting

- [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier)


We first look at performance of each model by calculating it's cross validation score.

We then compare it's performance on kaggle leaderboard

We use the softmax of the kaggle results as weights for training a voting classifier using 'soft' voting.

## Results (kaggle leaderboard)

- rf: 72.7%
- et: 71.5%
- xgb: 73.2%
- lgb: 73.5%
- dart: 73.1%
- cb: 73.0%
- hgb: 73.0%
- voting: 73.5%

## Code

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from preprocessing import *

In [3]:
seed = 17

Loading the preprocessed data (Refer to `preprocessing.py` for exact steps)

In [4]:
data, test_data = load_data("../data/")
X, y, x, enc_ids, cat_feat = preprocessing_and_fe(data, test_data)

In [5]:
true_indices = [index for index, value in enumerate(cat_feat) if value]

In [6]:
cat_cols = X.columns[cat_feat]
X[cat_cols] = X[cat_cols].astype('int')

In [7]:
cat_cols = x.columns[cat_feat]
x[cat_cols] = x[cat_cols].astype('int')

In [8]:
splits = 5
skf = StratifiedKFold(n_splits = splits, random_state = seed, shuffle = True)
np.random.seed(seed)

In [17]:
def cross_val_score(estimator, cv = skf, label = '', include_original = False):
    
    #initiate prediction arrays and score lists
    val_predictions = np.zeros((len(X)))
    #train_predictions = np.zeros((len(sample)))
    train_scores, val_scores = [], []
    
    #training model, predicting prognosis probability, and evaluating metrics
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        
        model = clone(estimator)
        
        #define train set
        X_train = X.iloc[train_idx].reset_index(drop = True)
        y_train = y.iloc[train_idx].reset_index(drop = True)
        
        #define validation set
        X_val = X.iloc[val_idx].reset_index(drop = True)
        y_val = y.iloc[val_idx].reset_index(drop = True)
        
        #train model
        model.fit(X_train, y_train)
        
        #make predictions
        train_preds = model.predict(X_train)
        val_preds = model.predict(X_val)
        val_preds = val_preds.reshape((-1,))
        val_predictions[val_idx] += val_preds
        
        #evaluate model for a fold
        train_score = f1_score(y_train, train_preds, average='macro')
        val_score = f1_score(y_val, val_preds, average='macro')
        
        #append model score for a fold to list
        train_scores.append(train_score)
        val_scores.append(val_score)
    
    print(f'Val Score: {np.mean(val_scores):.5f} ± {np.std(val_scores):.5f} | Train Score: {np.mean(train_scores):.5f} ± {np.std(train_scores):.5f} | {label}')
    model = clone(estimator)
    model.fit(X, y)
    
    return val_scores, val_predictions, model

In [18]:
score_list, oof_list = pd.DataFrame(), pd.DataFrame()
trained_models = {}

models = [
    ('rf', RandomForestClassifier(random_state = seed)),
    ('et', ExtraTreesClassifier(random_state = seed)),
    ('xgb', XGBClassifier(random_state = seed)),
    ('lgb', LGBMClassifier(random_state = seed, verbose=0)),
    ('dart', LGBMClassifier(random_state = seed, boosting_type = 'dart', verbose=0)),
    ('cb', CatBoostClassifier(random_state = seed, verbose=0, cat_features=true_indices, task_type='GPU', devices='0')),
    ('hgb', HistGradientBoostingClassifier(random_state = seed, categorical_features=cat_feat)),
]

for (label, model) in models:
    score_list[label], oof_list[label], trained_models[label] = cross_val_score(
        model,
        label = label,
        include_original = True
    )

Val Score: 0.53357 ± 0.00308 | Train Score: 0.99999 ± 0.00001 | rf
Val Score: 0.52252 ± 0.00253 | Train Score: 1.00000 ± 0.00000 | et
Val Score: 0.55455 ± 0.00287 | Train Score: 0.68135 ± 0.00322 | xgb
Val Score: 0.54462 ± 0.00414 | Train Score: 0.59089 ± 0.00144 | lgb
Val Score: 0.52247 ± 0.00382 | Train Score: 0.53642 ± 0.00262 | dart
Val Score: 0.55826 ± 0.00308 | Train Score: 0.64903 ± 0.00283 | cb
Val Score: 0.54925 ± 0.00361 | Train Score: 0.63662 ± 0.00901 | hgb


In [23]:
predictions = [(model, trained_models[model].predict(x)) for model in trained_models]

In [24]:
predictions

[('rf', array([2, 2, 1, ..., 1, 1, 1], dtype=int64)),
 ('et', array([2, 2, 1, ..., 1, 1, 1], dtype=int64)),
 ('xgb', array([1, 2, 1, ..., 1, 1, 1], dtype=int64)),
 ('lgb', array([1, 2, 1, ..., 1, 1, 1], dtype=int64)),
 ('dart', array([1, 2, 1, ..., 1, 1, 1], dtype=int64)),
 ('cb',
  array([[1],
         [2],
         [1],
         ...,
         [1],
         [1],
         [1]], dtype=int64)),
 ('hgb', array([1, 2, 1, ..., 1, 1, 1], dtype=int64))]

In [25]:
def gen_submission(label, prediction):
    submission_df = pd.DataFrame()
    submission_df['enc_id'] = enc_ids
    submission_df['readmission_id'] = prediction.reshape(-1).astype('float')
    submission_df.to_csv(f"{label}.csv", index=False)

In [26]:
for prediction in predictions:
    gen_submission(*prediction)

In [31]:
trained_models.keys()

dict_keys(['rf', 'et', 'xgb', 'lgb', 'dart', 'cb', 'hgb'])

In [32]:
weights = [72.7, 71.5, 73.2, 73.5, 73.1, 73, 73]

In [34]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

In [35]:
softmax(weights)

array([0.10675778, 0.03215483, 0.17601382, 0.23759381, 0.15926389,
       0.14410793, 0.14410793])

In [36]:
weights = softmax(weights)

In [37]:
model = VotingClassifier(models, weights=weights, voting = 'soft')
model.fit(X, y)

In [38]:
res = model.predict(x)

In [39]:
res

array([1, 2, 1, ..., 1, 1, 1], dtype=int64)

In [40]:
gen_submission('voting', res)