# Performance Improvements

In this notebook, we'll cover some examples of how model performance can be improved. The techniques covered are 
- Sampling for imbalanced learning
- Bagging
- Boosting
- Continuous Model Selection using Bandits. 

Clone the repo with notebooks and corresponding data. 

In [None]:
!git clone https://github.com/TurboML-Inc/colab-notebooks.git

Set up the environment and install TurboML's SDK. 

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()
!bash colab-notebooks/install_turboml.sh

The kernel should now be restarted with TurboML's SDK installed.

In [None]:
cd colab-notebooks

Login to your TurboML instance.

In [None]:
import pandas as pd
import turboml as tb
tb.init(backend_url=BACKEND_URL, api_key=API_KEY)

In [None]:
import numpy as np
from sklearn.metrics import roc_auc_score

In [None]:
transactions_df = pd.read_csv("data/transactions.csv").reset_index()
labels_df = pd.read_csv("data/labels.csv").reset_index()

In [None]:
transactions = tb.PandasDataset(
    dataset_name="transactions_performance_improve",
    key_field="index",
    dataframe=transactions_df,
    upload=True,
)
labels = tb.PandasDataset(
    dataset_name="labels_performance_improve",
    key_field="index",
    dataframe=labels_df,
    upload=True,
)

In [None]:
numerical_fields = [
    "transactionAmount",
    "localHour",
]
categorical_fields = [
    "digitalItemCount",
    "physicalItemCount",
    "isProxyIP",
]
features = transactions.get_input_fields(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels.get_label_field(label_field="is_fraud")

Now that we have our setup ready, let's first see the performance of a base HoeffdingTreeClassfier model. 

In [None]:
htc_model = tb.HoeffdingTreeClassifier(n_classes=2)

In [None]:
deployed_model = htc_model.deploy("htc_classifier", input=features, labels=label)

In [None]:
outputs = deployed_model.get_outputs()

In [None]:
len(outputs)

In [None]:
true_labels = labels_df["is_fraud"].values

In [None]:
real_outputs = np.array([x["record"].predicted_class for x in outputs])
roc_auc_score(true_labels, real_outputs)

Not bad. But can we improve it further? We haven't yet used the fact that the dataset is highly skewed.

## Sampling for Imbalanaced Learning

In [None]:
sampler_model = tb.RandomSampler(
    n_classes=2, desired_dist=[0.5, 0.5], sampling_method="under", base_model=htc_model
)

In [None]:
deployed_model = sampler_model.deploy(
    "undersampler_model", input=features, labels=label
)

In [None]:
outputs = deployed_model.get_outputs()

In [None]:
len(outputs)

In [None]:
real_outputs = np.array([x["record"].predicted_class for x in outputs])
roc_auc_score(true_labels, real_outputs)

## Bagging

In [None]:
lbc_model = tb.LeveragingBaggingClassifier(n_classes=2, base_model=htc_model)

In [None]:
deployed_model = lbc_model.deploy("lbc_classifier", input=features, labels=label)

In [None]:
outputs = deployed_model.get_outputs()

In [None]:
len(outputs)

In [None]:
real_outputs = np.array([x["record"].predicted_class for x in outputs])
roc_auc_score(true_labels, real_outputs)

## Boosting

In [None]:
abc_model = tb.AdaBoostClassifier(n_classes=2, base_model=htc_model)

In [None]:
deployed_model = abc_model.deploy("abc_classifier", input=features, labels=label)

In [None]:
outputs = deployed_model.get_outputs()

In [None]:
len(outputs)

In [None]:
real_outputs = np.array([x["record"].predicted_class for x in outputs])
roc_auc_score(true_labels, real_outputs)

## Continuous Model Selection with Bandits

In [None]:
bandit_model = tb.BanditModelSelection(base_models=[htc_model, lbc_model, abc_model])
deployed_model = bandit_model.deploy(
    "demo_classifier_bandit", input=features, labels=label
)

In [None]:
outputs = deployed_model.get_outputs()

In [None]:
len(outputs)

In [None]:
real_outputs = np.array([x["record"].predicted_class for x in outputs])
roc_auc_score(true_labels, real_outputs)