# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the Lending Club Loans dataset and build a Xgboost classifier on it.

Before feeding the data to the model, we need to pre-process them. Pre-processing code is usually written as a separate element wrt to the model, during the development phase. We want to wrap and save the pre-processing along with the model so to have a pipeline Processing+Model ready to take unprocessed data, process them and make predictions.

We can do that with Clearbox Wrapper, but all the pre-processing code must be wrapped in a single function. In this way, we can pass the function to the _save_model_ method.

## Install and import required libraries

In [1]:
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install xgboost

!pip install clearbox-wrapper

You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/andrea/clearbox_repos/clearbox-model-garden/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

from xgboost import XGBClassifier

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
loans_training_csv_path = 'loans_training_set.csv'
loans_test_csv_path = 'loans_test_set.csv'

In [5]:
loans_training = pd.read_csv(loans_training_csv_path)
loans_test = pd.read_csv(loans_test_csv_path)

In [6]:
target_column = 'loan_risk'

In [7]:
y_train = loans_training[target_column]
X_train = loans_training.drop(target_column, axis=1)

In [8]:
y_test = loans_test[target_column]
X_test = loans_test.drop(target_column, axis=1)

In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31148 entries, 0 to 31147
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   loan_amount         31148 non-null  int64  
 1   payments_term       31148 non-null  object 
 2   monthly_payment     31148 non-null  float64
 3   grade               31148 non-null  int64  
 4   working_years       31148 non-null  int64  
 5   home                31148 non-null  object 
 6   annual_income       31148 non-null  float64
 7   verification        31148 non-null  object 
 8   purpose             31148 non-null  object 
 9   debt_to_income      31148 non-null  float64
 10  delinquency         31148 non-null  int64  
 11  inquiries           31148 non-null  int64  
 12  open_credit_lines   31148 non-null  int64  
 13  derogatory_records  31148 non-null  int64  
 14  revolving_balance   31148 non-null  int64  
 15  revolving_rate      31148 non-null  float64
 16  tota

In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7788 entries, 0 to 7787
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   loan_amount         7788 non-null   int64  
 1   payments_term       7788 non-null   object 
 2   monthly_payment     7788 non-null   float64
 3   grade               7788 non-null   int64  
 4   working_years       7788 non-null   int64  
 5   home                7788 non-null   object 
 6   annual_income       7788 non-null   float64
 7   verification        7788 non-null   object 
 8   purpose             7788 non-null   object 
 9   debt_to_income      7788 non-null   float64
 10  delinquency         7788 non-null   int64  
 11  inquiries           7788 non-null   int64  
 12  open_credit_lines   7788 non-null   int64  
 13  derogatory_records  7788 non-null   int64  
 14  revolving_balance   7788 non-null   int64  
 15  revolving_rate      7788 non-null   float64
 16  total_

## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

The following preprocessing is built using several functions offered by Scikit Learn:


In [11]:
ordinal_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()


ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

x_preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
                                               ('cat', categorical_transformer, categorical_features)])

In [12]:
x_preprocessor.fit(X_train)

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['loan_amount', 'monthly_payment', 'grade',
                                  'working_years', 'annual_income',
                                  'debt_to_income', 'delinquency', 'inquiries',
                                  'open_credit_lines', 'derogatory_records',
                                  'revolving_balance', 'revolving_rate',
                                  'total_accounts', 'bankruptcies',
                                  'fico_average']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                             

As usual we encode the Y labels through a simple LabelEncoder

In [13]:
y_encoding = LabelEncoder()

In [14]:
y_encoding.fit(y_train)

LabelEncoder()

## Create and train the model

We build a simple Xgboost classifier setting up some basic parameters:

In [15]:
xgb_clf = XGBClassifier(n_estimators=30, max_depth=8, random_state=42)

We proceed to pre-process the X and encode the Y:

In [16]:
X_train_processed = x_preprocessor.transform(X_train)

In [17]:
y_train_encoded = y_encoding.transform(y_train)

  and should_run_async(code)


Finally, we fit the model on the processed data:

In [18]:
xgb_clf.fit(X_train_processed, y_train_encoded)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=8,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=30, n_jobs=0, num_parallel_tree=1, random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (xgboost) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for the preprocessing as a list. We need to add Scikit-Learn in this case.

In [19]:
wrapped_model_path = 'loans_xgboost_wrapped_model_v0.0.1'

In [20]:
processing_dependencies = ["scikit-learn==0.23.2"]

In [21]:
cbw.save_model(wrapped_model_path, xgb_clf, x_preprocessor, additional_deps=processing_dependencies)

<clearbox_wrapper.clearbox_wrapper.ClearboxWrapper at 0x7fc66d84f880>

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one. Some lines similar to these are present in the backend of Clearbox AI SAAS.

In [22]:
import zipfile

In [23]:
zipped_model_path = 'loans_xgboost_wrapped_model_v0.0.1.zip'
unzipped_model_path = 'loans_xgboost_wrapped_model_v0.0.1_unzipped'

In [24]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [25]:
loaded_model = cbw.load_model(unzipped_model_path)

Using the original model, the input data (X_test) must goes through the pre-processing function before the model.

In [26]:
X_test_processed = x_preprocessor.transform(X_test)
original_model_predictions = xgb_clf.predict_proba(X_test_processed)

Using the wrapped model, **the pre-processing is part of the predict pipeline**, so we can pass directly the raw input data to the predict function of the model: 

In [28]:
loaded_model_predictions = loaded_model.predict(X_test)

We check that the predictions made with the original model and the wrapped one are equal:

In [29]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

  and should_run_async(code)


## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)