# Clearbox Wrapper Tutorial

Clearbox Wrapper is a Python library to package and save a ML model.

We'll use the Lending Club Loans dataset and build a Xgboost classifier on it.

Before feeding the data to the model, we need to preprocess them. Preprocessing code is usually written as a separate element wrt to the model, during the development phase. We want to wrap and save the pre-processing along with the model so to have a pipeline Processing+Model ready to take unprocessed data, process them and make predictions.

We can do that with Clearbox Wrapper, but all the preprocessing code must be wrapped in a single function. In this way, we can pass the function to the _save_model_ method.

## Install and import required libraries

In [1]:
%%capture 
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install xgboost

!pip install clearbox-wrapper==0.3.10

In [2]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import sklearn.metrics as metrics

from xgboost import XGBClassifier

import clearbox_wrapper as cbw

## Datasets

We have two different csv files for the training and test set.

In [3]:
loans_training_csv_path = 'loans_training_set.csv'
loans_test_csv_path = 'loans_test_set.csv'

In [4]:
loans_training = pd.read_csv(loans_training_csv_path)
loans_test = pd.read_csv(loans_test_csv_path)

In [5]:
target_column = 'loan_risk'

In [6]:
y_train = loans_training[target_column]
X_train = loans_training.drop(target_column, axis=1)

In [7]:
y_test = loans_test[target_column]
X_test = loans_test.drop(target_column, axis=1)

In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31148 entries, 0 to 31147
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   loan_amount         31148 non-null  int64  
 1   payments_term       31148 non-null  object 
 2   monthly_payment     31148 non-null  float64
 3   grade               31148 non-null  int64  
 4   working_years       31148 non-null  int64  
 5   home                31148 non-null  object 
 6   annual_income       31148 non-null  float64
 7   verification        31148 non-null  object 
 8   purpose             31148 non-null  object 
 9   debt_to_income      31148 non-null  float64
 10  delinquency         31148 non-null  int64  
 11  inquiries           31148 non-null  int64  
 12  open_credit_lines   31148 non-null  int64  
 13  derogatory_records  31148 non-null  int64  
 14  revolving_balance   31148 non-null  int64  
 15  revolving_rate      31148 non-null  float64
 16  tota

In [9]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7788 entries, 0 to 7787
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   loan_amount         7788 non-null   int64  
 1   payments_term       7788 non-null   object 
 2   monthly_payment     7788 non-null   float64
 3   grade               7788 non-null   int64  
 4   working_years       7788 non-null   int64  
 5   home                7788 non-null   object 
 6   annual_income       7788 non-null   float64
 7   verification        7788 non-null   object 
 8   purpose             7788 non-null   object 
 9   debt_to_income      7788 non-null   float64
 10  delinquency         7788 non-null   int64  
 11  inquiries           7788 non-null   int64  
 12  open_credit_lines   7788 non-null   int64  
 13  derogatory_records  7788 non-null   int64  
 14  revolving_balance   7788 non-null   int64  
 15  revolving_rate      7788 non-null   float64
 16  total_

## Create a preprocessing function

The data need to be preprocessed before be passed as input to the model. You can use your own custom code for the preprocessing, just remember to wrap all of it in a single function.

The following preprocessing is built using several functions offered by Scikit Learn:


In [10]:
ordinal_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()


ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

x_preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
                                               ('cat', categorical_transformer, categorical_features)])

In [11]:
x_preprocessor.fit(X_train)

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['loan_amount', 'monthly_payment', 'grade',
                                  'working_years', 'annual_income',
                                  'debt_to_income', 'delinquency', 'inquiries',
                                  'open_credit_lines', 'derogatory_records',
                                  'revolving_balance', 'revolving_rate',
                                  'total_accounts', 'bankruptcies',
                                  'fico_average']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                             

As usual we encode the Y labels through a simple LabelEncoder

In [12]:
y_encoding = LabelEncoder()

In [13]:
y_encoding.fit(y_train)

LabelEncoder()

## Create and train the model

We build a simple Xgboost classifier setting up some basic parameters:

In [14]:
xgb_clf = XGBClassifier(n_estimators=30, max_depth=8, random_state=42)

We proceed to pre-process the X and encode the Y:

In [15]:
X_train_processed = x_preprocessor.transform(X_train)
X_test_processed = x_preprocessor.transform(X_test)

In [16]:
y_train_encoded = y_encoding.transform(y_train)
y_test_encoded = y_encoding.transform(y_test)

Finally, we fit the model on the processed data:

In [17]:
xgb_clf.fit(X_train_processed, y_train_encoded)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=8,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=30, n_jobs=8, num_parallel_tree=1, random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

We show some metrics on the training set...

In [18]:
training_predictions = xgb_clf.predict(X_train_processed)

In [19]:
print('Training Metrics Report:\n', metrics.classification_report(y_train_encoded, training_predictions))

Training Metrics Report:
               precision    recall  f1-score   support

           0       0.98      0.35      0.51      5433
           1       0.88      1.00      0.93     25715

    accuracy                           0.88     31148
   macro avg       0.93      0.67      0.72     31148
weighted avg       0.90      0.88      0.86     31148



...and on the test set:

In [20]:
test_predictions = xgb_clf.predict(X_test_processed)

In [21]:
print('Test Metrics Report:\n', metrics.classification_report(y_test_encoded, test_predictions))

Test Metrics Report:
               precision    recall  f1-score   support

           0       0.82      0.22      0.35      1364
           1       0.86      0.99      0.92      6424

    accuracy                           0.85      7788
   macro avg       0.84      0.60      0.63      7788
weighted avg       0.85      0.85      0.82      7788



## Wrap and Save the Model

Finally, we use Clearbox Wrapper to wrap and save the model and the preprocessor as a zipped folder in a specified path. 

The model dependency (xgboost) and its version it is detected automatically by CBW and added to the requirements saved into the resulting folder. But (**IMPORTANT**) you need to pass as a parameter the additional dependencies required for the preprocessing as a list. We need to add Scikit-Learn in this case.

We pass the training dataset to `save_model` as well, in order to generate a Preprocessing and Model Signature (the signature represents input and (optionally) outputs as data frames with (optionally) named columns and data type).


In [22]:
wrapped_model_path = 'loans_xgboost_wrapped_model_v0.3.10'

In [23]:
processing_dependencies = ["scikit-learn==0.23.2"]

In [24]:
cbw.save_model(wrapped_model_path, xgb_clf, preprocessing=x_preprocessor, input_data=X_train, additional_deps=processing_dependencies)

## Unzip and load the model

The following cells are not necessary for the final users, the zip created should be uploaded to our SAAS as it is. But here we want to show how to load a saved model and compare it to the original one. Some lines similar to these are present in the backend of Clearbox AI SAAS.

**IMPORTANT**: To assure reproducibility and avoid loading errors, it is necessary to load the wrapped model with the same Python version with which the model was saved.

In [25]:
import zipfile

In [26]:
zipped_model_path = 'loans_xgboost_wrapped_model_v0.3.10.zip'
unzipped_model_path = 'loans_xgboost_wrapped_model_v0.3.10_unzipped'

In [27]:
with zipfile.ZipFile(zipped_model_path, 'r') as zip_ref:
    zip_ref.extractall(unzipped_model_path)

In [28]:
loaded_model = cbw.load_model(unzipped_model_path)

Using the original model, the input data (X_test) must goes through the preprocessing function before the model.

In [29]:
original_model_predictions = xgb_clf.predict(X_test_processed)

Using the wrapped model, **the preprocessing is part of the predict pipeline**, so we can pass directly the raw input data to the predict function of the model: 

In [30]:
loaded_model_predictions = loaded_model.predict(X_test)

We check that the predictions made with the original model and the wrapped one are equal:

In [31]:
np.testing.assert_array_equal(original_model_predictions, loaded_model_predictions)

## Remove all generated files and directory

In [None]:
import os
import shutil

In [None]:
if os.path.exists(zipped_model_path):
        os.remove(zipped_model_path)

In [None]:
if os.path.exists(unzipped_model_path):
        shutil.rmtree(unzipped_model_path)