# DATA-611 Final Project - Eland - AutoML Model Training

This notebook handles training a model using auto-sklearn

It starts with `train.csv` and trains a model using this dataset. The model is then saved to `model.pkl`

This notebook was originally developed in Azure Machine Learning Studio against the Python 3 (ipykernel) kernel on a STANDARD_E4DS_V4 compute instance

## Dependencies

In [1]:
%pip install pandas
%pip install scikit-learn
%pip install auto-sklearn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Data Loading

Load our train.csv file

In [2]:
import pandas as pd

df_train = pd.read_csv('train.csv', index_col=0)
df_train.head()

Unnamed: 0,Credit Amount,Repay Delay Sep,Repay Delay Aug,Repay Delay Jul,Prior Pay Sep,Prior Pay Aug,Prior Pay Jul,Graduate School,Is Married,Prior Pay Total,Repay Delay Total,Defaulted
14720,130000,0,0,0,10000,10000,15000,0,False,70000,0,False
2522,200000,0,0,0,753,547,2,0,False,3344,0,False
24918,90000,0,0,0,1968,1968,1218,0,False,8132,0,True
35444,200000,0,0,0,372,865,23,1,False,72777,0,True
7608,260000,0,0,0,10027,10107,20000,1,True,70134,0,False


In [3]:
# Split our label column from the rest of the data
X_train = df_train.drop(columns=['Defaulted'])
y_train = df_train['Defaulted']

## Automated ML Model Training

In [4]:
# Create a pipeline using auto-sklearn to auto-select the best model
import autosklearn.classification
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# This should be the same as the number of CPU cores of the machine running the notebook
num_cores = 4

training_minutes = 180

# Auto scikit-learn lets us find optimal algorithm and hyperparameter combinations
classifier = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=training_minutes * 60,  
                                                              n_jobs=num_cores, 
                                                              metric=autosklearn.metrics.f1
                                                              )

# Create a pipeline that scales the data and trains a logistic regression model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', classifier)
])
pipeline

Pipeline(steps=[('scaler', StandardScaler()),
                ('model',
                 AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                                       metric=f1, n_jobs=4,
                                       time_left_for_this_task=10800))])

In [5]:
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Display info on the classifier
classifier

AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      metric=f1, n_jobs=4, per_run_time_limit=4320,
                      time_left_for_this_task=10800)

### Evaluated Models

In [6]:
# Display the leaderboard
classifier.leaderboard()

Unnamed: 0_level_0,rank,ensemble_weight,type,cost,duration
model_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
312,1,0.04,gradient_boosting,0.188775,37.328512
176,2,0.06,gradient_boosting,0.190801,33.657909
469,3,0.04,gradient_boosting,0.190929,43.874617
751,4,0.06,gradient_boosting,0.191211,29.78471
684,5,0.02,gradient_boosting,0.191227,48.822893
736,6,0.08,gradient_boosting,0.191533,39.896581
692,7,0.02,gradient_boosting,0.191609,37.488533
666,8,0.14,gradient_boosting,0.192243,34.652723
695,9,0.02,gradient_boosting,0.19225,37.027887
661,10,0.06,gradient_boosting,0.192369,37.656005


In [7]:
# Enable pretty printing
from pprint import pprint

# Pretty print the classifier model
pprint(classifier.show_models(), indent=3)

{  176: {  'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f540a765c10>,
           'cost': 0.19080115920351504,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f5421d16b50>,
           'ensemble_weight': 0.06,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f540a765d30>,
           'model_id': 176,
           'rank': 1,
           'sklearn_classifier': HistGradientBoostingClassifier(early_stopping=False,
                               l2_regularization=2.3814598105175607e-08,
                               learning_rate=0.13979963154620015, max_iter=512,
                               max_leaf_nodes=321, min_samples_leaf=3,
                               n_iter_no_change=0, random_state=1,
                               validation_fraction=N

### Model Serialization

Now that we have a model, let's serialize it to disk using Pickel so we can evaluate it in a different notebook

In [8]:
import pickle

pickle.dump(pipeline, open('model.pkl', 'wb'))

Work continues from `ModelMetrics.ipynb`