In [1]:
# Module: Data Science in Finance, AutoML 
# Version 1.0
# Topic : AutoML - TPOT
# Example source: https://www.kaggle.com/wendykan/lending-club-loan-data
#####################################################################
# For support or questions, contact Sri Krishnamurthy at
# sri@quantuniversity.com
# Copyright 2018 QuantUniversity LLC.
#####################################################################

# AutoML with TPOT

AutoML is the process of automating an end-to-end Machine Learning pipeline. [TPOT](https://epistasislab.github.io/tpot/) specifically uses genetic programming to optimise these piplines by selecting the best model and its hyperparamters.

![TPOT](https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-ml-pipeline.png)
<center>Image source: https://epistasislab.github.io/tpot/</center>

This notebook explains the basic workflow involved in an AutoML pipeline with TPOT

### Imports

In [2]:
# for numerical analysis and data processing
import numpy as np
import pandas as pd

#AutoML
from sklearn.metrics.scorer import make_scorer
from tpot import TPOTRegressor

### Dataset

The data set is the lending data for lendingclub from August 2011 to December 2011 for some borrowers. The feature descriptions for the data are also provided. Not all the features are required for making predictions, some features are redundant in the original data file. The provided data file is already cleaned and only relevant features are provided. There are two types of features, numerical and categorical.

Reading the input data from csv file.

In [4]:
df = pd.read_csv("../data/LendingClubLoan.csv", low_memory=False)
del df['issue_d'] # removing issue date as it wont affect the prediction (redundant feature)
df_description = pd.read_excel('../data/LCDataDictionary.xlsx').dropna()

In [5]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,loan_status_Binary
0,5000,36 months,10.65,162.87,B,B2,10+ years,RENT,24000.0,Verified,credit_card,AZ,27.65,0,1,0
1,2500,60 months,15.27,59.83,C,C4,< 1 year,RENT,30000.0,Source Verified,car,GA,1.0,0,5,1
2,2400,36 months,15.96,84.33,C,C5,10+ years,RENT,12252.0,Not Verified,small_business,IL,8.72,0,2,0
3,10000,36 months,13.49,339.31,C,C1,10+ years,RENT,49200.0,Source Verified,other,CA,20.0,0,1,0
4,3000,60 months,12.69,67.79,B,B5,1 year,RENT,80000.0,Source Verified,other,OR,17.94,0,0,0


### Preparing categorical features - One Hot Encoding

**Current version of TPOT does not support sparse matrix [link](https://github.com/EpistasisLab/tpot/issues/526), hence we need to do a bit of data preprocessing like converting categorical features.**

One way of representing categrical features is called one-hot encoding. Assume a categorical feature X with possible values as \[a, b, c, d\]. If in some sample the value of X=c, in one hot encoding ths particular feature is represented as X=\[0, 0, 1, 0\]. Its a binary array representation of length equal to the number of possible feature value, with 1 for the actual value.

If X can have values a b c d, then

X=c  
X=\[0, 0, 1, 0\]

X=a  
X=\[1, 0, 0, 0\]

In [6]:
numeric_columns = df.select_dtypes(include=['float64','int64']).columns
categorical_columns = df.select_dtypes(include=['object']).columns

In [7]:
for col in categorical_columns:
    df[col] = df[col].astype('category')

#### Dictionary for categorical features.

In [8]:
categories={}
for cat in categorical_columns:
    categories[cat] = df[cat].cat.categories.tolist()

In [9]:
p_categories = df['purpose'].cat.categories.tolist()
s_categories = df['addr_state'].cat.categories.tolist()
df[categorical_columns] = df[categorical_columns].apply(lambda x: x.cat.codes)

In [10]:
df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,loan_status_Binary
0,5000,0,10.65,162.87,1,6,1,2,24000.0,2,1,3,27.65,0,1,0
1,2500,1,15.27,59.83,2,13,10,2,30000.0,1,0,10,1.0,0,5,1
2,2400,0,15.96,84.33,2,14,1,2,12252.0,0,10,12,8.72,0,2,0
3,10000,0,13.49,339.31,2,10,1,2,49200.0,1,8,4,20.0,0,1,0
4,3000,1,12.69,67.79,1,9,0,2,80000.0,1,8,31,17.94,0,0,0


Storing interest rate statistics

In [11]:
min_rate= df['int_rate'].min()
max_rate= df['int_rate'].max()
print(min_rate, max_rate, max_rate- min_rate)

5.42 24.11 18.689999999999998


In [12]:
df_max = df.max()
df_min = df.min()

## Preparing the dataset 

The data is split into training and testing data. x represents the input features whereas y represents the output i.e. the interest rate.As a rule of thumb, we split the data into 80% training data and 20% testing or validation data.

In [13]:
y = df.iloc[:,df.columns.isin(["int_rate"])]
x = df.loc[:, ~df.columns.isin(["int_rate"])]

total_samples=len(df)
split = 0.8

x_train = x[0:int(total_samples*split)]
x_test = x[int(total_samples*split):total_samples]
y_train = y[0:int(total_samples*split)]
y_test = y[int(total_samples*split):total_samples]

## AutoML

#### TPOT Regressor [link](https://epistasislab.github.io/tpot/api/#regression)
The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.

These piplines are various combinations of different preprocessors and sklearn models. Some preprocessors include:
* Binarizer
* FastICA
* FeatureAgglomeration
* MaxAbsScaler
* Normalizer
* PCA
* StandardScaler
* RBFSampler
* OneHotEncoder

TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total 

**TPOTRegressor 's performance is as good as the amount of time it is allowed to optimize.**

Here is an excerpt from TPOT's official documentation: 

*"TPOT will take a while to run on larger datasets, but it’s important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search."*

### The following is all the code needed to find the best model:

In [14]:
tpot = TPOTRegressor(generations=10, 
                     population_size=10, 
                     verbosity=2, 
                     max_time_mins=10, #total time
                     max_eval_time_mins=1, # time per pipeline
                     scoring='neg_mean_absolute_error')
tpot.fit(x_train, y_train)

  from numpy.core.umath_tests import inner1d
  y = column_or_1d(y, warn=True)


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=10, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: -0.05914191270485769
Generation 2 - Current best internal CV score: -0.05823082512024584
Generation 3 - Current best internal CV score: -0.057404570621235884
Generation 4 - Current best internal CV score: -0.057404570621235884
Generation 5 - Current best internal CV score: -0.057161803168295645

10.482767866666668 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: RandomForestRegressor(XGBRegressor(OneHotEncoder(CombineDFs(input_matrix, input_matrix), minimum_fraction=0.15, sparse=False, threshold=10), learning_rate=1.0, max_depth=1, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.7000000000000001), bootstrap=False, max_features=0.7500000000000001, min_samples_leaf=10, min_samples_split=5, n_estimators=100)


TPOTRegressor(config_dict=None, crossover_rate=0.1, cv=5,
       disable_update_check=False, early_stop=None, generations=1000000,
       max_eval_time_mins=1, max_time_mins=10, memory=None,
       mutation_rate=0.9, n_jobs=1, offspring_size=None,
       periodic_checkpoint_folder=None, population_size=10,
       random_state=None, scoring='neg_mean_absolute_error', subsample=1.0,
       use_dask=False, verbosity=2, warm_start=False)

In [15]:
print(tpot.scoring_function + ": "+ str(tpot.score(x_test,y_test)))

neg_mean_absolute_error: -0.5510954468595537


  y = column_or_1d(y, warn=True)


### Using the best pipeline to make predictions

In [16]:
prediction = tpot.fitted_pipeline_.predict(x_test)
prediction_train = tpot.fitted_pipeline_.predict(x_train)

In [17]:
prediction[0:10]

array([14.24131455, 12.33427308, 14.65      , 10.64908971,  7.82975813,
       11.70685   ,  5.85571429,  9.92460308, 10.6488225 , 11.41856718])

#### TPOT export
TPOT also allows you to export the best sklearn pipeline obtained from the AutoML flow as a python file using a simple method call.

In [18]:
tpot.export('tpot_sample.py')

True

### Export the best model

In [19]:
import pickle
pickle.dump(tpot.fitted_pipeline_, open('tpot_pipeline.model','wb'))

### MAPE (Mean Absolute Percentage Error)

In [20]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [21]:
print(y_test.shape, y_train.shape)

(2000, 1) (7999, 1)


In [22]:
mape_test = mean_absolute_percentage_error(y_test.values.ravel(), prediction)
mape_train = mean_absolute_percentage_error(y_train.values.ravel(), prediction_train)

In [23]:
print("Training-set MAPE: "+str(mape_train))
print("Test-set MAPE: "+str(mape_test))

Training-set MAPE: 0.3916904794533839
Test-set MAPE: 4.991844759384315


### Actual values

In [24]:
y_test.values[0:5].ravel()

array([13.49, 11.49, 13.99, 10.59,  7.49])

### Predicted Values

In [25]:
prediction[0:5]

array([14.24131455, 12.33427308, 14.65      , 10.64908971,  7.82975813])