<a href="https://colab.research.google.com/github/MattiaVerticchio/SantanderTransactionPrediction/blob/master/SantanderCustomerTransactionPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Santander Customer Transaction Prediction
The objective of this Kaggle Competition is to predict customer behavior. The problem is modeled as a binary classification, where we try to predict if a customer will (`1`) or won’t (`0`) make a transaction. The dataset contains 200 real features and 1 boolean target. The metric for evaluation is the area under the receiver operating characteristic curve (ROC-AUC).

# Notebook setup
To build and tune the model we’ll use `optuna` which is a hyperparameter optimization framework. The model we’ll train is Microsoft’s LightGBM, a gradient boosting decision tree learner, integrated with `optuna`. Let’s install the packages.

In [None]:
%%bash

# Hyperparameter optimization framework
pip install --quiet optuna

# Kaggle APIs to download the dataset
pip install --upgrade --force-reinstall --no-deps --quiet kaggle

# GPU accelerated Microsoft LightGBM via CUDA
git clone --recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build ; cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j$(nproc)
cd ../python-package
python setup.py install --precompile

Submodule path 'compute': checked out '36c89134d4013b2e5e45bc55656a18bd6141995a'
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-

Cloning into 'LightGBM'...
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Cloning into '/content/LightGBM/compute'...
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
no previously-included directories found matching 'build'
INFO:LightGBM:Installing lib_lightgbm from: ['../lib_lightgbm.so']


Once installed, we’ll retrieve the dataset from the source. Here we’ll use Kaggle APIs to download the dataset from the `santander-customer-transaction-prediction` competition as `zip` file.

The `JSON` file contains a unique individual `username` and `key` that can be obtained from your own Kaggle account.

In [None]:
%%bash
# Set up Kaggle APIs
mkdir ~/.kaggle/
touch ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
echo '{"username": "USERNAME", "key": "KEY"}' >> ~/.kaggle/kaggle.json

# Download the file
kaggle competitions download -c santander-customer-transaction-prediction

# Unzip and delete the archive
unzip santander-customer-transaction-prediction.zip
rm santander-customer-transaction-prediction.zip

Downloading santander-customer-transaction-prediction.zip to /content

Archive:  santander-customer-transaction-prediction.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


  0%|          | 0.00/250M [00:00<?, ?B/s]  2%|▏         | 5.00M/250M [00:00<00:21, 11.8MB/s]  4%|▎         | 9.00M/250M [00:00<00:21, 11.9MB/s] 10%|▉         | 25.0M/250M [00:01<00:15, 15.7MB/s] 13%|█▎        | 33.0M/250M [00:01<00:11, 19.4MB/s] 16%|█▋        | 41.0M/250M [00:01<00:09, 24.4MB/s] 20%|█▉        | 49.0M/250M [00:01<00:08, 25.4MB/s] 23%|██▎       | 57.0M/250M [00:01<00:06, 31.1MB/s] 26%|██▌       | 65.0M/250M [00:02<00:06, 31.0MB/s] 29%|██▉       | 73.0M/250M [00:02<00:04, 37.7MB/s] 32%|███▏      | 81.0M/250M [00:02<00:05, 34.3MB/s] 36%|███▌      | 89.0M/250M [00:02<00:04, 39.3MB/s] 39%|███▉      | 97.0M/250M [00:02<00:04, 35.4MB/s] 45%|████▍     | 112M/250M [00:03<00:03, 46.1MB/s]  48%|████▊     | 120M/250M [00:03<00:03, 34.3MB/s] 52%|█████▏    | 129M/250M [00:03<00:03, 31.8MB/s] 55%|█████▍    | 137M/250M [00:04<00:05, 21.2MB/s] 57%|█████▋    | 142M/250M [00:04<00:04, 25.0MB/s] 68%|██████▊   | 169M/250M [00:05<00:02, 30.7MB/s] 74%|███████▍  | 185M/250

# Preprocessing
Let’s import the installed libraries and Pandas to manipulate the data.

In [None]:
# Data management
import pandas as pd

# Microsoft LightGBM classifier with hyperparameter optimization
import optuna.integration.lightgbm as lgb

# Garbage collection
import gc

Here we’ll read the dataset and separate features and target.

In [None]:
# Reading train and test data
X_train = pd.read_csv('/content/train.csv', index_col='ID_code')
X_test  = pd.read_csv('/content/test.csv',  index_col='ID_code')

# Separating features and target
y_train = X_train[['target']].astype('bool')
X_train = X_train.drop(columns='target')

# Matrix for all the features
X = X_train.append(X_test)

Feature augmentation cannot be widely explored on Colab with a dataset of this size. It could be useful to explore the following techniques:
- Feature interaction
- Feature ratio
- Polynomial combinations
- Trigonometric transforms
- Clustering

I will add few new aggregated columns on the `X` DataFrame, they’re computed by row.

In [None]:
cols = X.columns.values

X['sum']  = X[cols].sum(axis=1)
X['min']  = X[cols].min(axis=1)
X['max']  = X[cols].max(axis=1)
X['mean'] = X[cols].mean(axis=1)
X['std']  = X[cols].std(axis=1)
X['var']  = X[cols].var(axis=1)
X['skew'] = X[cols].skew(axis=1)
X['kurt'] = X[cols].kurtosis(axis=1)
X['med']  = X[cols].median(axis=1)

Now let’s create the train and test sets.

In [None]:
# Use 32bit floating point numbers to save memory
X = X.astype('float32')

# Training LightGBM dataset
dtrain = lgb.Dataset(X.iloc[0:200000], label=y_train)
# Testing DataFrame
X_test = X.iloc[200000:400000]

# Model tuning
The learning model we’ll use is Microsoft’s LightGBM, a fast gradient boosting decision tree implementation, wrapped by `optuna`, as an optimizer for hyperparameters.

The hyperparameters are optimized using a step wise process that follows a particular, well-established order:
- `feature_fraction`
- `num_leaves`
- `bagging`
- `feature_fraction` 
- `regularization_factors`
- `min_data_in_leaf`

Firstly, we define a few parameters for the model.


In [None]:
# Dictionary of starting LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "device": "gpu",          # Enable hardware acceleration
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    }

Then we create a `LightGBMTunerCV` object. We perform a 5-Folds Stratified Cross Validation to check the accuracy of the model. I set a very high `num_boost_round` and enabled early training stopping to avoid overfitting on training data, since that could lead to poor generalization on unseen data. Patience for early stopping is set at 100 rounds.

In [None]:
# Tuner object with Stratified 5-Fold Cross Validation
tuner = lgb.LightGBMTunerCV(params,                     # GBM settings
                            dtrain,                     # Training dataset
                            num_boost_round=999999,     # Set max iterations
                            nfold=5,                    # Number of CV folds
                            stratified=True,            # Stratified samples
                            early_stopping_rounds=100,  # Callback for CV's AUC
                            verbose_eval=False)         # Stay silent

## Hyperparameters tuning
`optuna` provides calls to perform the search, let’s execute them in the established order.

In [None]:
tuner.tune_feature_fraction()

feature_fraction, val_score: 0.890046:  10%|#         | 1/10 [01:42<15:20, 102.28s/it][I 2020-08-12 20:01:38,388] Trial 0 finished with value: 0.8900455747083807 and parameters: {'feature_fraction': 0.4}. Best is trial 0 with value: 0.8900455747083807.
feature_fraction, val_score: 0.890046:  20%|##        | 2/10 [03:13<13:11, 98.96s/it] [I 2020-08-12 20:03:09,609] Trial 1 finished with value: 0.8880712979305946 and parameters: {'feature_fraction': 1.0}. Best is trial 0 with value: 0.8900455747083807.
feature_fraction, val_score: 0.890046:  30%|###       | 3/10 [04:50<11:28, 98.39s/it][I 2020-08-12 20:04:46,664] Trial 2 finished with value: 0.8883527109173398 and parameters: {'feature_fraction': 0.8666666666666667}. Best is trial 0 with value: 0.8900455747083807.
feature_fraction, val_score: 0.890046:  40%|####      | 4/10 [06:28<09:49, 98.17s/it][I 2020-08-12 20:06:24,307] Trial 3 finished with value: 0.8891923168138411 and parameters: {'feature_fraction': 0.5333333333333333}. Best is 

In [None]:
tuner.tune_num_leaves()

num_leaves, val_score: 0.886435:   5%|5         | 1/20 [02:07<40:29, 127.87s/it][I 2020-08-13 11:47:30,425] Trial 0 finished with value: 0.8864345288158869 and parameters: {'num_leaves': 72}. Best is trial 0 with value: 0.8864345288158869.
num_leaves, val_score: 0.896680:  10%|#         | 2/20 [05:07<42:59, 143.28s/it][I 2020-08-13 11:50:29,679] Trial 1 finished with value: 0.8966800198764471 and parameters: {'num_leaves': 4}. Best is trial 1 with value: 0.8966800198764471.
num_leaves, val_score: 0.896680:  15%|#5        | 3/20 [25:34<2:12:46, 468.61s/it][I 2020-08-13 12:10:57,406] Trial 2 finished with value: 0.8907882672292018 and parameters: {'num_leaves': 256}. Best is trial 1 with value: 0.8966800198764471.
num_leaves, val_score: 0.896680:  20%|##        | 4/20 [48:42<3:18:30, 744.39s/it][I 2020-08-13 12:34:05,258] Trial 3 finished with value: 0.8912731123927198 and parameters: {'num_leaves': 175}. Best is trial 1 with value: 0.8966800198764471.
num_leaves, val_score: 0.896680:  2

In [None]:
tuner.tune_bagging()

bagging, val_score: 0.897956:  10%|#         | 1/10 [02:53<26:01, 173.50s/it][I 2020-08-13 14:28:03,019] Trial 20 finished with value: 0.8979563042250929 and parameters: {'bagging_fraction': 0.8429864478650875, 'bagging_freq': 7}. Best is trial 20 with value: 0.8979563042250929.
bagging, val_score: 0.898318:  20%|##        | 2/10 [06:22<24:33, 184.14s/it][I 2020-08-13 14:31:31,977] Trial 21 finished with value: 0.8983178297023929 and parameters: {'bagging_fraction': 0.8662505913776934, 'bagging_freq': 7}. Best is trial 21 with value: 0.8983178297023929.
bagging, val_score: 0.898318:  30%|###       | 3/10 [09:45<22:09, 189.96s/it][I 2020-08-13 14:34:55,514] Trial 22 finished with value: 0.8981793162601844 and parameters: {'bagging_fraction': 0.866089819885371, 'bagging_freq': 7}. Best is trial 21 with value: 0.8983178297023929.
bagging, val_score: 0.898318:  40%|####      | 4/10 [13:17<19:38, 196.48s/it][I 2020-08-13 14:38:27,233] Trial 23 finished with value: 0.897835041634071 and para

In [None]:
tuner.tune_feature_fraction_stage2()

feature_fraction_stage2, val_score: 0.898318:  33%|###3      | 1/3 [02:22<04:44, 142.30s/it][I 2020-08-14 00:32:51,096] Trial 0 finished with value: 0.8983178324678871 and parameters: {'feature_fraction': 0.4}. Best is trial 0 with value: 0.8983178324678871.
feature_fraction_stage2, val_score: 0.898318:  67%|######6   | 2/3 [04:14<02:13, 133.18s/it][I 2020-08-14 00:34:43,003] Trial 1 finished with value: 0.8978251424285564 and parameters: {'feature_fraction': 0.44000000000000006}. Best is trial 0 with value: 0.8983178324678871.
feature_fraction_stage2, val_score: 0.898349: 100%|##########| 3/3 [06:36<00:00, 135.87s/it][I 2020-08-14 00:37:05,153] Trial 2 finished with value: 0.8983486192438492 and parameters: {'feature_fraction': 0.48000000000000004}. Best is trial 2 with value: 0.8983486192438492.
feature_fraction_stage2, val_score: 0.898349: 100%|##########| 3/3 [06:36<00:00, 132.12s/it]


In [None]:
tuner.tune_regularization_factors()

regularization_factors, val_score: 0.898000:   5%|5         | 1/20 [02:18<43:56, 138.78s/it][I 2020-08-14 08:05:53,660] Trial 0 finished with value: 0.8979999043089274 and parameters: {'lambda_l1': 0.009247467086373043, 'lambda_l2': 0.5140701207760274}. Best is trial 0 with value: 0.8979999043089274.
regularization_factors, val_score: 0.898000:  10%|#         | 2/20 [04:24<40:25, 134.74s/it][I 2020-08-14 08:07:58,983] Trial 1 finished with value: 0.8979375193704587 and parameters: {'lambda_l1': 1.6758022062094455, 'lambda_l2': 1.7460765921189703}. Best is trial 0 with value: 0.8979999043089274.
regularization_factors, val_score: 0.898000:  15%|#5        | 3/20 [06:17<36:23, 128.47s/it][I 2020-08-14 08:09:52,807] Trial 2 finished with value: 0.8976738012500055 and parameters: {'lambda_l1': 0.10657770816862401, 'lambda_l2': 0.009179388424901706}. Best is trial 0 with value: 0.8979999043089274.
regularization_factors, val_score: 0.898125:  20%|##        | 4/20 [08:33<34:51, 130.74s/it][I 

In [None]:
tuner.tune_min_data_in_leaf()

min_data_in_leaf, val_score: 0.898016:  20%|##        | 1/5 [02:11<08:46, 131.62s/it][I 2020-08-14 09:40:49,872] Trial 0 finished with value: 0.8980155559271935 and parameters: {'min_child_samples': 100}. Best is trial 0 with value: 0.8980155559271935.
min_data_in_leaf, val_score: 0.898315:  40%|####      | 2/5 [04:45<06:55, 138.43s/it][I 2020-08-14 09:43:24,192] Trial 1 finished with value: 0.8983145344158274 and parameters: {'min_child_samples': 10}. Best is trial 1 with value: 0.8983145344158274.
min_data_in_leaf, val_score: 0.898360:  60%|######    | 3/5 [07:24<04:49, 144.61s/it][I 2020-08-14 09:46:03,217] Trial 2 finished with value: 0.8983597038946354 and parameters: {'min_child_samples': 50}. Best is trial 2 with value: 0.8983597038946354.
min_data_in_leaf, val_score: 0.898360:  80%|########  | 4/5 [09:27<02:18, 138.07s/it][I 2020-08-14 09:48:06,036] Trial 3 finished with value: 0.8980410766823225 and parameters: {'min_child_samples': 25}. Best is trial 2 with value: 0.898359703

Here are the results.
- `feature_fraction` = 0.48
- `num_leaves` = 3
- `bagging_fraction` = 0.8662505913776934
- `bagging_freq` = 7
- `lambda_l1` = 2.6736262550429385e-08
- `lambda_l2` = 0.0013546195528208944
- `min_child_samples` = 50

The next step is to find a good `num_boost_rounds` via cross-validation to retrain the final model without overfitting. Here I set the hyperparameters we found and start training with 10-Folds Stratified Cross-Validation with early stopping. This time the patience threshold is set to 1000, in this way we can be sure to reach the best model we can with this settings.

In [None]:
# Dictionary of tuned LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "device": "gpu",          # Enable hardware acceleration
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    # Adding optimizaed hyperparameters
    "feature_fraction": 0.48,
    "num_leaves": 3,
    "bagging_fraction" : 0.8662505913776934,
    "bagging_freq" : 7,
    "lambda_l1": 2.6736262550429385e-08,
    "lambda_l2": 0.0013546195528208944,
    "min_child_samples": 50}

We now create and train the object with the found settings.

In [None]:
finalModel = lgb.cv(params,
                    dtrain,
                    num_boost_round=999999,
                    early_stopping_rounds=1000,
                    nfold=10,
                    stratified=True,
                    verbose_eval=True)

[1]	cv_agg's auc: 0.578547 + 0.00825778
[2]	cv_agg's auc: 0.614765 + 0.00592437
[3]	cv_agg's auc: 0.633975 + 0.00711882
[4]	cv_agg's auc: 0.637971 + 0.00693536
[5]	cv_agg's auc: 0.656148 + 0.00943904
[6]	cv_agg's auc: 0.65855 + 0.00834924
[7]	cv_agg's auc: 0.668349 + 0.00732269
[8]	cv_agg's auc: 0.671891 + 0.00751849
[9]	cv_agg's auc: 0.681102 + 0.00755524
[10]	cv_agg's auc: 0.68885 + 0.00792294
[11]	cv_agg's auc: 0.694907 + 0.00803205
[12]	cv_agg's auc: 0.701212 + 0.00714948
[13]	cv_agg's auc: 0.702881 + 0.00810169
[14]	cv_agg's auc: 0.707366 + 0.00654004
[15]	cv_agg's auc: 0.713907 + 0.00542673
[16]	cv_agg's auc: 0.715726 + 0.00523241
[17]	cv_agg's auc: 0.718859 + 0.00487938
[18]	cv_agg's auc: 0.721158 + 0.00573641
[19]	cv_agg's auc: 0.723224 + 0.00637797
[20]	cv_agg's auc: 0.726243 + 0.00705448
[21]	cv_agg's auc: 0.730126 + 0.00771347
[22]	cv_agg's auc: 0.73359 + 0.00716319
[23]	cv_agg's auc: 0.735387 + 0.00615426
[24]	cv_agg's auc: 0.739328 + 0.0056503
[25]	cv_agg's auc: 0.742203 +

At this point we can train the final model on the whole dataset, using the optimized hyperparameters and number of boosting rounds.

In [None]:
# Importing the official library
import lightgbm as lgb

# Retrieving the best training iteration
CV_results = pd.DataFrame(finalModel)
best_iterations = CV_results['auc-mean'].idxmax()

# Training the final model 
model = lgb.train(params, dtrain, num_boost_round=best_iterations)

With the final model, we can make the predictions on the test set and create a CSV file to submit.

In [None]:
pred = model.predict(X_test)
df = pd.DataFrame(pred, columns=['target'])
df.index.name = 'ID_code'
df = df.rename('test_{}'.format)
df.to_csv('sub.csv')
df

Unnamed: 0_level_0,target
ID_code,Unnamed: 1_level_1
test_0,0.051829
test_1,0.206173
test_2,0.219142
test_3,0.253446
test_4,0.039699
...,...
test_199995,0.032182
test_199996,0.007454
test_199997,0.003174
test_199998,0.117466


As stated, using Kaggle APIs we submit the CSV and find out the AOC score.

In [None]:
! kaggle competitions submit -c santander-customer-transaction-prediction -f /content/sub.csv -m Tuned_LightGBM

100% 6.06M/6.06M [00:04<00:00, 1.39MB/s]
Successfully submitted to Santander Customer Transaction Prediction

# Results & Conclusions
The results are the following:
- Private score = 0.89610
- Public score = 0.89867

Overall an AOC score of ~0.90 for a single model prediction is not bad, considering that the top-5 that won the prize is placed at ~0.92.

This particular experiment focused on hyperparameter tuning, but what could be done to furtherly improve the scores?

Of course, we could dive deeper into feature engineering by augmenting the available data with the methods described above. Also, an ensemble learning model could be implemented to combine different model architectures and stack/blend the results.