<a href="https://colab.research.google.com/github/MattiaVerticchio/PersonalProjects/blob/master/TransactionPrediction/TransactionPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Santander Customer Transaction Prediction
> [Italiano]() / **English**

> **Abstract**
>
> The objective of this notebook is to predict customer behavior. The problem is a binary classification, where we try to predict if a customer will (`1`) or won’t (`0`) make a transaction. The dataset contains 200 real features and one boolean target. The metric for evaluation is the Area Under the Receiver Operating Characteristic Curve (ROC-AUC).

## Introduction
To build and tune the model, we’ll use `optuna`, which is a hyperparameter optimization framework. The model we’ll train is Microsoft’s LightGBM, a gradient boosting decision tree learner, integrated with `optuna`. Let’s install the packages.

In [1]:
%%bash
# Hyperparameter optimization framework
pip install --quiet optuna

Once installed, we’ll retrieve the dataset from the source. Here we’ll use Kaggle APIs to download the dataset from the Santander C Customer Transaction Prediction competition as a `zip` file.

The `JSON` file contains a unique individual `username` and `key`, retrievable from each Kaggle account settings.

In [2]:
%%bash
# Set up Kaggle APIs
mkdir ~/.kaggle/
touch ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
echo '{"username": "mattiavert", "key": "875616a9d59f306292b1d150195cf075"}' >> ~/.kaggle/kaggle.json

# Download the file
kaggle competitions download -c santander-customer-transaction-prediction

Downloading test.csv.zip to /content

Downloading train.csv.zip to /content

Downloading sample_submission.csv.zip to /content



  0%|          | 0.00/125M [00:00<?, ?B/s]  4%|4         | 5.00M/125M [00:00<00:10, 12.1MB/s]  7%|7         | 9.00M/125M [00:01<00:17, 7.12MB/s] 22%|##1       | 27.0M/125M [00:01<00:10, 10.0MB/s] 33%|###2      | 41.0M/125M [00:02<00:07, 12.4MB/s] 39%|###9      | 49.0M/125M [00:03<00:06, 11.5MB/s] 53%|#####2    | 66.0M/125M [00:03<00:03, 16.0MB/s] 59%|#####9    | 74.0M/125M [00:03<00:02, 18.4MB/s] 65%|######4   | 81.0M/125M [00:04<00:03, 15.1MB/s] 71%|#######1  | 89.0M/125M [00:04<00:02, 15.3MB/s] 87%|########6 | 108M/125M [00:04<00:00, 21.2MB/s]  97%|#########7| 121M/125M [00:05<00:00, 18.1MB/s]100%|##########| 125M/125M [00:05<00:00, 22.4MB/s]
  0%|          | 0.00/125M [00:00<?, ?B/s]  4%|4         | 5.00M/125M [00:00<00:13, 9.59MB/s]  7%|7         | 9.00M/125M [00:01<00:16, 7.49MB/s] 18%|#8        | 23.0M/125M [00:01<00:10, 10.5MB/s] 26%|##6       | 33.0M/125M [00:01<00:07, 12.7MB/s] 33%|###2      | 41.0M/125M [00:03<00:08, 10.5MB/s] 46%|####5     | 57.0M/125M [0

### Preprocessing
Let’s import the installed libraries and Pandas to manage the data.

In [3]:
# Data management
import pandas as pd
# Microsoft LightGBM classifier with hyperparameter optimization
import optuna.integration.lightgbm as lgb

Here we’ll read the dataset and separate features and target.

In [5]:
# Reading train and test data
X_train = pd.read_csv('train.csv.zip', index_col='ID_code')
X_test  = pd.read_csv('test.csv.zip',  index_col='ID_code')

# Separating features and target
y_train = X_train[['target']].astype('bool')
X_train = X_train.drop(columns='target')

# Matrix for all the features
X = X_train.append(X_test)

On Google Colaboratory, we cannot widely explore feature augmentation with a dataset of this size. It could be useful to explore the following techniques:
- Feature interaction
- Feature ratio
- Polynomial combinations
- Trigonometric transforms
- Clustering

However, due to memory limits, I will only add a few new aggregated columns on the `X` DataFrame.

In [6]:
cols = X.columns.values

X['sum']  = X[cols].sum(axis=1)
X['min']  = X[cols].min(axis=1)
X['max']  = X[cols].max(axis=1)
X['mean'] = X[cols].mean(axis=1)
X['std']  = X[cols].std(axis=1)
X['var']  = X[cols].var(axis=1)
X['skew'] = X[cols].skew(axis=1)
X['kurt'] = X[cols].kurtosis(axis=1)
X['med']  = X[cols].median(axis=1)

Now let’s create the train and test sets.

In [7]:
# Training LightGBM dataset
dtrain = lgb.Dataset(X.iloc[0:200000], label=y_train)
# Testing DataFrame
X_test = X.iloc[200000:400000]

## Model building
The learning model we’ll use is Microsoft’s LightGBM, a fast gradient boosting decision tree implementation, wrapped by `optuna`, as an optimizer for hyperparameters.

The hyperparameters are optimized using a step wise process that follows a particular, well-established order:
- `feature_fraction`
- `num_leaves`
- `bagging`
- `feature_fraction` 
- `regularization_factors`
- `min_data_in_leaf`

Firstly, we define a few parameters for the model.


In [9]:
# Dictionary of starting LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    }

Then we create a `LightGBMTunerCV` object. We perform a 5-Folds Stratified Cross Validation to check the accuracy of the model. I set a very high `num_boost_round` and enabled early training stopping to avoid overfitting on training data, since that could lead to poor generalization on unseen data. Patience for early stopping is set at 100 rounds.

In [10]:
# Tuner object with Stratified 5-Fold Cross Validation
tuner = lgb.LightGBMTunerCV(params,                     # GBM settings
                            dtrain,                     # Training dataset
                            num_boost_round=999999,     # Set max iterations
                            nfold=5,                    # Number of CV folds
                            stratified=True,            # Stratified samples
                            early_stopping_rounds=100,  # Callback for CV's AUC
                            verbose_eval=False)         # Stay silent

[I 2020-09-26 13:40:18,787] A new study created in memory with name: no-name-2b82055c-7e7f-422a-9968-31cc3e0187c2


### Hyperparameters tuning
`optuna` provides calls to perform the search, let’s execute them in the established order.

In [None]:
tuner.run()

Here are the results.
- `feature_fraction` = 0.48
- `num_leaves` = 3
- `bagging_fraction` = 0.8662505913776934
- `bagging_freq` = 7
- `lambda_l1` = 2.6736262550429385e-08
- `lambda_l2` = 0.0013546195528208944
- `min_child_samples` = 50

The next step is to find a good `num_boost_rounds` via cross-validation to retrain the final model without overfitting. Here I set the hyperparameters we found and start training with 10-Folds Stratified Cross-Validation with early stopping. This time the patience threshold is set to 1000, in this way we can be sure to reach the best model we can with this settings.

In [12]:
# Dictionary of tuned LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    # Adding optimizaed hyperparameters
    "feature_fraction": 0.48,
    "num_leaves": 3,
    "bagging_fraction" : 0.8662505913776934,
    "bagging_freq" : 7,
    "lambda_l1": 2.6736262550429385e-08,
    "lambda_l2": 0.0013546195528208944,
    "min_child_samples": 50}

We now create and train the object with the found settings.

In [None]:
finalModel = lgb.cv(params,
                    dtrain,
                    num_boost_round=999999,
                    early_stopping_rounds=1000,
                    nfold=10,
                    stratified=True,
                    verbose_eval=False)

At this point we can train the final model on the whole dataset, using the optimized hyperparameters and number of boosting rounds.

In [None]:
# Importing the official library
import lightgbm as lgb

# Retrieving the best training iteration
CV_results = pd.DataFrame(finalModel)
best_iterations = CV_results['auc-mean'].idxmax()

# Training the final model 
model = lgb.train(params, dtrain, num_boost_round=best_iterations)

With the final model, we can make the predictions on the test set and create a CSV file to submit.

In [None]:
pred = model.predict(X_test)
df = pd.DataFrame(pred, columns=['target'])
df.index.name = 'ID_code'
df = df.rename('test_{}'.format)
df.to_csv('sub.csv')
df

Unnamed: 0_level_0,target
ID_code,Unnamed: 1_level_1
test_0,0.051829
test_1,0.206173
test_2,0.219142
test_3,0.253446
test_4,0.039699
...,...
test_199995,0.032182
test_199996,0.007454
test_199997,0.003174
test_199998,0.117466


As stated, using Kaggle APIs we submit the CSV and find out the AOC score.

In [None]:
! kaggle competitions submit -c santander-customer-transaction-prediction -f /content/sub.csv -m Tuned_LightGBM

100% 6.06M/6.06M [00:04<00:00, 1.39MB/s]
Successfully submitted to Santander Customer Transaction Prediction

# Results & Conclusions
The results are the following:
- Private score = 0.89610
- Public score = 0.89867

Overall an AOC score of ~0.90 for a single model prediction is not bad, considering that the top-5 that won the prize is placed at ~0.92.

This particular experiment focused on hyperparameter tuning, but what could be done to furtherly improve the scores?

Of course, we could dive deeper into feature engineering by augmenting the available data with the methods described above. Also, an ensemble learning model could be implemented to combine different model architectures and stack/blend the results.