<a href="https://colab.research.google.com/github/MattiaVerticchio/PersonalProjects/blob/master/TransactionPrediction/TransactionPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Santander Customer Transaction Prediction
> [Italiano]() / **English**

> **Abstract**
>
> The objective of this notebook is to predict customer behavior. The problem is a binary classification, where we try to predict if a customer will (`1`) or won’t (`0`) make a transaction. The dataset contains 200 real anonymized features and one boolean target. We’ll use LightGBM as an ensemble learning model. The metric for evaluation is the Area Under the Receiver Operating Characteristic Curve (ROC-AUC), and the final cross-validated score is ~0.90.

## Introduction
To tune the classification model, we’ll use `optuna`, which is a hyperparameter optimization framework. The model we’ll train is Microsoft’s LightGBM, a gradient boosting decision tree learner, integrated with `optuna`. Let’s first install the packages.

In [1]:
%%bash
pip install -q optuna

Once installed, we’ll retrieve the dataset from the source. Here we’ll use Kaggle APIs to download the dataset from the Santander Customer Transaction Prediction competition as a `zip` file.

The `JSON` file contains a unique individual `username` and `key`, retrievable from each Kaggle account settings.

In [2]:
%%bash
# Set up Kaggle APIs
mkdir ~/.kaggle/
touch ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
echo '{"username": "mattiavert", "key": "875616a9d59f306292b1d150195cf075"}' >> ~/.kaggle/kaggle.json

# Download the file
kaggle competitions download -c santander-customer-transaction-prediction

Downloading train.csv.zip to /content

Downloading test.csv.zip to /content

Downloading sample_submission.csv.zip to /content



  0%|          | 0.00/125M [00:00<?, ?B/s]  6%|6         | 8.00M/125M [00:00<00:01, 83.8MB/s]  9%|8         | 11.0M/125M [00:00<00:02, 42.7MB/s] 28%|##8       | 35.0M/125M [00:00<00:01, 56.8MB/s] 35%|###5      | 44.0M/125M [00:00<00:01, 51.7MB/s] 54%|#####3    | 67.0M/125M [00:00<00:00, 67.5MB/s] 63%|######3   | 79.0M/125M [00:00<00:00, 71.1MB/s] 78%|#######7  | 97.0M/125M [00:01<00:00, 78.8MB/s] 93%|#########2| 116M/125M [00:01<00:00, 95.3MB/s] 100%|##########| 125M/125M [00:01<00:00, 106MB/s] 
  0%|          | 0.00/125M [00:00<?, ?B/s]  7%|7         | 9.00M/125M [00:00<00:01, 74.0MB/s] 19%|#9        | 24.0M/125M [00:00<00:01, 87.7MB/s] 33%|###2      | 41.0M/125M [00:00<00:01, 83.2MB/s] 47%|####6     | 58.0M/125M [00:00<00:00, 98.7MB/s] 59%|#####8    | 73.0M/125M [00:00<00:00, 91.6MB/s] 73%|#######2  | 91.0M/125M [00:00<00:00, 108MB/s]  84%|########4 | 105M/125M [00:00<00:00, 116MB/s]  95%|#########4| 118M/125M [00:01<00:00, 63.2MB/s]100%|##########| 125M/125M [00:

### Preprocessing
Let’s import the installed libraries and Pandas to manage the data.

In [3]:
import pandas as pd                        # Data management
import optuna.integration.lightgbm as lgb  # Hyperparameter optimization

Here we’ll read the dataset and separate features and target.

In [4]:
X_train = pd.read_csv('train.csv.zip', index_col='ID_code')  # Training data
X_test  = pd.read_csv('test.csv.zip',  index_col='ID_code')  # Testing data

y_train = X_train[['target']].astype('bool')  # Separating features and target
X_train = X_train.drop(columns='target')

X = X_train.append(X_test)  # Matrix for all the features

On Google Colaboratory, we cannot widely explore feature augmentation with a dataset of this size. It could be useful to explore different techniques, however, due to memory limits, I will only add a few new aggregated columns on the `X` DataFrame.

In [5]:
cols = X.columns.values

X['sum']  = X[cols].sum(axis=1)       # Sum of all the values
X['min']  = X[cols].min(axis=1)       # Minimum value in the sample
X['max']  = X[cols].max(axis=1)       # Maximum value in the sample
X['mean'] = X[cols].mean(axis=1)      # Mean sample value
X['std']  = X[cols].std(axis=1)       # Standard deviation of the sample
X['var']  = X[cols].var(axis=1)       # Variance of the sample
X['skew'] = X[cols].skew(axis=1)      # Skewness of the sample
X['kurt'] = X[cols].kurtosis(axis=1)  # Kurtosis of each sample
X['med']  = X[cols].median(axis=1)    # Median sample value

Now let’s create the train and test sets.

In [6]:
dtrain = lgb.Dataset(X.iloc[0:200000], label=y_train)  # Training data
X_test = X.iloc[200000:400000]                         # Testing data

## Model building
The learning model we’ll use is Microsoft’s LightGBM, a fast gradient boosting decision tree implementation, wrapped by `optuna`, as an optimizer for hyperparameters.

The hyperparameters are optimized using a step wise process that follows a particular, well-established order:
- `feature_fraction`
- `num_leaves`
- `bagging`
- `feature_fraction` 
- `regularization_factors`
- `min_data_in_leaf`

Firstly, we define a few parameters for the model.


In [7]:
params = {                    # Dictionary of starting parameters
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
}

Then we create a `LightGBMTunerCV` object. We perform a 5-Folds Stratified Cross Validation to check the accuracy of the model. I set a very high `num_boost_round` and enabled early training stopping to avoid overfitting on training data, since that could lead to poor generalization on unseen data. Patience for early stopping is set at 100 rounds.

In [8]:
tuner = lgb.LightGBMTunerCV(    # Tuner object with Stratified 5-Fold CV
    params,                     # GBM settings
    dtrain,                     # Training dataset
    num_boost_round=999999,     # Set max iterations
    nfold=5,                    # Number of CV folds
    stratified=True,            # Stratified samples
    early_stopping_rounds=100,  # Callback for CV's AUC
    verbose_eval=False          # Stay silent
)

[I 2020-09-27 09:18:47,388] A new study created in memory with name: no-name-5d9e68ba-ed35-47ff-8c8c-73b9f862c708


### Hyperparameters tuning
`optuna` provides calls to perform the search, let’s execute them in the established order.

In [None]:
tuner.run()

Here are the results.
- `feature_fraction = 0.48` 
- `num_leaves = 3`
- `bagging_fraction = 0.8662505913776934`
- `bagging_freq = 7`
- `lambda_l1 = 2.6736262550429385e-08`
- `lambda_l2 = 0.0013546195528208944`
- `min_child_samples = 50`

The next step is to find a good `num_boost_rounds` via cross-validation to retrain the final model without overfitting. Here I set the hyperparameters we found and start training with 10-Folds Stratified Cross-Validation with early stopping. This time the patience threshold is set to 20.

In [9]:
# Dictionary of tuned LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    # Adding optimized hyperparameters
    "feature_fraction": 0.48,
    "num_leaves": 3,
    "bagging_fraction" : 0.8662505913776934,
    "bagging_freq" : 7,
    "lambda_l1": 2.6736262550429385e-08,
    "lambda_l2": 0.0013546195528208944,
    "min_child_samples": 50
}

We now create and train the object with the found settings.

In [10]:
finalModel = lgb.cv(           # Training the cross-validated model
    params,                    # Loading the parameters
    dtrain,                    # Training dataset
    num_boost_round=999999,    # Setting a lot of boosting rounds
    early_stopping_rounds=20,  # Stop training after 20 non-productive rounds
    nfold=10,                  # Cross-validation folds
    stratified=True,           # Stratified sampling
)

## Results & Conclusions

In [17]:
CV_results = pd.DataFrame(finalModel)             # Saving iterations
best_iteration = CV_results['auc-mean'].idxmax()  # Best iteration
CV_results.loc[best_iteration]                    # Best CV ROC-AUC  

auc-mean    0.897531
auc-stdv    0.002463
Name: 1886, dtype: float64

The model scored ~0.90 as cross-validated metric for ROC-AUC.

This particular experiment focused on hyperparameter tuning, but what could be done to furtherly improve the scores of the whole model?

- Explore feature engineering by augmenting the available data with the methods described above.
    - Feature interaction
    - Feature ratio
    - Polynomial combinations
    - Trigonometric transforms
    - Clustering
- We could implement an ensemble learning model to combine different models and stack/blend the results.
- Calibrate the model prediction probabilities.

### Finalize the model
At this point, we can train the final model on the whole dataset, using the optimized hyperparameters and the number of boosting rounds.

In [None]:
import lightgbm as lgb  # Importing the official Microsoft LightGBM

model = lgb.train(                   # Training the final model 
    params,                          # Loading the parameters            
    dtrain,                          # Training dataset
    num_boost_round=best_iterations  # Setting boosting rounds
)

### [Go back to index >](https://github.com/MattiaVerticchio/PersonalProjects/blob/master/README_EN.md)