<a href="https://colab.research.google.com/github/MattiaVerticchio/PersonalProjects/blob/master/TransactionPrediction/TransactionPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Santander Customer Transaction Prediction
> [Italiano]() / **English**

> **Abstract**
>
> The objective of this notebook is to predict customer behavior. The problem is a binary classification, where we try to predict if a customer will (`1`) or won’t (`0`) make a transaction. The dataset contains 200 real features and one boolean target. The metric for evaluation is the Area Under the Receiver Operating Characteristic Curve (ROC-AUC).

## Introduction
To tune the classification model, we’ll use `optuna`, which is a hyperparameter optimization framework. The model we’ll train is Microsoft’s LightGBM, a gradient boosting decision tree learner, integrated with `optuna`. Let’s first install the packages.

In [1]:
%%bash
pip install -q optuna

Once installed, we’ll retrieve the dataset from the source. Here we’ll use Kaggle APIs to download the dataset from the Santander Customer Transaction Prediction competition as a `zip` file.

The `JSON` file contains a unique individual `username` and `key`, retrievable from each Kaggle account settings.

In [2]:
%%bash
# Set up Kaggle APIs
mkdir ~/.kaggle/
touch ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
echo '{"username": "mattiavert", "key": "875616a9d59f306292b1d150195cf075"}' >> ~/.kaggle/kaggle.json

# Download the file
kaggle competitions download -c santander-customer-transaction-prediction

Downloading test.csv.zip to /content

Downloading train.csv.zip to /content

Downloading sample_submission.csv.zip to /content



  0%|          | 0.00/125M [00:00<?, ?B/s]  4%|4         | 5.00M/125M [00:00<00:03, 40.4MB/s]  9%|8         | 11.0M/125M [00:00<00:02, 44.9MB/s] 18%|#7        | 22.0M/125M [00:00<00:01, 54.5MB/s] 26%|##6       | 33.0M/125M [00:00<00:01, 64.2MB/s] 33%|###2      | 41.0M/125M [00:00<00:01, 66.8MB/s] 39%|###9      | 49.0M/125M [00:00<00:02, 39.6MB/s] 56%|#####6    | 70.0M/125M [00:01<00:01, 52.4MB/s] 65%|######4   | 81.0M/125M [00:01<00:00, 49.9MB/s] 78%|#######7  | 97.0M/125M [00:01<00:00, 61.3MB/s] 91%|######### | 113M/125M [00:01<00:00, 74.2MB/s] 100%|##########| 125M/125M [00:01<00:00, 77.2MB/s]
  0%|          | 0.00/125M [00:00<?, ?B/s]  5%|4         | 6.00M/125M [00:00<00:01, 62.5MB/s] 18%|#7        | 22.0M/125M [00:00<00:01, 76.8MB/s] 22%|##2       | 28.0M/125M [00:00<00:01, 68.8MB/s] 33%|###2      | 41.0M/125M [00:00<00:01, 49.5MB/s] 50%|####9     | 62.0M/125M [00:00<00:01, 64.4MB/s] 70%|######9   | 87.0M/125M [00:00<00:00, 83.1MB/s] 88%|########8 | 110M/125M [0

### Preprocessing
Let’s import the installed libraries and Pandas to manage the data.

In [None]:
import pandas as pd                        # Data management
import optuna.integration.lightgbm as lgb  # Hyperparameter optimization

Here we’ll read the dataset and separate features and target.

In [None]:
X_train = pd.read_csv('train.csv.zip', index_col='ID_code')  # Training data
X_test  = pd.read_csv('test.csv.zip',  index_col='ID_code')  # Testing data

y_train = X_train[['target']].astype('bool')  # Separating features and target
X_train = X_train.drop(columns='target')

X = X_train.append(X_test)  # Matrix for all the features

On Google Colaboratory, we cannot widely explore feature augmentation with a dataset of this size. It could be useful to explore the following techniques:
- Feature interaction
- Feature ratio
- Polynomial combinations
- Trigonometric transforms
- Clustering

However, due to memory limits, I will only add a few new aggregated columns on the `X` DataFrame.

In [None]:
cols = X.columns.values

X['sum']  = X[cols].sum(axis=1)       # Sum of all the values
X['min']  = X[cols].min(axis=1)       # Minimum value in the sample
X['max']  = X[cols].max(axis=1)       # Maximum value in the sample
X['mean'] = X[cols].mean(axis=1)      # Mean sample value
X['std']  = X[cols].std(axis=1)       # Standard deviation of the sample
X['var']  = X[cols].var(axis=1)       # Variance of the sample
X['skew'] = X[cols].skew(axis=1)      # Skewness of the sample
X['kurt'] = X[cols].kurtosis(axis=1)  # Kurtosis of each sample
X['med']  = X[cols].median(axis=1)    # Median sample value

Now let’s create the train and test sets.

In [None]:
dtrain = lgb.Dataset(X.iloc[0:200000], label=y_train)  # Training data
X_test = X.iloc[200000:400000]                         # Testing data

## Model building
The learning model we’ll use is Microsoft’s LightGBM, a fast gradient boosting decision tree implementation, wrapped by `optuna`, as an optimizer for hyperparameters.

The hyperparameters are optimized using a step wise process that follows a particular, well-established order:
- `feature_fraction`
- `num_leaves`
- `bagging`
- `feature_fraction` 
- `regularization_factors`
- `min_data_in_leaf`

Firstly, we define a few parameters for the model.


In [None]:
# Dictionary of starting LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
}

Then we create a `LightGBMTunerCV` object. We perform a 5-Folds Stratified Cross Validation to check the accuracy of the model. I set a very high `num_boost_round` and enabled early training stopping to avoid overfitting on training data, since that could lead to poor generalization on unseen data. Patience for early stopping is set at 100 rounds.

In [None]:
# Tuner object with Stratified 5-Fold Cross Validation
tuner = lgb.LightGBMTunerCV(
    params,                     # GBM settings
    dtrain,                     # Training dataset
    num_boost_round=999999,     # Set max iterations
    nfold=5,                    # Number of CV folds
    stratified=True,            # Stratified samples
    early_stopping_rounds=100,  # Callback for CV's AUC
    verbose_eval=False          # Stay silent
)

[I 2020-09-26 13:40:18,787] A new study created in memory with name: no-name-2b82055c-7e7f-422a-9968-31cc3e0187c2


### Hyperparameters tuning
`optuna` provides calls to perform the search, let’s execute them in the established order.

In [None]:
tuner.run()

Here are the results.
- `feature_fraction` = 0.48
- `num_leaves` = 3
- `bagging_fraction` = 0.8662505913776934
- `bagging_freq` = 7
- `lambda_l1` = 2.6736262550429385e-08
- `lambda_l2` = 0.0013546195528208944
- `min_child_samples` = 50

The next step is to find a good `num_boost_rounds` via cross-validation to retrain the final model without overfitting. Here I set the hyperparameters we found and start training with 10-Folds Stratified Cross-Validation with early stopping. This time the patience threshold is set to 1000, in this way we can be sure to reach the best model we can with this settings.

In [None]:
# Dictionary of tuned LightGBM parameters
params = {
    "objective": "binary",    # Binary classification
    "metric": "auc",          # Used in competition
    "verbosity": -1,          # Stay silent
    "boosting_type": "gbdt",  # Gradient Boosting Decision Tree
    "max_bin": 63,            # Faster training on GPU
    "num_threads": 2,         # Use all physical cores of CPU
    # Adding optimizaed hyperparameters
    "feature_fraction": 0.48,
    "num_leaves": 3,
    "bagging_fraction" : 0.8662505913776934,
    "bagging_freq" : 7,
    "lambda_l1": 2.6736262550429385e-08,
    "lambda_l2": 0.0013546195528208944,
    "min_child_samples": 50
}

We now create and train the object with the found settings.

In [None]:
finalModel = lgb.cv(
    params,
    dtrain,
    num_boost_round=999999,
    early_stopping_rounds=1000,
    nfold=10,
    stratified=True,
    verbose_eval=False
)

At this point we can train the final model on the whole dataset, using the optimized hyperparameters and number of boosting rounds.

In [None]:
# Importing the official library
import lightgbm as lgb

# Retrieving the best training iteration
CV_results = pd.DataFrame(finalModel)
best_iterations = CV_results['auc-mean'].idxmax()

# Training the final model 
model = lgb.train(params, dtrain, num_boost_round=best_iterations)

# Results & Conclusions

This particular experiment focused on hyperparameter tuning, but what could be done to furtherly improve the scores?

Of course, we could dive deeper into feature engineering by augmenting the available data with the methods described above. Also, an ensemble learning model could be implemented to combine different model architectures and stack/blend the results.