# Predictive Analysis and Model Tuning with GPU

---

In this notebook I will test **[Google Colab](https://colab.research.google.com/notebooks/welcome.ipynb)** GPU Running Environment to train a LGBM.

The aim is to perform a classification task from [Santander Customer Transaction Prediction](https://www.kaggle.com/c/santander-customer-transaction-prediction) challenge on Kaggle.

 There is a [CPU version](https://github.com/FedericoRaimondi/me/blob/master/Santander_Customer_Transaction_Prediction/PredictiveAnalysis_ModelTuning/PredictiveAnalysis_ModelTuning.ipynb) of this notebook that could be run locally. You can refer to that version if you need a detailed explanation.
 
 _Before starting remember to set **'GPU' under 'Runtime'->'Change runtime type'**_
 
 Then you will also need to upload 3 files:
 - train.csv
 - test.csv
 - synthetic_samples_indexes.csv ([that's to filter only real test data](https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split))

In [1]:
# Importing all the libraries needed
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
import time
warnings.filterwarnings('ignore')
sns.set()

%matplotlib inline
%cd /content/

# Loading datasets
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

print("Train shape: " + str(df_train.shape))
print("Test shape: " + str(df_test.shape))

# Splitting the target variable and the features
X_train = df_train.loc[:,'var_0':]
y_train = df_train.loc[:,'target']

print(X_train.shape)
print(y_train.shape)

/content
Train shape: (200000, 202)
Test shape: (200000, 201)
(200000, 200)
(200000,)


In [2]:
synthetic_samples_indexes = pd.read_csv('synthetic_samples_indexes.csv')

df_test_real = df_test.copy()
df_test_real = df_test_real[~df_test_real.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))]
X_test = df_test_real.loc[:,'var_0':]
X_test.shape

(100000, 200)

In [0]:
# Setting up LGBM GPU version
!git clone --recursive https://github.com/Microsoft/LightGBM
%cd /content/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools pandas numpy scipy scikit-learn -U
%cd /content/LightGBM/python-package
!sudo python setup.py install --precompile

In [0]:
# Frequency Encoding


def get_count(df):
    '''
    Function that adds one column for each variable (excluding 'ID_code', 'target')
    populated with the value frequencies
    '''
    for var in [i for i in df.columns if i not in ['ID_code','target']]:
        df[var+'_count'] = df.groupby(var)[var].transform('count')
    return df


In [5]:
X_tot = pd.concat([X_train, X_test])
print(X_tot.shape)

start = time.time()
X_tot = get_count(X_tot)
end = time.time()
print('It took %.2f seconds\nShape: ' %(end - start))
print(X_tot.shape)

X_train_count = X_tot.iloc[0:200000]
X_test_count = X_tot.iloc[200000:]

(300000, 200)
It took 39.74 seconds
Shape: 
(300000, 400)


## GPU
---

In [0]:
# Model GPU

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, train_test_split
import lightgbm as lgb

# The parameters for Light Gradient Boost
lgb_params = {
        'bagging_fraction': 0.77,
        'bagging_freq': 2,
        'lambda_l1': 0.7,
        'lambda_l2': 2,
        'learning_rate': 0.01,
        'max_depth': 3,
        'min_data_in_leaf': 22,
        'min_gain_to_split': 0.07,
        'min_sum_hessian_in_leaf': 19,
        'num_leaves': 20,
        'feature_fraction': 1,
        'save_binary': True,
        'seed': 42,
        'feature_fraction_seed': 42,
        'bagging_seed': 42,
        'drop_seed': 42,
        'data_random_seed': 42,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': 'false',
        'num_threads': 6,
        'device': 'gpu',
        'gpu_platform_id': 0,
        'gpu_device_id': 0
}

In [7]:
folds = KFold(n_splits=4, random_state=42)
target = df_train['target']
y_hat = np.zeros([200000, 200])
test_hat = np.zeros([100000, 200])
i = 0
start = time.time()
for feature in ['var_' + str(x) for x in range(200)]: # loop over all features 
    feat_choices = [feature, feature + '_count']
    print('Model using: ' + str(feat_choices))
    oof = np.zeros(len(X_train_count))
    predictions = np.zeros(len(X_test_count))
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_count[feat_choices].values, target.values)):
        trn_data = lgb.Dataset(X_train_count.iloc[trn_idx][feat_choices], label=target.iloc[trn_idx])
        val_data = lgb.Dataset(X_train_count.iloc[val_idx][feat_choices], label=target.iloc[val_idx])
        clf = lgb.train(lgb_params, trn_data, 130, valid_sets = [val_data], verbose_eval=-1)
        oof[val_idx] = clf.predict(X_train_count.iloc[val_idx][feat_choices])
        predictions += clf.predict(X_test_count[feat_choices]) / folds.n_splits
    print(">>> CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
    
    y_hat[:, i] = oof
    test_hat[:, i] = predictions
    i += 1

    
end = time.time()
print('It took %.2f seconds' %(end - start))

Model using: ['var_0', 'var_0_count']
>>> CV score: 0.54809 
Model using: ['var_1', 'var_1_count']
>>> CV score: 0.54580 
Model using: ['var_2', 'var_2_count']
>>> CV score: 0.55087 
Model using: ['var_3', 'var_3_count']
>>> CV score: 0.50841 
Model using: ['var_4', 'var_4_count']
>>> CV score: 0.50234 
Model using: ['var_5', 'var_5_count']
>>> CV score: 0.52743 
Model using: ['var_6', 'var_6_count']
>>> CV score: 0.55783 
Model using: ['var_7', 'var_7_count']
>>> CV score: 0.50124 
Model using: ['var_8', 'var_8_count']
>>> CV score: 0.51733 
Model using: ['var_9', 'var_9_count']
>>> CV score: 0.54118 
Model using: ['var_10', 'var_10_count']
>>> CV score: 0.49850 
Model using: ['var_11', 'var_11_count']
>>> CV score: 0.51800 
Model using: ['var_12', 'var_12_count']
>>> CV score: 0.55976 
Model using: ['var_13', 'var_13_count']
>>> CV score: 0.55424 
Model using: ['var_14', 'var_14_count']
>>> CV score: 0.50525 
Model using: ['var_15', 'var_15_count']
>>> CV score: 0.51397 
Model using:

In [8]:
valid_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
print('>>> Your CV score is:', roc_auc_score(target, valid_pred))

>>> Your CV score is: 0.9209126218152106


## CPU
---

In [0]:
# Model CPU

# The parameters for Light Gradient Boost
lgb_params = {
        'bagging_fraction': 0.77,
        'bagging_freq': 2,
        'lambda_l1': 0.7,
        'lambda_l2': 2,
        'learning_rate': 0.01,
        'max_depth': 3,
        'min_data_in_leaf': 22,
        'min_gain_to_split': 0.07,
        'min_sum_hessian_in_leaf': 19,
        'num_leaves': 20,
        'feature_fraction': 1,
        'save_binary': True,
        'seed': 42,
        'feature_fraction_seed': 42,
        'bagging_seed': 42,
        'drop_seed': 42,
        'data_random_seed': 42,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': 'false',
        'num_threads': 6,
        'device': 'cpu'
}

In [10]:
folds = KFold(n_splits=4, random_state=42)
target = df_train['target']
y_hat = np.zeros([200000, 200])
test_hat = np.zeros([100000, 200])
i = 0
start = time.time()
for feature in ['var_' + str(x) for x in range(200)]: # loop over all features 
    feat_choices = [feature, feature + '_count']
    print('Model using: ' + str(feat_choices))
    oof = np.zeros(len(X_train_count))
    predictions = np.zeros(len(X_test_count))
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_count[feat_choices].values, target.values)):
        trn_data = lgb.Dataset(X_train_count.iloc[trn_idx][feat_choices], label=target.iloc[trn_idx])
        val_data = lgb.Dataset(X_train_count.iloc[val_idx][feat_choices], label=target.iloc[val_idx])
        clf = lgb.train(lgb_params, trn_data, 130, valid_sets = [val_data], verbose_eval=-1)
        oof[val_idx] = clf.predict(X_train_count.iloc[val_idx][feat_choices])
        predictions += clf.predict(X_test_count[feat_choices]) / folds.n_splits
    print(">>> CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
    
    y_hat[:, i] = oof
    test_hat[:, i] = predictions
    i += 1

    
end = time.time()
print('It took %.2f seconds' %(end - start))

Model using: ['var_0', 'var_0_count']
>>> CV score: 0.54809 
Model using: ['var_1', 'var_1_count']
>>> CV score: 0.54580 
Model using: ['var_2', 'var_2_count']
>>> CV score: 0.55087 
Model using: ['var_3', 'var_3_count']
>>> CV score: 0.50841 
Model using: ['var_4', 'var_4_count']
>>> CV score: 0.50234 
Model using: ['var_5', 'var_5_count']
>>> CV score: 0.52743 
Model using: ['var_6', 'var_6_count']
>>> CV score: 0.55783 
Model using: ['var_7', 'var_7_count']
>>> CV score: 0.50124 
Model using: ['var_8', 'var_8_count']
>>> CV score: 0.51733 
Model using: ['var_9', 'var_9_count']
>>> CV score: 0.54118 
Model using: ['var_10', 'var_10_count']
>>> CV score: 0.49850 
Model using: ['var_11', 'var_11_count']
>>> CV score: 0.51800 
Model using: ['var_12', 'var_12_count']
>>> CV score: 0.55976 
Model using: ['var_13', 'var_13_count']
>>> CV score: 0.55424 
Model using: ['var_14', 'var_14_count']
>>> CV score: 0.50525 
Model using: ['var_15', 'var_15_count']
>>> CV score: 0.51397 
Model using:

### Result
---

Well, in this case the GPU version took more time to train our model. 

- **GPU: 2635.93 seconds**
- **CPU: 1995.11 seconds**

The GPU works best with large and dense datasets!