# Predictive Analysis & Model Tuning

---

In [1]:
# Importing all the libraries needed
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
import time
warnings.filterwarnings('ignore')
sns.set()

%matplotlib inline

# getting my path C:\\Users\\username\\Desktop
# /Users/username/Desktop for Mac
path = os.getcwd()

In [2]:
# reading the training set (train.csv), I had this file in C:\\Users\\username\\Desktop\\...

# Mac

path = '/'.join(path.split("/")[:4])
df_train = pd.read_csv(path + '/Santander_Customer_Transaction_Prediction/data/train.csv')
df_test = pd.read_csv(path + '/Santander_Customer_Transaction_Prediction/data/test.csv')

# Windows

#path = '\\'.join(path.split("\\")[:4])
#df_train = pd.read_csv(path + '\\Santander_Customer_Transaction_Prediction\\data\\train.csv')
#df_test = pd.read_csv(path + '\\Santander_Customer_Transaction_Prediction\\data\\test.csv')

In [3]:
print("Train shape: " + str(df_train.shape))
print("Test shape: " + str(df_test.shape))

Train shape: (200000, 202)
Test shape: (200000, 201)


In [4]:
# Splitting the target variable and the features
X_train = df_train.loc[:,'var_0':]
y_train = df_train.loc[:,'target']

In [5]:
print(X_train.shape)
print(y_train.shape)

(200000, 200)
(200000,)


### Sorting Fake Test Data
After a discussion on Kaggle. It seems that the test dataset was created with half real data (used for LB scores) and synthetic data (maybe to increase the diffuculty of the competition). Note that this was one of the most important kernel of the competition, so it is worth looking it :)

Here is the kernel: [List of Fake Samples and Public/Private LB split](https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split)

In [6]:
synthetic_samples_indexes = pd.read_csv('synthetic_samples_indexes.csv')

In [7]:
df_test_real = df_test.copy()
df_test_real = df_test_real[~df_test_real.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))]
X_test = df_test_real.loc[:,'var_0':]
X_test.shape

(100000, 200)

### Frequency Encoding

---

As discussed in the [EDA notebook](https://github.com/FedericoRaimondi/me/blob/master/Santander_Customer_Transaction_Prediction/Exploratory_Data_Analysis/Data%20Exploration.ipynb) frequency encoding may help our tree based model to learn also the values occurrences for each variable.

I tried both considering only the training set and concatenating train + test.

The second path takes significant advantages in terms of performance!

In [8]:
def get_count(df):
    '''
    Function that adds one column for each variable (excluding 'ID_code', 'target')
    populated with the value frequencies
    '''
    for var in [i for i in df.columns if i not in ['ID_code','target']]:
        df[var+'_count'] = df.groupby(var)[var].transform('count')
    return df

In [9]:
X_tot = pd.concat([X_train, X_test])
print(X_tot.shape)

(300000, 200)


In [10]:
start = time.time()
X_tot = get_count(X_tot)
end = time.time()
print('It took %.2f seconds\nShape: ' %(end - start))
print(X_tot.shape)

It took 275.81 seconds
Shape: 
(300000, 400)


In [11]:
X_train_count = X_tot.iloc[0:200000]
X_test_count = X_tot.iloc[200000:]

## Let's build our model

But first let's split the train set into train/dev

### LightGBM

---

In this competition LightGBM resulted to be one of the top choices, maybe the best one considering the training time.

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed
and efficient with the following advantages:
- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data

**[Documentation](https://lightgbm.readthedocs.io/en/latest/)**

**[Explanation](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc)**

In [12]:
# Libraries
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, train_test_split
import lightgbm as lgb
from bayes_opt import BayesianOptimization

In [13]:
# 0.8 train, 0.2 dev
X_train,X_valid,y_train,y_valid = train_test_split(X_train_count, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [14]:
print('X_train shape: {}\n'.format(X_train.shape))
print('y_train shape: {}\n'.format(y_train.shape))
print('X_valid shape: {}\n'.format(X_valid.shape))
print('y_valid shape: {}'.format(y_valid.shape))

X_train shape: (160000, 400)

y_train shape: (160000,)

X_valid shape: (40000, 400)

y_valid shape: (40000,)


## Data Augmentation

---

In [15]:
# Data Augmentation 2x if y = 1 , 1x if y = 0
def augment(x,y,t=2):
    '''
    Data Augmentation 2x if y = 1 , 1x if y = 0
    '''
    xs,xn = [],[]
    for i in range(t):
        mask = y>0
        x1 = x[mask].copy()
        ids = np.arange(x1.shape[0])
        for c in range(int(x1.shape[1]/2)):
            np.random.shuffle(ids)
            x1[:,c] = x1[ids][:,c]
            x1[:,c+200] = x1[ids][:,c+200] # The new features must go with their original one!
        xs.append(x1)

    for i in range(t//2):
        mask = y==0
        x1 = x[mask].copy()
        ids = np.arange(x1.shape[0])
        for c in range(int(x1.shape[1]/2)):
            np.random.shuffle(ids)
            x1[:,c] = x1[ids][:,c]
            x1[:,c+200] = x1[ids][:,c+200] # The new features must go with their original one!
        xn.append(x1)

    xs = np.vstack(xs)
    xn = np.vstack(xn)
    ys = np.ones(xs.shape[0])
    yn = np.zeros(xn.shape[0])
    x = np.vstack([x,xs,xn])
    y = np.concatenate([y,ys,yn])
    return x,y

In [None]:
start = time.time()
# Trying Augmentation Only for training set!
X_tr, y_tr = augment(X_train.values, y_train.values)
print('X_tr Augm shape: {}'.format(X_tr.shape))
print('y_tr Augm shape: {}'.format(y_tr.shape))

end = time.time()
print('It took %.2f seconds' %(end - start))

- X_tr Augm shape: (336078, 400)
- y_tr Augm shape: (336078,)

It took 183.11 seconds

In [None]:
X_tr = pd.DataFrame(data=X_tr,columns=X_train.columns)
y_tr = pd.DataFrame(data=y_tr)

In [None]:
y_tr.columns = ['target']

In [16]:
# List of all the features
features = [c for c in X_train.columns if c not in ['ID_code', 'target']]

In [17]:
# The parameters for Light Gradient Boost
lgb_params = {
        'bagging_fraction': 0.77,
        'bagging_freq': 2,
        'lambda_l1': 0.7,
        'lambda_l2': 2,
        'learning_rate': 0.01,
        'max_depth': 3,
        'min_data_in_leaf': 22,
        'min_gain_to_split': 0.07,
        'min_sum_hessian_in_leaf': 19,
        'num_leaves': 20,
        'feature_fraction': 1,
        'save_binary': True,
        'seed': 42,
        'feature_fraction_seed': 42,
        'bagging_seed': 42,
        'drop_seed': 42,
        'data_random_seed': 42,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': 'false',
        'num_threads': 6
}

In [None]:
start = time.time()

trn_data = lgb.Dataset(X_train, label=y_train)
# trn_data = lgb.Dataset(X_tr, label=y_tr) # Augmentation
val_data = lgb.Dataset(X_valid, label=y_valid)

# Training
clf = lgb.train(lgb_params, trn_data, 100000, valid_sets = [trn_data, val_data], verbose_eval=5000, early_stopping_rounds = 3000)

end = time.time()
print('It took %.2f seconds' %(end - start))

The line above will take a bit to run...

Here are some results I obtained with different approaches:

- *CV: 0.900, LB: 0.900*, Model trained with original features
- *CV: 0.901, LB: 0.901*, Model trained with original features and augmentation

(After competition)

- *CV: 0.910, LB: 0.910*, Model trained with frequencies encoding
- *CV: 0.916, LB: 0.915*, Model trained with frequencies encoding and augmentation

__NOTE__: In the last days of the competition I tried frequency encoding with no luck! That's because 'feature_fraction' parameter... If you don't set it to 1, then the engineered feature may not interact with its original one!
Unfortunately, I hadn't this intuition during the competition but I learned something I can use the next time :)

In [None]:
# Predictions
train_pred = clf.predict(X_train[features], num_iteration=clf.best_iteration)
val_pred = clf.predict(X_valid[features], num_iteration=clf.best_iteration)
predictions = clf.predict(X_test_count[features], num_iteration=clf.best_iteration)

In [None]:
# Printing the ROC AUC scores
print(">> Train score: {:<8.5f}".format(roc_auc_score(y_train, train_pred)))
print("\n>> Valid score: {:<8.5f}".format(roc_auc_score(y_valid, val_pred)))

In [None]:
# Displaying feature importance

importance_df = pd.DataFrame()
importance_df["feature"] = features
importance_df["importance"] = clf.feature_importance()

importance_df = importance_df.sort_values(by='importance', ascending=False)
plt.figure(figsize=(14,26))
sns.barplot(x="importance", y="feature", data=importance_df)
plt.title('LightGBM Features')
plt.tight_layout()

## A Faster Approach

---

Ok the model it's fine, but we need a faster way to make our predictions!

Many kernels proposed fascinating approches:
- [Felipe Mello's one](https://www.kaggle.com/felipemello/step-by-step-guide-to-the-magic-lb-0-922)
- [Chris Deotte's one](https://www.kaggle.com/cdeotte/200-magical-models-santander-0-920)


These 2 guys really made an impressive work in the competition and helped many of us in understanding many concepts.

Basically, we know that our feature are strongly uncorrelated but we are not sure about their independency...

If we manage to create a several models using only one feature and perform the same as before, then we are also sure about independency!

In the next code, we are going to build 200 models with only 2 feature: the original one and the related frequency variable (i.e. 'var_0' and 'var_0_count').

In [18]:
start = time.time()

iteration = 120
y_hat = np.zeros([int(200000*0.2), 200])
test_hat = np.zeros([100000, 200])
i = 0
for feature in ['var_' + str(x) for x in range(200)]: # loop over all the raw features
    feat_choices = [feature, feature + '_count']
    trn_data = lgb.Dataset(X_train[feat_choices], y_train)
    #trn_data = lgb.Dataset(X_tr[feat_choices], y_tr) # Augmentation
    val_data = lgb.Dataset(X_valid[feat_choices], y_valid)
    clf = lgb.train(lgb_params, trn_data, iteration, valid_sets=[val_data], verbose_eval=-1)
    y_hat[:, i] = clf.predict(X_valid[feat_choices])
    test_hat[:, i] = clf.predict(X_test_count[feat_choices])
    i += 1
    
end = time.time()
print('It took %.2f seconds' %(end - start))

It took 336.93 seconds


_Less than 6 min with my pc! That's fast :)_

In [19]:
val_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
score = roc_auc_score(y_valid, val_pred)
print('>>> Your CV score is: ', score)

>>> Your CV score is:  0.922140665488566


I tried both with and without Augmentation. These are the results:

- No Augm: *CV: 0.922, LB: 0.92046*
- Augm: *CV: 0.921, LB: 0.92038*

## Parameter Bayesan Optimization

---

In the Kernel section, I found this [Bayesan Parameter Optimization](https://www.kaggle.com/fayzur/lgb-bayesian-parameters-finding-rank-average) tutorial that helped me tuning the LGBM parameters.

In [None]:
def LGB_bayesian(
    num_leaves,  # int
    min_data_in_leaf,  # int
    learning_rate,
    min_sum_hessian_in_leaf,    # int  
    lambda_l1,
    lambda_l2,
    min_gain_to_split,
    max_depth):
    
    # LightGBM expects next three parameters need to be integer. So we make them integer
    num_leaves = int(num_leaves)
    min_data_in_leaf = int(min_data_in_leaf)
    max_depth = int(max_depth)

    assert type(num_leaves) == int
    assert type(min_data_in_leaf) == int
    assert type(max_depth) == int

    param = {
        'bagging_fraction': 0.7693,
        'bagging_freq': 2,
        'lambda_l1': lambda_l1,
        'lambda_l2': lambda_l2,
        'learning_rate': learning_rate,
        'max_depth': max_depth,
        'min_data_in_leaf': min_data_in_leaf,
        'min_gain_to_split': min_gain_to_split,
        'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
        'num_leaves': num_leaves,
        'feature_fraction': 1,
        'save_binary': True,
        'seed': 42,
        'feature_fraction_seed': 42,
        'bagging_seed': 42,
        'drop_seed': 42,
        'data_random_seed': 42,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': 'false',
        'num_threads': 6
        }   
    
    trn_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_valid, label=y_valid)
    
    num_round = 5000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [val_data], verbose_eval=5000, early_stopping_rounds = 500)
    
    predictions = clf.predict(X_valid[features], num_iteration=clf.best_iteration)   
    
    score = roc_auc_score(y_valid, predictions)
    
    
    return score

In [None]:
# Bounded region of parameter space
bounds_LGB = {
    'num_leaves': (10, 25), 
    'min_data_in_leaf': (15, 40),  
    'learning_rate': (0.005, 0.012),
    'min_sum_hessian_in_leaf': (15, 20),
    'lambda_l1': (0, 5), 
    'lambda_l2': (0, 5), 
    'min_gain_to_split': (0, 1.0),
    'max_depth':(1,10),
}

In [None]:
# Bayesian Optimizer
LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)

In [None]:
init_points = 5
n_iter = 5

In [None]:
print('-' * 100)

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0, alpha=1e-6)

In [None]:
# Magic Parameters xD
LGB_BO.max['params']

Please note that this parameters **must be integers:**

- num_leaves
- min_data_in_leaf
- max_depth

## KFold CV

---

Let's try to split our Training set with KFold cross validation.
This should help us to increase a bit our performances and to have more reliable results!

I choose 4 Fold, but this could be changed!

In [23]:
folds = KFold(n_splits=4, random_state=42)
target = df_train['target']
y_hat = np.zeros([200000, 200])
test_hat = np.zeros([100000, 200])
i = 0
start = time.time()
for feature in ['var_' + str(x) for x in range(200)]: # loop over all features 
    feat_choices = [feature, feature + '_count']
    print('Model using: ' + str(feat_choices))
    oof = np.zeros(len(X_train_count))
    predictions = np.zeros(len(X_test_count))
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_count[feat_choices].values, target.values)):
        trn_data = lgb.Dataset(X_train_count.iloc[trn_idx][feat_choices], label=target.iloc[trn_idx])
        val_data = lgb.Dataset(X_train_count.iloc[val_idx][feat_choices], label=target.iloc[val_idx])
        clf = lgb.train(lgb_params, trn_data, 130, valid_sets = [val_data], verbose_eval=-1)
        oof[val_idx] = clf.predict(X_train_count.iloc[val_idx][feat_choices])
        predictions += clf.predict(X_test_count[feat_choices]) / folds.n_splits
    print(">>> CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
    
    y_hat[:, i] = oof
    test_hat[:, i] = predictions
    i += 1

    
end = time.time()
print('It took %.2f seconds' %(end - start))

Model using: ['var_0', 'var_0_count']
>>> CV score: 0.54809 
Model using: ['var_1', 'var_1_count']
>>> CV score: 0.54580 
Model using: ['var_2', 'var_2_count']
>>> CV score: 0.55087 
Model using: ['var_3', 'var_3_count']
>>> CV score: 0.50841 
Model using: ['var_4', 'var_4_count']
>>> CV score: 0.50234 
Model using: ['var_5', 'var_5_count']
>>> CV score: 0.52743 
Model using: ['var_6', 'var_6_count']
>>> CV score: 0.55783 
Model using: ['var_7', 'var_7_count']
>>> CV score: 0.50124 
Model using: ['var_8', 'var_8_count']
>>> CV score: 0.51733 
Model using: ['var_9', 'var_9_count']
>>> CV score: 0.54118 
Model using: ['var_10', 'var_10_count']
>>> CV score: 0.49850 
Model using: ['var_11', 'var_11_count']
>>> CV score: 0.51800 
Model using: ['var_12', 'var_12_count']
>>> CV score: 0.55976 
Model using: ['var_13', 'var_13_count']
>>> CV score: 0.55424 
Model using: ['var_14', 'var_14_count']
>>> CV score: 0.50525 
Model using: ['var_15', 'var_15_count']
>>> CV score: 0.51397 
Model using:

>>> CV score: 0.50336 
Model using: ['var_130', 'var_130_count']
>>> CV score: 0.52450 
Model using: ['var_131', 'var_131_count']
>>> CV score: 0.52783 
Model using: ['var_132', 'var_132_count']
>>> CV score: 0.52029 
Model using: ['var_133', 'var_133_count']
>>> CV score: 0.54559 
Model using: ['var_134', 'var_134_count']
>>> CV score: 0.51581 
Model using: ['var_135', 'var_135_count']
>>> CV score: 0.52331 
Model using: ['var_136', 'var_136_count']
>>> CV score: 0.49996 
Model using: ['var_137', 'var_137_count']
>>> CV score: 0.52493 
Model using: ['var_138', 'var_138_count']
>>> CV score: 0.51759 
Model using: ['var_139', 'var_139_count']
>>> CV score: 0.57431 
Model using: ['var_140', 'var_140_count']
>>> CV score: 0.51090 
Model using: ['var_141', 'var_141_count']
>>> CV score: 0.52950 
Model using: ['var_142', 'var_142_count']
>>> CV score: 0.51776 
Model using: ['var_143', 'var_143_count']
>>> CV score: 0.50204 
Model using: ['var_144', 'var_144_count']
>>> CV score: 0.51797 
Mo

Well, almost *36 min* from my pc. It's quite good!

Please note that when I tried KFold CV with all the raw features it took several hours!

In [25]:
valid_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
print('>>> Your CV score is:', roc_auc_score(target, valid_pred))

>>> Your CV score is: 0.9209122683537883


This is the best result I achieved (unfortunately after the end of the competition).
Maybe with few more days I could achieve the first step of the magic (LB: 0.908-0.910), but I am grateful for this experience.

Kaggle is a great place to learn and practice your data science skills!

**4 KFold Results:**
- *CV: 0.92091*
- *PublicLB: 0.92223*
- *PrivateLB: 0.92057*

## Submission

---

Preparing the submission file!

We only need 'ID_codes' and 'pred' columns.

__Note__ that I predicted values only for the real test data and I setted the fake ones at 0.


In [26]:
subm = pd.DataFrame({"ID_codes":df_test[~df_test.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))].loc[:,'ID_code']})
subm['pred'] = predictions
subm.head()

Unnamed: 0,ID_codes,pred
3,test_3,0.500013
7,test_7,0.499373
11,test_11,0.498615
15,test_15,0.498093
16,test_16,0.500646


In [27]:
ID_codes = df_test[~df_test.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))].loc[:,'ID_code']
submission = pd.DataFrame({"ID_code": df_test.ID_code.values})
submission['target'] = 0
submission.loc[submission['ID_code'].isin(ID_codes), 'target'] = subm['pred'].values

In [28]:
submission.head()

Unnamed: 0,ID_code,target
0,test_0,0.0
1,test_1,0.0
2,test_2,0.0
3,test_3,0.500013
4,test_4,0.0


In [29]:
submission.to_csv(r'submission.csv', index = None, header=True)