
<h2><center><font size="4">Dataset used: Santander Customer Transaction Prediction</font></center></h2>

<br>

# <a id='0'>Content</a>


- <a href='#6'>Model</a>
- <a href='#7'>Submission</a>  


# <a id='1'>Introduction</a>  

In this challenge, Santander invites Kagglers to help them identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data they have available to solve this problem.  

The data is anonimyzed, each row containing 200 numerical values identified just with a number.  

In the following we will explore the data, prepare it for a model, train a model and predict the target value for the test set.



# <a id='2'>Prepare for data analysis</a>  


## Load packages


In [1]:
import gc
import os
import logging
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import StratifiedKFold
warnings.filterwarnings('ignore')
import imblearn

## Load data   

Let's check what data files are available.

In [2]:
PATH="./data/"
os.listdir(PATH)

['test.csv', 'train.csv']

Let's load the train and test data files.

In [10]:
%%time
train_df = pd.read_csv(PATH+"train.csv")
test_df = pd.read_csv(PATH+"test.csv")

CPU times: user 7.77 s, sys: 233 ms, total: 8 s
Wall time: 8 s


In [6]:
train_df.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [7]:
test_df.head()

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,-2.1556,11.8495,-1.43,2.4508,13.7112,2.4669,4.3654,10.72,15.4722,-8.7197
1,test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,...,10.6165,8.8349,0.9403,10.1282,15.5765,0.4773,-1.4852,9.8714,19.1293,-20.976
2,test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.895,20.2537,1.5233,...,-0.7484,10.9935,1.9803,2.18,12.9813,2.1281,-7.1086,7.0618,19.8956,-23.1794
3,test_3,8.5374,-1.3222,12.022,6.5749,8.8458,3.1744,4.9397,20.566,3.3755,...,9.5702,9.0766,1.658,3.5813,15.1874,3.1656,3.9567,9.2295,13.0168,-4.2108
4,test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.989,...,4.2259,9.1723,1.2835,3.3778,19.5542,-0.286,-5.1612,7.2882,13.926,-9.1846


In [11]:
train_data = train_df.drop(['ID_code','target'],axis=1)
test_data = test_df.drop(['ID_code'],axis=1)
train_data.head()

Unnamed: 0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,-4.92,5.747,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,3.1468,8.0851,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,-4.9193,5.9525,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,-5.8609,8.245,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,6.2654,7.6784,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [14]:
target = train_df['target']

## Naive Bayes Classifier

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(train_data)  
x_test = scaler.transform(test_data)  

In [15]:
from sklearn.naive_bayes import GaussianNB  
classifier = GaussianNB()  
classifier.fit(x_train, target)

In [16]:
classifier.score(x_train, target)

0.921695

In [18]:
from sklearn import metrics
x_train_pred = classifier.predict(x_train)
# fpr, tpr, thresholds = metrics.roc_curve(x_train, x_train_pred)
# metrics.auc(fpr, tpr)
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
print(roc_auc_score(target,x_train_pred))
fpr, tpr, thresholds = roc_curve(target,x_train_pred)

0.6754651025994682


In [111]:
from sklearn.metrics import auc
auc(fpr, tpr)

0.6754651025994682

Basic naive bayes with scaled features gave 0.67 AUC on training data.

## Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression
clf_logistic = LogisticRegression(random_state=0).fit(X_over, y_over)
clf_logistic.score(X_over, y_over)

0.7701693144045091

In [19]:
x_train_pred = clf_logistic.predict(X_over)
print(roc_auc_score(y_over,x_train_pred))
fpr, tpr, thresholds = roc_curve(y_over,x_train_pred)

0.7701693144045092


Logistic regression performs even worse than naive bayes in terms of AUC

## Class Imbalance

In [17]:
import imblearn
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
print("Count before oversampling:  ")
print(Counter(target))
oversample = RandomOverSampler(sampling_strategy='minority')
# oversample = RandomOverSampler(sampling_strategy=0.5)

X_over, y_over = oversample.fit_resample(train_data, target)
print("count ater oversampling : ")
print(Counter(y_over))

Count before oversampling:  
Counter({0: 179902, 1: 20098})
count ater oversampling : 
Counter({0: 179902, 1: 179902})


In [16]:
# Training Naive Bayes again on oversampled data

x_train = scaler.fit_transform(X_over)
classifier = GaussianNB()  
classifier.fit(X_over, y_over)

print(classifier.score(X_over, y_over))


x_train_pred = classifier.predict(X_over)

print(roc_auc_score(y_over,x_train_pred))

0.8085290880590543
0.8085290880590543


After oversampling, AUC has significantly increased.

Now lets try SMOTE technique

In [19]:
from imblearn.over_sampling import SMOTE

print("Count before oversampling:  ")
print(Counter(target))
oversample = SMOTE(random_state=1)
# oversample = RandomOverSampler(sampling_strategy=0.5)

X_over, y_over = oversample.fit_resample(train_data, target)
print("count ater oversampling : ")
print(Counter(y_over))

Count before oversampling:  
Counter({0: 179902, 1: 20098})
count ater oversampling : 
Counter({0: 179902, 1: 179902})


In [22]:
# Training Naive Bayes again using SMOTE

x_train = scaler.fit_transform(X_over)
classifier = GaussianNB()  
classifier.fit(X_over, y_over)

print(classifier.score(X_over, y_over))


x_train_pred = classifier.predict(X_over)
x_test_predictions = classifier.predict(x_test)

print(roc_auc_score(y_over,x_train_pred))

0.8655823726250959
0.8655823726250957


### Decision Tree Classifier on Oversampled Data

In [23]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_over,y_over)
print(clf.score(X_over, y_over))
x_train_pred = clf.predict(X_over)
print(roc_auc_score(y_over,x_train_pred))

1.0
1.0


Decision tree has overfitting, lets try Random Forest

In [30]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=6, random_state=0)
clf.fit(X_over,y_over)
x_train_pred = clf.predict(X_over)
print(roc_auc_score(y_over,x_train_pred))

0.7669536747784905


Random forest performs worse than naive bayes on training data

# <a id='6'>Model</a>  

From the train columns list, we drop the ID and target to form the features list.

In [26]:
features = [c for c in train_df.columns if c not in ['ID_code', 'target']]
target = train_df['target']

We define the hyperparameters for the model.

In [27]:
param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'objective': 'binary', 
    'verbosity': 1
}

We run the model.

In [29]:
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=44000)
oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df.values, target.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train_df.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(train_df.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))

Fold 0
[LightGBM] [Info] Number of positive: 18089, number of negative: 161911
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 53040
[LightGBM] [Info] Number of data points in the train set: 180000, number of used features: 208
Training until validation scores don't improve for 3000 rounds
[1000]	training's auc: 0.885918	valid_1's auc: 0.875496
[2000]	training's auc: 0.903601	valid_1's auc: 0.887846
[3000]	training's auc: 0.914985	valid_1's auc: 0.894505
[4000]	training's auc: 0.922396	valid_1's auc: 0.898394
[5000]	training's auc: 0.928084	valid_1's auc: 0.900316
[6000]	training's auc: 0.932925	valid_1's auc: 0.901107
[7000]	training's auc: 0.937124	valid_1's auc: 0.901798
[8000]	training's auc: 0.941121	valid_1's auc: 0.902207
[9000]	training's auc: 0.944827	valid_1's auc: 0.90226
[10000]	training's auc: 0.948293	valid_1's auc: 0.90227
[11000]	training's auc: 0.951589	valid_1's auc: 0.902188
Early stopping, best iteration is:
[8864]	training's a

[11000]	training's auc: 0.951859	valid_1's auc: 0.898186
[12000]	training's auc: 0.955047	valid_1's auc: 0.898071
[13000]	training's auc: 0.958043	valid_1's auc: 0.898155
Early stopping, best iteration is:
[10356]	training's auc: 0.949799	valid_1's auc: 0.898259
Fold 6
[LightGBM] [Info] Number of positive: 18088, number of negative: 161912
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 53040
[LightGBM] [Info] Number of data points in the train set: 180000, number of used features: 208
Training until validation scores don't improve for 3000 rounds
[1000]	training's auc: 0.886795	valid_1's auc: 0.868372
[2000]	training's auc: 0.903886	valid_1's auc: 0.881877
[3000]	training's auc: 0.915243	valid_1's auc: 0.889648
[4000]	training's auc: 0.92264	valid_1's auc: 0.893456
[5000]	training's auc: 0.928345	valid_1's auc: 0.895663
[6000]	training's auc: 0.933156	valid_1's auc: 0.897068
[7000]	t

Let's check the feature importance.

In [None]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')

# <a id='7'>Submission</a>  

We submit the solution.

In [None]:
sub_df = pd.DataFrame({"ID_code":test_df["ID_code"].values})
sub_df["target"] = predictions
sub_df.to_csv("submission.csv", index=False)