<h1><center><font size="6">Santander Customer Transaction Prediction</font></center></h1>

<h2><center><font size="4">Dataset used: Santander Customer Transaction Prediction</font></center></h2>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Another_new_Santander_bank_-_geograph.org.uk_-_1710962.jpg/640px-Another_new_Santander_bank_-_geograph.org.uk_-_1710962.jpg" width="500"></img>

<br>

# <a id='1'>Content</a>  

- Introduction
- Exploratory Data Analysis
- Load Data and Reducing Memory Usage
- Basic Statistics Plot
- Feauture Engineering
- Regression Model (Light GBM)
- Classification Model (Logistic Regression, Naive Bayes , Random Forest)
- Feature Engineering (Sorting Fake Test Data, Frequency Encoding, K Fold CV)
- Conclusion
- Submission

# <a id='1'>Abstract</a>  

In this challenge, Santander invites Kagglers to help them identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data they have available to solve this problem.  

The data is anonimyzed, each row containing 200 numerical values identified just with a number.  

In the following we will explore the data, prepare it for a model, train a model and predict the target value for the test set, then prepare a submission.


At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.



# <a id='2'>Exploratory Data Analysis</a>  



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math

import warnings
warnings.filterwarnings("ignore")


# <a id='2'> Load Data and Reducing Memory Usage</a>  

In [None]:
#https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

def import_data(file):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True)
    df = reduce_mem_usage(df)
    return df


#### Shape of Training and Test Data

In [None]:
train = import_data("train.csv")
test = import_data("test.csv")

print("\n\nTrain Size : \t{}\nTest Size : \t{}".format(train.shape, test.shape))

We can see that the train Dataset has 202 columns while the test Dataset has 201 Columns. The extra column in the Train Dataset is the target data set which is not present in the Test Dataset

In [None]:
train.head(2)

The data obtained is entirely masked so with even domain knowledge we will not be able to find out any significant features. We can try with basic features like mean, standard deviation, counts, median, etc. We will do feature engineering later.

# <a id='2'> Basic Stats</a>

### Target Distribution

In [None]:
sns.countplot(train['target'])

In [None]:
train.target.value_counts()

### Is the Dataset balanced?

In [None]:
mylst = list(df_train["target"].value_counts())
zero = round(float((mylst[0]/sum(mylst))*100),2)
one = round(float((mylst[1]/sum(mylst))*100),2)
print('The dataset has {zero} % of target 0 and {one} % of target 1'.format(zero=zero, one=one))


**Well, not very balanced... we'll keep that into account!**

In [None]:
t0=train[train['target']==0]
t1=train[train['target']==1]

In [None]:
print('Distributions of 1st 100 features')
plt.figure(figsize=(20,16))
for i, col in enumerate(list(train.columns)[2:30]):
    plt.subplot(7,4,i + 1)
    plt.hist(t0[col],label='target 0',color='Red')
    plt.hist(t1[col],label='target 1',color='Blue')
    plt.title(col)
    plt.grid()
    plt.legend(loc='upper right')
    plt.tight_layout()

We can see from the above that nearly 90% of the Target value is 0(we assume that 0 stands for Customer didnot do transaction) and only 10% is 1(we assume 1 stands for Customer did a Transaction).

This makes the data significantly imbalanced

In [None]:
train.drop(['ID_code'],axis=1,inplace=True)
labels=train['target']
train.drop(['target'],axis=1,inplace=True)

In [None]:
train.select_dtypes(include='float16')

In [None]:
train.astype(np.float64).describe()

We can make few observations here:   
​
* standard deviation is relatively large for both train and test variable data;  
* min, max, mean, sdt values for train and test data looks quite close;  

#### Missing Values:

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

In [None]:
missing_data(train)

In [None]:
missing_data(test)

We can notice that there is no missing values in both the Train and the Test Dataset

### Performing EDA

#### Mean

In [None]:
features = train.columns.tolist()

In [None]:
plt.figure(figsize=(16,6))
plt.title("Mean in train and test set")
sns.distplot(train[features].mean(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].mean(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()

#### Standard Deviation

In [None]:
plt.figure(figsize=(16,6))
plt.title("Standard Deviation in train and test set")
sns.distplot(train[features].std(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].std(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()

#### Skewness

In [None]:
plt.figure(figsize=(16,6))
plt.title("Skewness in train and test set")
sns.distplot(train[features].skew(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].skew(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()

#### Min

In [None]:
plt.figure(figsize=(16,6))
plt.title("Min in train and test set")
sns.distplot(train[features].min(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].min(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()

#### Max

In [None]:
plt.figure(figsize=(16,6))
plt.title("Min in train and test set")
sns.distplot(train[features].max(axis=1), color="green", kde=True, bins=120, label='train')
sns.distplot(test[features].max(axis=1), color="blue", kde=True, bins=120, label='test')
plt.legend()
plt.show()

#### Comparing Distribution of Feature

We can see from above that all the variables have nearly same distribution with the same scales

### Duplicate Values

In [None]:
features = train.columns.values[2:202]
unique_max_train = []
unique_max_test = []
for feature in features:
    values = train[feature].value_counts()
    unique_max_train.append([feature, values.max(), values.idxmax()])
    values = test[feature].value_counts()
    unique_max_test.append([feature, values.max(), values.idxmax()])

In [None]:
np.transpose((pd.DataFrame(unique_max_train, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))

In [None]:
np.transpose((pd.DataFrame(unique_max_test, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))

Same columns in train and test set have the same or very close number of duplicates of same or very close values. This is an interesting pattern that we might be able to use in the future.

# <a id='2'>Feature Engineering</a>  

In [None]:
idx = features = train.columns.values[2:202]
for df in [test, train]:
    df['sum'] = df[idx].sum(axis=1)  
    df['min'] = df[idx].min(axis=1)
    df['max'] = df[idx].max(axis=1)
    df['mean'] = df[idx].mean(axis=1)
    df['std'] = df[idx].std(axis=1)
    df['skew'] = df[idx].skew(axis=1)
    df['kurt'] = df[idx].kurtosis(axis=1)
    df['med'] = df[idx].median(axis=1)

In [None]:
train.head(2)

In [None]:
train.drop(['kurt'],axis=1,inplace=True)

In [None]:
train.head()

In [None]:
test.head(2)

In [None]:
test.drop(['kurt','ID_code'],axis=1,inplace=True)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train)

In [None]:
train_data.shape

# <a id='2'>Let's see if there are some correlations between our variables</a> 


In [None]:
# choose a threshold to spot correlation above its abs()
# try 0.08 or 0.05 to have some results, even though is not a relevant correlation 
threshold = 0.3
dfcorr = df_train.corr()
dfcorr1 = dfcorr.copy()
dfcorr1[abs(dfcorr1) < threshold] = None
dfcorr1[abs(dfcorr1) >= threshold] = 1

In [None]:
# all the variables have at least corr = 1 with itself so we want to know
# which variables have more than 1 record above the threshold
cor = dfcorr1.sum(axis=1) > 1

In [None]:
# Listing the variables that is worth investigating on
var_to_check = list(cor[cor.values == True].index)

In [None]:
if len(var_to_check) > 0:
    print('These are the variables with correlations >= {}:'.format(threshold))
    print(str(var_to_check) + '\n')
    for i in var_to_check:
        print(str(dfcorr[(abs(dfcorr[i]) >= threshold) & (abs(dfcorr[i]) != 1)][i]) + '\n')
else:
    print('There are no significant correlations to look!')

In [None]:
dfcorr[(dfcorr!=1) & (abs(dfcorr)>0.1)].count()

All the correlations are < |0.1| ... They are extremely uncorrelated.

Maybe Santander team had preprocessed the data!

# <a id='2'>Modelling</a>  

### Regression Model

## Light GBM

In [None]:
!pip install lightgbm

In [None]:
import lightgbm as lgb

In [None]:
param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'objective': 'binary', 
    'verbosity': 1
}

#njobs = -1

In [None]:
#https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train
#https://www.kaggle.com/ashishpatel26/kfold-lightgbm/code 
#(learned from here how to use stratified k-fold with model)
#https://github.com/KazukiOnodera/Santander-Customer-Transaction-Prediction/blob/master/final_solution/onodera/py/907_predict_0410-2.py

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, labels.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=labels.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=labels.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(labels, oof)))

In [None]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')

### Classification Model

### 1. Logistic Regression

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from random import randrange, uniform
from scipy.stats import chi2_contingency
%matplotlib inline

In [None]:
trans = pd.read_csv("C:/Users/shali/Desktop/INFO 7390/Assignments/Assignment2/Data/test.csv")

### Detect and delete outliers from data

In [None]:
for i in range(2,202):
        #print(i)
        q75, q25 = np.percentile(trans.iloc[:,i], [75 ,25])
        iqr = q75 - q25

        min = q25 - (iqr*1.5)
        max = q75 + (iqr*1.5)
        #print(min)
        #print(max)
       
        trans = trans.drop(trans[trans.iloc[:,i] < min].index)
        trans = trans.drop(trans[trans.iloc[:,i] > max].index)

In [None]:
trans.shape

In [None]:
plt.boxplot(trans['var_0'] ,vert=True,patch_artist=True)

In [None]:
trans = trans.drop(trans.columns[0], axis = 1)

In [None]:
from  sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(trans.drop('target',axis=1), 
                                                    trans['target'], test_size=0.30, 
                                                    random_state=101)

In [None]:
print(x_train.shape)
print(x_test.shape)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from math import log

In [None]:
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import RandomizedSearchCV

C = LogisticRegression()

import math

parameter_data = [0.0001,0.001,0.01,0.1,1,5,10,20,30,40]

log_my_data = [math.log10(x) for x in parameter_data]

#print(log_my_data)
print("Printing parameter Data and Corresponding Log value")
data={'Parameter value':parameter_data,'Corresponding Log Value':log_my_data}
param=pd.DataFrame(data)
print("="*100)
print(param)
parameters = {'C':parameter_data}
clf = RandomizedSearchCV(C, parameters, cv=3, scoring='roc_auc', return_train_score=True, n_jobs=-1)
clf.fit(x_train, y_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

plt.plot(log_my_data, train_auc, label='Train AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(log_my_data,train_auc - train_auc_std,train_auc + train_auc_std,alpha=0.2,color='darkblue')

plt.plot(log_my_data, cv_auc, label='CV AUC')
# this code is copied from here: https://stackoverflow.com/a/48803361/4084039
plt.gca().fill_between(log_my_data,cv_auc - cv_auc_std,cv_auc + cv_auc_std,alpha=0.2,color='darkorange')

plt.scatter(log_my_data, train_auc, label='Train AUC points')

In [None]:
def model_predict(clf, data):
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs

    y_data_pred = []
    y_data_pred.extend(clf.predict_proba(data[:])[:,1])
  
    return y_data_pred

In [None]:
from sklearn.metrics import roc_curve, auc


neigh = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
neigh.fit(x_train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, x_train)    
y_test_pred = model_predict(neigh, x_test)

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(y_test, y_test_pred)
CM = pd.crosstab(y_test, y_test_pred)

#let us save TP, TN, FP, FN
TN = CM.iloc[0,0]
FN = CM.iloc[1,0]
TP = CM.iloc[1,1]
FP = CM.iloc[0,1]

In [None]:
sns.heatmap(CM, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix")

In [None]:
test =pd.read_csv("test.csv")

In [None]:
id_code = test.iloc[:,0]

In [None]:
test = test.drop("ID_code" ,axis=1)
predictions_test = neigh.predict(test)
df = pd.DataFrame({"ID_code" :id_code ,"target": predictions_test})
df.head()

In [None]:
test_logistic = df.join(test)
test_logistic.to_csv('logisticmodelpred.csv')
test_logistic.head()

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_logistic,palette='RdBu_r')
test_logistic['target'].value_counts()

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
neigh = GaussianNB()
neigh.fit(x_train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, x_train)    
y_test_pred = model_predict(neigh, x_test)

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
CM = confusion_matrix(y_test, y_test_pred)
CM = pd.crosstab(y_test, y_test_pred)

#let us save TP, TN, FP, FN
TN = CM.iloc[0,0]
FN = CM.iloc[1,0]
TP = CM.iloc[1,1]
FP = CM.iloc[0,1]
sns.heatmap(CM, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix")

In [None]:
predictions_test = neigh.predict(test)

In [None]:
df = pd.DataFrame({"ID_code" :id_code ,"target": predictions_test})
df.head()

In [None]:
test_nb = df.join(test)

In [None]:
test_nb.head(2)

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_nb,palette='RdBu_r')
test_nb['target'].value_counts()

In [None]:
test_nb.to_csv('Naive Bayes Prediction.csv')

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

C = RandomForestClassifier()

n_estimators=[10,50,100,200]
max_depth=[1, 5, 10, 50]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc', return_train_score=True,n_jobs=-1)
clf.fit(x_train, y_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = RandomForestClassifier(n_estimators=100,max_depth=10,class_weight='balanced')
neigh.fit(x_train, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, x_train)   
y_test_pred = model_predict(neigh, x_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
CM = confusion_matrix(y_test, y_test_pred)
CM = pd.crosstab(y_test, y_test_pred)

#let us save TP, TN, FP, FN
TN = CM.iloc[0,0]
FN = CM.iloc[1,0]
TP = CM.iloc[1,1]
FP = CM.iloc[0,1]
sns.heatmap(CM, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix")

In [None]:
predictions_rfc = neigh.predict(test)

In [None]:
df = pd.DataFrame({"ID_code" :id_code ,"target": predictions_rfc})
df.head()

In [None]:
test_rfc = df.join(test)

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='target',data=test_rfc,palette='RdBu_r')
test_rfc['target'].value_counts()

In [None]:
test_rfc.to_csv('RandomForestPrediction.csv')

# <a i> Predictive Analysis & Model Tuning </a>

At this point, what did I know?

1) Regular transformations wouldn't work (many topics shared failed experiments).

2) The test dataset had fake data, and that was relevant.

3) That unique values were, somehow, important.

So I decided to experiment with both of these ideas: I merged the train and test datataset without fake values, created new columns for each variable with the number of unique values in it, and trained my model.

 
SETTING FEATURE FRACTION TO 1

Feature fraction is a parameter of the LGB model. It goes from 0 to 1 and represents the percentage of the data that you will use on each iteration of the traning. My feature fraction was 0.3. When I set it to 1, the model was able to look at all the variables at once, and voila: CV 0.899

TIP: Setting feature fraction to 1 is a great way to understand the impact of a new feature in your model.






In [None]:
df_train = train.copy()
df_test = test.copy()

In [None]:
print("Train shape: " + str(df_train.shape))
print("Test shape: " + str(df_test.shape))

In [None]:
# Splitting the target variable and the features
X_train = df_train.loc[:,'var_0':]
y_train = df_train.loc[:,'target']

In [None]:
print(X_train.shape)
print(y_train.shape)

## Sorting Fake Test Data

After saw a discussion on Kaggle. It seems that the test dataset was created with half real data (used for LB scores) and synthetic data (maybe to increase the diffuculty of the competition). Note that this was one of the most important kernel of the competition, so it is worth looking it :)

Reference https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split

In [None]:
import pandas as pd
synthetic_samples_indexes = pd.read_csv('C:/Users/shali/Desktop/INFO 7390/Assignments/Assignment2/Data/synthetic_samples_indexes.csv')

In [None]:
df_test_real = test.copy()
df_test_real = df_test_real[~df_test_real.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))]
X_test = df_test_real.loc[:,'var_0':]
X_test.shape

# <a id='2'>Now let's plot density graphs</a>

In [None]:
fig = plt.figure(figsize=(5,5))
sns.distplot(df_train[df_train['target']>0]['var_0'], hist=False,label='0', color='red')
sns.distplot(df_train[df_train['target']==0]['var_0'], hist=False,label='1', color='green')
plt.xlabel('var_0', fontsize=12)
locs, labels = plt.xticks()
plt.tick_params(axis='x', which='major', labelsize=10)
plt.tick_params(axis='y', which='major', labelsize=10)

#balanced output

## Frequency Encoding


As discussed in the EDA notebook frequency encoding may help our tree based model to learn also the values occurrences for each variable.
https://www.kaggle.com/cdeotte/200-magical-models-santander-0-920

I tried both considering only the training set and concatenating train + test.

The second path takes significant advantages in terms of performance!


**MAKING MY JOB EASIER using Frequency Encoding**

At this point, I had a problem to solve: Each new feature about values frequency only mattered for one other specific feature. My model, however, was checking all possible interactions between my 400 features and taking a long time to run.


So I decided train my model with 2 features at a time: The original one and an extra column with the unique values count.

In [None]:
def get_count(df):
    '''
    Function that adds one column for each variable (excluding 'ID_code', 'target')
    populated with the value frequencies
    '''
    for var in [i for i in df.columns if i not in ['ID_code','target']]:
        df[var+'_count'] = df.groupby(var)[var].transform('count')
    return df

In [None]:
X_tot = pd.concat([X_train, X_test])
print(X_tot.shape)

In [None]:
import time
start = time.time()
X_tot = get_count(X_tot)
end = time.time()
print('It took %.2f seconds\nShape: ' %(end - start))
print(X_tot.shape)

In [None]:
X_train_count = X_tot.iloc[0:200000]
X_test_count = X_tot.iloc[200000:]

In [None]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(10,10,i)
        sns.distplot(df1[feature], hist=False,label=label1, color='red')
        sns.distplot(df2[feature], hist=False,label=label2, color='green')
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.show()

In [None]:
t0 = df_train.loc[df_train['target'] == 0]
t1 = df_train.loc[df_train['target'] == 1]
features = df_train.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features)

In [None]:
features = df_train.columns.values[102:202]
plot_feature_distribution(t0, t1, '0', '1', features)

# Let's build Model
But first let's split the train set into training and validation

In [None]:
# Libraries
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, train_test_split
import lightgbm as lgb

In [None]:
# 0.8 train, 0.2 dev
X_train,X_valid,y_train,y_valid = train_test_split(X_train_count, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [None]:
print('X_train shape: {}\n'.format(X_train.shape))
print('y_train shape: {}\n'.format(y_train.shape))
print('X_valid shape: {}\n'.format(X_valid.shape))
print('y_valid shape: {}'.format(y_valid.shape))

## Data Augmentation


Augmentation is a method to increase the amount of training data by randomly shuffle/transform the features in a certain way. It improves accuracy by letting the model see more cases of both "1" and "0" samples in training so the model can generalize better to new data.

I will tend to summarise that this technique works, when:

a. Number of features are large

b. No feature is strongly correlated with target (which is the case in my example)

c. Correlations among features is also low

I augmented both 1 and 0 training samples and tune the amount of augmentation. Currently, my best parameter is to add 2x "1"s and 1x "0"s.

In [None]:
# Data Augmentation 2x if y = 1 , 1x if y = 0
#Reference
#https://www.kaggle.com/jiweiliu/lgb-2-leaves-augment

def augment(x,y,t=2):
    '''
    Data Augmentation 2x if y = 1 , 1x if y = 0
    '''
    xs,xn = [],[]
    for i in range(t):
        mask = y>0
        x1 = x[mask].copy()
        ids = np.arange(x1.shape[0])
        for c in range(int(x1.shape[1]/2)):
            np.random.shuffle(ids)
            x1[:,c] = x1[ids][:,c]
            x1[:,c+200] = x1[ids][:,c+200] # The new features must go with their original one!
        xs.append(x1)

    for i in range(t//2):
        mask = y==0
        x1 = x[mask].copy()
        ids = np.arange(x1.shape[0])
        for c in range(int(x1.shape[1]/2)):
            np.random.shuffle(ids)
            x1[:,c] = x1[ids][:,c]
            x1[:,c+200] = x1[ids][:,c+200] # The new features must go with their original one!
        xn.append(x1)

    xs = np.vstack(xs)
    xn = np.vstack(xn)
    ys = np.ones(xs.shape[0])
    yn = np.zeros(xn.shape[0])
    x = np.vstack([x,xs,xn])
    y = np.concatenate([y,ys,yn])
    return x,y

In [None]:
start = time.time()
# Trying Augmentation Only for training set!
X_tr, y_tr = augment(X_train.values, y_train.values)
print('X_tr Augm shape: {}'.format(X_tr.shape))
print('y_tr Augm shape: {}'.format(y_tr.shape))

end = time.time()
print('It took %.2f seconds' %(end - start))

In [None]:
X_tr = pd.DataFrame(data=X_tr,columns=X_train.columns)
y_tr = pd.DataFrame(data=y_tr)

In [None]:
y_tr.columns = ['target']

In [None]:
# List of all the features
features = [c for c in X_train.columns if c not in ['ID_code', 'target']]

In [None]:
# The parameters for Light Gradient Boost
lgb_params = {
        'bagging_fraction': 0.77,
        'bagging_freq': 2,
        'lambda_l1': 0.7,
        'lambda_l2': 2,
        'learning_rate': 0.01,
        'max_depth': 3,
        'min_data_in_leaf': 22,
        'min_gain_to_split': 0.07,
        'min_sum_hessian_in_leaf': 19,
        'num_leaves': 20,
        'feature_fraction': 1,
        'save_binary': True,
        'seed': 42,
        'feature_fraction_seed': 42,
        'bagging_seed': 42,
        'drop_seed': 42,
        'data_random_seed': 42,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': 'false',
        'num_threads': 6
}

Less than 6 min with my pc! That's fast :)

In [None]:
val_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
score = roc_auc_score(y_valid, val_pred)
print('>>> Your CV score is: ', score)

###  With Augmentation

In [None]:
start = time.time()

iteration = 120
y_hat = np.zeros([int(200000*0.2), 200])
test_hat = np.zeros([100000, 200])
i = 0
for feature in ['var_' + str(x) for x in range(200)]: # loop over all the raw features
    feat_choices = [feature, feature + '_count']
    trn_data = lgb.Dataset(X_tr[feat_choices], y_tr) # Augmentation
    val_data = lgb.Dataset(X_valid[feat_choices], y_valid)
    clf = lgb.train(lgb_params, trn_data, iteration, valid_sets=[val_data], verbose_eval=-1)
    y_hat[:, i] = clf.predict(X_valid[feat_choices])
    test_hat[:, i] = clf.predict(X_test_count[feat_choices])
    i += 1
    
end = time.time()
print('It took %.2f seconds' %(end - start))

In [None]:
val_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
score = roc_auc_score(y_valid, val_pred)
print('>>> Your CV score is: ', score)

I tried both with and without Augmentation. These are the results:

No Augm: CV: 0.8974
Augm: CV: 0.8963

# MY FINAL RUN WITH 4 FOLD

# KFold CV

Let's try to split  Training set with KFold cross validation. This should help us to increase a bit our performances and to have more reliable results!

I choose 4 Fold, but this could be changed!

In [None]:
folds = KFold(n_splits=4, random_state=42)
target = df_train['target']
y_hat = np.zeros([200000, 200])
test_hat = np.zeros([100000, 200])
i = 0
start = time.time()
for feature in ['var_' + str(x) for x in range(200)]: # loop over all features 
    feat_choices = [feature, feature + '_count']
    print('Model using: ' + str(feat_choices))
    oof = np.zeros(len(X_train_count))
    predictions = np.zeros(len(X_test_count))
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_count[feat_choices].values, target.values)):
        trn_data = lgb.Dataset(X_train_count.iloc[trn_idx][feat_choices], label=target.iloc[trn_idx])
        val_data = lgb.Dataset(X_train_count.iloc[val_idx][feat_choices], label=target.iloc[val_idx])
        clf = lgb.train(lgb_params, trn_data, 130, valid_sets = [val_data], verbose_eval=-1)
        oof[val_idx] = clf.predict(X_train_count.iloc[val_idx][feat_choices])
        predictions += clf.predict(X_test_count[feat_choices]) / folds.n_splits
    print(">>> CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
    
    y_hat[:, i] = oof
    test_hat[:, i] = predictions
    i += 1

    
end = time.time()
print('It took %.2f seconds' %(end - start))

 Well, almost 36 min from my pc. It's quite good!

Please note that when I tried KFold CV with all the raw features it took several hours!

In [None]:
valid_pred = (y_hat).sum(axis=1)/200
predictions = (test_hat).sum(axis=1)/200
print('>>> Your CV score is:', roc_auc_score(target, valid_pred))

This is the best result I achieved  0.89, but I am grateful for this experience.


4 KFold Results:

CV: 0.8959

# Kaggle Score Evaluation

Public Score: 0.92223

Private Score: 0.92057

Kaggle Rank Public leaderboard is 62 out of 8820 i.e 0.007

# Conclusion

After applying multiple models (Logistic, Decision Tree, random Forest, Naive Bayes and LGB) based on sampling using Sorting Fake test data, Frequency Encoding and KFold CV.LGB given maximum score from these models.

# Submission

Preparing the submission file!

We only need 'ID_codes' and 'pred' columns.

Note that I predicted values only for the real test data and I setted the fake ones at 0.

In [None]:
subm = pd.DataFrame({"ID_codes":df_test[~df_test.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))].loc[:,'ID_code']})
subm['pred'] = predictions
subm.head()

In [None]:
ID_codes = df_test[~df_test.index.isin(list(synthetic_samples_indexes['synthetic_samples_indexes']))].loc[:,'ID_code']
submission = pd.DataFrame({"ID_code": df_test.ID_code.values})
submission['target'] = 0
submission.loc[submission['ID_code'].isin(ID_codes), 'target'] = subm['pred'].values

In [None]:
submission.head()

In [None]:
submission.to_csv(r'submission.csv', index = None, header=True)