This is a data analysis portfolio written by Daniel Yoo, using data from Kaggle competition 'Santander Customer Transaction Prediction'. This competition is to identify which customers will make a specific transaction in the future.

First, I am going to preprocess the data. I need to check whether the data needs scaling or not. After that, I will check if there are any variable which is highly correlated with other variables. If so, I need to remove the variables since it could violate Multicollinearity. Since the data is unbalanced, the ratio between the 'target 0' and 'target 1' is 8:1. Augmenting data could help us to get a better result. I am going to augment the 'target 1' data 8 times so that the ratio become 1:1. 

Next, modeling. I am using two models, the LightGBM and Neural Networks model. LightGBM is basically an ensemble of decision trees so it works much better than decision trees or random forests. I could choose another gradient boosting models like XGBoost, but LightGBM is much faster since it grows trees vertically. Also, I have tried XGBoost and Catboost too but I got the highest accuracy from LightBGM.I use K-fold cross-validation, using 5 folds. It is possible to pick the best parameter using Grid Search, but I don't have enough computation power for that, so I will skip it.

I am also using Neural Networks model. I don't expect to get high accuracy from it, however I think the data is big enough for not only machine learning models but also deep learning algorithms. I am going to run a simple NN model, and if it works fine, I can implement it using deeper layer or K-fold cross-validation or using CNN. Since we have 200 variables,I am going to use 128 units for each layer. When we use several layers, it has a chance to be overfitted to training set, therefore I am using 10% of dropout. I am also using batch normalization. It is similar to scaling X, but scaling the input data for every single layer. It might increase the accuracy and definitely speed up the model however, since we use both batch norm and dropout, there might be high bias because of too much regularization.

<font size="4">Data Preprocessing</font>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X = train.iloc[:,2:]
y = train['target']

In [None]:
#Checking correlation between variables
plt.figure(figsize=(15,15))
sns.heatmap(X.corr())

In [7]:
#Data Augmentation
def aug(x,y,t):
    xs,xn = [],[]
    for i in range(t):
        x1 = x[y==1].copy()
        ids = np.arange(x1.shape[0])
        for c in range(x1.shape[1]):
            np.random.shuffle(ids)
            x1[:,c] = x1[ids][:,c]
        xs.append(x1)

    xs = np.vstack(xs)
    ys = np.ones(xs.shape[0])
    x = np.vstack([x,xs])
    y = np.concatenate([y,ys])
    return x,y
XX, YY = aug(X.values, y.values, 8)
sns.countplot(YY)

In [26]:
#Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_S = sc.fit_transform(XX)

<font size="4">Modeling</font>

In [7]:
#lightGBM
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

In [8]:
lgb_params = { "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, 
              "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, 
              "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : 101, "verbosity" : 1, "seed": 101}
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)

In [None]:
from sklearn.metrics import roc_auc_score
temp = train[['ID_code', 'target']]
temp['predict'] = 0
predictions = test[['ID_code']]
val_aucs = []
feature_importance_df = pd.DataFrame()
features = [col for col in train.columns if col not in ['target', 'ID_code']]
for fold, (trn_idx, val_idx) in enumerate(skf.split(train, train['target'])):
    X_train, y_train = train.iloc[trn_idx][features], train.iloc[trn_idx]['target']
    X_cv, y_cv = train.iloc[val_idx][features], train.iloc[val_idx]['target']
    N = 5    
    p_valid,yp = 0,0
    for i in range(N):
        X_t, y_t = aug(X_train.values, y_train.values)
        X_t = pd.DataFrame(X_t)
        X_t = X_t.add_prefix('var_')
        trn_data = lgb.Dataset(X_t, label=y_t)
        val_data = lgb.Dataset(X_cv, label=y_cv)
        evals_result = {}
        lgb_clf = lgb.train(lgb_params,
                        trn_data,
                        100000,
                        valid_sets = [trn_data, val_data],
                        early_stopping_rounds=3000,
                        verbose_eval = 1000,
                        evals_result=evals_result
                       )
        p_valid += lgb_clf.predict(X_cv)
        yp += lgb_clf.predict(X_test)
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = lgb_clf.feature_importance()
    fold_importance_df["fold"] = fold + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    temp['predict'][val_idx] = p_valid/N
    val_score = roc_auc_score(y_cv, p_valid)
    val_aucs.append(val_score)
    predictions['fold{}'.format(fold+1)] = yp/N

In [69]:
print('The average accuracy of predicting the CV set 5 times are %f%%' %90.1)

The average accuracy of predicting the CV set 5 times are 90.100000%


In [None]:
#Plot to check variable importance
descend = feature_importance_df[['feature','importance']].groupby('feature').mean().sort_values(by='importance', ascending=False)
plt.figure(figsize=(12,36))
sns.barplot(x = descend['importance'], y = descend.index)
plt.title('Variable Importance')

In [3]:
#Artificial Neural Networks
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import BatchNormalization

Using TensorFlow backend.


In [36]:
from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(X_S,YY, test_size = 0.1, random_state = 101)
classifier = Sequential()
classifier.add(Dense(units = 100, kernel_initializer = 'glorot_uniform', activation = 'relu', input_dim = 200))
classifier.add(Dropout(0.1))
classifier.add(BatchNormalization())
classifier.add(Dense(units = 100, kernel_initializer = 'glorot_uniform', activation = 'relu'))
classifier.add(BatchNormalization())
classifier.add(Dense(units = 100, kernel_initializer = 'glorot_uniform', activation = 'relu'))
classifier.add(BatchNormalization())
classifier.add(Dense(units = 1, kernel_initializer = 'glorot_uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 64, epochs = 100)

In [40]:
y_pred = classifier.predict(X_cv)
y_pred = (y_pred > 0.5)
from sklearn.metrics import confusion_matrix
cm_nn = confusion_matrix(y_cv, y_pred)

In [55]:
print('The accuracy of NN model on cross validation set is %f%%' %((1-(4455+3833)/len(y_cv))*100))

The accuracy of NN model on cross validation set is 77.028188%


Conclusion

LightGBM works very well on this dataset. It could've had higher accuracy if I tuned several hyper-parameters. 90% of accuracy is an good result but I cannot say it is very high accuracy. The original dataset has 8 times more target 0 data than target 1. If some model says everyone has target 0, it will have 88.8% accuracy on the original. Of course, it will have 50% accuracy on the augmented dataset. 

Deep learning does not works, the logistic regression, which takes less than a minute to run, has 77% accuracy. It is exactly same accuracy that the 4 layers NN got. I found no reason to implement this model. Dataset might be too small for deep learning or NN might not match for the data. We might get better result when we use CNN but I won't try it here.