# Santander Customer Transaction Prediction
This project attempts to predict which customers will make certain transaction types in the future.

## Imports and setup
First of all, we import all the libraries that will be used in the notebook.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

We then load our training and test data. There are two sets saved in the data folder: The full dataset, and a reduced set that allows us to iterate faster when developing the model.

In [2]:
use_full_data = False
train_path = ('data/train_small.csv', 'data/train.csv')[int(use_full_data)]
test_path = ('data/test_small.csv', 'data/test.csv')[int(use_full_data)]

train_set = pd.read_csv(train_path)
test_set = pd.read_csv(test_path)

## Data exploration
We begin our analysis by exploring the data, looking at what features are available, and the data quality.

In [3]:
train_set.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
1,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
2,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104
3,train_5,0,11.4763,-2.3182,12.608,8.6264,10.9621,3.5609,4.5322,15.2255,...,-6.3068,6.6025,5.2912,0.4403,14.9452,1.0314,-3.6241,9.767,12.5809,-4.7602
4,train_6,0,11.8091,-0.0832,9.3494,4.2916,11.1355,-8.0198,6.1961,12.0771,...,8.783,6.4521,3.5325,0.1777,18.3314,0.5845,9.1104,9.1143,10.8869,-3.2097


In [4]:
test_set.head()

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,-4.92,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_3,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,-5.8609,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
2,train_7,13.558,-7.9881,13.8776,7.5985,8.6543,0.831,5.689,22.3262,5.0647,...,13.17,6.5491,3.9906,5.8061,23.1407,-0.3776,4.2178,9.4237,8.6624,3.4806
3,train_12,8.7671,-4.6154,9.7242,7.4242,9.0254,1.4247,6.2815,12.3143,5.6964,...,0.3782,7.4382,0.0854,1.3444,17.2439,-0.0798,5.7389,8.4897,17.0938,4.6106
4,train_14,13.808,5.0514,17.2611,8.512,12.8517,-9.1622,5.7327,21.0517,-4.5117,...,1.074,8.322,3.2619,1.6738,17.4797,-0.0257,-3.5323,9.3439,24.4479,-5.111


We see that the train set contains a column for sample ID and a column for the target, as well as 200 features. The test set also contains ID and features, but does not contain target values, as it is supposed to be used for evaluation of competition scores.

In [5]:
print(f'Fraction of positive samples: {train_set.iloc[:, 1].sum() / train_set.shape[0]:.2%}')

Fraction of positive samples: 9.97%


The target value is binary - either the customer makes the transaction, or they don't. We see that only about 10% of the customers actually do make the transaction.

In [6]:
print(f'Missing values in train set: {train_set.isna().sum().sum()}')
print(f'Missing values in test set: {test_set.isna().sum().sum()}')

Missing values in train set: 0
Missing values in test set: 0


As we can see, neither of the data sets contain any missing values, so we do not need to handle this.

## Data Pre-Processing
As mentioned above, there are no missing values that need to be handled during preprocessing. However, processing the data to engineer good features is still an important part of the training process (although the introduction of Deep Learning has enabled end-to-end training, including learning relevant features).

We will attempt several pre-processing techniques in order to produce accurate predictions.

We begin by simply extracting the features and targets from the full data sets:

In [7]:
train_x = train_set.iloc[:, 2:].values
train_y = train_set.iloc[:, 1].values

test_x = test_set.iloc[:, 1:].values

Many machine learning methods perform better on normalized data - i.e. data where each feature has been scaled to have zero mean and unit variance.

In [9]:
train_x_scaled = preprocessing.scale(train_x)
test_x_scaled = preprocessing.scale(test_x)

## Model Selection and Training
The next step is selecting a model type and fitting it to the training data.

> ### Note on cross-validation
> We already have two data sets, labeled train and test. However, the test data set is meant to be used for the competition submission, and does not contain labels. In order to carry out cross validation, we must therefore further split the train data. This is done before training of each model is performed.

We begin by fitting a Random Forest classifier using the scaled data. 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(train_x, train_y)

model = RandomForestClassifier(n_estimators=100, max_depth=5, verbose=True)
model.fit(x_train, y_train)