# What am I doing in this notebook?

So far, I have learned and tried different models to solve this problem in https://www.kaggle.com/sfktrkl/tps-nov-2021. That notebook includes some regression and classification models, tries those models with or without cross-validation. I have also tried doing a very simple feature selection and applied that to some of those models.

Although couple of those models gives good results, I have seen that many people are getting better results with deep learning. Hence, I have also started learning about it.

So, in this notebook, I will try what I have learned so far. I am not expecting getting very good results at least before I understand the basics behind the neural networks but still I will try my best to get good results 💪.

# Importing Librabies and Loading datasets

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Plot
import seaborn as sns
import matplotlib.pyplot as plt

# Scaler
from sklearn.preprocessing import StandardScaler

# Neural Network
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks

# Cross-Validation
from sklearn.model_selection import StratifiedKFold

In [None]:
train_data = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')

# Pre-proccessing

In [None]:
# Get train data without the target and ids
X = train_data.iloc[:, 1:-1].copy()
# Get the target
y = train_data.target.copy()

# Create test X, drop ids.
test_X = test_data.iloc[:, 1:].copy()

In [None]:
# Apply a scaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
test_X = scaler.transform(test_X)

# Modelling

In this notebook, I will try different configurations and try to understand which changes affects the results most.

What I have tried so far,

```
Version 1
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
512 neurons, Dropout 0.3, BatchNormalization, relu
256 neurons, Dropout 0.3, BatchNormalization, relu
128 neurons, Dropout 0.3, BatchNormalization, relu
Overall AUC: 0.770
```
---
```
Version 3
callbacks: EarlyStopping min_delta=0.001, patience=20
callbacks: ReduceLROnPlateau monitor='val_loss', factor=0.2, patience=5, min_lr=0.001
3 layers, 
512 neurons, Dropout 0.3, BatchNormalization, relu
256 neurons, Dropout 0.3, BatchNormalization, relu
128 neurons, Dropout 0.3, BatchNormalization, relu
Overall AUC: 0.769
(Notes, difference is probably caused by the ReduceLROnPlateau callback)
```
---
```
Version 4, (Try a different activation function with Version 1's configuration)
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
512 neurons, Dropout 0.3, BatchNormalization, linear
256 neurons, Dropout 0.3, BatchNormalization, linear
128 neurons, Dropout 0.3, BatchNormalization, linear
Overall AUC: 0.749
(Notes, activation function didn't work well, let's try swish in Version 5)
```
---
```
Version 5
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
512 neurons, Dropout 0.3, BatchNormalization, swish
256 neurons, Dropout 0.3, BatchNormalization, swish
128 neurons, Dropout 0.3, BatchNormalization, swish
Overall AUC: 0.770
(Notes, little better than Version 1, let's play with neuron numbers)
```
---
```
Version 6
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, BatchNormalization, swish
64 neurons, Dropout 0.3, BatchNormalization, swish
32 neurons, Dropout 0.3, BatchNormalization, swish
Overall AUC: 0.766
(Notes, although AUC is less, it got a better score.
In the next version I will play with the n_splits.)
```
---
```
Version 7
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 5
3 layers, 
100 neurons, Dropout 0.3, BatchNormalization, swish
64 neurons, Dropout 0.3, BatchNormalization, swish
32 neurons, Dropout 0.3, BatchNormalization, swish
Overall AUC: 0.761
(Notes, looks like it is better having more splits.
For the next version I will try layers without
batch normalization)
```
---
```
Version 8
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, swish
64 neurons, Dropout 0.3, swish
32 neurons, Dropout 0.3, swish
Overall AUC: 0.766
(Notes, pretty much the same AUC with Version 6 but 
it has the highest score. Let's try to change the 
batch size in the next version)
```
---
```
Version 9
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 1024, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, swish
64 neurons, Dropout 0.3, swish
32 neurons, Dropout 0.3, swish
Overall AUC: 0.765
(Notes, increasing the batch size reduced the score.
For next version, I want to try ReduceLROnPlateau again.)
```
---
```
Version 10
callbacks: EarlyStopping min_delta=0.001, patience=20
callbacks: ReduceLROnPlateau monitor='val_loss', factor=0.2, patience=5, min_lr=0.001
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, swish
64 neurons, Dropout 0.3, swish
32 neurons, Dropout 0.3, swish
Overall AUC: 0.766
(Notes, almost the same score with Version 8. Let's
drop the callback again and try without dropout in layers.)
```
---
```
Version 11
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, swish
64 neurons, swish
32 neurons, swish
Overall AUC: 0.768
(Notes, although it gets a better AUC its score is less
probably due to overfitting. For next version I want to try
a different scaler, MinMaxScaler probably)
```
---
```
Version 12
callbacks: EarlyStopping min_delta=0.001, patience=20
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, swish
64 neurons, Dropout 0.3, swish
32 neurons, Dropout 0.3, swish
Overall AUC: 0.755
(Notes, MinMaxScaler didn't work quite well.
Next I will try changing patience in early stopping.)
```
---
```
Version 13
callbacks: EarlyStopping min_delta=0.001, patience=5
EPOCHS: 100, BATCH_SIZE: 512, N_SPLITS: 15
3 layers, 
100 neurons, Dropout 0.3, swish
64 neurons, Dropout 0.3, swish
32 neurons, Dropout 0.3, swish
Overall AUC: 0.763
(Notes, changing patience is worked well, it looks like
helped the overfitting. Let's try changing the dropout)
```

In [None]:
# Set seeds
my_seed = 1
np.random.seed(my_seed)
tf.random.set_seed(my_seed)

## Callbacks

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping
early_stopping = callbacks.EarlyStopping(
    min_delta=0.001,           # Minimium amount of change to count as an improvement
    patience=5,                # How many epochs to wait before stopping
    restore_best_weights=True)

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau
reduce_lr = callbacks.ReduceLROnPlateau(
    monitor='val_loss', 
    factor=0.2,                # Factor by which the learning rate will be reduced
    patience=5,                # Number of epochs with no improvement
    min_lr=0.001)              # Lower bound on the learning rate

In [None]:
CALLBACKS = [early_stopping]

## Model

In [None]:
# Play with those configurations...
EPOCHS = 100
BATCH_SIZE = 512
N_SPLITS = 15

It should be noted that I am using sigmoid activation function as output activation function to solve this binary classification problem.

So, I am planing to play with other configurations except that output activation function. I hope this is a correct approach :)  
(It is better reading this article https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/)

In [None]:
model = keras.Sequential([
    layers.Dense(100, activation='swish', input_shape=[X.shape[1]]),
    layers.Dropout(0.2),
    layers.Dense(64, activation='swish'),
    layers.Dropout(0.2),
    layers.Dense(32, activation='swish'),
    layers.Dropout(0.2),
    # For a binary classification function use sigmoid
    layers.Dense(1, activation='sigmoid')])

https://www.kaggle.com/ryanholbrook/stochastic-gradient-descent  
A "loss function" that measures how good the network's predictions are.  
An "optimizer" that can tell the network how to change its weights.

So, we can play with optimizer and loss functions but probably we should keep the metrics.

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['AUC'])

## Training

In [None]:
fold = 0
test_predictions = np.zeros(test_X.shape[0])
skf = StratifiedKFold(n_splits=N_SPLITS, random_state=48, shuffle=True)
scores = {fold:None for fold in range(skf.n_splits)}
for train_idx, test_idx in skf.split(X, y):
    train_X, val_X = X[train_idx], X[test_idx]
    train_y, val_y = y.iloc[train_idx], y.iloc[test_idx]

    history = model.fit(
        train_X, train_y,
        validation_data=(val_X, val_y),
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        callbacks=CALLBACKS,        # Put your callbacks in a list
        verbose=0)                  # Turn off training log

    scores[fold] = (history.history)
    print(f"Fold {fold + 1} \t\t AUC: {np.max(scores[fold]['val_auc'])}")

    # Get the average values from each fold to the prediction
    test_predictions += model.predict(test_X, batch_size=BATCH_SIZE).reshape(1,-1)[0] / skf.n_splits
    fold += 1

overall_auc = [np.max(scores[fold]['val_auc']) for fold in range(skf.n_splits)]
print('Overall Mean AUC: ', np.mean(overall_auc))

# Evaluation

In [None]:
# Credits to https://www.kaggle.com/mlanhenke/tps-11-nn-baseline-keras?scriptVersionId=79830528
fig, ax = plt.subplots(3, 5, tight_layout=True, figsize=(20, 15))
ax = ax.flatten()

for fold in range(skf.n_splits):
    df_eval = pd.DataFrame({'train_loss': scores[fold]['loss'], 'valid_loss': scores[fold]['val_loss']})

    min_train = np.round(np.min(df_eval['train_loss']),5)
    min_valid = np.round(np.min(df_eval['valid_loss']),5)
    delta = np.round(min_valid - min_train,5)
    
    sns.lineplot(
        x=df_eval.index,
        y=df_eval['train_loss'],
        label='train_loss',
        ax = ax[fold]
    )

    sns.lineplot(
        x=df_eval.index,
        y=df_eval['valid_loss'],
        label='valid_loss',
        ax = ax[fold]
    )
    
    ax[fold].set_ylabel('')
    ax[fold].set_xlabel(f"Fold {fold+1}\nmin_train: {min_train}\nmin_valid: {min_valid}\ndelta: {delta}", fontstyle='italic')

sns.despine()

# Submission

In [None]:
# Run the code to save predictions in the format used for competition scoring
output = pd.DataFrame({'id': test_data.id, 'target': test_predictions})
output.to_csv('submission.csv', index=False)

In [None]:
output