# Introduction
Greetings!👋

In this kernel you will find my data science approach to "Tabular Playground Series - May 2021" competition using **D**enoising **A**uto**e**ncoders with swap noise. As always, any feedback Is very much appreciated! :)

Check out my other notebooks about this competition:

* [EDA+LGBM+Optuna using GPU](https://www.kaggle.com/aipi12/eda-lgbm-optuna-using-gpu)

* [EDA+CatBoost+Optuna](https://www.kaggle.com/aipi12/eda-catboost-optuna)

For more information about DAE and swap noise technique:

* https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629

* https://towardsdatascience.com/how-to-apply-self-supervision-to-tabular-data-introducing-dfencoder-eec21c4afaef

* https://www.kaggle.com/springmanndaniel/1st-place-turn-your-data-into-daeta

# Table of contents:

1. Meeting our data

2. Doing a bit of preprocessing

3. Implementing Denoising Autoencoder

    3.1 Implementing swap noise
    
    3.2 Creating and fitting DAE
    
    3.3 Extracting features

# 1. Meeting our data

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/train.csv', index_col = 'id')
test = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/test.csv', index_col = 'id')
train

In [None]:
test

In [None]:
train.shape

In [None]:
test.shape

In [None]:
target = train.target.copy()
target

In [None]:
train.drop('target', axis = 1, inplace = True)
train

In [None]:
(train.columns).equals(test.columns)

# 2. Doing a bit of preprocessing

In [None]:
train_test = pd.concat([train, test], keys = ['train', 'test'], axis = 0)
train_test

In [None]:
train_test = (train_test - train_test.mean()) / train_test.std()
train = train_test.xs('train').copy()
test = train_test.xs('test').copy()
train

In [None]:
class_map = {
    'Class_1': 0,
    'Class_2': 1,
    'Class_3': 2,
    'Class_4': 3,
}

target = target.map(class_map).astype('int')

target

# 3. Implementing Denoising Autoencoder

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.random.set_seed(1)

In [None]:
train_test.head(5)

# 3.1 Implementing swap noise

In [None]:
def df_inputSwapNoise(df, p):
    """
    Custom function for implementing swap noise.
    It takes: DataFrame of data, percentage of values to be replaced;
    And it outputs: DataFrame with noise.
    """
    n = df.shape[0]
    idx = list(range(n))
    swap_n = round(n * p)
    for col in df.columns:
        arr = df[col].values
        col_vals = np.random.permutation(arr)
        swap_idx = np.random.choice(idx, size = swap_n)
        arr[swap_idx] = np.random.choice(col_vals, size = swap_n)
        df[col] = arr
    return df

In [None]:
noisy_train_test = df_inputSwapNoise(train_test.copy(), 0.15)

In [None]:
noisy_train_test.equals(train_test)

Plotting amount of noise per feature:

In [None]:
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set_style('whitegrid')

plt.figure(figsize = (16, 12))
sns.barplot(x = (-noisy_train_test.eq(train_test)).sum(), y = noisy_train_test.eq(train_test).sum().index, palette = 'winter_r')

# 3.2 Creating and fitting DAE

In [None]:
autoencoder = keras.Sequential([layers.Dense(input_shape = [noisy_train_test.shape[1]], 
                                             units = 1500, activation = 'relu'),
                                layers.Dense(units = 1500, activation = 'relu'),
                                layers.Dense(units = 1500, activation = 'relu'),
                                layers.Dense(units = noisy_train_test.shape[1], activation = 'linear')])

autoencoder.compile(optimizer = 'adam',
                    loss = 'mse')

In [None]:
autoencoder.summary()

In [None]:
autoencoder.fit(
    noisy_train_test, 
    train_test, 
    epochs = 1000,
    batch_size = 128,
)

# 3.3 Extracting features

In [None]:
layers_list = [layer.output for layer in autoencoder.layers[:-1]]

In [None]:
feat_extraction_model = keras.Model(inputs = autoencoder.input, outputs = layers_list)

In [None]:
ext_features = feat_extraction_model.predict(train_test)

In [None]:
ext_features

In [None]:
ext_features[0].shape

In [None]:
train_test_dae = pd.DataFrame()
for n in range(len(layers_list)):
    dae = pd.DataFrame(data = ext_features[n], 
                       columns = [f'feature_{ext_features[0].shape[1] * n + i}' for i in range(ext_features[n].shape[1])])
    train_test_dae = pd.concat([train_test_dae, dae], axis = 1)

In [None]:
train_test_dae.index.name = 'id'
train_test_dae

In [None]:
train_dae = train_test_dae.iloc[:train.shape[0]]
test_dae = train_test_dae.iloc[train.shape[0]:]

In [None]:
train_dae.to_csv('train_dae.csv')
test_dae.to_csv('test_dae.csv')