![](https://github.com/SauravMaheshkar/Tabular-Playground-Series-May-2021/blob/main/assets/Banner.png?raw=true)

# Table of Content

1. [Packages 📦 and Basic Setup](#basic)
2. [Pre-Processing 👎🏻 -> 👍](#preprocess)
3. [The Model 👷‍♀️](#model)
4. [Training 💪🏻](#train)

## Disclaimer

This kernel builds on top of [@subinium](https://www.kaggle.com/subinium)'s kernel [TPS-May:Deeplearning Pipeline for Beginner](https://www.kaggle.com/subinium/tps-may-deeplearning-pipeline-for-beginner)

<a id = "basic"> </a>

# Packages 📦 and Basic Setup

In [1]:
%%capture
!pip install wandb

import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Weights and Biases
import wandb
from wandb.keras import WandbCallback
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("WANDB_API_KEY")
wandb.login(key=api_key);
wandb.init(project='TPS May 2021', entity='sauravmaheshkar')

As we can see, the dataset consists of 50 feature columnns with 4 classes. 

In [2]:
# Basic Paths
train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

train = train.drop('id', axis=1)
test = test.drop('id', axis=1)
train.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49,target
0,0,0,1,0,1,0,0,0,0,0,...,0,0,21,0,0,0,0,0,0,Class_2
1,0,0,0,0,2,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Class_1
2,0,0,0,0,0,0,0,0,0,2,...,0,1,0,0,0,0,13,2,0,Class_1
3,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,1,0,Class_4
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,Class_2


<a id = 'preprocess'> </a>
# Pre-Processing 👎🏻 -> 👍

In this kernel I'll highlight the relatively new `tensorflow.feature_column` submodule which contains a ton of useful methods to deal with structured data. For more details kindly visit the [documentation](https://www.tensorflow.org/api_docs/python/tf/feature_column). Tensorflow offers a ton of feature columns for us to experiment with, viz :

* Numeric columns
* Bucketized columns
* Categorical columns
* Embedding columns
* Hashed feature columns
* Crossed feature columns

Have a look at [this tutorial](https://www.tensorflow.org/tutorials/structured_data/feature_columns) for more details.

Upon a closer look, we realize that most features are left skewed in this dataset. Thus, Normalization seems ideal.

In [3]:
for i in range(50):
    mean, std = train[f'feature_{i}'].mean(), train[f'feature_{i}'].std()
    train[f'feature_{i}'] = train[f'feature_{i}'].apply(lambda x : (x-mean)/std)

We also convert the `target` columns into binary class matrices using the `tf.keras.utils.to_categorical()` function.

In [4]:
# transform target column into 0,1,2,3 values
label_dict = {val:idx for idx, val in enumerate(sorted(train['target'].unique()))}
train['target'] = train['target'].map(label_dict)

train['target'] = tf.keras.utils.to_categorical(train['target'])

Lastly, we split out dataset into train, validation and test splits. Most deep learning models will overfit, so having a nice split ratio is key 🔑

In [5]:
train, test = train_test_split(train, test_size=0.2)
train, val = train_test_split(train, test_size=0.4)

print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

48000 train examples
32000 validation examples
20000 test examples


Rather than using dataframes we'll create a `tf.data.Dataset` from our original dataframe. This also allows us to efficiently shuffle, prefetch and create batches.

In [6]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    return ds

A numeric column is the simplest type of feature column. It is used to represent real valued features. When using this column, our model will receive the column value from the dataframe **unchanged**. The output of a feature column will become the input to the model.

In [7]:
from tensorflow import feature_column

feature_columns = []

for i in range(50):
    feature_columns.append(feature_column.numeric_column(f'feature_{i}'))

<a id='model'></a>
# The Model 👷‍♀️

We create a `feature_layer`, which will act as the first layer in our model. A batch size of 64 was arbitrarily chosen.

In [8]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

batch_size = 64
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

We just add a Dense layer in the end to act as a classification head for the model and compile using the `sgd` optimizer and the `sparse_categorical_crossentropy` loss function.

In [9]:
model = tf.keras.models.Sequential([
        feature_layer,
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(4, activation='softmax')])

model.compile(optimizer='sgd',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

<a id='train'></a>
# Training 💪🏻

We train for a mere 10 epochs and the overfitting is quite obvious here. Better Regularization strategies + better feature selection can be extremely helpful moving forward. We'll also log our metrics to [Weights and Biases](https://wandb.ai/site) for efficient experiment tracking.

In [10]:
model.fit(train_ds,epochs = 10, verbose = 2,
          validation_data=val_ds,
          validation_steps = 100,
          callbacks = [WandbCallback()])

Epoch 1/10
750/750 - 6s - loss: 0.7997 - accuracy: 0.7844 - val_loss: 0.3898 - val_accuracy: 0.9145
Epoch 2/10
750/750 - 3s - loss: 0.3643 - accuracy: 0.9158 - val_loss: 0.3476 - val_accuracy: 0.9145
Epoch 3/10
750/750 - 3s - loss: 0.3376 - accuracy: 0.9158 - val_loss: 0.3326 - val_accuracy: 0.9145
Epoch 4/10
750/750 - 3s - loss: 0.3260 - accuracy: 0.9158 - val_loss: 0.3237 - val_accuracy: 0.9145
Epoch 5/10
750/750 - 3s - loss: 0.3184 - accuracy: 0.9158 - val_loss: 0.3180 - val_accuracy: 0.9145
Epoch 6/10
750/750 - 3s - loss: 0.3136 - accuracy: 0.9158 - val_loss: 0.3138 - val_accuracy: 0.9145
Epoch 7/10
750/750 - 3s - loss: 0.3102 - accuracy: 0.9158 - val_loss: 0.3113 - val_accuracy: 0.9145
Epoch 8/10
750/750 - 3s - loss: 0.3075 - accuracy: 0.9158 - val_loss: 0.3084 - val_accuracy: 0.9145
Epoch 9/10
750/750 - 3s - loss: 0.3054 - accuracy: 0.9158 - val_loss: 0.3069 - val_accuracy: 0.9145
Epoch 10/10
750/750 - 3s - loss: 0.3038 - accuracy: 0.9158 - val_loss: 0.3052 - val_accuracy: 0.9145

<tensorflow.python.keras.callbacks.History at 0x7f6c28112e90>

We can evaluate on our test set and see the loss and accuracy. As, we can see the model overfits to a large extent.

In [11]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)
print("Loss", loss)

Accuracy 0.9150999784469604
Loss 0.30434146523475647


# Submission

In [12]:
train_df = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
train_df = train_df.drop('id', axis = 1)
test_df = test_df.drop('id', axis=1)

for i in range(50):
    mean, std = train_df[f'feature_{i}'].mean(), train_df[f'feature_{i}'].std()
    test_df[f'feature_{i}'] = test_df[f'feature_{i}'].apply(lambda x : (x-mean)/std)

In [13]:
def df_to_dataset_test(dataframe, batch_size=32):
    dataframe = dataframe.copy()
    ds = tf.data.Dataset.from_tensor_slices(dict(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    return ds

test_df = df_to_dataset_test(test_df, batch_size=batch_size)

In [14]:
sample_submission[['Class_1','Class_2', 'Class_3', 'Class_4']] = model.predict(test_df)
sample_submission.to_csv('densefeatures_submission.csv',index = False)
sample_submission.head()

Unnamed: 0,id,Class_1,Class_2,Class_3,Class_4
0,100000,0.900258,0.085768,0.006744,0.00723
1,100001,0.903497,0.083409,0.006768,0.006326
2,100002,0.914519,0.072834,0.006857,0.00579
3,100003,0.915672,0.071767,0.00665,0.005911
4,100004,0.917454,0.069082,0.006707,0.006757
