# <span style= "color:crimson"><u>Going Deep!!!</u></span>
<img src = "https://i.pinimg.com/originals/1a/10/6e/1a106e5fa6cb78e4a89bff36fccc3a02.png">


<p style = "font-size:150%"><span style="color:blue">
    We all know, when we talk about <strong>Machine Learning</strong> boosting methods, they kicks the asses of every other ML-Algos when it comes to tabular data
    <br>
    Although, deeplearning has proved to be unreasonalby more effective in fields like Vision, Speech or Language, it haven't shown much effectiveness when it comes to     tabular data. </span>
    
<br>
<span style= "color:crimson"><span style="font-size:150%"><u>Why Go Deeper?</u></span></span>
    
<p style="font-size:150%"><span style="color:blue">Today, we can train machines to detect objects in images, extract meaning from text, stop spam emails, drive cars, discover new drug candidates, and beat top players in Chess, Go, and countless other games and most of the credit goes to deep-learning.
<br>
Machine doesn't understands the science/theory, all it does is generalises the ideas by visiting a lot of data, so lacking a perfect theory we have to rely on intutions. <strong> Deeper Neural Network because it generalises more and overfits less</strong>. <br>
So, even if part of the face is hidden, the network will still pick up a signal from the remaining input, and therefore generalize better. It's a good intuition, and it appears to be what is actually happening. Experiments confirm that deep neural networks outperform shallow ones on common image as well as text tasks.
</span>
</p>

<table><tr><td><img src ="https://pytorch-tabular.readthedocs.io/en/latest/imgs/pytorch_tabular_logo.png"></td>

<td><span style= "color:black"><span style="font-size:350%">The <br>Deep Learning Framework<br>for Tabular Data</span></span></td></tr></table>


 

<p style="font-size:150%">
        As the name suggest <strong>PyTorch Tabular</strong> is a framework based on <a href="https://pytorch.org/">pytorch</a> and <a href="https://www.pytorchlightning.ai/">pytorch lightning</a> which works on Tabular Data.<br>
        Rather than doing everything from scratch, it make out job easy and we can do everything in few simple steps.
    </p>

# <span style="color:crimson"><u>In this tutorial, we'll learn to implement pytorch-tabular in <strong>5-Simple-Steps</strong>:</u></span>
### 
<span style="font-size:150%">
Step 1: Installation<br>
Step 2: Setting up the Configs<br>
Step 3: Initializing the Model and Training<br>
Step 4: Evaluating the Model on unseen data<br>
Step 5: Saving the Model
</span>

In [None]:
import warnings 
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

In [None]:
# reading data
data=pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
data['diagnosis'] = data['diagnosis'].astype('category').cat.codes# encoding labels
data.drop(['id','Unnamed: 32'],axis=1,inplace=True)

In [None]:
def load_classification_data(df,target_col,test_size):
    torch_data = np.array(df.drop(target_col,axis=1))
    torch_labels =np.array(df[target_col])
    data = np.hstack([torch_data, torch_labels.reshape(-1,1)])
    gen_names = [f"feature_{i}" for i in range(data.shape[-1])]
    col_names = gen_names
    col_names[-1] = "target"
    data = pd.DataFrame(data, columns=col_names)
    cat_col_names = [x for x in gen_names[:-1] if len(data[x].unique())<10]
    num_col_names = [x for x in gen_names[:-1] if x not in [target_col]+cat_col_names]
    test_idx = data.sample(int(test_size * len(data)), random_state=42).index
    test = data[data.index.isin(test_idx)]
    train = data[~data.index.isin(test_idx)]
    return (train, test, ["target"],cat_col_names,num_col_names)

In [None]:
train, test, target_col, cat_col_names, num_col_names= load_classification_data(data,'diagnosis',0.2)

**Sorry in advance for a lot of printing

# Step 1: Installation
***it might depricate some libraries but that will do fine*

In [None]:
! pip install pytorch_tabular[all]

# Importing Libraries
and simple data pre-processing formality for pytorch

In [None]:
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig, NodeConfig, TabNetModelConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
from pytorch_tabular.categorical_encoders import CategoricalEmbeddingTransformer

# Step2: Setting up the Configs:

In [None]:
data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
    continuous_feature_transform="quantile_normal",
    normalize_continuous_features=True
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="4096-4096-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3,
    metrics=["accuracy"]
)
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

# Step3 & Step 4: Training and Evaluating 

In [None]:
tabular_model.fit(train=train, test=test)

In [None]:
tabular_model.evaluate(test)

# Step 5: Saving the Model

In [None]:
tabular_model.save_model('./saved_models/tab_model')

In [None]:
loaded_model = TabularModel.load_from_checkpoint('./saved_models/tab_model/')

# Testing on other datasets

# 🌹 iris-flower dataset

In [None]:
# reading data
data2=pd.read_csv('../input/iris/Iris.csv')
data2.drop(['Id'],axis=1,inplace=True)
data2['Species'] = data2['Species'].astype('category').cat.codes# encoding labels

In [None]:
train, test, target_col, cat_col_names, num_col_names  = load_classification_data(data2,'Species',0.2)

In [None]:
data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
    continuous_feature_transform="quantile_normal",
    normalize_continuous_features=True
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=16,
    max_epochs=100,
    auto_select_gpus = True,
    gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="4096-4096-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3,
    metrics=["accuracy"]
)
tabular_model2 = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

In [None]:
tabular_model2.fit(train=train, test=test)

In [None]:
tabular_model2.evaluate(test)

# 👨‍🔬 pima-indian-diabetes dataset

In [None]:
# reading data
data3=pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
# data3.drop('id',axis=1,inplace=True)
data3['quality'] = data3['Outcome'].astype('category').cat.codes# encoding labels

In [None]:
train, test, target_col, cat_col_names, num_col_names= load_classification_data(data3,'quality',0.2)

In [None]:
data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
    continuous_feature_transform="quantile_normal",
    normalize_continuous_features=True
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="1024-1024-32",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3,
    metrics=["accuracy"]
)
tabular_model3 = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

In [None]:
tabular_model3.fit(train=train, test=test)

In [None]:
tabular_model3.evaluate(test)

# ❤ heart-disease-uci dataset

In [None]:
# reading data
data4=pd.read_csv('../input/heart-disease-uci/heart.csv')
# data3.drop('id',axis=1,inplace=True)
data4['target'] = data4['target'].astype('category').cat.codes# encoding labels

In [None]:
train, test, target_col, cat_col_names, num_col_names= load_classification_data(data4,'target',0.2)

In [None]:
data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
    continuous_feature_transform="quantile_normal",
    normalize_continuous_features=True
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="4096-4096-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3,
    metrics=["accuracy"]
)
tabular_model4 = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

In [None]:
tabular_model4.fit(train=train, test=test)

In [None]:
tabular_model4.evaluate(test)

# 🍷 winequality-red dataset

In [None]:
# reading data
data5=pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
data5['quality'] = data5['quality'].astype('category').cat.codes# encoding labels

In [None]:
train, test, target_col, cat_col_names, num_col_names= load_classification_data(data5,'quality',0.2)

In [None]:
data_config = DataConfig(
    target=['target'], #target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
    continuous_feature_transform="quantile_normal",
    normalize_continuous_features=True
)
trainer_config = TrainerConfig(
    auto_lr_find=True, # Runs the LRFinder to automatically derive a learning rate
    batch_size=32,
    max_epochs=100,
    gpus=1, #index of the GPU to use. 0, means CPU
)
optimizer_config = OptimizerConfig()
model_config = CategoryEmbeddingModelConfig(
    task="classification",
    layers="4096-4096-512",  # Number of nodes in each layer
    activation="LeakyReLU", # Activation between each layers
    learning_rate = 1e-3,
    metrics=["accuracy"]
)
tabular_model5 = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

In [None]:
tabular_model5.fit(train=train, test=test)

In [None]:
tabular_model5.evaluate(test)

# Conclusion:
The module worked, almost ok. <br>
There is still a lot of research going on in this field, and I don't want to make a prejudice that it won't get any better.<br>
PS: I tried it on [tabular-playground-series-jun-2021](https://www.kaggle.com/c/tabular-playground-series-jun-2021/overview) dataset <br>and it worked so really well and yes it worked better than Classical Algorithms without any preprocessing.

<strong><span style="color:crimson"><span style="font-size:150%">If you like my work, please don't forget leave an upvote!</span></span></strong>

<strong><span style="color: seagreen"><span style="font-size:150%"> If you don't, atleast leave a comment on what should I do to improve it!</span></span></strong>