# PyTorch Tabular Basics!

Welcome to the first notebook of my tutorial series on PyTorch Tabular!

If you are a data science enthusiast and want to explore the use of Deep Learning Architectures to solve Tabular Data problems then you have come to the right place 😄

In this tutorial series I will try to teach you how to work with PyTorch Tabular. This package allows you to easily leverage some of the latest advancements in Deep Learning Architectures. 

![Alt text](data/banner.jpg)

## Content index:

This tutorial can be divided into the following short segments:

1. A regression problem: Crab Age Dataset

2. Approaching the problem with traditional ML

3. How PyTorch Tabular works

4. Data Configuration

5. Trainer Configuration

6. Optimizer Configuration

7. Model Configuration

8. Joining everything together to train our model

9. Evaluation and comparison: DL vs Traditional ML

10. Conclusion and final thoughts

11. Saving, Loading and Predicting with our Model

12. Conclusion

---

You are expected to be familiar with basic Machine Learning and Deep learning concepts such as overfitting, learning rate, train test and validation sets, epochs, loss... As well as basic experience with programming. Don't worry if you don't know all of these concepts, I will always clearly reference them so you can review them if you want 😃


## 1. A regression problem: Crab Age Dataset

For this tutorial I propose to work with the Crab Age Dataset which can be obtained from this [kaggle comptition](https://www.kaggle.com/competitions/playground-series-s3e16/overview), specifically the train.csv

In [21]:
import pandas as pd

df = pd.read_csv("data/train.csv")

df.head()

Unnamed: 0,id,Sex,Length,Diameter,Height,Weight,Shucked Weight,Viscera Weight,Shell Weight,Age
0,0,I,1.525,1.175,0.375,28.973189,12.728926,6.647958,8.348928,9
1,1,I,1.1,0.825,0.275,10.418441,4.521745,2.324659,3.40194,8
2,2,M,1.3875,1.1125,0.375,24.777463,11.3398,5.556502,6.662133,9
3,3,F,1.7,1.4125,0.5,50.660556,20.354941,10.991839,14.996885,11
4,4,I,1.25,1.0125,0.3375,23.289114,11.977664,4.50757,5.953395,8


For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab's physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit. Here are some of the things you can do with this dataset:

- Exploratory data analysis - Understand how different physical features change with age and overall, how they relate with each other.

- Feature Engineering - Define new features using a combination of given data points to gain insights and help improve model accuracy.

- > Build a regression model to predict the age of the Crab.

Feel free to explore the dataset and search on the meaning of the features. For concision purposes we will skip to working on building a model. 

## 2. Approaching the problem with traditional ML

Let's revise some of the basic steps you would take when approaching a problem like this:

1. Understand the purpose of the project. (Take a look above) ✅

2. Explore the Data and understand its features - for concision purposes we won't cover this, but if you want to, I reccomend you check this [kaggle discussion](https://www.kaggle.com/competitions/playground-series-s3e16/discussion/413736) to quickly get up to pair!

3. Prepare the features for training - For now, we won't do any feature engineering and we will only encode the 'Sex' column, creating a column for each possible option: Male, Female or Indeterminate.

In [22]:
df = pd.get_dummies(df, columns=['Sex'], prefix='aut')

df

Unnamed: 0,id,Length,Diameter,Height,Weight,Shucked Weight,Viscera Weight,Shell Weight,Age,aut_F,aut_I,aut_M
0,0,1.5250,1.1750,0.3750,28.973189,12.728926,6.647958,8.348928,9,0,1,0
1,1,1.1000,0.8250,0.2750,10.418441,4.521745,2.324659,3.401940,8,0,1,0
2,2,1.3875,1.1125,0.3750,24.777463,11.339800,5.556502,6.662133,9,0,0,1
3,3,1.7000,1.4125,0.5000,50.660556,20.354941,10.991839,14.996885,11,1,0,0
4,4,1.2500,1.0125,0.3375,23.289114,11.977664,4.507570,5.953395,8,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
74046,74046,1.6625,1.2625,0.4375,50.660556,20.680960,10.361742,12.332033,10,1,0,0
74047,74047,1.0750,0.8625,0.2750,10.446791,4.323299,2.296310,3.543687,6,0,1,0
74048,74048,1.4875,1.2000,0.4125,29.483480,12.303683,7.540967,8.079607,10,1,0,0
74049,74049,1.2125,0.9625,0.3125,16.768729,8.972617,2.919999,4.280774,8,0,1,0


4. Split the dataframe into train, validation and test sets

In [23]:
from sklearn.model_selection import train_test_split

# Splitting the original dataframe into train, val and test dataframes
train, test = train_test_split(df, random_state=42, test_size=0.2)
train, val = train_test_split(train, random_state=42, test_size=0.2)

print(f"Train Shape: {train.shape} | Val Shape: {val.shape} | Test Shape: {test.shape}")

Train Shape: (47392, 12) | Val Shape: (11848, 12) | Test Shape: (14811, 12)


In [24]:
# Separate the target variable and features



# Training data 
X_train = train.drop(['Age', 'id'], axis=1)
y_train = train['Age']

# Validation data
X_val = val.drop(['Age', 'id'], axis=1)
y_val = val['Age']

# Testing data
X_test = test.drop('Age', axis=1)
y_test = test['Age']

# Training + Validation data
X_train_val = pd.concat([X_train, X_val])
y_train_val = pd.concat([y_train, y_val])

5. Build some baseline models with default parameters

In [25]:
from sklearn.linear_model import LinearRegression

# Train a Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_val, y_train_val)

In [26]:
from sklearn.ensemble import RandomForestRegressor

# Train a Random Forest model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train_val, y_train_val)

In [27]:
import xgboost as xgb

# Initialize the XGBRegressor model
xgb_model = xgb.XGBRegressor(random_state=42)

# Fit the model on training data and validate on validation data
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

Now if you were to continue you would evaluate the performance of these baselines with a metric such as Mean Absolute Error (MAE), perhaps you would also try more models, do feature engineering, explore more feature correlations, hyperparameter tuning... But what if you were to use Pytorch Tabular?

## 3. How PyTorch Tabular works

PyTorch Tabular is a  library designed for handling tabular data using PyTorch, a very popular ML framework.

PyTorch Tabular provides an interface for building and training deep learning models specifically tailored for tabular datasets through 4 core components:

1. Data Configuration (data_config)

        This is the class where you will define how your data is organised and pre-processed


2. Trainer Configuration (trainer_config)

        This is where you control some of the the training process elements and other hyperparameters related to optimization


3. Model Configuration (model_config)

        It defines the architecture and behavior of your model as well as the task type you're trying to solve


4. Optimizer Configuration (optimizer_config)

        The optimizer_config typically specifies the optimizer and some of its hyperparameters, you always have to instantiate it but tuning its parameters is optional

In the following sections I will show you how to use each of these components with the regression problem we have been working on but before that, let's **install the package**:

In [8]:
!pip install pytorch_tabular[extra]

In [28]:
import pytorch_tabular

print(pytorch_tabular.__version__)

1.1.0


This way we install the complete library and extra dependencies.

## 4. Data Configuration

For DataConfig you only **need** to define the target column as well as continuous and categorical columns.

However there are some more parameters you can tune and some things to keep in mind:

- **Normalize continuous features (`normalize_continuous`)**:

    - By default, PyTorch Tabular normalizes continuous features, this might provide a better optimization to the model.
    - You can always turn this parameter to false to use the values in their original form
    - Default: 'True'
    
- **Transform continuous features (`continuous_feature_transform`)**:

    - PyTorch Tabular also offers built-in transformations for continuous features. Some of the options are: 
    - ['quantile_normal', 'yeo-johnson', 'quantile_uniform', 'box-cox']

Pytorch tabular also has 2 parameters `num_workers` and `pin_memory` that can be altered to change the data loading process but we won't focus on them. You can read more about them [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

Let's define the DataConfig class for our problem:

In [10]:
from pytorch_tabular.config import DataConfig

data_config = DataConfig(

    target=['Age'],  # target should always be a list. Multi-targets are only supported for regression.

    continuous_cols=['Length', 'Height', 'Diameter', 'Shucked Weight', 'Viscera Weight', 'Shell Weight'], # our continuous features

    categorical_cols=['aut_F', 'aut_I', 'aut_M'], # our categorical features


)

Our dataset doesn't have any mising values. But if you are wondering about it PyTorch Tabular can handle missing values in categorical features natively, but missing values in numerical features need to be handled separately. 

Let's move on to Trainer Configuration.

## 5. Trainer Configuration

In TrainerConfig we define our model is trained. This is where you will start to find some DL concepts. I will provide a short description of them but not explain them in depth. You don't need to specify any parameters here as the mandatory ones all have default parameters but let's take a look at some of the most important ones:


- **Batch Size (`batch_size`)**:
    - Determines the number of samples processed in each training iteration.
    - Default: 64

- **Maximum Epochs (`max_epochs`)**:
    - Specifies the maximum number of training epochs before stopping.
    - Default: 10

-  **Early Stopping (`early_stopping`)**:
    - The loss/metric that needed to be monitored for early stopping.
    - Default: 'valid_loss'

-  **Early Stopping Mode (`early_stopping_mode`)**:
    - The direction in which the loss/metric should be optimized.
    - Default: 'min' (minimize)

-  **Early Stopping Patience (`early_stopping_patience`)**:
    - The number of epochs to wait until there is no further improvements in loss/metric
    - Default: 3

-  **Learning Rate Finder (`auto_lr_find`)**:
    - Let's you find the optimal learning rate and automatically use that for training the network
    - Uses the method proposed in the paper [Cyclical Learning Rates for Training Neural Networks](https://arxiv.org/abs/1506.01186)
    - Default: False

There are many more parameters that allow you to personalize the training process, we covered some of the most basic one but you can always check some of the other ones [here](https://pytorch-tabular.readthedocs.io/en/latest/training/#pytorch_tabular.config.TrainerConfig)

In [11]:
from pytorch_tabular.config import TrainerConfig

trainer_config = TrainerConfig(

    batch_size=1024, # Let's increase the default batch size, you may increase/decrease it further depending on your CPU/GPU
    max_epochs=100, # Choose to train for a higher number of epochs,
    
)

For now let's keep it the training configuration simple and not define any more parameters. However, you should keep in mind that the training configuration will use the Early Stopping parameters with its default values! Let's move on to Optimizer Configuration.

## 6. Optimizer Configuration

With the OptimizerConfig class you can customize the optimizer and learning rate scheduler of your model. The most important parameters is the Optimizer:

- **Optimizer (`optimizer`)**:

    - This parameter refers to the optimizer used to update the weights of neural networks during training through gradient descent.
    - Default: 'Adam'
    - SGD, RMSProp, AdamW and other valid Pytorch Optimizers are also accepted

There are other parameters that you can find [here](https://pytorch-tabular.readthedocs.io/en/latest/tutorials/02-Exploring%20Advanced%20Features%20with%20PyTorch%20Tabular/#3-optimizerconfig). Since they go into more complex DL topics we won't cover them on this tutorial.


In [12]:
from pytorch_tabular.config import OptimizerConfig

optimizer_config = OptimizerConfig()

Lastly let's move on to Model Configuration

## 7. Model Configuration

This is where we define which model to use and the corresponding hyperparameters. PyTorch Tabular has separate config classes for [each model it offers](https://pytorch-tabular.readthedocs.io/en/stable/models/#available-models), all of them share a few core parameters in a ModelConfig class, the most important of which are:

- **Type of task (`task`)**:

    - This defines whether we are running the model for a regression, classification task

- **Learning rate (`learning_rate`)**:

    - The learning rate of the model.
    - Default: 1e-3

- **Loss function (`loss`)**:

    - The loss function to be applied. 
    - By Default, it is MSELoss for regression and CrossEntropyLoss for classification.


- **Metrics to keep track of (`learning_rate`)**:

    - The list of metrics you need to track during training. 
    - The metrics should be one of the functional metrics implemented in torchmetrics. 
    - By default, it is accuracy if classification and mean_squared_error for regression


Let's create the model_config object for our problem using the GANDALFConfig class. Don't worry if you understand all of the parameters, every different model has its own parameters that you will need to study if you want to understand what they are doing, the link above will take you to the documentation of each model where you can learn more about it and its parameters. Using ChatGPT is also a great way to learn about them!

In [13]:
from pytorch_tabular.models import GANDALFConfig

model_config = GANDALFConfig(
    task="regression",
    gflu_stages=6,
    gflu_feature_init_sparsity=0.3,
    gflu_dropout=0.0,
    learning_rate=1e-3,
)

## 8. Joining everything together to train our model

Now that we have defined each of the core components we can start the training process with TabularModel class. Specifically this is where we specify the compnents we defined previously and start training our model:

In [16]:
from pytorch_tabular import TabularModel

tabular_model = TabularModel(
    data_config=data_config,        
    model_config=model_config,      
    optimizer_config=optimizer_config, 
    trainer_config=trainer_config,  
    verbose=True                    # Print training progress
)

tabular_model.fit(train=train, validation=val)

Seed set to 42


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


You are using a CUDA device ('NVIDIA GeForce RTX 3070 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
c:\Users\stude\anaconda3\envs\tutorial\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:639: Checkpoint directory C:\Users\stude\Documents\portfolio\PyTorch Tabular\saved_models exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

  return torch.load(f, map_location=map_location)


<pytorch_lightning.trainer.trainer.Trainer at 0x23ecb14af80>

If you need a short visual reminder, here is how our full Pytorch Tabular code looks like:

In [None]:
from pytorch_tabular import TabularModel
from pytorch_tabular.models import GANDALFConfig
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig

data_config = DataConfig(

    target=['Age'],  # target should always be a list. Multi-targets are only supported for regression.
    continuous_cols=['Length', 'Height', 'Diameter', 'Shucked Weight', 'Viscera Weight', 'Shell Weight'], # our continuous features
    categorical_cols=['aut_F', 'aut_I', 'aut_M'], # our categorical features
    normalize_continuous = True # let's normalize the continuous features

)

trainer_config = TrainerConfig(

    batch_size=1024, # Let's increase the default batch size
    max_epochs=100, # Choose to train for a higher number of epochs
)

optimizer_config = OptimizerConfig()

model_config = GANDALFConfig(
    task="regression",
    gflu_stages=6,
    gflu_feature_init_sparsity=0.3,
    gflu_dropout=0.0,
    learning_rate=1e-3,
)

tabular_model = TabularModel(
    data_config=data_config,        
    model_config=model_config,      
    optimizer_config=optimizer_config, 
    trainer_config=trainer_config,  
    verbose=True                    # Print training progress
)

tabular_model.fit(train=train, validation=val)


## 10. Evaluation and comparison: DL vs Traditional ML

In [17]:
import time
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

# Timing the prediction and calculating the MAE for each model

# Random Forest Model
start_time = time.time()
rf_pred = rf_model.predict(X_test)
rf_time = time.time() - start_time
rf_mae = mean_absolute_error(y_test, rf_pred)

# XGBoost Model
start_time = time.time()
xgb_pred = xgb_model.predict(X_test)
xgb_time = time.time() - start_time
xgb_mae = mean_absolute_error(y_test, xgb_pred)

# Linear Regression Model
start_time = time.time()
lr_pred = lr_model.predict(X_test)
lr_time = time.time() - start_time
lr_mae = mean_absolute_error(y_test, lr_pred)

# Gandalf Model
start_time = time.time()
gandalf_pred = tabular_model.predict(test)
gandalf_time = time.time() - start_time
true_values = test['Age']
gandalf_mae = mean_absolute_error(true_values, gandalf_pred)

# Create a DataFrame with Model, MAE, and Inference Time
data = {
    'model': ['Random Forest', 'XGBoost', 'Linear Regression', 'Gandalf'],
    'MAE': [rf_mae, xgb_mae, lr_mae, gandalf_mae],
    'inference_time': [rf_time, xgb_time, lr_time, gandalf_time]
}

df = pd.DataFrame(data)

# Sort the DataFrame by MAE in ascending order
df_sort_mae = df.sort_values(by='MAE', ascending=True)

df_sort_mae


Unnamed: 0,model,MAE,inference_time
1,XGBoost,1.427673,0.012975
3,Gandalf,1.431624,0.43446
0,Random Forest,1.465381,0.472417
2,Linear Regression,1.488617,0.0


## 11. Saving, Loading and Predicting with our Model

Saving and loading models in PyTorch Tabular is very simple! All wee need is to use the `save_model` and `load_model ` methods respectively. 

Furthermore, `save_model` has a very useful feature: by default it also saves the datamodule, that contains the training, validation, and test data. 

This can be disabled by setting `inference_only` to True

In [52]:
tabular_model.save_model("models/simple_gandalf", inference_only = False)

In [53]:
gandalf_model = TabularModel.load_model("models/simple_gandalf")

c:\Users\stude\anaconda3\envs\tutorial\lib\site-packages\lightning_fabric\utilities\cloud_io.py:56: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.


Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [54]:
# Make predictions using Gandalf's model with the .predict() method
gandalf_prediction = gandalf_model.predict(test)

# Ensure that both arrays are 1-dimensional
true_values_flat = test['Age'].values.flatten() 
gandalf_pred_flat = gandalf_prediction.values.flatten() 

# Create a DataFrame with Gandalf's predictions and the true values
gandalf_results = pd.DataFrame({
    'True Values': true_values_flat,
    'Gandalf Predictions': gandalf_pred_flat
})

# Display the DataFrame
gandalf_results


Unnamed: 0,True Values,Gandalf Predictions
0,18,13.383442
1,6,5.344850
2,8,9.801076
3,8,7.898350
4,8,7.461432
...,...,...
14806,9,8.847288
14807,10,10.874093
14808,11,10.625701
14809,3,4.783142


## 12. Conclusion

In this tutorial we covered the very basics of how to train a Deep Learning model to solve a regression problem! 

- First we described the dataset we used
- Then we showed how you would approach the problem with Traditional DL
- We learned how PyTorch Tabular works and trained a Gandalf Model with it
- We plotted a simple comparison between the models we created
- Lastly we learned how to save, load and predict with our mode

This first tutorial already contained a lot of information, don't worry if you feel overwhelmed, with practice everything will become clearer 🙂

In the next tutorial we will cover the topics of cross validation with PyTorch Tabular 😁

I hope you found this tutorial useful and that you come back. The next tutorial will come out soon!

## Author:

Francisco Ribeiro Mansilha

Feel free to contact me if you have any thoughts you would like to share or spot any mistakes!

francisco.ribeiro.mansilha@gmail.com