# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep



In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
import torch

df = pd.read_csv('hdb_price_prediction.csv')

# Divide the dataset into train and test sets
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

# Define target variable and features
target = ['resale_price']
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
cat_embed_cols = ['month', 'town', 'flat_model_type', 'storey_range']


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
tab_preprocessor = TabPreprocessor(
    continuous_cols=continuous_cols,
    categorical_cols=cat_embed_cols
)
X_train_tab = tab_preprocessor.fit_transform(train_data)
X_test_tab = tab_preprocessor.transform(test_data)

# Define the TabMlp model
tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[200, 100]    
)

model = WideDeep(deeptabular=tab_mlp)

# Create a Trainer for the model
trainer = Trainer(
    model,
    objective="rmse",
    metrics=[R2Score()],
    verbose=1,
    seed=42,
    num_workers=0
)

# Define the target variable for training
target_train = train_data['resale_price'].values

# Train the model
trainer.fit(X_tab=X_train_tab, target=target_train, n_epochs=100, batch_size=64)

epoch 1: 100%|█████████████████████████████| 1366/1366 [00:12<00:00, 111.28it/s, loss=2.76e+5, metrics={'r2': -3.2408}]
epoch 2: 100%|██████████████████████████████| 1366/1366 [00:11<00:00, 117.89it/s, loss=1.24e+5, metrics={'r2': 0.3058}]
epoch 3: 100%|██████████████████████████████| 1366/1366 [00:10<00:00, 125.53it/s, loss=1.19e+5, metrics={'r2': 0.3494}]
epoch 4: 100%|██████████████████████████████| 1366/1366 [00:11<00:00, 119.93it/s, loss=1.18e+5, metrics={'r2': 0.3588}]
epoch 5: 100%|██████████████████████████████| 1366/1366 [00:12<00:00, 105.49it/s, loss=1.18e+5, metrics={'r2': 0.3594}]
epoch 6: 100%|███████████████████████████████| 1366/1366 [00:12<00:00, 107.47it/s, loss=1.18e+5, metrics={'r2': 0.363}]
epoch 7: 100%|██████████████████████████████| 1366/1366 [00:11<00:00, 114.18it/s, loss=1.18e+5, metrics={'r2': 0.3656}]
epoch 8: 100%|██████████████████████████████| 1366/1366 [00:13<00:00, 101.00it/s, loss=1.17e+5, metrics={'r2': 0.3667}]
epoch 9: 100%|██████████████████████████

>Report the test RMSE and the test R2 value that you obtained.

In [5]:
# Get predictions from the trained model
target_test = test_data['resale_price'].values
predictions = trainer.predict(X_tab=X_test_tab)

# Calculate RMSE
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(target_test, predictions, squared=False)

# Calculate R2 score
from sklearn.metrics import r2_score
r2 = r2_score(target_test, predictions)

print(f"Test RMSE: {rmse}")
print(f"Test R2 Score: {r2}")

predict: 100%|████████████████████████████████████████████████████████████████████| 1128/1128 [00:03<00:00, 341.44it/s]

Test RMSE: 144876.06406611577
Test R2 Score: 0.2666829079552636



