# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep



In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [4]:
df = pd.read_csv('hdb_price_prediction.csv')

# Training Data
df_train = df[df['year'] <= 2020].copy()
# Testing Data
df_test = df[df['year'] >= 2021].copy()

# Dropping Unncessary Columns
df_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)

print("Training Data:", df_train.shape)
print("Testing Data:", df_test.shape)

Training Data: (87370, 11)
Testing Data: (72183, 11)


2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [7]:
# Define continuous and categorical columns
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality',
                   'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
cat_embed_cols = ['month', 'town', 'flat_model_type', 'storey_range']

In [8]:
# Create and fit the TabPreprocessor
tab_preprocessor = TabPreprocessor(
    cat_embed_cols = cat_embed_cols, continuous_cols = continuous_cols
)

# Scaled Training Data
X_tab = tab_preprocessor.fit_transform(df_train)



In [10]:
tab_mlp = TabMlp(column_idx=tab_preprocessor.column_idx,
                 cat_embed_input=tab_preprocessor.cat_embed_input,
                 cat_embed_dropout=0.1,
                 continuous_cols=continuous_cols,
                 mlp_hidden_dims=[200, 100]) # Two linear layers with 200 and 100 neurons

wide_deep = WideDeep(deeptabular=tab_mlp)

In [11]:
# Create the Trainer
trainer = Trainer(
    model=wide_deep,
    objective="regression",
    metrics=[R2Score],
    batch_size=64,
    num_workers=0
)

In [12]:
# Train the model
trainer.fit(
    X_tab=X_tab,
    target=df_train['resale_price'],
    n_epochs=100,
)

epoch 1: 100%|██████████| 2731/2731 [00:07<00:00, 347.16it/s, loss=3.61e+10, metrics={'r2': -0.5192}]
epoch 2: 100%|██████████| 2731/2731 [00:07<00:00, 354.41it/s, loss=8.85e+9, metrics={'r2': 0.6274}] 
epoch 3: 100%|██████████| 2731/2731 [00:07<00:00, 369.80it/s, loss=5.35e+9, metrics={'r2': 0.7748}]
epoch 4: 100%|██████████| 2731/2731 [00:07<00:00, 369.40it/s, loss=4.32e+9, metrics={'r2': 0.8183}]
epoch 5: 100%|██████████| 2731/2731 [00:07<00:00, 379.96it/s, loss=3.95e+9, metrics={'r2': 0.8336}]
epoch 6: 100%|██████████| 2731/2731 [00:07<00:00, 355.66it/s, loss=3.72e+9, metrics={'r2': 0.8432}]
epoch 26: 100%|██████████| 2731/2731 [00:08<00:00, 331.14it/s, loss=2.61e+9, metrics={'r2': 0.8903}]
epoch 27:   0%|          | 0/2731 [00:00<?, ?it/s, loss=2.46e+9, metrics={'r2': 0.8861}]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limi

3.Report the test RMSE and the test R2 value that you obtained.

In [21]:
# Scaled Test Data
X_test = tab_preprocessor.transform(df_test)

# Make predictions on the test dataset
y_pred = trainer.predict(X_tab = X_test)

predict: 100%|██████████| 2256/2256 [00:02<00:00, 888.38it/s]


In [22]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_true = df_test['resale_price']

mse = mean_squared_error(y_true, y_pred)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))
r2 = r2_score(y_true, y_pred)
print("R2 Score:", r2)

Root Mean Squared Error (RMSE): 94274.93646960524
R2 Score: 0.6894789636373253
