# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep




[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# YOUR CODE HERE
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
# YOUR CODE & RESULT HERE
# categorical columns
cat_embed_cols = ['month', 'town', 'flat_model_type', 'storey_range']

# continuous columns
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

# create deeptabular
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols,
    cols_to_scale=continuous_cols
)

# preprocess data
train_target = train_data['resale_price'].values
test_target = test_data['resale_price'].values
train_data = tab_preprocessor.fit_transform(train_data)
test_data = tab_preprocessor.transform(test_data)

# initialize model
tabmlp = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
)
model = WideDeep(deeptabular=tabmlp)  # since trainer needs WideDeep class

# create the trainer
trainer = Trainer(
    model=model,
    objective='rmse',
    metrics=[R2Score],
    num_workers=0,
    seed=SEED
)

# training
trainer.fit(
    X_tab = train_data,
    target = train_target,
    n_epochs=100,
    batch_size=64

)


epoch 1: 100%|██████████| 1366/1366 [00:11<00:00, 122.42it/s, loss=2.39e+5, metrics={'r2': -2.4291}]
epoch 2: 100%|██████████| 1366/1366 [00:09<00:00, 141.81it/s, loss=8.22e+4, metrics={'r2': 0.6577}]
epoch 3: 100%|██████████| 1366/1366 [00:09<00:00, 141.82it/s, loss=6.31e+4, metrics={'r2': 0.81}]  
epoch 4: 100%|██████████| 1366/1366 [00:09<00:00, 149.18it/s, loss=5.81e+4, metrics={'r2': 0.8436}]
epoch 5: 100%|██████████| 1366/1366 [00:09<00:00, 144.47it/s, loss=5.57e+4, metrics={'r2': 0.8588}]
epoch 6: 100%|██████████| 1366/1366 [00:10<00:00, 133.30it/s, loss=5.44e+4, metrics={'r2': 0.8664}]
epoch 7: 100%|██████████| 1366/1366 [00:09<00:00, 137.68it/s, loss=5.36e+4, metrics={'r2': 0.8708}]
epoch 8: 100%|██████████| 1366/1366 [00:09<00:00, 143.37it/s, loss=5.32e+4, metrics={'r2': 0.8733}]
epoch 9: 100%|██████████| 1366/1366 [00:09<00:00, 142.80it/s, loss=5.3e+4, metrics={'r2': 0.8737}] 
epoch 10: 100%|██████████| 1366/1366 [00:09<00:00, 148.05it/s, loss=5.27e+4, metrics={'r2': 0.8755}

3.Report the test RMSE and the test R2 value that you obtained.

In [5]:
# YOUR CODE & RESULT HERE
from sklearn.metrics import r2_score, mean_squared_error

# test the model
res = trainer.predict(X_tab=test_data, batch_size=64)

# report RMSE and R2
RMSE = np.sqrt(mean_squared_error(test_target, res))
r2 = r2_score(test_target, res)

print(f"RMSE: {RMSE}")
print(f"R2: {r2}")

predict: 100%|██████████| 1128/1128 [00:03<00:00, 334.10it/s]

RMSE: 99218.29356196288
R2: 0.6560605147559431



