# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [2]:
# Load the dataset
df = pd.read_csv('hdb_price_prediction.csv')

# Dividing the dataset into train, validation and test sets by applying the given conditions
train_df = df[df['year'] <= 2020]  # Training data includes entries from year 2020 and before
test_df = df[df['year'] >= 2021]  # Test data includes entries from year 2021 and after

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [3]:
# Define the target
target = train_df['resale_price'].values

# Column type variables from the assignment pdf file
categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']  # Categorical columns
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']  # Continuous columns

# Create the TabPreprocessor
tab_preprocessor = TabPreprocessor(cat_embed_cols=categorical_cols, continuous_cols=continuous_cols)

# Transform the training dataset
X_tab = tab_preprocessor.fit_transform(train_df)

# Create the TabMlp model with 2 linear layers in the MLP, with 200 and 100 neurons respectively
tabmlp = TabMlp(
    mlp_hidden_dims=[200, 100],  # 2 linear layers in the MLP, with 200 and 100 neurons respectively
    column_idx=tab_preprocessor.column_idx,  # Column indices
    cat_embed_input=tab_preprocessor.cat_embed_input,  # Embedding input
    continuous_cols=continuous_cols  # Continuous columns
)

# Create the WideDeep model
model = WideDeep(deeptabular=tabmlp)

# Create the Trainer
trainer = Trainer(
    model=model,  # Pass the model
    cost_function="rmse",  # RMSE cost function
    metrics=[R2Score()],  # R2 score
    num_workers=0  # Set the num_workers parameter to 0
)

# Define the epochs and batch size
no_epochs = 100
batch_size = 64

# Train the model
trainer.fit(
    X_tab=X_tab,  # Pass the transformed training dataset
    target=target,  # Target variable
    n_epochs=no_epochs,  # Number of epochs
    batch_size=batch_size  # Batch size
)

epoch 1: 100%|██████████| 1366/1366 [00:06<00:00, 197.87it/s, loss=1.84e+5, metrics={'r2': -1.2476}]
epoch 2: 100%|██████████| 1366/1366 [00:06<00:00, 205.21it/s, loss=1.01e+5, metrics={'r2': 0.4758}]
epoch 3: 100%|██████████| 1366/1366 [00:06<00:00, 205.43it/s, loss=7.96e+4, metrics={'r2': 0.6846}]
epoch 4: 100%|██████████| 1366/1366 [00:06<00:00, 207.05it/s, loss=6.6e+4, metrics={'r2': 0.7979}] 
epoch 5: 100%|██████████| 1366/1366 [00:06<00:00, 202.77it/s, loss=6.13e+4, metrics={'r2': 0.8291}]
epoch 6: 100%|██████████| 1366/1366 [00:06<00:00, 203.77it/s, loss=5.92e+4, metrics={'r2': 0.8423}]
epoch 7: 100%|██████████| 1366/1366 [00:06<00:00, 199.56it/s, loss=5.79e+4, metrics={'r2': 0.8495}]
epoch 8: 100%|██████████| 1366/1366 [00:06<00:00, 200.82it/s, loss=5.68e+4, metrics={'r2': 0.8552}]
epoch 9: 100%|██████████| 1366/1366 [00:06<00:00, 208.91it/s, loss=5.58e+4, metrics={'r2': 0.8607}]
epoch 10: 100%|██████████| 1366/1366 [00:06<00:00, 206.53it/s, loss=5.47e+4, metrics={'r2': 0.8663}

In [4]:
# Transform the test dataset
X_tab_test = tab_preprocessor.transform(test_df)

# Predict the target variable
y_pred = trainer.predict(X_tab=X_tab_test)
y_pred

predict: 100%|██████████| 1128/1128 [00:02<00:00, 519.41it/s]


array([173035.72, 192246.64, 289562.8 , ..., 594494.4 , 518723.94,
       553011.7 ], dtype=float32)

3.Report the test RMSE and the test R2 value that you obtained.

In [5]:
# Import the dependencies we will need to compute the RMSE and R2
from sklearn.metrics import mean_squared_error, r2_score

# Define the ground truth and the predictions
y_true = test_df['resale_price']  # Ground truth

print('RMSE & R2')

# Compute the RMSE
rmse = mean_squared_error(y_true, y_pred, squared=False)  # Set squared=False to get the RMSE
print(f'Test RMSE: {rmse}')

# Compute the R2 value
r2 = r2_score(y_true, y_pred) 
print(f'Test R2: {r2}')

RMSE & R2
Test RMSE: 100703.4402070674
Test R2: 0.6456869534456644
