# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.3.2-py3-none-any.whl (21.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
Collecting einops (from pytorch-widedeep)
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics (from pytorch-widedeep)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
Collecting fastparquet>=0.8.1 (from pytorch-widedeep)
  Downloading fastparquet-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cramjam>=2.3 (from fastparquet>=0.8.1->pytorch-wided

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd
import torch

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score



>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [15]:
# TODO: Enter your code here
# The train and test data is prepared.
df_start = pd.read_csv('hdb_price_prediction.csv',index_col = None)
columns_to_drop = ['full_address', 'nearest_stn']
df = df_start.drop(columns_to_drop,axis=1)

continuous_columns = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
categorical_columns = ["month", "town", "flat_model_type", "storey_range"]
target_column = ["resale_price"]
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

drop_year = ['year']

train_data = train_data.drop(drop_year,axis=1)
test_data = test_data.drop(drop_year,axis=1)


In [16]:
index_dict = {}
i = 0
for column_name in train_data.columns:
  index_dict[column_name] = i
  i += 1

print(index_dict)

{'month': 0, 'town': 1, 'dist_to_nearest_stn': 2, 'dist_to_dhoby': 3, 'degree_centrality': 4, 'eigenvector_centrality': 5, 'flat_model_type': 6, 'remaining_lease_years': 7, 'floor_area_sqm': 8, 'storey_range': 9, 'resale_price': 10}


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [17]:
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_columns, continuous_cols=continuous_columns
)

X_train_data = train_data.drop(target_column, axis = 1)
X_train = tab_preprocessor.fit_transform(X_train_data)
y_train = train_data['resale_price'].values

X_test_data = test_data.drop(target_column, axis = 1)



In [18]:
col_names = categorical_columns + continuous_columns
column_idx = {k:v for v,k in enumerate(col_names)}
cat_embed_input = [(u,i,j) for u,i,j in zip(categorical_columns, [100]*4, [200]*4)]
tabmlp_model = TabMlp(column_idx = column_idx, cat_embed_input = cat_embed_input, continuous_cols = continuous_columns)
model = WideDeep(deeptabular = tabmlp_model)
trainer = Trainer(model = model, objective = "rmse", metrics = [R2Score()], num_workers = 0,)

In [19]:
trainer.fit(
    X_tab = X_train,
    target = y_train,
    n_epochs=100,
    batch_size=64,
)

epoch 1: 100%|██████████| 1366/1366 [00:13<00:00, 104.99it/s, loss=1.68e+5, metrics={'r2': -0.9832}]
epoch 2: 100%|██████████| 1366/1366 [00:13<00:00, 100.11it/s, loss=6.55e+4, metrics={'r2': 0.7981}]
epoch 3: 100%|██████████| 1366/1366 [00:12<00:00, 106.15it/s, loss=6.1e+4, metrics={'r2': 0.8311}]
epoch 4: 100%|██████████| 1366/1366 [00:13<00:00, 101.33it/s, loss=5.97e+4, metrics={'r2': 0.8393}]
epoch 5: 100%|██████████| 1366/1366 [00:13<00:00, 101.90it/s, loss=5.91e+4, metrics={'r2': 0.8424}]
epoch 6: 100%|██████████| 1366/1366 [00:13<00:00, 101.28it/s, loss=5.88e+4, metrics={'r2': 0.845}]
epoch 7: 100%|██████████| 1366/1366 [00:13<00:00, 100.16it/s, loss=5.82e+4, metrics={'r2': 0.8477}]
epoch 8: 100%|██████████| 1366/1366 [00:13<00:00, 101.39it/s, loss=5.79e+4, metrics={'r2': 0.8492}]
epoch 9: 100%|██████████| 1366/1366 [00:13<00:00, 101.43it/s, loss=5.77e+4, metrics={'r2': 0.8505}]
epoch 10: 100%|██████████| 1366/1366 [00:13<00:00, 101.69it/s, loss=5.7e+4, metrics={'r2': 0.8538}]
e

In [20]:
X_test = tab_preprocessor.transform(X_test_data)
y_test = test_data['resale_price'].values
preds = trainer.predict(X_tab = X_test, batch_size = 64)

predict: 100%|██████████| 1128/1128 [00:05<00:00, 216.68it/s]


In [21]:
pred_df = pd.DataFrame(data=preds, columns=['predictions'])


In [22]:
pred_df['actual'] = pd.DataFrame(data=y_test)

In [23]:
pred_df # Just to compare the predictions with the actual values

Unnamed: 0,predictions,actual
0,155490.437500,211000.0
1,178643.859375,225000.0
2,288589.812500,260000.0
3,289925.718750,265000.0
4,272628.000000,265000.0
...,...,...
72178,578773.250000,780000.0
72179,626387.375000,808000.0
72180,617686.562500,788888.0
72181,563054.812500,822800.0


>Report the test RMSE and the test R2 value that you obtained.

In [24]:
from sklearn.metrics import mean_squared_error
# Calculating RMSE
rmse = np.sqrt(mean_squared_error(pred_df['actual'], pred_df['predictions']))
r2_inbuilt = R2Score()
print('R2 = ', r2_inbuilt(torch.tensor(pred_df['actual'].values), torch.tensor(pred_df['predictions'].values)))
# Calculating R-squared (R2) value

print(f"Test RMSE: {rmse:.2f}")


R2 =  0.5584535983423558
Test RMSE: 101753.57
