# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [4]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Obtaining dependency information for pytorch-widedeep from https://files.pythonhosted.org/packages/17/f4/48f8d4c527baea10808b822fd3c00260f2b3b453937f2ef54bc464da1b88/pytorch_widedeep-1.3.2-py3-none-any.whl.metadata
  Using cached pytorch_widedeep-1.3.2-py3-none-any.whl.metadata (10 kB)
Collecting gensim (from pytorch-widedeep)
  Obtaining dependency information for gensim from https://files.pythonhosted.org/packages/3e/b7/fba98a65efea29a7d8bf25ade2db67e34ebab8e63769e8927d0a4d42a84f/gensim-4.3.2-cp38-cp38-win_amd64.whl.metadata
  Using cached gensim-4.3.2-cp38-cp38-win_amd64.whl.metadata (8.5 kB)
Collecting spacy (from pytorch-widedeep)
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/14/26/7447496e90ee51bf00d1af33085c180eeb26166149bed1d30ef4c53d862c/spacy-3.6.1-cp38-cp38-win_amd64.whl.metadata
  Using cached spacy-3.6.1-cp38-cp38-win_amd64.whl.metadata (26 kB)
Collecting torchvision (from pytorch-widedeep)
  Using c

In [8]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from sklearn.metrics import mean_squared_error
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
# TODO: Enter your code here
df = pd.read_csv('hdb_price_prediction.csv')
train_df = df[df['year'] <= 2020]
test_df = df[df['year'] >= 2021]


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
# TODO: Enter your code here
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
cat_col_names = ['month', 'town', 'flat_model_type', 'storey_range']

cat_embed_cols = [
    ("month", len(train_df["month"].unique())),
    ("town", len(train_df["town"].unique())),
    ("flat_model_type", len(train_df["flat_model_type"].unique())),
    ("storey_range", len(train_df["storey_range"].unique())),
]

tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols
)

X_tab = tab_preprocessor.fit_transform(train_df)

model = TabMlp(mlp_hidden_dims=[200,100], column_idx=tab_preprocessor.column_idx, cat_embed_input=tab_preprocessor.cat_embed_input, continuous_cols = continuous_cols)
model = WideDeep(deeptabular=model)
trainer = Trainer(model, objective='rmse', metrics=[R2Score], device = 'cpu', num_workers=0)
# trainer.fit(X_wide=model, X_tab=None, target=df['resale_price'], n_epochs=100, batch_size=64)
trainer.fit(X_tab=X_tab, target=np.array(train_df['resale_price']), n_epochs=100, batch_size=64)

epoch 1: 100%|██████████| 1366/1366 [00:14<00:00, 92.16it/s, loss=2.01e+5, metrics={'r2': -1.6453}] 
epoch 2: 100%|██████████| 1366/1366 [00:11<00:00, 114.78it/s, loss=8.21e+4, metrics={'r2': 0.6774}]
epoch 3: 100%|██████████| 1366/1366 [00:11<00:00, 122.15it/s, loss=7.29e+4, metrics={'r2': 0.759}] 
epoch 4: 100%|██████████| 1366/1366 [00:10<00:00, 133.08it/s, loss=6.96e+4, metrics={'r2': 0.7838}]
epoch 5: 100%|██████████| 1366/1366 [00:11<00:00, 122.63it/s, loss=6.72e+4, metrics={'r2': 0.799}] 
epoch 6: 100%|██████████| 1366/1366 [00:16<00:00, 85.10it/s, loss=6.57e+4, metrics={'r2': 0.8079}] 
epoch 7: 100%|██████████| 1366/1366 [00:14<00:00, 93.40it/s, loss=6.48e+4, metrics={'r2': 0.8133}] 
epoch 8: 100%|██████████| 1366/1366 [00:11<00:00, 122.35it/s, loss=6.4e+4, metrics={'r2': 0.8173}] 
epoch 9: 100%|██████████| 1366/1366 [00:13<00:00, 101.06it/s, loss=6.35e+4, metrics={'r2': 0.8203}]
epoch 10: 100%|██████████| 1366/1366 [00:12<00:00, 112.80it/s, loss=6.3e+4, metrics={'r2': 0.8228}]

>Report the test RMSE and the test R2 value that you obtained.

RSME Score: 106068
R2 Score: 0.607

In [9]:
X_test_tab = tab_preprocessor.transform(test_df)
pred = trainer.predict(X_tab=X_test_tab, batch_size=64)
rsme = pow(mean_squared_error(test_df['resale_price'], pred), 0.5)
r2 = R2Score()
r2_score = r2(pred, test_df['resale_price'])

print("RSME Score:", rsme)
print("R2 Score:", r2_score)

predict: 100%|██████████| 1128/1128 [00:03<00:00, 365.79it/s]

RSME Score: 106067.97369493851
R2 Score: 0.6069325560507035



