# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [None]:
!pip install pytorch-widedeep

In [None]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [None]:
df = pd.read_csv('hdb_price_prediction.csv')

# YOUR CODE HERE
train_data = df[df['year'] <= 2020]
test_data = df[df['year'] >= 2021]

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [None]:
# YOUR CODE & RESULT HERE
# Define the target variable and the names of the continuous and categorical variables
target = ["resale_price"]
continuous_cols = [
    "dist_to_nearest_stn",
    "dist_to_dhoby",
    "degree_centrality",
    "eigenvector_centrality",
    "remaining_lease_years",
    "floor_area_sqm",
]
categorical_cols = ["month", "town", "flat_model_type", "storey_range"]

preprocessor = TabPreprocessor(
    embed_cols=categorical_cols, continuous_cols=continuous_cols
)
x_train = preprocessor.fit_transform(train_data)
y_train = train_data[target].values

model = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=preprocessor.column_idx,
    cat_embed_input=preprocessor.cat_embed_input,
    continuous_cols=continuous_cols,
)

# Combine the TabMlp model with any other models you want to use
wide = WideDeep(deeptabular=model)

# Set up Trainer and train
trainer = Trainer(
    wide, objective="root_mean_squared_error", metrics=[R2Score], num_workers=0
)

# Fit the model
trainer.fit(X_tab=x_train, target=y_train, n_epochs=100, batch_size=64)

3.Report the test RMSE and the test R2 value that you obtained.

In [None]:
# YOUR CODE & RESULT HERE
import math
from sklearn.metrics import r2_score, root_mean_squared_error

x_test = preprocessor.transform(test_data)
y_test = test_data[target].values

predictions = trainer.predict(X_tab=x_test,batch_size=64)

print(f"RMSE: {math.sqrt(root_mean_squared_error(test_data['resale_price'], predictions))}")
print(f"R2: {r2_score(test_data['resale_price'], predictions)}")