CS4001/4042 Assignment 1
---
Part B, Q1 (15 marks)
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [13]:
SEED = 42

import os
import torch
import random
import numpy as np
import pandas as pd
import torch.nn as nn
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (DataConfig, OptimizerConfig, TrainerConfig)

from sklearn.metrics import r2_score

random.seed(SEED)
np.random.seed(SEED)
os.environ['TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD'] = '1' # https://github.com/suno-ai/bark/issues/626

> Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from year 2020 and before as training data, and year 2021 as test data (validation set is not required).
**Do not** use data from year 2022 and year 2023.



In [21]:
df = pd.read_csv('hdb_price_prediction.csv')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
df = df[df['year'] < 2022]
train_df = df[df['year'] <= 2020]
test_df = df[df['year'] == 2021]

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [None]:
%%capture
data_config = DataConfig(
  target=['resale_price'],
  continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'],
  categorical_cols=['month', 'town', 'flat_model_type', 'storey_range']
)

trainer_config = TrainerConfig(
  auto_lr_find=True,
  batch_size=1024,
  max_epochs=50
)

model_config = CategoryEmbeddingModelConfig(
  task='regression',
  layers='50',
  activation='ReLU',
  learning_rate=1e-3
)

optimizer_config = OptimizerConfig()

tabular_model = TabularModel(
  data_config=data_config,
  model_config=model_config,
  optimizer_config=optimizer_config,
  trainer_config=trainer_config
)

tabular_model.fit(train=train_df, validation=test_df)
result = tabular_model.evaluate(test_df)
prediction = tabular_model.predict(test_df)

> Report the test RMSE error and the test R2 value that you obtained.



In [47]:
true_y = test_df['resale_price'].values
r2_value = r2_score(true_y, prediction) #type: ignore

print(f"R2 Value: {r2_value}")
print(f"Root Mean Squared Error: {result[0]['test_loss'] ** 0.5}") #type: ignore

R2 Value: 0.8474233754422245
Root Mean Squared Error: 63529.121763172516


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. 



In [58]:
errors_df = pd.DataFrame()
squeezed_prediction = np.squeeze(prediction) #type: ignore
errors_df['resale_price'] = test_df['resale_price']
errors_df['prediction'] = squeezed_prediction
errors_df['error'] = abs(errors_df['resale_price'] - errors_df['prediction'])
print(errors_df.nlargest(25, 'error'))

        resale_price    prediction         error
92405       780000.0  3.801949e+05  399805.06250
112128      998000.0  6.727598e+05  325240.18750
90251      1001000.0  1.324944e+06  323943.87500
90957       968000.0  6.501336e+05  317866.37500
91871       680888.0  3.687742e+05  312113.78125
90608      1360000.0  1.053320e+06  306679.75000
92504       695000.0  3.922965e+05  302703.53125
92299       690000.0  3.910424e+05  298957.62500
92442      1165000.0  8.684987e+05  296501.31250
98379       873000.0  5.839535e+05  289046.50000
91694       680000.0  3.917008e+05  288299.21875
92340      1245000.0  9.616702e+05  283329.75000
92066       628000.0  3.465577e+05  281442.28125
93670      1238000.0  9.584582e+05  279541.81250
91497       618000.0  3.398550e+05  278145.03125
90521       988000.0  7.101542e+05  277845.75000
90432      1280000.0  1.005957e+06  274042.87500
93825       938000.0  6.652728e+05  272727.25000
92073       668000.0  4.001608e+05  267839.15625
92496       640000.0

Part B, Q2 (10 marks)
---
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network.

In [4]:
import pandas as pd
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data（validation set is not required here).

In [8]:
df = pd.read_csv('hdb_price_prediction.csv')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
train_df = df[df['year'] >= 2021]
test_df = df[df['year'] <= 2020]

>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 hidden layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 60 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [9]:
target_col = 'resale_price'
cat_embed_cols=['month', 'town', 'flat_model_type', 'storey_range']
continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

tab_preprocessor = TabPreprocessor(
  cat_embed_cols=cat_embed_cols,
  continuous_cols=continuous_cols
)
X_tab = tab_preprocessor.fit_transform(train_df)
target_df = train_df[target_col].values

tab_mlp = TabMlp(
  column_idx=tab_preprocessor.column_idx,
  cat_embed_input=tab_preprocessor.cat_embed_input,
  continuous_cols=tab_preprocessor.continuous_cols,
  mlp_hidden_dims=[200, 100]
)

model = WideDeep(deeptabular=tab_mlp)
trainer = Trainer(model, loss_fn='rmse', metrics=[R2Score], num_workers=0)
trainer.fit(X_tab=X_tab, target=target_df, n_epochs=60, batch_size=64)

epoch 1: 100%|██████████| 1128/1128 [00:11<00:00, 99.52it/s, loss=2.18e+5, metrics={'r2': -1.7549}] 
epoch 2: 100%|██████████| 1128/1128 [00:10<00:00, 102.74it/s, loss=1.07e+5, metrics={'r2': 0.4997}]
epoch 3: 100%|██████████| 1128/1128 [00:10<00:00, 107.93it/s, loss=9.41e+4, metrics={'r2': 0.6154}]
epoch 4: 100%|██████████| 1128/1128 [00:10<00:00, 106.33it/s, loss=8.14e+4, metrics={'r2': 0.7251}]
epoch 5: 100%|██████████| 1128/1128 [00:10<00:00, 104.83it/s, loss=7.3e+4, metrics={'r2': 0.7886}] 
epoch 6: 100%|██████████| 1128/1128 [00:11<00:00, 95.75it/s, loss=6.86e+4, metrics={'r2': 0.817}] 
epoch 7: 100%|██████████| 1128/1128 [00:12<00:00, 88.87it/s, loss=6.65e+4, metrics={'r2': 0.8307}] 
epoch 8: 100%|██████████| 1128/1128 [00:11<00:00, 94.43it/s, loss=6.48e+4, metrics={'r2': 0.8406}] 
epoch 9: 100%|██████████| 1128/1128 [00:12<00:00, 90.66it/s, loss=6.39e+4, metrics={'r2': 0.8452}] 
epoch 10: 100%|██████████| 1128/1128 [00:11<00:00, 94.73it/s, loss=6.29e+4, metrics={'r2': 0.8511}] 

>Report the test RMSE and the test R2 value that you obtained.

In [19]:
X_tab_te = tab_preprocessor.transform(test_df)
predictions = trainer.predict(X_tab=X_tab_te)
r2_score = r2_score(test_df[target_col].values, predictions)
print(f"R2 Score: {r2_score}")

predict: 100%|██████████| 1366/1366 [00:04<00:00, 284.94it/s]

R2 Score: -0.1337902193666367





Part B, Q3 (10 marks)
---
Besides ensuring that your neural network performs well, it is important to be able to explain the model’s decision. **Captum** is a very handy library that helps you to do so for PyTorch models.

Many model explainability algorithms for deep learning models are available in Captum. These algorithms are often used to generate an attribution score for each feature. Features with larger scores are more ‘important’ and some algorithms also provide information about directionality (i.e. a feature with very negative attribution scores means the larger the value of that feature, the lower the value of the output).

In general, these algorithms can be grouped into two paradigms:
- **perturbation based approaches** (e.g. Feature Ablation)
- **gradient / backpropagation based approaches** (e.g. Saliency)

The former adopts a brute-force approach of removing / permuting features one by one and does not scale up well. The latter depends on gradients and they can be computed relatively quickly. But unlike how backpropagation computes gradients with respect to weights, gradients here are computed **with respect to the input**. This gives us a sense of how much a change in the input affects the model’s outputs.




---



In [None]:
from captum.attr import Saliency, InputXGradient, IntegratedGradients, GradientShap, FeatureAblation

> First, use the train set (year 2020 and before) and test set (year 2021) following the splits in Question B1 (validation set is not required here). To keep things simple, we will **limit our analysis to numeric / continuous features only**. Drop all categorical features from the dataframes. Standardise the features via **StandardScaler** (fit to training set, then transform all).

In [None]:
# TODO: Enter your code here

> Follow this tutorial to generate the plot from various model explainability algorithms (https://captum.ai/tutorials/House_Prices_Regression_Interpret).
Specifically, make the following changes:
- Use a feedforward neural network with 3 hidden layers, each having 5 neurons. Train using Adam optimiser with learning rate of 0.001.
- Use Input x Gradients, Integrated Gradients, DeepLift, GradientSHAP, Feature Ablation. To avoid long running time, you can limit the analysis to the first 1000 samples in test set.

In [None]:
# TODO: Enter your code here

> Read the following [descriptions](https://captum.ai/docs/attribution_algorithms) and [comparisons](https://captum.ai/docs/algorithms_comparison_matrix) in Captum to build up your understanding of the difference of various explainability algorithms. Based on your plot, identify the three most important features for regression. Explain how each of these features influences the regression outcome.


\# TODO: \<Enter your answer here\>

Part B, Q4 (10 marks)
---

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



In [None]:
!pip install alibi-detect

In [None]:
from alibi_detect.cd import TabularDrift

> Evaluate your model from B1 on data from year 2022 and report the test R2.

In [None]:
# TODO: Enter your code here

> Evaluate your model from B1 on data from year 2023 and report the test R2.

In [None]:
# TODO: Enter your code here

> Did model degradation occur for the deep learning model?

\# TODO: \<Enter your answer here\>

Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2020 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [None]:
# TODO: Enter your code here

> Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?

\# TODO: \<Enter your answer here\>

> From your analysis via TabularDrift, which features contribute to this shift?

\# TODO: \<Enter your answer here\>

> Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.

\# TODO: \<Enter your answer here\>

In [None]:
# TODO: Enter your code here