CS4001/4042 Assignment 1
---
Part B, Q1 (15 marks)
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [36]:
!pip install pytorch_tabular[extra]
# ! pip install numpy
# ! pip install pandas
# ! pip install torch
# ! pip install scikit-learn
# ! pip install matplotlib

zsh:1: no matches found: pytorch_tabular[extra]


In [32]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

import torch
import torch.nn as nn

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from year 2020 and before as training data, and year 2021 as test data (validation set is not required).
**Do not** use data from year 2022 and year 2023.



In [33]:
df = pd.read_csv('hdb_price_prediction.csv')
print(f"Dataset shape: {df.shape}")
print(f"Year range: {df['year'].min()} to {df['year'].max()}")

# Split data by year
train_df = df[df['year'] <= 2020]
test_df_2021 = df[df['year'] == 2021]

print(f"Train set (≤2020): {train_df.shape[0]} records")
print(f"Test set (2021): {test_df_2021.shape[0]} records")

Dataset shape: (159553, 14)
Year range: 2017 to 2023
Train set (≤2020): 87370 records
Test set (2021): 29057 records


> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [34]:
# TODO: Enter your code here
# Define features as specified in the assignment
categorical_features = ['month', 'town', 'flat_model_type', 'storey_range']
continuous_features = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 
                      'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
target = ['resale_price']

# Configure the model
data_config = DataConfig(
    target=target,
    continuous_cols=continuous_features,
    categorical_cols=categorical_features,
)

trainer_config = TrainerConfig(
    auto_lr_find=True,  # Automatically tune learning rate
    batch_size=1024,
    max_epochs=50,
)

optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50"
)

# Initialize and train the model
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

print("Training the PyTorch Tabular model...")
tabular_model.fit(train=train_df)

# Evaluate the model on 2021 test data
results = tabular_model.evaluate(test_df_2021)
pred_df = tabular_model.predict(test_df_2021)
# test_preds = pred_df[target + "_prediction"].values
# test_actuals = test_df_2021[target].values

Seed set to 42


Training the PyTorch Tabular model...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


/Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/saved_models exists and is not empty.
/Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at /Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/.lr_find_81c11c4f-b60e-4045-b005-87754d39b359.ckpt
Restored all states from the checkpoint at /Users/aryansethi/Documents/Personal/Neural-Networks-Assignment-1/.lr_find_81c11c4f-b60e-4045-b005-87754d39b359.ckpt


Output()

UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL builtins.dict was not an allowed global by default. Please use `torch.serialization.add_safe_globals([dict])` or the `torch.serialization.safe_globals([dict])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

> Report the test RMSE error and the test R2 value that you obtained.



In [18]:
# TODO: Enter your code here
# Calculate metrics
rmse = np.sqrt(mean_squared_error(test_actuals, test_preds))
r2 = r2_score(test_actuals, test_preds)

print(f"\nB1 Results on 2021 Test Set:")
print(f"RMSE: ${rmse:.2f}")
print(f"R² Score: {r2:.4f}")


B1 Results on 2021 Test Set:
RMSE: $76635.10
R² Score: 0.7780


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. 



In [19]:
# TODO: Enter your code here
# Calculate errors for each test sample
test_df_2021['predicted'] = test_preds
test_df_2021['error'] = np.abs(test_df_2021['predicted'] - test_df_2021[target])

# Show top 25 samples with largest errors
top_25_errors = test_df_2021.sort_values(by='error', ascending=False).head(25)
print("\nTop 25 samples with largest prediction errors:")
print(top_25_errors[['year', 'month', 'town', 'flat_model_type', 'floor_area_sqm', 'resale_price', 'predicted', 'error']])


Top 25 samples with largest prediction errors:
        year  month          town        flat_model_type  floor_area_sqm  \
92405   2021     11   BUKIT MERAH       3 ROOM, Standard            88.0   
90608   2021     12        BISHAN           5 ROOM, DBSS           120.0   
90957   2021      6   BUKIT BATOK   EXECUTIVE, Apartment           144.0   
92442   2021     11   BUKIT MERAH       5 ROOM, Improved           113.0   
112128  2021     12      TAMPINES  EXECUTIVE, Maisonette           148.0   
90521   2021     10        BISHAN       5 ROOM, Improved           121.0   
90432   2021      8        BISHAN           5 ROOM, DBSS           120.0   
90483   2021      9        BISHAN           5 ROOM, DBSS           120.0   
98379   2021     12       HOUGANG   EXECUTIVE, Apartment           142.0   
105702  2021      6    QUEENSTOWN   EXECUTIVE, Apartment           148.0   
92533   2021     12   BUKIT MERAH       5 ROOM, Improved           115.0   
91871   2021      6   BUKIT MERAH       

Part B, Q2 (10 marks)
---
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network.

In [20]:
! pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.6.5-py3-none-any.whl.metadata (10 kB)
Collecting scipy<=1.12.0,>=1.7.3 (from pytorch-widedeep)
  Downloading scipy-1.12.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (112 kB)
Collecting gensim (from pytorch-widedeep)
  Downloading gensim-4.3.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (8.2 kB)
Collecting spacy (from pytorch-widedeep)
  Downloading spacy-3.8.4-cp310-cp310-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting opencv-contrib-python>=4.9.0.80 (from pytorch-widedeep)
  Downloading opencv_contrib_python-4.11.0.86-cp37-abi3-macosx_13_0_arm64.whl.metadata (20 kB)
Collecting torchvision>=0.15.0 (from pytorch-widedeep)
  Downloading torchvision-0.21.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.1 kB)
Collecting wrapt (from pytorch-widedeep)
  Downloading wrapt-1.17.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.4 kB)
Collecting pyarrow>=15.0.0 (from pytorch-widedeep)
  Downloading pyarrow-19.0.1-cp310-cp310-macosx_12_0_

In [26]:
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data（validation set is not required here).

In [27]:
# TODO: Enter your code here
print("\nPart B, Question 2: PyTorch-WideDeep Implementation")

# For B2, test set includes 2021 and after
test_df_b2 = df[df['year'] >= 2021].copy()
print(f"Train set (≤2020): {train_df.shape[0]} records")
print(f"Test set (≥2021): {test_df_b2.shape[0]} records")


Part B, Question 2: PyTorch-WideDeep Implementation
Train set (≤2020): 87370 records
Test set (≥2021): 72183 records


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 hidden layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 60 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [29]:
# Prepare data for TabPreprocessor
X_train = train_df.copy()
y_train = X_train.pop(target)
X_test = test_df_b2.copy()
y_test = X_test.pop(target)

# Preprocess the tabular data
tab_preprocessor = TabPreprocessor(
    categorical_cols=categorical_features,
    continuous_cols=continuous_features,
    scale=True
)

X_tab_train = tab_preprocessor.fit_transform(X_train)
X_tab_test = tab_preprocessor.transform(X_test)

# Create the TabMlp model with 2 hidden layers (200, 100)
tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    mlp_hidden_dims=[200, 100],
    mlp_activation="relu",
    embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_features
)

# Combine the components
model = WideDeep(deeptabular=tab_mlp)

# Create a Trainer
trainer = Trainer(
    model,
    objective="regression",
    optimizers="Adam",
    lr=0.001,
    metrics=[R2Score],
    callbacks=None,
    verbose=1
)

# Train the model
print("Training the PyTorch-WideDeep model...")
trainer.fit(
    X_tab=X_tab_train,
    target=y_train,
    n_epochs=60,
    batch_size=64,
    val_split=0.1,
    num_workers=0
)

  self._check_inputs(cat_embed_cols)


AttributeError: 'TabPreprocessor' object has no attribute 'cat_embed_input'

>Report the test RMSE and the test R2 value that you obtained.

In [21]:
# TODO: Enter your code here
# Evaluate on test set
preds = trainer.predict(X_tab=X_tab_test)
rmse_b2 = np.sqrt(mean_squared_error(y_test, preds))
r2_b2 = r2_score(y_test, preds)

print(f"\nB2 Results on Test Set (≥2021):")
print(f"RMSE: ${rmse_b2:.2f}")
print(f"R² Score: {r2_b2:.4f}")

Part B, Q3 (10 marks)
---
Besides ensuring that your neural network performs well, it is important to be able to explain the model’s decision. **Captum** is a very handy library that helps you to do so for PyTorch models.

Many model explainability algorithms for deep learning models are available in Captum. These algorithms are often used to generate an attribution score for each feature. Features with larger scores are more ‘important’ and some algorithms also provide information about directionality (i.e. a feature with very negative attribution scores means the larger the value of that feature, the lower the value of the output).

In general, these algorithms can be grouped into two paradigms:
- **perturbation based approaches** (e.g. Feature Ablation)
- **gradient / backpropagation based approaches** (e.g. Saliency)

The former adopts a brute-force approach of removing / permuting features one by one and does not scale up well. The latter depends on gradients and they can be computed relatively quickly. But unlike how backpropagation computes gradients with respect to weights, gradients here are computed **with respect to the input**. This gives us a sense of how much a change in the input affects the model’s outputs.




---



In [None]:
!pip install captum

In [25]:
from captum.attr import Saliency, InputXGradient, IntegratedGradients, GradientShap, FeatureAblation

> First, use the train set (year 2020 and before) and test set (year 2021) following the splits in Question B1 (validation set is not required here). To keep things simple, we will **limit our analysis to numeric / continuous features only**. Drop all categorical features from the dataframes. Standardise the features via **StandardScaler** (fit to training set, then transform all).

In [27]:
# TODO: Enter your code here

> Follow this tutorial to generate the plot from various model explainability algorithms (https://captum.ai/tutorials/House_Prices_Regression_Interpret).
Specifically, make the following changes:
- Use a feedforward neural network with 3 hidden layers, each having 5 neurons. Train using Adam optimiser with learning rate of 0.001.
- Use Input x Gradients, Integrated Gradients, DeepLift, GradientSHAP, Feature Ablation. To avoid long running time, you can limit the analysis to the first 1000 samples in test set.

In [29]:
# TODO: Enter your code here

> Read the following [descriptions](https://captum.ai/docs/attribution_algorithms) and [comparisons](https://captum.ai/docs/algorithms_comparison_matrix) in Captum to build up your understanding of the difference of various explainability algorithms. Based on your plot, identify the three most important features for regression. Explain how each of these features influences the regression outcome.


\# TODO: \<Enter your answer here\>

Part B, Q4 (10 marks)
---

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



In [None]:
!pip install alibi-detect

In [35]:
from alibi_detect.cd import TabularDrift

> Evaluate your model from B1 on data from year 2022 and report the test R2.

In [37]:
# TODO: Enter your code here

> Evaluate your model from B1 on data from year 2023 and report the test R2.

In [39]:
# TODO: Enter your code here

> Did model degradation occur for the deep learning model?

\# TODO: \<Enter your answer here\>

Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2020 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [43]:
# TODO: Enter your code here

> Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?

\# TODO: \<Enter your answer here\>

> From your analysis via TabularDrift, which features contribute to this shift?

\# TODO: \<Enter your answer here\>

> Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.

\# TODO: \<Enter your answer here\>

In [None]:
# TODO: Enter your code here