# Question B4 (10 marks)

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [1]:
!pip install alibi-detect




[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: D:\Program Files\Python310\python.exe -m pip install --upgrade pip


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

1.Evaluate your model from B1 on data from year 2022 and report the test R2.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
import pytorch_tabular
from sklearn.metrics import r2_score

# load model
model = pytorch_tabular.tabular_model.TabularModel.load_model('b1')

# load data
test_data_2022 = df[df['year'] == 2022]

# test data
pred_2022 = model.predict(test_data_2022)
r2_2022 = r2_score(test_data_2022['resale_price'], pred_2022['resale_price_prediction'])

print(f"2022 R2: {r2_2022}")

Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


2022 R2: 0.41781363520246395


2.Evaluate your model from B1 on data from year 2023 and report the test R2.

In [4]:
# TODO: Enter your code here
test_data_2023 = df[df['year'] == 2023]

# test data
pred_2023 = model.predict(test_data_2023)
r2_2023 = r2_score(test_data_2023['resale_price'], pred_2023['resale_price_prediction'])

print(f"2023 R2: {r2_2023}")

2023 R2: 0.13473505427989907


3.Did model degradation occur for the deep learning model?


In [5]:
# YOUR ANSWER HERE

Yes, the model degradation occurred. Because the R^2 score of year 2022 is 0.4178 and it of year 2023 is only 0.1347, which is much less than 1, indicating the model cannot predict the data in 2022 and 2023 well.



---



---



4.Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [6]:
# YOUR CODE HERE
# load data
train_data = df[df['year'] <= 2019]
test_data = df[df['year'] == 2023]

# set sample number
n_ref = 1000
n_test = 1000

# sample data
ref_data, test_data = train_data[:n_ref], test_data[:n_test]
cat_embed_cols = ['month', 'town', 'flat_model_type', 'storey_range']
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
target = ['resale_price']

# define feature names and category map
feature_names = cat_embed_cols + continuous_cols
category_map = {
    0: df['month'].unique().tolist(),
    1: df['town'].unique().tolist(),
    2: df['flat_model_type'].unique().tolist(),
    3: df['storey_range'].unique().tolist(),
}

# set training and test data
X_train = train_data[:n_ref]
X_test = test_data[:n_test]
X_ref = X_train[cat_embed_cols + continuous_cols].values
X_test = X_test[cat_embed_cols + continuous_cols].values

# define TabularDrift
categories_per_feature = {f: None for f in list(category_map.keys())}
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

# give predictions
preds = cd.predict(X_test)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

# detect drifting
fpreds = cd.predict(X_test, drift_type='feature')
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Drift? Yes!
month -- Drift? No! -- Chi2 0.000 -- p-value 1.000
town -- Drift? Yes! -- Chi2 667.474 -- p-value 0.000
flat_model_type -- Drift? Yes! -- Chi2 77.586 -- p-value 0.000
storey_range -- Drift? Yes! -- Chi2 38.800 -- p-value 0.001
dist_to_nearest_stn -- Drift? No! -- K-S 0.055 -- p-value 0.094
dist_to_dhoby -- Drift? Yes! -- K-S 0.218 -- p-value 0.000
degree_centrality -- Drift? No! -- K-S 0.029 -- p-value 0.783
eigenvector_centrality -- Drift? Yes! -- K-S 0.195 -- p-value 0.000
remaining_lease_years -- Drift? Yes! -- K-S 0.271 -- p-value 0.000
floor_area_sqm -- Drift? Yes! -- K-S 0.134 -- p-value 0.000


5.Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


In [7]:
# YOUR ANSWER HERE

Concept drift possibly led to model degradation. Since the input features X do not affect by housing measures, P(X) remains the same. At the mean time, because housing measures have made an impact on the relationship between all the features and resale_price, which leads to the change of P(Y|X). Therefore, it results in concept drift by definition.

6.From your analysis via TabularDrift, which features contribute to this shift?


In [8]:
# YOUR ANSWER HERE

Contributing features: town, flat_model_type, storey_range, dist_to_dhoby, eigenvector_centrality, remaining_lease_years, floor_area_sqm

7.Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


Here I use all data from 2017 to 2022 to train the model. But in order to put more importance on the more recent data, I give more proportion of the data on more recent years and give less on less recent years. Here I put all data in 2022, 80% of the data in 2021, 60% of the data in 2020, 40% of the data in 2019, 20% of the data in 2018, 10% of the data in 2017 into the dataset. Therfore, 
It enables the model to leverage both past context and present dynamics.

After that, I split the dataset such that 90% of the data is used for training and 10% of the data is used for validation. Finally, we use the trained model to do prediction on 2023's data. The R^2 score is about 0.7038, which is improved a lot.

In [9]:
# YOUR CODE HERE
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

# load data and set proportion
data_2017 = df[df['year'] == 2017]
data_2018 = df[df['year'] == 2018]
data_2019 = df[df['year'] == 2019]
data_2020 = df[df['year'] == 2020]
data_2021 = df[df['year'] == 2021]
data_2022 = df[df['year'] == 2022]
data_2017 = data_2017.sample(n=int(0.1 * len(data_2017)), random_state=42)
data_2018 = data_2018.sample(n=int(0.2 * len(data_2018)), random_state=42)
data_2019 = data_2019.sample(n=int(0.4 * len(data_2019)), random_state=42)
data_2020 = data_2020.sample(n=int(0.6 * len(data_2020)), random_state=42)
data_2021 = data_2021.sample(n=int(0.8 * len(data_2021)), random_state=42)

# shuffle the  data
data = pd.concat([data_2017, data_2018, data_2019, data_2020, data_2021, data_2022])
data_shuffled = data.sample(frac=1, random_state=42)
split_index = int(0.9 * len(data_shuffled))
train_data = data_shuffled.iloc[:split_index]
val_data = data_shuffled.iloc[split_index:]
test_data = df[df['year'] == 2023]

# define data
data_config = DataConfig(
    target=['resale_price'],
    continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'],
    categorical_cols=['month', 'town', 'flat_model_type', 'storey_range']
)

# define trainer
trainer_config = TrainerConfig(
    batch_size=1024,
    max_epochs=50,
    auto_lr_find=True  # to find an optimal lr
)

# define model
model_config = CategoryEmbeddingModelConfig(
    task='regression',
    layers='50'
)

# define optimizer
optimizer_config = OptimizerConfig(
    optimizer='Adam'
)

# initialization
model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

# train the model
model.fit(train=train_data, validation=val_data, seed=SEED)

pred = model.predict(test_data)
r2 = r2_score(test_data['resale_price'], pred['resale_price_prediction'])

print(f"2023 R2: {r2}")

Seed set to 42


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


You are using a CUDA device ('NVIDIA GeForce RTX 4060 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
D:\Program Files\Python310\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:639: Checkpoint directory E:\git\SC4001\Programming Assignment\saved_models exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
D:\Program Files\Python310\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
D:\Program Files\Python310\lib\site-packages\pytorch_lightning\trainer\connectors\

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.6918309709189363
Restoring states from the checkpoint path at E:\git\SC4001\Programming Assignment\.lr_find_86457976-9887-486d-883e-9f737d40f4dd.ckpt
Restored all states from the checkpoint at E:\git\SC4001\Programming Assignment\.lr_find_86457976-9887-486d-883e-9f737d40f4dd.ckpt


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

2023 R2: 0.7037734076617993


The R^2 score is about 0.7038 after using this method, which is so much larger than 0.1347.

### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |