CS4001/4042 Assignment 1, Part B, Q4
---

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [22]:
!pip install alibi-detect

Collecting alibi-detect
  Obtaining dependency information for alibi-detect from https://files.pythonhosted.org/packages/ed/2a/9e11bfee0cf54c0ea78243cb0559d604878f6d58d0ac6227d11e7007decc/alibi_detect-0.11.4-py3-none-any.whl.metadata
  Using cached alibi_detect-0.11.4-py3-none-any.whl.metadata (28 kB)
Collecting opencv-python<5.0.0,>=3.2.0 (from alibi-detect)
  Obtaining dependency information for opencv-python<5.0.0,>=3.2.0 from https://files.pythonhosted.org/packages/38/d2/3e8c13ffc37ca5ebc6f382b242b44acb43eb489042e1728407ac3904e72f/opencv_python-4.8.1.78-cp37-abi3-win_amd64.whl.metadata
  Using cached opencv_python-4.8.1.78-cp37-abi3-win_amd64.whl.metadata (20 kB)
Collecting scikit-image!=0.17.1,<0.22,>=0.14.2 (from alibi-detect)
  Obtaining dependency information for scikit-image!=0.17.1,<0.22,>=0.14.2 from https://files.pythonhosted.org/packages/f3/93/65601f7577d6fd49ec23bf8fb58c04d8170b06a1544452ae2ea9f59bf11f/scikit_image-0.21.0-cp310-cp310-win_amd64.whl.metadata
  Using cached 

In [23]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

import pytorch_tabular
from sklearn.metrics import r2_score

> Evaluate your model from B1 on data from year 2022 and report the test R2.

In [24]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
dir = 'saved_models/part_B_1'
tabular_model = pytorch_tabular.tabular_model.TabularModel.load_model(dir)

df_2022 = df[df['year'] == 2022]
pred_df = tabular_model.predict(df_2022)

r2 = r2_score(pred_df['resale_price'], pred_df['resale_price_prediction'])
print(f"R^2: {r2}")

2023-10-13 20:03:12,486 - {pytorch_tabular.tabular_model:129} - INFO - Experiment Tracking is turned off
2023-10-13 20:03:12,497 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Output()

R^2: 0.4388472038478126


> Evaluate your model from B1 on data from year 2023 and report the test R2.

In [25]:
# TODO: Enter your code here
# Filter the dataset for the year 2023
df_2023 = df[df['year'] == 2023]

# Use the tabular model to predict resale prices for the year 2023
pred_df = tabular_model.predict(df_2023)

# Calculate the R^2 score for model evaluation
r2 = r2_score(pred_df['resale_price'], pred_df['resale_price_prediction'])

# Print the R^2 score for the year 2023
print(f"R^2 for 2023: {r2}")

Output()

R^2 for 2023: 0.16212585514427558


> Did model degradation occur for the deep learning model?


Yes, the coefficient of determination, $R^2$, has exhibited a decrease from 0.439 to 0.162.



---



---



Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [26]:
# Define categorical and continuous columns
categorical_columns = ['month', 'town', 'flat_model_type', 'storey_range']
continuous_columns = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
target_columns = ['resale_price']

# Select relevant columns and create feature names
selected_columns = categorical_columns + continuous_columns
X = df[selected_columns]
y = df[target_columns]
feature_names = X.columns.values

# Create a category map for categorical columns
category_map = {i: df[column].unique().tolist() for i, column in enumerate(categorical_columns)}

# Filter the data for reference and test sets
df_train = df[df['year'] <= 2019][:1000]  # Reference set
df_2023 = df[df['year'] == 2023][:1000]  # Test set

# Extract X and y for reference and test sets
X_ref = df_train[selected_columns].values
y_ref = df_train[target_columns].values
X_test = df_2023[selected_columns].values
y_test = df_2023[target_columns].values

# Create categories_per_feature for TabularDrift
categories_per_feature = {i: None for i in category_map.keys()}

# Initialize TabularDrift with the reference data
cd = TabularDrift(X_ref, p_val=0.05, categories_per_feature=categories_per_feature)

In [27]:
# Detect drift and analyze features
preds = cd.predict(X_test)
labels = ['No', 'Yes']

# Print whether drift is detected or not
print('Drift? {}'.format(labels[preds['data']['is_drift']]))
threshold = preds['data']['threshold']
print("Threshold:", threshold)

drifted_columns = []

# Predict drift at the feature level
fpreds = cd.predict(X_test, drift_type='feature')

print("The following features have drifted:")

# Loop through each feature and analyze drift
for f in range(cd.n_features):
    feature_type = 'Categorical' if f < len(categorical_columns) else 'Continuous'
    feature_name = selected_columns[f]
    is_drift = 'Yes' if fpreds['data']['is_drift'][f] else 'No'
    distance_statistic = fpreds['data']['distance'][f]
    p_value = fpreds['data']['p_val'][f]

    # Print information about the feature's drift status
    print(f'{feature_type} Feature: {feature_name}')
    print(f'Drift Detected: {is_drift}')
    print(f'Drift Test Statistic: {distance_statistic:.3f}')
    print(f'P-Value: {p_value:.3f}\n')

    if is_drift == 'Yes':
        drifted_columns.append(feature_name)


Drift? Yes
Threshold: 0.005
The following features have drifted:
Categorical Feature: month
Drift Detected: No
Drift Test Statistic: 0.000
P-Value: 1.000

Categorical Feature: town
Drift Detected: Yes
Drift Test Statistic: 667.474
P-Value: 0.000

Categorical Feature: flat_model_type
Drift Detected: Yes
Drift Test Statistic: 77.586
P-Value: 0.000

Categorical Feature: storey_range
Drift Detected: Yes
Drift Test Statistic: 38.800
P-Value: 0.001

Continuous Feature: dist_to_nearest_stn
Drift Detected: No
Drift Test Statistic: 0.055
P-Value: 0.094

Continuous Feature: dist_to_dhoby
Drift Detected: Yes
Drift Test Statistic: 0.218
P-Value: 0.000

Continuous Feature: degree_centrality
Drift Detected: No
Drift Test Statistic: 0.029
P-Value: 0.783

Continuous Feature: eigenvector_centrality
Drift Detected: Yes
Drift Test Statistic: 0.195
P-Value: 0.000

Continuous Feature: remaining_lease_years
Drift Detected: Yes
Drift Test Statistic: 0.271
P-Value: 0.000

Continuous Feature: floor_area_sqm
Dr

> Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


Concept drift

> From your analysis via TabularDrift, which features contribute to this shift?


These variables:

In [28]:
drifted_columns

['town',
 'flat_model_type',
 'storey_range',
 'dist_to_dhoby',
 'eigenvector_centrality',
 'remaining_lease_years',
 'floor_area_sqm']

> Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


Continuously retrain the model by incorporating and adapting to newly acquired data.

In [29]:
# TODO: Enter your code here
tabular_model.fit(train=df[df['year'] <= 2022])

Global seed set to 42
2023-10-13 20:03:13,739 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-13 20:03:13,747 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-13 20:03:13,888 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-13 20:03:13,917 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-13 20:03:13,968 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\alpha\PycharmProjects\pythonProject3\.lr_find_5a3603dc-5448-4811-ad19-cfcd7ff194e3.ckpt
Restored all states from the checkpoint file at C:\Users\alpha\PycharmProjects\pythonProject3\.lr_find_5a3603dc-5448-4811-ad19-cfcd7ff194e3.ckpt
2023-10-13 20:03:17,293 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-13 20:03:17,294 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-13 20:04:57,968 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-13 20:04:57,969 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
  rank_zero_deprecation(


<pytorch_lightning.trainer.trainer.Trainer at 0x2606eee2020>

In [30]:
test = df[df['year'] == 2023]

result = tabular_model.evaluate(test)
pred_df = tabular_model.predict(test)

target = pred_df['resale_price']
prediction = pred_df['resale_price_prediction']
r_squared = r2_score(target, prediction)
print(f"R^2 Score: {r_squared}")

Output()

  rank_zero_warn(


Output()

R^2 Score: 0.5529873559271797


Revisiting model training with more recent data reveals an enhancement in the $R^2$ coefficient, indicative of improved model performance.

### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |