# Question B4 (10 marks)

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [2]:
!pip install alibi-detect



In [3]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

1.Evaluate your model from B1 on data from year 2022 and report the test R2.

In [5]:
# Import relevant libraries

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

In [6]:
# Redoing model from B1 and testing on 2022 instead of 2021

df = pd.read_csv('hdb_price_prediction.csv')

# Training Data Set: Year 2019 and before
df_train = df[df['year'] <= 2019].copy()
# Validation Data Set: Year 2020
df_val = df[df['year'] == 2020].copy()
# Testing Data Set: Year 2022
df_test22 = df[df['year'] == 2022].copy()

# Dropping Unncessary Columns
df_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_val.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_test22.drop(columns=['year','full_address','nearest_stn'], inplace=True)

numeric = ['dist_to_nearest_stn','dist_to_dhoby','degree_centrality','eigenvector_centrality',
                 'remaining_lease_years','floor_area_sqm']
categorical = ['month','town','flat_model_type','storey_range']

data_config = DataConfig(
    target=["resale_price"],  
    continuous_cols=numeric,
    categorical_cols=categorical,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

In [7]:
from torch_optimizer import QHAdam
# Training Tabular Model
tabular_model.fit(df_train, 
                  validation=df_val,
                  optimizer=QHAdam)

  return torch.load(f, map_location=map_location)


<pytorch_lightning.trainer.trainer.Trainer at 0x2a12ef855e0>

In [8]:
# Evaluation and Prediction
evaluation22 = tabular_model.evaluate(df_test22)
predicted22 = tabular_model.predict(df_test22)

In [9]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# True Values and Predicted Values
y_true22 = df_test22['resale_price'].values
y_pred22 = predicted22['resale_price_prediction']

mse = mean_squared_error(y_true22, y_pred22)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))
r2 = r2_score(y_true22, y_pred22)
print("R2 Score:", r2)

Root Mean Squared Error (RMSE): 129910.63385219268
R2 Score: 0.41781365864686426


2.Evaluate your model from B1 on data from year 2023 and report the test R2.

In [12]:
# Data Preparation for 2023

# Testing Data Set: Year 2022
df_test23 = df[df['year'] == 2023].copy()

# Dropping Unnecessary Columns
df_test23.drop(columns=['year','full_address','nearest_stn'], inplace=True)

In [13]:
# Evaluation and Prediction
evaluation23 = tabular_model.evaluate(df_test23)
predicted23 = tabular_model.predict(df_test23)

In [14]:
# True Values and Predicted Values
y_true23 = df_test23['resale_price'].values
y_pred23 = predicted23['resale_price_prediction']

mse = mean_squared_error(y_true23, y_pred23)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))
r2 = r2_score(y_true23, y_pred23)
print("R2 Score:", r2)

Root Mean Squared Error (RMSE): 159714.97544234028
R2 Score: 0.13473508672140222


3.Did model degradation occur for the deep learning model?


Yes, model degradation occurred. The R2 value dropped from **0.7555578453226205 for 2021** test data(in B1) to **0.41781365864686426 for 2022** test data, to **0.13473508672140222  for 2023** test data.



---



---



4.Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [55]:
# Dropping Resale Price (target) leaving only features
train_drift = df_train.copy()
train_drift.drop(columns=['resale_price'],inplace=True)

test_drift = df_test23.copy()
test_drift.drop(columns=['resale_price'],inplace=True)

feature_names = train_drift.columns
feature_names

Index(['month', 'town', 'dist_to_nearest_stn', 'dist_to_dhoby',
       'degree_centrality', 'eigenvector_centrality', 'flat_model_type',
       'remaining_lease_years', 'floor_area_sqm', 'storey_range'],
      dtype='object')

In [57]:
# Sample 1000 data points each
sample_train = train_drift.sample(1000, random_state = 42)
sample_test = test_drift.sample(1000, random_state = 42)

In [61]:
# Detecting drift

categories_per_feature = {f: None for f in range(sample_train.values.shape[1])}

cd = TabularDrift(sample_train.values, 
                  p_val=.05, 
                  categories_per_feature=categories_per_feature)

preds = cd.predict(sample_test.values)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Drift? Yes!


In [63]:
# Individual feature-wise drift

fpreds = cd.predict(sample_test.values, drift_type='feature')

for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

month -- Drift? Yes! -- Chi2 430.336 -- p-value 0.000
town -- Drift? No! -- Chi2 33.178 -- p-value 0.127
dist_to_nearest_stn -- Drift? No! -- Chi2 1799.333 -- p-value 0.153
dist_to_dhoby -- Drift? No! -- Chi2 1799.333 -- p-value 0.153
degree_centrality -- Drift? Yes! -- Chi2 14.145 -- p-value 0.003
eigenvector_centrality -- Drift? Yes! -- Chi2 110.044 -- p-value 0.008
flat_model_type -- Drift? Yes! -- Chi2 62.122 -- p-value 0.001
remaining_lease_years -- Drift? Yes! -- Chi2 824.113 -- p-value 0.000
floor_area_sqm -- Drift? Yes! -- Chi2 210.241 -- p-value 0.000
storey_range -- Drift? Yes! -- Chi2 27.842 -- p-value 0.010


5.Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


Concept drift led to model degradation. Concept drift is when P(Y|X) changes, but P(X) remains the same. Concept drift, also known as posterior shift, is when the input distribution remains the same but the conditional distribution of the output given an input changes. In this case, cooling measures introduced to the housing market by the government has altered the relationship between all features and resale_price.

6.From your analysis via TabularDrift, which features contribute to this shift?


From my analysis above, 'month', 'degree_centrality', 'eigenvector_centrality', 'flat_model_type', 'remaining_lease_years', 'floor_area_sqm' and 'storey_range' have drifted and contributed to the shift. These features have p value <= 0.010.

7.Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


To reduce concept drift, we should train, validate and test data inclusive of cooling measures periods (16 December 2021, 30 September 2022, 27 April 2023)

Therefore, we can adjust train-validate-test split to:
-  Train: before and inclusive of 2021
-  Validate: 2022
-  Test: 2023

In [88]:
# Splitting data
final_train = df[(df['year'] <= 2021)]
final_val = df[df['year'] == 2022]
final_test = df[df['year'] == 2023]

# Dropping Unncessary Columns
final_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
final_val.drop(columns=['year','full_address','nearest_stn'], inplace=True)
final_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)

numeric = ['dist_to_nearest_stn','dist_to_dhoby','degree_centrality','eigenvector_centrality',
                 'remaining_lease_years','floor_area_sqm']
categorical = ['month','town','flat_model_type','storey_range']

data_config = DataConfig(
    target=["resale_price"],  
    continuous_cols=numeric,
    categorical_cols=categorical,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)
optimizer_config = OptimizerConfig()

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_val.drop(columns=['year','full_address','nearest_stn'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)


In [90]:
# Training Tabular Model
tabular_model.fit(final_train, 
                  validation=final_val, 
                  optimizer=QHAdam)

  return torch.load(f, map_location=map_location)


<pytorch_lightning.trainer.trainer.Trainer at 0x2a134e20b50>

In [92]:
# Evaluation and Prediction
final_evaluation = tabular_model.evaluate(final_test)
final_predicted = tabular_model.predict(final_test)

In [94]:
# True Values and Predicted Values
y_true_final = final_test['resale_price'].values
y_pred_final = final_predicted['resale_price_prediction']

mse = mean_squared_error(y_true_final, y_pred_final)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))
r2 = r2_score(y_true_final, y_pred_final)
print("R2 Score:", r2)

Root Mean Squared Error (RMSE): 134635.19982569522
R2 Score: 0.38514168047980224


R2 score has improved from **0.13473508672140222** to **0.38514168047980224** for year 2023 when the training and validation data are modified accordingly.

### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |