# Question B1 (15 marks)

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]



In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

1.Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [4]:
df = pd.read_csv('hdb_price_prediction.csv')

# Training Data Set: Year 2019 and before
df_train = df[df['year'] <= 2019].copy()
# Validation Data Set: Year 2020
df_val = df[df['year'] == 2020].copy()
# Testing Data Set: Year 2021
df_test = df[df['year'] == 2021].copy()

# Dropping Columns not used for training
df_train.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_val.drop(columns=['year','full_address','nearest_stn'], inplace=True)
df_test.drop(columns=['year','full_address','nearest_stn'], inplace=True)

print("Training Data:", df_train.shape)
print("Validation Data:", df_val.shape)
print("Testing Data:", df_test.shape)

Training Data: (64057, 11)
Validation Data: (23313, 11)
Testing Data: (29057, 11)


2.Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [6]:
numeric = ['dist_to_nearest_stn','dist_to_dhoby','degree_centrality','eigenvector_centrality', 
           'remaining_lease_years','floor_area_sqm']
categorical = ['month','town','flat_model_type','storey_range']

data_config = DataConfig(
    target=["resale_price"],  
    continuous_cols=numeric,
    categorical_cols=categorical,
)

In [7]:
trainer_config = TrainerConfig(
    auto_lr_find=True,  # automatically tune the learning rate
    batch_size=1024,
    max_epochs=50,
)

In [8]:
model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  
)

In [9]:
optimizer_config = OptimizerConfig(optimizer='Adam')

In [10]:
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

3.Report the test RMSE error and the test R2 value that you obtained.



In [12]:
from torch_optimizer import QHAdam
# Training Tabular Model
tabular_model.fit(df_train, 
                  validation=df_val,
                  optimizer=QHAdam)

  return torch.load(f, map_location=map_location)


<pytorch_lightning.trainer.trainer.Trainer at 0x1c90a4ee880>

In [13]:
# Evaluation and Prediction
evaluation = tabular_model.evaluate(df_test)
predicted = tabular_model.predict(df_test)

In [14]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# True Values and Predicted Values
y_true = df_test['resale_price'].values
y_pred = predicted['resale_price_prediction']

mse = mean_squared_error(y_true, y_pred)
print("Root Mean Squared Error (RMSE):", np.sqrt(mse))
r2 = r2_score(y_true, y_pred)
print("R2 Score:", r2)

Root Mean Squared Error (RMSE): 80411.27126542194
R2 Score: 0.7555578453226205


4.Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [33]:
# Create absolute error column
df_test['absolute_error'] = abs(y_true - y_pred)

# Top 25 with largest error
worst_predictions = df.nlargest(25, 'absolute_error')
worst_predictions = worst_predictions.reset_index(drop=True)

worst_predictions

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,absolute_error
0,11,2021,BUKIT MERAH,46 SENG POH ROAD,Tiong Bahru,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0,419085.375
1,2,2021,QUEENSTOWN,46 STIRLING ROAD,Commonwealth,0.570988,4.922054,0.016807,0.00535,"4 ROOM, Terrace",46.916667,134.0,01 TO 03,975000.0,413788.3125
2,12,2021,QUEENSTOWN,89 DAWSON ROAD,Queenstown,0.658035,3.807573,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.333333,109.0,04 TO 06,968000.0,407555.0625
3,8,2021,QUEENSTOWN,42 STIRLING ROAD,Queenstown,0.554599,4.841933,0.016807,0.008342,"4 ROOM, Terrace",46.416667,120.0,01 TO 03,930000.0,394128.4375
4,10,2021,QUEENSTOWN,92 DAWSON ROAD,Queenstown,0.584731,3.882019,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.5,97.0,13 TO 15,958000.0,383422.9375
5,6,2021,QUEENSTOWN,91 DAWSON ROAD,Queenstown,0.745596,3.720593,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.916667,97.0,07 TO 09,930000.0,380730.5625
6,6,2021,QUEENSTOWN,89 DAWSON ROAD,Queenstown,0.658035,3.807573,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.916667,109.0,10 TO 12,950000.0,366701.5625
7,11,2021,BUKIT MERAH,127D KIM TIAN ROAD,Tiong Bahru,0.686789,2.664024,0.016807,0.047782,"5 ROOM, Improved",90.333333,113.0,16 TO 18,1165000.0,365478.0
8,6,2021,QUEENSTOWN,150 MEI LING STREET,Queenstown,0.245207,4.709043,0.016807,0.008342,"EXECUTIVE, Apartment",73.416667,148.0,10 TO 12,1235000.0,365216.5625
9,6,2021,QUEENSTOWN,91 DAWSON ROAD,Queenstown,0.745596,3.720593,0.016807,0.008342,"4 ROOM, Premium Apartment Loft",93.916667,109.0,31 TO 33,1032888.0,362622.0625


In [37]:
# Analysis of numeric data
print("\nAnalysis of Worst Predictions:")
for col in ['floor_area_sqm', 'remaining_lease_years', 'dist_to_nearest_stn', 'dist_to_dhoby', 
            'degree_centrality', 'eigenvector_centrality']:
    print(f"\n{col} statistics:")
    print(f"Mean for worst predictions: {worst_predictions[col].mean():.2f}")
    print(f"Mean for all test data: {df_test[col].mean():.2f}")

# Analysis of categorical data
for col in ['month', 'town', 'flat_model_type', 'storey_range']:
    print(f"\nTop 5 {col} in worst predictions:")
    print(worst_predictions[col].value_counts().nlargest(5))
    
    # Compare to overall distribution
    print(f"\nOverall top 5 {col} distribution:")
    print(df_test[col].value_counts().nlargest(5))


Analysis of Worst Predictions:

floor_area_sqm statistics:
Mean for worst predictions: 112.60
Mean for all test data: 98.25

remaining_lease_years statistics:
Mean for worst predictions: 82.50
Mean for all test data: 75.37

dist_to_nearest_stn statistics:
Mean for worst predictions: 0.64
Mean for all test data: 0.82

dist_to_dhoby statistics:
Mean for worst predictions: 4.94
Mean for all test data: 10.98

degree_centrality statistics:
Mean for worst predictions: 0.02
Mean for all test data: 0.02

eigenvector_centrality statistics:
Mean for worst predictions: 0.02
Mean for all test data: 0.01

Top 5 month in worst predictions:
month
12    5
6     5
10    3
9     3
11    2
Name: count, dtype: int64

Overall top 5 month distribution:
month
8     2735
7     2655
11    2566
9     2510
10    2495
Name: count, dtype: int64

Top 5 town in worst predictions:
town
QUEENSTOWN     16
BUKIT MERAH     5
BUKIT BATOK     1
ANG MO KIO      1
TAMPINES        1
Name: count, dtype: int64

Overall top 5 t

Trends:
1. Worst predictions are dominated by "4 ROOM, Premium Apartment Loft" (13 out of 25 cases), even though this flat model type is not in the top 5 in the test data.
2. 16 out of 25 worst predictions are in Queenstown, and Queenstown is not in the top 5 towns.

Ways to reduce errors:
1. Premium, unique properties are not predicted well, possibly due to flat model type being relatively rare. More training data on premium properties can help the model learn better. Inclusion of more specific flat features can provide more grounds for comparison between premium and common flat types, so model can justify and predict prices better for premium flats.
2. Location-specific pricing in Queenstown is not well captured, hence the model needs more training data from Queenstown.