CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]

In [5]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [6]:
# TODO: Enter your code here

df = pd.read_csv('hdb_price_prediction.csv')
df_train = df[df['year'] <= 2019]
df_val = df[df['year'] == 2020]
df_test = df[df['year'] == 2021]


> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [91]:
# TODO: Enter your code here
num_col_names = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
cat_col_names = ['month', 'town', 'flat_model_type', 'storey_range']
data_config = DataConfig(
    target=[
        'resale_price'
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    auto_lr_find=True,  # Runs the LRFinder to automatically derive a learning rate
    batch_size=1024,
    max_epochs=50,
)
optimizer_config = OptimizerConfig('Adam')

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50",  # Number of nodes in each layer
    activation="LeakyReLU",  # Activation between each layers
    learning_rate=1e-3,
)

tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)
tabular_model.fit(train=df_train, validation=df_val)
result = tabular_model.evaluate(df_test)
pred_df = tabular_model.predict(df_test)
tabular_model.save_model("hdb_regression")
loaded_model = TabularModel.load_from_checkpoint("hdb_regression")

2023-10-02 11:14:55,597 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off


Global seed set to 42
2023-10-02 11:14:55,626 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-02 11:14:55,635 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-02 11:14:55,700 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-02 11:14:55,740 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-02 11:14:55,781 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at d:\micha\Michael\GitHub\Michael_Lee\CZ4042_lab\Assignment\.lr_find_7e711698-7b00-4b13-9f5f-34d200a6509e.ckpt
Restored all states from the checkpoint file at d:\micha\Michael\GitHub\Michael_Lee\CZ4042_lab\Assignment\.lr_find_7e711698-7b00-4b13-9f5f-34d200a6509e.ckpt
2023-10-02 11:14:59,354 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-02 11:14:59,354 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-02 11:15:45,529 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-02 11:15:45,530 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model


Output()

  rank_zero_deprecation(
  rank_zero_warn(


Output()

2023-10-02 11:15:47,946 - {pytorch_tabular.tabular_model:129} - INFO - Experiment Tracking is turned off
2023-10-02 11:15:47,946 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


> Report the test RMSE error and the test R2 value that you obtained.



In [94]:
from sklearn.metrics import r2_score, mean_squared_error
rsme = pow(mean_squared_error(pred_df['resale_price'], pred_df['resale_price_prediction']), 0.5)
r2_score = r2_score(pred_df['resale_price'], pred_df['resale_price_prediction'])

print("RSME Score:", rsme)
print("R2 Score:", r2_score)


RSME Score: 71069.38710449345
R2 Score: 0.8090553972101858


RSME Score: 71069

R2 Score: 0.81


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [95]:
# TODO: Enter your code here
pred_df['pred_error'] = abs(pred_df['resale_price_prediction'] - pred_df['resale_price'])
pred_df = pred_df.sort_values('pred_error', ascending=False)
pred_df.head(25).reset_index()


Unnamed: 0,index,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price,resale_price_prediction,pred_error
0,92405,11,2021,BUKIT MERAH,46 SENG POH ROAD,Tiong Bahru,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0,406487.7,373512.3125
1,90957,6,2021,BUKIT BATOK,288A BUKIT BATOK STREET 25,Bukit Batok,1.29254,10.763777,0.016807,0.000217,"EXECUTIVE, Apartment",75.583333,144.0,10 TO 12,968000.0,631980.5,336019.5
2,112128,12,2021,TAMPINES,156 TAMPINES STREET 12,Tampines,0.370873,12.479752,0.033613,0.000229,"EXECUTIVE, Maisonette",61.75,148.0,01 TO 03,998000.0,668925.2,329074.75
3,90608,12,2021,BISHAN,273B BISHAN STREET 24,Bishan,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0,1045015.0,314985.125
4,90521,10,2021,BISHAN,237 BISHAN STREET 22,Bishan,0.947205,6.663943,0.033613,0.015854,"5 ROOM, Improved",69.583333,121.0,07 TO 09,988000.0,689532.2,298467.8125
5,114254,9,2021,WOODLANDS,789 WOODLANDS AVENUE 6,Woodlands,1.915461,16.660245,0.016807,2.4e-05,"EXECUTIVE, Maisonette",75.083333,141.0,10 TO 12,800000.0,501547.0,298453.0
6,92442,11,2021,BUKIT MERAH,127D KIM TIAN ROAD,Tiong Bahru,0.686789,2.664024,0.016807,0.047782,"5 ROOM, Improved",90.333333,113.0,16 TO 18,1165000.0,867835.1,297164.9375
7,98379,12,2021,HOUGANG,615 HOUGANG AVENUE 8,Hougang,0.899849,8.828235,0.016807,0.001507,"EXECUTIVE, Apartment",63.666667,142.0,04 TO 06,873000.0,585992.5,287007.5
8,92340,10,2021,BUKIT MERAH,56 HAVELOCK ROAD,Tiong Bahru,0.451387,2.128424,0.016807,0.047782,"5 ROOM, Improved",90.75,114.0,34 TO 36,1245000.0,961098.4,283901.5625
9,91871,6,2021,BUKIT MERAH,17 TIONG BAHRU ROAD,Tiong Bahru,0.693391,2.058774,0.016807,0.047782,"3 ROOM, Standard",50.583333,88.0,01 TO 03,680888.0,401855.0,279033.0


In [98]:
good_pred_df = pred_df.sort_values('pred_error', ascending=True).reset_index()

print("Compare the cateogorical features for the 25 worst and 25 best predictions. The tables display the top 5 categories within each categorical feature for the 25 worst and 25 best predictions.")
for col in cat_col_names:
    print(f"===== Comparing for categorical category {col} =====")
    df_compare = pd.DataFrame()
    df_compare['Bad Category'] = pred_df[col].head(25).value_counts()[:5].index.tolist()
    df_compare['Bad Category Count'] = pred_df[col].head(25).value_counts()[:5].values.tolist()
    
    df_compare['Good Category'] = good_pred_df[col].head(25).value_counts()[:5].index.tolist()
    df_compare['Good Category Count'] = good_pred_df[col].head(25).value_counts()[:5].values.tolist()
    
    print(df_compare)

Compare the cateogorical features for the 25 worst and 25 best predictions. The tables display the top 5 categories within each categorical feature for the 25 worst and 25 best predictions.
===== Comparing for categorical category month =====
   Bad Category  Bad Category Count  Good Category  Good Category Count
0            12                  10              6                    4
1            11                   3              1                    4
2             6                   3              2                    3
3            10                   3              4                    3
4             8                   3              3                    3
===== Comparing for categorical category town =====
  Bad Category  Bad Category Count Good Category  Good Category Count
0  BUKIT MERAH                   8   BUKIT MERAH                    4
1       BISHAN                   6     TOA PAYOH                    2
2      HOUGANG                   2       GEYLANG               

In [99]:
pred_df[num_col_names].head(25).describe()

Unnamed: 0,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,remaining_lease_years,floor_area_sqm
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.821817,6.11832,0.019496,0.023421,74.23,118.68
std,0.359394,3.861212,0.006288,0.026168,15.730427,19.208765
min,0.370873,1.982722,0.016807,2.4e-05,50.166667,88.0
25%,0.581977,2.594828,0.016807,0.006243,63.666667,113.0
50%,0.745596,6.370404,0.016807,0.008342,73.5,120.0
75%,1.081018,8.071776,0.016807,0.047782,90.166667,126.0
max,1.915461,16.660245,0.033613,0.103876,96.75,154.0


In [100]:
good_pred_df[num_col_names].head(25).describe()

Unnamed: 0,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,remaining_lease_years,floor_area_sqm
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.710519,8.340128,0.017479,0.018681,63.5,89.2
std,0.387744,4.525807,0.005381,0.043785,12.934307,26.925824
min,0.114837,0.687215,0.008403,1.1e-05,45.916667,45.0
25%,0.439622,4.847078,0.016807,0.000382,53.5,65.0
50%,0.717746,8.162365,0.016807,0.004897,61.416667,91.0
75%,0.923266,12.165743,0.016807,0.018783,71.666667,110.0
max,1.787313,17.346043,0.033613,0.217454,92.083333,133.0


The top 25 test samples with the largest errors, the following trends were observed.

- From the data analysis done, most of the errors in prediction seem to stem from the differences in the distrubution of the categorical features. 
- For the top 25 test samples with the largest errors, most of the 25 samples had month = 12. This could have contributed to the large errors observed.
- Other deviations for this top 25 test sample could include having town= 'BISHAN'.

Improvements to reduce errors.
- Normalisation of continous data can be performed. In the target label 'resale_price', the values are rather large compared to the data features the model is trained with. This could have contributed to the large errors observed. 
- Feature engineering can be done to both remove redundant features and also create features that will provide more useful information during training. 
- Different encoding techniques for categorical features can be studied. This can help the model learn and use the cateogorical features better.