CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]



In [2]:
SEED = 42

import os

import random
random.seed(42)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

from sklearn.metrics import mean_squared_error, r2_score
import math

  warn(


> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# Divide the dataset into train, validation, and test sets
train_data = df[df['year'] <= 2019]
val_data = df[df['year'] == 2020]
test_data = df[df['year'] == 2021]

# Define target variable and features
target= ['resale_price']
continuous_features = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 
                       'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']
categorical_features = ['month', 'town', 'flat_model_type', 'storey_range']

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [4]:
# TODO: Enter your code here
data_config = DataConfig(
    target=['resale_price'], 
    continuous_cols=continuous_features, 
    categorical_cols=categorical_features
)

trainer_config = TrainerConfig(
    auto_lr_find=True,
    max_epochs=50,
    batch_size=1024
)

model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50", 
    activation="ReLU",
    learning_rate=0.001
)

optimizer_config = OptimizerConfig(optimizer="Adam")

model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config
)

model.fit(train=train_data, validation=val_data)

2023-10-13 20:35:43,158 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
Global seed set to 42
2023-10-13 20:35:43,194 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-10-13 20:35:43,194 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-10-13 20:35:43,285 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-10-13 20:35:43,319 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-10-13 20:35:43,552 - {pytorch_tabular.tabular_model:573} - INFO - Auto LR Find Started
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(
  rank_zero_warn(


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at C:\Users\Keerthana\SC4001 Assignment\.lr_find_e1131849-1c70-4c8d-ad61-23daea637bc7.ckpt
Restored all states from the checkpoint file at C:\Users\Keerthana\SC4001 Assignment\.lr_find_e1131849-1c70-4c8d-ad61-23daea637bc7.ckpt
2023-10-13 20:35:47,005 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-13 20:35:47,006 - {pytorch_tabular.tabular_model:582} - INFO - Training Started


Output()

2023-10-13 20:36:15,746 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
2023-10-13 20:36:15,746 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
  rank_zero_deprecation(


<pytorch_lightning.trainer.trainer.Trainer at 0x1ff84b87be0>

In [5]:
predictions = model.predict(test_data)

predictions_array = predictions['resale_price_prediction'].values

# Calculate RMSE
test_rmse = math.sqrt(mean_squared_error(test_data['resale_price'], predictions_array))

# Calculate R2 score
test_r2 = r2_score(test_data['resale_price'], predictions_array)

print(f"Test RMSE: {test_rmse:.2f}")
print(f"Test R2: {test_r2:.2f}")

Output()

Test RMSE: 76696.92
Test R2: 0.78


> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<Enter your answer here\>

Test RMSE: 76696.92

Test R2: 0.78

> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [6]:
# TODO: Enter your code here
test_data['absolute_error'] = abs(test_data['resale_price'] - predictions_array)

top_25_errors = test_data.sort_values(by='absolute_error', ascending=False).head(25)

# Print the top 25 rows with the largest errors
print(top_25_errors)

        month  year          town                full_address    nearest_stn  \
92405      11  2021   BUKIT MERAH            46 SENG POH ROAD    Tiong Bahru   
90957       6  2021   BUKIT BATOK  288A BUKIT BATOK STREET 25    Bukit Batok   
112128     12  2021      TAMPINES      156 TAMPINES STREET 12       Tampines   
90608      12  2021        BISHAN       273B BISHAN STREET 24         Bishan   
106192     12  2021    QUEENSTOWN              89 DAWSON ROAD     Queenstown   
91871       6  2021   BUKIT MERAH         17 TIONG BAHRU ROAD    Tiong Bahru   
93825       8  2021  CENTRAL AREA       4 TANJONG PAGAR PLAZA  Tanjong Pagar   
92504      12  2021   BUKIT MERAH            49 KIM PONG ROAD    Tiong Bahru   
105695      6  2021    QUEENSTOWN              91 DAWSON ROAD     Queenstown   
90432       8  2021        BISHAN       275A BISHAN STREET 24         Bishan   
92299      10  2021   BUKIT MERAH         36 MOH GUAN TERRACE    Tiong Bahru   
92442      11  2021   BUKIT MERAH       

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['absolute_error'] = abs(test_data['resale_price'] - predictions_array)


\# TODO: \<Enter your answer here\>

The properties with the largest errors in predictions are generally high-value properties. These properties have significantly higher resale prices compared to others in the dataset.