# 2: Removing Outliers and Transforming data

## 2.1 Using Mahalanobis distance to identify outliers
As we are working with Multivariate data, it isn't the best practice for us to simply remove outliers based on a single variable. Thus, we utilise Mahalanobis distance to identify outliers based on multiple variables.

### 2.1.1: Loading and preparing dataset for mahalanobis distance calculation

In [2]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Read the CSV file
raw_df = pd.read_csv('ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv')

# Load Dataset and perform simple transformations to numerical for mahalanobis distance
def convert_lease_to_months(lease_str):
    years, months = 0, 0  # Default values

    # Extract years
    year_match = re.search(r'(\d+)\s*years?', lease_str)
    if year_match:
        years = int(year_match.group(1))

    # Extract months
    month_match = re.search(r'(\d+)\s*months?', lease_str)
    if month_match:
        months = int(month_match.group(1))

    return years * 12 + months  # Convert to total months

# Apply function to column
raw_df["remaining_lease_months"] = raw_df["remaining_lease"].apply(convert_lease_to_months)

df = raw_df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200540 entries, 0 to 200836
Data columns (total 12 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   month                   200540 non-null  object 
 1   town                    200540 non-null  object 
 2   flat_type               200540 non-null  object 
 3   block                   200540 non-null  object 
 4   street_name             200540 non-null  object 
 5   storey_range            200540 non-null  object 
 6   floor_area_sqm          200540 non-null  float64
 7   flat_model              200540 non-null  object 
 8   lease_commence_date     200540 non-null  int64  
 9   remaining_lease         200540 non-null  object 
 10  resale_price            200540 non-null  float64
 11  remaining_lease_months  200540 non-null  int64  
dtypes: float64(2), int64(2), object(8)
memory usage: 19.9+ MB


In [3]:
df_numeric = df[['floor_area_sqm', 'remaining_lease_months', 'resale_price']].copy() 


# Compute the mean and covariance matrix
mean_vec = np.mean(df_numeric, axis=0)
cov_matrix = np.cov(df_numeric, rowvar=False)

# Compute the inverse of the covariance matrix
inv_cov_matrix = np.linalg.inv(cov_matrix)

# Compute Mahalanobis distance for each data point
df_numeric['mahalanobis_dist'] = df_numeric.apply(lambda x: mahalanobis(x, mean_vec, inv_cov_matrix), axis=1)


# View top outliers
df_numeric.sort_values(by='mahalanobis_dist', ascending=False).head(70)

Unnamed: 0,floor_area_sqm,remaining_lease_months,resale_price,mahalanobis_dist
182474,366.7,564,1568000.0,11.688276
200280,117.0,1038,1600000.0,7.101499
174508,112.0,1088,1588000.0,7.090145
197905,106.0,1020,1550000.0,7.060178
174615,113.0,1050,1580000.0,7.047303
...,...,...,...,...
148453,113.0,1135,1450000.0,6.051356
148368,113.0,1137,1450000.0,6.050331
183103,117.0,1023,1450000.0,6.048822
171900,172.0,808,1500000.0,6.019152


Based on the Mahalanobis Distance, we can see that there is a single outlier with a significantly different when compared to the other datapoints. Thus, we will remove this single outlier before we proceed.

In [4]:
# Find the index of the top outlier
top_outlier_idx = df_numeric['mahalanobis_dist'].idxmax()

# Remove the outlier from the dataset
df_cleaned = df.drop(index=top_outlier_idx)

# Verify the removal
print(f"Removed row at index: {top_outlier_idx}")
print(df_cleaned.shape)  # Check the new shape

# Load your dataset 
df_numeric_cleaned = df_cleaned[['floor_area_sqm', 'remaining_lease_months', 'resale_price']].copy() 

# Compute the mean and covariance matrix
mean_vec = np.mean(df_numeric_cleaned, axis=0)
cov_matrix = np.cov(df_numeric_cleaned, rowvar=False)

# Compute the inverse of the covariance matrix
inv_cov_matrix = np.linalg.inv(cov_matrix)

# Compute Mahalanobis distance for each data point
df_numeric_cleaned['mahalanobis_dist'] = df_numeric_cleaned.apply(lambda x: mahalanobis(x, mean_vec, inv_cov_matrix), axis=1)
# View top outliers
df_numeric_cleaned.sort_values(by='mahalanobis_dist', ascending=False).head(70)

Removed row at index: 182474
(200539, 12)


Unnamed: 0,floor_area_sqm,remaining_lease_months,resale_price,mahalanobis_dist
200280,117.0,1038,1600000.0,7.101510
174508,112.0,1088,1588000.0,7.090138
197905,106.0,1020,1550000.0,7.060166
174615,113.0,1050,1580000.0,7.047301
175779,105.0,1023,1542880.0,7.033331
...,...,...,...,...
148368,113.0,1137,1450000.0,6.050326
183103,117.0,1023,1450000.0,6.048841
171900,172.0,808,1500000.0,6.019791
175710,97.0,1030,1370000.0,6.017171


## 2.2: Retaining only relevant columns

In [5]:
# relevant features

df_cleaned = df_cleaned[["month", 'town', 'storey_range', 'floor_area_sqm', 'flat_type','flat_model','remaining_lease','resale_price']]
df_cleaned.head()

Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price
0,2017-01,ANG MO KIO,10 TO 12,44.0,2 ROOM,Improved,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,04 TO 06,68.0,3 ROOM,New Generation,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,265000.0


## 2.3: Transforming Datatypes

In [6]:
df = df_cleaned
df.dtypes

month               object
town                object
storey_range        object
floor_area_sqm     float64
flat_type           object
flat_model          object
remaining_lease     object
resale_price       float64
dtype: object

### 2.3.1: Changing month to months from Jan 2017

In [7]:
# changing month to months from jan 2017
df[['y', 'm_from2017']]=df['month'].str.split(pat="-", expand = True)
df['y'] = pd.to_numeric(df['y']).sub(2017).mul(12)
df['m_from2017'] = pd.to_numeric(df['m_from2017'])
df['m_from2017'] = df['m_from2017'].add(df['y'])

# can delete col y now

### 2.3.2: Changing remaining lease to no. of months left in lease

In [8]:
# changing remaining lease to no of months left
df[['lYr', 'lease months left', 'h']]=df['remaining_lease'].str.split(r"years|months|month", expand = True)
df['lYr'] = pd.to_numeric(df['lYr']).mul(12)
df['lease months left'] = pd.to_numeric(df['lease months left']).add(df['lYr'])
df.head()
# remove col lYr and h

Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price,y,m_from2017,lYr,lease months left,h
0,2017-01,ANG MO KIO,10 TO 12,44.0,2 ROOM,Improved,61 years 04 months,232000.0,0,1,732,736.0,
1,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,60 years 07 months,250000.0,0,1,720,727.0,
2,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,262000.0,0,1,744,749.0,
3,2017-01,ANG MO KIO,04 TO 06,68.0,3 ROOM,New Generation,62 years 01 month,265000.0,0,1,744,745.0,
4,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,265000.0,0,1,744,749.0,


### 2.3.3: Encoding towns using one hot encoding

In [9]:
# encoding the towns - one hot encoding

# Standardize column names
df.rename(columns=lambda x: x.strip().lower(), inplace=True)

# Ensure 'town' column is a string and clean spaces
df['town'] = df['town'].astype(str).str.strip()

# One-hot encode
df_encoded = pd.get_dummies(df, columns=['town'], dtype = int)

# Ensure only specified towns are kept
towns = [
    "JURONG WEST", "SENGKANG", "WOODLANDS", "PUNGGOL", "TAMPINES", "YISHUN", "BEDOK", "HOUGANG",
    "ANG MO KIO", "BUKIT MERAH", "CHOA CHU KANG", "TOA PAYOH", "BUKIT BATOK", "BUKIT PANJANG",
    "KALLANG/WHAMPOA", "PASIR RIS", "GEYLANG", "QUEENSTOWN", "SEMBAWANG", "JURONG EAST",
    "BISHAN", "CLEMENTI", "SERANGOON", "CENTRAL AREA", "MARINE PARADE", "BUKIT TIMAH"
]

town_columns = [f'town_{town}' for town in towns if f'town_{town}' in df_encoded.columns]
df_encoded = df_encoded[['town'] + town_columns] if 'town' in df_encoded.columns else df_encoded[town_columns]

df = pd.concat([df, df_encoded], axis = 1)
df.head()

Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price,y,m_from2017,...,town_GEYLANG,town_QUEENSTOWN,town_SEMBAWANG,town_JURONG EAST,town_BISHAN,town_CLEMENTI,town_SERANGOON,town_CENTRAL AREA,town_MARINE PARADE,town_BUKIT TIMAH
0,2017-01,ANG MO KIO,10 TO 12,44.0,2 ROOM,Improved,61 years 04 months,232000.0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,60 years 07 months,250000.0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,262000.0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,2017-01,ANG MO KIO,04 TO 06,68.0,3 ROOM,New Generation,62 years 01 month,265000.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,2017-01,ANG MO KIO,01 TO 03,67.0,3 ROOM,New Generation,62 years 05 months,265000.0,0,1,...,0,0,0,0,0,0,0,0,0,0


### 2.3.4: Encoding flat model using one hot encoding

In [10]:
# encoding the flat_model - one hot encoding
#### still need to edit this chunk to adapt it

# Standardize column names
df.rename(columns=lambda x: x.strip().lower(), inplace=True)

# Ensure 'town' column is a string and clean spaces
df['flat_model'] = df['flat_model'].astype(str).str.strip()

# One-hot encode
df_encoded = pd.get_dummies(df, columns=['flat_model'], dtype = int)

# Ensure only specified towns are kept
models = [
    "Model A", "Improved", "New Generation", "Premium Apartment", "Simplified", "Apartment", "Maisonette", "Standard",
    "DBSS", "Model A2", "Model A-Maisonette", "Adjoined flat", "Type S1", "2-room",
    "Type S2", "Premium Apartment Loft", "Terrace", "Multi Generation", "3Gen", "Improved-Maisonette",
    "Premium Maisonette"
]

model_columns = [f'flat_model_{model}' for model in models if f'flat_model_{model}' in df_encoded.columns]
df_encoded = df_encoded[['flat_model'] + model_columns] if 'flat_model' in df_encoded.columns else df_encoded[model_columns]

df = pd.concat([df, df_encoded], axis = 1)
df.tail()

Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price,y,m_from2017,...,flat_model_Adjoined flat,flat_model_Type S1,flat_model_2-room,flat_model_Type S2,flat_model_Premium Apartment Loft,flat_model_Terrace,flat_model_Multi Generation,flat_model_3Gen,flat_model_Improved-Maisonette,flat_model_Premium Maisonette
200832,2025-02,YISHUN,01 TO 03,142.0,EXECUTIVE,Apartment,62 years 05 months,845000.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200833,2025-01,YISHUN,04 TO 06,146.0,EXECUTIVE,Maisonette,61 years 05 months,800000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200834,2025-02,YISHUN,07 TO 09,146.0,EXECUTIVE,Maisonette,60 years 05 months,818888.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200835,2025-01,YISHUN,01 TO 03,146.0,EXECUTIVE,Maisonette,62 years 02 months,960000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200836,2025-02,YISHUN,01 TO 03,145.0,EXECUTIVE,Apartment,61 years 10 months,868888.0,96,98,...,0,0,0,0,0,0,0,0,0,0


### 2.3.5: Encoding Flat Type using label encoding

In [11]:
# encoding the flat type - label encoding
flat_type_mapping = {
    '1 ROOM': 0,
    '2 ROOM': 1,
    '3 ROOM': 2,
    '4 ROOM': 3,
    '5 ROOM': 4,
    'EXECUTIVE': 5,
    'MULTI-GENERATION': 6
}

# Apply the mapping to the flat_type column
df['flat_type'] = df['flat_type'].map(flat_type_mapping)
df.tail()


Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price,y,m_from2017,...,flat_model_Adjoined flat,flat_model_Type S1,flat_model_2-room,flat_model_Type S2,flat_model_Premium Apartment Loft,flat_model_Terrace,flat_model_Multi Generation,flat_model_3Gen,flat_model_Improved-Maisonette,flat_model_Premium Maisonette
200832,2025-02,YISHUN,01 TO 03,142.0,5,Apartment,62 years 05 months,845000.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200833,2025-01,YISHUN,04 TO 06,146.0,5,Maisonette,61 years 05 months,800000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200834,2025-02,YISHUN,07 TO 09,146.0,5,Maisonette,60 years 05 months,818888.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200835,2025-01,YISHUN,01 TO 03,146.0,5,Maisonette,62 years 02 months,960000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200836,2025-02,YISHUN,01 TO 03,145.0,5,Apartment,61 years 10 months,868888.0,96,98,...,0,0,0,0,0,0,0,0,0,0


### 2.3.6: Encoding Storey Range using Label encoding

In [12]:
# encoding the storey_range - label encoding
storey_range_mapping = {
    '01 TO 03': 0, '04 TO 06': 1, '07 TO 09': 2, '10 TO 12': 3, '13 TO 15': 4,
    '16 TO 18': 5, '19 TO 21': 6, '22 TO 24': 7, '25 TO 27': 8, '28 TO 30': 9,
    '31 TO 33': 10, '34 TO 36': 11, '37 TO 39': 12, '40 TO 42': 13, '43 TO 45': 14,
    '46 TO 48': 15, '49 TO 51': 16
}

# Apply the mapping to the storey_range column
df['storey_range'] = df['storey_range'].map(storey_range_mapping)
df.tail()

Unnamed: 0,month,town,storey_range,floor_area_sqm,flat_type,flat_model,remaining_lease,resale_price,y,m_from2017,...,flat_model_Adjoined flat,flat_model_Type S1,flat_model_2-room,flat_model_Type S2,flat_model_Premium Apartment Loft,flat_model_Terrace,flat_model_Multi Generation,flat_model_3Gen,flat_model_Improved-Maisonette,flat_model_Premium Maisonette
200832,2025-02,YISHUN,0,142.0,5,Apartment,62 years 05 months,845000.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200833,2025-01,YISHUN,1,146.0,5,Maisonette,61 years 05 months,800000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200834,2025-02,YISHUN,2,146.0,5,Maisonette,60 years 05 months,818888.0,96,98,...,0,0,0,0,0,0,0,0,0,0
200835,2025-01,YISHUN,0,146.0,5,Maisonette,62 years 02 months,960000.0,96,97,...,0,0,0,0,0,0,0,0,0,0
200836,2025-02,YISHUN,0,145.0,5,Apartment,61 years 10 months,868888.0,96,98,...,0,0,0,0,0,0,0,0,0,0


### 2.3.7: Removing redundant columns used for transformation

In [13]:
#remove redundant/encoded cols
df.drop(columns=['lyr', 'h', 'remaining_lease', 'y', 'town','flat_model', 'month'])

Unnamed: 0,storey_range,floor_area_sqm,flat_type,resale_price,m_from2017,lease months left,town_jurong west,town_sengkang,town_woodlands,town_punggol,...,flat_model_Adjoined flat,flat_model_Type S1,flat_model_2-room,flat_model_Type S2,flat_model_Premium Apartment Loft,flat_model_Terrace,flat_model_Multi Generation,flat_model_3Gen,flat_model_Improved-Maisonette,flat_model_Premium Maisonette
0,3,44.0,1,232000.0,1,736.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,67.0,2,250000.0,1,727.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,67.0,2,262000.0,1,749.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,68.0,2,265000.0,1,745.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,67.0,2,265000.0,1,749.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200832,0,142.0,5,845000.0,98,749.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
200833,1,146.0,5,800000.0,97,737.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
200834,2,146.0,5,818888.0,98,725.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
200835,0,146.0,5,960000.0,97,746.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df.dtypes

month                                 object
town                                  object
storey_range                           int64
floor_area_sqm                       float64
flat_type                              int64
flat_model                            object
remaining_lease                       object
resale_price                         float64
y                                      int64
m_from2017                             int64
lyr                                    int64
lease months left                    float64
h                                     object
town_jurong west                       int64
town_sengkang                          int64
town_woodlands                         int64
town_punggol                           int64
town_tampines                          int64
town_yishun                            int64
town_bedok                             int64
town_hougang                           int64
town_ang mo kio                        int64
town_bukit

## 2.4: Exporting transformed data to new csv

In [15]:
# generate the new csv file to use for the model
df.to_csv('cleaned_hdb.csv', index = False)

In [16]:
df1 = pd.read_csv("cleaned_hdb.csv")
df1.shape

(200539, 60)