# Capstone Project: Singapore HDB Resale Price Prediction
___

<p align = 'center'>
  <img src = "https://github.com/ElangSetiawan/sg-hdb-resale/blob/main/images/hdb_shintaro_tay_st_photo.jpg?raw=true" width = 75%>
<p/>
Source : https://www.straitstimes.com/singapore/housing/households-that-received-help-with-mortgage-payments-nearly-triple-that-of-same


**Problem Statement**

Public housing in Singapore is subsidised housing built and managed by the government under the Housing and Development Board (HDB). Most public housing in Singapore is owner-occupied. Under Singapore’s housing ownership programme, housing units are sold to applicants who meet certain income, citizenship and property ownership requirements, on a 99-year leasehold. The estate’s land and common areas continue to be owned by the government. Owner-occupied public housing can be sold to others in a resale market, subject to certain restrictions. Prices within the resale market are not regulated by the government.

Demand for resale flats since the end of the Circuit Breaker has pushed prices and sales to new highs. According to the HDB Price Index in Q2 2021, resale flat prices climbed 3% from Q1 2021, growing for the fifth consecutive quarter since Q2 2020. Prices were also 11% higher compared to a year ago. As data scientists, we want to understand the factors driving the price of resale flats as and provide predicted sale price for property portals.

**Model Explored**

|Models|Description|
|---|---|
|LinearRegression|
|XGBRegressor|


**Evaluation Metrics**

The evaluation metrics will be overfitting/underfitting of less than 2% between train and test data.

**Workflow Process**  
1. Notebook 1 of 2 : General EDA
2. Notebook 1 of 2 : EDA on Geolocation
3. Notebook 2 of 2 : Data Preprocessing
4. Notebook 2 of 2 : Feature Engineering
5. Notebook 2 of 2 : Create Model
6. Notebook 2 of 2 : Iterative Model tuning


In [None]:
# # installing less common packages (uncomment if you do not have these installed)
# !pip install geopy
# !pip install geopandas
# !pip install featuretools

In [1]:
import pandas as pd, numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
from mpl_toolkits import mplot3d
import seaborn as sns

import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
from geopy.distance import geodesic

import datetime as dt

import shapely
from shapely import geometry
from shapely import ops
from shapely.geometry import Point, LineString, Polygon, MultiPoint
from shapely.ops import nearest_points

import warnings
warnings.filterwarnings('ignore')

from sklearn.impute import SimpleImputer

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import silhouette_score
from sklearn.neighbors import KNeighborsClassifier


sns.set_style('ticks')

pd.set_option('display.max_columns', None)

%matplotlib inline

# 1.0 Data Import
___

In [2]:
df_raw_1990 = pd.read_csv('../data/raw/resale-flat-prices-based-on-approval-date-1990-1999.csv')
df_raw_2000 = pd.read_csv('../data/raw/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
df_raw_2012 = pd.read_csv('../data/raw/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
df_raw_2015 = pd.read_csv('../data/raw/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
# Coverage up to 2021-11-23 - data.gov.sg
df_raw_2017 = pd.read_csv('../data/raw/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')

df_raw_1990.shape,df_raw_2000.shape,df_raw_2012.shape,df_raw_2015.shape,df_raw_2017.shape

((287196, 10), (369651, 10), (52203, 10), (37153, 11), (113753, 11))

# 2.0 General EDA
___

In [3]:
# getting some basic information about each dataframe
# shape of dataframe i.e. number of rows and columns
# total number of rows with null values
# total number of duplicates
# data types of columns

def basic_eda(df, df_name):
    print(df_name.upper())
    print()
    print(f"Rows: {df.shape[0]} \t Columns: {df.shape[1]}")
    print()
    
    print(f"Total null rows: {df.isnull().sum().sum()}")
    print(f"Percentage null rows: {round(df.isnull().sum().sum() / df.shape[0] * 100, 2)}%")
    print()
    
    print(f"Total duplicate rows: {df[df.duplicated(keep=False)].shape[0]}")
    print(f"Percentage dupe rows: {round(df[df.duplicated(keep=False)].shape[0] / df.shape[0] * 100, 2)}%")
    print()
    
    print(df.dtypes)
    print("-----\n")

In [4]:
dfs = [
    (df_raw_1990, 'from 1990 to 1999'),
    (df_raw_2000, 'from 2000 to 2012'),
    (df_raw_2012, 'from 2012 to 2014'),
    (df_raw_2015, 'from 2015 to 2017'),
    (df_raw_2017, 'from 2017 to 2021-11-23')
    ]

In [5]:
[basic_eda(df, name) for df, name in dfs]

FROM 1990 TO 1999

Rows: 287196 	 Columns: 10

Total null rows: 0
Percentage null rows: 0.0%

Total duplicate rows: 1638
Percentage dupe rows: 0.57%

month                   object
town                    object
flat_type               object
block                   object
street_name             object
storey_range            object
floor_area_sqm         float64
flat_model              object
lease_commence_date      int64
resale_price             int64
dtype: object
-----

FROM 2000 TO 2012

Rows: 369651 	 Columns: 10

Total null rows: 0
Percentage null rows: 0.0%

Total duplicate rows: 1014
Percentage dupe rows: 0.27%

month                   object
town                    object
flat_type               object
block                   object
street_name             object
storey_range            object
floor_area_sqm         float64
flat_model              object
lease_commence_date      int64
resale_price           float64
dtype: object
-----

FROM 2012 TO 2014

Rows: 52203 	 Colum

[None, None, None, None, None]

| Observations | Action |
|---|---|
|Duplicate rows detected | Investigate or remove duplicate rows |
|New data 'remaining_lease" is introduced from 2015 onwards | Create new column called remaining_lease for data prior to 2015|

In [16]:
# New column 'remaining_lease' was introduced from dataset of 2015 onwards.
# add column remaining_lease to dataframes of data prior to 2015, before concatenating all the datasets

df_raw_1990['remaining_lease'] = 'dummy'
df_raw_2000['remaining_lease'] = 'dummy'
df_raw_2012['remaining_lease'] = 'dummy'
df_all = pd.concat([df_raw_1990, df_raw_2000, df_raw_2012, df_raw_2015,df_raw_2017] )
df_all.reset_index(inplace = True)
basic_eda(df_all, 'all_data')

ALL_DATA

Rows: 859956 	 Columns: 12

Total null rows: 0
Percentage null rows: 0.0%

Total duplicate rows: 0
Percentage dupe rows: 0.0%

index                    int64
month                   object
town                    object
flat_type               object
block                   object
street_name             object
storey_range            object
floor_area_sqm         float64
flat_model              object
lease_commence_date      int64
resale_price           float64
remaining_lease         object
dtype: object
-----



In [12]:
df_all.head(-1)

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,remaining_lease
0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,9000.0,dummy
1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,04 TO 06,31.0,IMPROVED,1977,6000.0,dummy
2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,8000.0,dummy
3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,07 TO 09,31.0,IMPROVED,1977,6000.0,dummy
4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,04 TO 06,73.0,NEW GENERATION,1976,47200.0,dummy
...,...,...,...,...,...,...,...,...,...,...,...
113747,2021-11,YISHUN,EXECUTIVE,361,YISHUN RING RD,01 TO 03,146.0,Maisonette,1988,668000.0,65 years 08 months
113748,2021-11,YISHUN,EXECUTIVE,792,YISHUN RING RD,10 TO 12,144.0,Apartment,1987,690000.0,64 years 10 months
113749,2021-11,YISHUN,EXECUTIVE,611,YISHUN ST 61,10 TO 12,142.0,Apartment,1987,680000.0,65 years 01 month
113750,2021-11,YISHUN,EXECUTIVE,614,YISHUN ST 61,01 TO 03,142.0,Apartment,1987,632000.0,64 years 06 months


| Observations | Action |
|---|---|
|Column 'month' is text yyyy-mm | Create new 'sale_date' column of datetime yyyy-mm-01 |
|Column 'remaining_lease" is text | Create new 'remaining_year' column of 99 - (sale_date.year - lease_commence_date)|

In [42]:
# New column 'sale_date' as datetime.

df_copy = df_all.copy()
df_copy['sale_date'] = pd.to_datetime(df_copy['month']+'-01')
df_copy.head()

Unnamed: 0,index,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,remaining_lease,sale_date
0,0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,9000.0,dummy,1990-01-01
1,1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,04 TO 06,31.0,IMPROVED,1977,6000.0,dummy,1990-01-01
2,2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,8000.0,dummy,1990-01-01
3,3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,07 TO 09,31.0,IMPROVED,1977,6000.0,dummy,1990-01-01
4,4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,04 TO 06,73.0,NEW GENERATION,1976,47200.0,dummy,1990-01-01


0         1990-01-01
1         1990-01-01
2         1990-01-01
3         1990-01-01
4         1990-01-01
             ...    
859951    2021-11-01
859952    2021-11-01
859953    2021-11-01
859954    2021-11-01
859955    2021-11-01
Name: month, Length: 859956, dtype: object

In [None]:
# This cell is used for testing and development to avoid rerunning the call to pushshift API.
# The code is commented out and only used when necessary. 
# Pickle the raw DataFrames. 
"""
print("pickling ml_raw:", ml_raw.shape)
picklefile = open('../data/interim/ml_raw.pickle', 'wb') #create a file
pickle.dump(ml_raw, picklefile, pickle.HIGHEST_PROTOCOL) #pickle the dataframe
picklefile.close() #close file

print("pickling ds_raw:", ds_raw.shape)
picklefile = open('../data/interim/ds_raw.pickle', 'wb') #create a file
pickle.dump(ds_raw, picklefile, pickle.HIGHEST_PROTOCOL) #pickle the dataframe
picklefile.close() #close file
"""