# Capstone Project: Singapore HDB Resale Price Prediction
___

<p align = 'center'>
  <img src = "https://github.com/ElangSetiawan/sg-hdb-resale/blob/main/images/hdb_shintaro_tay_st_photo.jpg?raw=true" width = 75%>
<p/>
Source : https://www.straitstimes.com/singapore/housing/households-that-received-help-with-mortgage-payments-nearly-triple-that-of-same


**Problem Statement**

Public housing in Singapore is subsidised housing built and managed by the government under the Housing and Development Board (HDB). Most public housing in Singapore is owner-occupied. Under Singapore’s housing ownership programme, housing units are sold to applicants who meet certain income, citizenship and property ownership requirements, on a 99-year leasehold. The estate’s land and common areas continue to be owned by the government. Owner-occupied public housing can be sold to others in a resale market, subject to certain restrictions. Prices within the resale market are not regulated by the government.

Demand for resale flats since the end of the Circuit Breaker has pushed prices and sales to new highs. According to the HDB Price Index in Q2 2021, resale flat prices climbed 3% from Q1 2021, growing for the fifth consecutive quarter since Q2 2020. Prices were also 11% higher compared to a year ago. As data scientists, we want to understand the factors driving the price of resale flats as and provide predicted sale price for property portals.

**Model Explored**

|Models|Description|
|---|---|
|LinearRegression|
|XGBRegressor|


**Evaluation Metrics**

The evaluation metrics will be overfitting/underfitting of less than 2% between train and test data.

**Workflow Process**  
1. Notebook 1 of 2 : General EDA
2. Notebook 1 of 2 : Geolocation preprocessing


**Data Sources**  
1. Singapore postal sector and districts:<br> 
https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm
2. Singapore HDB information and resale prices<br>
https://data.gov.sg/dataset/hdb-property-information<br>
https://data.gov.sg/dataset/resale-flat-prices
3. Singapore primary schools<br>
https://en.wikipedia.org/wiki/List_of_schools_in_Singapore
4. Singapore MRT<br>
https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations
5. Singapore LRT<br>
https://en.wikipedia.org/wiki/List_of_Singapore_LRT_stations
6. Singapore Shopping Malls<br>
https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore
6. Ministry of Education - Primary School Balloting system<br>
https://www.moe.gov.sg/primary/p1-registration/distance

**HDB information and resale transaction prices**
The site data.gov.sg provides both the hdb information dataset as well as the monthly resale transactions. Since old transaction data does not improve the model only dataset from 2017 is considered for this project.

**Geolocation and point of interests**
In this project, it is hypothesized that distance of hdb flat to nearby amenities such as MRT/LRT stations, schools, and shopping centres will be important, these geo informations are obtained via API from https://developers.onemap.sg/commonapi/search? since this provides free geolocation data unlike google map.<br>
In Singapore context, postal codes are classified into 28 different districts. A mapping table is created for this.

# 1.0 Python Libraries

In [1]:
# # installing less common packages (uncomment if you do not have these installed)
# !pip install geopy
# !pip install geopandas
# !pip install featuretools

In [2]:
# The following code imports the standard required libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
from mpl_toolkits import mplot3d
import seaborn as sns

import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
from geopy.distance import geodesic
import folium

import json 
import requests
import time

import datetime as dt

import shapely
from shapely import geometry
from shapely import ops
from shapely.geometry import Point, LineString, Polygon, MultiPoint
from shapely.ops import nearest_points

from sklearn.base import BaseEstimator, TransformerMixin


import warnings
# warnings.filterwarnings('ignore')

sns.set_style('ticks')

pd.set_option('display.max_columns', None)

%matplotlib inline

# 1.1 Geolocation Helper Functions

In [64]:
# The following code is helper functions to be used in a scikit-learn pipeline
# Reference: https://www.kaggle.com/lucabasa/understand-and-use-a-pipeline#A-Pipeline-step-by-step


def convert_list_to_string(org_list, separator=' '):
    """ Convert list to string, by joining all item in list with given separator.
        Returns the concatenated string """
    return separator.join(org_list)

def clean_string(in_string):
    """ Remove characters in string that cause an error in onemap_api """
    out_string = ' ' + in_string.upper() + ' '
    out_string = out_string.replace("'","%27").replace(" ST. "," SAINT ").replace(" RD "," ROAD "). \
                           replace(" ST "," STREET ").replace(" AVE "," AVENUE ").replace(" PK "," PARK "). \
                           strip().replace(" ","%20")
    
    return out_string

def get_onemap_api(address):
    """ Given a string address, call the onemap API to obtain the geodata """
    qry = clean_string(address)
    req = requests.get('https://developers.onemap.sg/commonapi/search?searchVal='+qry+'&returnGeom=Y&getAddrDetails=Y&pageNum=1')
    resultsdict = eval(req.text)
    if resultsdict['found']>0:
        latitude = float(resultsdict['results'][0]['LATITUDE'])
        longitude = float(resultsdict['results'][0]['LONGITUDE'])
        if resultsdict['results'][0]['POSTAL'] == 'NIL':
            postal = 0
        else:
            postal = int(resultsdict['results'][0]['POSTAL'])
        return resultsdict['found'], latitude, longitude, postal, resultsdict['results'][0]['ADDRESS']  

    else:
        return resultsdict['found'], 0.0, 0.0, 0, qry
    

class get_location(BaseEstimator, TransformerMixin):
    '''
    This class takes in a dataframe containing a column 'address' of the address to be geocoded.
    New columns will be added (if not existing), otherwise the columns will be populated with the 
    latitude, longitude, postal_adress, postcode, district, point.
    '''
    def __init__(self, overwrite=False, verbose=True, maxrows=10):
        self.overwrite = overwrite # If True, existing location data is overwritten, skip otherwise
        self.verbose = verbose     # If True, write some debug code
        self.maxrows = maxrows     # Limit number of rows queried
        self.geo_columns = ['latitude','longitude','postcode','mailing_address','district','point']

    def fit(self, X, y=None):
        # Do nothing
        return self
    
    # The following checks if the list of geo_columns exists in the dataframe and creates the columns if not 
    def match_columns(self, X):
        miss_cols = list(set(self.geo_columns) - set(X.columns))
        
        err = 0
        
        if len(miss_cols) > 0:
            for col in miss_cols:
                if col == 'latitude' or col == 'longitude':
                    X[col] = 0.0  # insert a column for the missing latitude/longitude as float64
                elif col == 'postcode' or col == 'district':
                    X[col] = 0    # insert a column for the missing postcode/district as int64
                elif col == 'mailing_address':
                    X[col] = 'Not Found'   # insert an address column as string
                else:
                    X[col] = 0
                err += 1
                      
        if err > 0 and self.verbose == True:
            print('Columns ' + convert_list_to_string(miss_cols, ', ') + ' are added.')
            
        return X
        
    def transform(self, X):
        result = 0
        count = 0
        failed_count = 0
        skipped_count = 0
        
        X = self.match_columns(X)
        
        i_addr = X.columns.get_loc('address')
        i_lat  = X.columns.get_loc('latitude')
        i_lon  = X.columns.get_loc('longitude')
        i_post = X.columns.get_loc('postcode')
        i_mail = X.columns.get_loc('mailing_address')
        i_dist = X.columns.get_loc('district')
        i_point = X.columns.get_loc('point')
        
        for i in range(self.maxrows):
            if self.overwrite == True or (X.iloc[i,i_lat] == 0.0 and X.iloc[i,i_lon] == 0.0): 
                result, X.iloc[i,i_lat], X.iloc[i,i_lon], X.iloc[i,i_post], X.iloc[i,i_mail] \
                     = get_onemap_api(X.iloc[i,i_addr])
                if result > 0:
                    X.iloc[i,i_dist] = sg_districts[X.iloc[i,i_post]//10000] 
                    X.iloc[i,i_point] = Point(X.iloc[i,i_lon], X.iloc[i,i_lat])
                else:
                    failed_count = failed_count + 1
                    if self.verbose == True:
                        print('Failed to get geodata for: '+ X.iloc[i,i_addr])
            else:
                skipped_count = skipped_count + 1

            count = count + 1

            if count%100 == 0 and count > skipped_count:
                time.sleep(1) # Sleep 1 second after each 100 iterations
                
            if self.verbose == True and count%1000 == 0:
                print('Processed: ' + str(count) + ' addresses, ' + str(failed_count) + ' failed, ' 
                  + str(skipped_count) + ' skipped.')

        if self.verbose == True:
            print('Processed: ' + str(count) + ' addresses, ' + str(failed_count) + ' failed, ' 
                  + str(skipped_count) + ' skipped.')

        return X
    

# 1.2 Mapping Helper Functions

In [142]:
# Helper functions for mapping the geolocations
# This is the center of Singapore in latitude and longitude
sg_lat = 1.28967
sg_lon = 103.85007

def marker_circle(X, the_map):
    """Creates markers for all `coordinates` passed, and adds onto `the_map`.  """
    i_addr = X.columns.get_loc('address')
    i_lat  = X.columns.get_loc('latitude')
    i_lon  = X.columns.get_loc('longitude')
    i_post = X.columns.get_loc('postcode')
    i_mail = X.columns.get_loc('mailing_address')
    i_dist = X.columns.get_loc('district')
    i_point = X.columns.get_loc('point')

    for i in range(len(X)):
        folium.CircleMarker(location = [X.iloc[i,i_lat],X.iloc[i,i_lon]],
                  radius=1.0,
                  popup=X.iloc[i,i_mail]).add_to(the_map)
        
def marker_icons(X, the_map, color='red', icon='arrow-down' ):
    """Creates markers for all points passed, and adds onto `the_map`.  """
    i_addr = X.columns.get_loc('address')
    i_lat  = X.columns.get_loc('latitude')
    i_lon  = X.columns.get_loc('longitude')
    i_post = X.columns.get_loc('postcode')
    i_mail = X.columns.get_loc('mailing_address')
    i_dist = X.columns.get_loc('district')
    i_point = X.columns.get_loc('point')
    iconprefix = 'fa'
    iconname=icon
    
    for i in range(len(X)):
        folium.Marker(location = [X.iloc[i,i_lat],X.iloc[i,i_lon]],
                  popup=X.iloc[i,i_mail], 
                  icon=folium.Icon(color=color ,prefix= iconprefix, icon=iconname )).add_to(the_map)
    

# 1.0 Data Import
___
Import the dataset into python. Input files from data.gov.sg as referenced above or manually created from wikipedia information as referenced above.

In [8]:
# 1.1 HDB flat information - Location information
df_raw_property_info = pd.read_csv('../data/raw/hdb-property-information.csv')
print(df_raw_property_info.shape)
print(df_raw_property_info.columns)

(12442, 24)
Index(['blk_no', 'street', 'max_floor_lvl', 'year_completed', 'residential',
       'commercial', 'market_hawker', 'miscellaneous', 'multistorey_carpark',
       'precinct_pavilion', 'bldg_contract_town', 'total_dwelling_units',
       '1room_sold', '2room_sold', '3room_sold', '4room_sold', '5room_sold',
       'exec_sold', 'multigen_sold', 'studio_apartment_sold', '1room_rental',
       '2room_rental', '3room_rental', 'other_room_rental'],
      dtype='object')


In [10]:
# 1.2 HDB flat information - Resale Prices
# Coverage up to 2021-11-23 - data.gov.sg
df_raw_2017 = pd.read_csv('../data/raw/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')
print(df_raw_2017.shape)
print(df_raw_2017.columns)

(113753, 11)
Index(['month', 'town', 'flat_type', 'block', 'street_name', 'storey_range',
       'floor_area_sqm', 'flat_model', 'lease_commence_date',
       'remaining_lease', 'resale_price'],
      dtype='object')


In [69]:
# Points of Interest
# 1.3 MRT and LRT locations
df_raw_mrt_lrt = pd.read_csv('../data/raw/Singapore_MRT_LRT_stations.csv')
print(df_raw_mrt_lrt.shape)
print(df_raw_mrt_lrt.columns)

(166, 2)
Index(['station_id', 'station_name'], dtype='object')


In [26]:
# 1.4 Primary School locations
df_raw_schools = pd.read_csv('../data/raw/Singapore_Primary_schools.csv')
print(df_raw_schools.shape)
print(df_raw_schools.columns)

(184, 1)
Index(['name'], dtype='object')


In [172]:
# 1.5 Shopping mall locations
df_raw_shopping = pd.read_csv('../data/raw/Singapore_Shopping.csv')
print(df_raw_shopping.shape)
print(df_raw_shopping.columns)

(153, 1)
Index(['name'], dtype='object')


In [53]:
# 1.6 Mapping of postcode to districts
def keystoint(x):
    return {int(k): int(v) for k, v in x}
with open('../data/raw/Singapore_districts.json') as d:
    sg_districts = json.load(d, object_pairs_hook=keystoint)
    print(sg_districts)
    print(type(sg_districts))


{0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 2, 14: 3, 15: 3, 16: 3, 9: 4, 10: 4, 11: 5, 12: 5, 13: 5, 17: 6, 18: 7, 19: 7, 20: 8, 21: 8, 22: 9, 23: 9, 24: 10, 25: 10, 26: 10, 27: 10, 28: 11, 29: 11, 30: 11, 31: 12, 32: 12, 33: 12, 34: 13, 35: 13, 36: 13, 37: 13, 38: 14, 39: 14, 40: 14, 41: 14, 42: 15, 43: 15, 44: 15, 45: 15, 46: 16, 47: 16, 48: 16, 49: 17, 50: 17, 81: 17, 51: 18, 52: 18, 53: 19, 54: 19, 55: 19, 82: 19, 56: 20, 57: 20, 58: 21, 59: 21, 60: 22, 61: 22, 62: 22, 63: 22, 64: 22, 65: 23, 66: 23, 67: 23, 68: 23, 69: 24, 70: 24, 71: 24, 72: 25, 73: 25, 77: 26, 78: 26, 75: 27, 76: 27, 79: 28, 80: 28}
<class 'dict'>


# 2.0 General EDA
___

In [165]:
# getting some basic information about each dataframe
# shape of dataframe i.e. number of rows and columns
# total number of rows with null values
# total number of duplicates
# data types of columns

def assess_NA(data):
    """
    Returns a pandas dataframe denoting the total number of NA values and the percentage of NA values in each column.
    The column names are noted on the index.
    
    Parameters
    ----------
    data: dataframe
    """
    # pandas series denoting features and the sum of their null values
    null_sum = data.isnull().sum()# instantiate columns for missing data
    total = null_sum.sort_values(ascending=False)
    percent = ( ((null_sum / len(data.index))*100).round(2) ).sort_values(ascending=False)
    
    # concatenate along the columns to create the complete dataframe
    df_NA = pd.concat([total, percent], axis=1, keys=['Number of NA', 'Percent NA'])
    
    # drop rows that don't have any missing data; omit if you want to keep all rows
    df_NA = df_NA[ (df_NA.T != 0).any() ]
    
    return df_NA

def basic_eda(df, df_name):
    print(df_name.upper())
    print()
    print(f"Rows: {df.shape[0]} \t Columns: {df.shape[1]}")
    print()
    
    print(f"Total null rows: {df.isnull().sum().sum()}")
    print(f"Percentage null rows: {round(df.isnull().sum().sum() / df.shape[0] * 100, 2)}%")
    print()
    
    print(f"Total duplicate rows: {df[df.duplicated(keep=False)].shape[0]}")
    print(f"Percentage dupe rows: {round(df[df.duplicated(keep=False)].shape[0] / df.shape[0] * 100, 2)}%")
    print()
    
    print('Data Type of the columns')
    print(df.dtypes)
    print()
    
    df_NA = assess_NA(df)
    if len(df_NA) > 0:
        print('Missing value for each columns:')
        print(df_NA)
    else:
        print('There is no missing value.')
    print()
    
    print('Statistics for numerical columns')
    print(df.describe())
    print()
    
    print('Top 5 rows')
    print(df.head(5))
    print("-----\n")

In [166]:
basic_eda(df_raw_property_info, 'hdb property info')

HDB PROPERTY INFO

Rows: 12442 	 Columns: 24

Total null rows: 0
Percentage null rows: 0.0%

Total duplicate rows: 0
Percentage dupe rows: 0.0%

Data Type of the columns
blk_no                   object
street                   object
max_floor_lvl             int64
year_completed            int64
residential              object
commercial               object
market_hawker            object
miscellaneous            object
multistorey_carpark      object
precinct_pavilion        object
bldg_contract_town       object
total_dwelling_units      int64
1room_sold                int64
2room_sold                int64
3room_sold                int64
4room_sold                int64
5room_sold                int64
exec_sold                 int64
multigen_sold             int64
studio_apartment_sold     int64
1room_rental              int64
2room_rental              int64
3room_rental              int64
other_room_rental         int64
dtype: object

There is no missing value.

Statistics for nume

The eda report shows that there is no missing data.

The following columns are categorical columns for label encoding:
residential, commercial, market_hawker, miscellaneous, multistorey_carpark, precinct_pavilion, bldg_contract_town

In [167]:
basic_eda(df_raw_2017, 'Prices from 1/1/2017 to 23/11/2021')

PRICES FROM 1/1/2017 TO 23/11/2021

Rows: 113753 	 Columns: 11

Total null rows: 0
Percentage null rows: 0.0%

Total duplicate rows: 494
Percentage dupe rows: 0.43%

Data Type of the columns
month                   object
town                    object
flat_type               object
block                   object
street_name             object
storey_range            object
floor_area_sqm         float64
flat_model              object
lease_commence_date      int64
remaining_lease         object
resale_price           float64
dtype: object

There is no missing value.

Statistics for numerical columns
       floor_area_sqm  lease_commence_date  resale_price
count   113753.000000        113753.000000  1.137530e+05
mean        97.855562          1995.004492  4.578275e+05
std         24.149495            13.394424  1.584956e+05
min         31.000000          1966.000000  1.400000e+05
25%         82.000000          1985.000000  3.420000e+05
50%         95.000000          1995.000000  4.2700

The eda report shows that there is no missing data.
There are 494 duplicate rows detected, since this represent 0.43% of the overal dataset, these duplicates will be removed.

Lease commence date needs to be converted to datetime 
Month needs to be converted to datetime
Remaining lease to be coverted to number of years remaining in the lease

The following columns are categorical columns for one hot encoding:
flat_type, storey_range, flat_model.

In [171]:
# Explore the categorical columns

print(df_raw_2017.flat_type.value_counts())
print()
print(df_raw_2017.storey_range.value_counts())
print()
print(df_raw_2017.flat_model.value_counts())

4 ROOM              47328
5 ROOM              28946
3 ROOM              26596
EXECUTIVE            9090
2 ROOM               1687
MULTI-GENERATION       58
1 ROOM                 48
Name: flat_type, dtype: int64

04 TO 06    26340
07 TO 09    23785
10 TO 12    21196
01 TO 03    20279
13 TO 15    10864
16 TO 18     5049
19 TO 21     2165
22 TO 24     1605
25 TO 27      898
28 TO 30      566
31 TO 33      288
34 TO 36      268
37 TO 39      254
40 TO 42      128
43 TO 45       31
46 TO 48       27
49 TO 51       10
Name: storey_range, dtype: int64

Model A                   37337
Improved                  28466
New Generation            14746
Premium Apartment         12996
Apartment                  4572
Simplified                 4478
Maisonette                 3436
Standard                   3229
DBSS                       2161
Model A2                   1349
Adjoined flat               214
Model A-Maisonette          204
Type S1                     204
Type S2                     118

### 2.1 Basic feature engineering
___

| Observations | Action |
|---|---|
|Column 'month' is text yyyy-mm | Create new 'sale_date' column of datetime yyyy-mm-01 |
|Column 'remaining_lease" is text | Create new 'remaining_year' column of 99 - (sale_date.year - lease_commence_date)|

In [61]:
# New column 'sale_date' as datetime.
df_price = df_raw_2017.copy()
df_price['sale_date'] = pd.to_datetime(df_price['month']+'-01')
df_price['lease_date'] = pd.to_datetime(df_price['lease_commence_date'].astype(str), format='%Y')
df_price['remaining_year'] = 99 - (df_price.sale_date.dt.year - df_price.lease_date.dt.year)
df_price['address'] = df_price['block']+' '+df_price['street_name']
df_price.head(-3)

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price,sale_date,lease_date,address,remaining_year
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0,2017-01-01,1979-01-01,406 ANG MO KIO AVE 10,61
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0,2017-01-01,1978-01-01,108 ANG MO KIO AVE 4,60
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0,2017-01-01,1980-01-01,602 ANG MO KIO AVE 5,62
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0,2017-01-01,1980-01-01,465 ANG MO KIO AVE 10,62
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0,2017-01-01,1980-01-01,601 ANG MO KIO AVE 5,62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113745,2021-11,YISHUN,EXECUTIVE,387,YISHUN RING RD,04 TO 06,146.0,Maisonette,1988,65 years 08 months,688000.0,2021-11-01,1988-01-01,387 YISHUN RING RD,66
113746,2021-11,YISHUN,EXECUTIVE,328,YISHUN RING RD,01 TO 03,146.0,Maisonette,1988,65 years 08 months,668000.0,2021-11-01,1988-01-01,328 YISHUN RING RD,66
113747,2021-11,YISHUN,EXECUTIVE,361,YISHUN RING RD,01 TO 03,146.0,Maisonette,1988,65 years 08 months,668000.0,2021-11-01,1988-01-01,361 YISHUN RING RD,66
113748,2021-11,YISHUN,EXECUTIVE,792,YISHUN RING RD,10 TO 12,144.0,Apartment,1987,64 years 10 months,690000.0,2021-11-01,1987-01-01,792 YISHUN RING RD,65


### 2.2 Geocoding the addresses
___

In [70]:
# Create the df_mrt_lrt dataset for MRT/LRT Information with geo location
df_mrt_lrt = df_raw_mrt_lrt.copy()
df_mrt_lrt['address'] = df_mrt_lrt['station_id']
print(f'Processing {len(df_mrt_lrt)} rows.')
finder = get_location(overwrite=True, verbose=True, maxrows=len(df_mrt_lrt))
df_mrt_lrt = finder.fit_transform(df_mrt_lrt)
df_mrt_lrt.head(100)

Processing 166 rows.
Columns district, longitude, mailing_address, latitude, point, postcode are added.
Processed: 166 addresses, 0 failed, 0 skipped.


Unnamed: 0,station_id,station_name,address,district,longitude,mailing_address,latitude,point,postcode
0,NS10,Admiralty,NS10,25,103.800991,70 WOODLANDS AVENUE 7 ADMIRALTY MRT STATION (N...,1.440589,POINT (103.800990519771 1.44058856161847),738344
1,EW9,Aljunied,EW9,14,103.882906,81 LORONG 25 GEYLANG ALJUNIED MRT STATION (EW9...,1.316433,POINT (103.882906044385 1.3164326118157),388310
2,NS16,Ang Mo Kio,NS16,20,103.849558,2795 ANG MO KIO AVENUE 8 ANG MO KIO MRT STATIO...,1.369933,POINT (103.84955809232 1.36993284962262),569812
3,CC12,Bartley,CC12,19,103.880178,90 BARTLEY ROAD BARTLEY MRT STATION (CC12) SIN...,1.342501,POINT (103.880177899184 1.34250117805245),539788
4,CE1,Bayfront,CE1,1,103.859080,11 BAYFRONT AVENUE BAYFRONT MRT STATION (DT16 ...,1.281874,POINT (103.859079764874 1.28187378879209),18957
...,...,...,...,...,...,...,...,...,...
95,NS11,Sembawang,NS11,27,103.820046,11 CANBERRA ROAD SEMBAWANG MRT STATION (NS11) ...,1.449051,POINT (103.820046140211 1.44905082158502),759775
96,NE16,Sengkang,NE16,19,103.895485,5 SENGKANG SQUARE SENGKANG MRT STATION (NE16) ...,1.391695,POINT (103.895484694279 1.39169462601522),545062
97,NE12,Serangoon,NE12,19,103.873575,600 UPPER SERANGOON ROAD SERANGOON MRT STATION...,1.349708,POINT (103.873574849884 1.34970788089564),534801
98,EW3,Simei,EW3,18,103.953377,30 SIMEI STREET 3 SIMEI MRT STATION (EW3) SING...,1.343197,POINT (103.953377214378 1.34319707851829),529888


The LRT station with code BP14 Ten Mile Junction LRT station, was permanently closed from 13 January 2019 which results in geo look up error. This is removed from the dataset to resolve the issue. 

In [147]:
# Map the MRT and LRT stations for checking the correctness of the geolocation

map = folium.Map(location=[sg_lat, sg_lon], zoom_start=14)
markers = marker_icons(df_mrt_lrt, map,color='green', icon='fa-subway') 
map

In [71]:
# Create the df_school dataset for School Information with geo location
df_school = df_raw_schools.copy()
df_school['address'] = df_school['name']
print(f'Processing {len(df_school)} rows.')
finder = get_location(overwrite=True, verbose=True, maxrows=len(df_school))
df_school = finder.fit_transform(df_school)
df_school.head(len(df_school))

Processing 184 rows.
Columns district, longitude, mailing_address, latitude, point, postcode are added.
Processed: 184 addresses, 0 failed, 0 skipped.


Unnamed: 0,name,address,district,longitude,mailing_address,latitude,point,postcode
0,Admiralty Primary School,Admiralty Primary School,25,103.800040,11 WOODLANDS CIRCLE ADMIRALTY PRIMARY SCHOOL S...,1.442635,POINT (103.800040119743 1.4426347903311),738907
1,Ahmad Ibrahim Primary School,Ahmad Ibrahim Primary School,27,103.832942,10 YISHUN STREET 11 AHMAD IBRAHIM PRIMARY SCHO...,1.433153,POINT (103.832942401086 1.43315271543517),768643
2,Ai Tong School,Ai Tong School,20,103.833020,100 BRIGHT HILL DRIVE AI TONG SCHOOL SINGAPORE...,1.360583,POINT (103.833020333986 1.3605834338904),579646
3,Alexandra Primary School,Alexandra Primary School,3,103.824425,2A PRINCE CHARLES CRESCENT ALEXANDRA PRIMARY S...,1.291334,POINT (103.824424680531 1.29133439161334),159016
4,Anchor Green Primary School,Anchor Green Primary School,19,103.887165,31 ANCHORVALE DRIVE ANCHOR GREEN PRIMARY SCHOO...,1.390370,POINT (103.887165375933 1.39036998654612),544969
...,...,...,...,...,...,...,...,...
179,Yuhua Primary School,Yuhua Primary School,22,103.741106,158 JURONG EAST STREET 24 YUHUA PRIMARY SCHOOL...,1.342802,POINT (103.741105772644 1.34280230475033),609558
180,Yumin Primary School,Yumin Primary School,18,103.950462,3 TAMPINES STREET 21 YUMIN PRIMARY SCHOOL SING...,1.351292,POINT (103.950461927088 1.35129177656981),529393
181,Zhangde Primary School,Zhangde Primary School,3,103.825952,51 JALAN MEMBINA ZHANGDE PRIMARY SCHOOL SINGAP...,1.284212,POINT (103.825951875662 1.28421153335379),169485
182,Zhenghua Primary School,Zhenghua Primary School,23,103.769314,9 FAJAR ROAD ZHENGHUA PRIMARY SCHOOL SINGAPORE...,1.379549,POINT (103.769313521752 1.37954887512229),679002


There are schools that have been permanently closed which results in geo look up error. This is removed from the dataset to resolve the issue. Additionally the onemap_api does not recognize "St." as abbreviation for "Saint", the clean_string code is modified to handle this case.

In [155]:
map = folium.Map(location=[sg_lat, sg_lon], zoom_start=14)
markers = marker_icons(df_school, map, color='blue', icon="fa-graduation-cap") 
map

In [74]:
# Create the df_shopping dataset for Shopping Malls Information with geo location
df_shopping = df_raw_shopping.copy()
df_shopping['address'] = df_shopping['name']
print(f'Processing {len(df_shopping)} rows.')
finder = get_location(overwrite=True, verbose=True, maxrows=len(df_shopping))
df_shopping = finder.fit_transform(df_shopping)
df_shopping.head(len(df_shopping))


Processing 153 rows.
Columns district, longitude, mailing_address, latitude, point, postcode are added.
Processed: 153 addresses, 0 failed, 0 skipped.


Unnamed: 0,name,address,district,longitude,mailing_address,latitude,point,postcode
0,100 AM,100 AM,2,103.843471,100 TRAS STREET 100 AM SINGAPORE 079027,1.274588,POINT (103.84347073661 1.27458821795426),79027
1,600 @ Toa Payoh,600 @ Toa Payoh,12,103.850978,600 LORONG 4 TOA PAYOH 600 @ TOA PAYOH SINGAPO...,1.334036,POINT (103.850977706475 1.33403623894465),319515
2,Anchorpoint,Anchorpoint,3,103.805608,368 ALEXANDRA ROAD ANCHORPOINT SHOPPING CENTRE...,1.288935,POINT (103.805607779399 1.28893477974497),159952
3,Beauty World Centre,Beauty World Centre,21,103.776539,144 UPPER BUKIT TIMAH ROAD BEAUTY WORLD CENTRE...,1.342413,POINT (103.776539385406 1.34241264188642),588177
4,Beauty World Plaza,Beauty World Plaza,21,103.776259,140 UPPER BUKIT TIMAH ROAD BEAUTY WORLD PLAZA ...,1.341800,POINT (103.776259359854 1.34180018619223),588176
...,...,...,...,...,...,...,...,...
148,Yew Tee Square,Yew Tee Square,23,103.747345,623 CHOA CHU KANG STREET 62 YEW TEE SQUARE SIN...,1.398321,POINT (103.747344866472 1.39832134857945),680623
149,321 Clementi,321 Clementi,5,103.764987,321 CLEMENTI AVENUE 3 321 CLEMENTI SINGAPORE 1...,1.312002,POINT (103.764986676365 1.31200212030821),129905
150,Cathay Cineleisure Orchard,Cathay Cineleisure Orchard,9,103.836430,8 GRANGE ROAD CATHAY CINELEISURE ORCHARD SINGA...,1.301521,POINT (103.836429655016 1.30152101873533),239695
151,GV Yishun,GV Yishun,27,103.836473,51 YISHUN CENTRAL 1 GOLDEN VILLAGE (GV YISHUN)...,1.429916,POINT (103.836473396124 1.42991554202388),768794


In [174]:
map = folium.Map(location=[sg_lat, sg_lon], zoom_start=14)
markers = marker_icons(df_shopping, map, color='purple', icon="fa-shopping-cart") 
map

In [None]:
# Create the df_info dataset for HDB Information with geo location
df_info = df_raw_property_info.copy()
df_info['address'] = df_info['blk_no'] + ' ' + df_info['street']
print(f'Processing {len(df_info)} rows.')
finder = get_location(overwrite=False, verbose=True, maxrows=len(df_info))
df_info = finder.fit_transform(df_info)
df_info.head(len(df_info))

In [None]:
"""
prices = df_raw_2017.copy()
prices['address'] = prices['block'] + ' ' + prices['street_name']
finder = get_location(overwrite=False, verbose=True)
prices = finder.fit_transform(prices)
prices.head(100)
"""

In [None]:
"""
#newdf = df1.merge(df2, how='left', on='name')
test_info = 
# Create point geometries
geometry = geopandas.points_from_xy(df.Longitude, df.Latitude)
geo_df = geopandas.GeoDataFrame(df[['Year','Name','Country', 'Latitude', 'Longitude', 'Type']], geometry=geometry)

geo_df.head()

from geopy import distance
print(distance.distance(wellington, salamanca).km)
"""

### 2.2 Pickle the datasets for the modelling

In [None]:
# Save the dataframe as pickle. 
print("pickling df_info:", df_info.shape)
import pickle
picklefile = open('../data/interim/df_info.pickle', 'wb') #create a file
pickle.dump(df_info, picklefile, pickle.HIGHEST_PROTOCOL) #pickle the dataframe
picklefile.close() #close file
