<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Prediction of HDB Resale Flat Prices
Author: Edmond Ang

## Contents
* [1. Test Data Cleaning](#1.-Test-Data-Cleaning)
    * [1.1. Imports](#1.1.-Imports)
    * [1.2. Initial summary of data](#1.2.-Initial-summary-of-data)
* [2. Cleaning](#2.-Cleaning)
    * [2.1. Check for null values](#2.1.-Check-for-null-values)
    * [2.2. Change NaN values](#2.2.-Change-NaN-values)
    * [2.3. Change Y and N values](#2.3.-Change-Y-and-N-values)
    * [2.4. Drop repetitive features](#2.4.-Drop-repetitive-features)
    * [2.5. Separate out id column](#2.5.-Separate-out-id-column)
* [3. Export Data](#3.-Export-Data)

---
## 1. Test Data Cleaning
---

* Mimic training data cleaning steps to determine baseline model prediction

### 1.1. Imports

In [1]:
import numpy as np
import pandas as pd
import math
import os

### 1.2. Initial summary of data

In [2]:
hdb_df = pd.read_csv('datasets/test.csv')
pd.set_option('display.max_columns', None)  # to see all columns
print(hdb_df.shape)
print(hdb_df.head())

(16737, 76)
       id Tranc_YearMonth         town flat_type block          street_name  \
0  114982         2012-11       YISHUN    4 ROOM   173         YISHUN AVE 7   
1   95653         2019-08  JURONG WEST    5 ROOM  986C    JURONG WEST ST 93   
2   40303         2013-10   ANG MO KIO    3 ROOM   534    ANG MO KIO AVE 10   
3  109506         2017-10    WOODLANDS    4 ROOM    29         MARSILING DR   
4  100149         2016-08  BUKIT BATOK    4 ROOM   170  BT BATOK WEST AVE 8   

  storey_range  floor_area_sqm         flat_model  lease_commence_date  \
0     07 TO 09            84.0         Simplified                 1987   
1     04 TO 06           112.0  Premium Apartment                 2008   
2     07 TO 09            68.0     New Generation                 1980   
3     01 TO 03            97.0     New Generation                 1979   
4     16 TO 18           103.0            Model A                 1985   

   Tranc_Year  Tranc_Month  mid_storey  lower  upper  mid  \
0      

  hdb_df = pd.read_csv('datasets/test.csv')


---
## 2. Cleaning
---

### 2.1. Check for null values

In [3]:
pd.set_option('display.max_rows', None)  # to see all rows
missing_hdb_df = pd.DataFrame(hdb_df.isna().sum()).reset_index()
missing_hdb_df.columns = ['col', 'num_na']
missing_hdb_df['%na'] = missing_hdb_df['num_na']/len(hdb_df)*100
missing_hdb_df.style

Unnamed: 0,col,num_na,%na
0,id,0,0.0
1,Tranc_YearMonth,0,0.0
2,town,0,0.0
3,flat_type,0,0.0
4,block,0,0.0
5,street_name,0,0.0
6,storey_range,0,0.0
7,floor_area_sqm,0,0.0
8,flat_model,0,0.0
9,lease_commence_date,0,0.0


### 2.2. Change NaN values 
* from 'X within Y distance' features to 0

In [4]:
hdb_df['Mall_Within_500m'] = [0 if math.isnan(x) else x for x in hdb_df['Mall_Within_500m']]
hdb_df['Mall_Within_1km'] = [0 if math.isnan(x) else x for x in hdb_df['Mall_Within_1km']]
hdb_df['Mall_Within_2km'] = [0 if math.isnan(x) else x for x in hdb_df['Mall_Within_2km']]
hdb_df['Hawker_Within_500m'] = [0 if math.isnan(x) else x for x in hdb_df['Hawker_Within_500m']]
hdb_df['Hawker_Within_1km'] = [0 if math.isnan(x) else x for x in hdb_df['Hawker_Within_1km']]
hdb_df['Hawker_Within_2km'] = [0 if math.isnan(x) else x for x in hdb_df['Hawker_Within_2km']]

* since kaggle needs predictions on all rows (i.e. cannot drop any rows), I am replacing all NaN 'Mall_Nearest_Distance' as 4km, beyond the maximum numeric value to represent that these NaN distances are far-out

In [5]:
print(hdb_df['Mall_Nearest_Distance'].max())
hdb_df['Mall_Nearest_Distance'] = [4000.0 if math.isnan(x) else x for x in hdb_df['Mall_Nearest_Distance']]

3496.40291


### 2.3. Change Y and N values
* to 1 and 0 respectively so Y and N can become machine-readable

In [6]:
hdb_df['residential'] = [False if x == "N" else True for x in hdb_df['residential']]
hdb_df['commercial'] = [False if x == "N" else True for x in hdb_df['commercial']]
hdb_df['market_hawker'] = [False if x == "N" else True for x in hdb_df['market_hawker']]
hdb_df['multistorey_carpark'] = [False if x == "N" else True for x in hdb_df['multistorey_carpark']]
hdb_df['precinct_pavilion'] = [False if x == "N" else True for x in hdb_df['precinct_pavilion']]

### 2.4. Drop repetitive features 
* These values can be represented by other values (e.g. latitude and longitude values represented by location name)
* Also to avoid an expansion of features if get_dummies is used on these categorical data

In [7]:
# sanity check: rule against curse of dimensionality
# where N = 150,634, (N)**0.5 = 388 features
print(len(hdb_df['block'].unique()))
print(len(hdb_df['address'].unique()))
print(len(hdb_df['bus_stop_name'].unique()))  # showing exmaples of how many features i'd have with one-hot encoding

2248
7044
1586


In [8]:
# drop 'repetitive' columns
# did not drop 'id', will need it for use later to submit for kaggle
hdb_df = hdb_df.drop(columns=['Tranc_YearMonth', 'lease_commence_date', 'flat_type', 'block', 'flat_model', 'storey_range', 'mid', 'lower', 'upper', 'postal', 'floor_area_sqft', 'block', 'street_name', 'address', 'planning_area', 'mrt_name', 'bus_stop_name', 'pri_sch_latitude', 'pri_sch_longitude', 'sec_sch_latitude', 'sec_sch_longitude', 'bus_stop_latitude', 'bus_stop_longitude', 'mrt_latitude', 'mrt_longitude', 'Longitude', 'Latitude'])

* Sanity checks

In [9]:
print(hdb_df.isnull().sum())  # results show none remaining

id                           0
town                         0
floor_area_sqm               0
Tranc_Year                   0
Tranc_Month                  0
mid_storey                   0
full_flat_type               0
hdb_age                      0
max_floor_lvl                0
year_completed               0
residential                  0
commercial                   0
market_hawker                0
multistorey_carpark          0
precinct_pavilion            0
total_dwelling_units         0
1room_sold                   0
2room_sold                   0
3room_sold                   0
4room_sold                   0
5room_sold                   0
exec_sold                    0
multigen_sold                0
studio_apartment_sold        0
1room_rental                 0
2room_rental                 0
3room_rental                 0
other_room_rental            0
Mall_Nearest_Distance        0
Mall_Within_500m             0
Mall_Within_1km              0
Mall_Within_2km              0
Hawker_N

In [10]:
print(hdb_df.dtypes)  # looks good

id                             int64
town                          object
floor_area_sqm               float64
Tranc_Year                     int64
Tranc_Month                    int64
mid_storey                     int64
full_flat_type                object
hdb_age                        int64
max_floor_lvl                  int64
year_completed                 int64
residential                     bool
commercial                      bool
market_hawker                   bool
multistorey_carpark             bool
precinct_pavilion               bool
total_dwelling_units           int64
1room_sold                     int64
2room_sold                     int64
3room_sold                     int64
4room_sold                     int64
5room_sold                     int64
exec_sold                      int64
multigen_sold                  int64
studio_apartment_sold          int64
1room_rental                   int64
2room_rental                   int64
3room_rental                   int64
o

### 2.5. Separate out id column
* to merge with y_pred later for kaggle submission

In [11]:
id_df = hdb_df['id']
id_df.name = 'Id'  # rename to fit kaggle requirement
print(id_df.head())
hdb_df.drop(columns=['id'], inplace = True)

0    114982
1     95653
2     40303
3    109506
4    100149
Name: Id, dtype: int64


---
## 3. Export Data
---

In [12]:
newpath = 'output'
if not os.path.exists(newpath):
    os.makedirs(newpath)

hdb_df.to_csv('output/cleaned_baseline_hdb_test.csv', index=False)
id_df.to_csv('output/cleaned_hdb_test_id.csv', index=False)

In [13]:
# reset option to un-see rows and columns
pd.reset_option('display.max_rows')  # reset to un-see all rows
pd.reset_option('display.max_columns')  # reseting option to un-see all columns