In [1]:
import pandas as pd
import numpy as np
import random
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as datetime
import timeit

pd.set_option('display.max_rows', 60000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 200)
pd.options.display.float_format = '{:,.2f}'.format

In [9]:
#load the data
df = pd.read_csv('resale_flat_prices.csv', low_memory = False, index_col = 0 )

## Six steps in CRISP-DM the standard data mining process
### 1. Understanding Business

#### Problem Statement
To predict resale flat prices of Singapore HDB flat using official data from Singapore Public Data Repository.


House dealers and potential buyers and sellers will be interested in predicting the price of a HDB flat based on its attributes.

More precise, we try to answer to the following 3 business questions:
- Is the price of a flat related to its location?
- Is the price of a flat related to the size of the flat?
- Can the price of a flat be predicted based in its known variables with appropriate accuracy?

### 2. Data Understanding

#### Gathering data
The data is taken from the Singapore Public Data Repository and can be found here (https://data.gov.sg/). 
The scope of the data collected is from March 2012 to September 2020. 

#### Describing data

In [15]:
df.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0


In [25]:
df.shape

(169730, 11)

From initial observation, the dataset contains 169760 records with 11 variables. <br> 
Notes from the HDB metadata :

The datatypes of the initial dataset:

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 169730 entries, 0 to 52202
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   month                169730 non-null  object 
 1   town                 169730 non-null  object 
 2   flat_type            169730 non-null  object 
 3   block                169730 non-null  object 
 4   street_name          169730 non-null  object 
 5   storey_range         169730 non-null  object 
 6   floor_area_sqm       169730 non-null  float64
 7   flat_model           169730 non-null  object 
 8   lease_commence_date  169730 non-null  int64  
 9   remaining_lease      117527 non-null  object 
 10  resale_price         169730 non-null  float64
dtypes: float64(2), int64(1), object(8)
memory usage: 15.5+ MB


An inspection of the numerical data types in the dataset:

In [26]:
df.describe()[['floor_area_sqm','resale_price']]

Unnamed: 0,floor_area_sqm,resale_price
count,169730.0,169730.0
mean,97.05,445514.51
std,24.49,141097.13
min,31.0,140000.0
25%,74.0,345000.0
50%,95.0,420000.0
75%,112.0,515000.0
max,280.0,1258000.0


#### Verifying data quality

There are 52203 Nan records in the remaining_lease. We may have to explore how to impute data for these Nan records.These records range from 1966 to 2012.

In [27]:
df.isnull().sum()

month                      0
town                       0
flat_type                  0
block                      0
street_name                0
storey_range               0
floor_area_sqm             0
flat_model                 0
lease_commence_date        0
remaining_lease        52203
resale_price               0
dtype: int64

In [28]:
df[df['remaining_lease'].isnull()]['lease_commence_date'].unique()

array([1986, 1980, 1984, 1981, 1978, 1979, 1985, 1977, 1976, 1982, 2001,
       2003, 1996, 2002, 2006, 1972, 1988, 1983, 1975, 1987, 1993, 2000,
       1997, 2005, 1989, 2010, 1990, 1992, 1998, 2004, 1969, 1970, 1973,
       2008, 2009, 1999, 2007, 1974, 1994, 1995, 1971, 1967, 1991, 1968,
       1966, 2012, 2011], dtype=int64)

### 3. Data Preparation

#### Selecting data
We will select all columns, except for the block and street_name. Exact address may not be of value for this analysis. 

#### Cleaning data
We will implement the following cleaning strategy: 
- to make the transactional "month" and "lease_commence_date" into datatype datetime
- to make the "flat_type" column into discrete data 
- transform the remaining_lease to months 
- transform the "town", "storey_range", and "flat_model" into categorical datatype
- floor_area_sqm and resale_price to remain as float datetype

In [24]:
df.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0


#### Constructing data
Other variables to be populated: 
- to further split the month into years and months
- (wishlist) to import distance from mrt,malls, and famous schools? 


#### Integrating data
The datset was constructed from 3 files from the data.gov.sg site: 
1. 'resale-flat-prices-based-on-registration-date-from-jan-2017-onwards'
2. 'resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016'
3. 'resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014'

#### Formatting data
See above under cleaning data

### 4. Modelling

To come up with a model for predicting the price based on the features, we may use the linear regression algorithm. 

### 5. Evaluation 
### 6. Deployment