# Motivation

There are various analysis we can conduct on the same dataset. It is important to set the agenda before looking at the data to ensure we do not get lost along the way.

For this analysis, we are interested in understanding the key factors that affects the `resale price`.

# Retrieve data

We retrieve data from Data.gov.sg.

We can manually download the data, here we implemented a custom object to download data with a given URL.

In [None]:
###############
# import data #
###############
import numpy as np  # numeric operations
import matplotlib.pyplot as plt  # for plotting
import pandas as pd  # dataframe

import gevent.monkey
gevent.monkey.patch_all()

from hdb_resale_data import (
    # retrive location data
    Location,
    # custom Data object
    Data,
)

In [None]:
url = "https://data.gov.sg/dataset/7a339d20-3c57-4b11-a695-9348adfd7614/download"
data = Data(url)  # Data takes in a url: string

data.download(filename="../data/data.zip")  # Data stores the downloaded object with the given filename
data.zip_filename(zip_file="../data/data.zip")  # display the file names we downloaded

In [None]:
# read the file inside the zip file
df = data.read_zip(zip_file="../data/data.zip", 
                   filename="resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

# Data profiling

We want to check for the following profiles:

1. Missing values
2. Outliers
3. Skewed data

We will be keeping a profile of the list of data cleaning steps required before conducting data cleaning.

## Data preview

Sometimes the easiest way to identify data cleaning steps is to look at the data itself.

In [None]:
df.info()

In [None]:
df.head()

Observation:

1. `month` is coded as object (string) and not datetime
2. `remaining_lease` is coded as object (string) and not datetime
3. Location data is in string, hard to have any meaningful interpretation
4. `storey_range` is in string, hard to have any meaningful interpretation

Let's fix the issues before proceeding.

In [None]:
##########################
# fix datetime for month #
##########################
# problem: month is in yyyy-mm, common datetime format require a day as well
# solution: concatenate a string '-01' before converting to datetime
# in order to ensure pandas convert our datetime object correctly,
# we will explictly input the format
df["month"] = pd.to_datetime(df["month"] + "-01", format="%Y-%m-%d")

In [None]:
####################################
# fix datetime for remaining_lease #
####################################
# problem: remaining_lease is in years and month (string), we want a standardised unit
# solution: convert remaining_lease to years (year = month/12)
# we use Regex to extract out the years and month
years = df["remaining_lease"].str.extract("(\d+) years").astype("float")
months = df["remaining_lease"].str.extract("years (\d+) [months]|[month]").astype("float")

In [None]:
# we have some missing months, let's make sure those are entries without a month data
df.loc[months.isna().values, "remaining_lease"].unique()

In [None]:
# add a new column with remaining lease in years
# and remove the old column
df["remaining_lease_years"] = years + months/12
df.pop("remaining_lease")

In [None]:
#####################
# fix location data #
#####################
# problem: location data is in string, it is hard to compare against different locations
# solution: retrive geolocation
# retriving geolocation is a more challenging task, we illustrate the idea here and
# execute it in a seperate script
loc = Location()
location = df["block"] + " " + df["street_name"]
links = loc.url1 + location + loc.url2
responses = loc.get_gresponse(links.iloc[:100].values)

In [None]:
# total run time will require around 2 hrs
# it took around 6 sec to return 100 requests
6 / 100 * links.shape[0] / 60 / 60

In [None]:
# we show the first 2 result as an illustration
list(map(lambda response: loc.json_load(response), responses))[:2]

We will complete the location request in our data cleaning step as it is time consuming.

In [None]:
####################
# fix storey_range #
####################
# problem: storey_range is in string, it is hard to have meaning comparision
# solution: since the storey range is a numeric variable, let's take the first storey
df["storey_min"] = df["storey_range"].str.extract("(\d+) TO")
df["storey_min"] = df["storey_min"].astype("int")
df.pop("storey_range")

## Data cleaning

We have apply the logic investigated with our data preview 

In [None]:
df.head()

In [None]:
df.info()