# Motivation

In this notebook, we will continue with our EDA after data profiling and data cleaning step.

Our main goal remain to investigate the key factors that affect HDB resale price.

In particular, we are interested in observing for:

1. Distributions of prices
2. Outliers of prices

In [None]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

sns.set()

In [None]:
# we will load the data preprocess previously to save time here
with open("../data/hdb_final", "rb") as f:
    df = pickle.load(f)

In [None]:
df.head()

In [None]:
# with a steralised object, our data types are preserved
df.info()

# Data visualization

## Time series - or not?

We will be investigating if the prices change significantly (trend and seasonality) with time.

Having trend and seasonality is an issue as current prices will be a function of past prices. Formally

$y_{t+1} = f(X, y_{t})$

where $y$ is the resale price and $X$ is time invariant factors such as locations.

a clear trend and seasonality is observed, generally the price increases over the years with the following exception

1. price decrease from 2018 to 2019
2. price spike after 2020

Impact on prediction model:

1. The general price increase (trend) suggest we have to de-trend before conducting prediction. Similarly for seasonality we have to remove the seasonality before prediction.
2. The sudden price changes will affect data stationarity, that is the past data might not be reflective of the present data. We might want to decide which time period is most relevant for our prediction.

In [None]:
fig, ax = plt.subplots()
sns.lineplot(data=df, x="month", y="resale_price", ax=ax)
plt.title("Resale price over time")
plt.xticks(rotation=45-90);

Since we have found out that our data is not a stationary time series, time will be an important factor in all our future analysis.

## Before deep diving

It is very easy to get lost in exploratory data analysis. A data with 4 columns will produce 4x3x2x1 = 24 combinations for investigation. Therefore, it is very important to take a step back and understand what are the key features you want to investigate with.

We highlight the key features we want to investigate in our EDA, you might want to deep dive in certain area if you have some hypothesis about the data.

Before we investigate features, it is very important to understand if there's a data drift (past data is not representative of current data). One easy way to investigate data drift is to look at the `time series data`.

Our main focus for EDA is to prepare for predictive modeling. Therefore, we will investigate the following aspects about each data columns:

1. Do we have sufficient coverage for each categories/ range of data?
2. Do we need to consider each categories/ range of data in order to predict resale price?

The two aspects will greatly affect our choice of data features:

1. If there is insufficient coverage of data, we have to either cluster similar entities to predict the group price, or perform techniques to take care of the imbalance data (undersampling, oversampling, SMOTE etc)
2. If the average sales price is consistent across different categories/ range of data, then there might not even be any needs to include them in the model. Furthermore, understanding distribution for mean price might be useful to determine the clusters.

In order to answer the two questions, we will plot the following graphs for each data columns. Note that since in the previous section we know that data is not stationary, we have to expand our investigation to look across time.

1. Bar plot showing counts for each categories/ range of data
2. Line plot showing counts for each categories/ range of data across time
3. Bar plot showing mean sales price for each categories/ range of data
4. Line plot showing mean sales price for each categories/ range of data across time

Since we are certain about our visualization now, let's write some helper function to compute the plots.

In [None]:
def bar_count(col: str, ax = None, legend="auto", data=df.copy()) -> None:
    """Plots Bar plot showing counts for each categories/ range of data."""
    if not ax:
        # if no axis provided, create one
        fig, ax = plt.subplots()
    
    category_data = True
    if data[col].dtype != "object":
        # check if data is categorical
        category_data = False
        
    if not category_data:
        # if data is not categorical, we need to discretise it
        data[col] = pd.cut(data[col].values, bins=10)
    
    data = data[col].value_counts(normalize=False)
    
    if category_data:
        # only sort when it's categorical data
        data = data.sort_values()
        
    data.plot.barh(xlabel="total data", 
               title=f"total number of transactions by {col}",
               ax=ax
              )
    
    if not ax:
        # if no axis provided, plot
        plt.show()
        
def line_count(col: str, ax = None, legend="auto", data=df.copy()) -> None:
    """Plots Line plot showing counts for each categories/ range of data across time."""
    if not ax:
        # if no axis provided, create one
        fig, ax = plt.subplots()
    
    category_data = True
    if data[col].dtype != "object":
        # check if data is categorical
        category_data = False
        
    if not category_data:
        # if data is not categorical, we need to discretise it
        data[col] = pd.cut(data[col].values, bins=10)
    
    data = data.groupby(["month", col], as_index=False)["resale_price"]\
        .count().rename(columns={"resale_price":"count"})
    
    if category_data:
        # only sort when it's categorical data
        data = data.sort_values(by="count")
    
    sns.lineplot(
        data = data,
        x = "month", y = "count", hue=col,
        ax=ax, legend=legend
    )
    
    ax.set_title(f"number of units sold across time by {col}\nlegend ordered asce by count")
    ax.set_xlabel("")
    plt.setp(ax.get_xticklabels(), rotation=45-90, horizontalalignment='right')
    
    if legend:
        ax.legend(loc=1, bbox_to_anchor=(1.5, 1.))
    
    if not ax:
        # if no axis provided, plot
        plt.show()
        
def bar_mean(col: str, ax = None, data=df.copy()) -> None:
    """Bar plot showing mean sales price for each categories/ range of data."""
    if not ax:
        # if no axis provided, create one
        fig, ax = plt.subplots()

    category_data = True
    if data[col].dtype != "object":
        # check if data is categorical
        category_data = False
        
    if not category_data:
        # if data is not categorical, we need to discretise it
        data[col] = pd.cut(data[col].values, bins=10)
    
    data = data.groupby(col, as_index=True)["resale_price"].mean()
    
    if category_data:
        # only sort when it's categorical data
        data = data.sort_values()
        
    data.plot.barh(xlabel="mean price", 
               title=f"mean price by {col}",
               ax=ax)
    
    if not ax:
        # if no axis provided, plot
        plt.show()
        
def line_mean(col: str, ax = None, legend="auto", data=df.copy()) -> None:
    """Line plot showing mean sales price for each categories/ range of data across time."""
    if not ax:
        # if no axis provided, create one
        fig, ax = plt.subplots()
    
    category_data = True
    if data[col].dtype != "object":
        # check if data is categorical
        category_data = False
        
    if not category_data:
        # if data is not categorical, we need to discretise it
        data[col] = pd.cut(data[col].values, bins=10)
    
    data = data.groupby(["month", col], as_index=False)["resale_price"]\
        .mean().rename(columns={"resale_price":"mean_resale_price"})
    
    if category_data:
        # only sort when it's categorical data
        data = data.sort_values(by="mean_resale_price")
    
    sns.lineplot(
        data = data,
        x = "month", y = "mean_resale_price", hue=col,
        ax=ax, legend=legend
    )
    
    ax.set_title(f"mean price across time by {col}\nlegend ordered asce by mean_resale_price")
    ax.set_xlabel("")
    plt.setp(ax.get_xticklabels(), rotation=45-90, horizontalalignment='right')
    
    if legend:
        ax.legend(loc=1, bbox_to_anchor=(1.5, 1.))
    
    if not ax:
        # if no axis provided, plot
        plt.show()
        
def plotting(col: str) -> None:
    """Plots all 4 plots in 1 plot."""
    fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(14, 10), gridspec_kw={"hspace": 0.4})
    bar_count(col, ax1)
    line_count(col, ax2, legend=False)
    bar_mean(col, ax3)
    line_mean(col, ax4, legend=False)
    plt.suptitle(col.upper())
    plt.show()

## Location

### town

Pricing in Singapore are heavily affected by the estates. Some estates (such as Orchard) are known to command higher prices than others.


Now, it is clear that
1. Some estates have more data than others
2. Seasonality in terms of number of units sold
3. Very slight trend in terms of number of units sold

Impact on prediction

1. Some towns have very few transactions, we might not have sufficient data to capture the pattern in those towns.\
We might want to consider using clustering techniques to group those minority town together.\
Unsupervised machine learning could be one way, alternatively, we can cluster town by domain knowledge (e.g. rich estates, older estates etc)

2. The number of housing sold seems to correlate with the mean resale price.\
It is possible that when most people do not want to sell their house, the market is bad. Therefore, the price is low. (or most people do not want to buy house at this time, we are not sure demand or supply is the cause for low transactions)\
We might want to consider how could we estimate the total transaction number. One possibility is to include macro-economic factors to take into consideration the economy performance.

In [None]:
plotting("town")

## flat_type

Observation:

1. Most of the units sold are 4 room HDB
2. However, most expensive HDBs are Multi-Gen

Impact on prediction:
1. We have very little data on both the most expensive HDB (Multi-gen) and cheapest HDB (1 room),\
again clustering or some other techniques is needed.
2. Clearly different flat types will command different prices.

In [None]:
plotting("flat_type")

## storey_min

Observation:
1. Most transactions occur at relatively low level floors. However, we know that most HDBs are under 20 storey height. There might be something special about those units higher than 20 storey.
2. Clearly the prices are higher for higher level units

Impact on prediction:
1. Clearly we have to take into consideration different storey height
2. The sparse records for higher storey might require us to cluster those very high units together.

In [None]:
plotting("storey_min")

## floor_area_sqm

Observation
1. Most transaction occurs with small area units
2. Most expensive units are with larger area units

Impact on prediction
1. Sparse data with larger floor area
2. Clearly floor area affects sales price

In [None]:
plotting("floor_area_sqm")

## flat_model

Observation:
1. Most data is with some minority unit types
2. Most expensive units are not the most commonly seen units

Impact on prediction:
1. Sparse data with some unit types
2. Clearly flat model affect the housing prices

In [None]:
plotting("flat_model")

## lease_commence_date

Observation:
1. Most transactions occur between units built in earlier years
2. most expensive transactions occurs between units built in later years

Impact on prediction
1. Slightly more balance data between different built years
2. the prices are more balance, but clearly there is still some different between different commence years.

In [None]:
plotting("lease_commence_date")

## remaining_lease_years

Observation:
1. Most of the transactions occurs in units with longer lease years
2. Most expensive units are also those with longer lease years

Impact on prediction:
1. Clear difference between different remaining lease years
2. Relatively more balance data across different remaining lease years.

In [None]:
plotting("remaining_lease_years")

# Conclusion

As we have observe from our data visualization, there is a clear difference between sales prices among each columns (lucky for us). Although there are still more data preprocessing required between we can fit a model to our data. Which we will do in the next notebook.