# Property Prices Prediction Model


In [None]:
if __name__ == '__main__':
  from google.colab import drive
  drive.mount("/content/drive")

In [None]:
# %cd "/content/drive/MyDrive/"+ # ADD YOUR PATH HERE
# %ls

In [None]:
import numpy as np
import pandas as pd
import folium
from folium.plugins import HeatMap
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import datetime

## Dataset Description

As a partner of a property investment company, your objective is to make a profit from investing in and the eventual sale of invested properties. To do this, you need a solid property prediction model based on historical property transactions. To enable the prediction of future property prices from your prediction model compared against prevailing asking prices. So that the future sale of a property will bring in a nice profit.

The Hong Kong Island areas of Central, Sheung Wan and Sai Wan private housing dataset is a Excel Comma Separated Value (CSV) text file, which include 29 features and 1139 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale. 

The features include string, boolean, categorical and numerical variables. These features are as follows:

| Field Label | Field description | Data format | Remark |
| --- | --- | --- | --- |
| Reg_Date | Register date in Land Registry | DD/MM/YYYY |  |
| Reg_Year | Register year | YYYY | Extract from Reg_Date field |
| Prop_Name_ENG | Name of the property | String |  |
| Address_ENG | Address of the property | String |  |
| Prop_Type | Type of the property | Boolean (Single/ Estate) |  |
| Tower | Tower number | String |  |
| Floor | Floor of the property | Numeric |  |
| Floor_(H/M/L) | Floor category (H/M/L) | Category (H/M/L) | Floor 0 to 15 => L, Floor 16 to 34 => M, Floor upper than 35 => H |
| Flat | Flat of the property | String |  |
| Bed_Room | No. of bed room of the property | Numeric |  |
| Roof | Roof included? | Boolean (Y/N) |  |
| Build_Ages | Building age at registration | String |  |
| Rehab_Year | Year of building rehabilitation schemes | Numeric | [Website](https://www.ura.org.hk/en/project/rehabilitation/building-rehabilitation-schemes) |
| Eff_Build_Age | Effective building age that consider the building rehabilitation schemes | String | If URA is 0, Eff_Age = Build_Ages. If URA is not 0, (Reg_Year)+{[Build_Age-(Reg_Year-URA)]*2/3} |
| SalePrice_10k | Sale Price | Numeric |  |
| SaleableArea | Saleable area of the property | Numeric |  |
| Gross Area | Gross area of the property | Numeric |  |
| SaleableAreaPrice | Price per saleable area | Numeric |  |
| Gross Area_Price | Price per gross area | Numeric |  |
| Kindergarten | No. of kindergarten near the property | Numeric |  |
| Primary_Schools | No. of primary schools near the property | Numeric |  |
| Secondary_Schools | No. of secondary schools near the property | Numeric |  |
| Parks | No. of public parks near the property | Numeric |  |
| Library | No. of library near the property | Numeric |  |
| Bus_Route | No. of bus stations near the property | Numeric |  |
| Mall | No. of shopping mall near the property | Numeric | Shun Tak Centre, Western Market, Infinitus Plaza |
| Wet Market | No. of wet market near the property | Numeric |  |
| Latitude | The latitude of the property | Numeric | [Website](http://www.mapdevelopers.com/batch_geocode_tool.php) |
| Longitude | The longitude of the property | Numeric | [Website](http://www.mapdevelopers.com/batch_geocode_tool.php) |
| Edu_Inst | The education institution near the property | Numeric | Sum of Kindergarten, Primary_Schools and Secondary_Schools |



Relevant information on the datasets can be retrieved by running the code below:

In [None]:
# load the Hong Kong Properties dataset 
pd.set_option('display.max_columns', None)
dataframe = pd.read_csv('HKProp_Dataset.csv')
dataframe.head()

You can check the shape of the dataframe we used in this lab. The correct shape should be `(1100, 28)`.

In [None]:
# print the shape for the dataframe
print(dataframe.shape)

## Visualization of the Sale Price distribution

Before we start our research, we first have a visualization to better understand the distribution of our data. The target value for this analysis is the sale price, which is `'SalePrice_10k'` in the dataframe. We will use the sale price of different properties as the radius for the circle marker and visualize them in the map. We use different color to plot the sale price for `(0,600)`, `(600,1000]`, and `(1000,+∞)`

You can explore the sale price directly on the map. By clicking the circle on the map, you can see the name, type, and floor of different properties. 

In [None]:
# Visualize the sale price
hongkong_map = folium.Map(location=[22.281442,114.152991],
                        zoom_start=15,
                   tiles="cartodbpositron")

for i in range(len(dataframe)):
    lat = dataframe['Latitude'].iloc[i] 
    long = dataframe['Longitude'].iloc[i] 
    SalePrice_10k = dataframe['SalePrice_10k'].iloc[i]
    radius = min(SalePrice_10k/100, 20)

    if SalePrice_10k > 1000:
        color = "#008080"  # blue high price
    elif SalePrice_10k < 600:
        color = "#9BCD9B"  # grey cheap price
    else:
        color = "#9C9C9C"  # green normal price
    
    popup_text = """Name : {}<br>
                Type : {}<br>
                Floor : {}<br>"""
    
    popup_text = popup_text.format(
        dataframe['Prop_Name_ENG'].iloc[i], 
        dataframe['Prop_Type'].iloc[i], 
        dataframe['Floor'].iloc[i]
        )

    folium.CircleMarker(location = [lat, long], popup= popup_text,radius = radius, color = color, fill = True).add_to(hongkong_map)

hongkong_map

Another direct visualization tool to check the locations for the most sold properties is to use a heatmap. You can zoom in or zoom out to find the location distribution of these sales.

In [None]:
num = len(dataframe)
lat = np.array(dataframe["Latitude"][0:num])                       
lon = np.array(dataframe["Longitude"][0:num])                        
rent_room = np.array(dataframe["SalePrice_10k"][0:num],dtype=float)    

data1 = [[lat[i],lon[i],rent_room[i]] for i in range(num)]    

hongkong_map = folium.Map(location=[22.285442,114.152991],
                        zoom_start=16) 
HeatMap(data1).add_to(hongkong_map) 
folium.TileLayer('cartodbpositron').add_to(hongkong_map)

hongkong_map

After having a glance at the dataset distribution, you may now have a brief understanding of how the sale prices are connected to the locations. Here we preprocess and combine the features in the datasheet to construct training and validation data `X_trainval, y_trainval`, and testing data `X_test, y_test` for training and evaluating our MLP model. **Please DO NOT modify the code**. The correct shape should be `(1000, 166) (100, 166) (1000,) (100,)` for `X_trainval, X_test, y_trainval, y_test`, respectively.

In [None]:
# Codes for generating input data and ground truth labels

y = np.array(dataframe['SalePrice_10k'])

# replace '--' with 0, replace $ with space.
dataframe = dataframe.replace('--', '0').replace('$', '')  
Reg_Date = pd.to_datetime(np.array(dataframe['Reg_Date']),infer_datetime_format=True)
Days_to_reg_date = [int(t.days) for t in (Reg_Date - datetime.datetime.today())]
Bedroom = dataframe['Bed_Room'].replace('Studio', '0').fillna(0)
Is_studio = [1 if e == 0 else 0 for e in np.array(dataframe['Bed_Room'])] 

SaleableArea = [int(str(t).replace(',', '').replace('nan', '0')) for t in dataframe['SaleableArea']]
SaleableAreaPrice = [int(str(t).replace(',', '').replace('$', '0').replace('nan', '0')) for t in dataframe['SaleableAreaPrice']]
GrossArea = [int(str(t).replace(',', '').replace('nan', '0')) for t in dataframe['Gross Area']]
GrossAreaPrice = [int(str(t).replace(',', '').replace('$', '0').replace('nan', '0')) for t in dataframe['Gross Area_Price']]

X = np.concatenate([np.array(Days_to_reg_date).reshape(-1, 1), np.array(Bedroom).reshape(-1, 1),
                        np.array(Is_studio).reshape(-1, 1), np.array(SaleableArea).reshape(-1, 1),
                        np.array(GrossAreaPrice).reshape(-1, 1),
                        np.array(GrossArea).reshape(-1, 1),
                        np.array(SaleableAreaPrice).reshape(-1, 1),
                        np.array(pd.get_dummies(dataframe[['Flat', 'Prop_Type', 'Tower', 'Roof']])),
                       np.array(dataframe[['Floor', 'Build_Ages', 'Rehab_Year', 'Kindergarten', 'Primary_Schools', 'Secondary_Schools',
 'Parks', 'Library', 'Bus_Route', 'Mall', 'Wet Market', 'Latitude', 'Longitude']].fillna(0))], axis=1)

X_trainval, X_test, y_trainval, y_test = X[:1000], X[1000:], y[:1000], y[1000:]
print(X_trainval.shape, X_test.shape, y_trainval.shape, y_test.shape)


In [None]:
def shuffle_data_numpy(X, y, numpy_seed):
    # fix the random seed
    np.random.seed(numpy_seed)

    num_sample = X.shape[0]

    order = np.random.choice(num_sample, num_sample, replace = False)

    X_shuffle = X[order]
    y_shuffle = y[order]

    return X_shuffle, y_shuffle

def train_val_split(X_trainval, y_trainval, train_size, numpy_seed):
    X_shuffle, y_shuffle = shuffle_data_numpy(X_trainval, y_trainval, numpy_seed)

    X_train = X_shuffle[ : train_size, : ]
    y_train = y_shuffle[ : train_size]

    X_val = X_shuffle[train_size : , : ]
    y_val = y_shuffle[train_size : ]

    return X_train, X_val, y_train, y_val

In [None]:
X_train, X_val, y_train, y_val = train_val_split(X_trainval, y_trainval, 700, 42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

In [None]:
# No additional import allowed
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

def MyModel(num_dense_layer, dense_layer_unit, input_dim, dropout_ratio):
    # Create a sequential model
    model = Sequential()

    for i in range(num_dense_layer):
      model.add(Dense(units = dense_layer_unit, activation = "relu", kernel_initializer="uniform"))
      model.add(Dropout(dropout_ratio))

    model.add(Dense(units = 1, activation = "linear", kernel_initializer="uniform"))

    model.build(input_shape=(None, input_dim))

    return model

In [None]:
# Keep them as the default setting for the model you submitted to ZINC!
num_dense_layer = 2
dense_layer_unit = 40
input_dim = len(X_train[0])
dropout_ratio = 0

model = MyModel(num_dense_layer, dense_layer_unit, input_dim, dropout_ratio)
model.summary()

In [None]:
from tensorflow.keras.optimizers import Adam

def MyModel_Training(model, X_train, y_train, X_val, y_val, batchsize, train_epoch):
    adam_optimizer = Adam(learning_rate = 1e-3)
    
    model.compile( 
        optimizer = adam_optimizer,
        loss = "mean_squared_error",
        metrics = ["mae"]
    )

    history = model.fit(X_train, y_train, epochs = train_epoch, validation_data = (X_val, y_val), batch_size = batchsize, validation_batch_size = batchsize)

    return history, model

model = MyModel(num_dense_layer, dense_layer_unit, input_dim, dropout_ratio)

batchsize = 8
train_epoch = 50

history, model = MyModel_Training(model, X_train, y_train, X_val, y_val, batchsize, train_epoch)
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=1)
print(f'Test Mean Average Error (MAE): {test_mae}')

In [None]:
model.save('./mlp_model.keras')

You can load the model you just saved and check the mean average error again

In [None]:
from tensorflow.keras.models import load_model
mlp_model = load_model("./mlp_model.keras")
test_loss, test_mae = mlp_model.evaluate(X_test, y_test, verbose=1)
print(f'Test Mean Average Error (MAE): {test_mae}')

### Sanity Check of your MLP Training

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['mae'], label='Training mae')
plt.plot(history.history['val_mae'], label='Validation mae')
plt.xlabel('Epochs')
plt.ylabel('MAE')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

### How accurate can your model predicted the sales price? Let's check!
Here we plot the predicted sale price (y-axis) vs. ground truth sales price (x-axis). If you find most of the points distributed near the line `y=x`, your MLP model can help make a prediction for the house price!

In [None]:
test_predictions = model.predict(X_test).flatten()

plt.figure(figsize=(6, 6))
plt.scatter(y_test, test_predictions)
plt.xlabel('Ground True Values for Sale Price (10k HKD)')
plt.ylabel('Predictions for Sale Price (10k HKD)')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 3000], [-100, 3000])