## Contents
* [1. Baseline Model Prediction](#1.-Baseline-Model-Prediction)
    * [1.1. Imports](#1.1.-Imports)
    * [1.2. Read datasets](#1.2.-Read-datasets)
* [2. Model Preparations](#2.-Model-Preparations)
    * [2.1. Train-test split](#2.1.-Train-test-split)
    * [2.2. Scale and one-hot encode (ohe)](#2.2.-Scale-and-one-hot-encode-(ohe))
    * [2.3. Make train and test datasets whole](#2.3.-Make-train-and-test-datasets-whole)
* [3. Baseline Model Creation](#3.-Baseline-Model-Creation)
    * [3.1. Instantiate model](#3.1.-Instantiate-model)
    * [3.2. Model fitting and evaluation](#3.2.-Model-fitting-and-evaluation)
    * [3.3. Scale and ohe test.csv](#3.3.-Scale-and-ohe-test.csv)
    * [3.4. Predict and export](#3.4.-Predict-and-export)

---
## 1. Baseline Model Prediction
---
Overview of methodology:
1. Conduct appropriate model preparation steps before model fitting
    * Train-test split
    * Separate continuous and categorical features and conduct scaling and one-hot encoding respectively
    * Combine the separated features to make train and test datasets whole again
2. Instantiate baseline linear regression model, model fit and evaluate the $R^{2}$ score
3. Use the baseline model to predict the blind test dataset, and export in the right format

### 1.1. Imports

In [78]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import r2_score

### 1.2. Read datasets

In [79]:
hdb_train_df = pd.read_csv('output/cleaned_baseline_hdb_train.csv')
hdb_test_df = pd.read_csv('output/cleaned_baseline_hdb_test.csv')
id_df = pd.read_csv('output/cleaned_hdb_test_id.csv')

### Change selected data to object/bool data type
- So these data will be treated as categorical data later (as such time-related data, yes/no data)

In [80]:
hdb_train_df = hdb_train_df.astype({'Tranc_Year': 'object', 'Tranc_Month': 'object'})
hdb_train_df = hdb_train_df.astype({'pri_sch_affiliation': 'bool', 'affiliation': 'bool'})
hdb_test_df = hdb_test_df.astype({'Tranc_Year': 'object', 'Tranc_Month': 'object'})
hdb_test_df = hdb_test_df.astype({'pri_sch_affiliation': 'bool', 'affiliation': 'bool'})

---
## 2. Model Preparations
---

### 2.1. Train-test split

In [81]:
X = hdb_train_df.drop(columns=['resale_price'])
y = hdb_train_df['resale_price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
print(len(X_train), len(X_test), len(y_train), len(y_test))
print(len(X_test.columns))
print(X_train.info())

112975 37659 112975 37659
49
<class 'pandas.core.frame.DataFrame'>
Int64Index: 112975 entries, 119232 to 121958
Data columns (total 49 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   town                       112975 non-null  object 
 1   floor_area_sqm             112975 non-null  float64
 2   Tranc_Year                 112975 non-null  object 
 3   Tranc_Month                112975 non-null  object 
 4   mid_storey                 112975 non-null  int64  
 5   full_flat_type             112975 non-null  object 
 6   hdb_age                    112975 non-null  int64  
 7   max_floor_lvl              112975 non-null  int64  
 8   year_completed             112975 non-null  int64  
 9   residential                112975 non-null  bool   
 10  commercial                 112975 non-null  bool   
 11  market_hawker              112975 non-null  bool   
 12  multistorey_carpark        112975 non-null  bool   


### 2.2. Scale and one-hot encode (ohe)
* separate continuous and categorical features

In [82]:
# separate column names by dtype
X_train_cat = X_train.select_dtypes(include=['object','bool']).columns
X_train_cont = X_train.select_dtypes(include=['int64', 'float64']).columns
X_test_cat = X_test.select_dtypes(include=['object','bool']).columns
X_test_cont = X_test.select_dtypes(include=['int64', 'float64']).columns

print(X_train_cat)
print(X_train_cont)

Index(['town', 'Tranc_Year', 'Tranc_Month', 'full_flat_type', 'residential',
       'commercial', 'market_hawker', 'multistorey_carpark',
       'precinct_pavilion', 'pri_sch_name', 'pri_sch_affiliation',
       'sec_sch_name', 'affiliation'],
      dtype='object')
Index(['floor_area_sqm', 'mid_storey', 'hdb_age', 'max_floor_lvl',
       'year_completed', 'total_dwelling_units', '1room_sold', '2room_sold',
       '3room_sold', '4room_sold', '5room_sold', 'exec_sold', 'multigen_sold',
       'studio_apartment_sold', '1room_rental', '2room_rental', '3room_rental',
       'other_room_rental', 'Mall_Nearest_Distance', 'Mall_Within_500m',
       'Mall_Within_1km', 'Mall_Within_2km', 'Hawker_Nearest_Distance',
       'Hawker_Within_500m', 'Hawker_Within_1km', 'Hawker_Within_2km',
       'hawker_food_stalls', 'hawker_market_stalls', 'mrt_nearest_distance',
       'bus_interchange', 'mrt_interchange', 'bus_stop_nearest_distance',
       'pri_sch_nearest_distance', 'vacancy', 'sec_sch_nearest_d

* scale and ohe the separate features

In [83]:
ss = StandardScaler()
ohe = OneHotEncoder(sparse=False, drop='first')

# scale and one-hot encode based on columns defined above
Z_train_ss = ss.fit_transform(X_train[X_train_cont])
Z_test_ss = ss.transform(X_test[X_test_cont])
Z_train_ohe = ohe.fit_transform(X_train[X_train_cat])
Z_test_ohe = ohe.transform(X_test[X_test_cat])

### 2.3. Make train and test datasets whole

In [84]:
Z_train = np.concatenate([Z_train_ss, Z_train_ohe], axis=1)
Z_test = np.concatenate([Z_test_ss, Z_test_ohe], axis=1)

---
## 3. Baseline Model Creation
---

### 3.1. Instantiate model

In [85]:
lr = LinearRegression()

In [86]:
Z_train.shape

(112975, 438)

### 3.2. Model fitting and evaluation

In [87]:
lr.fit(Z_train, y_train)

In [88]:
lr.score(Z_train, y_train)  # train score

0.9204702913339389

In [89]:
lr.score(Z_test, y_test)  # test score shows good fit

0.9191976361973623

### 3.3. Scale and ohe test.csv
* scale & ohe hdb_test_df continuous and categorical features respectively

In [90]:
# separate continuous and categorical features
X_hidden_cat = hdb_test_df.select_dtypes(include=['object','bool']).columns
X_hidden_cont = hdb_test_df.select_dtypes(include=['int64', 'float64']).columns

print(X_hidden_cat)
print(X_hidden_cont)

Index(['town', 'Tranc_Year', 'Tranc_Month', 'full_flat_type', 'residential',
       'commercial', 'market_hawker', 'multistorey_carpark',
       'precinct_pavilion', 'pri_sch_name', 'pri_sch_affiliation',
       'sec_sch_name', 'affiliation'],
      dtype='object')
Index(['floor_area_sqm', 'mid_storey', 'hdb_age', 'max_floor_lvl',
       'year_completed', 'total_dwelling_units', '1room_sold', '2room_sold',
       '3room_sold', '4room_sold', '5room_sold', 'exec_sold', 'multigen_sold',
       'studio_apartment_sold', '1room_rental', '2room_rental', '3room_rental',
       'other_room_rental', 'Mall_Nearest_Distance', 'Mall_Within_500m',
       'Mall_Within_1km', 'Mall_Within_2km', 'Hawker_Nearest_Distance',
       'Hawker_Within_500m', 'Hawker_Within_1km', 'Hawker_Within_2km',
       'hawker_food_stalls', 'hawker_market_stalls', 'mrt_nearest_distance',
       'bus_interchange', 'mrt_interchange', 'bus_stop_nearest_distance',
       'pri_sch_nearest_distance', 'vacancy', 'sec_sch_nearest_d

In [91]:
# scale and ohe respectively
Z_hidden_ss = ss.transform(hdb_test_df[X_hidden_cont])
Z_hidden_ohe = ohe.transform(hdb_test_df[X_hidden_cat])

In [92]:
# concat the data
Z_hidden = np.concatenate([Z_hidden_ss, Z_hidden_ohe], axis=1)

### 3.4. Predict and export
* Predict, combine predictions with 'id' and save as csv (for kaggle submission)

In [94]:
y_pred = lr.predict(Z_hidden)
id_df['Predicted'] = y_pred
print(id_df.head())

id_df.to_csv('output/baseline_pred.csv', index=False)

       Id  Predicted
0  114982   371978.5
1   95653   482970.5
2   40303   380298.5
3  109506   308694.0
4  100149   437106.5
