# Final Project

For the final project we are going to use a well-know dataset taken from Kaggle, namely the Melbourne Housing Dataset.  

In it, you may find various information about houses and apartments in the Melbourne region (Australia). The goal is to predict the right house value.  
The dataset can be found in the data folder (under the name melb_data.csv), while a more detailed dataset description is at the Kaggle page: [Melbourne Housing Snapshot](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot)

In [1]:
import numpy as np
import random
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV

Set seed for reproducibility

In [2]:
def set_seed(seed=666):
    """Define the seed for the randomness of the script"""

    np.random.RandomState(seed)
    np.random.seed(seed)
    random.seed(seed)
    
set_seed()

In [3]:
# Read the data
data = pd.read_csv('./data/melb_data.csv')
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


### Columns to use for prediction

For simplicity we select only numerical variables to be used in the Machine Learning Tree Models.  
If you are eager to try out slightly more complicated code, feel free to use also categorical variables with the proper encoding (CatBoost Encoding or Label Encoding, for Ordinal and Nominal categorical variables respectively) 

In [4]:
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt','Price']
data = data[cols_to_use]

# Delete the dataset rows containing NaN values
data.dropna(axis=0,inplace=True)
print(data.shape)

y = data.pop('Price')/1000

(6858, 6)


In [5]:
data.head()

Unnamed: 0,Rooms,Distance,Landsize,BuildingArea,YearBuilt
1,2,2.5,156.0,79.0,1900.0
2,3,2.5,134.0,150.0,1900.0
4,4,2.5,120.0,142.0,2014.0
6,3,2.5,245.0,210.0,1910.0
7,2,2.5,256.0,107.0,1890.0


In [6]:
# Separate data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=42)

## Train the Models

Here you are required to train at least one of the three models we talked about during the course: Decision Tree, Random Forest and Gradient Boosting.   
Regarding Gradient Boosting, feel free to choose the implementation that suits you the most, between XgBoost and Catboost.  

Your goal is to achieve the minimum value of MSE (Mean Squared Error) for the variable Price. In order to do this, you are required to choose the best parameters for the models, using the techniques already discussed when needed, such as Grid Search or Early Stopping.  
To assess the MSE, use the mean_squared_error function in the sklearn library. Its syntax is:  
*mean_squared_error(y_test, model.predict(X_test))*

In [7]:
from sklearn.metrics import mean_squared_error

**Remember**: Since the Y variable (the variable to be predicted) is a numerical one, namely the Price, we are in a Regression framework. We already know that Tree models work in the same way for categorical and continuous Y variables, just keep in mind you should use the right implementation.  

This means you should import the following objects:  

In [8]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor, Pool
from xgboost import XGBRegressor

*The implementation is the same, there are only minor changes in the defaults (eg. the default Regression Split Criterion is Variance, while in Classification is Gini)*   

Remember also to set the proper Loss Function (for Regression it is suggested to use Mean Squared Error).  

In practice you must pay attention to:

* in GridSearchCV, the scoring parameter should be **"neg_mean_squared_error"**
* In CatBoost, the eval_metric should be **"RMSE"** (and the parameters regarding class weights are not needed, since we do not have classes anymore for the Y variable)
* in XgBoost, the default objective parameter is **"reg:squarederror"**, while the eval_metric parameter shall be set to **"rmse"** and there is no need for parameters regarding the class weight (such as scale_pos_weight)  

**GOOD LUCK!**