<a href="https://www.kaggle.com/code/galvangoh/housing-price-regression-lowest-rmse-models?scriptVersionId=105749880" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

***
# Melbourne Housing Price Dataset
***
## Regression Problem
***
#### [Melbourne house pricing](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home "Click to see the source from Kaggle")
****

#### Notes on Specific Variables
---

>Rooms: Number of rooms

>Price: Price in dollars

>Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

>Type: br - bedroom(s); h - house, cottage, villa, semi, terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

>SellerG: Real Estate Agent

>Date: Date sold

>Distance: Distance from CBD

>Regionname: General Region (West, North West, North, North east …etc)

>Propertycount: Number of properties that exist in the suburb.

>Bedroom2 : Scraped # of Bedrooms (from different source)

>Bathroom: Number of Bathrooms

>Car: Number of carspots

>Landsize: Land Size

>BuildingArea: Building Size

>CouncilArea: Governing council for the area

# A. Libraries needed for this study

In [None]:
# for working with dataframes
import pandas as pd
from sklearn import preprocessing
import numpy as np

# for visualisation
import plotly.express as px
import plotly.graph_objects as go

# for model evaluation
from sklearn.model_selection import KFold

# for the various ML models selected for this project
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor, Pool, cv
import lightgbm as lgb

# for evaulating predictions
from sklearn import metrics
import statistics
from sklearn.metrics import mean_squared_error

# hide any warnings from output
import warnings
warnings.filterwarnings("ignore")

# notebook settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# 1. Data Preparation

## 1.1 Importing dataset

In [None]:
# set pandas to display all columns of data
pd.set_option('display.max_columns', None)

# read in the data
melb_house_price_df = pd.read_csv('../input/melb-housing-dataset/melb_data.csv')

## 1.2 High level look at the dataset

In [None]:
melb_house_price_df.head()

In [None]:
melb_house_price_df.shape

13580 rows and 21 columns of data

In [None]:
melb_house_price_df.info()

In [None]:
melb_house_price_df.describe().T

## 1.3 Split datatframe into numerical and categorical dataframes

In [None]:
# find out the number of columns which are categorical
num_col = melb_house_price_df.select_dtypes(exclude='object')

# find out the number of columns which are numerical
cat_col = melb_house_price_df.select_dtypes(exclude=['int64','float64'])

print(f"There are {num_col.shape[1]} numerical features.")
print(f"\nThere are {cat_col.shape[1]} categorical features.")

### 1.3.1 Finding numerical features with correlation with outcome

In [None]:
# create a new variable to contain only the numerical features of the dataset
num_col_corr = num_col.corr()

# create a heatmap of all numerical features and outcome
num_col_heatmap = px.imshow(num_col_corr, height = 700, width = 700,
                            title = "Heatmap Correlation for Numerical Features")

num_col_heatmap.show()

<b>Here, we can see some of the numerical features having some correlation with Price:</b>
1. Rooms
2. Bedroom2
3. Bathroom

### 1.3.2 Look at the relationship between categorical features and outcome

In [None]:
# add in the outcome column into cat_col
cat_col["Price"] = melb_house_price_df["Price"]

cat_col_df = px.scatter_matrix(cat_col, height = 900, width = 900)

cat_col_df.show()

<b>Even though not much inference can be made at this stage, there is some minor observations made:</b>
1. Different suburbs have different housing price.
2. h Type house tends to be sold at a higher volume.
3. Specific council area affects housing price.
4. Different regions have different housing price.

# 2. Exploratory Data Analysis

## 2.1 Undertstanding features of this dataset

### 2.1.1 Suburb

#### Suburb refers to city within Melbourne where citizens find residence and it is also an area where businesses and government organisations are located.

#### Here we see how many are there in terms of property sales method

In [None]:
# Find the suburb with the highest sales in house price
suburb_highest_price = melb_house_price_df.groupby(by="Suburb").sum()[["Price"]].sort_values(by="Price", ascending=False)
suburb_highest_price.reset_index(inplace=True)
suburb_highest_price

In [None]:
# design the box plot
fig_suburb = px.histogram(suburb_highest_price, x="Suburb", y="Price",
                          log_y=True, width=5000, height=700,
                          orientation='v', title="Property Sales History for all Suburbs")

# display the plot
fig_suburb.show()

<b>Suburb with the highest property sales</b> -> Brighton<br>
<b>Suburb with the lowest property sales</b> -> Bacchus Marsh

#### Here we try to see the range of house prices across all cities in this dataset.

In [None]:
# create a new dataframe for the pivot table
suburb_sum_price = melb_house_price_df.pivot_table(index="Suburb", columns="Type", aggfunc=("min","max","sum"))["Price"]

# sort the dataframe # swap the level 1 and level 2 in the tuple to look at different price ranges
suburb_sum_price= suburb_sum_price.sort_values(by=[("min","u")], ascending=False)

# display the dataframe
suburb_sum_price

### 2.1.2 Method

#### Kaggle provided a few acroynms however only 5 methods were known in this dataset.

S - property sold - [what does it mean](https://upside.com.au/articles/selling-your-property/selling-guide/common-real-estate-terms-and-definitions-explained)<br>
SP - property sold prior - [what does it mean?](https://www.therealestateconversation.com.au/blog/justin-nickerson/why-would-you-sell-property-prior-auction/justin-nickerson-auctioneer/justin)<br>
PI - property passed in - [what does it mean?](https://www.greghocking.com.au/what-happens-if-a-property-is-passed-in-at-auction/#:~:text=When%20a%20property%20is%20passed,crowd%20or%20a%20vendor%20bid.)<br>
VB - vendor bid - [what does it mean?](https://www.realestate.com.au/advice/auction-hammer-falls-vendor-bid/)<br>
SA - sold after auction - [what does it mean?](https://www.domain.com.au/advice/the-block-2018-how-auctions-work-in-victoria-777537/)

#### Here we find out all different types of property sales methods

In [None]:
# group the dataframe by the sales methods
melb_house_price_df.groupby(by="Method").count()

Focusing on the features without null values such as "Price";<br>
Sales method <b>"S" (Property Sold)</b> is the most common between home owners and home sellers.<br>
<b>"SA" (Sold After Auction)</b> is the least common sales method.

#### Visualise the total amount of sales by each method

In [None]:
# design the bar plot to show the sum of property prices differentiated by sales method
fig_method = px.histogram(melb_house_price_df, x="Price", y="Method",
                          color="Type", title="Total Amount of Sales by Method")

# display the plot
fig_method.show()

- Method by Property Sold is a clear winner here.

- It can be also observed that property type "h" had won the hearts of many buyers. Property type "h" always have the highest sales amount no matter the sales method.

- [This article](https://lawyersconveyancing.com.au/faq/auctions-faq/) explains why other sale methods were not gaining much sales as Property Sold.

### 2.1.3 Distance

#### This feature should be intuitive for most of us. Distance here means the distance between the property and CBD. So it it only natural that the lower the distance, the higher the sales price.

In [None]:
# labeling for the plot
labels = {"Price":"House Price",
          "Distance":"Distance to CBD (Km)"} # many kagglers have used kilometers as the metric

# design the plot to show the effects on property prices when distance matters
fig_house_dist_price = px.scatter(melb_house_price_df, x="Distance", y="Price",
                                  facet_col="Type", log_y=True, trendline="ols",
                                  title="House Prices w.r.t. Distance to CBD", marginal_x="box",
                                  height=600, labels=labels, trendline_color_override="black")
fig_house_dist_price.show()

- Ignore the outliers first and look at the trendline, the higher the distance to CBD, the lower the house prices.
- Majority of the properties are located within 20 Km of CBD.
- For properties located above 20 Km of CBD, property type "h" has higher proportion as compared to type "u" and "t". Probably for property owners whom want peace from the buzzling city.

### 2.1.4 Geographical features

#### Similarly with Distance, geograhical features of the properties should also affect property prices. With the knowledge of a list of places of interest around the Postcode, Latitude and Longtitude, some feature engineering can be perform.

In [None]:
# create a new dataframe to hold all the geograhical features
geo_features_df = melb_house_price_df[["Postcode", "CouncilArea", "Lattitude", "Longtitude", "Price"]]
geo_features_df.head()

#### Create a new column to show the different group of pricing

In [None]:
# cut the range of property price into 10 groups and create a new column for it
geo_features_df["Price_Group"] = pd.cut(geo_features_df["Price"], bins=10)
geo_features_df

# show all 10 different price group
geo_features_df.Price_Group.value_counts().sort_values(ascending=False)

#### Reassign each price group with a number in a way as the number of price group increase, so does the price of the property.

In [None]:
# using label encoder to do the reassignment
le = preprocessing.LabelEncoder()

# changing the price_group values to categorial so that plotly will display them as discreate colors rather than continous
geo_features_df["Price_Group"] = le.fit_transform(geo_features_df["Price_Group"]).astype(str)

In [None]:
# check if the label encoders works
geo_features_df.Price_Group.value_counts().sort_values(ascending=False)

# check dtype of price group
geo_features_df.info()

In [None]:
# setting the price legend
price_legend = {"0":"< 0.976",
                "1":"0.976 - 1.868",
                "2":"1.868 - 2.759",
                "3":"2.759 - 3.651",
                "4":"3.651 - 4.542",
                "5":"4.542 - 5.434",
                "6":"5.434 - 6.325",
                "7":"6.325 - 7.217",
                "8":"7.217 - 8.108",
                "9":"8.108 - 9.000"}

# edit the labels
labels = {"Price_Group":"Prices in $ Million"}

# design the geographical plot
geo_fig = px.scatter_mapbox(geo_features_df, lat="Lattitude", lon="Longtitude",
                            hover_name="CouncilArea", zoom=10, color="Price_Group",
                            height=800, width=1000, labels=labels,
                            category_orders = {"Price_Group":["0","1","2","3","4","5","6","7","8","9"]},
                            title="Property Price and Location In Melbourne",
                            mapbox_style="open-street-map")

#change the price legend
geo_fig.for_each_trace(lambda t: t.update(name = price_legend[t.name],
                                          legendgroup = price_legend[t.name],
                                          hovertemplate = t.hovertemplate.replace(t.name, price_legend[t.name])
                                         )
                      )

<b>This scatter_mapbox visualisation gives the following observations:</b>
1. Properties below the price group of <= 1.868 million made up the majority of the dataset.
2. Properties are more densly packed within and around the city center. Refering to section 2.1.3. it should be properties within 20 km of CBD.
3. Interestingly, property prices beginning from 1.868 millions tends to cluster around the region East of Melbourne. As property prices increases, those properties only reside in East Melbourne.
4. The number of high valued property also decreases as the property prices increases.
5. Properties of price above 3.651 million are usually found within the city of Boroondara.
6. This visualisation gets difficult to infer for properties valued above 6.325 million. A disadvantage to look at where the outliers are.

### 2.1.5 Percentage of each property type

#### Earlier in section 2.1.3., we see that lower property prices made up majority of this dataset. We create a piechart to further explore which property type makes this observation true.

In [None]:
house_type_fig = px.pie(melb_house_price_df, values="Price", names="Type",
                        title="Proportion of Property Types in Melbourne",
                        hole=0.3)

house_type_fig.show()

- Type h properties stands at 80% of this dataset!
- As this dataset is populated with all sub-categories of type h (house, cottage, villa, semi, terrace) just as "h", it will be helpful if this type h category can be further split up to reduce the current imbalance.

### 2.1.6 Housing Facilities

#### We all know that property prices increase with more spacious house and more rooms and carspot (carpark). But we can try to visualise, for Melbourne, having which facility pushes the property prices up further more.

#### create a new dataframe to hold the data for this section's visualisation.

In [None]:
prop_facil_df = melb_house_price_df[["Rooms","Price","Bathroom","Car","Landsize"]]
prop_facil_df.head(3)

#### visualise with the newly created dataframe - say we further visualise in 4 portions.
- The steeper the gradient, the greater the linear relationship it is between house price and the feature we are going to compare.

In [None]:
# visualise rooms facility
prop_facil_fig_rm = px.scatter(prop_facil_df, x='Rooms', y="Price", color="Price", log_y=True,
                               trendline="ols", title="Effects of Property Price for No. of Rooms",
                               labels= {"Price":"House Price", "Rooms":"No. Rooms"},
                               color_continuous_scale=[(0, "red"), (0.5, "green"), (1, "blue")])

prop_facil_fig_rm.show()

In [None]:
# visualise bathroom facility
prop_facil_fig_bathrm = px.scatter(prop_facil_df, x='Bathroom', y="Price", color="Price", log_y=True,
                                   trendline="ols", title="Effects of Property Price for No. of Bathrooms",
                                   labels= {"Price":"House Price", "Car":"No. of Bathrooms"},
                                   color_continuous_scale=[(0, "red"), (0.5, "green"), (1, "blue")])

prop_facil_fig_bathrm.show()

In [None]:
# visualise carspots facility
prop_facil_fig_car = px.scatter(prop_facil_df, x='Car', y="Price", color="Price", log_y=True,
                                trendline="ols", title="Effects of Property Price for No. of Carspots",
                                labels= {"Price":"House Price", "Car":"No. of Carspots"},
                                color_continuous_scale=[(0, "red"), (0.5, "green"), (1, "blue")])

prop_facil_fig_car.show()

In [None]:
# visualise landsize area
prop_facil_fig_area = px.scatter(prop_facil_df, x='Price', y="Landsize", color="Price", log_x=True, log_y=True,
                                 trendline="ols", title="Effects of Property Price for Landsize",
                                 labels= {"Price":"House Price", "Landsize":"House Landsize"},
                                 color_continuous_scale=[(0, "red"), (0.5, "green"), (1, "blue")])

prop_facil_fig_area.show()

#### Summary for 2.1.6 Housing Facilities

In [None]:
prop_facil_fig_rm_results = px.get_trendline_results(prop_facil_fig_rm).px_fit_results.iloc[0]
print(f"Gradient (Rooms): {prop_facil_fig_rm_results.params[1]}")

prop_facil_fig_bathrm_results = px.get_trendline_results(prop_facil_fig_bathrm).px_fit_results.iloc[0]
print(f"Gradient (Bathroom): {prop_facil_fig_bathrm_results.params[1]}")

prop_facil_fig_car_results = px.get_trendline_results(prop_facil_fig_car).px_fit_results.iloc[0]
print(f"Gradient (Carspots): {prop_facil_fig_car_results.params[1]}")

prop_facil_fig_area_results = px.get_trendline_results(prop_facil_fig_area).px_fit_results.iloc[0]
print(f"Gradient (Landsize): {prop_facil_fig_area_results.params[1]}")

- No. of bathrooms has the greatest influence in terms of property prices. This could be linked with the underlying cost where there could be more materials to use and labour work during the constructions of the bathrooms.
- No. of rooms has the second highest influence in terms of property prices.
- Carspots and Landsize are having weaker influence to property prices.

### 2.1.7 Geopolitical Influence

#### Sometimes, area with greater political influence affects property prices. However, with no political knowledge for the Melbourne city we can only make obeservation based on the values in this dataset.

#### create a new dataframe to hold the columns for this visualisation.

In [None]:
# only taking features with possible geopolitical relations
geopol_df = melb_house_price_df[["Type","Price","CouncilArea","Regionname","Propertycount"]]
geopol_df.head(3)

#### visualise using a treemap as we can see some heirarchical relation between the columns.

However treemap is unable to handle null values. Only "CouncilArea" column is having null values. To allow the dataset to stay true, we temporarily replace the null values with the value of "unknown". 

In [None]:
geopol_df["CouncilArea"].fillna("unknown", inplace=True)

In [None]:
geopol_fig = px.treemap(geopol_df, path=[px.Constant("Melbourne"),"Regionname","CouncilArea","Type"],
                        values="Price",color="Price", height=800, hover_name="Type",
                        labels={"Price":"Prices in $"}, color_continuous_scale="earth",
                        title="Melbourne Property Price by Metropolitcan Region")

geopol_fig.data[0].hovertemplate = '<b>%{label}</b><br>%{value}'
geopol_fig.show()

- The Southern Metropolitan has shown the highest property sales transacted.
- All metropolitan has their own unknown council area and some are quite significant. Even though, there were originally 1369 missing values, which stands about 10% of the dataset, when added up its quite a sum. If wrongly computed, it can give a wrong impression. Below a pie chart illustrate the proportion of "unknown" council area.

In [None]:
council_area_fig = px.pie(geopol_df, values="Price", names="CouncilArea",
                          height=700, width=900, hole=0.3,
                          title="11.1% of Unknown Property Sales Transaction!")

council_area_fig.show()

### 2.1.7 Effects of Time and Property Price

#### We try to observe if there's any relationship between the period of which properties were being sold and their prices.

In [None]:
# define the features needed for this visualisation
date_sold_df = melb_house_price_df[["Date","Price", "Type"]]

# convert the Date object type into datetime format
date_sold_df["Date"] = pd.to_datetime(date_sold_df["Date"], exact=True, format="%d/%m/%y")

date_sold_df.info()

In [None]:
date_sold_fig = px.scatter(date_sold_df, x="Date", y="Price",
                           facet_col="Type", height=700 , width=1500, marginal_x="histogram",
                           title="Property Type Sales Volume Across Time")

date_sold_fig.show()

- Observation 1: There is a gap, around March 2016 and Jan 2017, where there are almost no properties were sold during these periods.
- Observation 2: Type h property has higher sales volumn. This can also be seen in section 2.1.2 and 2.1.5.
- Observation 3: There is a trend that properties tend to be sold as the year end approaches. This would be otherwise for type u and t properties.

[This article](https://www.realestate.com.au/insights/when-is-the-best-time-to-sell-your-home-statistically/) explains that property sellers will want to sell their houses in the nicest state and encourages buyers to fork out more cash for purchase. The article also stated that the best time to sell properties is on November and it matches with the trend we observe in this dataset.

[Another article](https://www.theage.com.au/property/news/testing-time-as-melbourne-s-spring-selling-season-begins-will-property-prices-keep-going-down-20220825-p5bcol.html) states that most property buyers rushed into the spring property market. This seems to be a seasonal norm in Melbourne as more sellers tend to list their properties during spring. During Decemeber to February is the summer season in Melbourne, usually the property market tones down in this period. During the hot weather, property buyers would probably find it uncomfortable to do house visits and attend to auction.

[The 4 seasons of Melbourne](https://www.australia.com/en-sg/facts-and-planning/weather-in-australia/melbourne-weather.html)

## 2.2 Finding out null values

With a greater understanding and sense in the dataset through visualisation, we move on to data cleaning part.

In [None]:
# list out features with null values
melb_house_price_df.isnull().sum()

<b>Features with null values:</b>
1. SellerG
2. Car
3. BuildingArea
4. YearBuilt
5. Council Area

### 2.2.1 Handling null values - "SellerG"

<b>Find out the number of property agent this dataset has.</b>

In [None]:
# using unique() to list out all agents name and use .shape to count how many agents are there
num_of_property_agent = melb_house_price_df.SellerG.unique().shape[0]

# print out the results
print(f"There are {num_of_property_agent} property agents in this dataset.")

<b>Take away that 1 particular row with null value in "SellerG"</br>

In [None]:
melb_house_price_df = melb_house_price_df[melb_house_price_df["SellerG"].notna()]

This [article](https://towardsdatascience.com/two-pandas-tricks-i-wish-id-known-earlier-60af0a049735) explains why `.notna()` is better than `.dropna()`.
Credits to [Liad Pollak Zuckerman](https://medium.com/@pollakliad) for the interesting insight!

In [None]:
melb_house_price_df.SellerG.isnull().sum()

### 2.2.2 Handling null values - "Car"

#### using statistical imputation method for missing values in "Car".

We find the most frequent value (no. carspots) within the "Car" column.

In [None]:
melb_house_price_df["Car"].mode()

In [None]:
melb_house_price_df["Car"] = melb_house_price_df["Car"].fillna(melb_house_price_df["Car"].mode()[0])

In [None]:
melb_house_price_df.Car.isnull().sum()

### 2.2.3 Handling null values - BuildingArea & YearBuilt

#### This 2 features has the most null values.
- BuildingArea is also being referred to as building size. This vague term could mean a number of things. It could mean how large the building be in terms of height, width and breadth. No other useful features are useful in helping to figure out what does BuildingArea actually means. This column has nearly 50% null values. Careless imputation may affect how ML models train in section 4. Therefore, it it safer to drop away this feature.


- YearBuilt. The year of which the property is built on. It will also be safer to drop this feature. During each year or months, the cost to construct properties will be different and depending on the performance of the construction sector within Melbourne during that period. There could be a period where Melbourne had too much empty properties and to attract buyers and investors, real estate comapnies decided to sell properties at attractive prices. Careless imputation can influence how ML models learn this feature.

In [None]:
melb_house_price_df.drop(["BuildingArea","YearBuilt"], axis=1, inplace=True)

In [None]:
# check the shape of the dataset again
melb_house_price_df.shape

### 2.2.4 Handling null values - CouncilArea

#### Earlier in section 2.1.6...
- We visualized CouncilArea in treemap and found out that for each type of property within each CouncilArea (and also Regionname following up the hierarchy), unknown properties sales transacted can add up to a very large sum.
- CouncilArea being a categorial feature means using the `.mode()` function is the most suitable way to go about it...But! it will be dangerous to do so.

#### we see what is the most common CouncilArea in the dataset

In [None]:
melb_house_price_df.CouncilArea.mode()

>In section 2.1.6 we observed that Moreland has 6.58% of sales transaction - around 960 Million Dollars.<br>
>To put 11.1% (around 1.6 Billion Dollars!) into Moreland will unrealistically change the dataset in a way this dataset is not an accurate reflection of performance of the property industry.


#### what we will do is still using the `.mode()` function, however, on the Regionname level. This imputation is not perfect as the suburb may not accurately belong to the newly filled council area.

In [None]:
# how many unique region names are there is this dataset
melb_house_price_df["Regionname"].value_counts()

#### for each region, we find the most commonly appeared council area

In [None]:
melb_house_price_df[melb_house_price_df.Regionname == "Southern Metropolitan"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Northern Metropolitan"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Western Metropolitan"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Eastern Metropolitan"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "South-Eastern Metropolitan"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Eastern Victoria"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Northern Victoria"].CouncilArea.mode()
melb_house_price_df[melb_house_price_df.Regionname == "Western Victoria"].CouncilArea.mode()

#### create a dictionary to hold all the commonly appeared council area which corresponds to its respective region

In [None]:
mapping_dict = dict({"Southern Metropolitan":"Boroondara",
                     "Northern Metropolitan":"Moreland",
                     "Western Metropolitan":"Moonee Valley",
                     "Eastern Metropolitan":"Banyule",
                     "South-Eastern Metropolitan":"Kingston",
                     "Eastern Victoria":"Yarra Ranges",
                     "Northern Victoria":"Melton",
                     "Western Victoria":"Melton"})

#### using `.map()` to convert all NaN values

In [None]:
melb_house_price_df["CouncilArea"] = melb_house_price_df["CouncilArea"].fillna(melb_house_price_df["Regionname"].map(mapping_dict))

In [None]:
melb_house_price_df.isnull().sum()

all seems good now.

## 2.3 Data wrangling

#### So far what we have done are some visualisations to understand the features and sorting out the null values. Now we can clean the data more so that the features are suppose in the correct data type or format before moving on to the model training phase.

#### Another high level look at the dataset.

In [None]:
melb_house_price_df.head(3)

#### Check out the shape of the updated dataset

In [None]:
melb_house_price_df.shape

- Previously, the original dataset has 13580 rows and 21 columns of data.<br>
- Now, the dataset has 13579 rows and 19 columns of data.<br>
- In terms of dropping of rows and columns, we have done the minimal. This is to preserve most of the originality of the dataset so that during the model training phase, we can have more room for further tweaking.

#### Check out the number of numerical and categorical features

In [None]:
melb_house_price_df.info()

#### Note that in section 1.2 we are just doing observation. Now if we look into the details...
1. Postcode (int), Propertycount (int), Lattitude (float) & Longtitude (float )are numerical data type. In actuality, they should be seen as a form of categorical data.

>For example, Postcode represent a certain area the property is located. It is a constant data which does not change. Other features (facilities of property both internally and externally) can change to influence the property prices but the postal code will still be the same. Being in numerical data type not only is a wrong representative of the data, it may be recognise as some sort of variables which affect property prices which in fact does not, at least for the problem of this project. This also applies for Lattitude and Longtitude.

>Similarly for Propertycount, it can be recognise by the training model as a variable, where a suburb with high property count may mean high property prices which is not the real truth.

- Hence, Postode, Propertycount, Lattitude & Longtitude should be convereted into string format (object data type).

2. Date (object) is obviously not an object data type but rather a datetime data type.

>Through datetime data type we can perform feature engineering through Date and gain more insights (will there some months or season where prices are hiked?).

- Hence, Data should be convereted into datetime data type.

### 2.3.1 Changing data type - Postcode, Propertycount, Latitutude & Longtitude

In [None]:
# list out the columns to convert
cols_num_to_cat = ["Postcode","Propertycount","Lattitude", "Longtitude"]

# run a loop to convert the features into string format
for col in cols_num_to_cat:
    melb_house_price_df[col] = melb_house_price_df[col].astype(str)

In [None]:
# checking if the change is effective
melb_house_price_df[cols_num_to_cat].info()

### 2.3.2 Changing data type - Date

#### Randomly sample 10 rows in the data set and check format of the dates.

In [None]:
melb_house_price_df[["Date"]].sample(10)

- For index 6853, the date format is in 30/7/16. This shows 30th July 2016.
- And for index 834, the date format is 10/12/16. This shows 10th December 2016.
- So, we can confirm that the date format are being represented in the format of DD/MM/YY.
- This is good. We just have to convert the data type and retain the date format.

before the change (DD/MM/YY)

In [None]:
melb_house_price_df["Date"].head(3)

In [None]:
# convert to datetime datatype
melb_house_price_df["Date"] = pd.to_datetime(melb_house_price_df["Date"], exact=True, format="%d/%m/%y")

In [None]:
# checking if the change is effective
melb_house_price_df[["Date"]].info()

after the change (YYYY-MM-DD)

In [None]:
melb_house_price_df["Date"].head(3)

In [None]:
# melb_house_price_df["Date"] = melb_house_price_df["Date"].dt.strftime("%d/%m/%y")

- The above code can gives us back the original date format but the code will return the Date column back to object data type which is not what we want.
- This should not be an issue. Just the way the dates are being presented. For this project, it is not important so we will just accept the format as YYYY-MM-DD.
- Later on, we can do feature engineering to extract out the year and month. This will give clarity for future readers.

# 3. Feature Engineering

In [None]:
melb_house_price_df.head(3)

#### We indentify what are the features will be needed to be engineered and the techniques.
1. <b>Numerical features</b> - Feature Scaling (for distance based ML algorithm).
2. <b>Datetime features</b> - Extract year and month
3. <b>Categorical features</b> - One-Hot encoding.

#### Before moving ahead with this section
- We should also identify what are the ML algorithms we will be using for this project's regression problem.
- In this project, we will be using 4 different algorithms. Namely:
>1. Linear Regression from [Sci-kit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
>2. Random Forest from [Ski-kit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).
>3. CatBoost from [Yandex](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor).
>4. LightGBM from [Microsoft](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html]).

- With this in mind, we can plan how we will be engineer our features.

## 3.1 Feature Engineering - Datetime Features

- In this section, we attempt to extract the month of which the property is sold.

In [None]:
# extract both month and year and create a new column
melb_house_price_df["Month_Sold"] = melb_house_price_df["Date"].dt.month
melb_house_price_df[["Date", "Month_Sold"]].head()

## 3.2 Feature Engineering - Numerical Features

<b>Identify the numerical features</b>
- Note to exclude Date and Price.
- Date should not be scaled.
- Price is our target, hence it will be kept as it is.

<b>Methods of feature scaling</b>
- Widely known, there are 2 methods for feature scaling. Feature scaling is important for distance-based model, this is to take care of the values so that they share the same range.
 - 1. Z-score normalisation (standardization)
 - 2. Normalisation</br>


- Z-score normalisation is a measure of how much the values deviate away from the mean.
- Normalisation scales values onto a scale between 0 and 1.
- Z-score normalisation is more preferred to use as it is more robust to outliers.

As Z-score normalisation is only needed for Linear Regression. We do not want to mix up our dataframes. Hence, we create make a copy of the dataframe just for Linear Regression. Note that further on with the feature engineering, they have to be done for both distance-based dataframe and tree-based dataframe.

In [None]:
lin_reg_df = melb_house_price_df.copy(deep=True)
lin_reg_df.head(3)

In [None]:
# find the numerical featutres within the data frame
num_features = lin_reg_df.select_dtypes(include=["int64","float64"])

# list out the column labels for numerical features
num_features.columns

In [None]:
# define the features to scale and take out Price column as it is our target
num_features_scaling = ['Rooms', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'Month_Sold']

# run a loop to scale the features
for col in num_features_scaling:
    lin_reg_df[col] = (lin_reg_df[col] - lin_reg_df[col].mean())/lin_reg_df[col].std()

In [None]:
lin_reg_df[num_features_scaling].head()

## 3.3 Feature Engineering - Categorical Features

<b>Identify the categorical features</b>
- What object type features goes through in this section?
- Earlier in our visualisation, we get a sense of the unique number of certain features.
- Features such as Suburb, CouncilArea, etc. has many different names (over a hundred).
- In this section, we want to use the One-Hot Encoding technique on certain features only.
- The features are selected on the basis that once One-Hot Encoding is being done, it will not influence the direction of which the algorithm learn the features.

<b>Before One-Hot Encoding is perform</b>
- Take note on what will be the ML algorithms we are planning to implement.
- Certain algorithms is able to handle categorical columns well and do not require One-Hot Encoding.
- At the beginning of section 3, we have mentioned that CatBoost and LightGBM will be used.
- In the [CatBoost](https://catboost.ai/en/docs/features/categorical-features)and [LightGBM](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html) documentation, these models are able to handle categorical features uniquely on its own.
- However, Linear Regression and Random Forest will require the categorical features to be encoded numerically.

In [None]:
# find the categorical features within the data frame
cat_features = lin_reg_df.select_dtypes(include="object")
cat_features.head(3)
cat_features.columns

In [None]:
# find the categorical features within the data frame
cat_features = melb_house_price_df.select_dtypes(include="object")
cat_features.head(3)
cat_features.columns

#### we select only these categorical features -> ['Type', 'Method', 'Regionname']

In [None]:
# define the features for One-Hot Encoding
features = ['Type', 'Method', 'Regionname']

# pass the defined features
cat_OH_feature_dist_df = pd.get_dummies(lin_reg_df[features])
cat_OH_feature_tree_df = pd.get_dummies(melb_house_price_df[features])

# look at the One-Hot Encoded features
cat_OH_feature_dist_df.head(3)
cat_OH_feature_tree_df.head(3)

In [None]:
cat_OH_feature_dist_df.shape
cat_OH_feature_tree_df.shape

#### 16 new One-Hot Encoded columns generated out of 3 columns ['Type', 'Method', 'Regionname']

#### Merge these 16 new columns into both dataframes

In [None]:
# merge the One-Hot Encoded features into the data set
melb_data_dist_df = pd.concat([lin_reg_df, cat_OH_feature_dist_df], axis=1)
melb_data_tree_df = pd.concat([melb_house_price_df, cat_OH_feature_tree_df], axis=1)

#### Once merged, we can drop away the original ['Type', 'Method', 'Regionname']

In [None]:
melb_data_dist_df.drop(features, axis=1, inplace=True)
melb_data_tree_df.drop(features, axis=1, inplace=True)

In [None]:
melb_data_dist_df.head(3)
melb_data_tree_df.head(3)

In [None]:
# look again at the new shape of the dataset
melb_data_dist_df.shape
melb_data_tree_df.shape

# 4. Model Building

#### As this is a regression problem, we list out the few ML algorithm we will be using for this project and earlier in section 3 we have already mentioned them. But let's repeat them here again in section 4.
1. Linear Regression from [Sci-kit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) - baseline model.
2. Random Forest from [Sci-kit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).
3. CatBoost from [Yandex](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor).
4. LightGBM from [Microsoft](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html]).

#### A quick run through of how we build our models...
1. Select the predictors to use to predict Price and select Price as the target.
2. Using KFold Validation as a technique for model evaulation where the number of folds `n_splits=10`.
3. Instantiate the empty training model with `random_state=2` where necessary.
4. Using GridSearhCV for hyperparameter tuning whenever possible.
5. Train the model by the number of KFold we have determined at point 2.
6. Run predictions and calculate the RMSE scores.

#### Reminder that we now have 2 different dataframes to work with
><b>With</b> One-Hot Encoding `melb_data_df`<br>
>- This dataframe will be meant for Linear Regression and Random Forest.<br>

><b>Without</b> One-Hot Encoding `melb_house_price_df`
>- This dataframe will be meant for CatBoost and LightGBM.
>- Yes, earlier we have a tre-based dataframe which have been One-hot encoded. This means we have to repeat the feature engineering process for CatBoost and LightGBM. This is because these models do not require One-Hot Encoding.

>> - Catboost is able to handle categorical features, and we have to specific the categorical features during model traning.
>> - LightGBM handles categorical features when they are integer-encoded.

#### Have a list of column labels from the dataframe

In [None]:
melb_data_dist_df.columns

## 4.1 Linear Regression - Baseline Model

In [None]:
# extract out the column labels which are numerical
features_num = melb_data_dist_df.select_dtypes(exclude="object")
features_num = list(features_num.columns)
features_num

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_dist_df[[
    'Rooms', 'Distance', 'Bedroom2', 'Bathroom',
    'Car', 'Landsize', 'Month_Sold', 'Type_h', 'Type_t',
    'Type_u', 'Method_PI', 'Method_S', 'Method_SA', 'Method_SP',
    'Method_VB', 'Regionname_Eastern Metropolitan', 'Regionname_Eastern Victoria',
    'Regionname_Northern Metropolitan', 'Regionname_Northern Victoria', 'Regionname_South-Eastern Metropolitan',
    'Regionname_Southern Metropolitan', 'Regionname_Western Metropolitan', 'Regionname_Western Victoria'
]]

y = melb_data_dist_df["Price"]

# Split the dataset into ten sets
kf = KFold(n_splits=10)

# Instatiate the empty Linear Regression object
linear_reg = linear_model.LinearRegression()

# Create a empty list for storing the RMSE of each fold
kfold_RMSE_linear_reg = []

# Iterate through each fold and calculate the RMSE
for train_index, test_index in kf.split(X):
    
    # Extract the traning and test data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Fit the model
    linear_reg_model = linear_reg.fit(X_train, y_train)
    y_pred_linear_reg = linear_reg_model.predict(X_test)
    
    # Calculate the RMSE for each fold and append it
    RMSE_linear_reg = mean_squared_error(y_test, y_pred_linear_reg, squared=False)
    kfold_RMSE_linear_reg.append(RMSE_linear_reg)

print("Linear Regression | RMSE for each fold:", kfold_RMSE_linear_reg)
print("\nLinear Regression | Average RMSE: ", statistics.mean(kfold_RMSE_linear_reg))

#### Summary for Linear Regression model
1. All numerical features within the dataset is fitted into the Linear Regression model initially.
2. Remove some features randomly and iterated the training process to possibly lower RMSE was not successful, RMSE did not decrease.

## 4.2 Random Forest

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_tree_df[[
    'Rooms', 'Distance', 'Bedroom2', 'Bathroom',
    'Car', 'Landsize', 'Month_Sold', 'Type_h', 'Type_t',
    'Type_u', 'Method_PI', 'Method_S', 'Method_SA', 'Method_SP',
    'Method_VB', 'Regionname_Eastern Metropolitan', 'Regionname_Eastern Victoria',
    'Regionname_Northern Metropolitan', 'Regionname_Northern Victoria', 'Regionname_South-Eastern Metropolitan',
    'Regionname_Southern Metropolitan', 'Regionname_Western Metropolitan', 'Regionname_Western Victoria'
]]

y = melb_data_tree_df["Price"]

# Split the dataset into ten sets
kf = KFold(n_splits=10)

# Instatiate the empty Random Forest Regressor object
rf_reg = RandomForestRegressor(random_state=2)

# Create a empty list for storing the RMSE of each fold
kfold_RMSE_rf_reg = []

# Define the parameters of Random Forest Regressor
parameters = {"max_depth":[5,7,10],
              "min_samples_split":[2,4,6],
              "min_samples_leaf":[1,2,3]
              }

# Create an empty tuned RandomForestRegressor model
rf_reg_tuned = GridSearchCV(estimator=rf_reg, param_grid=parameters, n_jobs=-1, verbose=1)

# # Iterate through each fold and calculate the RMSE
for train_index, test_index in kf.split(X):
    
    # Extract the training and test data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Fit the model
    rf_model = rf_reg_tuned.fit(X_train, y_train)
    y_pred_rf = rf_model.predict(X_test)
    
    # Calculate the RMSE for each fold and append it
    RMSE_rf_model = mean_squared_error(y_test, y_pred_rf, squared=False)
    kfold_RMSE_rf_reg.append(RMSE_rf_model)

In [None]:
print("Random Forest | RMSE for each fold: ", kfold_RMSE_rf_reg)
print("\nRandom Forest | Average RMSE: ", statistics.mean(kfold_RMSE_rf_reg),"\n")

# Show the best parameters for the tuned Random Forest Regressor model
print(rf_reg_tuned.best_params_)

#### Summary
- All numerical features were taken, including the One-Hot encoded features.
- Simple grid search over 3 parameters were done.

## 4.3 CatBoost

We do not want to use melb_data_tree_df since it has been One-Hot encoded. We use the dataframe before section 3.2

In [None]:
melb_data_catboost_df = melb_house_price_df.copy(deep=True)

In [None]:
melb_data_catboost_df.head(3)

In [None]:
melb_data_catboost_df.columns

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_catboost_df[[
    'Suburb', 'Address', 'Rooms', 'Type', 'Method', 'SellerG',
    'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname',
    'Propertycount', 'Month_Sold'
]]

# set Price as the target
y = melb_data_catboost_df['Price']

# get the indices of all categorical features
cat_features_indices = np.where(X.dtypes == object)[0]
cat_features_indices

In [None]:
# Instatiate the empty Random Forest Regressor object
catboost_reg = CatBoostRegressor(random_state=2, cat_features=cat_features_indices,
                                 verbose=False, early_stopping_rounds=10, iterations=500,
#                                  task_type="GPU", devices="0:1" # uncomment if running on GPU
                                )

# Define the parameters of Cat Boost Regressor for grid_search
parameters = {"depth":[6,7,8],
              "l2_leaf_reg":[3,4,5,],
              "random_strength":[0.2,0.3,0.4]
              }

grid_search_results = catboost_reg.grid_search(parameters, X, y,
                                               cv=10, verbose=False,
                                               search_by_train_test_split=False,
                                               partition_random_seed=2, shuffle=True
                                              ) 

In [None]:
# print out the best params used for the grid search model
print("\nBest Params : ", grid_search_results['params'])

cv_results = pd.DataFrame(grid_search_results["cv_results"])
cv_results

- By deafult, the CatBoostRegressor iterates through 1000 times. From the abvoe code, we set iterations of 500 runs to reduce training time.
- early_stopping_rounds of 10 means the iteration stops when result does not improves after 10 calculation steps and move on to the next iteration.

In [None]:
# # Define the input dataset for cross-validation
# cv_dataset = Pool(X, y, cat_features=cat_features_indices)

# # Define the parameters of Cat Boost Regressor
# parameters = {"depth":8,
#               "l2_leaf_reg":3,
#               "random_strength":0.4,
#               "loss_function":"RMSE",
#               "early_stopping_rounds":10
#               }

# cv_data = cv(pool=cv_dataset,
#              params=parameters,
#              plot=True,
#              fold_count=10,
#              logging_level="Silent",
#              return_models=True,
#              partition_random_seed=2,
#              shuffle=True,
#              return_models=True,
#              early_stopping_rounds=10,
#              iterations=500
#             )

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_catboost_df[[
    'Suburb', 'Address', 'Rooms', 'Type', 'Method', 'SellerG',
    'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname',
    'Propertycount', 'Month_Sold'
]]

y = melb_data_catboost_df['Price']

# Split the dataset into ten sets
kf = KFold(n_splits=10)

# Instatiate the empty Random Forest Regressor object
catboost_reg = CatBoostRegressor(random_state=2, depth=8, l2_leaf_reg=3, random_strength=0.4,
                                 early_stopping_rounds=10)

# CatBoost supports GPU. If running on GPU, then uncomment and run this code below.
# catboost_reg = CatBoostRegressor(random_state=2, depth=8, l2_leaf_reg=3, random_strength=0.4,
#                                  task_type="GPU", device="0:1")

# Create a empty list for storing the RMSE of each fold
kfold_RMSE_catboost_reg = []

# Iterate through each fold and calculate the RMSE
for train_index, test_index in kf.split(X):
    
    # Extract the training and test data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Fit the model
    catboost_model = catboost_reg.fit(X_train, y_train, cat_features=cat_features_indices, verbose=1000)
    y_pred_catboost = catboost_model.predict(X_test)
    
    # Calculate the RMSE for each fold and append it
    RMSE_catboost_model = mean_squared_error(y_test, y_pred_catboost, squared=False)
    kfold_RMSE_catboost_reg.append(RMSE_catboost_model)

In [None]:
print("Cat Boost | RMSE for each fold: ", kfold_RMSE_catboost_reg)
print("\nCat Boost | Average RMSE: ", statistics.mean(kfold_RMSE_catboost_reg),"\n")


#### Summary
- In the beginning, we use CatBoost's grid_search method to find the best parameters for the model training. To shortern this process, we cut the iterations by half and down to 500 trees.
- By the end of the grid search training, we can see that the RMSE decreases down to ~281000. An improvement of RMSE as seen from Random Forest.
- With the best parameters found in the grid search section earlier, we repeat the training process again with KFold cross validation as fair comparison of RMSE against other training models.
- The average RMSE at ~288000 is a little higher than our RMSE results from the grid search but it is still below Random Forest and Linear Regression.

## 4.4 LightGBM

In [None]:
melb_data_lightgbm_df = melb_house_price_df.copy(deep=True)

In [None]:
melb_data_lightgbm_df.head(3)

In [None]:
melb_data_lightgbm_df.info()

In [None]:
label_encoder_features = melb_data_lightgbm_df[[
    'Suburb','Type','Method','SellerG','Postcode',
    'CouncilArea','Regionname','Propertycount','Lattitude','Longtitude'
]]

label_encoder_features

le = preprocessing.LabelEncoder()

for col in label_encoder_features:
    melb_data_lightgbm_df[col] = le.fit_transform(melb_data_lightgbm_df[col])

In [None]:
melb_data_lightgbm_df.head(3)

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_lightgbm_df[[
    'Suburb', 'Rooms', 'Type', 'Method','SellerG',
    'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'CouncilArea', 'Regionname','Propertycount', 'Month_Sold',
    'Lattitude','Longtitude'
]]

# set Price as the target
y = melb_data_lightgbm_df['Price']

# Split the dataset into ten set
kf = KFold(n_splits=10)

# # Instatiate the empty LRegressor object
lgbm_reg = lgb.LGBMRegressor(random_state=2, n_jobs=-1, n_estimators=300)

# If running on gpu

params={
    "random_state":2,
    "verbose":-1,
    "n_jobs":1,
    "n_estimators":500,
    "force_col_wise":True
#     "device":"gpu",
#     "gpu_platform_id":0,
#     "gpu_device":0
}

lgbm_reg = lgb.LGBMRegressor(**params)

# Create an empty list for storing the RMSE for each fold
kfold_RMSE_lgbm_reg = []

# Define the tuning paramters for LGBMRegressor
parameters ={'max_depth':[2,3,4],
             'num_leaves':[15,20,25],
             "min_data_in_leaf":[3,5,8],
#              "feature_fraction":[0.5]
            }

# Cerate an empty tuned LightGBMRegressor model
lgbm_reg_tuned = GridSearchCV(estimator=lgbm_reg, param_grid=parameters, verbose=False, n_jobs=-1)

# Iterate through each fold and calcualte the RMSE
for train_index, test_index in kf.split(X):
    
    # Extract the training and test data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Fit the model
    lgbm_model = lgbm_reg_tuned.fit(X_train, y_train, categorical_feature=[0,3,5,6,14,17,18], verbose=False)
    y_pred_lgb = lgbm_model.predict(X_test)
    
    # Calculate the RMSE for each fold and append it
    RMSE_lgm_model = mean_squared_error(y_test, y_pred_lgb, squared=False)
    kfold_RMSE_lgbm_reg.append(RMSE_lgm_model)

In [None]:
print("LightGBM | RMSE for each fold: ", kfold_RMSE_lgbm_reg)
print("\nLightGBM | Average RMSE: ", statistics.mean(kfold_RMSE_lgbm_reg))

In [None]:
print(lgbm_model.best_params_)

#### Summary
- A tuned LightGBM model was attempted using GridSearchCV.
- A few tweaks around the hyperparameter meters was done at the RMSE is still hovering around ~290000 range.
- Lattitude and Longtitude has been added into the last iteration which manage to reduce RMSE very slightly.

# 5. Re-training The Best Model

## 5.1 Compile all the KFold RMSE into a Dictionary

In [None]:
training_models = pd.DataFrame(list(zip(kfold_RMSE_linear_reg,kfold_RMSE_rf_reg,
                                        kfold_RMSE_catboost_reg,kfold_RMSE_lgbm_reg)),
                                columns=["LinearRegression","RandomForestRegressor",
                                         "CatBoostRegressor","LightGBMRegressor"])

training_models

## 5.3 Visualise the RMSE of all Training Models

In [None]:
trace1 = go.Box(y=kfold_RMSE_linear_reg, boxmean=True, name="LinearRegression")
trace2 = go.Box(y=kfold_RMSE_rf_reg, boxmean=True, name="RandomForestRegressor")
trace3 = go.Box(y=kfold_RMSE_catboost_reg, boxmean=True, name="CatBoostRegressor")
trace4 = go.Box(y=kfold_RMSE_lgbm_reg, boxmean=True, name="LightGBMRegressor")

RMSE_models_fig = go.Figure(data=[trace1, trace2, trace3, trace4])

RMSE_models_fig.update_layout(title_text="RMSE of Training Models",
                              yaxis_title="RMSE", xaxis_title="Training Models")

#### Summary
- LinearRegression model as the baseline models provides a benchmark for other models.
- Tree based models has produced mean RMSE of below 350K. 
- CatBoostRegressor has the lowest mean RMSE amongst all 4 training models.
- LightGBM Regressor has the lowest RMSE amongst all 4 training models.
- We can see a strong competition between the 2 gradient boosting models.
- Since CatBoostRegressor gave us the lowest RMSE, we will select CatBoost as the best model.

## 5.4 Re-training the Best Model

In [None]:
# Select the predictors to use to predict Price and select Price as the target.
X = melb_data_catboost_df[[
    'Suburb', 'Address', 'Rooms', 'Type', 'Method', 'SellerG',
    'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
    'Landsize', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname',
    'Propertycount', 'Month_Sold'
]]

y = melb_data_catboost_df['Price']

# Instatiate the empty Random Forest Regressor object
catboost_reg = CatBoostRegressor(random_state=2, depth=8, l2_leaf_reg=3, random_strength=0.4,
                                 early_stopping_rounds=10)

# CatBoost supports GPU. If running on GPU, then uncomment and run this code below.
# catboost_reg = CatBoostRegressor(random_state=2, depth=8, l2_leaf_reg=3, random_strength=0.4,
#                                  task_type="GPU", device="0:1")


# Fit the model
catboost_model = catboost_reg.fit(X, y, cat_features=cat_features_indices, verbose=1000)