# Melbourne House Prices - Kaggle Tutorial

## Data Exploration
The first step in any machine learning project is to familiarise with the data. Pandas will be used for this project. 

In [144]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor

The most important part of the Pandas library is the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). A DataFrame holds data in a similar fashion to a sheet in Excel, or a table in a SQL database.

Firstly, the file path for the csv containing the data is saved in a variable for ease of access.

Then, the csv is read using and the data is stored in a DataFrame title melbourne_data.

The Melbourne data has some missing values (some houses for which some variables were not recorded). The simplest option that will be used on this dataset is to drop houses with missing values using the [pandas.dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) function.

In [145]:
melbourne_file_path = 'melb_data.csv'

melbourne_data = pd.read_csv(melbourne_file_path)

melbourne_data = melbourne_data.dropna(axis = 0)

Overview of the "cleaned up" data.

In [146]:
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1068828.0,9.751097,3101.947708,2.902034,1.57634,1.573596,471.00694,141.568645,1964.081988,-37.807904,144.990201,7435.489509
std,0.971079,675156.4,5.612065,86.421604,0.970055,0.711362,0.929947,897.449881,90.834824,38.105673,0.07585,0.099165,4337.698917
min,1.0,131000.0,0.0,3000.0,0.0,1.0,0.0,0.0,0.0,1196.0,-38.16492,144.54237,389.0
25%,2.0,620000.0,5.9,3044.0,2.0,1.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198,4383.75
50%,3.0,880000.0,9.0,3081.0,3.0,1.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958,6567.0
75%,4.0,1325000.0,12.4,3147.0,3.0,2.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527,10175.0
max,8.0,9000000.0,47.4,3977.0,9.0,8.0,10.0,37000.0,3112.0,2018.0,-37.45709,145.52635,21650.0


The dataset has too mane variables to be efficiently analysed. One way to cull the data is by intrinsic knowledge of what's required from the dataset. 

There are many ways to select a subset of the data, however two approaches will be used in this project: 
*   Dot notation: Dot notation is used to select the "prediction target"
*   Selecting with a column list, used to select the "features"

The columns of the dataset are interrogated to establish the prediction targget and features.

In [147]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

## Selecting The Prediction Target
A variable can be pulled out with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

The dot notation is used to select the column that is required to be predicted, which is called the prediction target. By convention, the prediction target is called **y**.

In [148]:
y = melbourne_data.Price

## Choosing Features
The columns that are inputted into the model (and later used to make predictions) are called "features." In this case, those would be the columns used to determine the home price. Sometimes, all columns will be used except the target as features. Other times it ismore value added to use fewer features.

By convention, this data is called X.

For now, a model will be built with only a few features. Later on it'll be iterated and compared to models built with different features.

In [149]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
test_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

X = melbourne_data[melbourne_features]

KeyError: "['FullBath', '2ndFlrSF', 'LotArea', 'BedroomAbvGr', '1stFlrSF', 'TotRmsAbvGrd'] not in index"

In [130]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [131]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building the Model
The scikit-learn library is used to create the model. When coding, this library is written as sklearn, as seen in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

*   Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
*   Fit: Capture patterns from provided data. This is the heart of modeling.
*   Predict: Just what it sounds like
*   Evaluate: Determine how accurate the model's predictions are.

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures the same results are returned in each run. This is considered a good practice. Any number can be used and model quality won't depend meaningfully on exactly what value was chosen.

In [132]:
melbourne_model = DecisionTreeRegressor(random_state = 1)

melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

## Model Predictions
The model is now fitted and can be used to make predictions.

In practice, predictions would be for new houses coming on the market rather than the houses for which prices already exist for in the dataset. However, predictions will be made for the first few rows of the training data to see how the predict function works.

In [133]:
print('Making predictions for the following 5 houses')
print(X.head())
print('The predictions are:')
print(melbourne_model.predict(X.head()))
print('The data for these houses for comparison to what the model predicted:')
print(melbourne_data.head())

Making predictions for the following 5 houses
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are:
[1035000. 1465000. 1600000. 1876000. 1636000.]
The data for these houses for comparison to what the model predicted:
       Suburb          Address  Rooms Type      Price Method SellerG  \
1  Abbotsford  25 Bloomburg St      2    h  1035000.0      S  Biggin   
2  Abbotsford     5 Charles St      3    h  1465000.0     SP  Biggin   
4  Abbotsford      55a Park St      4    h  1600000.0     VB  Nelson   
6  Abbotsford     124 Yarra St      3    h  1876000.0      S  Nelson   
7  Abbotsford    98 Charles St      2    h  1636000.0      S  Nelson   

        Date  Distance  Postcode  ...  Bathroom  Car  Landsize  Buildin

## Model Validation
In most (though not all) applications, the relevant measure of model quality is **predictive accuracy**.

There are many metrics for summarising model quality, a common one being **Mean Absolute Error** (or **MAE**). The formula for the MAE metric is: *error = actual - predicted*

The MAE metric takes the absolute value of each error, thus converting each error to a positive number. It then takes the average of those absolute errors and this is the measure of model quality. In other words, this translates to "On average, the predictions are off by about X".

The mean absolute error metric can be imported as *from sklearn.metrics import mean_absolute_error*.

In [134]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

A common mistake on assessing predictive accuracy is making predictions with the training data and comparing those predictions to the target values in the training data. This measure is called an "in-sample" score. The problem with using in-sample scores is that the model is trained is learning patterns found in the training set, Thus, testing on the training set has the risk of evaluating the incorrect predictive accuracy as the model may have picked up a pattern on a given dataset taht may not actually be important for the prediction in question.

Consequently, since models' practical value comes from making predictions on new data, performance needs to be measured on data that wasn't used to build the model. The most straightforward way to do this is to exclude a portion of the data from the model-building process, and tehn use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

The scikit-learn library has a function *train_test_split* to break up the data into two pieces, thus two datasets where one can be used for training and the other can be used for validation. 

In [135]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
mean_absolute_error(val_y, val_predictions)

274541.772326232

## Model Validation Methods Comparison
Comparing the two model validation results, the in-sample validation and validation based on split data, it can be seen that the error for the latter dramatically increased ($1115 to $275,680) thus demonstrating the risks of validating a model with in-sample data.

There are many ways to improve an underperforming model, such as experimenting with different features that could be more appropriate for the dataset in question or different model types.

## Model Optimisation

Having a reliable way to measure model accuracy allows to experiment with different models and establish which one better predicts what is required based on a given dataset. But what alternatives are there for models?

The scikit-learn's [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) has many options for decision tree model used in this project.

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, it would return 4 groups of houses. Splitting each of those again would create 8 groups. If the number of groups keeps doubling by adding more splits at each level, it'd return 210  groups of houses by the time it'd get to the 10th level. That's 1024 leaves.

When the houses are divided amongst many leaves, fewer houses are found in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting.

Furthermore, a utility function can be used to help compare MAE scores from different values for max_leaf_nodes.

In [136]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

A for-loop can be used to compare the accuracy of models built with different values for max_leaf_nodes.

In [137]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271996


## Fit the Optimised Model
With the model now optimised, based on the optimal number of leaf nodes for this dataset, it can be builkt with the whole dataset to utilise the full volume of data to maximise accuracy. Part of the data is no longer required for validation as all the modelling decisions have been made.

In [138]:
final_model = DecisionTreeRegressor(max_leaf_nodes = 500, random_state = 0)

final_model.fit(X, y)

DecisionTreeRegressor(max_leaf_nodes=500, random_state=0)

## Random Forests Algorithm
Decision trees create a dilemma.  A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will underfit and perform poorly because it fails to capture as many distinctions in the raw data. 

Even today's most sophisticated modelling techniques face this tension between overfitting and underfitting. However, many models have clever ideas that can lead to better performance.

The **random forest** modelling technique uses many trees and it makes a prediction by averaging the predictions of each component tree. In general, it has much better predictive accuracy than a single decision tree and it works well with default parameters (unlike more complex models that are sensitive to their hyperparameters and require tuning). 

In [139]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_X, train_y)

melb_preds = forest_model.predict(val_X)

forest_mae = mean_absolute_error(val_y, melb_preds)

print('Mean Absolute Error is: %d' %(forest_mae))

Mean Absolute Error is: 207190


There is likely room for improvement, however even with default parameters the random forest algorithm delivered a significant improvement over the best/tuned decision tree model.

Now that the mean absolute error for the random forest algorithm has been established, the model can be trained on the full dataset.

In [140]:
rf_model_full_dataset = RandomForestRegressor(random_state = 1)

rf_model_full_dataset.fit(X, y)

RandomForestRegressor(random_state=1)

## Predictions on New Data
The 'test.csv' file has a set of new data that can be used to make predictions with trained model.

In [141]:
test_data_path = 'test.csv'
test_data = pd.read_csv(test_data_path)

With the data imported, the prediction dataset with only the required features can be created.

In [142]:
test_X = test_data[test_features]

Finally, the model can be used to make predictions based on the test data.

In [143]:
test_predictions = rf_model_full_dataset.predict(test_X)

ValueError: X has 7 features, but DecisionTreeRegressor is expecting 5 features as input.

## Kaggle Competitions
The code below demonstrates how predictions can be saved in the format used for competition scoring and thus be correctly submitted.  

In [85]:
output = pd.DataFrame({'Id': test_data.Id, 'SalePrice': test_predictions})
output.to_csv('submission.csv', index=False)

NameError: name 'test_preds' is not defined