# Houseing Price Prediction with Decision Tree and Random Forest

This is an introductory ML project from [Kaggle's Micro Courses.](https://www.kaggle.com/learn/intro-to-machine-learning)

In this project, I used Random Forest and Decision Trees to predict the housing prices in Melbourne, Australia.

In the end, I compared the two model using MAE(Mean Absolute Error).

## 1. Importing Pandas

In [1]:
import pandas as pd

## 2. Loading Data

In [2]:
path = 'melb_data.csv'
melb_data = pd.read_csv(path)

In [3]:
# This is How the data looks like!
melb_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [4]:
# A brief summary of the data
melb_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## 3. Droping rows with any missing values

In [5]:
melb_data = melb_data.dropna(axis=0)

In [6]:
# After the data cleansing
melb_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1068828.0,9.751097,3101.947708,2.902034,1.57634,1.573596,471.00694,141.568645,1964.081988,-37.807904,144.990201,7435.489509
std,0.971079,675156.4,5.612065,86.421604,0.970055,0.711362,0.929947,897.449881,90.834824,38.105673,0.07585,0.099165,4337.698917
min,1.0,131000.0,0.0,3000.0,0.0,1.0,0.0,0.0,0.0,1196.0,-38.16492,144.54237,389.0
25%,2.0,620000.0,5.9,3044.0,2.0,1.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198,4383.75
50%,3.0,880000.0,9.0,3081.0,3.0,1.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958,6567.0
75%,4.0,1325000.0,12.4,3147.0,3.0,2.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527,10175.0
max,8.0,9000000.0,47.4,3977.0,9.0,8.0,10.0,37000.0,3112.0,2018.0,-37.45709,145.52635,21650.0


## 4. Splitting the Data into Training Sets and Validation Sets

In [7]:
from sklearn.model_selection import train_test_split
y = melb_data.Price # Prediction Target
features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
            'YearBuilt', 'Lattitude', 'Longtitude']
X = melb_data[features] # Features

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

## 5. Build, Train, and Predict

In [8]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Since we can specify the depth of the tree, going to define a function to compare depths of the model, using MAE
# For this model, we will test by using 5, 50, 500, and 5000 maximum leaf nodes.
def decision_tree_best_mae(train_X, val_X, train_y, val_y):
    min_mae = 0
    for num in [5, 50, 500, 5000]:
        # 5-1 Decision Tree
        model = DecisionTreeRegressor(max_leaf_nodes = num, random_state=0)
        model.fit(train_X, train_y) # Train our model
        preds_val = model.predict(val_X) # Make a prediction
        mae = mean_absolute_error(val_y, preds_val)
        if min_mae == 0:
            min_mae = mae
        elif min_mae >= mae:
            min_mae = mae
    return(min_mae)

In [9]:
# 5-2 Random Forest
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state = 0)
forest_model.fit(train_X, train_y) # Train our model
forest_preds = forest_model.predict(val_X) # Make a prediction.

## 6. Compare the Models!

In [10]:
print('Random Forest MAE: %d' % (mean_absolute_error(val_y, forest_preds)))
print('Decision Tree MAE: %d' % (decision_tree_best_mae(train_X, val_X, train_y, val_y)))

Random Forest MAE: 193528
Decision Tree MAE: 243495


## 7. Conclusion

As it is noticeable by the name MAE, MAE is the number is the average of difference between the actual values of data and predictions.

| | Decision Tree | Random Forest |
| --- | :--- | :--- |
| MAE | 243495 | 193528 |


As you can see by the result, Random Forest showed way better performance than our Decision Tree.
It makes sense because Random Forest is a machine that uses multiple Decision Trees to average their outcome.

This is the end of this project.