In [4]:
import pandas as pd

In [5]:
df = pd.read_csv("melb_data.csv")

In [6]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


### 1. Selecting The Prediction Target

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.



In [9]:
y = df.Price

### 2. Choosing "Features"




In [10]:
featurs = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

In [11]:
X = df[featurs]

In [12]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


In [13]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.7996,144.9984
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
3,3,2.0,94.0,-37.7969,144.9969
4,4,1.0,120.0,-37.8072,144.9941


### 3. Building Your Model

The steps to building and using a model are:

>1. Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

>2. Fit: Capture patterns from provided data. This is the heart of modeling.

>3. Predict: Just what it sounds like.

>4. Evaluate: Determine how accurate the model's predictions are.

In [14]:
from sklearn.tree import DecisionTreeRegressor

price_model = DecisionTreeRegressor(random_state=1)

price_model.fit(X,y)

DecisionTreeRegressor(random_state=1)

In [15]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.7996,144.9984
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
3,3,2.0,94.0,-37.7969,144.9969
4,4,1.0,120.0,-37.8072,144.9941


In [17]:
price_model.predict(X.head())

array([1480000., 1035000., 1465000.,  850000., 1600000.])

### 4. Model Validation



**There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.**


#### A. To solve validation issues we will split our data into train and test datasets using  `train_test_split` .


#### B. The prediction error for each house is: error=actual−predicted

> mean_absolute_error(test_y, predicted_prices)


In [21]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 1)

# Define model

price_model = DecisionTreeRegressor(random_state = 1)

# Fit model

price_model.fit(train_X, train_y)

# get predicted prices on validation data

predictions = price_model.predict(test_X)

In [22]:
from sklearn.metrics import mean_absolute_error

# calculate mean_absolute_error

mean_absolute_error(test_y, predictions)


241632.16966126655