## Recap
We've built a model. In this exercise we'll test how good our model is.

Where the previous exercise left off....

In [2]:
# Code we have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


# Exercises

## Step 1: Split Data
Use the `train_test_split` function to split up data.

Give it the argument `random_state=1` so the `check` functions know what to expect when verifying our code.

Recall, our features are loaded in the DataFrame **X** and our target is loaded in **y**.


In [3]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

print("train_X :-")
print(train_X)

print("train_y :-")
print(train_y)

print("val_X :-")
print(val_X)

print("val_y :-")
print(val_y)

train_X :-
      LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
6       10084       2004      1694         0         2             3   
807     21384       1923      1072       504         1             3   
955      7136       1946       979       979         2             4   
1040    13125       1957      1803         0         2             3   
701      9600       1969      1164         0         1             3   
...       ...        ...       ...       ...       ...           ...   
715     10140       1974      1350         0         2             3   
905      9920       1954      1063         0         1             3   
1096     6882       1914       773       582         1             3   
235      1680       1971       483       504         1             2   
1061    18000       1935       894         0         1             2   

      TotRmsAbvGrd  
6                7  
807              6  
955              8  
1040             8  
701              6 

## Step 2: Specify and Fit the Model

Creating a `DecisionTreeRegressor` model and fitting it to the relevant data.
Set `random_state` to 1 again when creating the model.

In [4]:
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

## Step 3: Make Predictions with Validation data


In [5]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

print(val_predictions)

[186500. 184000. 130000.  92000. 164500. 220000. 335000. 144152. 215000.
 262000. 180000. 121000. 175900. 210000. 248900. 131000. 100000. 149350.
 235000. 156000. 149900. 265979. 193500. 377500. 100000. 162900. 145000.
 180000. 582933. 146000. 140000.  91500. 112500. 113000. 145000. 312500.
 110000. 132000. 305000. 128000. 162900. 115000. 110000. 124000. 215200.
 180000.  79000. 192000. 282922. 235000. 132000. 325000.  80000. 237000.
 208300. 100000. 120500. 162000. 153000. 187000. 185750. 335000. 129000.
 124900. 185750. 133700. 127000. 230000. 146800. 157900. 136000. 153575.
 335000. 177500. 143000. 202500. 168500. 105000. 305900. 192000. 190000.
 140200. 134900. 128950. 213000. 108959. 149500. 190000. 175900. 160000.
 250580. 157000. 120500. 147500. 118000. 117000. 110000. 130000. 148500.
 148000. 190000. 130500. 127000. 120500. 135000. 168000. 176432. 128000.
 147000. 260000. 132000. 129500. 171000. 181134. 227875. 189000. 282922.
  94750. 185000. 194000. 159000. 279500. 290000. 13

Inspect your predictions and actual values from validation data.

In [6]:
# print the top few validation predictions
print(val_predictions.tolist()[:6])
# print the top few actual prices from validation data
print(val_y.tolist()[:6])

[186500.0, 184000.0, 130000.0, 92000.0, 164500.0, 220000.0]
[231500, 179500, 122000, 84500, 142000, 325624]


What do we notice that is different from what we saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data


In [7]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

print(val_mae)

29652.931506849316
