**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/model-validation).**

---


## Recap
You've built a model. In this exercise you will test how good your model is.

Run the cell below to set up your coding environment where the previous exercise left off.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice # TArget predict

feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns] #Available featire

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

#without splitting into validation data and training data
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex4 import *
print("Setup Complete")

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]
Setup Complete


# Exercises

## Step 1: Split Your Data
Use the `train_test_split` function to split up your data.

Give it the argument `random_state=1` so the `check` functions know what to expect when verifying your code.

Recall, your features are loaded in the DataFrame **X** and your target is loaded in **y**.


In [2]:
# Import the train_test_split function
from sklearn.model_selection import train_test_split

# Split data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=0.25) #default value for the test size(validation data))

# Check your answer
step_1.check()


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [3]:
val_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
258,12435,2001,963,829,2,3,7
267,8400,1939,1052,720,2,4,8
288,9819,1967,900,0,1,3,5
649,1936,1970,630,0,1,1,3
1233,12160,1959,1188,0,1,3,6
...,...,...,...,...,...,...,...
1017,5814,1984,1360,0,1,1,4
534,9056,2004,707,707,2,3,6
1334,2368,1970,765,600,1,3,7
1369,10635,2003,1668,0,2,3,8


In [4]:
train_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
6,10084,2004,1694,0,2,3,7
807,21384,1923,1072,504,1,3,6
955,7136,1946,979,979,2,4,8
1040,13125,1957,1803,0,2,3,8
701,9600,1969,1164,0,1,3,6
...,...,...,...,...,...,...,...
715,10140,1974,1350,0,2,3,7
905,9920,1954,1063,0,1,3,6
1096,6882,1914,773,582,1,3,7
235,1680,1971,483,504,1,2,5


In [5]:
# The lines below will show you a hint or the solution.
# step_1.hint() 
# step_1.solution()


## Step 2: Specify and Fit the Model

Create a `DecisionTreeRegressor` model and fit it to the relevant data.
Set `random_state` to 1 again when creating the model.

In [6]:
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)
#‚úî You are creating an empty decision tree model
#‚úî No data is learned yet
#‚úî Think of it as buying a notebook, not writing in it yet

#3Ô∏è‚É£ What is stored in iowa_model right now?
#At this point:
# ‚ùå The model does NOT know anything
# ‚ùå No rules
# ‚ùå No splits

#‚úî Just a decision tree structure waiting to be trained

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)
# You are telling the model:
    # ‚ÄúHere are the inputs (train_X) and the correct answers (train_y). Learn the pattern that maps X ‚Üí y.‚Äù
    #Each row in train_X matches one value in train_y

# 7Ô∏è‚É£ What does the decision tree do during .fit()?
    # Internally, the tree:
    # üîπ Step-by-step learning process
        # 1Ô∏è‚É£ Looks at all training data
        # 2Ô∏è‚É£ Chooses the best feature to split on
        # 3Ô∏è‚É£ Splits data to reduce prediction error
        # 4Ô∏è‚É£ Repeats splitting again and again
        # 5Ô∏è‚É£ Stops when:
# Data is pure, or
# Max depth reached, or
# No improvement possible

#Now it returns a model that has been;
    # ‚úî The model has learned rules
    # ‚úî The tree structure is built
    # ‚úî The model is ready to predict

# Check your answer
step_2.check()


[186500. 184000. 130000.  92000. 164500. 220000. 335000. 144152. 215000.
 262000.]
[186500. 184000. 130000.  92000. 164500. 220000. 335000. 144152. 215000.
 262000.]


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [7]:
# step_2.hint()
# step_2.solution()

## Step 3: Make Predictions with Validation data


In [8]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
#The model has never seen this val_X data when training

# .predict() means:
# ‚ÄúUse what you have learned to estimate outputs for new inputs.‚Äù

print(val_predictions)
#One predicted value per row in val_X

# ‚ùå The model does NOT learn
# ‚ùå The model does NOT change
# ‚ùå No training occurs

# Check your answer
step_3.check()

[186500. 184000. 130000.  92000. 164500. 220000. 335000. 144152. 215000.
 262000. 180000. 121000. 175900. 210000. 248900. 131000. 100000. 149350.
 235000. 156000. 149900. 265979. 193500. 377500. 100000. 162900. 145000.
 180000. 582933. 146000. 140000.  91500. 112500. 113000. 145000. 312500.
 110000. 132000. 305000. 128000. 162900. 115000. 110000. 124000. 215200.
 180000.  79000. 192000. 282922. 235000. 132000. 325000.  80000. 237000.
 208300. 100000. 120500. 162000. 153000. 187000. 185750. 335000. 129000.
 124900. 185750. 133700. 127000. 230000. 146800. 157900. 136000. 153575.
 335000. 177500. 143000. 202500. 168500. 105000. 305900. 192000. 190000.
 140200. 134900. 128950. 213000. 108959. 149500. 190000. 175900. 160000.
 250580. 157000. 120500. 147500. 118000. 117000. 110000. 130000. 148500.
 148000. 190000. 130500. 127000. 120500. 135000. 168000. 176432. 128000.
 147000. 260000. 132000. 129500. 171000. 181134. 227875. 189000. 282922.
  94750. 185000. 194000. 159000. 279500. 290000. 13

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [9]:
# step_3.hint()
# step_3.solution()

Inspect your predictions and actual values from validation data.

In [10]:
# print the top few validation predictions
print(val_predictions[:5])

# print the top few actual prices from validation data
print(val_y.head()) #These are the real data

[186500. 184000. 130000.  92000. 164500.]
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data


In [11]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

# uncomment following line to see the validation_mae
print(val_mae)

# Check your answer
step_4.check()

29652.931506849316


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [12]:
# step_4.hint()
# step_4.solution()

Is that MAE good?  There isn't a general rule for what values are good that applies across applications. But you'll see how to use (and improve) this number in the next step.

# Keep Going

You are ready for **[Underfitting and Overfitting](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**


---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*