## Recap
So far, you have loaded your data and reviewed it with the following code. Run this cell to set up your coding environment where the previous step left off.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

X.head()
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex4 import *

First in-sample predictions: [ 208500.  181500.  223500.  140000.  250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


# Exercises


## Step 1: Split Your Data
Use the train_test_split function to split up your data.

Though you could use any number for the random_state, use `random_state=1` so the `check` functions know what to expect when they verify your code.

In [2]:
# Import the function
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y)
print("Incorrect")
step_1.check()

Incorrect


<span style="color:#cc3333">Incorrect:</span> The training data had different rows than expected

In [3]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
print("Correct")
step_1.check()

Correct


<span style="color:#33cc33">Correct</span>

In [4]:
# The lines below will show you a hint or the solution.
step_1.hint() 
# step_1.solution()


<span style="color:#3366cc">Hint:</span> The function you need to import is part of sklearn. When calling the function, the arguments are X and y. Ensure you set the random_state to 1.

## Step 2: Specify and Fit the Model

In [5]:
# Fit the model with the training data.
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)
step_2.check()

[ 186500.  184000.  130000.   92000.  164500.  220000.  335000.  144152.
  215000.  262000.]
[ 186500.  184000.  130000.   92000.  164500.  220000.  335000.  144152.
  215000.  262000.]


<span style="color:#33cc33">Correct</span>

In [6]:
# step_2.hint()
step_2.solution()

<span style="color:#33cc99">Solution:</span> 
```python
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)
```

## Step 3: Make Predictions with Validation data


In [7]:
# Predict with all validation observations
val_predictions = iowa_model.predict(train_X)
print("Incorrect")
step_3.check()

Incorrect


<span style="color:#cc3333">Incorrect:</span> `val_predictions` is wrong size. Did you predict with the wrong data?

In [8]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
print("Correct")
step_3.check()

Correct


<span style="color:#33cc33">Correct</span>

In [9]:
step_3.hint()
# step_3.solution()

<span style="color:#3366cc">Hint:</span> Run predict on the right validation data object.

Inspect your predictions and actual values from validation data.

In [10]:
# print the top few validation predictions
#print(_)
# print the top few actual prices from validation data
#print(_)

What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data


In [11]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_predictions, val_y)

step_4.check()

<span style="color:#33cc33">Correct</span>

Is that MAE good?  There isn't a general rule for what values are good that applies across applications. But you'll see how to use this number in the next step.

# Keep Going

Now that you can measure model performance, you are ready to run some experiments comparing different models. The key is to understand **[Underfitting and Overfitting](https://www.kaggle.com/dansbecker/underfitting-overfitting-and-model-optimization)**. It's an especially fun part of machine learning. 

---
**[Course Home Page](https://www.kaggle.com/learn/machine-learning)**


