### What is Model validation.
`Measuring model quality is a key to iteratively improving your models.`

I'll want to evaluate almost every model that I ever build. In other words, will the model's predictions be close to what actually happens.
Many people make a hudge misteke when measuring predictive accuracy.
You'd first need to summarize the model quality into an understandable way. If comparing actual and predictive values for 10000 houses, you'll likely find mix of good and bad predictions. Looking thorough a list of 10000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There many metrics to summarize model quality, one of them is a `MAE` (Mean absolute Error).

**The prediction error for each house is:**  
`error = actual - predicted`  
So, if house's price 150,000 USD, and you predicted it would cost 100,000 the error is 50,000.

With the MAE, we take the absolute value of each error (this converts each error to a positive number). We then take the average of those absolute errors. This is our measure of model quality.  
`In plain English, it can be said as`
> On average, our predictions  are off by about X.

***The example from: 003_1_first_machine_learning_model.***
```Python
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, predictions)

>>> 30642.671232876713
```

In [90]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

iowa_file_path = './data/home_data_for_machine_learning/train.csv'
home_data = pd.read_csv(iowa_file_path)

y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

print('\nComplete')

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]

Complete


**Draw attention**
```Python
Syntax: Series.tolist()

Return type: Converted series into List
```

### Exercises

### Step 1: Split your data

In [91]:
print(dir(sklearn.model_selection ))

['BaseCrossValidator', 'GridSearchCV', 'GroupKFold', 'GroupShuffleSplit', 'KFold', 'LeaveOneGroupOut', 'LeaveOneOut', 'LeavePGroupsOut', 'LeavePOut', 'ParameterGrid', 'ParameterSampler', 'PredefinedSplit', 'RandomizedSearchCV', 'RepeatedKFold', 'RepeatedStratifiedKFold', 'ShuffleSplit', 'StratifiedKFold', 'StratifiedShuffleSplit', 'TimeSeriesSplit', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_search', '_split', '_validation', 'check_cv', 'cross_val_predict', 'cross_val_score', 'cross_validate', 'fit_grid_point', 'learning_curve', 'permutation_test_score', 'train_test_split', 'validation_curve']


In [92]:
# Import `train_test_split` function
import sklearn
from sklearn.model_selection import train_test_split

# fill in
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [93]:
print(train_X.shape)
print(train_y.shape)

(1095, 7)
(1095,)


In [94]:
print(val_X.shape)
print(val_y.shape)

(365, 7)
(365,)


### Step 2: Specify and Fit model  

Create a `Decision Tree Regressor` model and `fit it to the relevant data`.  
Set `random_state` to 1 again when creating the model.

In [95]:
# Specify the model.
# help(iowa_model = DecisionTreeRegressor)
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
# help(iowa_model.fit)
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

### Step 3: Make predictions with Validation data

In [69]:
# Prediction with all `val_` observation.
# help(iowa_model.predict(val_X))
forcast = iowa_model.predict(val_X)

In [73]:
# 
# Inspection the predictions and actual values from validation data.
#

# print the top few validation predictions.
print(forcast[:10])
print(50 * "-")
# print the top few actual prices.
print(val_y[:10])

[186500. 184000. 130000.  92000. 164500. 220000. 335000. 144152. 215000.
 262000.]
--------------------------------------------------
258     231500
267     179500
288     122000
649      84500
1233    142000
167     325624
926     285000
831     151000
1237    195000
426     275000
Name: SalePrice, dtype: int64


### Step 4: Calculate the Mean absolutely Error in Validation Data

In [127]:
from sklearn.metrics import mean_absolute_error
# print(dir(sklearn.metrics))
# help(mean_absolute_error)

val_mae = mean_absolute_error(val_y, forcast)
print(f"MAE = ${int(val_mae)}")

MAE = $29652


### Underfitting and Overfitting

I build my model, and now it's time to optimize the size of the tree to make better predictions.

In [112]:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [126]:
for max_leaf_nodes in [5, 50, 60, 500, 900, 1000, 1070]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes}  \t\t Mean Absolute Error: {my_mae}")

Max leaf nodes: 5  		 Mean Absolute Error: 35044.51323006062
Max leaf nodes: 50  		 Mean Absolute Error: 27405.9305977479
Max leaf nodes: 60  		 Mean Absolute Error: 27110.899469831442
Max leaf nodes: 500  		 Mean Absolute Error: 29454.183240959952
Max leaf nodes: 900  		 Mean Absolute Error: 30016.684474885846
Max leaf nodes: 1000  		 Mean Absolute Error: 30006.734246575343
Max leaf nodes: 1070  		 Mean Absolute Error: 30011.569863013698


Of the options listed, 500 is the optimal number of leaves.

### Conclusion
Here's the takeaway: Models can suffer from either:

**Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or  
**Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.  
We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy.  
This lets us try many candidate models and keep the best one.