# Underfitting and Overfitting

Consider a tree-based model, when we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new (unseen) data (because each prediction is based on only a few houses). Therefore, **overfitting** refers to a situation where a model matches the training data almost perfectly, but does poorly in validation and other new data. 

On the other hand, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.


Since we care about accuracy on new data, which we estimate from our validation data, we want to find the optimized point between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

Here below is an example.

In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [2]:
# Data Loading Code Runs At This Point
import pandas as pd
    
# Load data
melbourne_data = pd.read_csv("melb_data.csv") 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [3]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  254983


Of the options listed, 500 is the optimal number of leaves.