**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**

---


## Recap
You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")


# Make training predictions and calculate training mean absolute error
train_predictions = iowa_model.predict(train_X)
train_mae = mean_absolute_error(train_predictions, train_y)

# Print both training and validation MAE
print("Training MAE: {:,.0f}".format(train_mae))
print("Validation MAE: {:,.0f}".format(val_mae))

# Compare performance
if train_mae < val_mae:
    print("The model might be overfitting.")
elif train_mae > val_mae:
    print("The model might be underfitting.")
else:
    print("The model is performing consistently.")


Validation MAE: 29,653

Setup complete
Training MAE: 62
Validation MAE: 29,653
The model might be overfitting.


# Exercises
You could write the function `get_mae` yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below.

In [2]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# Test the function with different max_leaf_nodes values
leaf_node_values = [5, 10, 20, 50, 100, 200]

for max_leaf_nodes in leaf_node_values:
    val_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"MAE for max_leaf_nodes={max_leaf_nodes}: {val_mae:.0f}")

MAE for max_leaf_nodes=5: 35045
MAE for max_leaf_nodes=10: 31585
MAE for max_leaf_nodes=20: 28707
MAE for max_leaf_nodes=50: 27406
MAE for max_leaf_nodes=100: 27283
MAE for max_leaf_nodes=200: 28136


## Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for *max_leaf_nodes* from a set of possible values.

Call the *get_mae* function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data.

In [3]:
# Initialize variables to track the best tree size
best_tree_size = None
best_mae = float('inf')  # Start with a very large value for comparison

# Loop through the candidate_max_leaf_nodes to find the best tree size
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
for max_leaf_nodes in candidate_max_leaf_nodes:
    # Calculate MAE for each max_leaf_nodes
    val_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"MAE for max_leaf_nodes={max_leaf_nodes}: {val_mae:.0f}")
    
    # Update the best tree size if the current MAE is lower
    if val_mae < best_mae:
        best_mae = val_mae
        best_tree_size = max_leaf_nodes

# Print the best tree size after loop
print(f"The best tree size is: {best_tree_size}")

MAE for max_leaf_nodes=5: 35045
MAE for max_leaf_nodes=25: 29016
MAE for max_leaf_nodes=50: 27406
MAE for max_leaf_nodes=100: 27283
MAE for max_leaf_nodes=250: 27894
MAE for max_leaf_nodes=500: 29454
The best tree size is: 100


In [4]:
# The lines below will show you a hint or the solution.
step_1.hint() 
step_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> You will call get_mae in the loop. You'll need to map the names of your data structure to the names in get_mae

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
# Here is a short solution with a dict comprehension.
# The lesson gives an example of how to do this with an explicit loop.
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)

```

## Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size.  That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [5]:
# Fit the final model with the best tree size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# Fit it on all the training data
final_model.fit(X, y)

# Make predictions on the validation set
final_predictions = final_model.predict(val_X)

# Calculate the validation MAE
final_mae = mean_absolute_error(val_y, final_predictions)
print(f"Final validation MAE: {final_mae:,.0f}")


# Check your answer
step_2.check()


Final validation MAE: 16,816


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [6]:
 step_2.hint()
 step_2.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Fit with the ideal value of max_leaf_nodes. In the fit step, use all of the data in the dataset

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model
final_model.fit(X, y)
```

You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.

# Keep Going

You are ready for **[Random Forests](https://www.kaggle.com/dansbecker/random-forests).**


---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*