**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**

---


## Recap
You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.

In [1]:
# Code you have previously used to load data
# IMPORT LIBRARIES
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# READ FILE
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

#READ CSV (COMMA-SEPARATED-FILE) INTO A PANDAS DATAFRAME
home_data = pd.read_csv(iowa_file_path)

# CREATE TARGET OBJECT and call it y
y = home_data.SalePrice

# Create X that contains the features that I want to consider. I may only choose some of them from original file.
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]


# Split into validation and training data
# train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# If I do not specify a random state, I will get a different split each time.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
# iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
# iowa_model.fit(train_X, train_y)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
# val_predictions = iowa_model.predict(val_X)
val_predictions = iowa_model.predict(val_X)

#val_mae = mean_absolute_error(val_predictions, val_y)
#Compare the predictions with the true values
val_mae = mean_absolute_error(val_predictions, val_y)
#print("Validation MAE: {:,.0f}".format(val_mae))
print("Validation MAE: {:,.0f}".format(val_mae))

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")

Validation MAE: 29,653

Setup complete


# Exercises
You could write the function `get_mae` yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below.

In [2]:
# def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
#     model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
#     model.fit(train_X, train_y)
#     preds_val = model.predict(val_X)
#     mae = mean_absolute_error(val_y, preds_val)
#     return(mae)
def get_mae(chosen_max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=chosen_max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    # Do predictions
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(preds_val, val_y)

### My note: I wrote the function myself again to get familiar with the materials. 

In [3]:
def chau_get_mean_absolute_error(given_max_leaf_nodes, train_X, val_X, train_y, val_y):
    my_model = DecisionTreeRegressor(max_leaf_nodes=given_max_leaf_nodes, random_state=14)
    my_model.fit(train_X, train_y)
    predicted_values = my_model.predict(val_X)
    mae = mean_absolute_error(val_y, predicted_values)
    return mae

## Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for *max_leaf_nodes* from a set of possible values.

Call the *get_mae* function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data.

In [4]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for max_node in candidate_max_leaf_nodes:
    my_mae = chau_get_mean_absolute_error(max_node,train_X, val_X, train_y, val_y)
    if(max_node == 5):
        lowest_mae = my_mae
    else:
        if(my_mae < lowest_mae):
            lowest_mae = my_mae
            best_tree_size = max_node
    print("Max leaf nodes: %d \n Mean Absolute Error: %d" %(max_node, my_mae))
    

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
print(best_tree_size)

# Check your answer
step_1.check()

Max leaf nodes: 5 
 Mean Absolute Error: 35044
Max leaf nodes: 25 
 Mean Absolute Error: 29016
Max leaf nodes: 50 
 Mean Absolute Error: 27734
Max leaf nodes: 100 
 Mean Absolute Error: 27611
Max leaf nodes: 250 
 Mean Absolute Error: 28271
Max leaf nodes: 500 
 Mean Absolute Error: 29591
100


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [5]:
print(best_tree_size)

100


## My note: 
1. I used the for-loop to try out different number of leafs and retrieve the mean_average_error (MAE) for each tree.
2. I set the first MAE of the 5-leaves tree as my base, and I used that number as the benchmark to compare the MAE of other trees.
3. Once I identified the lowest mean_average_error, I traced back to see which tree it is and stored the number of its leaves.
4. Another way is creating a dictionary and storing the MAE in correspondence to the number of leaves, and then loop through again to do comparison; however, it is not sufficient in terms of memory and time. I created a new variable inside the loop and did a comparison along the way. 

In [6]:
# The lines below will show you a hint or the solution.
# step_1.hint() 
# step_1.solution()

## Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size.  That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [7]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)

# Check your answer
step_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [8]:
# step_2.hint()
# step_2.solution()

You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.

# Keep Going

You are ready for **[Random Forests](https://www.kaggle.com/dansbecker/random-forests).**


---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*