**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**

---


## 回顾
您已经构建了第一个模型，现在是时候优化树的大小以做出更好的预测了。运行此单元格以在上一步中断的地方设置您的编码环境。

In [15]:
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")

# 练习

您可以自己编写函数`get_mae`。现在，我们来提供。这个函数与上一课中介绍的是同一个函数。查看下面的代码块。

In [16]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

## 步骤1: 比较不同大小的决策树

编写一个循环，从一组可能的值中尝试以下 *max_leaf_node* 值。

对 max_leaf_node 的每个值调用 *get_mae* 函数。以某种方式存储输出，使您能够选择 `max_leaf_nodes` 的值，从而为您的数据提供最准确的模型。

In [17]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# 编写循环，从 candidate_max_leaf_nodes 中找到理想的树大小
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}

# 保存 max_leaf_node 的最佳值(它应该是5、25、50、100、250或500)
best_tree_size = min(scores, key=scores.get)

# Check your answer
step_1.check()

In [18]:
# The lines below will show you a hint or the solution.
# step_1.hint() 
# step_1.solution()

## 步骤2: 使用所有数据拟合模型

你知道最好的树大小。如果您打算在实践中部署这个模型，那么通过使用所有的数据并保持树的大小，您将使它更加精确。也就是说，既然您已经做出了所有的建模决策，就不需要保留验证数据了。

In [19]:
# 填写参数，使最佳大小和取消注释
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# 拟合最终模型，取消后两行的注释
final_model.fit(X, y)

# Check your answer
step_2.check()

In [20]:
# step_2.hint()
# step_2.solution()

您调整了这个模型并改进了结果。但是我们仍然在使用决策树模型，按照现代机器学习标准，这些模型并不十分复杂。在下一步中，您将学习如何使用随机森林来进一步改进您的模型。

# 继续

你已经准备好进入 **[Random Forests](https://www.kaggle.com/dansbecker/random-forests).**


---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*