# 决策树处理回归任务

实验内容
1.	使用sklearn.tree.DecisionTreeRegressor完成kaggle房价预测问题
2.	计算最大深度为10的决策树，训练集上十折交叉验证的MAE和RMSE
3.	绘制最大深度从1到30，决策树在训练集和测试集上MAE的变化曲线
4.  选择一个合理的树的最大深度，并给出理由

## 1. 读取数据

In [None]:
import pandas as pd
data = pd.read_csv('data/kaggle_house_price_prediction/kaggle_hourse_price_train.csv')

In [None]:
# 丢弃有缺失值的特征（列）
data.dropna(axis = 1, inplace = True)

# 只保留整数的特征
data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

In [None]:
data.head()

## 2. 数据集划分

70%做训练集，30%做测试集

In [None]:
from sklearn.utils import shuffle

In [None]:
data_shuffled = shuffle(data, random_state = 32)
split_line = int(len(data_shuffled) * 0.7)
training_data = data_shuffled[:split_line]
testing_data = data_shuffled[split_line:]

## 3. 导入模型

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [None]:
from sklearn.tree import DecisionTreeRegressor

## 4. 选取特征和标记

In [None]:
features = data.columns.tolist()
target = 'SalePrice'
features.remove(target)

## 5. 训练与预测

请你在下面计算树的最大深度为10时，使用训练集全量特征训练的决策树的十折交叉验证的MAE和RMSE  

In [None]:
# YOUR CODE HERE
X_train = training_data[features]
y_train = training_data[target]

model = DecisionTreeRegressor(max_depth=10)
y_pred = cross_val_predict(model, X_train, y_train, cv=10)
mae = mean_absolute_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred, squared=False)
print("MAE:", mae)
print("RMSE:", rmse)

###### 双击此处编辑
最大深度为10，全量特征的决策树，十折交叉验证指标

MAE|RMSE
-|-
26529.84204384742|42130.77600109948

## 6. 改变最大深度，绘制决策树的精度变换图

绘制最大深度从1到30，决策树训练集和测试集MAE的变化图，要把两个曲线画在一张图内，横坐标是最大深度，纵坐标是MAE

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")

In [None]:
# 绘制最大深度从1到30，决策树训练集和测试集MAE的变化图
train_mae = []
test_mae = []
for depth in range(1, 31):
    model = DecisionTreeRegressor(max_depth=depth)
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(testing_data[features])
    train_mae.append(mean_absolute_error(y_train, y_pred_train))
    test_mae.append(mean_absolute_error(testing_data[target], y_pred_test))
plt.plot(range(1, 31), train_mae, label='Train MAE')
plt.plot(range(1, 31), test_mae, label='Test MAE')
plt.xlabel('Max Depth')
plt.ylabel('MAE')
plt.title('Decision Tree Performance')
plt.legend()
plt.show()