## Decision Tree 决策树模型
- 节点node
- 分支split
- 剪枝：预剪枝：在创建树的同时剪枝。后剪枝：树建好了再修剪
- 树模型的优势：自动处理大量变量，树模型会在所有自变量中选出最重要的自变量对样本进行切分。对数据没有正态独立方差齐这些要求，应用范围更广

### 决策树分类

In [8]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import pandas as pd
import math
import joblib
from sklearn import metrics

In [5]:

iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

DTC = DecisionTreeClassifier()
DTC.fit(X_train, y_train)



DecisionTreeClassifier()

In [2]:
print(iris.feature_names)
DTC.feature_importances_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([0.01906837, 0.02764914, 0.40706919, 0.5462133 ])

In [3]:
DTC.score(X_test, y_test)

0.9555555555555556

In [4]:
print(classification_report(y_test, DTC.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.89      1.00      0.94        17
           2       1.00      0.87      0.93        15

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45



In [None]:
# 看看树的样子
from sklearn.tree import export_graphviz

export_graphviz(DTC, out_file="classify_tree.dot", feature_names=iris.feature_names, class_names=iris.target_names)

# 使用graphviz查看，使用软件打开.dot文件
with open('classify_tree.dot') as f:
    dot_graph = f.read

import graphviz
graph = graphviz.Source(dot_graph)
graph.render('classify_tree')

### 决策树回归
- 分类与回归的区别：分类问题中的因变量是分类变量，回归种的因变量是连续变量
- 分类决策树中，用信息熵表示节点的混乱程度
- 回归决策树中，改用均方差来表示混乱程度
- 分类决策树中，叶子结点的众数就是输出结果
- 回归决策树中，改用叶子节点的平均数作为结果

In [None]:
from sklearn.tree import DecisionTreeRegressor

boston = datasets.load_boston()
X = boston.data
y = boston.target

DTR = DecisionTreeRegressor(max_depth=3)
DTR.fit(X,y)
print(DTR.score(X,y))


In [None]:
export_graphviz(DTR, out_file='regression_tree.dot', feature_names = boston.featurenames)

with open('regression_tree.dot') as f:
    dot_graph = f.read

graph = graphviz.Source(dot_graph)
graph.render('regression_tree')

## part1、 SVM applied in ads effectiveness prediction

In [9]:
df = pd.read_csv("./data/ads_3.csv")

X = df[df.columns[:62]]
Y = df[df.columns[62:]]
MSE = []
RMSE = []
R_squared = []

for i in range(12):
    y = Y[Y.columns[i]]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    
    DT_regression = DecisionTreeRegressor(max_depth=4)
    DT_regression.fit(X_train, y_train)

    # joblib.dump(DT_regression, "model/SVM_regression/model{}.pkl".format(i+1))
    y_pred = DT_regression.predict(X_test)
    MSE.append(metrics.mean_squared_error(y_test, y_pred))
    RMSE.append(math.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    R_squared.append(metrics.r2_score(y_test, y_pred))
    

In [11]:
result_dic = {"MSE":MSE, "RMSE":RMSE, "R_squared":R_squared}
result_df = pd.DataFrame(result_dic, index=Y.columns)
result_df.to_csv("result/DT_regression.csv")

In [22]:
MSE = []
RMSE = []
R_squared = []

for i in range(12):
    y = Y[Y.columns[i]]
    
    DT_regression = DecisionTreeRegressor(max_depth=6)
    DT_regression.fit(X, y)

    joblib.dump(DT_regression, "model/DT_regression/model{}.pkl".format(i+1))
    
    MSE.append(metrics.mean_squared_error(y, DT_regression.predict(X)))
    RMSE.append(math.sqrt(metrics.mean_squared_error(y, DT_regression.predict(X))))
    R_squared.append(metrics.r2_score(y, DT_regression.predict(X)))

In [20]:
result_dic = {"MSE":MSE, "RMSE":RMSE, "R_squared":R_squared}
result_df = pd.DataFrame(result_dic, index=Y.columns)
result_df.to_csv("result/DT_regression.csv")