像SVM, 决策树也是一个多才多艺的机器学习算法，适用于分类，回归甚至是多输出任务。

In [1]:
import os
PROJECT_ROOT_DIR="."
CHAPTER_ID = "decision_trees"
def image_path(fig_id):
    return os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id)

## 训练和可视化决策树

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()

In [3]:
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [4]:
X = iris.data[:, 2:] # petal length, width 后两种属性
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### 可视化决策树

In [5]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=image_path('iris_tree.dot'),
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
)

使用$Graphiviz$生成决策树  
`dot -Tpng iris_tree.dot -o iris_tree.png`

In [None]:
!dot -Tpng iris_tree.dot -o iris_tree.png

![](./images/decision_trees/iris_tree.png)

节点属性说明：   
$samples$ 节点产生样本数   
$value$=[,,,]，每个$class$有多少样本,用来计算属于某种$classs$的概率      
$class$ 决策的类别 ,根据$value$取多的$class$    
$gini$:基尼纯度，节点所有样本属于同一个属性$gini=0$   
对于绿色节点 $g_{green}=1-(0/54)^2-(49/54)^2-(5/54)^2\approx 0.168$  

$Gini\ impurity:$ $\displaystyle G_i=\sum_{k=1}^np_{i,k}^2$

In [6]:
tree_clf.predict_proba([[5, 1.5]])
# 0/54 49/54 5/54

array([[0.        , 0.90740741, 0.09259259]])

In [7]:
tree_clf.predict([[5, 1.5]])
# belong to class 1 versicolor

array([1])

## CART决策树

$Classification\ And\ Regression\ Tree$训练一棵"增长"的决策树

## 决策树回归

In [13]:
from sklearn.tree import DecisionTreeRegressor, export_graphviz

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(iris.data[:], iris.target);

export_graphviz(
        tree_reg,
        out_file=image_path('iris_tree_reg.dot'),
        feature_names=iris.feature_names,
        class_names=iris.target_names,
        rounded=True,
        filled=True
)

![](./images/decision_trees/iris_tree_reg.png)