# Decisionn Trees
![](picture/01.png)

Check if every item in the dataset is in the same class:

    if so return the class label
    Else
        find the best feature to split the data
        split the dataset
        create a branch node
             for each split
                 call createBranch and add the result to the branch node
             retuen branch node
             
Please note the recursive nature of createBranch. It calls itself in the second-to-last line.

**Note:** some decision trees make a binary split of the data,but we won't do this. We'll follow the ID3 algorithm.

### ID3

Entropy is defined as the expected value of the information. First, we need to define information. If you’re classifying something that can take on multiple values, the information for symbol xi is defined as

$l(x_i) = log_{2}p(x_i)$

where p(xi) is the probability of choosing this class.

To calculate entropy, you need the expected value of all the information of all pos-
sible values of our class. This is given by

$H = - \sum_{i=1}^{n}p(x_i)log_{2}p(x_i)$

where n is the number of classes.

Note: click [here](https://blog.csdn.net/acdreamers/article/details/44661149) to ID3

### DataSet

See the data in table 3.1. It contains five animals pulled from the sea and asks if they can survive without coming to the surface and if they have flippers. We would like to classify these animals into two classes: fish and not fish. Now we want to decide whether we should split the data based on the first feature or the second feature. To answer this question, we need some quantitative way of determining how to split the data. We’ll discuss that next.

![](picture/02.png)

### Let's look how to compute SannonENT

If we target learnning "is Fish?".

The table 3.1 total examples is 5, and 2 "yes",3 "no". So,we compute shannonENT as follows

$H = - \frac{3}{5} log_{2}\frac{3}{5} - \frac{2}{5} log_{2} \frac{2}{5} = 0.9709505944546686$


### 1.Function to calculate the Shannon entropy of a dataset

In [1]:
from math import log

def calcShannonEnt(dataSet):
    """
    parameter-- dataSet,it's a dictionary
    
    return shannon Ent
    """
    numEntrise = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
            labelCounts[currentLabel] += 1
        else:
            labelCounts[currentLabel] += 1
    shannonEnt = 0.
#     print(labelCounts)
    for key in labelCounts:
        # compute shannon entropy 
        prob = float(labelCounts[key]) /  numEntrise
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

In [2]:
def createDataSet():
    """
    create data set with table 3.1 Marine animal data
    Note: 1 = Yes ,0= No in fetures.
    returns-- data set and labels
    """
    dataSet = [[1,1,'yes'],
              [1,1,"yes"],
              [1,0,"no"],
              [0,1,"no"],
              [0,1,"no"]]
    labels = ['no surfacing','flippers']
    return dataSet,labels

Now, we can try compute shannonENT

In [3]:
dataSet,labels = createDataSet()
shannonENT = calcShannonEnt(dataSet)
print(shannonENT)

0.9709505944546686


### 2.Splitting the dataset
You just saw how to measure the amount of disorder in a dataset. For our classifier algorithm to work, you need to measure the entropy, split the dataset, measure the entropy on the split sets, and see if splitting it was the right thing to do. You’ll do this for all of our features to determine the best feature to split on. Think of it as a two- dimensional plot of some data. You want to draw a line to separate one class from another. Should you do this on the X-axis or the Y-axis? The answer is what you’re try- ing to find out here.

In [4]:
def splitDataSet(dataSet,axis,value):
    """
    Dataset splitting on given feture
    
    returns : retDataSet
    """
    
    retDataSet = []
    for featVec in dataSet:
        # start split dataset
        if featVec[axis] == value: # if value of featVec[axis] equal target value, then  append to retDataSet
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

In [5]:
dataSet,labels = createDataSet()
print(dataSet)
retDataSet = splitDataSet(dataSet,0,1)
print(retDataSet)

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
[[1, 'yes'], [1, 'yes'], [0, 'no']]


You’re now going to combine the Shannon entropy calculation and the splitDataSet() function to cycle through the dataset and decide which feature is the best to split on. Using the entropy calculation tells you which split best organizes your data.

### 3.Choosing the best feature to split on

在决策树的每一个非叶子结点划分之前，先计算每一个属性所带来的信息增益，选择最大信息增益的属性来划
分，因为信息增益越大，区分样本的能力就越强，越具有代表性，很显然这是一种自顶向下的贪心策略。

In [6]:
def chooseBestFeatureToSplit(dataSet):
    """
    choose the best feture to spilt
    
    return:
        bestFeture
    """
    # get Number of index in every feature list.Note: every feature list must be same.
    numFeatures = len(dataSet[0]) -1 
    # compute "is Fish" feature  base entropy,like example in "dataSet cell".
    baseEntropy = calcShannonEnt(dataSet)
    # set best InfoGain, infoGain is Swedish language~,initialize value equal 0
    # and set best feature. initialize value equal -1
    bestInfoGain = 0.; bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet] #get every feature list
        uniqueVals = set(featList) # unique feature list
        newEntropy = 0.
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            
            # follows code about 2 lines, compute shannonEnt.
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
            
        infoGain = baseEntropy - newEntropy # comput infoGain
        if (infoGain > bestInfoGain): # update infoGain
            bestInfoGain = infoGain
            bestFeature = i  # update best feature
    return bestFeature

In [7]:
dataSet,labels = createDataSet()
print(dataSet)
bestFeature = chooseBestFeatureToSplit(dataSet)
print("Best Feature index is ",bestFeature)

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
Best Feature index is  0


So,the Best Feature is "Can survive without coming to surface."


You’ll stop under the following conditions: you run out of attributes on which to split or all the instances in a branch are the same class. If all instances have the same class, then you’ll create a leaf node, or terminating block. Any data that reaches this leaf node is deemed to belong to the class of that leaf node. 
![](picture/03.png)

### Befor we create the Decision Tree,we need learning decision tree's logist.

Let's look what is decision tree's logist.

We using this example dataset

![](picture/04.png)


- step 1: we need compute Entropy(Fish) = $- \frac{3}{5} log_{2}\frac{3}{5} - \frac{2}{5} log_{2} \frac{2}{5}=0.9709505944546686$
- step 2: comput IG
    - No surfacing:
        - 1:[is Fish] yes,yes,no 
        - 0:[is Fish] no,no
        - Entropy(1) = $- \frac{2}{3} log_{2}\frac{2}{3} - \frac{1}{3} log_{2} \frac{1}{3}=0.918296 $
        - Entropy(0) = $- 0 - \frac{2}{2} log_{2} \frac{2}{2}=0 $
        - Entropy(No surfacing | Fish) = $\frac{3}{5}\times 0.918296 + \frac{2}{5} \times 0 = 0.5509$
        - $\frac{3}{5},\frac{2}{5}$: in total dataset
        - IG(No surfacing | Fish) = 0.9709505944546686 - 0.5509 = 0.42
    - Flippers:
        - 1:[is Fish] yes,yes,no,no
        - 0:[is Fish] no
        - Entropy(1) = $- \frac{2}{4} log_{2}\frac{2}{4} - \frac{2}{4} log_{2} \frac{2}{4}=1.0 $
        - Entropy(0) = $- 0 - \frac{1}{1} log_{2} \frac{1}{1}=0 $
        - Entropy(Flippers | Fish) = $\frac{4}{5}\times 1 + \frac{1}{5} \times 0 = 0.8$
        - $\frac{1}{5},\frac{4}{5}$: in total dataset
        - IG(Flippers | Fish) = 0.9709505944546686 - 0.8 = 0.17095
        
- step 3: choose best IG(max IG)
    - so,we choose feature "No surfacing".
- step 4: using feature "No surfacing" to split dataset.
- step 5: repeat like setp 1-3
    
**Last Note:**
 - if best feature can't split dataset to  have same class label, then ,can using other feature to split dataset.
     - [[1, 1, 'yes'], [1, 0, 'yes'], [1, 0, 'no'], [0, 1, 'yes'], [0, 1, 'no']]
 
 - if case have no more features to split, but we also can't split dataset to have same class label, then we use "majority vote."
     - [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'yes'], [0, 1, 'no']]
 

In [8]:
import operator
def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys(): 
            classCount[vote] = 0
        else:
            classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    print(sortedClassCount)
    return sortedClassCount[0][0]

In [9]:
majorityCnt(['yes','yes','yes','no'])

[('yes', 2), ('no', 0)]


'yes'

### 4.Tree-building code

The first stopping condition is that if all the class labels are the same, then you return this label.

The second stopping condition is the case when there are no more features to split.

In [10]:
def createTree(dataSet,labels):
    # get class list: ["yes","yes"..] in last feature.
    classList = [example[-1] for example in dataSet]
    
    # define first stopping condition.
    if classList.count(classList[0]) == len(classList):
        
        return classList[0]
    
    # define second stopping condition.
    if len(dataSet[0]) == 1:
        
        return  majorityCnt(classList)
    
    # get best Feature.
    bestFeat = chooseBestFeatureToSplit(dataSet)
    
    # using best feture to get label in labels.
    bestFeatLabel = labels[bestFeat]
    
    # create result tree.
    myTree = {bestFeatLabel:{}}
    
    del (labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues) # get unique values to split dataset.
    for value in uniqueVals:
        subLabels = labels[:] # Get the remaining tables.
        # start Recursive.
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

In [11]:
dataSet,labels = createDataSet()
print(dataSet)
print(labels)
myTree = createTree(dataSet,labels)
print(myTree)

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
['no surfacing', 'flippers']
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}


majority vote.

In [12]:
dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'yes'], [0, 1, 'no']]
labels = ['no surfacing', 'flippers']
myTree = createTree(dataSet,labels)
print(myTree)

[('yes', 0), ('no', 0)]
{'flippers': {0: 'no', 1: {'no surfacing': {0: 'yes', 1: 'yes'}}}}


best feature can't split dataset 

In [13]:
dataSet = [[1, 1, 'yes'], [1, 0, 'yes'], [1, 0, 'no'], [0, 1, 'yes'], [0, 1, 'no']]
labels = ['no surfacing', 'flippers']
myTree = createTree(dataSet,labels)
print(myTree)

[('yes', 0), ('no', 0)]
{'no surfacing': {0: {'flippers': {'yes': 'yes', 'no': 'no'}}, 1: {'flippers': {0: 'yes', 1: 'yes'}}}}


### 5.using decision trees to predict contact lens type

The Lenses dataset3 is one of the more famous datasets. It’s a number of observations based on patients’ eye conditions and the type of contact lenses the doctor prescribed. The classes are hard, soft, and no contact lenses. The data is from the UCI database repository and is modified slightly so that it can be displayed easier.

In [14]:
def loadingDataSet():
    """
    Implement predict contact lens
    returns:
        lenesTree
    """
    path = "data_set/lenses.txt"
    fr = open(path)
    lenses = [inst.strip().split('\t') for inst in fr.readlines() ]
    lenesLabels = ['age','prescript','astigmatic','tearRate']
    lenesTree = createTree(lenses,lenesLabels)
    return lenesTree

In [15]:
lenesTree = loadingDataSet()
print(lenesTree)

{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'young': 'hard', 'pre': 'no lenses', 'presbyopic': 'no lenses'}}, 'myope': 'hard'}}, 'no': {'age': {'young': 'soft', 'pre': 'soft', 'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}}}}}}}


### 6.C4.5

**信息增益比:**

其信息增益与训练数据集的经验熵$H(D)$之比.

由于C4.5在生成过程中是采用信息增益比来计算选择特征,所以这里就不在多说了.

---------------

### 7.CART

CART决策树的生成即使递归地构建二叉决策树的过程,对回归树用平方误差最小化准则,对分类树用基尼指数(Gini index)最小化准则,进行特征选取，生成二叉树.

CART生成算法

输入:训练数据集D,停止计算的条件

输出:CART决策树

根据训练数据集,从根节点开始,递归地对每个节点进行一下操作,构建二叉树决策树:

(1) 设节点的训练数据集为D,计算现有特征对该数据集的Gini指数.此时,对每个特征A,对其可能取的每个值a,根据样本点A=a的测试为"是","否"将D分割成D1和D2两个部分,使用Gini计算A=a的Gini指数

(2) 在所有可能的特征A以及它们所有可能的切分点a中,选择Gini指数最小的特征以及对应切分点作为最优特征和最优切分点.依最优特征与最优切分点,从现节点生成两个子节点,将训练数据集依特征分配到两个子节点中去.

(3) 对两个子节点递归地调用(1),(2)直到满足停止条件

(4) 生成CART


**注意:**

- $Gini(D,A)=\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)$
- $Gini(D) = 1-\sum_{k=1}^{K}(\frac{|C_k|}{|D|})^2$
- $C_k$是$D$属于第k类的样本子集,K是类的个数

- 停止条件:
    - 结点中的样本个数小于阈值.
    - 样本集的Gini指数小于预订阈值,也就是说样本基本属于同一类.
    - 该次样本集中没有更多的特征
    

In [16]:
import pandas as pd
import numpy as np

### 7.1 逻辑阐述

1.为了实现CART,这里主要使用pandas这个库，方便处理训练样本

2.刚开始调用训练的时候需要初始化一些参数

```python
self.m,self.n = data.shape 
self.Dtree = {}
self._Gini = np.Inf
```

3.在参数初始化完毕之后,需要先定义递归退出条件

```python

if  m <= self.Threshold_data:
    print("第一个结束条件退出")
return self.Dtree
...

```

4.接下去就是计算Gin指数获取最优的特征A以及对应的切分点a.

```python
def calcGini(self,X,labels):
    ....
```

**注意:**这里的计算直接采用向量的形式(因为计算Gini的公式可以转换为向量的形式),并且将$Gini(D,A)$拆分成两个部分

5.在计算出最优的Gini得出最优的特征A以及对应的切分点a之后,我们需要依照A=a进行切分

```python
def split_Tree(self,X,best_label,best_class):
    ...
```

**注意:**虽然我们可以使用A=a进行划分数据集,但是我们并不知道是切分点a的那一侧是叶节点,比如说可能是a的一侧是叶结点,也可能是除了a之外的分割是叶结点,还可能由a划分之后的两个子节点都不是叶结点,还可能是由a分割之后两侧都是叶结点所以我们需要做判断

```python
if ((X_equal_last_label == X_equal_last_label.iloc[0]).all())...:
    ....
```


6.在数据集划分完毕之后,我们依照上面4中情况进行处理:

- 1.A=a划分后两侧都是叶结点,那么可以将其放入最终的字典内
- 2.A=a划分后,某一侧是叶结点,但是另外一侧不是,那么我么将是叶结点的那一侧放入最终字典,将不是叶结点的那一侧继续递归
- 3.A=a划分后,两侧都不是叶结点的话,那么继续递归,为了结果好看,会创建一个列表来放入子两侧节点的划分结果

```python

if (res_1 is False) and (res_2 is False)...:
    ...

```

7.在整体递归结束后,就可以得到最终的字典,另外我们也可以设置停止条件中的阈值来处理过拟合的情况,当然剪枝是最好的操作


---------------------

In [17]:
class CART:
    def __init__(self,Threshold_data=0,Threshold_Gini=0.001,isPrintGini=False):
        
        self.isPrintGini = isPrintGini
        self.Threshold_data = Threshold_data
        self.Threshold_Gini = Threshold_Gini
        self.is_start = True
        self.cache_best_label = None
        
    def init_args(self,data,labels):
        self.m,self.n = data.shape
        self.Dtree = {}
        self._Gini = np.Inf
        
    def split_Tree(self,X,best_label,best_class):
        """
        检查叶节点,由于Gini计算出来的best_label,我们不知道到底那一侧是叶节点,所以需要分开来判定
        """
        X_equal_ = X[X[best_label]==best_class]
        X_not_equql_ = X[X[best_label]!=best_class]
        X_equal_last_label = X_equal_.iloc[:,-1]
        X_not_equql_last_label = X_not_equql_.iloc[:,-1]
        
        if ((X_equal_last_label == X_equal_last_label.iloc[0]).all()) and ((X_not_equql_last_label == X_not_equql_last_label.iloc[0]).all()):
            # 说明X_equal_和X_not_equql_为叶节点,也就是说两侧都是叶节点,那么该分支划分完毕
            
            return False,False
        
        elif (X_equal_last_label == X_equal_last_label.iloc[0]).all():
            #说明X_equal_l是叶节点,那么另一侧就不是应该继续划分
            return X_not_equql_,False
        
        
        elif (X_not_equql_last_label == X_not_equql_last_label.iloc[0]).all():
            #说明X_not_equql_是叶节点,那么另一侧就不是应该继续划分  
            return False,X_equal_
        
        else:
            # 说明两侧都不是,那么两侧都应该继续划分
            return X_equal_,X_not_equql_
        
                
    def calcGini(self,X,labels):
        res_list = []
        for label in labels[:-1]:
                
            dupli_X = X.drop_duplicates(label)[label]
            groupby_X = X.groupby([label,labels[-1]]).size()
            
            for i in dupli_X:
     
                #######Gini 第一项 D_1#####
                gini_1 = 1-np.sum(np.power(groupby_X[i] /X.groupby([label]).size()[i],2))
                Gini_1 = X.groupby([label]).size()[i] / self.m * gini_1
                
                ########## Gini 第二项 D_2 ########
                filter_X = X[X[label] != i]  # 过滤掉含有i的元素
                gini_2 = 1- np.sum(np.power(filter_X.groupby(labels[-1]).size() / filter_X.groupby(labels[-1]).size().sum(),2))
                Gini_2 = (self.m - X.groupby([label]).size()[i]) / self.m  * gini_2
                GINI = Gini_1 + Gini_2
                
                # 放入列表,用于比较最小Gini值
                res_list.append([label,i,GINI])
                    
                    
        # 比较最小Gini值
        best_label,best_class,_Gini = sorted(res_list,key=lambda x:x[2])[0]
        
        return best_label,best_class,_Gini
        
    
    def input_Dtree(self,best_label,best_class):
        
        if self.cache_best_label is not None:
            self.Dtree[self.cache_best_label].append({best_label:best_class})
        else:
            self.Dtree[best_label] = best_class
    
    
    def fit(self,data,labels):
        m = data.shape[0]
        
        if self.is_start:
            self.init_args(data,labels)
            self.is_start = False
        
        # 定义停止条件
        if  m <= self.Threshold_data:
            print("第一个结束条件退出")
            return self.Dtree
        
        elif self._Gini <= self.Threshold_Gini:
            print("第二个结束条件退出")
            
            return self.Dtree
        elif len(labels)-1 == 0:
            print("第三个结束条件退出")
            
            return self.Dtree
        
        else:
            best_label,best_class,self._Gini = self.calcGini(data,labels)
                
            if self.isPrintGini:
                print("~Best label:",best_label,"~Best class:",best_class,"~Caculate Gini:",self._Gini)
                
            # 开始划分,如果有一个是叶节点,那么直接是{key:value}的形式,否则{key:{key:{}...}}的形式
            res_1,res_2 = self.split_Tree(data,best_label,best_class)
            
            # 依照A=a不同情况进行结果处理
            if (res_1 is False) and (res_2 is False):
                
                self.input_Dtree(best_label,best_class)
                
            elif (res_1 is not False) and (res_2 is False):
                
                self.input_Dtree(best_label,best_class)
                    
                data = res_1.drop(columns=[best_label])
                labels = data.columns
                self.fit(data,labels)
                
            elif (res_1 is False) and (res_2 is not False):
                
                self.input_Dtree(best_label,best_class)
                    
                data = res_2.drop(columns=[best_label])
                labels = data.columns
                self.fit(data,labels)
                
            elif (res_1 is not False) and (res_2 is not False):
                
                self.Dtree[best_label+":"+best_class] = []
                self.cache_best_label = best_label+":"+best_class
                print("{}:{} 子分支第一个".format(best_label,best_class))
                
                data = res_1
                labels = data.columns
                self.fit(data,labels)
                
                self.Dtree[best_label+":not "+best_class] = []
                self.cache_best_label = best_label+":not "+best_class
                
                print("{}:not {} 子分支第二个".format(best_label,best_class))
                
                data = res_2
                labels = data.columns
                self.fit(data,labels)
                self.cache_best_label = None
                
        return self.Dtree


### 7.2 尝试统计学习方法(李航)例子5.4

In [18]:
# 书上题目5.1
def create_data():
    datasets = np.array([
               ['青年', '否', '否', '一般', '否'],
               ['青年', '否', '否', '好', '否'],
               ['青年', '是', '否', '好', '是'],
               ['青年', '是', '是', '一般', '是'],
               ['青年', '否', '否', '一般', '否'],
               ['中年', '否', '否', '一般', '否'],
               ['中年', '否', '否', '好', '否'],
               ['中年', '是', '是', '好', '是'],
               ['中年', '否', '是', '非常好', '是'],
               ['中年', '否', '是', '非常好', '是'],
               ['老年', '否', '是', '非常好', '是'],
               ['老年', '否', '是', '好', '是'],
               ['老年', '是', '否', '好', '是'],
               ['老年', '是', '否', '非常好', '是'],
               ['老年', '否', '否', '一般', '否'],
               ])
    labels = np.array(['年龄', '有工作', '有自己的房子', '信贷情况', '类别'])
    # 返回数据集和每个维度的名称
    return datasets, labels

In [19]:
datasets, labels = create_data()

In [20]:
data = pd.DataFrame(data=datasets,columns=labels)
data

Unnamed: 0,年龄,有工作,有自己的房子,信贷情况,类别
0,青年,否,否,一般,否
1,青年,否,否,好,否
2,青年,是,否,好,是
3,青年,是,是,一般,是
4,青年,否,否,一般,否
5,中年,否,否,一般,否
6,中年,否,否,好,否
7,中年,是,是,好,是
8,中年,否,是,非常好,是
9,中年,否,是,非常好,是


In [21]:
cart2 = CART(isPrintGini=True)    

In [22]:
data_test = pd.DataFrame(data=datasets,columns=labels)
Dtree = cart2.fit(data_test,labels)
print(Dtree)

~Best label: 有自己的房子 ~Best class: 否 ~Caculate Gini: 0.26666666666666666
~Best label: 有工作 ~Best class: 否 ~Caculate Gini: 0.0
{'有自己的房子': '否', '有工作': '否'}


可以看到结果是正确的

### 7.3 再来尝试机器学习实战中的案例

In [23]:
def loadingDataSet():
    """
    Implement predict contact lens
    returns:
        lenesTree
    """
    path = "data_set/lenses.txt"
    fr = open(path)
    lenses = np.array([inst.strip().split('\t') for inst in fr.readlines() ])
    lenesLabels = ['age','prescript','astigmatic','tearRate','Joker']
    return lenses,lenesLabels

In [24]:
lenses,lenesLabels = loadingDataSet()
data_lenses = pd.DataFrame(data=lenses,columns=lenesLabels)
data_lenses

Unnamed: 0,age,prescript,astigmatic,tearRate,Joker
0,young,myope,no,reduced,no lenses
1,young,myope,no,normal,soft
2,young,myope,yes,reduced,no lenses
3,young,myope,yes,normal,hard
4,young,hyper,no,reduced,no lenses
5,young,hyper,no,normal,soft
6,young,hyper,yes,reduced,no lenses
7,young,hyper,yes,normal,hard
8,pre,myope,no,reduced,no lenses
9,pre,myope,no,normal,soft


In [25]:
cart2 = CART(isPrintGini=True,Threshold_Gini=-np.Inf) 

In [26]:
Dtree = cart2.fit(data_lenses,lenesLabels)
print(Dtree)

~Best label: tearRate ~Best class: reduced ~Caculate Gini: 0.3263888888888889
~Best label: astigmatic ~Best class: yes ~Caculate Gini: 0.31944444444444436
astigmatic:yes 子分支第一个
~Best label: prescript ~Best class: hyper ~Caculate Gini: 0.05555555555555555
~Best label: age ~Best class: young ~Caculate Gini: 0.0
astigmatic:not yes 子分支第二个
~Best label: age ~Best class: presbyopic ~Caculate Gini: 0.041666666666666664
~Best label: prescript ~Best class: myope ~Caculate Gini: 0.0
{'tearRate': 'reduced', 'astigmatic:yes': [{'prescript': 'hyper'}, {'age': 'young'}], 'astigmatic:not yes': [{'age': 'presbyopic'}, {'prescript': 'myope'}]}


原题分割结果如下:
![](picture/tree_1.png)

可以看出结果是一模一样的,但是这里不够的是,并没有采取剪枝措施.

CART剪枝:**统计学方法**P73

### 8. sklearn.tree.DecisionTreeClassifier

[sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

[sklearn.tree.DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

In [27]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
import graphviz

In [28]:
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
    data = np.array(df.iloc[:100, [0, 1, -1]])
    # print(data)
    return data[:,:2], data[:,-1]
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

注意,由于scikit使用的数据形式与之前我们创建的形式有差异,所以我们并不能使用scikit去检验我们例子的结果,当然你也可以更改例子中的数据形式来尝试

scikit 数据形式:

```
X : array-like or sparse matrix, shape = [n_samples, n_features]
The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels) as integers or strings.
```



In [29]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [30]:
clf.score(X_test, y_test)

0.9666666666666667

In [31]:
export_graphviz(clf, out_file="picture/mytree.dot")

将dot转换为png文件需要安装[graphviz](https://www.graphviz.org/download/)
使用[sklearn.tree.export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)中的

```
$ dot -Tps tree.dot -o tree.ps  (PostScript format) 
$ dot -Tpng tree.dot -o tree.png    (PNG format)```

![](picture/mytree.png)