# 实验六：决策树分类器
- 姓名：冯思程
- 学号：2112213
- 专业：计算机科学与技术

## 实验要求
### 截止日期：12月15日
以学号+姓名(6)的命名形式打包实验代码+实验报告，发送到邮箱18329300691@163.com

### 数据集

这里使用的是给定的数据集。


### 基本要求
- 	基于 Watermelon-train1数据集（只有离散属性），构造ID3决策树；
- 	基于构造的 ID3 决策树，对数据集 Watermelon-test1进行预测，输出分类精度；
### 中级要求
-   对数据集Watermelon-train2，构造C4.5或者CART决策树，要求可以处理连续型属性；
-   对测试集Watermelon-test2进行预测，输出分类精度；
### 高级要求
使用任意的剪枝算法对构造的决策树（基本要求和中级要求构造的树）进行剪枝，观察测试集合的分类精度是否有提升，给出分析过程。



=======================================================================================================================

# 开始

**环境**：python 3.10.9+vscode 1.82.2+一些必备的第三方库，例如pandas等。

<span style="color:red">**注意**</span>：我在后文的代码都补充了适当的注释并在代码前进行了适当注释和分析，感谢学长学姐的批阅！辛苦！

## 基础要求部分

### 导入需要的包

In [12]:
import pandas as pd
from math import log
from sklearn.model_selection import train_test_split

### 数据读取

In [13]:
train1 = pd.read_csv("Watermelon-train1.csv", encoding='gbk')
test1 = pd.read_csv("Watermelon-test1.csv", encoding='gbk')
train2 = pd.read_csv("Watermelon-train2.csv", encoding='gbk')
test2 = pd.read_csv("Watermelon-test2.csv", encoding='gbk')
print(train1)

    编号  色泽  根蒂  敲声  纹理 好瓜
0    1  青绿  蜷缩  浊响  清晰  是
1    2  乌黑  蜷缩  沉闷  清晰  是
2    3  乌黑  蜷缩  浊响  清晰  是
3    4  青绿  蜷缩  沉闷  清晰  是
4    5  浅白  蜷缩  浊响  清晰  是
5    6  青绿  稍蜷  浊响  清晰  是
6    7  乌黑  稍蜷  浊响  稍糊  是
7    8  乌黑  稍蜷  浊响  清晰  是
8    9  乌黑  稍蜷  沉闷  稍糊  否
9   10  青绿  硬挺  清脆  清晰  否
10  11  浅白  硬挺  清脆  模糊  否
11  12  浅白  蜷缩  浊响  模糊  否
12  13  青绿  稍蜷  浊响  稍糊  否
13  14  浅白  稍蜷  沉闷  稍糊  否
14  15  浅白  蜷缩  浊响  模糊  否
15  16  青绿  蜷缩  沉闷  稍糊  否


### 数据预处理

In [14]:
def preprocessing(data):
    # 删除第一列-标号列
    data = data.drop(columns=['编号'])
    # 定义映射关系
    feature_mappings = {
        '色泽': {'浅白': 0, '乌黑': 1, '青绿': 2},
        '根蒂': {'蜷缩': 0, '稍蜷': 1, '硬挺': 2},
        '敲声': {'浊响': 0, '沉闷': 1, '清脆': 2},
        '纹理': {'模糊': 0, '稍糊': 1, '清晰': 2},
        '好瓜': {'是': 1, '否': 0}
    }

    # 将特征和目标变量进行转换
    for column, mapping in feature_mappings.items():
        if column in data.columns:
            data[column] = data[column].map(mapping)
    
    return data

train1 = preprocessing(train1)
test1 = preprocessing(test1)
train2 = preprocessing(train2)
test2 = preprocessing(test2)
print(train1)
print(train2)

    色泽  根蒂  敲声  纹理  好瓜
0    2   0   0   2   1
1    1   0   1   2   1
2    1   0   0   2   1
3    2   0   1   2   1
4    0   0   0   2   1
5    2   1   0   2   1
6    1   1   0   1   1
7    1   1   0   2   1
8    1   1   1   1   0
9    2   2   2   2   0
10   0   2   2   0   0
11   0   0   0   0   0
12   2   1   0   1   0
13   0   1   1   1   0
14   0   0   0   0   0
15   2   0   1   1   0
    色泽  根蒂  敲声  纹理     密度  好瓜
0    2   0   0   2  0.697   1
1    1   0   1   2  0.774   1
2    1   0   0   2  0.634   1
3    2   0   1   2  0.608   1
4    0   0   0   2  0.556   1
5    2   1   0   2  0.403   1
6    1   1   0   1  0.481   1
7    1   1   0   2  0.437   1
8    1   1   1   1  0.666   0
9    2   2   2   2  0.243   0
10   0   2   2   0  0.245   0
11   0   0   0   0  0.343   0
12   2   1   0   1  0.639   0
13   0   1   1   1  0.657   0
14   1   1   0   2  0.360   0
15   0   0   0   0  0.593   0
16   2   0   1   1  0.719   0


### 实现ID3决策树：

在信息论与概率统计中，熵是表示随机变量不确定性的量。X是⼀个取值为有限个的离散随机变量，熵的公式如下：
$$ H(X)=-\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)$$ 
$𝐻(𝑋)$就被称作随机变量𝑋的熵，它表示随机变量不确定的度量。熵取值越大，随机变量不确定性越大。当随机变量为均匀分布时，熵最大。当某一状态概率取值为1时，熵的值为零。

条件熵表示在已知随机变量𝑋的条件下随机变量𝑌的不确定性，定义为给定𝑋条件下𝑌的条件概率分布的熵对𝑋的数学期望:
$$H(Y \mid X)=\sum_{x} p(x) H(Y \mid X=x) =-\sum_{x} p(x) \sum_{y} p(y \mid x) \log p(y \mid x)$$

特征𝐴对数据集𝐷的**信息增益**就是熵$𝐻(𝐷)$与条件熵$𝐻(𝐷|𝐴)$之差:
$$𝐻(𝐷)−𝐻(𝐷∣𝐴)$$

表示已知特征𝐴的信息而使得数据集𝐷的信息不确定减少的程度。信息增益越大的特征代表其具有更强的分类能力，所以我们就要**选择能够使数据的不确定程度减少最多的特征**，也就是信息增益最大的特征。

#### 决策树的生成

从根节点开始，计算所有可能特征的信息增益，选择信息增益最大的特征作为划分该节点的特征，根据该特征的不同取值建立子节点；
在对子节点递归地调用以上方法，直到达到停止条件，得到⼀个决策树。

#### 决策树的停止条件：

  1. 当前结点所有样本都属于同⼀类别；
  2. 当前结点的所有属性值都相同，没有剩余属性可用来进一步划分样本；
  3. 达到最大树深；
  4. 达到叶子结点的最小样本数；

In [15]:
# 计算信息熵
def calculate_entropy(dataframe, target_column='好瓜'):
    
    num_entries = len(dataframe)
    label_counts = dataframe[target_column].value_counts()

    entropy = 0.0
    for count in label_counts:
        probability = count / num_entries
        entropy -= probability * log(probability, 2)

    return entropy


# 从一个数据集中划分出一个子数据集，这个子数据集只包含在指定特征上具有特定值的数据行，并且在返回的数据点中不包括这个特定特征轴的值。
def split_dataframe(dataframe, feature, value):

    # 筛选出特征列符合指定值的行
    filtered_df = dataframe[dataframe[feature] == value]

    # 删除该特征列
    subdataframe = filtered_df.drop(columns=[feature])

    return subdataframe


# 将遍历每个特征列，计算每个特征的信息增益，最后返回具有最大信息增益的特征列的名称。
def id3_choose_best_feature_to_split(dataframe, target_column='好瓜'):
    
    base_entropy = calculate_entropy(dataframe, target_column)
    best_info_gain = 0.0
    best_feature = ''

    for feature in dataframe.columns:
        if feature == target_column:
            continue  # Skip the target column
        unique_vals = dataframe[feature].unique()
        new_entropy = 0.0
        
        for value in unique_vals:
            sub_df = split_dataframe(dataframe, feature, value)
            probability = len(sub_df) / float(len(dataframe))
            new_entropy += probability * calculate_entropy(sub_df, target_column)
        
        info_gain = base_entropy - new_entropy
        print(f"ID3: Information Gain for feature '{feature}': {info_gain:.3f}")
        if info_gain > best_info_gain:
            best_info_gain = info_gain
            best_feature = feature

    return best_feature


# 当数据集的所有特征都已处理，但类别标签仍不唯一时，使用多数表决的方法决定叶子节点的分类。
def majority_vote(labels):
    label_count = {}
    for label in labels:
        if label not in label_count:
            label_count[label] = 0
        label_count[label] += 1

    sorted_label_count = sorted(label_count.items(), key=lambda item: item[1], reverse=True)
    return sorted_label_count[0][0]


# 实现基于ID3决策树的构建
def ID3_createTree(dataset, labels):
    classList = dataset.iloc[:, -1].tolist()  # 假设最后一列为类别列

    if classList.count(classList[0]) == len(classList):
        return classList[0]

    if len(dataset.columns) == 1:
        return majority_vote(classList)

    bestFeatLabel = id3_choose_best_feature_to_split(dataset, dataset.columns[-1])
    ID3Tree = {bestFeatLabel: {}}

    # 删除已使用的特征
    remaining_features = [lbl for lbl in dataset.columns if lbl != bestFeatLabel and lbl != dataset.columns[-1]]

    uniqueVals = dataset[bestFeatLabel].unique()
    for value in uniqueVals:
        reducedDataset = split_dataframe(dataset, bestFeatLabel, value)
        ID3Tree[bestFeatLabel][value] = ID3_createTree(reducedDataset, remaining_features)

    return ID3Tree


# 单个数据实例（DataFrame 的一行）进行分类
def classify(decision_tree, feature_labels, test_vector):
    if not isinstance(decision_tree, dict):
        # 如果decision_tree不是字典，那么它是一个叶节点的值
        return decision_tree

    root_feature = list(decision_tree.keys())[0]
    sub_tree = decision_tree[root_feature]

    if root_feature not in feature_labels:
        return None  # 如果特征标签不在feature_labels中，返回None

    feature_index = feature_labels.index(root_feature)
    
    if feature_index >= len(test_vector):
        return None  # 如果特征索引超出范围，返回None

    feature_value = test_vector[feature_index]

    if isinstance(sub_tree, dict) and feature_value in sub_tree:
        # 只有当sub_tree是字典时才执行比较
        if isinstance(sub_tree[feature_value], dict):
            return classify(sub_tree[feature_value], feature_labels, test_vector)
        else:
            return sub_tree[feature_value]
    return None




# 对一个DataFrame的每行进行分类。
def classifytest(decision_tree, feature_labels, test_data):
    classification_results = []
    for _, row in test_data.iterrows():
        classification_results.append(classify(decision_tree, feature_labels, row))
    return classification_results


# 计算准确率
def calculate_accuracy(predicted_labels, true_labels):
    if len(true_labels) == 0:
        return 0  # 如果没有真实标签，返回0作为准确率
    correct = sum(p == t for p, t in zip(predicted_labels, true_labels))
    accuracy = correct / len(true_labels)
    return accuracy


In [16]:
# 构建决策树
feature_labels = list(train1.columns)  # 所有特征的列名
decision_tree1 = ID3_createTree(train1, feature_labels)

# 对测试数据进行分类
predicted_labels1 = classifytest(decision_tree1, feature_labels, test1)

# 计算准确率
true_labels = test1.iloc[:, -1].tolist()  # 真实标签
accuracy = calculate_accuracy(predicted_labels1, true_labels)

print("Accuracy:", accuracy)

ID3: Information Gain for feature '色泽': 0.174
ID3: Information Gain for feature '根蒂': 0.148
ID3: Information Gain for feature '敲声': 0.180
ID3: Information Gain for feature '纹理': 0.503
ID3: Information Gain for feature '色泽': 0.138
ID3: Information Gain for feature '根蒂': 0.544
ID3: Information Gain for feature '敲声': 0.544
ID3: Information Gain for feature '色泽': 0.322
ID3: Information Gain for feature '根蒂': 0.073
ID3: Information Gain for feature '敲声': 0.322
ID3: Information Gain for feature '根蒂': 0.000
ID3: Information Gain for feature '敲声': 1.000
Accuracy: 0.7


## 中级要求部分

这里我选择用到的是C4.5算法，下面进行一下简单的介绍：

- C4.5算法与ID3算法相似，其对ID3算法进行了改进。
- 信息增益作为划分准则存在的问题：

     信息增益偏向于选择取值较多的特征进行划分。⽐如学号这个特征，每个学生都有一个不同的学号，如果根据学号对样本进行分类，则每个学生都属于不同的类别，这样是没有意义的。而C4.5在生成过程中，用**信息增益比**来选择特征，可以校正这个问题。
     
- 特点
  - 能够完成对连续属性的离散化处理
  - 能够对不完整数据进行处理
  - 需要对数据集进行多次的顺序扫描和排序


In [17]:
# 实现C4.5算法的最佳特征列选取
def C45_chooseBestFeatureToSplit(dataframe):
    
    base_entropy = calculate_entropy(dataframe)
    best_info_gain_ratio = 0.0
    best_feature = ''

    for feature in dataframe.columns[:-1]:  # 遍历所有特征，排除最后的类别列
        feat_list = dataframe[feature]
        unique_vals = set(feat_list)  # 创建唯一的分类标签列表
        new_entropy = 0.0
        IV = 0.0

        for value in unique_vals:
            sub_dataframe = split_dataframe(dataframe, feature, value)
            p = len(sub_dataframe) / float(len(dataframe))
            new_entropy += p * calculate_entropy(sub_dataframe)
            IV -= p * log(p, 2) if p > 0 else 0  # 防止p为0导致的log(0)错误

        info_gain = base_entropy - new_entropy
        info_gain_ratio = info_gain / IV if IV != 0 else 0

        print(f"C4.5: Information Gain Ratio for feature '{feature}': {info_gain_ratio:.3f}")
        if info_gain_ratio > best_info_gain_ratio:
            best_info_gain_ratio = info_gain_ratio
            best_feature = feature

    return best_feature

# 实现基于C4.5决策树的构建
def C45_createTree(dataset, labels):
    classList = dataset.iloc[:, -1].tolist()  # 假设最后一列为类别列

    if classList.count(classList[0]) == len(classList):
        return classList[0]

    # 如果没有更多特征可以用于进一步划分
    if len(dataset.columns) == 1:
        return majority_vote(classList)

    bestFeatLabel = C45_chooseBestFeatureToSplit(dataset)
    C45Tree = {bestFeatLabel: {}}

    # 删除已使用的特征
    remaining_features = [lbl for lbl in labels if lbl != bestFeatLabel]

    uniqueVals = dataset[bestFeatLabel].unique()
    for value in uniqueVals:
        reducedDataset = split_dataframe(dataset, bestFeatLabel, value)
        C45Tree[bestFeatLabel][value] = C45_createTree(reducedDataset, remaining_features)

    return C45Tree



In [18]:
# 构建决策树
feature_labels = list(train2.columns)  # 所有特征的列名
decision_tree2 = C45_createTree(train2, feature_labels)

# 对测试数据进行分类
predicted_labels2 = classifytest(decision_tree2, feature_labels, test2)

# 计算准确率
true_labels2 = test2.iloc[:, -1].tolist()  # 真实标签
accuracy2 = calculate_accuracy(predicted_labels2, true_labels2)

print("Accuracy:", accuracy2)

C4.5: Information Gain Ratio for feature '色泽': 0.068
C4.5: Information Gain Ratio for feature '根蒂': 0.102
C4.5: Information Gain Ratio for feature '敲声': 0.106
C4.5: Information Gain Ratio for feature '纹理': 0.263
C4.5: Information Gain Ratio for feature '密度': 0.244
C4.5: Information Gain Ratio for feature '色泽': 0.031
C4.5: Information Gain Ratio for feature '根蒂': 0.339
C4.5: Information Gain Ratio for feature '敲声': 0.270
C4.5: Information Gain Ratio for feature '密度': 0.241
C4.5: Information Gain Ratio for feature '色泽': 0.274
C4.5: Information Gain Ratio for feature '敲声': 0.000
C4.5: Information Gain Ratio for feature '密度': 0.579
C4.5: Information Gain Ratio for feature '色泽': 0.212
C4.5: Information Gain Ratio for feature '根蒂': 0.101
C4.5: Information Gain Ratio for feature '敲声': 0.332
C4.5: Information Gain Ratio for feature '密度': 0.311
C4.5: Information Gain Ratio for feature '色泽': 1.000
C4.5: Information Gain Ratio for feature '根蒂': 0.000
C4.5: Information Gain Ratio for feature '密度':

## 高级要求部分

- 决策树很容易出现**过拟合现象**。原因在于学习时完全考虑的是如何提⾼对训练数据的正确分类从⽽构建出过于复杂的决策树。
- 解决这个问题的方法称为**剪枝**，即对已生成的树进行简化。具体地，就是从已生成的树上裁剪掉⼀些子树或叶节点，并将其根节点或父节点作为新的叶节点。 
- 决策树的剪枝基本策略有**预剪枝 (Pre-Pruning)** 和 **后剪枝 (Post-Pruning)**
   - **预剪枝**：是根据⼀些原则**极早的停止树增长**，如树的深度达到用户所要的深度、节点中样本个数少于用户指定个数、不纯度指标下降的幅度小于用户指定的幅度等。 
   - **后剪枝**：是通过在完全生长的树上剪去分枝实现的，通过删除节点的分支来剪去树节点。是在生成决策树之后**自底向上**的对树中所有的非叶结点进⾏逐一考察 。

这里我对在基础要求中实现的ID3决策树和在中级要求中实现的C4.5决策树来进行预剪枝和后剪枝策略的添加，并进行分析。


In [19]:
def ID3_createTree2(dataset, labels, test_dataset=None, pre_pruning=True, post_pruning=True):
    classList = dataset.iloc[:, -1].tolist()  # 假设最后一列为类别列

    if classList.count(classList[0]) == len(classList):
        return classList[0]

    if len(dataset.columns) == 1:
        return majority_vote(classList)

    bestFeatLabel = id3_choose_best_feature_to_split(dataset)
    ID3Tree = {bestFeatLabel: {}}
    
    uniqueVals = dataset[bestFeatLabel].unique()

    # 预剪枝逻辑
    if pre_pruning and test_dataset is not None:
        leaf = majority_vote(classList)
        accuracy_without_split = calculate_accuracy(
            classifytest({bestFeatLabel: leaf}, labels, test_dataset),
            test_dataset.iloc[:, -1]
        )
        accuracy_with_split = 0
        for value in uniqueVals:
            subDataset = split_dataframe(dataset, bestFeatLabel, value)
            subTestset = split_dataframe(test_dataset, bestFeatLabel, value)
            if not subTestset.empty:  # 确保子测试集不为空
                subTree = ID3_createTree2(subDataset, labels[:], subTestset)
                accuracy_with_split += calculate_accuracy(
                    classifytest({bestFeatLabel: {value: subTree}}, labels, subTestset),
                    subTestset.iloc[:, -1]
                )
        if accuracy_without_split >= accuracy_with_split / len(uniqueVals):
            print("发生预剪枝处理")
            return leaf

    # 构建子树
    for value in uniqueVals:
        subDataset = split_dataframe(dataset, bestFeatLabel, value)
        if test_dataset is not None:  # 检查test_dataset是否为None
            subTestset = split_dataframe(test_dataset, bestFeatLabel, value)
            if not subTestset.empty:  # 确保子测试集不为空
                subTree = ID3_createTree2(subDataset, labels[:], subTestset, pre_pruning, post_pruning)
                accuracy_with_split += calculate_accuracy(
                    classifytest({bestFeatLabel: {value: subTree}}, labels, subTestset),
                    subTestset.iloc[:, -1]
                )
        else:
            subTestset = None
        ID3Tree[bestFeatLabel][value] = ID3_createTree2(subDataset, labels[:], subTestset, pre_pruning, post_pruning)
    # 后剪枝逻辑
    if post_pruning and test_dataset is not None:
        leaf = majority_vote(classList)
        accuracy_without_split = calculate_accuracy(
            classifytest({bestFeatLabel: leaf}, labels, test_dataset),
            test_dataset.iloc[:, -1]
        )
        accuracy_with_split = calculate_accuracy(
            classifytest(ID3Tree, labels, test_dataset),
            test_dataset.iloc[:, -1]
        )
        if accuracy_without_split >= accuracy_with_split:
            print("发生后剪枝处理")
            return leaf

    return ID3Tree



In [20]:
# 构建决策树
feature_labels = list(train1.columns)  # 所有特征的列名

# 分割数据集，比例为 70% 训练，30% 测试
train, test = train_test_split(train1, test_size=0.3, random_state=2023)
decision_tree3 = ID3_createTree2(train, feature_labels, test)

# 对测试数据进行分类
predicted_labels3 = classifytest(decision_tree3, feature_labels, test1)

# 计算准确率
true_labels = test1.iloc[:, -1].tolist()  # 真实标签
accuracy = calculate_accuracy(predicted_labels3, true_labels)

print("Accuracy 2.0:", accuracy)

ID3: Information Gain for feature '色泽': 0.254
ID3: Information Gain for feature '根蒂': 0.150
ID3: Information Gain for feature '敲声': 0.227
ID3: Information Gain for feature '纹理': 0.319
ID3: Information Gain for feature '色泽': 0.128
ID3: Information Gain for feature '根蒂': 0.592
ID3: Information Gain for feature '敲声': 0.592
ID3: Information Gain for feature '色泽': 0.918
ID3: Information Gain for feature '根蒂': 0.252
ID3: Information Gain for feature '敲声': 0.918
发生预剪枝处理
ID3: Information Gain for feature '色泽': 0.128
ID3: Information Gain for feature '根蒂': 0.592
ID3: Information Gain for feature '敲声': 0.592
ID3: Information Gain for feature '色泽': 0.128
ID3: Information Gain for feature '根蒂': 0.592
ID3: Information Gain for feature '敲声': 0.592
ID3: Information Gain for feature '色泽': 0.918
ID3: Information Gain for feature '根蒂': 0.252
ID3: Information Gain for feature '敲声': 0.918
发生预剪枝处理
ID3: Information Gain for feature '色泽': 0.918
ID3: Information Gain for feature '根蒂': 0.252
ID3: Information G

In [21]:
def C45_createTree2(dataframe, feature_labels, test_dataframe=None, pre_pruning=True, post_pruning=True):
    class_list = dataframe.iloc[:, -1].tolist()

    if len(set(class_list)) == 1:
        return class_list[0]

    if len(dataframe.columns) == 1:
        return majority_vote(class_list)

    best_feature = C45_chooseBestFeatureToSplit(dataframe)
    C45_tree = {best_feature: {}}
    unique_values = dataframe[best_feature].unique()

    # 预剪枝逻辑
    if pre_pruning and test_dataframe is not None:
        leaf = majority_vote(class_list)
        leaf_predictions = [leaf] * len(test_dataframe)
        root_accuracy = calculate_accuracy(leaf_predictions, test_dataframe.iloc[:, -1].tolist())
        split_accuracies = []

        for value in unique_values:
            sub_dataframe = split_dataframe(dataframe, best_feature, value)
            sub_test_dataframe = split_dataframe(test_dataframe, best_feature, value)
            if not sub_test_dataframe.empty:
                sub_tree = C45_createTree2(sub_dataframe, feature_labels, sub_test_dataframe, pre_pruning, post_pruning)
                sub_predictions = classifytest({best_feature: sub_tree}, feature_labels, sub_test_dataframe)
                split_accuracies.append(calculate_accuracy(sub_predictions, sub_test_dataframe.iloc[:, -1].tolist()))

        if split_accuracies and all(accuracy < root_accuracy for accuracy in split_accuracies):
            print("发生预剪枝处理")
            return leaf

    # 构建子树
    for value in unique_values:
        sub_dataframe = split_dataframe(dataframe, best_feature, value)
        if not sub_dataframe.empty:
            sub_test_dataframe = split_dataframe(test_dataframe, best_feature, value) if test_dataframe is not None else None
            C45_tree[best_feature][value] = C45_createTree2(sub_dataframe, feature_labels, sub_test_dataframe, pre_pruning, post_pruning)

    # 后剪枝逻辑
    if post_pruning and test_dataframe is not None:
        leaf = majority_vote(class_list)
        tree_predictions = classifytest(C45_tree, feature_labels, test_dataframe)
        tree_accuracy = calculate_accuracy(tree_predictions, test_dataframe.iloc[:, -1].tolist())
        leaf_accuracy = calculate_accuracy([leaf] * len(test_dataframe), test_dataframe.iloc[:, -1].tolist())

        if leaf_accuracy > tree_accuracy:
            print("发生后剪枝处理")
            return leaf

    return C45_tree


In [22]:
# 构建决策树
feature_labels = list(train2.columns)  # 所有特征的列名

# 分割数据集，比例为 70% 训练，30% 测试
trainc, testc = train_test_split(train2, test_size=0.3, random_state=2023)
decision_tree4 = C45_createTree2(trainc, feature_labels, testc)

# 对测试数据进行分类
predicted_labels4 = classifytest(decision_tree4, feature_labels, test2)

# 计算准确率
true_labels = test2.iloc[:, -1].tolist()  # 真实标签
accuracy = calculate_accuracy(predicted_labels4, true_labels)

print("Accuracy 2.0:", accuracy)

C4.5: Information Gain Ratio for feature '色泽': 0.152
C4.5: Information Gain Ratio for feature '根蒂': 0.113
C4.5: Information Gain Ratio for feature '敲声': 0.118
C4.5: Information Gain Ratio for feature '纹理': 0.257
C4.5: Information Gain Ratio for feature '密度': 0.273
发生后剪枝处理
Accuracy 2.0: 0.6


### 分析

可以看到我在加入预剪枝和后剪枝策略后ID3决策树和C45决策树的分类精度变化如下：
- ID3决策树从0.7提升到0.8
- C45决策树从0.6到0.6，没有变化

对于ID3决策树，我通过我加入的标记发现了经过了三次预剪枝的处理，精度从0.7提升到0.8，查看test1的数量可以发现是多了一个分类正确的数据行，说明预剪枝确实提升了决策树的分类效果，有效地避免掉了过拟合的发生。

对于C45决策树，我通过我加入的标记发现了经过了一次后剪枝的处理，但是精度并没有发生变化，这里我仔细探究认为是与数据量有关，后剪枝处理是整个决策树构建后进行遍历来进行剪枝的，所以理论上是可以剪掉可能过拟合的子树，但是这里精度并没有发生变化，我觉得是由于test2数据量太小，test2只有5个数据行，偶然性太大，如果数据量足够，决策效果会有明显的提升。

额外分析说明：其实在实现加入预剪枝和后剪枝算法后的ID3和C45决策树的训练集组成已经发生了改变，原来是完全的一个train全作为训练集进行决策树的搭建，现在是将train划分成了训练和测试进行决策树构建，所以其实这已经不是一个十分完美的控制变量法的对比了。这里我还考虑了如果我保持train不变，而是从test中取一部分过来进行预剪枝和后剪枝的测试，但是后来我否定了这种思路，因为这会在测试集上造成更大的偏差，我保证了在测试的时候都用了统一的测试集。