# 作业2：决策树实现


## 加载数据集
加载训练数据集，并且通过descibe()方法和isnull()方法对数据集进行简单的初步分析

In [236]:
import numpy as np 
import pandas as pd 

train_data = pd.read_csv("kaggle/input/car/car_evaluation.csv", header=None)
train_data.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
train_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [237]:
train_data.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

## 调用决策树进行分类

实现译码器，并补充测试正确率的方法

In [238]:
import category_encoders as ce
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])

def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)

将数据集分为训练集和测试集，并且进行译码

In [239]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_data.drop(['class'], axis=1), train_data['class'], test_size=0.3, random_state=42)

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

调用商用ID3决策树进行分类，并输出正确率

In [240]:
from sklearn.tree import DecisionTreeClassifier

model2 = DecisionTreeClassifier(criterion='entropy', max_depth=15)
model2.fit(X_train, y_train)
y_pred_train_sklearn = model2.predict(X_train)
y_pred_test_sklearn = model2.predict(X_test)
print("The accuracy in the train data set: ", accuracy(y_train, y_pred_train_sklearn))
print("The accuracy in the test data set:  ", accuracy(y_test, y_pred_test_sklearn))

The accuracy in the train data set:  1.0
The accuracy in the test data set:   0.9421965317919075


调用自行实现的 ID3 决策树进行分类，并输出正确率

In [241]:
from DecisionTree import DecisionTree
dt = DecisionTree(max_depth=5)
dt.fit(X_train, y_train)
y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)
print("The accuracy in the train data set: ", accuracy(y_train, y_pred_train))
print("The accuracy in the test data set:  ", accuracy(y_test, y_pred_test))

The accuracy in the train data set:  0.9710504549214226
The accuracy in the test data set:   0.9190751445086706


## 检验对比决策树模型


使用t校验来检测两个决策树模型是否有显著差异

In [242]:
from sklearn import model_selection
errorlist_sklearn = np.array([])
errorlist_mydecis = np.array([])

# split the train_data into 5 folds
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in kf.split(train_data):
    X_train, X_test = train_data.iloc[train_index], train_data.iloc[test_index]
    y_train, y_test = X_train['class'], X_test['class']
    X_train = encoder.fit_transform(X_train.drop(['class'], axis=1))
    X_test = encoder.transform(X_test.drop(['class'], axis=1))

    dt = DecisionTree(max_depth=2)
    dt.fit(X_train, y_train)
    y_pred_train = dt.predict(X_train)
    y_pred_test = dt.predict(X_test)
    errorlist_mydecis = np.append(errorlist_mydecis, 1 - accuracy(y_test, y_pred_test))

    dt2 = DecisionTreeClassifier(criterion='entropy', max_depth=2)
    dt2.fit(X_train, y_train)
    y_pred_train_sklearn = dt2.predict(X_train)
    y_pred_test_sklearn = dt2.predict(X_test)
    errorlist_sklearn = np.append(errorlist_sklearn, 1 - accuracy(y_test, y_pred_test_sklearn))

print("The array of error for sklearn: ", errorlist_sklearn)
print("The array of error for my decision tree: ", errorlist_mydecis)

error = errorlist_sklearn - np.mean(errorlist_mydecis)
check_t = np.sqrt(10 / np.var(error)) * np.mean(error)
print("The check_t: ", check_t)

The array of error for sklearn:  [0.25722543 0.20231214 0.19942197 0.22028986 0.27536232]
The array of error for my decision tree:  [0.23410405 0.21098266 0.20231214 0.22898551 0.28695652]
The check_t:  -0.18220123281149483


In [243]:
# t-test
from scipy import stats
t_statistic, p_value = stats.ttest_ind(errorlist_sklearn, errorlist_mydecis)
print("The t-statistic: ", t_statistic)
print("The p-value: ", p_value)

The t-statistic:  -0.08255088912599556
The p-value:  0.9362367668355995
