|课程名称：机器学习|学生姓名：邓力予|学生学号：20201910442|
|-|-|-|
|实验名称：机器学习实验一|
|学院：数学与统计学院|专业：数据科学与大数据技术|年级：2020级|

# 使用scikit-learn

## 导入包

In [1]:
from sklearn import tree
import pandas as pd
from sklearn.metrics import accuracy_score
import numpy as np

## 读取数据
### 使用pandas读取数据

In [2]:
data_train = pd.read_csv("fashion-mnist_train.csv")
data_test = pd.read_csv("fashion-mnist_test.csv")

### 拆分数据集

- 训练集

In [3]:
data_train_x = data_train.iloc[:, 1:]
data_train_y = data_train["label"]

- 测试集

In [4]:
data_test_x = data_test.iloc[:, 1:]
data_test_y = data_test["label"]

## 使用scikit-learn训练模型

### 训练

- 实例化模型

In [5]:
model_sklearn = tree.DecisionTreeClassifier()

- 训练模型

In [None]:
model_sklearn.fit(data_train_x, y=data_train_y)

### 测试

In [7]:
data_test_y_pre = model_sklearn.predict(data_test_x)

- 计算正确率

In [8]:
score_sklearn = accuracy_score(data_test_y, data_test_y_pre)
print(score_sklearn)

0.7993


# 不使用scikit-learn

## 构建决策树

- 计算信息熵

In [9]:
def cal_information_entropy(data):
    data_label = data.iloc[:,-1]
    label_class =data_label.value_counts() #总共有多少类
    Ent = 0
    for k in label_class.keys():
        p_k = label_class[k]/len(data_label)
        Ent += -p_k*np.log2(p_k)
    return Ent

- 计算给定数据属性a的信息增益

In [10]:
def cal_information_gain(data, a):
    Ent = cal_information_entropy(data)
    feature_class = data[a].value_counts() #特征有多少种可能
    gain = 0
    for v in feature_class.keys():
        weight = feature_class[v]/data.shape[0]
        Ent_v = cal_information_entropy(data.loc[data[a] == v])
        gain += weight*Ent_v
    return Ent - gain

- 先计算固有值intrinsic_value

In [11]:
def cal_gain_ratio(data , a):
    IV_a = 0
    feature_class = data[a].value_counts()  # 特征有多少种可能
    for v in feature_class.keys():
        weight = feature_class[v]/data.shape[0]
        IV_a += -weight*np.log2(weight)
    gain_ration = cal_information_gain(data,a)/IV_a
    return gain_ration

- 获取标签最多的那一类

In [12]:
def get_most_label(data):
    data_label = data.iloc[:,-1]
    label_sort = data_label.value_counts(sort=True)
    return label_sort.keys()[0]

- 挑选最优特征，即在信息增益大于平均水平的特征中选取增益率最高的特征

In [13]:
def get_best_feature(data):
    features = data.columns[:-1]
    res = {}
    for a in features:
        temp = cal_information_gain(data, a)
        gain_ration = cal_gain_ratio(data,a)
        res[a] = (temp,gain_ration)
    res = sorted(res.items(),key=lambda x:x[1][0],reverse=True) #按信息增益排名
    res_avg = sum([x[1][0] for x in res])/len(res) #信息增益平均水平
    good_res = [x for x in res if x[1][0] >= res_avg] #选取信息增益高于平均水平的特征
    result =sorted(good_res,key=lambda x:x[1][1],reverse=True) #将信息增益高的特征按照增益率进行排名
    return result[0][0] #返回高信息增益中增益率最大的特征

- 将数据转化为（属性值：数据）的元组形式返回，并删除之前的特征列

In [14]:
def drop_exist_feature(data, best_feature):
    attr = pd.unique(data[best_feature])
    new_data = [(nd, data[data[best_feature] == nd]) for nd in attr]
    new_data = [(n[0], n[1].drop([best_feature], axis=1)) for n in new_data]
    return new_data

- 创建决策树

In [15]:
def create_tree(data):
    data_label = data.iloc[:,-1]
    if len(data_label.value_counts()) == 1: #只有一类
        return data_label.values[0]
    if all(len(data[i].value_counts()) == 1 for i in data.iloc[:,:-1].columns): #所有数据的特征值一样，选样本最多的类作为分类结果
        return get_most_label(data)
    best_feature = get_best_feature(data) #根据信息增益得到的最优划分特征
    Tree = {best_feature:{}} #用字典形式存储决策树
    exist_vals = pd.unique(data[best_feature])  # 当前数据下最佳特征的取值
    if len(exist_vals) != len(column_count[best_feature]):  # 如果特征的取值相比于原来的少了
        no_exist_attr = set(column_count[best_feature]) - set(exist_vals)  # 少的那些特征
        for no_feat in no_exist_attr:
            Tree[best_feature][no_feat] = get_most_label(data)  # 缺失的特征分类为当前类别最多的
    for item in drop_exist_feature(data,best_feature): #根据特征值的不同递归创建决策树
        Tree[best_feature][item[0]] = create_tree(item[1])
    return Tree

- 预测

In [16]:
def predict(Tree , test_data):
    first_feature = list(Tree.keys())[0]
    second_dict = Tree[first_feature]
    input_first = test_data.get(first_feature)
    input_value = second_dict[input_first]
    if isinstance(input_value , dict): #判断分支还是不是字典
        class_label = predict(input_value, test_data)
    else:
        class_label = input_value
    return class_label

## 训练

- 统计每个特征的取值情况作为全局变量

In [17]:
column_count = dict([(ds, list(pd.unique(data_train[ds]))) for ds in data_train.iloc[:, :-1].columns])

- 训练决策树

In [18]:
try:
    with open("Tree.txt","r") as file_tree:
        dicision_Tree=eval(file_tree.read())
except FileNotFoundError:
    dicision_Tree = create_tree(data_train)
    with open("Tree.txt","w") as file_tree:
        file_tree.write(str(dicision_Tree))

## 测试

- 使用测试集预测

In [19]:
pre_y_mytree=[]
for i in range(10000):
    try:
        pre_y_mytree.append(predict(dicision_Tree, data_test_x.iloc[i, :]))
    except KeyError:
        pre_y_mytree.append(np.nan)

- 准确率

In [20]:
right_count=0
for i in range(10000):
    if pre_y_mytree[i]==data_test_y[i]:
        right_count=right_count+1
print(right_count/10000)

0.1
