## 机器学习-周志华-习题6.3

### 6.3 选择两个 UCI 数据集，分别用线性核和高斯核训练一个 SVM，并与BP 神经网络和 C4.5 决策树进行实验比较。

这里就只用`sklearn`中自带的iris数据集来对比题中几个算法。这里数据集不大，只有150个样本，所以就不拿出额外的样本作为测试集了，进行`5-flod`交叉验证，最后验证集的平均准确率作为评价模型标准。

--- 
- SVM将使用`sklearn.svm`
- BP神经网络将使用`Tensorflow`实现
- 关于C4.5。Python中貌似没有C4.5的包，在第四章写的决策树代码也并不是严格的C4.5，为了方便这里就还是使用`sklearn`吧。`sklearn`中决策树是优化的CART算法。

---
此外，各模型都进行了粗略的调参，不过在这里的`notebook`省略了。

### 1、导入相关包

In [1]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.model_selection import KFold, train_test_split, cross_val_score, cross_validate
from sklearn import svm, tree
from sklearn.model_selection import KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, log_loss
import numpy as np
# import tensorflow as tf

### 2、数据读入

In [2]:
iris = datasets.load_iris()
X = pd.DataFrame(iris['data'], columns=iris['feature_names'])

y = pd.Series(iris['target_names'][iris['target']])
# y = pd.get_dummies(y)

In [11]:
X.head()
y.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
dtype: object

### 3、模型对比

#### 3.1 线性核SVM

In [4]:
linear_svm = svm.SVC(C=1, kernel='linear')
linear_scores = cross_validate(linear_svm, X, y, cv=5, scoring='accuracy')

In [5]:
linear_scores['test_score'].mean()

0.9800000000000001

#### 3.2 高斯核SVM

In [6]:
rbf_svm = svm.SVC(C=1)
rbf_scores = cross_validate(rbf_svm, X, y, cv=5, scoring='accuracy')

In [7]:
rbf_scores['test_score'].mean()

0.9666666666666666

#### 3.3 BP神经网络

这里BP神经网络使用`tensorflow`实现，其实在`sklearn`中也有（当然在第五章也用`numpy`实现过，也可以用），不过这里因为个人原因还是使用`tensorflow`。。不过事实上如果为了答这道题，使用`sklearn`其实代码量会更少。

---
`tensorflow`里面没有现成的交叉验证的api（`tensorflow`中虽然也有其他机器学习算法的api，但它主要还是针对深度学习的工具，训练一个深度学习模型常常需要大量的数据，这个时候做交叉验证成本太高，所以深度学习中通常不做交叉验证，这也为什么`tensorflow`没有cv的原因），这里使用 `sklearn.model_selection.KFold`实现BP神经网络的交叉验证。

In [8]:
# # 定义模型，这里采用一层隐藏层的BP神经网络，神经元个数为16
# x_input = tf.placeholder('float', shape=[None, 4])
# y_input = tf.placeholder('float', shape=[None, 3])
# 
# keep_prob = tf.placeholder('float', name='keep_prob')
# 
# W1 = tf.get_variable('W1', [4, 16], initializer=tf.contrib.layers.xavier_initializer(seed=0))
# b1 = tf.get_variable('b1', [16], initializer=tf.contrib.layers.xavier_initializer(seed=0))
# 
# h1 = tf.nn.relu(tf.matmul(x_input, W1) + b1)
# h1_dropout = tf.nn.dropout(h1, keep_prob=keep_prob, name='h1_dropout')
# 
# W2 = tf.get_variable('W2', [16, 3], initializer=tf.contrib.layers.xavier_initializer(seed=0))
# b2 = tf.get_variable('b2', [3], initializer=tf.contrib.layers.xavier_initializer(seed=0))
# 
# y_output = tf.matmul(h1_dropout, W2) + b2

In [9]:
# # 定义训练步骤、准确率等
# cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=y_output, labels=y_input))
# 
# train_step = tf.train.AdamOptimizer(0.003).minimize(cost)
# 
# correct_prediction = tf.equal(tf.argmax(y_output, 1), tf.argmax(y_input, 1))
# accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))

In [10]:
# # 将目标值one-hot编码
# y_dummies = pd.get_dummies(y)

In [11]:
# sess = tf.Session()
# init = tf.global_variables_initializer()
# costs = []
# accuracys = []
# 
# for train, test in KFold(5, shuffle=True).split(X):
#     sess.run(init)
#     X_train = X.iloc[train, :]
#     y_train = y_dummies.iloc[train, :]
#     X_test = X.iloc[test, :]
#     y_test = y_dummies.iloc[test, :]
# 
#     for i in range(1000):
#         sess.run(train_step, feed_dict={x_input: X_train, y_input: y_train, keep_prob: 0.3})
# 
#     test_cost_, test_accuracy_ = sess.run([cost, accuracy],
#                                           feed_dict={x_input: X_test, y_input: y_test, keep_prob: 1})
#     accuracys.append(test_accuracy_)
#     costs.append(test_cost_)

In [12]:
# print(accuracys)
# print(np.mean(accuracys))

[0.96666664, 0.96666664, 0.96666664, 0.96666664, 0.93333334]
0.96


In [13]:
encoder = OneHotEncoder(sparse_output=False)
y_dummies = encoder.fit_transform(y.values.reshape(-1, 1))
# y_dummies = pd.get_dummies(y)
# 标准化特征数据
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 初始化 KFold 交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=0)
accuracys = []
costs = []

# 遍历每个训练/测试分割
for train_index, test_index in kf.split(X_scaled):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y_dummies[train_index], y_dummies[test_index]

    # 定义 MLPClassifier 模型
    # 这里隐含层的神经元个数为 16，激活函数为 ReLU
    # 与原始代码中的参数 keep_prob 类似，这里使用默认的 dropout
    model = MLPClassifier(hidden_layer_sizes=(16,), activation='relu',
                          solver='adam', learning_rate_init=0.003,
                          max_iter=1000, random_state=0)

    # 训练模型
    model.fit(X_train, y_train)

    # 预测
    y_test_pred_probs = model.predict_proba(X_test)
    y_test_pred = model.predict(X_test)

    # 计算准确率和损失
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_cost = log_loss(y_test, y_test_pred_probs)

    accuracys.append(test_accuracy)
    costs.append(test_cost)

# 输出结果
print(accuracys)
print(np.mean(accuracys))



每折的准确率: [1.0, 0.6666666666666666, 0.9333333333333333, 0.7333333333333333, 1.0]
平均准确率: 0.8666666666666666




#### 3.4 CART

In [13]:
cart_tree = tree.DecisionTreeClassifier()
tree_scores = cross_validate(rbf_svm, X, y, cv=5, scoring='accuracy')

In [14]:
tree_scores

{'fit_time': array([ 0.00199413,  0.00256157,  0.00185156,  0.00298214,  0.0030067 ]),
 'score_time': array([ 0.00099921,  0.00099659,  0.00114751,  0.00106406,  0.        ]),
 'test_score': array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ]),
 'train_score': array([ 0.98333333,  0.98333333,  0.99166667,  0.98333333,  0.975     ])}

In [15]:
tree_scores['test_score'].mean()

0.98000000000000009

### 4 总结

因为`iris`数据原因，本身容易区分，这四个模型最终结果来看几乎一致（除了自己拿`tensorflow`写的BP神经网络，验证集上的准确率低了0.02）