## 测试机器学习算法 - 准确度

利用 **train_test_split** 分离出一部分数据做训练，另外一部分数据做测试。

本篇利用KNN算法训练 sklearn 上的鸢尾花数据以及手写数字数据集。

### 1. 鸢尾花数据

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets 

导入数据集 以及 预处理：

In [2]:
iris = datasets.load_iris() # 用鸢尾花数据集
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [3]:
X = iris.data
y = iris.target

In [4]:
print(X.shape, y.shape)

(150, 4) (150,)


In [5]:
# 随机打乱数据集，取出对应的下标
shuffled_indexes = np.random.permutation(len(X))
shuffled_indexes

array([116,  51,  61,  48, 139,  83, 117, 137,   7, 119,  95,  71, 125,
       100, 144, 102, 129, 123,  42,  65,  70, 103,  98,   2,  93,  56,
        74,  16, 104,   1, 127, 126,  77,  32,  72,   4,  37, 147,  60,
        84,  33,   0, 136,  80, 132, 145,  14, 142, 108,  52,  34,  78,
        86,  66,  43, 133,  21, 118,  68, 101,  29,   6,  15,  69,  85,
         5, 121,  58,  67,  73,  38,  24,  41,  19,  47, 111, 131,  91,
         3, 105,  90,  22, 124,  49,  26, 135, 106,  57, 143,  96, 130,
        64,   8,  59,  44,  99, 149, 148,  79, 140,  39,  89,  30,  94,
        27,  11,  17, 107,  28,  54,  53, 109,  82,  63,  92, 122,  40,
       113,  23,  36,  62,  13,  12, 112, 141,  46, 138,  97,  25, 146,
        50,  87,  76,  55,  81,  18,  20,  75, 115,  31, 128, 120, 110,
        45,  88,   9,  35, 134,  10, 114])

分离训练数据集和测试数据集：

In [6]:
test_ratio = 0.2
test_size = int(len(X) * test_ratio)

#训练集是前80%，测试集是后20%
test_indexes = shuffled_indexes[:test_size]
train_indexes = shuffled_indexes[test_size:]

In [7]:
X_train = X[train_indexes]
y_train = y[train_indexes]

X_test = X[test_indexes]
y_test = y[test_indexes]

测试部分：

In [9]:
%run kNN/kNN.py 

my_knn_clf = KNNClassifier(3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)

In [10]:
sum(y_predict == y_test) / len(y_test)
# 该比例表示该模型的准确率

0.9

### 手写数字数据集

导入数据集 以及 利用 sklearn 的 ts-split 库：

In [11]:
digits = datasets.load_digits()
digits.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

In [13]:
X = digits.data
y = digits.target
print(X.shape, y.shape) #有 64 个特征

(1797, 64) (1797,)


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
# 直接得到训练集以及测试集

print(X_train.shape, y_train.shape)

(1437, 64) (1437,)


测试部分：

In [17]:
my_knn_clf = KNNClassifier(3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)

In [18]:
sum(y_predict == y_test) / len(y_test)
# 该比例表示该模型的准确率

0.9888888888888889