## Patrick 🌰

In [1]:
# 从sklearn.datasets 导入 iris数据加载器。
from sklearn.datasets import load_iris
# 使用加载器读取数据并且存入变量iris。
iris = load_iris()
# 查验数据规模。
iris.data.shape


(150, 4)

In [2]:
# 查看数据说明。对于一名机器学习的实践者来讲，这是一个好习惯。
# View the data description. This is a good habit for a machine learning practitioner.
print (iris.DESCR)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [3]:

from sklearn.model_selection import train_test_split
# 从使用train_test_split，利用随机种子random_state采样25%的数据作为测试集。
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=33)


In [4]:
# 从sklearn.preprocessing里选择导入数据标准化模块。
from sklearn.preprocessing import StandardScaler
# 从sklearn.neighbors里选择导入KNeighborsClassifier，即K近邻分类器。
from sklearn.neighbors import KNeighborsClassifier

# 对训练和测试的特征数据进行标准化。
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

# 使用K近邻分类器对测试数据进行类别预测，预测结果储存在变量y_predict中。
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_predict = knc.predict(X_test)


In [6]:
# 使用模型自带的评估函数进行准确性测评。
print ('The accuracy of K-Nearest Neighbor Classifier is', knc.score(X_test, y_test) )


The accuracy of K-Nearest Neighbor Classifier is 0.8947368421052632


In [7]:
# 依然使用sklearn.metrics里面的classification_report模块对预测结果做更加详细的分析。
from sklearn.metrics import classification_report
print (classification_report(y_test, y_predict, target_names=iris.target_names))


              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       0.73      1.00      0.85        11
   virginica       1.00      0.79      0.88        19

   micro avg       0.89      0.89      0.89        38
   macro avg       0.91      0.93      0.91        38
weighted avg       0.92      0.89      0.90        38



* K 临近（分类）是非常直观的机器学习模型，因此深受广大初学者的喜爱，许多教科书常常以此模型为例抛砖引玉，便足以看出其不仅特别，而且尚有瑕疵之处。细心的读者会发现，K临近算法与其他模型最大的不同在于： 该模型没有参数训练过程，而只是根据测试样本在训练数据的分布直接作出分类决策。因此，K临近属于无参数模型 Nonparametric model 中的一种。然而，正是这样的决策算法，导致了其非常高的计算复杂度和内存消耗。因为该模型每处理一个测试样本，都需要对所有预先加载在内存的训练样本进行遍历，逐一计算相似度、排序并且选取K个最临近训练样本的标记，进而做出分类决策。这是平方级别的算法复杂度，一旦数据规模稍大，使用者便需要权衡更多的计算时间的代价。
* K-Nearest Neighbor(classifier) is a very intuitive machine learning model, so it is very popular among beginners. Many textbooks often use this model as an example to show that it is not only special, but also flawed. Careful readers will find that the biggest difference between the K-Nearest Neighbor algorithm and other models is that the model has no parameter training process, but only makes the classification decision directly based on the distribution of the training data. Therefore, K-Nearest Neighbor is one of the nonparametric models. However, it is such a decision algorithm that leads to its very high computational complexity and memory consumption. Because each time a model is processed, it needs to traverse all the training samples preloaded in memory, calculate the similarity, sort and select the markers of the K nearest training samples one by one, and then make the classification decision. This is the algorithmic complexity of the square level. Once the data size is slightly larger, the user needs to consider the cost of more computation time.