# 问题定义

MNIST数据的分类：

图像是x，类别是y。有一组数据$(x_1,y_1),\dots(x_n, y_n)$，想预测一个未知的$x$，它是什么类别？

In [1]:
from mnist_tools import load_mnist, plot_images

train_x, train_y, test_x, test_y = load_mnist()

- 像素0-255表示，做一个变换到$[0,1]$之间
- 28 * 28 

In [5]:
train_x.shape

(60000, 28, 28)

In [7]:
train_x = train_x.reshape(-1, 28 * 28).astype(float) / 255
test_x = test_x.reshape(-1, 28 * 28).astype(float) / 255

In [8]:
train_x.shape

(60000, 784)

分类：

构造一个分类函数：$\hat y = f(x)$，目的是让 $\hat y$ 跟 $y$ 越接近越好，

需要一种度量去衡量接近程度，通常把这种度量成为损失函数(loss function)

线性分类器：

In [10]:
from sklearn.linear_model import LogisticRegression # 罗杰斯特回归，二分类，10类
from sklearn.metrics import accuracy_score          # 准确率

多类罗杰斯特回归，softmax回归：

In [12]:
LogisticRegression?

In [11]:
lr = LogisticRegression(multi_class='multinomial', solver='lbfgs')

In [13]:
lr_y = lr.fit(train_x, train_y).predict(test_x)

In [15]:
test_x.shape

(10000, 784)

In [14]:
lr_y.shape

(10000,)

In [16]:
accuracy_score(test_y, lr_y)

0.92569999999999997

In [17]:
import pickle

with open("lr.pkl", "wb") as f:
    pickle.dump(lr, f)

In [18]:
with open("lr.pkl", "rb") as f:
    lr_pkl = pickle.load(f)

In [19]:
lr_pkl_y = lr_pkl.predict(test_x)

In [20]:
accuracy_score(test_y, lr_pkl_y)

0.92569999999999997

最近邻：

In [21]:
from sklearn.neighbors import KNeighborsClassifier

In [22]:
KNeighborsClassifier?

In [24]:
knn = KNeighborsClassifier(n_neighbors=1)

K近邻，训练集是600，测试集是10000，10000 * 600

In [26]:
knn_y = knn.fit(train_x[::100], train_y[::100]).predict(test_x)

In [27]:
accuracy_score(test_y, knn_y)

0.83940000000000003

决策树：

In [28]:
from sklearn.tree import DecisionTreeClassifier

In [29]:
dt = DecisionTreeClassifier()

In [45]:
dt_y = dt.fit(train_x[::100], train_y[::100]).predict(test_x)

In [46]:
accuracy_score(test_y, dt_y)

0.60260000000000002

SVM：

In [40]:
from sklearn.svm import SVC   # 分类 SVC，回归是 SVR，Classify，Regression

In [41]:
SVC?

In [42]:
svm = SVC()

In [43]:
svm_y = svm.fit(train_x[::100], train_y[::100]).predict(test_x)

In [44]:
accuracy_score(test_y, svm_y)

0.79549999999999998

随机森林：

In [47]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

In [48]:
rf = RandomForestClassifier()

In [49]:
rf_y = rf.fit(train_x, train_y).predict(test_x)

In [50]:
accuracy_score(test_y, rf_y)

0.94810000000000005

AdaBoost

In [51]:
ada = AdaBoostClassifier()

In [52]:
ada_y = rf.fit(train_x, train_y).predict(test_x)

In [53]:
accuracy_score(test_y, ada_y)

0.94689999999999996

神经网络：

In [54]:
from sklearn.neural_network import MLPClassifier

In [55]:
mlp = MLPClassifier() # 多层感知器

In [56]:
mlp_y = mlp.fit(train_x, train_y).predict(test_x)

(784) -> (100) -> (10)，每一维代表第i类的概率

In [57]:
accuracy_score(test_y, mlp_y)

0.98099999999999998

PCA：

In [58]:
from sklearn.pipeline import Pipeline

In [59]:
from sklearn.decomposition import PCA

In [62]:
pca = PCA(n_components=50)

In [63]:
pca_lr = Pipeline([('pca1', pca), ('lr2', lr)])

In [64]:
pca_lr_y = pca_lr.fit(train_x, train_y).predict(test_x)

In [65]:
accuracy_score(test_y, pca_lr_y)

0.9123

In [66]:
pca_mlp = Pipeline([('s1', pca), ('s2', mlp)])

In [67]:
pca_mlp_y = pca_lr.fit(train_x, train_y).predict(test_x)

In [68]:
accuracy_score(test_y, pca_mlp_y)

0.91279999999999994