# 数字识别模型构建

## 项目背景
#### 数据来源
* 项目数据基于kaggle数据集。下载地址：https://www.kaggle.com/c/digit-recognizer

#### 项目目的
* 通过训练集中书写数字的像素构建模型，预测测试集中的数字。
* 本项目采用卷积神经网络（CNN）。

# 1.导入相关的库

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import plotly as py
import plotly.graph_objs as go
import cufflinks as cf
import plotly.express as px
from plotly.offline import iplot,init_notebook_mode
cf.go_offline(connected=True)
init_notebook_mode(connected=True)
cf.set_config_file(theme='pearl') #设置一下制图颜色

from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers import Dense
from keras.optimizers import RMSprop
from sklearn.model_selection import train_test_split
from keras.callbacks.callbacks import ReduceLROnPlateau
from sklearn.metrics import confusion_matrix

import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.


# 2.读取数据并理解

In [2]:
df_train = pd.read_csv(r'train.csv')
df_test = pd.read_csv(r'test.csv')

In [3]:
# 创建X_train和y_train
y_train = df_train["label"]
X_train = df_train.drop("label",axis = 1) 

In [4]:
# 看一下y_train
print(y_train.value_counts().sort_values())
y_train.iplot(kind = 'hist')

5    3795
8    4063
4    4072
0    4132
6    4137
2    4177
9    4188
3    4351
7    4401
1    4684
Name: label, dtype: int64


* 可以看出来，0-9的数量在3795-4684之间，数量差异不大，训练集数据分布均匀。

In [5]:
# 看一下X_train
print('X的形状是：',X_train.shape)

X的形状是： (42000, 784)


* 说明训练数据共有42000个，图像像素大小为28 * 28

In [6]:
# 看一下测试集
print('测试集的形状是：',df_test.shape)

测试集的形状是： (28000, 784)


* 测试数据有28000个，是训练数据的66.7%。为增加模型的预测正确性，看来我们有必要通过数据处理扩大训练集。

In [7]:
# 数据可视化,我们来看看数据
pick_one = np.random.randint(0, 42001) #未设定种子，每次都能查看不同的值
img = np.array(X_train.iloc[pick_one,:]).reshape((28,28))
fig = px.imshow(img,color_continuous_scale='gray')
fig.update_layout(width=100,height=100,coloraxis_showscale=False,
                  margin=dict(l=10, r=10, b=10, t=10),
                  xaxis=dict(showticklabels=False),yaxis=dict(showticklabels=False))
fig.show()
print('这是数字 {}'.format(y_train[pick_one]))

这是数字 5


# 3.数据预处理

In [8]:
# 查看是否存在缺失值
print(X_train.isnull().any().describe())
print('-'*20)
print(y_train.isnull().any())
print('-'*20)
print(df_test.isnull().any().describe())

count       784
unique        1
top       False
freq        784
dtype: object
--------------------
False
--------------------
count       784
unique        1
top       False
freq        784
dtype: object


* 数据无缺失值。

In [9]:
# 查看是否存在异常值
print(X_train.max().max())
print(X_train.min().min())
print('-'*20)
print(y_train.max())
print(y_train.min())
print('-'*20)
print(df_test.max().max())
print(df_test.min().min())

255
0
--------------------
9
0
--------------------
255
0


* 像素值在0-255之间，数据无异常。

In [10]:
# 对数据进行像素缩放
X_train = X_train / 255
df_test = df_test / 255

In [11]:
# 设置数据的形状，大小为28 * 28
X_train = X_train.values.reshape(-1,28,28,1)
df_test = df_test.values.reshape(-1,28,28,1)
# 将y_train转变为哑变量
from keras.utils.np_utils import to_categorical
y_train = to_categorical(y_train, num_classes = 10)

In [12]:
# 将训练数据分成训练部分和测试部分，以检验模型准确度
X_train_1,X_train_2,y_train_1,y_train_2 = train_test_split(X_train,y_train,test_size=0.2,random_state=0)

In [13]:
print(X_train_1.shape)
print(y_train_1.shape)
print(X_train_2.shape)
print(y_train_2.shape)

(33600, 28, 28, 1)
(33600, 10)
(8400, 28, 28, 1)
(8400, 10)


In [14]:
# 扩充数据，避免过度拟合
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rotation_range=10,  #  随机旋转的度数范围
                                   width_shift_range=0.1, # 宽度范围
                                   height_shift_range=0.1, # 高度范围
                                   zoom_range=0.1) # 缩小图片


train_datagen.fit(X_train_1)
train_set = train_datagen.flow(X_train_1,y_train_1, batch_size=32)

# 4.创建卷积神经网络（CNN）

In [15]:
# 初始化卷积神经网络
classifier = Sequential()

In [16]:
# 添加第一层卷积层（convolution）和最大池化层（Maxpooling）
classifier.add(Convolution2D(filters=32,kernel_size=(5,5),padding = 'Same',
                             activation = 'relu',input_shape = (28,28,1))) 
classifier.add(Convolution2D(filters=32,kernel_size=(5,5),padding = 'Same',
                             activation = 'relu')) 
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Dropout(0.25))

In [17]:
# 添加第二层卷积层和最大池化层
classifier.add(Convolution2D(filters=64,kernel_size=(5,5),activation = 'relu',padding = 'Same'))
classifier.add(Convolution2D(filters=64,kernel_size=(5,5),activation = 'relu',padding = 'Same'))
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Dropout(0.25))

In [18]:
# 添加第三层卷积层和最大池化层
classifier.add(Convolution2D(filters=64,kernel_size=(5,5),activation = 'relu',padding = 'Same'))
classifier.add(Convolution2D(filters=64,kernel_size=(5,5),activation = 'relu',padding = 'Same'))
classifier.add(MaxPooling2D(pool_size=(2,2),strides=(2,2)))
classifier.add(Dropout(0.25))

In [19]:
# 添加扁平层
classifier.add(Flatten())

In [20]:
# 添加全连接层
classifier.add(Dense(units = 256,activation = "relu")) #隐藏层
classifier.add(Dropout(0.5))
classifier.add(Dense(units = 10,activation = "softmax")) #输出层

In [21]:
# 编译神经网络，优化器采用adam
classifier.compile(optimizer = "adam",loss = 'categorical_crossentropy',metrics = ['accuracy'])

In [22]:
# 设置回调函数
reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5,
                              patience=3, verbose=1,min_lr=0.00001)

# 5.训练模型并评估性能

# 5.1.训练模型

In [23]:
history = classifier.fit_generator(train_set,
                         steps_per_epoch=1050, #训练集数据个数/每批量处理个数
                         epochs=30,# 设置期数，期数可根据训练结果不断调整
                         validation_data=(X_train_2,y_train_2),
                         callbacks=[reduce_lr],
                         verbose = 2)
history

Epoch 1/30
 - 214s - loss: 0.4419 - accuracy: 0.8544 - val_loss: 0.0649 - val_accuracy: 0.9820
Epoch 2/30
 - 213s - loss: 0.1137 - accuracy: 0.9707 - val_loss: 0.0404 - val_accuracy: 0.9860
Epoch 3/30
 - 216s - loss: 0.0819 - accuracy: 0.9781 - val_loss: 0.0518 - val_accuracy: 0.9886
Epoch 4/30
 - 216s - loss: 0.0798 - accuracy: 0.9795 - val_loss: 0.0476 - val_accuracy: 0.9871
Epoch 5/30
 - 216s - loss: 0.0696 - accuracy: 0.9814 - val_loss: 0.0505 - val_accuracy: 0.9875
Epoch 6/30
 - 218s - loss: 0.0660 - accuracy: 0.9835 - val_loss: 0.0453 - val_accuracy: 0.9857
Epoch 7/30
 - 218s - loss: 0.0626 - accuracy: 0.9837 - val_loss: 0.0394 - val_accuracy: 0.9913
Epoch 8/30
 - 219s - loss: 0.0567 - accuracy: 0.9851 - val_loss: 0.0397 - val_accuracy: 0.9902
Epoch 9/30
 - 216s - loss: 0.0576 - accuracy: 0.9851 - val_loss: 0.0381 - val_accuracy: 0.9902
Epoch 10/30
 - 215s - loss: 0.0551 - accuracy: 0.9864 - val_loss: 0.0364 - val_accuracy: 0.9915
Epoch 11/30
 - 215s - loss: 0.0504 - accuracy: 0.

<keras.callbacks.callbacks.History at 0x2682266be48>

In [24]:
history=pd.DataFrame(history.history)

In [25]:
history.head()

Unnamed: 0,val_loss,val_accuracy,loss,accuracy,lr
0,0.06486,0.982024,0.441855,0.854375,0.001
1,0.040383,0.985952,0.113693,0.970655,0.001
2,0.051834,0.988571,0.081872,0.978095,0.001
3,0.047606,0.987143,0.079805,0.979494,0.001
4,0.050547,0.9875,0.06962,0.981369,0.001


In [26]:
# 数据可视化
fig = py.subplots.make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter( y=history.loss.tolist(),name='training_loss'),row=1, col=1)
fig.add_trace(go.Scatter( y=history.accuracy.tolist(),name='training_accuracy'),row=1, col=2)
fig.add_trace(go.Scatter( y=history.val_loss.tolist(),name='val_loss'),row=1, col=1)
fig.add_trace(go.Scatter( y=history.val_accuracy.tolist(),name='val_accuracy'),row=1, col=2)
fig.update_layout(height=500, width=800,
                  title={'text': "模型训练过程", 'y':0.92,'x':0.44})
fig.show()

* 可以看出来，accuracy处于波动上升的状态并趋于平稳。
* 测试数据准确率高于训练数据，说明模型没有过度拟合。

# 5.2 评估模型

In [27]:
# 看一下混淆矩阵吧
# Predict the values from the validation dataset
y_pred = classifier.predict(X_train_2)
# Convert predictions classes to one hot vectors 
y_pred_classes = np.argmax(y_pred,axis = 1) 
# Convert validation observations to one hot vectors
y_real = np.argmax(y_train_2,axis = 1) 
# compute the confusion matrix
cmx = confusion_matrix(y_real, y_pred_classes) 

In [28]:
pd.DataFrame(cmx,index=np.arange(10),columns=np.arange(10))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,812,0,0,0,0,0,0,0,1,0
1,1,951,1,0,0,0,0,8,0,0
2,0,0,856,1,0,0,0,1,2,0
3,2,0,2,858,0,1,0,0,0,0
4,0,1,0,0,824,0,0,0,0,2
5,0,0,0,3,0,751,1,0,1,0
6,1,0,1,1,2,0,836,0,0,0
7,0,2,2,1,1,0,0,891,0,2
8,0,0,0,0,0,0,0,0,768,0
9,0,0,0,0,3,3,0,1,2,803


从矩阵可以看出模型大部分时候拟合良好，但以下错误比较明显：
* 将9识别为4
* 将0识别为6

In [29]:
# 我们来看一下这些错误
errors=pd.Series(y_pred_classes - y_real != 0)
y_pred=pd.DataFrame(y_pred)
y_pred.insert(0,'errors',errors)

errors_index=y_pred[y_pred['errors']>0].index
X_train_2=pd.DataFrame(X_train_2.reshape(8400,784) * 255)

In [30]:
# 数据可视化,我们来看看数据
pick_one = np.random.randint(0, errors_index.size+1) #未设定种子，每次都能查看不同的值
img = np.array(X_train_2.iloc[errors_index[pick_one],:]).reshape((28,28))
fig = px.imshow(img,color_continuous_scale='gray')
fig.update_layout(width=100,height=100,coloraxis_showscale=False,
                  margin=dict(l=10, r=10, b=10, t=10),
                  xaxis=dict(showticklabels=False),yaxis=dict(showticklabels=False))
fig.show()
print('这是数字 {}'.format(y_real[errors_index[pick_one]]))
print('我识别成了 {}'.format(y_pred_classes[errors_index[pick_one]]))

这是数字 6
我识别成了 4


* 可以看出来，有一些手写数字过于潦草确实很容易产生歧义。
* 比如6的圆圈过大而尾巴太短，可能会被认为是0，这些识别误差是不可避免的。

# 6.提交预测

In [31]:
results = classifier.predict(df_test)
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")

In [32]:
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("digit_recognizer_model.csv",index=False)