forked from blester125/multi_digit_recognition
/
Report.tex
882 lines (790 loc) · 49.3 KB
/
Report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
\documentclass[12pt,twocolumn,letterpaper]{article}
\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage[breaklinks=true,bookmarks=false]{hyperref}
\cvprfinalcopy % *** Uncomment this line for the final submission
\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}
\setcounter{page}{1}
\begin{document}
\title{{\huge Deep Learning for Image Recognition} \\
Machine Learning Engineer Nanodegree \\
Capstone Project}
\author{Brian Lester\\
{\tt\small blester125@gmail.com}}
\maketitle
%%%%%%%%% ABSTRACT
\begin{abstract}
Deep Learning has become one of the most popular specializations in Machine
Learning due to the massive increase in both the amount of data available and
the amount of computing power available. With increases in both of these the
size and possible depth of neural networks has been increased, which has allowed for
them to be applied to various new fields, especially image processing. Image
processing is often done using convolutional networks that reduce the size of
images while increasing their depth. This paper applies a deep convolutional
neural network to the problem of classifying multiple digits in natural
scenes. The depth of this network allows for one network to classify each
digit rather than traditional approach of localize, segment, and classify
each digit individually.
\end{abstract}
%%%%%%%%% BODY TEXT
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Definition}
%-------------------------------------------------------------------------------
\subsection{Project Overview}
This project tackles multidigit sequence recognition in real world scenes. This
has many applications, the most obvious being recognizing addresses from GPS
tagged photos. This would allow for the creation of maps with addresses
automatically using data collected (like the kind captured by the Google streetview car).
This also has other applications. As self driving cars become more popular the
cars need a way to determine the speed limit. Without extensive Vehicle to
Infrastructure communication a car needs to know the speed limit in an area, this
can be done from map data or by reading the Speed Limit sign. This is especially
the case of road construction where map data may not be up to date. Multi digit
recognition can also be used to automatically scan passports and the like.
Clearly this technology can be applied in many areas.
One of the first applications of Convolutional Neural Networks was by LeCun in
1989 ~\cite{lecun-89c}. These networks were used for digit recognition so that mail
sorting machines could recognize hand written digits that appear in the zip codes of
mail. This allowed for automatic mail sorting/routing machines to be created.
It is fitting that Neural Networks are still
used for digit recognition. However current applications (even automatic mail sorters)
need to be able to recognize more than one digit at a time (Zip Codes are 5
digits long after all). Using the old neural networks the problem had to be broken into
several different parts. This included digit detection where digit sequences are
found in the image, then this detected area is segmented into probable digits.
These segments are then analyzed and individual digits are recognized. This
multistep process meant the programmers have to write code for each step of the
process. This has the unfortunate consequence of taking up programmer time. It
would be much easier to a larger neural network to do the whole process. The use
of an end-to-end network to recognize all the digits in an image at the same time
was first used by Goodfellow \etal ~\cite{goodfellow}.
Datasets for this project are the ``Street View Housing Numbers'' (SVHN) dataset that
can be found here \url{http://ufldl.stanford.edu/housenumbers/}. This dataset includes
73257 training digits, 26032 training digits and 531131 extra ``somewhat less
difficult samples'' to use as additional training examples. These images are RGB
images with labels and bounding boxes included with each image.
%-------------------------------------------------------------------------------
\subsection{Problem Statement}
The problem addressed in this paper is the problem of finding and reporting multiple
digits in a picture of a natural scene. This problem differs from the normal
approach used to classify multiple digits in that it uses a single deep neural
network to identify all the digits in the image rather than using a multiple
step process of localizing the digits, segmenting them, and then classifying each
digit individually. This deep network will learn to do these steps automatically
without human help. This allows for the computer to optimize and segment the
process in ways that are non-obvious to a human programmer.
This is a supervised learning problem and we will be solving it using a deep
convolutional neural network. We will process the image with the convolutional
network to create a much deeper feature vector. This new feature vector will then
be passed to 6 fully connected layers (each of the 6 are two layers deep) that
will output the length of the sequence in the image and the first 5 digits in
the image. This end-to-end strategy is based on the findings of the
Goodfellow \etal paper ~\cite{goodfellow}.
This deep neural network will output six softmax classifiers (this includes the
probability of each possible value for that classifier). The first classifier
will output the length of the sequence of digits in the scene. The remaining five
classifiers will output the value of each digit in the sequence (with 10 meaning
that the digit is not present in the sequence). By outputting the argmax (index
of the largest value of the softmax classifier) and taking the last five numbers
without including the tens the resulting numbers will be the sequence of the
digits in the numbers.
%-------------------------------------------------------------------------------
\subsection{Metrics}
% The better metric
That metric that will be used to measure the accuracy of the Deep Convolutional
Neural Network is the proportion of correct classifications. For any given image
there are six labels. The first is the length of the sequence of digits (1 through
6 where 6 means more than 5 digits). The next 5 labels are the digits in the
image from 0 to 10 where 10 means that that particular digit is not in the image.
This metric is sufficient for this project because recognizing multiple digits
with a single neural network is the point of the project and how many
classifications the network makes that are correct is a good metric for how well the
network is preforming. This accuracy can be found using the following equation
$100.0*\frac{\textit{Number of correct classifications}}{\textit{Number of samples}*\textit{The maximum number of digits}}$
where the maximum number of digits is 5. This is the accuracy value that was used
in training and for evaluating the validation set. This metric is good for
evaluating the model for this project.
If this network was used in an application where accuracy is
even more important (for example creating addresses in automatically generated
mapping application where a wrong address is a huge problem for a user) then a
secondary accuracy metric could be used where
the entire classification must be correct for the test sample to be correct. For
example if the label is 137 and the model outputs 17 it is wrong not 66.6\%
correct. This network will not be used in this sort of high pressure situation so
the metric that is simply the proportion of the correct classifications is
sufficient for our purposes of seeing what tweaks to the model cause better
performance.
% The real metric
%Our metric that we will use to measure the accuracy of the Deep Convolutional
%Neural Network is the proportion of the input images where the correct length
%of the sequence is output and each element in the sequence of digits are correct.
%Having some digit correct and some wrong counts as a wrong answer. Only images
%that are perfectly match the labels are considered correct. For example an image
%that contains 197 and the Neural Network output 137 would be incorrect even
%though the 1 and 7 are correct.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Analysis}
%-------------------------------------------------------------------------------
\subsection{Data Exploration}
The dataset comes from the Street View House Numbers dataset from Stanford. This
dataset can be found here \url{http://ufldl.stanford.edu/housenumbers/}. This
data is split into three datasets, the ``Training'' dataset with 33,402 images,
the ``Test'' dataset with 13,068 images that are used to evaluate the performance
of the network, and the ``Extra'' dataset that contains 202,353 images. The extra
dataset images are considered ``easier'' examples than the training dataset.
These images are called ``easier'' both by the website where the datasets can
be downloaded and in a paper by Lecun \etal ~\cite{sermanet-icpr-12}.
The dataset is a large collection of images taken from the Google streetview. The
images are similar to MNIST dataset of hand written digits in that the images are
small cropped digit images. However the difference is that the images from SVHN
are taken from real-world, natural, scenes. The SVHN images also contain multiple
digits. Each digit ranges from 0 to 9 and there are between 1 and 6 digits in an
image. These images are RGB images of various sizes with three color channels (a
Red, Green, and Blue channel). There are no other features outside of the values
of the image pixels. The dataset also include labels for the digits. The dataset
also contain the information to draw bounding boxes around each digit (they are
given by the x and y coordinates and the width and height of the box). These
boxes were used in data preprocessing to help center and crop digits from the
images. The boxes are not considered input to the neural network. The pixels of
the images are the only features that are feed to the Neural Network.
Table \ref{table:stats} shows some statistics about the datasets. Do to the order that the image
processing is done (the images from the training and extra datasets are
processed then split into training and validation sets) most of the data is not
available for the final training set or the validation set. The height is the max
height of an image in that particular dataset, the width is the max width of an
image in that dataset. The mean and standard deviation are the mean and standard
deviation of the pixel values in each dataset. The table shows that all the
dataset are approximately the same outside of the size. The training and
validation sets do not have this sort of information because the images are
processed (which includes normalization) before they are split into the two later
dataset. This normalization means that the mean and standard deviation for these
datasets are zero and 1 respectively.
Table \ref{table:lengths} shows the length of various sequences in the datasets. These numbers are
collected across all of the datasets. This table shows that there are a similar
number of most of the lengths. There are very few images with five digits and
only one that has six digits. This sequence of length 6 could be considered an
outlier; however, it was not removed from the dataset because in the real world
some addresses or images will have 6 or more digits and the model should be
able to handle it. The table shows that the Training set has a mode of length 2,
the extra set of length 3, and the test set of length 2.
The datasets are pretty simple, just a bunch of images that are all quite similar
to one another.
%% Data table
\begin{table*}
\begin{center}
\begin{tabular}{|l|c|c|c|c|c|}
\hline
Dataset & Size & Max Height & Max Width & Mean & Std. Dev. \\
\hline
A.) Train & 33,402 & 501 & 876 & 139.22 & 59.67 \\
\hline
B.) Extra & 202,353 & 415 & 668 & 135.94 & 61.62 \\
\hline
C.) Test & 13,068 & 516 & 1083 & 133.72 & 66.11 \\
\hline
D.) Train & 230,071 & N/A & N/A & N/A & N/A \\
\hline
E.) Valid & 5,684 & N/A & N/A & N/A & N/A \\
\hline
\end{tabular}
\end{center}
\caption{This table details information about the datasets. These datasets include
A.) The training set that was split to create the D.) Train dataset and E.) Valid dataset,
B.) The extra dataset that is used to create the training and validation datasets,
C.) The test set that is used to evaluate the model,
D.) The training set that is used for training the model (created from the train and extra datasets),
and E.) The validation set that is used to tune hyper-parameters of the model.
Size is the number of images in each dataset. Max
height is the height of the tallest image in each dataset. Max width is the
width of the widest image in each dataset. Mean is the mean pixel value of the
images in the dataset. Std. Dev. is the standard deviation of the pixel values
in the images. \textit{Note:} The final Training dataset and Validation sets have
little information about them because they are created after the images from the
Train and Extra datasets are processed and therefore have been normalized so they
have no meaningful mean or standard deviation.}
\label{table:stats}
\end{table*}
%% Length table
\begin{table}
\begin{center}
\resizebox{\linewidth}{!}{
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
Length & 1 & 2 & 3 & 4 & 5 & 6 \\
\hline
Train & 5,137 & 18,130 & 8,691 & 1,434 & 9 & 1 \\
\hline
Extra & 9,385 & 71,726 & 106,789 & 14,338 & 115 & 0 \\
\hline
Test & 2,483 & 8,356 & 2,081 & 146 & 2 & 0 \\
\hline
\end{tabular}
}
\end{center}
\caption{This table shows the frequency of various lengths of digits that are
found in the Training dataset.}
\label{table:lengths}
\end{table}
%-------------------------------------------------------------------------------
\subsection{Exploratory Visualization}
The dataset is just pictures and the features are just pixel values. There are
not various features that can be plotted against each other to find features that
are correlated or related. Lacking this sort of visualization means features cannot
be found that could be removed. Instead of plotting features examples from the
training dataset have been added. Figure \ref{fig:Original Figure} is an image from the training dataset that has labels
of length: 3 and values: 1, 0, 2, 10, 10. This means that the image contains the
digit sequence 102 as can clearly be seen. This example image also includes the
bounding boxes that surround each digit. These could be used to crop individual
digits to train a single digit classifier but are instead used to help center and
crop images so that all the digits are visible.
%% Original picture %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/figure_og.png}}
\end{center}
\caption{An example image from the Training dataset. Bounding boxes for each
digit has been added to the image.}
\label{fig:Original Figure}
\end{figure}
Figure \ref{fig:6 Digit Figure} contains an example of an image that could be considered an outlier.
This is the only digit in the Training set that contains a sequence of length
six. This is not removed from the dataset however because in real world use of
this network capturing sequences that include more than five digits seems fairly
common. In the case of a mapping application that is trying to find addresses
that are up to length five it is important to be able to find sequences that are
longer than five digits and disregard them rather than creating two outputs that
overlap or some other unexpected prediction.
%% 6 digit picture %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/6_digits.png}}
\end{center}
\caption{The only example in the training dataset that has six digits.}
\label{fig:6 Digit Figure}
\end{figure}
%-------------------------------------------------------------------------------
\subsection{Algorithms and Techniques}
The following algorithms will be used to implement a deep convolutional neural
network to identify multidigit sequences in real world scenes.
\begin{itemize}
\item Logistic Regression: Using a matrix and a vector you can transform an
input of size $i$ into an output of size $o$ by multiplying the input by an
$[i, o]$ matrix then adding a vector of size $o$ to it. By adjusting the
value of the matrix an output can be created that depends on different parts
of the input with different weights based on the matrix. Logistic Regression
is the defacto standard for neural networks which is why it is used here.
\item Rectified Linear Unit (ReLU) Activation: Rectifier activation is used
to introduce non-linearity to the network. When passed through a ReLU
rectifier input is set to zero when below a threshold (the default
tensorflow threshold is used). ReLU is used because it causes sparse
activation (only some of the inputs are active) and it is efficient to
compute.
\item Convolution: Using a small matrix called a kernel that is size x by x
by depth of the input (depth1) by depth of output (depth2) the kernel is moved
over the input (the amount the kernel is moved is called the stride)
that is of depth1 and transforms the input into an output
that is slightly smaller but is now depth2. There are two forms of padding
that can be done, valid padding where the kernel never leaves the image and
same padding where the image is padded with zeros. This is a form of weight sharing
where the same weights are applied to various parts of the image so that the
network does not have to deal with where in the input the important features
are. As the kernel moves across the image it preforms logistic regression on
each piece of the input.
\item Max Pooling: Small sections of the image are grouped together and the
largest value is used. This reduces the size of the input while retaining the
depth of the input. Max pooling was chosen because it was used in AlexNet ~\cite{alex},
which was a large influence on this network.
\item Local Response Normalization: This restricts the values that are
possible to a smaller range while still having the relative ratios between
values in the input. This restriction helps reduce overfitting in the model
where the model is tuned so well to the training data that it fails to
generalize to the test data. This overfitting manifests as a low training
error and a high test error. Default parameters from tensorflow are used.
Local response normalization was chosen, like max pooling, due to it's use
in AlexNet ~\cite{alex}.
\item Dropout: When applied to an input layer it randomly zeros out some
values according to some probability. The fact that any given input may not
be present means that the network cannot depend on any one input feature.
This forces the system to learn redundant representations which leads to
more general solutions. Dropout was shown to be helpful by Goodfellow \etal
in their paper ~\cite{goodfellow}.
\item Softmax: Softmax is an algorithm that normalizes the output values of
logistic regression. This makes it so the output sums to one and is a valid
probability distribution that can be used to make predictions. Softmax is a
classic algorithm used in image recognition to turn the results of logistic
regression into a probability distribution and so it is used here.
\item Cross Entropy: Cross entropy is a metric that is used to tell how wrong
a prediction is compared to the correct answer. This loss function summarizes
how wrong the the model was when comparing its predictions to the real answers.
It is the cost function that is used in logistic regression and is therefore
used here.
\item AdagradOptimizer: Is an advanced optimization function that is
implemented in tensorflow that takes as inputs the gradients (partial
derivatives in respect to each weight variable) of the cost function and
updates these weights. This is an efficient and built into tensorflow so it
is a good choice to use.
\end{itemize}
The deep convolutional neural network is built with four convolutional layers
that will transform the input which is size 50 by 50 by 1 (a grayscaled image) into a single feature
vector of length 128. This feature vector will be passed to six logistic regression
classifiers that each have hidden layers. These six classifiers will output the
length and the values of the digits in the images. This network will work directly
on the pixel values of the input images
\begin{enumerate}
\item The first convolutional layer uses a kernel of size 3 by 3 by 1 by 16
and a stride of 1. The layer uses valid padding. This convolution reduces the
size of the image from size 50 by 50 by 1 to 48 by 48 by 16. Then max pooling is
applied to the input with a stride of two. This results in an image that is
size 24 by 24 by 16. This pooling is followed by local response normalization
to help reduce overfitting.
\item The second convolutional layer uses a kernel of 1 by 1 by 16 by 32 and
a stride of 1. This means that the result of the convolution is the same size
as the input but it is much deeper (32 compared to 16). This technique of a
convolutional layer that does not reduce the input size but increases the
depth is taken from the implementation of a similar network by Goodfellow
\etal ~\cite{goodfellow}. After this second convolution local response
normalization is applied again. Changing this order of pooling and
normalization is adapted from the AlexNet architecture used by Krizhevsky
\etal to win the ImageNet competition in 2012 ~\cite{alex}. Then pooling is
applied resulting in an input of 12 by 12 by 32.
\item The third convolution is then applied with a kernel of size 5 by 5 by
32 by 64 and a stride of 1 to create an output with size 8 by 8 by 64.
Pooling is used to change the size to 4 by 4 by 64. Normalization is then
applied.
\item The final convolution with kernel 1 by 1 by 64 by 128 and stride 1 is
then applied to create a input of 1 by 1 by 128. This input is normalized
and then reshaped into a feature vector of size 128.
\end{enumerate}
This final feature vector is then used as input for six different softmax classifiers that
have hidden layers of size 16. The first softmax classifier has an output space
of 0 to 6 (the length of the sequence in the image) and the rest have output
spaces of 0 through 10 (the values of the digits in the image). These softmax
classifiers show the probability that a certain digit is that value.by taking
the most probable value we can make predictions of the digits in the image. In
this network dropout is applied to each layer save input and output.
The results are compared to the labels using
``sparse\_softmax\_cross\_entropy\_with\_logits'' to calculate the loss. This
cost is then minimized by the AdagradOptimizer in order to train the model.
%-------------------------------------------------------------------------------
\subsection{Benchmark}
According to the paper by Goodfellow \etal ~\cite{goodfellow} human operators have about 98\%
accuracy when it comes to identifying multidigit sequences in natural scenes.
This would obviously be a good benchmark to try to reach with this system.
However the system created by Goodfellow \etal ~\cite{goodfellow} only reached 96\% accuracy in natural scenes.
This state of the art result seems to be a better benchmark than humans levels
(it seems more obtainable that trying to beat state of the art results). However
this result seems unobtainable due to the development environment. Goodfellow
\etal ~\cite{goodfellow} achieved 96\% using a neural network that was eleven layers deep before
being connected to the six final output layers. The final feature vector created by
Goodfellow \etal's convolutional network was of size 4096 while the vector created
by this convolutional network is only of size 128 ~\cite{goodfellow}. Goodfellow \etal ~\cite{goodfellow}used a
distributed framework called DistBelief to train their network and it still took
six days to train the network. Time and resources to create such a deep network
are lacking. Using a shallower network (about four or five layers) should be able to
reach about 90\% accuracy according to the graph in Figure 4 of the Goodfellow
\etal paper ~\cite{goodfellow}.
Due to the large amount of samples that have a small length of sequence a poor model
might be tempted to guess all 10 (the ``not there value''). If the model also
guessed all the sequences were length 2 (the mode of the training set) then this
model would achieve an accuracy of 60.79\%. This sets a floor that the model
should try to beat.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Methodology}
%-------------------------------------------------------------------------------
\subsection{Data Processing}
Data processing for this problem was based on the processing methods described
by Goodfellow \etal in their paper ~\cite{goodfellow}. First a bounding box that encompasses
all the bounding boxes for the digits in the images is found. This box is not included
in the labels from the dataset. It is instead programmatically found by finding
the highest, lowest, leftmost, and rightmost parts of the provided bounding boxes
(the one that surround each digit) and creating a box of that size. This box is then
scaled up by 30\%. The image is then cropped to the size of this scaled up box.
This cropped image is then resized to be 50 pixels by 50 pixels. In the paper
by Goodfellow \etal ~\cite{goodfellow} these images are resized to 64 pixels by 64 pixels. This
was the first size tried but the development environment ran out of memory
when they were scaled to 64 pixels by 64 pixels, after trying several sizes 50
seemed to be the largest that my development environment could handle. After the
resizing the images were converted from the RBG with 3 color channels to grayscale
so that rather than three channels the resulting images had only one color channel.
Finally the images are normalized by subtracting the mean pixel value from each
pixel and dividing it by the standard deviation. This regularization was the
only form applied to the dataset. Things like LeCun Local Contrast Normalization
were not used.
As explained above the one image with length six could be considered an outlier
but was not removed from dataset due to high likelyhood of similar images being
found in the wild
The SVHN dataset is divided into three datasets. The train, test, and extra datasets.
The extra dataset is a large collection of easy samples while the train dataset is
a smaller dataset with ``more difficult'' samples. The difficultly assessment for
these datasets comes from both a paper by LeCun \etal ~\cite{sermanet-icpr-12} and
the sentiment is echoed by the website where the dataset can be downloaded. To create
training and validation datasets from the train and extra datasets a method from LeCun \etal ~\cite{sermanet-icpr-12}
was used. The Validation set is created so that it is $\frac{2}{3}$ train samples
and $\frac{1}{3}$ extra samples from each class. This breaks down into about 4,000
train samples of each class and about 2,000 extra samples from each class
(5684 total). Here ``class'' is defined to be based on what digit is the first
in the sequence in the image.
Figure \ref{fig:Processed Figure} shows Figure \ref{fig:Original Figure} after the image has been processed.
%% Processed figure %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/figure_processed.png}}
\end{center}
\caption{The same image from Figure 1 after it has been processed (cropped,
resized to 50 x 50 pixels and converted to grayscale). }
\label{fig:Processed Figure}
\end{figure}
%-------------------------------------------------------------------------------
\subsection{Implementation}
The first step in implementation was to obtain the datasets. These were fetched
from \url{http://ufldl.stanford.edu/housenumbers/} using a modified version of
the download code from the deep learning course at Udacity
\url{https://www.udacity.com/course/deep-learning--ud730}. The code can be found here
\url{https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity} The data was then
extracted using more code modified from the deep learning course.
While implementing the data preprocessing familiarity with numpy was needed to help
shape and normalize image arrays. The very tricky part came from the format of
the SVHN label file. The labels were saved as a .m matlab file. With the newest
versions of matlab python cannot turn the data from a .m file into a dictionary
automatically.
The use of the h5py module was new during this project. h5py was used to read
data from the .m file and parse it into python dictionaries. This was one of the
trickier parts of the project and help from the Udacity forms really helped get
a handle on the way that h5py uses references and how to fetch data from it. The
code used is adapted from \url{https://github.com/hangyao/street_view_house_numbers/blob/master/3_preprocess_multi.ipynb}
Once the preprocessing code was written the next problem that was ran into is the
limitations of the development environment. This was all done on a laptop and when the
preprocessing code tried to resize all images to be 64 by 64 pixels the program
would run out of memory. This meant that a new size must be found. After trying
smaller and smaller resize dimensions 50 seemed to be the largest dimensions that
the images can be scaled down to without running out of memory.
The Neural Network itself was implemented in tensorflow by Google ~\cite{Google}
This was the first time using tensorflow but it was fairly easy. First it was
implemented as a straight block of code in an ipython notebook. This was
obviously non-conducive to using in an application. The network was rewritten
using functions. This took a long time to figure out how to do because of
inexperience with tensorflow, especially the fact that things are not evaluated
until a session is run, plus the idea of a graph and how to use placeholders
took a bit to get used to. This code still looked unreadable so variable scope
was added to clean it up. Tensorboard summary operations were also added to help
visualize both the model and the learning process. Figure \ref{fig:model image} shows the visualization
of the entire tensorflow graph. Figure \ref{fig:network figure} shows the graph of actual network itself.
%% Model picture %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.5\linewidth]{images/model_graph.png}}
\end{center}
\caption{A picture of the tensorflow graph. The model is the network itself
and the rest of it is the what allows for training and saving the graph.}
\label{fig:model image}
\end{figure}
%% Network picture %%
\begin{figure*}
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/network_graph.png}}
\end{center}
\caption{Picture of the network from tensorboard.
Larger image avaiable at \url{http://imgur.com/cx9DINa}}
\label{fig:network figure}
\end{figure*}
The trickiest part of the implementation was how to calculate the loss function
for this network. Most softmax implementations use cross entropy with the
label ``one-hot encoded'' (the label is a vector that has a 1 in the index of the
answer rather than just being a scalar that is the answer). Not wanting to
convert labels into ``one-hot encoded'' labels ``sparse\_softmax\_cross\_entropy\_with\_logits'',
the tensorflow function that calculates the cross entropy, was used. The use of
this function meant that cross entropy could be calculated
without needing to use ``one-hot encoded'' labels (which would have been a drain
on system resources). The tensorflow documentation
mentions that input to this function should be the output of logistic regression
rather than output of softmax because the function does softmax itself to be more
efficient. This required a slight rewrite as the inference function that builds
network graph originally returned the softmax of the logits layer. The function
needed to return the logits layer instead so that it could be used with the
``sparse\_softmax\_cross\_entropy\_with\_logits'' function. To calculate the
actual loss the result of ``sparse\_softmax\_cross\_entropy\_with\_logits'' had
to be summed for each label compared to each logits output layer. That is each
of the six logits outputs a prediction and the
``sparse\_softmax\_cross\_entropy\_with\_logits'' was used to find the cross
entropy between each of the logits output and the label that corresponds to that
logits. These results where then summed up.
%% Loss graph %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/loss_graph.png}}
\end{center}
\caption{This graph of the loss function over training steps shows how the loss
decreased as the model was trained.}
\label{fig:loss}
\end{figure}
Figure \ref{fig:loss} shows
the change in the loss function at various training steps. The loss function
trended down during training showing that it is a good loss function and minimizing
it helped train the model.
%-------------------------------------------------------------------------------
\subsection{Refinement}
The original model had only three convolutional layers and no hidden layers in
the logits. It also only had dropout between the final feature vector and the
logits layers. This model had the accuracy reported in the first row of
Table \ref{table:acc}. While the test accuracy is reported in this table it was not used to
tune the model as this would cause the features of the test set to bleed into
the training process. The fact that the the training and validation accuracy is
low means that the model is not complex enough to represent the data (the model
has high bias). Due to these errors the network was made deeper, another
convolutional layer was added, and larger, hidden layers were added to the
logits.
Once the network was expanded the accuracies were the ones found in the
second row of Table \ref{table:acc}. The fact that the validation error is so much higher than
the test set error (shown by the fact the the accuracy is lower) means that the
model is overfitting to the training data. The model has high variance and is having
trouble generalizing to the test data and will likely have trouble on other novel
data. A common technique to fight overfitting in neural networks is dropout. This
was then added to each layer of the model outside of the input and output layers.
Once the network was made larger and dropout was added the accuracy reported in
row 3 of Table \ref{table:acc} were found. Similar steps (especially increases in depth) could
be taken from here to get even better performance.
%-------------------------------------------------------------------------------
\section{Results}
\subsection{Model Evaluation and Validation}
The final model is a deep convolutional neural network. This model was chosen due
to its use by Goodfellow at all ~\cite{goodfellow} to great success. The final
model however was deeper than the original plan for the model. This more complex
model resulted in very long training times, 11.18 hours on an Intel i5 CPU.
Figure \ref{fig:accuracy} shows the accuracy values for the training and validation
datasets as training went on. The validation set was still increasing as the training
went on meaning that more training could increase the accuracy. Table \ref{table:acc}
includes the final Test accuracy of 93.89\%. This accuracy is close to the
training and validation accuracy which means that this model does not suffer from
high variance. The means that the model does not overfit to the training data and
is able to generalize to new data. This model is therefore a good fit to solve
this problem.
In addition to using the Test set to test the model's performance, the performance was vetted using
both a camera application and the ability to run the model on various images. The
model preforms reasonably well on these applications. However when these input
methods have a slight problem in that they preform rather poor when the digits
are not centered in the image. Unlike Test set images these Camera images do not
have bounding boxes and therefor don't have a way to center the images.
They also preform better when the digits are about
30\% smaller than the boundaries of the images. These input methods show that the
model is sensitive to slight changes in the input. This could be fixed with a
deeper network and more input where the digits are in various places of the input.
This problem could also be solved with more convolutional layers because
convolutions squeeze the spacial dimensions from the images. Without having the
digits in various places in the input images the model never really learns to
localize the digits.
%% Accuracy Graph %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/accuracy.png}}
\end{center}
\caption{A graph showing how the accuracy on the training batch and the
accuracy on the validation set changed as the model was trained.}
\label{fig:accuracy}
\end{figure}
%% Accuracy Table %%
\begin{table}
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
Accuracy & Train & Valid & Test \\
\hline
Original & 90.13 & 88.38 & 89.45 \\
\hline
Second & 95.21 & 90.39 & 91.11 \\
\hline
Best & 95.63 & 92.73 & 93.89 \\
\hline
\end{tabular}
\end{center}
\caption{Accuracy of the model on each dataset where the Train dataset is the
last accuracy on the last batch of training data.}
\label{table:acc}
\end{table}
The results of this model is quite strong and can be trusted in applications where
wrong answers are not the worst thing in the world. If the network was being used
for mapping data where a wrong address could mean sending a user to the wrong place
then the model would have to be trained more.
Currently only one validation set is used to validate the model. Normally cross-validation
is used to create multiple validation sets which are used to validate the model. This
insures that there is not some hidden pattern in the validation set that is leading
to abnormally high or abnormally low results. This was not done due to time
constraints but will be done in the future when improvements are made to the model.
This model generates digits found in natural scenes. The model creates usable
output and is a good solution to the problem.
%-------------------------------------------------------------------------------
\subsection{Justification}
%% Example output %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/example_output.png}}
\end{center}
\caption{An example of some of the output of the model. The images are images
that were fed into the model. The Label is the label for that image from the
dataset and the Predict is the value of the output when that image is feed into
the network.}
\label{fig:output}
\end{figure}
Figure \ref{fig:output} show several example outputs. It includes the image that
was given to the network, the label from the dataset and the predicted digits
in the image from the model. The model missed only a single digit from the second
image.
This model sufficiently solves the problem. A network that guessed only 2 for the
sequence length and only 10 for the each digit would score 60.79\%. This is the
minimum accuracy required for the system to do better than educated guesses. The
test accuracy for the final network is 93.89\%. This is far better than the
minimum benchmark. 93.89\% is also pretty close to the accuracy found by Goodfellow
\etal ~\cite{goodfellow}. Being close to this benchmark is a very good result.
This network is a good solution to the problem. It has a few selectivity problems
as discussed in the ``Model evaluation and validation'' section. When the
improvements mentioned later in the ``Improvements'' section are applied then
these problems may disappear. Even before these improvements the model does a
good job classifying digits.
%-------------------------------------------------------------------------------
\section{Conclusion}
\subsection{Free Form Visualization}
While there is not much more to visualize as the dataset is just images so most
of these visualizations will be about the learning process.
Figure \ref{fig:two image} shows two images. The left is a processed image that has been resized
and converted to grayscale. The right side of the photo is the image after it
has passed through one convolutional layer. The changes to the image show how the
convolution change the image. It is hard to see the reduced size because it is
only two pixels smaller.
%% Two image pictures %%
\begin{figure}[t]
\begin{center}
\fbox{
\includegraphics[width=0.5\linewidth]{images/input_image.png}
\includegraphics[width=0.5\linewidth]{images/conv_image.png}
}
\end{center}
\caption{The image on the left is one of the images that has been processed
and is ready to be fed into the network. The image on the right is the image
after it has been through a single convolution.}
\label{fig:two image}
\end{figure}
The following Figures, \ref{fig:conv}, \ref{fig:hidden}, and \ref{fig:logits}
are histogram activations for various
weights in the convolutional network. Figure \ref{fig:conv} shows the activation for the
weights in the third convolutional layer of the network. Figure \ref{fig:hidden} shows the
histogram activation for the hidden layer in the 5th logits layer. Figure \ref{fig:logits}
shows the activation for one of the final logits layer.
%% Convolutional graph %%S
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/conv_3_graph.png}}
\end{center}
\caption{This shows the activation of the third convolutional layer weights in
the network. The top and bottom light blue lines show that the training doesn't
begin to effect this set of weights until almost 40,000 training steps.}
\label{fig:conv}
\end{figure}
%% Hidden Graph %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/hidden_5_graph.png}}
\end{center}
\caption{This shows the activation of the hidden layer weights in the fifth logits in
the network. The top and bottom light blue lines show that the training doesn't
begin to effect this set of weights until almost 30,000 training steps.}
\label{fig:hidden}
\end{figure}
%% Logits Graph %%
\begin{figure}[t]
\begin{center}
\fbox{\includegraphics[width=0.9\linewidth]{images/logits_1_graph.png}}
\end{center}
\caption{This shows the activation of the output layer weights in the first logits in
the network. The top and bottom light blue lines show that the training starts to change
weights almost immediately.}
\label{fig:logits}
\end{figure}
These graphs are a little hard to read but they show the distribution of the
activation on each layer. The middle line is line that splits the activations so
that 50\% fall above it and 50\% below. The dark lines are the lines where 69\%
of the activations fall between them, while the light blue lines are where 89\%
fall between. All three of these graphs show what the majority of activations
(between the blue lines) are. The differences between the graph shows how the
different layers are trained. The deepest layer (convolution 3 weights in Figure
\ref{fig:conv}) is the slowest to change. This change is visible due to the expansion in the
y dimension of the top and bottom lines in the graph. The hidden layer changes
quicker and the logits changes the quickest in that it already starts with a
wide distribution and stays wide. The explanation for the difference between
these graphs is explained by the saturation of the logistic function at different
levels of the neural network ~\cite{xavier}. These graphs help visualize the learning process and
in future runs and improvements these graphs will help see patterns and show that
the changes are helping the models performance.
%-------------------------------------------------------------------------------
\subsection{Reflection}
The first step of this project was to read the existing literature available on
the topic. The most helpful papers were by Goodfellow \etal ~\cite{goodfellow} and
LeCun \etal
\cite{sermanet-icpr-12}. The Goodfellow \etal ~\cite{goodfellow} paper was especially helpful. That
is where the idea of creating a single convolutional feature vector and using
that as input to various logitic classifiers.
The next step was to analyze and process the dataset. This was an interesting part
of the process because the use of h5py was new and without this project it would
probably not have been used. The h5py module is used to access data structures
that are saved on to disk.
After processing the data the model was build in tensorflow. This was an
interesting part to the project be cause it was new. Figuring out how to build
the convolutional network so that the final feature vector was only size 1 by 1
by depth. The difficult part was turning the mess of spaghetti code into a well
organized function, several errors were introduced in the process that resulted
in very low accuracy (the cross entropy error did take into account the 4th
logits). Figuring out the cause of this error took several days and was the
hardest part of the project.
The final steps was adding the network into an application. This was interesting
to use opencv for the first time. After hooking all the pieces together the last
thing to do was train the model. This took 11 hours and the anticipation was one
of the harder parts of the project.
This is one of the largest end-to-end projects that have been undertaken and while
the results are a little weaker than the goal was, the project was a large success.
%-------------------------------------------------------------------------------
\subsection{Improvement}
There is a lot of improvement that can be made from here. One of the first things
that might be able to help performance would be to normalize the data after it
is split into the training and validation set rather than before. This would help
make the images within a dataset more uniform.
Another problem that the model has is that it struggles when the digit sequence
is not in the center of the image. This problem, plus the problem that the
accuracy is not as high as it could be, can be solved in a single tweak. As
discussed in the paper by Goodfellow \etal ~\cite{goodfellow} more data can be artificially created
by grabbing different parts of the bounding box. For example once you blow up the
largest bounding box you can grab the whole thing as one data point, you can grab
top left box (from the scaled up corner top left to the original bottom right
corner) so the image is in the lower right part of the image and so on. This means that a single
image can be used to create 5 training points. This would help train the model to
detect digits where ever they appear in the image. The increase in amount of data
will also help the accuracy of the model.
In order to do some of the other improvements a better development environment
is needed. One obvious improvement is during data preprocessing resize the images
to 64 by 64 rather than 50 by 50. This allows for more pixels and clearer images.
This will help performance but requires more memory for the computer.
As the Figure \ref{fig:loss} loss function graph shows that at the end of the training steps
the loss function was still decreasing. The same trend is shown in Figure \ref{fig:accuracy} as
the accuracy continues to increase. This means that the model would benefit from
more training. The training time for the best model was 11 hours long so in order
to realistically train for significantly more steps would require a GPU to speed
up training and allow for more training.
Another improvement is to create a more complex model. This means a deeper
convolutional network and more hidden units in the logits layers as well as more
hidden layers. This also can mean using larger convolutional kernels and same
padding. A more complex network also means that it is possible to over fit so
dropout is needed. For this deeper network to be feasible parallel training on a
GPU would be needed for training. The largest indicator that the deeper network
would help is the fact that the Goodfellow \etal ~\cite{goodfellow} paper had a
feature vector produced by the convolutional part of the network had a 4096
features compared to the 128 features in this model.
The majority of improvements to this network boils down to having a large network
and more data. This requires an improved development environment. This work
should be easy to expand on and will be soon.
{\small
\bibliographystyle{ieee}
\bibliography{bib}
}
\end{document}