# Recognition of convolutional neural network based on CUDA Technology

Yi-bin HUANG, Kang Li, Ge Wang, Min Cao, Pin Li, Yu-jia Zhang Institute of Automation, Chinese Academy of Sciences

Abstract - For the problem whether Graphic Processing Unit(GPU), the stream processor with high performance of floating—point computing is applicable to neural networks, this paper proposes the parallel recognition algorithm of ConvolutionalNeural Networks(CNNs). It adopts Compute Unified Device Architecture(CUDA) technology, definite the parallel data structures, and describes the mapping mechanism for computing tasks on CUDA. It compares the parallel recognition algorithm achieved on GPU of GTX200 hardware architecture with the serial algorithm on CPU. It improves speed by nearly 60 times. Result shows that GPU based the stream processor architecture ate more applicable to some related applications about neural networks than CPU.

Key words- stream processor; Single—Instruction Multiple—Thread(SIMT); GTX200 hardware architecture; Compute Unified Device Architecture (CUDA)technology; Convolutional Neural Networks (CNNs)

The recognition algorithm of Convolutional Neural Networks (CNNs) is widely used in many research areas, but the implementation involves high arithmetic intensity. Use a new emerging paradigm, the streaming model [1] of Graphic Processing Unit (GPU), which has more transistor for data processing and many-core (hundreds of cores) compared with CPU. Significant application-level speedup over uniprocessor execution, easy entrance, numerical precision and accuracy, wide availability to end users and strong scalability, these programmable features of GPU make it a new hotspot.

#### I. CUDA

INVIDIA proposed CUDA (Compute Unified Device Architecture) [2] in 2007, which created a new era of GPU computing and made GPU glide into place as a new important computing resource[3]. Generally, write programs for GPU on C language with minimum extensions. What's more, it is easy to program with single program multiple data. Figure 1 shows the CUDA programming model. Program based on CUDA integrates serial host C code and highly parallel device kernel C code. A CUDA device is a coprocessor to the CPU or host and runs a large number of threads in parallel. Execution model involves kernels, threads, blocks and grids. A CUDA kernel is data-parallel portions of an application and executed by an array of threads. All threads run the same code but different data. Divide monolithic

thread array into multiple blocks. A group of blocks compose of a grid. Threads within a block cooperate via shared memory, atomic operations and barrier synchronization, but threads in different blocks cannot cooperate. Every thread has an ID (1D, 2D or 3D) used to compute memory addresses and make control decisions. Similarly, each block has an ID (1D or 2D). A warp of 32 threads physically runs on SM (Streaming Multiprocessor) and shares instructions. When operands of the warp are ready, it will be executed. All warps are dynamically scheduled by SM which is an array of SPs (Streaming Processors).

GTX980 (GTX280) adopts GM204 core of Maxwell architecture (GTX200) and has 2048 (240) SPs. The theoretical computation speed is 4.6 TFLOPs (933 GFLOPs) and memory bandwidth is 224GB/s.

### II. CONVOLUTIONAL NEURAL NETWORK— FIVE LAYERS

The feature of CNN is described in terms of three basic conception, which are Feature Map, Weight Shared and subsampling[5-6]. The structure of five layers CNNS applies to hand-written numeral recognition, as shown in Figure 1. The former three layers are composed of some features mapping. The scale of the current layer is less than the previous, but quantities of the current are more than the previous. Feature Map in the same layer share an acceptance domain and a bias.

C1 is the input layer that accepts 29\*29 digital handwriting images of 0-1 matrix. Therefore, C1 has 29\*29 neurons, that is to say, C1 is composed of 841 typical neurons.

C2 is the convolutional layer. The convolution between a 5\*5 acceptance domain and C1 produces six feature maps. Because of sampling C1, each Feature Map in C2 layer is composed of 13\*13 neurons. Therefore, C2 have 13\*13\*6=1014 nodes and (5\*5\*+1)\*6=156 numbers of weight linked to C1. Totally, the connection of C1 and C2 has 1014\*(5\*5\*+1) = 26364.



Figure 1. CNN—five layers

C3 is the convolutional layer. There are 50 feature maps that are small than C2's size and more than its numbers. The convolution between a 5\*5 acceptance domain and C2 produces 5\*5 images in each Feature Map. One pixel in C3's Feature Map is generated in the convolution between a 5\*5 acceptance domain and the combination of corresponding region of six feature maps. This layer have 5\*5\*50=1250 neurons and (5\*5+1)\*6\*50=7800 numbers of weight linked to C2. Totally, the connection of C2 and C3 has 1250\*(5\*5+1) = 32500.

N1 is fully connected of 100 neurons. There is no Feature Map, but a bias is shared to classify. The 1250 neurons in C3 layer are fully connected to N1. Hence, the connection of C3 and N1 has 100\*(1250+1) = 125100. N1 layer have (1250+1)\*100=125100 numbers of weight.

N2 is the output layer that output  $0\sim9$  handwritten numeral image. The layer is fully connected to N1 and have 10 neurons used in classification and result. There is (100+1)\*10=1010 numbers of weight and 10\*(100+1) = 1010 connections.

Moreover, each layer uses the hyperbolic tangent function [2] expect for N2:

$$\phi(x) = 1.7159 \times \tanh(\frac{6x}{9})$$

### III. PARALLEL RECOGNITION ALGORITHM FOR CONVOLUTIONAL NEURAL NETWORKS

#### 3.1 Parallel data structure based on CUDA

The following is the specific description of the data structure ,which is involved in the algorithm to the CUDA map and is addressed by blockldx (x, y) and threadldx (x, y, z). blockldx (x, y) and threadldx (x, y, z) are one-dimensional.

### (1) Convolution feature map

The five layer convolutional neural network can be described as a data structure (accepting domain, feature map) ,which is composed of 2 2D –array,  $FM_{ij} = \{P_{ij}[N][N], F_{ij}[M][M]\}, \text{among} \quad \text{them,} \quad F_{ij} \\ \text{represent j-th feature mapping of i-th layer,} P_{ij}[N][N] \\ \text{represent the j-th-accepted-domain of i-th layer is a} \\ N \times N 2D \text{ array } P_{ij}, F_{ij}[M][M] \\ \text{represent the j-th feature} \\ \text{map of i-th layer is a} \\ M \times M 2D \text{ array } F_{ij}.$ 

In a single layer, mapped to CUDA, it can constitute a one-dimensional array set as the following formula:

$$_{FM_{i}} = \{ _{P_{i}}[N*N], _{F_{i}}[(M*M)_{j}]$$

$$\{ _{F_{i1}}[M*M], _{F_{i2}}[M*M], \dots, _{F_{ii}}[M*M] \} \}$$

Among them,  $_P_i$  {Pixel}N\*N,  $_F_i$  {NeuronCNN}M\*M, represent i-th layer mapping feature set, each set share an accepted-domain  $_P_i$  to accept the samples feature maps from upper layer,  $_F_{ij}$  represent the j-th map of i-th layer, and it is an 1D array that make up with M\*M convolution neurons, and they assemble  $_F_i$ , which represents an 1D array that assemble by all i-th feature map. The size of j depends on the number of feature map of this layer. When i=1, it does not contain accept-set  $_P_i$ .

# (2) Heavy layers and connection weights between the layers

Each features a convolution layer mapping feature with respect to the upper right of re-mapping can be expressed as  $W_{ij}\{wd_{ij}[i][j],e_{ij}\}$ ,  $wd_{ij}$  represent represents the opposite of N×N accepted domain ilayer j-th-feature-map link weights two-dimensional array, each feature map share a set of weights and bias count number eij, they are combined into a one-dimensional array mapping to CUDA, but the weight set of common layer is defined by the following

formula:

(3) Define the behavior of 2 kind of neuron

As follows, The behavior of convolution neurons can be:

$$Neuron_{CNN} = f(\sum_{j=1}^{N \times N} w_{ik}[j]pixel_j + \_e_{ik})$$

$$\sum_{j=1}^{N\times N} _{} w_{ik}[j] \in _{} wd_{ik}[N\times N], pixel_{j} \in _{} P, _{} e_{ik} \in _{} W_{i}^{\dagger}$$

Among them, N comes from \_P; k is k-th feature map

A typical neuron behavior can be modeling as:

$$\begin{aligned} Neuron_{is} &= f\left(\underline{e}_{is} + \sum_{j=1}^{n} \underline{w}_{ij}^{s} \mathbf{u}_{j}\right) \\ &\underline{w}_{ij}^{s} \in \underline{W}_{i}^{\dagger}, \ \mathbf{u}_{j} \in Neuron_{i-1,s}, \underline{e}_{is} \in \underline{W}_{i}^{\dagger} \end{aligned}$$

Among them,  $u_j$  results for the neuron calculation of the starting point of the link.

### 3.2 Thread setting

GTX280 the ability of calculation is from 1.3 of the specification, as specification, from top to bottom, the Kernel specific thread settings between the layers are shown in Table 1.

Table 1. Settings of threads and processor

| Kernel | Grid dim(dim3) | Block dim (dim 3) | Threads |
|--------|----------------|-------------------|---------|
| 1      | (6,1)          | (13,13)           | 1014    |
| 2      | (50,1)         | (5,5)             | 1250    |
| 3      | (100,1)        | (1,1)             | 100     |
| 4      | (10,1)         | (1,1)             | 10      |

## 3.3 Structural analysis of identification results for identification

After the above process, the individual finally is classified into a set of similarity set  $U\{ui\}$ , i=0,1,...,9, which is make up with u which represent the similarity of identified individual  $0\sim9$ . Finally, the results can be obtained with MAX $\{U\}$ .

### IV. KEY ALGORITHM DESCRIPTION

Assume that dimGrid and dimBlock are Grid dimension and Block dimension respectively.

### 4.1 Convolution Computation Algorithm

Convolution computation algorithm is as follows:

- (1) Kernel booting: Kernel <<<dimGrid, dimBlock>>>
- (2) For i=1 To n×n (concurrently)
  Initialize the result r of convolution within the shared memory Device
- (3) \_shared\_double r=0;
- (4) Indexing data position, blockID = blockIdx. x
- (5) Take an offset e<sub>is</sub> in\_W<sub>i</sub>
- (6) Loop step 1)~4) according to the size of accepting domain  $_{P_i}$ 
  - 1) Do sampling from matrix  $\_FM_i$  according to  $\_P_i$
  - 2) Perform convolution according to formula (4)
  - 3) Call data collection algorithm in section 4.2
  - 4) Using the activation function in formula (1) to adjust the output amplitude of neurons

End For

- (7) Thread synchronization \_syncthreads()
- (8) The end

### 4.2 Data Collection Algorithm

Data collection algorithm is as follows:

(1) For idx = 1 To n Do (concurrently)

For idy = 1 To n Do

Using shared memory to collect data

Thread[idx\*pitch+idy] do \_FM, [g(idx, idy)] = (\_shared\_double) result

End For

End For

(2) The end

### 4.3 Identity Classification Algorithm

Identity classification algorithm is as follows:

- For i=1 To n×n (concurrently)
   Initialize the result cr of classification within the shared memory Device
- (2) \_shared\_double cr=0
- (3) Indexing data position, blockID = blockIdx. x;
- (4) Do classification according to formula (5)
- (5) Call data collection algorithm in section 4.2
- (6) Using the activation function in formula (1) to adjust the output amplitude of neurons End For
- (7) Data back to users
- (8) The end

### V. CONTRAST EXPERIMENT

Tested on NVIDIA GeForce GTX280, which contains global memory 1 GB. This GPU mounted with the Intel Core2 E8400 3.0 GHz of PC machine.

In order to make the results comparable, we use MINST [6] handwritten digital word library and self-built libraries relatively in the upper CPU and GPU. The results show that the accuracy of self-built libraries and library MNIST are respectively about 93% and about 95%, which shown in Table 2.

Table 2. Detection results of the digital recognition

| Libraries  | Samples | Correct | Correct rate (%) |
|------------|---------|---------|------------------|
| MINIST     | 10 000  | 9570    | 95.7             |
|            |         | 9450    | 9450             |
|            |         | 9620    | 9620             |
| Self-built | 1000    | 935     | 93.5             |
|            |         | 941     | 94.1             |
|            |         | 927     | 92.7             |

As to accuracy, CUDA technology and senior languages in x86 architecture CPU are different in technical processing. In the CPU, the currently most advanced language (including c) are in accordance with IEEE-754 floating-point standard to regular a storage format. In CUDA, computing devices follow a single-precision binary floating-point IEEE-754 standard, except that (here are only partially cited, details in Ref. [4]):

- (1) Addition and multiplication are usually combined into a multiply-add instruction (FMAD)
  - (2) Division implemented by nonstandard reciprocal;
  - (3) The square root of the square root through nonstandard reciprocal realization;
  - (4) Does not support direct rounding to plus / minus infinity;
  - (5) Does not dynamically configurable rounding mode:
  - (6) No floating-point exception monitoring mechanism, the floating-point exception always be recorded;
  - (7) The results of an operation contains one or more Nan. Nan bit mode is 0x7FFFFFFF.

There is a slight error between handwritten digital in GPU and CPU final output and 0-9 in similarity. From Figure 2, we can see the standard deviation of the similarity of the number of individual

numbers on the CPU and GPU 1 000 times recognition output is  $10^{-7}$ . Calculation error is small enough, and the identification is correct or not depends on the individual numeric similarity. Therefore, it is same between on the CPU and GPU in correctness of recognition and detection but speed has two order of difference. Figure 3 appears that the comparison of floating point computing power for CPU and GPU, the difference of average floating-point computing power is up to 60 times in peak value. With the increase times of recognition, floating-point capabilities are flat and incline to linear trend.



Figure 2. Standard deviation of the recognition output



Figure 3. Difference of average floating-point computing power of n times

### VI. CONCLUSION

Although topology structure of convolution neural network is simple, it still needs a huge amount of work in calculation. NVIDIA GPU based on hardware architecture of stream processor has significant improvement in handwriting recognition based on convolution neural network in support of programming model in CUDA. Compared with CPU, it has amazing advantages. Experiments show that stream processor is suitable for convolution neural network. However, when input is large, due to the limitation of bandwidth of data transmission between equipment, this may became a bottleneck for stream processors. To improve the utilization rate of stream

processors, do scheduling and allocate date reasonably, can be better for a variety of applications of neural network.

### REFERENCE

- [1] Zhang Ying, Yang Xuejun. Wang Guibin, et al. Scientific Computing Applications 011 a Stream Processor[C]//Proc. of IEEE Int'1 Syrup. on Performance Analysis of Systems and Software. Austin, Texas, USA: [s. n.], 2008: 105-114.
- [2] Luebke D. CUDA: Scalable Parallel Programming for Highperformance Scientific Computing[C]//Proc. of the 5th IEEE int'l Symp. on Biomedical Imaging: From Nano to Macro. Paris, France [s. n.], 2008: 836-838.
- [3] Wen Tian, Fan Xu, Hon-yuan Wang, et al. Fast Scale Invariant

- Feature Transform Algorithm Based on CUDA [J]. COMPUTER ENGINEERING, 2010, 36(8): 219-221.
- [4] NVIDIA Corporation . NVIDIA CUDA Compute Unified Device Architecture Programming Guide 2.0 [EB/OL]. (2008—06-07) . http : //developer. download. nvidia. com/compute/cuda/2\_0/docs/N VIDIA-CUDA-Programming\_Guide\_2.0.pdf.
- [5] Lawrence S, Giles CL, Tsoi AC. Convolutional Neural Networks for Face Recognition[C]// Proc. of IEEE Computer Society Conference on CVPR. San Francisco, California, USA: [s. n.], 1 996 217-222.
- [6] Boxu Xiao, Lijing Zhang. 基于分流抑制机制的卷积神经网络 人脸检测法[J]. JOURNAL OF COMPUTER APPLICATIONS, 2006, 26(z1): 46-48.