# Deep Computer Vision Using Convolutional Neural Networks

Although IBM's deep blue supercomputer beat the chess word champion Garry Kasparov back in 1996, it wasn't until fairly recently that computers were able to reliably perform seemingly trivial tasks such as detecting a puppy in a picture or recognising spoken words. Why are these tasks so effortless to use humans? The answer lies in the fact that perception largely takes place outside the realm of our consciousness, within specialised visual, auditory, & other sensory modules in our brains. By the time sensory information reaches our consciousness, it is already adorned with high-level features; for example, when you look at a picture of a cute pupply, you cannot choose *not* to see the puppy, *not* to notice its cutemess. Nor can you explain *how* you recognise a cute puppy; it's just obvious to you. Thus, we cannot trust our subjective experience: perception is not trivial at all & to understand it, we must look at how sensory modules work.

Convolutional neural networks (CNNs) emerged from the study of the brain's visual cortex, & they have been used in image recognition since the 1980s. In the last few years, thanks to the increase in computational power, the amount of available learning data, & the tricks developed for training deep nets, CNNs have managed to achieve superhuman performance on some complex visual tasks. They power image search services, self-driving cars, automatic video classification systems, & more. Moreover, CNNs are not restricted to visual perception: they are also successful at many other tasks, such as voice recognition & natural language processing. However, we will focus on visual applications for now.

In this lesson, we will explore where CNNs came from, what their building blocks look like, & how to implement them using TensorFlow & Keras. Then we will discuess some of the best CNN architectures, as well as other visual tasks, including object detection (classifying multiple objects in an image & placing bounding boxes around them) & semantic segmentation (classifying each pixel according to the class of the object it belongs to).

---

# The Architecture of the Visual Cortex

David Hubel & Torsten Wiesel performed a series of experiments on cats in 1958 & 1959 (& a few years later on monkeys), giving crucial insights into the structure of the visual cortex (the authors received the nobel prize in physiology or medicine in 1981 for their work). In particular, they showed that many neurons in the visual cortex have a small *local receptive field* meaning they react only to visual stimuli located in a limited region of the visual field (see below figure, in which the local receptive fields of five neurons are represented by dashed circles). The receptive fields of different neurons overlap & together tile the whole visual field.

Moreover, the authors showed that some neurons react only to images of horizontal lines, while other react only to lines with different orientations (two neurons may have the same receptive field byt react to different line orientations). They also noticed that some neurons have larger receptive fields, & they react to more complex patterns that are combinations of the lower-level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons (notice in the below figure, that each neuron is connected only to a few neurons from the previous layer). This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field.

<img src = "Images/Receptive Fields of Visual Cortex Neurons.png" width = "600" style = "margin:auto"/>

These studies of the visual cortex inspired the neocognitron, introduced in 1980, which gradually evolved into what we now call *convolutional neural networks*. An important milestone was in a 1998 paper by Yann leCun that introduced the famous *LeNet-5* architecture, widely used by banks to recognise handwritten check numbers. This architecture has some building blocks that you already know, such as fully connected layers * sigmoid activation functions, but it also introduces two new building blocks: *convolutional layers* & *pooling layers*. Let's look at them now.

---

# Convolutional Layers

The most important building block of a CNN is the *convolutional layer*: neurons in the first convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields. In turn, each neuron in the second covolutional layer is connected only to neurals located within a small rectangle in the first layer. This architecture allows the network to concatenate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, & so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.

<img src = "Images/Convolutional Layers.png" width = "500" style = "margin:auto"/>

A neuron located in row $i$, column $j$ of a given layer is connected to the outputs of the neurons in the previous layer located in rows $i$ to $i + f_h - 1$ columns $j$ to $j + f_w - 1$, where $f_h$ & $f_w$ are the height & width of the receptive feild. In order for a layer to have the same height & width as the previous layer, it is common to add zeros around the inputs as shown in the diagram. This is called *zero padding*.

<img src = "Images/Convolutional Layer Connection & Zero Padding.png" width = "500" style = "margin:auto"/>

It is also possible to connect a large input layer to a much samller layer by spacing out the receptive fields. This dramatically reduces the model's computational complexity. The shift from one receptive field to the next is called the *stride*. In the below diagram, a 5 x 7 input layer (plus zero padding) is connected to a 3 x 4 layer, using 3 x 3 receptive fields & a stride of 3 (in this example, the strid is the same in both directions, but it does not have to be so). A neuron located in row $i$, column $j$ in the upper layer is connected to the outputs of the neurons in the previous layer located in rows $i * s_h$ to $i * s_h + f_h - 1$, columns $j * s_w$ to $j * s_w + f_w - 1$, where $s_h$ & $s_w$ are the vertical & horizontal strides.

<img src = "Images/Stride.png" width = "500" style = "margin:auto"/>

## Filters

A neuron's weights can be represented as a small image the size of the receptive field. For example, the below figure shows two possible sets of weights called *filters* (or *convolutional kernels*). The first one is represented as a black square with a vertical white line in the middle (its a 7 x 7 matrix full of 0s except for the central column, which is full of 1s); neurons using these weights will ignore everything in their receptive field except for the central vertical line (since all inputs will get multiplied by 0, except for the ones located in the central vertical line). The second filter is black square with a horizontal white line in the middle. Once again, neurons using these weights will ignore everything in their receptive field except for the central horizontal line.

<img src = "Images/Applying Filters to get Different Feature Maps.png" width = "600" style = "margin:auto"/>

Now if all neurons in a layer use the same vertical line filter (& the same bias term) & you feed the network the input image shown (the bottom image), the layer will output the top-left image. Notice that thte vertical white lines get enhanced while the reset gets blurred. Similarly, the upper-right image is what you get if all neurons use the same horizontal line filter; notice that the horizoantal white lines get enhanced while the rest is blurred out. Thus, a layer full of neurons using the same filter outputs a *feature map*, which highlights the areas in an image that activate the filter the most. Of course, you do not have to define the filters manually: instead, during training, the convolutional layer will automatically learn the most useful filters for its task, & the layers above will learn to combine them into more complex patterns.

## Stacking Multiple Feature Maps

Up to now, for simplicity, we have represented the output of each convolutional layer as a 2D layer, but in reality, a convolutional layer has multiple filter (you decide how many) & outputs one feature map per filter, so it is more accurately represented in 3D.

<img src = "Images/Convolutional Layers with Multiple Feature Maps & Images with Color Channels.png" width = "550" style = "margin:auto"/>

It has one neuron per pixel in each feature map, & all neurons within a given feature map share the same parameters (i.e., the same weights & bias term). Neurons in different feature maps use different parameters. A neuron's receptive field is the same as described earlier, but it extends across all the previous layers' feature maps. In short, a convolutional layer simultaneously applies multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

