# Convolutional Neural Networks

"Deep Learning" is a general term that usually refers to the use of neural networks with multiple layers that synthesize the way the human brain learns and makes decisions. A convolutional neural network is a kind of neural network that extracts *features* from matrices of numeric values (often images) by convolving multiple filters over the matrix values to apply weights and identify patterns, such as edges, corners, and so on in an image. The numeric representations of these patterns are then passed to a fully-connected neural network layer to map the features to specific classes.

## Basic Neural Network Recap

Your brain works by connecting networks of neurons, each of which receives electrochemical stimuli from multiple inputs, which cause the neuron to fire under certain conditions. When a neuron fires, it creates an electrochemical charge that is passed as an input to one or more other neurons, creating a complex *feed-forward* network made up of layers of neurons that pass the signal on. An artificial neural network uses the same principles but the inputs are numeric values with associated *weights* that reflect their relative importance. The neuron take these input values and weights and applies them to an *activation function* that determines the ouput that the artificial neuron passes onto the next layer:

<br/>
<div align="center" style='font-size:24px;'>&#8694;&#9711;&rarr;</div>

As the human brain learns from experience, the inputs to the neurons are strenghtened or weakened depending on their importance to the decisions that the brain needs to make in response to stimuli. Similarly, you train an artificial neural network using a supervised leaning technique in which a *loss function* is used to evaluate how well the multi-layered model detects known labels. You can then find the derivative of the loss function to determine whether the level of error (loss) is reduced by increasing or decreasing the weights associated with the inputs, and then apply *backpropagation* to adjust the weights and improve the model iteratively over multiple training *epochs*. The result of this training is a deep learning model that consists of:
* An *input* layer to which the initial input variables are passed.
* One or more *hidden* layers in which the weights optimized by training determine the signal that is fed forward through the network.
* An *output* layer that presents the results.

## Convolutional Neural Networks (CNNs)
Convolutional Neural Networks, or *CNNs*, are a particular type of artificial neural network that works well with matrix inputs, such as images (which are fundamentally just multi-dimensional matrices of pixel intensity values). There are various kinds of layer in a CNN, but a common architecture is to build a sequence of *convolutional* layers that find patterns in indvidual areas of the input matrix and *pooling* layers that aggregate these patterns. Additionally, some layers may *drop* data (which helps avoid *overfitting* the model to the training data), and finally some layers will *flatten* the matrix data and a linear *dense*, or *fully connected* layer will perform classification and reshape the predictions to conform with the expected output format.

Conceptually, a Convolutional Neural Network for image classification is made up of multiple layers that extract features, such as edges, corners, etc; followed by one or more fully-connected layers to classify objects based on these features. You can visualize this like this:

<table>
    <tr><td rowspan=2 style='border: 1px solid black;'>&#x21d2;</td><td style='border: 1px solid black;'>Convolutional Layer</td><td style='border: 1px solid black;'>Pooling Layer</td><td style='border: 1px solid black;'>Convolutional Layer</td><td style='border: 1px solid black;'>Pooling Layer</td><td style='border: 1px solid black;'>Drop Layer</td><td style='border: 1px solid black;'>Fully Connected Layer</td><td rowspan=2 style='border: 1px solid black;'>&#x21d2;</td></tr>
    <tr><td colspan=5 style='border: 1px solid black; text-align:center;'>Feature Extraction</td><td style='border: 1px solid black; text-align:center;'>Classification</td></tr>
</table>

*Note: In Machine Learning, particularly "deep learning", matrices used in neural networks are often referred to as **tensors**. In a simplistic (which is to say, not strictly accurate) sense, a tensor is just a generalized term for a multi-dimensional matrix. In some deep learning frameworks, like PyTorch, a tensor is a specific type of data structure with properties and methods that support deep learning operations.*

### Convolutional Layers
Convolutional layers apply filters to a subregion of the input image, and *convolve* the filter across the image to extract features (such as edges, corners, etc.). For example, suppose the following matrix represents the pixels in a 6x6 image:

$$\begin{bmatrix}255 & 255 & 255 & 255 & 255 & 255\\255 & 255 & 0 & 0 & 255 & 255\\255 & 0 & 0 & 0 & 0 & 255\\255 & 0 & 0 & 0 & 0 & 255\\255 & 255 & 0 & 0 & 255 & 255\\255 & 255 & 255 & 255 & 255 & 255\end{bmatrix}$$

And let's suppose that a filter matrix is defined as a matrix of *weight* values like this:

$$\begin{bmatrix}0 & 1 & 0\\0 & 1 & 0\\0 & 1 & 0\end{bmatrix}$$

The convolution layer applies the filter to the image matrix one "patch" at a time; so the first operation would apply to the <span style="color:red">red</span> elements below:

$$\begin{bmatrix}\color{red}{255} & \color{red}{255} & \color{red}{255} & 255 & 255 & 255\\\color{red}{255} & \color{red}{255} & \color{red}{0} & 0 & 255 & 255\\\color{red}{255} & \color{red}{0} & \color{red}{0} & 0 & 0 & 255\\255 & 0 & 0 & 0 & 0 & 255\\255 & 255 & 0 & 0 & 255 & 255\\255 & 255 & 255 & 255 & 255 & 255\end{bmatrix}$$

To apply the filter, we multiply the patch area by the filter elementwise, and add the results:

$$\begin{bmatrix}255 & 255 & 255\\255 & 255 & 0\\255 & 0 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 0\\0 & 1 & 0\\0 & 1 & 0\end{bmatrix}= \begin{bmatrix}(255 \times 0) + (255 \times 1) + (255 \times 0) & +\\ (255 \times 0) + (255 \times 1) + (0 \times 0) & + \\ (255 \times 0) + (0 \times 1) + (0 \times 0)\end{bmatrix}  = 510$$

This result is then used as the value for the first element of a feature map:

$$\begin{bmatrix}\color{red}{510} & ? & ? & ?\\? & ? & ? & ?\\? & ? & ? & ?\\? & ? & ? & ?\end{bmatrix}$$

Next we move the patch along one pixel and apply the filter to the new patch area:

$$\begin{bmatrix}255 & \color{red}{255} & \color{red}{255} & \color{red}{255} & 255 & 255\\255 & \color{red}{255} & \color{red}{0} & \color{red}{0} & 255 & 255\\255 & \color{red}{0} & \color{red}{0} & \color{red}{0} & 0 & 255\\255 & 0 & 0 & 0 & 0 & 255\\255 & 255 & 0 & 0 & 255 & 255\\255 & 255 & 255 & 255 & 255 & 255\end{bmatrix}$$

$$\begin{bmatrix}255 & 255 & 255\\255 & 0 & 0\\0 & 0 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 0\\0 & 1 & 0\\0 & 1 & 0\end{bmatrix}= \begin{bmatrix}(255 \times 0) + (255 \times 1) + (255 \times 0) & +\\ (255 \times 0) + (0 \times 1) + (0 \times 0) & + \\ (0 \times 0) + (0 \times 1) + (0 \times 0)\end{bmatrix}  = 255 $$

So can fill in that value on our feature map:
$$\begin{bmatrix}510 & \color{red}{255} & ? & ?\\? & ? & ? & ?\\? & ? & ? & ?\\? & ? & ? & ?\end{bmatrix}$$

Then we just repeat the process, moving the patch across the entire image matrix until we have a completed feature map like this:

$$\begin{bmatrix}510 & 255 & 255 & 510\\255 & 0 & 0 & 255\\255 & 0 & 0 & 255\\510 & 255 & 255 & 510\end{bmatrix}$$

You'll have noticed that as a result of convolving a patch across the original image, we've "lost" a 1-pixel strip around the edge. Typically, we apply a *padding* rule to keep the convolved image the same size as the original image, often by simply filling creating a 1-pixel wide edge with 0 values, like this:

$$\begin{bmatrix}0 & 0 & 0 & 0 & 0 & 0\\0 & 510 & 255 & 255 & 510 & 0\\0 & 255 & 0 & 0 & 255 & 0\\0 & 255 & 0 & 0 & 255 & 0\\0 & 510 & 255 & 255 & 510 & 0\\0 & 0 & 0 & 0 & 0 & 0\end{bmatrix}$$

### Pooling Layers
After using one or more convolution layers to create a filter map, you can use a pooling layer to  reduce the number of dimensions in the matrix. A common technique is to use *MaxPooling*, in which a patch is applied to the matrix and the maximum value within the mask is retained while the others are discarded.

For example, we could apply a 2x2 patch to our feature map to extract the largest value in each 2x2 subarea:

$$\begin{bmatrix}\color{blue}{0} & \color{blue}{0} & \color{green}{0} & \color{green}{0} & \color{red}{0} & \color{red}{0}\\\color{blue}{0} & \color{blue}{510} & \color{green}{255} & \color{green}{255} & \color{red}{510} & \color{red}{0}\\\color{magenta}{0} & \color{magenta}{255} & \color{orange}{0} & \color{orange}{0} & \color{cyan}{255} & \color{cyan}{0}\\\color{magenta}{0} & \color{magenta}{255} & \color{orange}{0} & \color{orange}{0} & \color{cyan}{255} & \color{cyan}{0}\\\color{brown}{0} & \color{brown}{510} & 255 & 255 & \color{yellow}{510} & \color{yellow}{0}\\\color{brown}{0} & \color{brown}{0} & 0 & 0 & \color{yellow}{0} & \color{yellow}{0}\end{bmatrix}\Longrightarrow \begin{bmatrix}\color{blue}{510} & \color{green}{255} & \color{red}{510}\\\color{magenta}{255} & \color{orange}{0} & \color{cyan}{255}\\\color{brown}{510} & 255 & \color{yellow}{510}\end{bmatrix}$$

### Activation Functions
After each layer of filtering or pooling, it's common to apply a *rectified linear unit (ReLU)* function to the feature maps that have been produced. This has the effect of ensuring that all values in the feature maps are zero or higher.

### Dropout Layers
In any machine learning training process, there is a danger of *overfitting* the model to the training data. In other words, you might end with a model that works extremely well with the data on which it was trained, but can't generalize effectively to classify new images. One way in which you can reduce the risk of overfitting is to randomly drop some of the feature maps.

### Dense (Fully-Connected) Layers
After the previous layers have created feature maps, a final linear *fully-connected* layer is used to generate class predictions - you can think of the fully-connected layer as being the endpoint of the classifier what determines which combination of features found in the previous layers "adds up" to a particular class. To create a fully-connected layer, the feature maps are flattened into a single 1-dimensional matrix and a function is applied to calculate the probability for each class that the model is designed to predict - usually this final function is a *Sigmoid* or *SoftMax* function that assigns a value between 0 and 1 to each class, with the total of these assignments adding to 1:

$$\begin{bmatrix}510 & 255 & 510\\255 & 0 & 255\\510 & 255 & 510\end{bmatrix}\begin{bmatrix}255 & 255 & 510\\255 & 0 & 255\\510 & 255 & 255\end{bmatrix}...$$

$$ \Downarrow $$

$$\begin{bmatrix}510 & 255 & 510 & 255 & 0 & 255 & 510 & 255 & 510 & 255 & 255 & 510 & 255 & 0 & 255 & 510 & 255 & 255 ...\end{bmatrix}$$

$$ \Downarrow $$

$$\begin{bmatrix}C_{1} & C_{2} & C_{3} \\ 0.15 & 0.8 & 0.05\end{bmatrix}$$

### Backpropagation
When we train a CNN, we perform mulitple passes forward through the network of layers, and then use a *loss function* to measure the difference between the output values (which you may recall are probability predictions for each class) and the actual values for the known image classes used to train the model (in other words, 1 for the correct class and 0 for all the others). For example, in the example above the predicted probabilities are 0.15 for C<sub>1</sub>, 0.8 for C<sub>2</sub>, and 0.05 for C<sub>3</sub>. Let's suppose that the image in question is an example of C<sub>2</sub>, so the expected output is actually 0 for C<sub>1</sub>, 1 for C<sub>2</sub>, and 0 for C<sub>3</sub>. The error (or *loss*) represents how far from the expected values our results are.

Having calculated the loss, the training process uses a specified *optimizer* to calculate the derivitive of the loss function wit respect to the weights and biases used in the network layers, and determine how best to adjust them to reduce the loss. We then go backwards through the network, adjusting the weights before the next forward pass.

## Building a CNN
There are several commonly used frameworks for creating CNNs, including *PyTorch*, *Tensorflow*, the *Microsoft Cognitive Toolkit (CNTK)*, and *Keras* (which is a high-level API that can use Tensorflow or CNTK as a back end). 

### A Simple Example
The example we'll use to explore this is a classification model that can classify images of geometric shapes.

First, we'll generate some images for our classification model. Run the cell below to do that (note that it may take several minutes to run)

In [None]:
# Function to create a random image (of a square, circle, or triangle)
def create_image (size, shape):
    from random import randint
    import numpy as np
    from PIL import Image, ImageDraw
    
    xy1 = randint(10,40)
    xy2 = randint(60,100)
    col = (randint(0,200), randint(0,200), randint(0,200))

    img = Image.new("RGB", size, (255, 255, 255))
    draw = ImageDraw.Draw(img)
    
    if shape == 'circle':
        draw.ellipse([(xy1,xy1), (xy2,xy2)], fill=col)
    elif shape == 'square':
        draw.rectangle([(xy1,xy1), (xy2,xy2)], fill=col)
    else: # triangle
        draw.polygon([(xy1,xy1), (xy2,xy2), (xy2,xy1)], fill=col)
    del draw
    
    return np.array(img)

# function to create a dataset of images
def generate_image_data (classes, size, cases, img_dir):
    import os, shutil
    from PIL import Image
    
    if os.path.exists(img_dir):
        replace_folder = input("Image folder already exists. Enter Y to replace it (this can take a while!). \n")
        if replace_folder == "Y":
            print("Deleting old images...")
            shutil.rmtree(img_dir)
        else:
            return # Quit - no need to replace existing images
    os.makedirs(img_dir)
    print("Generating new images...")
    i = 0
    while(i < (cases - 1) / len(classes)):
        if (i%25 == 0):
            print("Progress:{:.0%}".format((i*len(classes))/cases))
        i += 1
        for classname in classes:
            img = Image.fromarray(create_image(size, classname))
            saveFolder = os.path.join(img_dir,classname)
            if not os.path.exists(saveFolder):
                os.makedirs(saveFolder)
            imgFileName = os.path.join(saveFolder, classname + str(i) + '.jpg')
            try:
                img.save(imgFileName)
            except:
                try:
                    # Retry (resource constraints in Azure notebooks can cause occassional disk access errors)
                    img.save(imgFileName)
                except:
                    # We gave it a shot - time to move on with our lives
                    print("Error saving image", imgFileName)
            
# Our classes will be circles, squares, and triangles
classnames = ['circle', 'square', 'triangle']

# All images will be 128x128 pixels
img_size = (128,128)

# We'll store the images in a folder named 'shapes'
folder_name = 'shapes'

# Generate 1200 random images.
generate_image_data(classnames, img_size, 1200, folder_name)

print("Image files ready in %s folder!" % folder_name)

### Setting up the Frameworks
Now that we have our data, we're ready to build a CNN. The first step is to install and configure the frameworks we want to use.

We're going to use TensorFlow as a back-end for the Keras machine learning framework.

> **Note**: In the Azure DSVM, these packages are already installed - we'll just ensure that we have the latest version of Keras. To install TensorFlow on your own system, consult the documentation at https://www.tensorflow.org/install/. To install Keras, consult the Keras installation documentation at https://keras.io/#installation.

In [None]:
import sys
! {sys.executable} -m pip install --upgrade keras

import tensorflow, keras
print('TensorFlow version:',tensorflow.__version__)
print('Keras version:',keras.__version__)

from keras import backend as K

### Preparing the Data
Before we can train the model, we need to prepare the data. We'll divide the feature values by 255 to normalize them as floating point values between 0 and 1, and we'll split the data so that we can use 70% of it to train the model, and hold back 30% to validate it. When loading the data, the data generator will assing "hot-encoded" numeric labels to indicate which class each image belongs to based on the subfolders in which the data is stored. In this case, there are three subfolders - *circle*, *square*, and *triangle*, so the labels will consist of three *0* or *1* values indicating which of these classes is associated with the image - for example the label [0 1 0] indicates that the image belongs to the second class (*square*).

In [None]:
from keras.preprocessing.image import ImageDataGenerator

data_folder = 'shapes'
img_size = (128, 128)
batch_size = 30

print("Getting Data...")
datagen = ImageDataGenerator(rescale=1./255, # normalize pixel values
                             validation_split=0.3) # hold back 30% of the images for validation

print("Preparing training dataset...")
train_generator = datagen.flow_from_directory(
    data_folder,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='categorical',
    subset='training') # set as training data

print("Preparing validation dataset...")
validation_generator = datagen.flow_from_directory(
    data_folder,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='categorical',
    subset='validation') # set as validation data

classes = sorted(train_generator.class_indices.keys())
print("class names: ", classes)

### Defining the CNN
Now we're ready to train our model. This involves defining the layers for our CNN, specifying an *optimizer*, and compiling the model for multi-class classification. In this example, we'll use an optimizer based on the *Adam* algorithm and set its *learning rate* parameter (which determines how much the weights are adjusted after backpropagation identifies their affect on loss). These settings can have a significant impact on how well your model (and how quickly) your model learns the optimal weights and bias values required to predict accurately. 

> Note: For information about the optimizers available in Keras, see https://keras.io/optimizers/

In [None]:
# Define a CNN classifier network
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import optimizers

# Define the model as a sequence of layers
model = Sequential()

# The input layer accepts an image and applies a convolution that uses 32 6x6 filters and a rectified linear unit activation function
model.add(Conv2D(32, (6, 6), input_shape=train_generator.image_shape, activation='relu'))

# Next we'll add a max pooling layer with a 2x2 patch
model.add(MaxPooling2D(pool_size=(2,2)))

# We can add as many layers as we think necessary - here we'll add another convolution, max pooling, and dropout layer
model.add(Conv2D(32, (6, 6), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# A dropout layer randomly drops some nodes to reduce inter-dependencies (which can cause over-fitting)
model.add(Dropout(0.2))

# Now we'll flatten the feature maps and generate an output layer with a predicted probability for each class
model.add(Flatten())
model.add(Dense(train_generator.num_classes, activation='softmax'))

# We'll use the ADAM optimizer
opt = optimizers.Adam(lr=0.001)

# With the layers defined, we can now compile the model for categorical (multi-class) classification
model.compile(loss='categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

print(model.summary())

### Training the Model
With the layers of the CNN defined, we're ready to train the model using our image data. In the example below, we use 5 iterations (*epochs*) to train the model in 30-image batches, holding back 30% of the data for validation. After each epoch, the loss function measures the error (*loss*) in the model and adjusts the weights (which were randomly generated for the first iteration) to try to improve accuracy. 

> **Note**: We're only using 5 epochs to reduce the training time for this simple example. A real-world CNN is usually trained over more epochs than this. CNN model training is processor-intensive, so it's recommended to perform this on a system that can leverage GPUs (such as the Data Science Virtual Machine in Azure) to reduce training time. Status will be displayed as the training progresses.

In [None]:
# Train the model over 5 epochs using 30-image batches and using the validation holdout dataset for validation
num_epochs = 5
history = model.fit_generator(
    train_generator,
    steps_per_epoch = train_generator.samples // batch_size,
    validation_data = validation_generator, 
    validation_steps = validation_generator.samples // batch_size,
    epochs = num_epochs)

### View the Loss History
We tracked average training and validation loss history for each epoch. We can plot these to verify that loss reduced as the model was trained, and to detect *over-fitting* (which is indicated by a continued drop in training loss after validation loss has levelled out or started to increase.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

epoch_nums = range(1,num_epochs+1)
training_loss = history.history["loss"]
validation_loss = history.history["val_loss"]
plt.plot(epoch_nums, training_loss)
plt.plot(epoch_nums, validation_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.show()

### Save the Model
Now that we have trained the model, we can save it with the trained weights. Then later, we can reload it and use it to predict classes from new images.

In [None]:
from keras.models import load_model

modelFileName = 'shape-classifier.h5'

model.save(modelFileName) # saves the trained model
print("Model saved.")

del model  # deletes the existing model variable

### Use the Model with New Data
Now that we've trained and evaluated our model, we can use it to predict classes for new images.

In [None]:
# Function to predict the class of an image
def predict_image(classifier, image_array):
    import numpy as np
    
    # We need to format the input to match the training data
    # The generator loaded the values as floating point numbers
    # and normalized the pixel values, so...
    imgfeatures = image_array.astype('float32')
    imgfeatures /= 255
    
    # These are the classes our model can predict
    classnames = ['circle', 'square', 'triangle']
    
    # Predict the class of each input image
    predictions = classifier.predict(imgfeatures)
    
    predicted_classes = []
    for prediction in predictions:
        # The prediction for each image is the probability for each class, e.g. [0.8, 0.1, 0.2]
        # So get the index of the highest probability
        class_idx = np.argmax(prediction)
        # And append the corresponding class name to the results
        predicted_classes.append(classnames[int(class_idx)])
    # Return the predictions as a JSON
    return predicted_classes


from random import randint
import numpy as np
%matplotlib inline

# load the saved model
model = load_model(modelFileName) 

# Create a random test image
img = create_image ((128,128), classes[randint(0, len(classes)-1)])
plt.imshow(img)

# Create an array of (1) images to match the expected input format
img_array = img.reshape(1, img.shape[0], img.shape[1], img.shape[2])

# get the predicted clases
predicted_classes = predict_image(model, img_array)

# Display the prediction for the first image (we only submitted one!)
print(predicted_classes[0])