# Convolutional Neural Networks

In this notebook we will implement a convolutional neural network. Rather than doing everything from scratch we will make use of [TensorFlow 2](https://www.tensorflow.org/) and the [Keras](https://keras.io) high level interface.

## Convolution neural network for MNIST dataset

Implement the neural network in "[Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)", by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. The [Keras Layer documentation](https://keras.io/api/layers/) includes information about the layers supported. In particular, [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d) and [`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d) layers may be useful.

In [1]:
import tensorflow as tf
import numpy as np

### MNIST Dataset

First, let us load the MNIST digits dataset that we will be using to train our network. This is available directly within Keras:

In [2]:
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()

The data comes as a set of integers in the range [0,255] representing the shade of gray of a given pixel. Let's first rescale them to be in the range [0,1]:

In [3]:
x_train, x_test = x_train / 255.0, x_test / 255.0

We first need to reshape the input data to make the images 28 x 28 x 1 rather than 28 x 28. This is beacause more generally we might have 28 x 28 x 3 to account for the three colour channels (red, green, blue) in an image, but here we have only one grayscale channel.

In [4]:
X_train = x_train[..., np.newaxis]
X_test = x_test[..., np.newaxis]

Now we construct our network with three convolution layers, two pooling layers and fully-connected layers at the end.

In [9]:
# Construct the CNN model
model = tf.keras.models.Sequential([
    # First convolutional layer
    tf.keras.layers.Conv2D(6,(5,5),input_shape=(28,28,1),padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),

    # Second convolutional layer
    tf.keras.layers.Conv2D(16,(5,5), activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    
    # Third 
    tf.keras.layers.Conv2D(120,(5,5), activation='relu'),
    # Flattening layer
    tf.keras.layers.Flatten(),

    # Fully connected layers
    tf.keras.layers.Dense(84, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')  # Output layer with 10 classes
])


# Understanding Convolutional Neural Networks (CNNs)

## 1. What is a Filter in CNNs?
A **filter** (or **kernel**) is a small matrix of numbers (parameters) that slides over the input image and performs a convolution operation. The result is a **feature map**, which highlights specific patterns or features in the image, such as edges, corners, or textures.

### Key Characteristics of a Filter:
- **Size:**
  - Filters are typically small, like $3 \times 3$, $5 \times 5$, or $7 \times 7$, relative to the input size.
  - For a $28 \times 28$ MNIST image, a $5 \times 5$ filter scans over $5 \times 5$ sections of the image.

- **Convolution Operation:**
  - At each position, the filter performs an **element-wise multiplication** with the input pixels it overlaps and sums the results.
  - The filter then moves (or **slides**) to the next position by a step size called the **stride**.

### Example:
Let’s say the input image is:
$$
\text{Input Image: }
\begin{bmatrix}
1 & 2 & 0 & 1 \\
3 & 4 & 1 & 0 \\
0 & 1 & 3 & 2 \\
1 & 0 & 2 & 4
\end{bmatrix}
$$

And the filter (kernel) is:
$$
\text{Filter: }
\begin{bmatrix}
1 & 0 \\
-1 & 1
\end{bmatrix}
$$

To compute the convolution:
1. Place the filter over the top-left corner of the input.
2. Perform element-wise multiplication:
   $$
   (1 \cdot 1) + (2 \cdot 0) + (3 \cdot -1) + (4 \cdot 1) = 1 + 0 - 3 + 4 = 2
   $$
3. Slide the filter to the right by one step (stride of 1) and repeat.

The result is a **feature map**, smaller than the original image.

---

## 2. Padding in CNNs
**Padding** determines how the edges of the input are handled during convolution. Without padding, the convolution operation reduces the size of the output feature map.

### Types of Padding:
1. **Same Padding:**
   - Adds zeros around the edges of the input to ensure the output size is the same as the input size.
   - If the input is $28 \times 28$ and the filter is $3 \times 3$, the padding ensures the output remains $28 \times 28$.

2. **Valid Padding:**
   - No padding is added.
   - The output size is reduced because the filter cannot extend beyond the input edges.
   - For example, a $28 \times 28$ input with a $3 \times 3$ filter results in an output size of $26 \times 26$ (subtracting $2$ rows and columns).

### Why Use Padding?
- **Same Padding:**
  - Ensures features near the edges of the image are treated equally.
  - Preserves the spatial size of the output.

- **Valid Padding:**
  - Reduces computation by shrinking the feature map.
  - Useful when the exact output size isn’t critical.

---

## 3. Pooling in CNNs
Pooling is a **down-sampling** operation that reduces the spatial size of the feature maps. This simplifies computation and helps prevent overfitting.

### Types of Pooling:
1. **Max Pooling:**
   - Divides the input into non-overlapping regions and takes the maximum value from each region.
   - For example, for a $2 \times 2$ region:
     $$
     \text{Region: }
     \begin{bmatrix}
     1 & 3 \\
     2 & 4
     \end{bmatrix}
     \quad \text{Max Value: } 4
     $$

2. **Average Pooling:**
   - Takes the average of all values in the region.
     $$
     \text{Average Value: } \frac{1 + 3 + 2 + 4}{4} = 2.5
     $$

### Key Parameters:
- **Pool Size:**
  - Defines the size of the pooling region, e.g., $2 \times 2$.
- **Stride:**
  - Determines how far the pooling window moves. Default is equal to the pool size, ensuring non-overlapping regions.

### Why Use Pooling?
- Reduces the spatial size of feature maps.
- Focuses on dominant features while discarding unnecessary details.
- Makes the model robust to small translations and distortions.

---

## 4. Fully Connected (Dense) Layers
After convolution and pooling, the feature maps are flattened into a 1D vector and passed through fully connected (dense) layers.

### Role of Dense Layers:
- Combine all the extracted features to make predictions.
- Each neuron in a dense layer connects to every value in the input vector, learning a weighted sum.

### Why 84 Neurons?
- The number $84$ is inspired by the **LeNet-5 architecture**, which used this size for its hidden layer.
- It’s large enough to capture complex patterns while being computationally efficient.

### Why 10 Neurons in the Output Layer?
- The MNIST dataset has 10 classes (digits 0-9).
- Each neuron corresponds to a class, and the output values represent the probabilities for each class.

---

## 5. Activation Functions
Activation functions introduce **non-linearity** into the model, allowing it to learn complex patterns.

### Common Activation Functions:
- **ReLU (Rectified Linear Unit):**
  $$
  \text{ReLU}(x) = \max(0, x)
  $$
  - Helps prevent the vanishing gradient problem.
- **Sigmoid:**
  - Maps values to the range $(0, 1)$, often used in output layers for binary classification.
- **Softmax:**
  - Converts raw scores into probabilities for multi-class classification.

---

## 6. Why Does the Model Use This Architecture?
1. **Convolutional Layers:**
   - Detect hierarchical features (edges, shapes, patterns).
   - Increasing filters (6, 16, 120) allows the network to learn more complex features.

2. **Pooling Layers:**
   - Reduce the spatial size of feature maps, focusing on the most important features.

3. **Dense Layers:**
   - Combine features to make the final prediction.

4. **Output Layer:**
   - Maps the features to probabilities for each digit class.

---

## Summary:
- **Filters:** Learn to detect features like edges and shapes.
- **Kernel Size:** Defines the receptive field of each filter.
- **Padding:** Determines how edges are handled.
- **Pooling:** Reduces spatial size while preserving important features.
- **Dense Layers:** Combine features for classification.
- **Activation Functions:** Add non-linearity to learn complex patterns.

This combination allows CNNs to process images efficiently and make accurate predictions.


Next, we compile the model, specfiying sparse categorical cross-entropy loss and ADAM optimisation.

In [10]:
# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Now train the model for 20 epochs

In [11]:
# Train the model for 20 epochs
model.fit(X_train, y_train, epochs=20, verbose=1)

Epoch 1/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.8760 - loss: 0.3991
Epoch 2/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9810 - loss: 0.0607
Epoch 3/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9858 - loss: 0.0428
Epoch 4/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9898 - loss: 0.0334
Epoch 5/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9915 - loss: 0.0259
Epoch 6/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.9925 - loss: 0.0223
Epoch 7/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.9939 - loss: 0.0196
Epoch 8/20
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 6ms/step - accuracy: 0.9954 - loss: 0.0138
Epoch 9/20
[1m1875/

<keras.src.callbacks.history.History at 0x12f82a35bb0>

We have achieved 99.6% accuracy after training for 20 epochs. Let's check this against the test data:

In [12]:
model.evaluate(X_test, y_test, verbose=False)

[0.052501361817121506, 0.9879999756813049]

The result is 99%, so we may have slightly overtrained, but still have a highly accurate model.