# Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a class of deep learning models that are especially effective for image-related tasks such as image classification, object detection, and facial recognition.

They are inspired by how humans learn to recognise objects and how the human visual system processes visual information.



## How Machines Learn to See Images

Similar to how a child learns to recognise objects, a machine learning algorithm must be shown a very large number of images before it can generalise. Only after seeing enough examples can it make correct predictions for images it has never seen before.

However, computers see the world very differently from humans.

A computer does not see objects, shapes, or meaning. Its world consists entirely of numbers. Every image can be represented as a two-dimensional array of numbers, where each number corresponds to a pixel value.

Even though computers perceive images differently, this does not mean they cannot learn patterns. It simply means that we must represent images in a way that allows algorithms to discover meaningful structures from numerical data.


## Why Regular Neural Networks Are Not Enough

A regular (fully connected) neural network treats every input value as independent. When working with images, this usually means flattening the image into a one-dimensional vector before feeding it into the network.

Flattening an image causes a major problem: all spatial information is lost. The network no longer knows which pixels were originally close to each other or how they formed shapes such as edges or corners.

As a result, a traditional neural network cannot effectively capture patterns like textures, edges, or object structures. This makes it inefficient and poorly suited for image-related tasks.

Convolutional Neural Networks solve this problem by preserving spatial relationships and learning features directly from local regions of the image.


## Biological Inspiration of Convolutional Neural Networks

Convolutional Neural Networks are strongly inspired by the way the human visual system processes visual information. Studies conducted in the 1950s and 1960s by neuroscientists Hubel and Wiesel revealed important insights into how visual perception works in mammals.

Their research showed that neurons in the visual cortex do not respond to the entire visual field at once. Instead, each neuron is activated only by stimuli within a small, specific region of the visual field. This concept is known as a receptive field.

They identified two main types of visual neurons. The first type, called simple cells, respond to basic visual patterns such as straight lines or edges at specific orientations and positions. The second type, known as complex cells, respond to the same patterns even when their position changes slightly within the visual field.

This hierarchical and local processing of visual information inspired the design of Convolutional Neural Networks. Similar to the human brain, CNNs analyse images by detecting simple features first and gradually combining them to form more complex representations.

## Architecture of Convolutional Neural Networks

The architecture of a Convolutional Neural Network (CNN) is fundamentally different from that of a traditional fully connected neural network. CNNs are specifically designed to process data that has a grid-like structure, such as images.

Unlike regular neural networks, which treat all input features equally, CNNs exploit the spatial structure of images. This allows them to learn visual patterns more efficiently while significantly reducing the number of parameters required.

![cnn architecture](https://media.licdn.com/dms/image/v2/D5612AQGOui8XZUZJSA/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1680532048475?e=2147483647&v=beta&t=5gZVHYNL2Vc2mK3iKrpK-FcpURIFdyaP4Vi38eeeZyM)


## Components of a Convolutional Neural Network

A Convolutional Neural Network is not a single, uniform block of computation. Instead, it is organised into two major components, each responsible for a distinct role in the overall learning process.

The first component is responsible for understanding the visual content of an image. It analyses the raw pixel values and learns to identify meaningful visual patterns such as edges, textures, shapes, and object parts. This component is commonly referred to as the **feature extraction component**.

The second component is responsible for interpreting the features learned by the first component. Using these extracted features, it determines what the image represents by assigning probabilities to different output classes. This component is known as the **classification component**.

By dividing the network into these two components, CNNs are able to efficiently process images, first by learning *what* is present in the image and then by deciding *what the image means*.


## Feature Extraction Component

The feature extraction component is the first and most important part of a Convolutional Neural Network. Its primary role is to transform raw image data into meaningful representations that can be understood and used by the network.

When an image is first given as input to a CNN, it exists only as a grid of numerical pixel values. At this stage, the network has no knowledge of shapes, objects, or patterns. Individual pixel values by themselves carry very little useful information. The task of the feature extraction component is to discover structure and patterns within this raw data.

This component works by gradually analysing small regions of the image and learning visual features at multiple levels of complexity. In the earliest stages, the network learns very simple features such as edges, lines, and corners. These features are fundamental building blocks and appear in almost all images, regardless of the object being depicted.

As the image data moves deeper through the feature extraction component, these simple features are combined to form more complex patterns such as textures, curves, and basic shapes. In even deeper layers, the network learns high-level representations such as object parts or distinctive visual characteristics that are strongly associated with specific classes.

The feature extraction component does not rely on manually defined rules or handcrafted features. Instead, it automatically learns which features are important directly from the training data. This ability to learn features hierarchically and automatically is one of the key strengths of Convolutional Neural Networks and is a major reason for their success in image-related tasks.

## Convolution Operation in Convolutional Neural Networks

Convolution is one of the most important building blocks of a Convolutional Neural Network. It is the primary mechanism through which the network learns visual features from an image.

In mathematics, convolution refers to an operation that combines two functions to produce a third function. Conceptually, it merges two sources of information. In the context of CNNs, these two sources are:
1. The input image
2. A small matrix called a filter or kernel

The result of applying a filter to an image through convolution is known as a **feature map**.
In the animation below, you can see the convolution operation. You can see the filter (the green square) is sliding over our input (the blue square) and the sum of the convolution goes into the feature map (the red square).

The area of our filter is also called the receptive field, named after the neuron cells! The size of this filter is 3x3.

![convolution](https://cdn-media-1.freecodecamp.org/images/Htskzls1pGp98-X2mHmVy9tCj0cYXkiCrQ4t)

## Filter (Kernel) and Input Image

A filter, also called a kernel, is a small matrix of learnable values. Common filter sizes are 3×3 or 5×5. The filter is much smaller than the input image and is designed to focus on local regions rather than the entire image at once.

The input image is represented as a matrix of pixel values. For a grayscale image, this matrix has two dimensions (height and width). For a colour image, it has three dimensions (height, width, and depth).



The convolution operation is performed by sliding the filter over the input image.

At each position:
- The filter is placed over a small region of the image
- Element-wise multiplication is performed between the filter values and the corresponding pixel values
- All resulting values are summed to produce a single number

This single number is placed into the corresponding position in the feature map.

This process is repeated as the filter moves across the entire image.

## Receptive Field

The region of the image that the filter covers at any given time is called the **receptive field**.

For example:
- A 3×3 filter has a receptive field of size 3×3
- This means the neuron associated with that output value only looks at a 3×3 region of the input image

This concept is inspired by biological vision, where neurons respond only to stimuli within a small region of the visual field.


## From 2D to 3D Convolution

For simplicity, convolution is often illustrated using two-dimensional images. However, in practice, convolution is performed in **three dimensions**.

A colour image has three channels: Red, Green, and Blue. Therefore:
- The filter also has a depth of 3
- The filter spans the entire depth of the input image
- The convolution operation produces a two-dimensional feature map

Each filter always covers all input channels.


## Multiple Filters and Feature Maps

A single filter can detect only one type of feature. To learn multiple features, a convolution layer uses **many filters**.

Each filter:
- Learns a different visual pattern
- Produces its own feature map

All feature maps are stacked together to form the output of the convolution layer. The number of feature maps equals the number of filters used.
![3d](https://cdn-media-1.freecodecamp.org/images/Gjxh-aApWTzIRI1UNmGnNLrk8OKsQaf2tlDu)

## Activation Function After Convolution

After convolution, the output values are passed through an activation function to introduce non-linearity.

In CNNs, the most commonly used activation function is the **ReLU (Rectified Linear Unit)**.

ReLU replaces all negative values with zero while keeping positive values unchanged. This allows the network to learn complex patterns and improves training efficiency.



## Stride: How the Filter Moves

Stride defines how many pixels the filter moves at each step.

- A stride of 1 means the filter moves one pixel at a time
- Larger stride values cause the filter to move in larger steps

Increasing the stride reduces the size of the feature map and lowers computational cost, but may also result in loss of fine-grained information.


## Padding: Preserving Spatial Size

After convolution, the feature map is usually smaller than the input image. To control this size reduction, **padding** is used.

Padding adds a border of zero-valued pixels around the input image. This allows:
- Preservation of spatial dimensions
- Better handling of edge information
- Proper alignment of filters with the image

Padding is especially useful when multiple convolution layers are stacked.


## Pooling Layer After Convolution

After one or more convolution layers, a pooling layer is commonly added.

Pooling reduces the spatial dimensions of feature maps, which:
- Decreases the number of parameters
- Reduces computation
- Helps control overfitting

The most common pooling operation is **max pooling**, where the maximum value within a window is selected.

![](https://cdn-media-1.freecodecamp.org/images/96HH3r99NwOK818EB9ZdEbVY3zOBOYJE-I8Q)

When designing a convolution layer, four key hyperparameters must be chosen carefully:
- Kernel size
- Number of filters
- Stride
- Padding

These parameters directly influence the feature map size, learning capacity, and performance of the CNN.


## Classification Component of a Convolutional Neural Network

The classification component is the second major part of a Convolutional Neural Network. While the feature extraction component focuses on learning *what visual patterns exist* in the image, the classification component focuses on *what those patterns mean*.

By the time the data reaches the classification component, the original image has already been transformed into a set of high-level features. These features encode important information about the image, such as the presence of edges, shapes, textures, and object parts. The role of the classification component is to interpret these learned features and make a final decision about the class of the input image.

In essence, the classification component answers the question:  
**“Given the features that have been extracted, what is the image?”**

## Fully Connected Layers in the Classification Component

The core of the classification component consists of one or more fully connected (dense) layers. These layers operate in a similar manner to traditional neural networks.

In a fully connected layer, each neuron is connected to every neuron in the previous layer. This complete connectivity allows the network to combine all extracted features and learn complex relationships between them.

Unlike convolution layers, which focus on local patterns, fully connected layers consider the entire set of learned features simultaneously. This makes them well suited for high-level reasoning and decision making.

## Purpose of Fully Connected Layers

The features extracted by convolution and pooling layers are spatial and distributed across feature maps. While these features are useful, they do not directly correspond to class predictions.

Fully connected layers serve as a classifier that:
- Combines all extracted features
- Learns decision boundaries between classes
- Produces scores corresponding to each class

Through training, the fully connected layers learn how different feature combinations relate to different output classes.

## Output Layer in the Classification Component

The final layer of the classification component is known as the output layer. This layer produces a numerical score for each possible class.

Each neuron in the output layer corresponds to one class. The values produced by this layer indicate how strongly the network believes the input image belongs to each class.

These raw scores are not yet probabilities and must be further processed using an activation function.

## Softmax Activation Function

The softmax activation function is commonly used in the output layer of a CNN for multi-class classification tasks.

Softmax converts raw output scores into probabilities. Each probability lies between 0 and 1, and the sum of all probabilities equals 1.

This makes the output interpretable, as it directly represents the model’s confidence in each class. The class with the highest probability is selected as the final prediction.


## Final Interpretation of the CNN Output

After the softmax activation function is applied, the CNN produces a probability distribution over all possible classes.

The predicted class is the one with the highest probability. This probabilistic output not only provides a prediction but also indicates how confident the network is in its decision.

This marks the completion of the classification component and the final stage of the Convolutional Neural Network.

## Training a Convolutional Neural Network

Training a Convolutional Neural Network is the process through which the network learns the values of its parameters so that it can make accurate predictions.

During training, the CNN is shown a large number of input images along with their correct labels. Based on these examples, the network gradually adjusts its internal parameters to minimise prediction errors.

Although CNNs have a specialised architecture, the fundamental training principle is the same as that of traditional neural networks: learning through error correction.



## What a CNN Learns During Training

A CNN contains two types of layers:
- Layers with learnable parameters (convolution and fully connected layers)
- Layers without learnable parameters (activation, pooling, flattening)

During training:
- Convolution layers learn filter values
- Fully connected layers learn weights and biases

Pooling and activation layers do not learn parameters; they only transform data.


## Forward Propagation in a CNN

Forward propagation is the process of passing an input image through the entire CNN from start to end.

The input image first passes through the feature extraction component, where convolution, activation, and pooling operations extract meaningful features. These features are then flattened and passed through the classification component, producing output scores for each class.

At the end of forward propagation, the network produces a prediction based on its current parameter values.


## Loss Function in CNN Training

The loss function measures how far the network’s prediction is from the true label.

It produces a single numerical value that represents the error made by the network for a given input. A smaller loss value indicates better performance.

In image classification tasks, common loss functions include:
- Categorical Cross-Entropy
- Sparse Categorical Cross-Entropy

The objective of training is to minimise the loss function.

## Why Cross-Entropy Loss Is Used

Cross-entropy loss is well suited for classification problems because it strongly penalises confident but incorrect predictions.

When the network assigns high probability to the wrong class, the loss becomes large. When it assigns high probability to the correct class, the loss becomes small.

This behaviour encourages the network to produce accurate and confident predictions.


## Backpropagation in Convolutional Neural Networks

Backpropagation is the process through which the network updates its parameters to reduce the loss.

After forward propagation and loss calculation, the error is propagated backward through the network. During this process, gradients of the loss function with respect to each parameter are computed.

In CNNs, backpropagation updates:
- Filter values in convolution layers
- Weights and biases in fully connected layers

Although the mathematical details are more complex due to convolution operations, the underlying principle remains the same as in traditional neural networks.

## Gradient Descent and Optimization

Once gradients are computed through backpropagation, an optimization algorithm is used to update the parameters.

Gradient descent updates parameters in the direction that reduces the loss. The size of each update is controlled by the learning rate.

Choosing an appropriate learning rate is crucial:
- Too large → unstable training
- Too small → slow learning

## Optimizers in CNN Training

Optimizers determine how parameter updates are performed.

One of the most commonly used optimizers is Adam (Adaptive Moment Estimation). Adam combines the advantages of momentum and adaptive learning rates.

Adam automatically adjusts learning rates for each parameter, leading to faster and more stable convergence in CNN training.

## Epochs and Batches

Training data is not processed all at once.

- A batch is a small subset of the training data
- An epoch is one complete pass through the entire training dataset

Training over multiple epochs allows the network to refine its parameters gradually and improve accuracy.


## Training and Validation Data

During training, data is typically divided into:
- Training data, used to update parameters
- Validation data, used to evaluate performance during training

Validation helps monitor overfitting and ensures that the model generalises well to unseen data.

Before training begins, the CNN must be compiled.

Compilation involves specifying:
- The loss function
- The optimizer
- The evaluation metric

These choices define how the model learns and how its performance is measured.

Only through training, the CNN learns to extract meaningful features and make accurate predictions on image data.


# Example: Image Classification Using Convolutional Neural Network (CNN)

In this section, let us look at the implementation of a Convolutional Neural Network (CNN) for image classification using the MNIST dataset. The objective of this example is to understand how CNNs are applied in practice to classify images by learning hierarchical visual features and making accurate predictions.

The complete workflow includes dataset overview, data preprocessing, model construction, training, evaluation, and prediction.


## Characteristics of the MNIST Dataset
The MNIST dataset (Modified National Institute of Standards and Technology dataset) is one of the most widely used benchmark datasets in the field of machine learning and deep learning. It is primarily used for training and evaluating image classification models, especially Convolutional Neural Networks.

The dataset consists of images of handwritten digits ranging from 0 to 9. Each image represents a single digit written by different individuals, making the dataset suitable for studying pattern recognition and generalization.

- The dataset contains 70,000 images in total  
- 60,000 images are used for training  
- 10,000 images are used for testing  
- Each image is of size 28 × 28 pixels
- Images are grayscale, meaning they have a single color channel  
- Pixel values range from 0 to 255, where 0 represents black and 255 represents white  


### Importing Required Libraries

NumPy is used for numerical operations and array manipulation.  
Matplotlib is used for visualizing image data.  
TensorFlow Keras provides high-level APIs to build, train, and evaluate Convolutional Neural Networks, including layers such as convolution, pooling, and fully connected layers.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Activation
from tensorflow.keras.datasets import mnist


### Loading the MNIST Dataset

The MNIST dataset is loaded directly using Keras.  
It is automatically split into training data (60,000 images) and testing data (10,000 images).  
Each image is a grayscale handwritten digit with a corresponding label from 0 to 9.


In [2]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()


### Understanding the Dataset Shape

The training images have the shape (60000, 28, 28), meaning there are 60,000 images of size 28 × 28 pixels.  
The labels have the shape (60000,), indicating one label per image.  
Since CNNs expect channel information, the data must be reshaped before training.


In [3]:
print(x_train.shape)
print(y_train.shape)


(60000, 28, 28)
(60000,)


### Reshaping the Data

Convolutional Neural Networks expect input in the format:  
(height, width, channels).

Since MNIST images are grayscale, they have only one channel.  
Reshaping adds this channel dimension and makes the data compatible with CNN layers.


In [4]:
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)


### Normalizing the Data

Pixel values in MNIST range from 0 to 255.  
Dividing by 255 scales the values to the range 0 to 1.  
Normalization improves training stability and speeds up convergence.


In [5]:
x_train = x_train / 255.0
x_test = x_test / 255.0


### Creating the CNN Model

The Sequential model is used to build the CNN layer by layer in a linear stack.  
This approach is simple and well suited for standard CNN architectures.


In [6]:
model = Sequential()


### First Convolution Block

The convolution layer applies 32 filters of size 3 × 3 to extract basic visual features such as edges.  
The ReLU activation introduces non-linearity, allowing the network to learn complex patterns.  
Max pooling reduces the spatial dimensions and helps control overfitting.


In [7]:
model.add(Conv2D(32, (3,3), input_shape=(28,28,1)))
model.add(Activation('relu'))
model.add(MaxPooling2D((2,2)))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### Second Convolution Block

The second convolution layer learns more complex features by building on the features extracted earlier.  
Increasing the number of filters allows the network to capture richer visual information.  
Pooling again reduces the feature map size while retaining important details.


In [8]:
model.add(Conv2D(64, (3,3)))
model.add(Activation('relu'))
model.add(MaxPooling2D((2,2)))


### Flattening the Feature Maps

Flattening converts the three-dimensional feature maps into a one-dimensional vector.  
This step is required before passing the data to fully connected layers for classification.


In [9]:
model.add(Flatten())


### Fully Connected Layer

The dense layer combines all extracted features and learns relationships between them.  
This layer performs high-level reasoning and prepares the data for final classification.


In [10]:
model.add(Dense(128))
model.add(Activation('relu'))


### Output Layer

The output layer has 10 neurons, one for each digit class (0–9).  
The softmax activation converts outputs into probabilities, enabling multi-class classification.


In [11]:
model.add(Dense(10))
model.add(Activation('softmax'))


### Compiling the Model

The loss function measures classification error.  
The Adam optimizer efficiently updates model parameters.  
Accuracy is used to evaluate the model’s performance.


In [12]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)


### Training the CNN

The model is trained using the training dataset for multiple epochs.  
Training adjusts the filters and weights to minimize loss and improve accuracy.  
Validation data is used to monitor generalization performance.


In [13]:
model.fit(
    x_train,
    y_train,
    epochs=5,
    batch_size=32,
    validation_data=(x_test, y_test)
)


Epoch 1/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 35ms/step - accuracy: 0.9606 - loss: 0.1277 - val_accuracy: 0.9866 - val_loss: 0.0419
Epoch 2/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 34ms/step - accuracy: 0.9866 - loss: 0.0431 - val_accuracy: 0.9886 - val_loss: 0.0320
Epoch 3/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 34ms/step - accuracy: 0.9908 - loss: 0.0284 - val_accuracy: 0.9901 - val_loss: 0.0271
Epoch 4/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 32ms/step - accuracy: 0.9937 - loss: 0.0206 - val_accuracy: 0.9896 - val_loss: 0.0331
Epoch 5/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 34ms/step - accuracy: 0.9945 - loss: 0.0167 - val_accuracy: 0.9901 - val_loss: 0.0305


<keras.src.callbacks.history.History at 0x24de537c830>

### Evaluating the Model

The trained CNN is evaluated using unseen test data.  
The test accuracy indicates how well the model generalizes to new images.


In [14]:
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print("Test Accuracy:", test_accuracy)


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.9901 - loss: 0.0305
Test Accuracy: 0.9901000261306763


In this example, a Convolutional Neural Network was built to classify handwritten digits from the MNIST dataset.  
The model used convolution and pooling layers for feature extraction and fully connected layers for classification.  
This demonstrates the complete workflow of image classification using CNNs.



## Task for the Reader
1. Modify the CNN architecture by changing the number of convolution filters and observe how it affects the training and test accuracy.

2. Increase or decrease the number of training epochs and analyze the impact on model performance and overfitting.

3. Replace the max pooling layer with average pooling and compare the results obtained during testing.

4. Test the trained model on multiple individual images from the test dataset and identify cases where the model makes incorrect predictions.

5. Experiment with different optimizers such as SGD or RMSprop and compare their convergence behavior with Adam.

6. Try visualizing the training and validation accuracy curves to better understand how the model learns over time.

Completing these tasks will help deepen your practical understanding of how CNN design choices influence image classification performance.
