# Neural Network Image Classification Architecture

*The information described below details the structure of the neural network specifically made for the purpose of image classification:*

## Input Layer

The input layer is responsible for receiving the raw image as input of the neural network. It ensures that the image data is in a suitable format for efficient and effective processing in the subsequent layers of our neural network.

In the input layer, the image is represented as a grid of pixels with individual pixels containing numerical values ranging from 0 to 255. These values represent the intensity of the color; with a vector of 3 numbers for 3 - channel images (RGB) and a single value for grayscale images.

The input layer may receive either a 2-dimensional tensor or a 3-dimensional tensor depending on the nature of input data it receives. For cases where we use a colored image, it receives a 3 - dimensional tensor of height, width, and 3-channel values. On the other hand if it recieves a grayscale image, then it only requires a 2 - dimensional tensor for height and width.

Note that before our images are fed to the neural network for training, they are preprocessed to a suitable image format. This includes resizing them to a consistent size, normalizing the pixel values, and may also include data augmentation such as rotation to increase the diversity of our data.

## Convolutional Layers

A **convolutional layer** performs a ***convolution operation*** to *detect specific features in the input data, producing one or more feature maps*. It uses filters, stride, padding, and an activation function to perform this operation, and it shares parameters to increase efficiency and prevent overfitting.

**Filters or kernels** are *small matrices of weights that the network learns during training*. ***Each filter is responsible for detecting a specific feature in the input data***, such as edges, corners, or more complex patterns in higher layers. The number of filters in a convolutional layer determines the number of feature maps it will produce

The **convolutional operation** refers to an *operation involving sliding the filter (kernel) across the input data (image) and computing the dot product between the filter and the input at each position*. This results to a feature map that represents the presence of specific features detected by the filter in the input data.

*How our kernel is slide through the image during the convolutional operation is determined by the value of stride*. The **stride** is the ***number of positions the filter moves at each step during the convolution operation***. A stride of 1 means the filter slides one position at a time, while a stride of 2 means it jumps two positions at each step. The ***stride affects the size of the output feature map***: a *larger stride results in a smaller feature map*, while *smaller side results in a larger feature map*.

In situations where a kernel is slid beyond the edge of our image data, resulting in certain filter points scanning non-existent pixel data, padding is involved to determine how our kernels approach such situation. **Padding** involves ***adding extra pixels around the border of the input data***. This is done *to control the spatial size of the output feature maps and to preserve the spatial dimensions of the input*. **'Valid' padding*** means ***no padding is used (the filter does not go outside the input)***, and **'same' padding** means ***enough padding is added to keep the output size the same as the input size.***

An **activation function**, usually in the form of a ***rectified linear unit (ReLU)***, is introduced after the convolutional operation to *introduce non - linearity into the model*.

One of the **benefits of adding a convolutional layer to our neural network** is due to the fact that ***the same filter is applied to different parts of the input***, meaning *the same weights are used multiple times*. This ***parameter sharing reduces the number of parameters in the model***, making it *more efficient* and *less prone to overfitting* compared to fully connected layers.

## Pooling Layer

**Pooling layers**, also known as ***downsampling***, conducts ***dimensionality reduction***, *reducing the number of parameters in the input*. This operation is ***similar to the convolutional layer*** as involves sliding a filter/kernel with determined kernel dimensionality and stride across the input ***but also performs computation to output a summary statistic of the values within that filter***. The result is a downsampled version of the input that retains important information while reducing its size.

There are several types of pooling operations, but the most common ones are ***max pooling*** and ***average pooling***. **Max pooling** outputs the maximum value within the window, while **average pooling** outputs the average value. Max pooling is more commonly used as it tends to retain the most salient features.

Pooling layers ***help to make the representation become approximately invariant to small translations*** of the input. They also ***help to control overfitting*** by providing an abstracted form of the representation.

Unlike convolutional layers and fully connected layers, ***pooling layers do not have any learnable parameters***. They perform a fixed operation on the input.

The most commonly used downsampling technique is ***max pooling***, as it *allows the reduction of the spatial dimensions of the input, provides translation invariance, and highlights dominant features of our input data*. Due to the destructive nature of max pooling and downsampling techniques in general, they are usually ***set with a maximum kernel size of 3 and a standard stride value of 2***, with the most common variation of that of a pooling method with kernel of size 2, and a stride value of 2. **Differing from these usually result to a poor performing model**.