<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/week-9-CNN/CNN_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Networks (CNN)

## Computational motivations

## Regular neural networks cannot scale to full images!


"MLP" is an acronym for "**Multi-Layer Perceptron**," which is a type of neural network characterized by multiple layers through which data passes in a feedforward manner. In the recent lecture, we explored how an MLP, also known as a **fully connected network**, can be trained to classify images from the MNIST dataset. The MNIST dataset comprises small, **2-dimensional grayscale **images of handwritten digits, each with a resolution of **28 by 28 pixels**.

In contrast, the image dimensions we encounter in modern applications can be much larger. For instance, a **standard JPEG** image might have dimensions of **640 by 480 pixels**, with an additional dimension for color channels, typically resulting in a 3-dimensional array of **640 x 480 x 3**. Similarly, medical imaging, such as MRI scans of the brain, might produce volumetric data with dimensions reaching 300 x 300 x 200 voxels.

If one were to connect every voxel in such large images in a **fully connected network**, the number of parameters to learn would be **astronomically high**. For example, a single fully connected layer that connects every voxel from an MRI scan with dimensions of **300 x 300 x 200** would entail learning a **weight matrix** with around **18 million parameters**.

This situation, where the **number of features dwarfs the number of training examples**, can lead to **overfitting**, where the model learns the training data too well, including its noise and anomalies. Consequently, such a model would likely perform poorly when generalizing to new, unseen data. This is a significant concern in machine learning, as it undermines the purpose of creating models that can predict and perform well on real-world data.

Moreover, the utilization of fully connected layers for image data is excessively **redundant**. Unlike linear regression, which treats each input feature independently, image data possesses inherent structure where **adjacent pixels or voxels are often correlated**.

<font color='blue'>**This spatial correlation is a crucial piece of information that fully connected networks typically ignore.**</font>

However, it is important to recognize that images are more than just collections of correlated pixels; they can be **conceptualized as hierarchies of increasingly complex patterns or textures**. For instance, the Fourier transform encoding of magnetic resonance images exemplifies this by **decomposing images into patterns of varying spatial frequencies**, from **low-level textures** to **more complex structures**. This concept is not limited to medical imaging; it also applies to natural images.

The hierarchical nature of image data suggests that a different network architecture, such as convolutional neural networks (CNNs), might be more appropriate. CNNs leverage the **correlated spatial information** and the hierarchical structure of images by using convolutional filters to capture patterns within localized regions, allowing for a significant reduction in the number of parameters and better generalization capabilities.

## What is bespoke learning approach?

The term "bespoke learning approach" in the context of Convolutional Neural Networks (CNNs) usually **refers to a tailored or customized training method specifically designed for a CNN to optimize its performance on a particular task or dataset. **This approach is not a standard or universally defined method but rather a concept that involves fine-tuning the learning process of a CNN to suit specific needs. Here are some aspects that might be involved:

1. **Customized Architecture**: Designing a CNN architecture that is specifically suited for the task at hand. This can include choosing the number of layers, types of layers (convolutional, pooling, fully connected, etc.), and other architectural features.

2. **Tailored Data Preprocessing**: Applying data preprocessing techniques that are particularly effective for the given dataset, like normalization, augmentation, or specific transformations that enhance the relevant features for the task.

3. **Specialized Training Procedures**: Implementing training procedures that are optimized for the specific problem, such as selecting an appropriate loss function, choosing a specific optimization algorithm, or setting a custom learning rate schedule.

4. **Focused Feature Engineering**: Incorporating domain knowledge into the feature engineering process to ensure that the CNN learns relevant and significant patterns for the specific application.

5. **Hyperparameter Tuning**: Fine-tuning the hyperparameters of the CNN, like the number of filters in each convolutional layer or the dropout rate, to optimize performance for the specific task.

6. **Task-Specific Regularization Techniques**: Using regularization techniques that are best suited for the task, such as L1/L2 regularization, dropout, or batch normalization, to prevent overfitting and improve generalization.

7. **Custom Loss Functions**: Designing loss functions that directly correspond to the specific objectives of the task, which can be crucial in applications with unique requirements.

In essence, a bespoke learning approach in the context of CNNs is about tailoring every aspect of the CNN's design, training, and operation to best suit the unique requirements of a specific task, dataset, or application. This approach recognizes that the most effective CNN configuration and training methodology may vary significantly depending on the specific characteristics and challenges of the task at hand.

---


**NOTE
As an extra note about Fourier transforms in MRI images:**

The Fourier transform is used as a method to break down MRI images into various components based on their **spatial frequencies**. Here's a breakdown of what this means:

1. **Fourier Transform in MRI**: The Fourier transform is a mathematical tool used in MRI to convert spatial information (like the position of structures within the body) into frequency information. This process is crucial for creating the images we see in an MRI scan.

2. **Decomposing Images**: Decomposition here refers to the process of breaking down the complex image data into simpler, more understandable components. In the case of MRI, this means **separating the image data into various patterns based on how frequently they occur in space.**

3. **Patterns of Varying Spatial Frequencies**: Spatial frequency in an image refers to **how often the intensity changes over space.**

<font color='blue'>**High spatial frequencies correspond to rapid changes in intensity and are often seen in the fine details or edges in an image, while low spatial frequencies correspond to slow changes in intensity and are associated with general shapes or areas of uniform color or intensity.**</font>

4. **From Low-Level Textures to More Complex Structures**: This phrase indicates a range of spatial frequencies represented in the image, from simple, smooth variations (low-level textures) to intricate, detailed patterns (more complex structures). In MRI, this allows for the detailed visualization of both the overall anatomy (like the shape of an organ) and finer details (like the texture of tissues).

So, in summary, the Fourier transform in MRI exemplifies the concept of image decomposition by breaking down the images into elements of varying spatial frequencies, allowing for the detailed analysis of everything from simple textures to complex anatomical structures.

## Taking inspiration from human vision

The insightful observation that the mammalian brain efficiently processes visual information has inspired the structure and function of modern image recognition systems. When light reaches our eyes, the captured signals are transmitted from the retina to the primary visual cortex, known as V1, located at the back of the brain. V1 serves as a topographical map for visual stimuli: points that are close together in the visual field are processed by adjacent neurons in V1. The neurons here are adept at detecting edges and other high-frequency spatial features, acting as specialized edge detectors.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig1.png" width = "500" >

As visual information progresses through the brain's visual pathway, it is relayed to subsequent cortical areas, such as V2 and V4. Each of these regions extracts and processes more complex patterns, building on the simple features identified by V1.

**This hierarchical processing is essential**, as it allows the brain to handle complex visual tasks, such as object recognition and motion detection. For instance, object recognition involves higher-order visual areas that evolve to represent increasingly complex features until the signals reach the inferior temporal cortex. In this region, individual neurons can be highly selective, responding vigorously to specific objects, like faces.

Convolutional Neural Networks (CNNs) are crafted to emulate this hierarchical processing structure. In a CNN, multiple layers of convolutional filters are applied to input images, where initial layers may resemble the function of V1 by detecting simple edges and textures. As the data passes through successive layers, the network identifies more intricate patterns, eventually recognizing whole objects within the images.

This layered approach offers a substantial advantage: CNNs do not rely on spatial normalization or image registration. Therefore, there is no need to align images to a standard form or assume that corresponding pixels across different images represent identical content. This flexibility is crucial, especially when dealing with varied image presentations where direct pixel-to-pixel comparison is not feasible due to transformations such as scaling, rotation, or skewing.

This characteristic of CNNs is particularly beneficial when compared to traditional machine learning (ML) techniques, which often treat input examples as **flat feature vectors**, comparing each feature based solely on its position within the vector. This method can suffice for simple, consistent datasets like MNIST, where the images are relatively **uniform in size and position**. However, it becomes impractical for more complex or varied datasets, such as those involving natural scenes or biological structures, **where the spatial relationships and features cannot be neatly mapped onto a fixed vector space** without losing critical structural information.

Thus, CNNs represent a significant advancement over previous hand-engineered feature detection methods. They allow for the learning of features in a way that respects the intrinsic variability and complexity of the visual world, much like our own biological visual processing systems.



---


**More simple explanation:** Think of it like this: When you look at a picture, your brain doesn't just see a bunch of pixels; **it sees shapes, edges, and patterns**. This happens because the visual information from your eyes goes on a bit of a journey inside your brain, starting at a place called V1. This is where your brain starts to **make sense of all** the lines and edges in what you're looking at.

From there, the info hops from one brain spot to another, with each stop getting better at figuring out what you're seeing — from simple patterns all the way up to complex stuff like recognizing your best friend's face in a crowd.

**Now, imagine trying to teach a computer to do that.** That's where Convolutional Neural Networks (CNNs) come in. They're like a computer version of your brain's visual journey. In a CNN, the first layers are like your brain's V1 area — **good at spotting edges and basic patterns.** As you go **deeper** into the network, it **gets better** at seeing more complicated things, like shapes and eventually whole objects, **without getting confused if the picture is tilted or a little blurry.**

This is super cool because, unlike older computer vision methods that needed everything lined up perfectly, CNNs can deal with pictures being all sorts of different sizes and angles — just like we do when we see something new.

So, for simple stuff like the MNIST dataset where you have **digits neatly centered and looking pretty similar, old-school methods were fine.** But throw in a photo from your last vacation or a medical image, and things get trickier. **This is where CNNs really shine, handling all the messy, real-world variability** like champs, much like our own nifty brains do.

# Taking inspiration from human vision: Feature detection


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig7.png" width = "700" >

One of the standout benefits of Convolutional Neural Networks (CNNs) compared to older **feature detection techniques** is their ability to **autonomously learn features that are perfectly tailored to the specific task they are designed to solve**. This <font color='blue'>**bespoke learning approach**</font> is far more efficient than the **one-size-fits-all** strategy of **hand-engineered feature detectors**.

To illustrate, let's consider what happens when a CNN is trained on a dataset of **natural images**. During training, the network learns various features at different levels, which we can observe in the layers of the network. Each layer captures different **aspects** of the images: **starting from basic elements to more complex ones**.

In the beginning layers, like 'layer 1', the network learns **fundamental features**. Think of these as the visual alphabet—the **edges, corners, and textures**.

The <font color='blue'>**"patches" or "filter kernels"**</font> learned at this stage are activated by **simple patterns in the images**. For example, a set of nine patches in 'layer 1' might all be triggered by the same edge or texture feature. As we move to **subsequent features** within this layer, different sets of patches activate for **different basic visual components.**

As we progress to deeper layers, **the relationship between the learned filters and the image patches becomes more direct**. Each filter in 'layer 2' might respond to slightly more complex patterns that make up part of an object, like the contour of a petal or the curve of a shell.

By the time we get to 'layer 5', the network has advanced to recognizing and responding to entire objects or significant portions of them. **The features that activate in this layer are much more sophisticated**. They could be picking up on the whole shape of a face, the form of an animal, or the circular pattern of a wheel.

<font color='blue'>**At this stage, the network has moved beyond the basic 'visual alphabet' and is now 'reading' and understanding complete 'visual words' or even 'visual sentences', so to speak.**</font>

In summary, as a CNN learns from a dataset, it builds a **layered understanding of the visual world,** starting from the simplest elements to the most complex structures. This hierarchical learning process allows CNNs to adapt to a wide variety of visual tasks, making them extremely versatile and powerful tools in image recognition.


---


Think of CNNs like a team of artists learning to paint different scenes. Instead of using a one-size-fits-all set of brushes, they create their own unique brushes tailored to the specific scene they're painting. **This is how CNNs get a leg up over older methods that used a standard toolkit no matter what the picture was.**

Let's picture a CNN in action, like a group of artists learning to paint by looking at lots of different pictures of nature. At the very start, the 'layer 1' artists are focused on the basics: they're figuring out how to draw simple lines and edges, the kind of details you'd find in leaves or the ripples of water. The patches you see labeled 'layer 1' are like their early sketches, and the images that **get these newbie artists really excited (or 'activate strongly**') are ones with these simple patterns.

As our artistic CNN progresses to 'layer 2', the complexity ramps up. Now, they're not just drawing lines; **they're putting those lines together to make textures and basic shapes**—think of the patterns on a butterfly's wing or the roughness of tree bark.

Fast forward to 'layer 5', and our artists are now painting whole scenes—capturing the **essence of a face or the dynamic shape of a spinning wheel**. It's no longer about lines or textures; it's about bringing together all these elements to recognize complex objects in their entirety.

So, when you look at the examples provided for each layer, you'll notice that the early layers get excited about the simple stuff, while the later layers are all about the big picture. Just as you can clearly see the strokes in a painting that make up a tree or the sky, you can see how the features learned by the CNN reflect the intricate parts of the images they've been studying.

In essence, by the time you reach the higher layers of the network, these CNNs aren't just recognizing patterns; they're identifying whole items and intricate parts of the scene—just like how we might recognize faces or objects in our everyday lives. This is the magic of CNNs; they start from scratch and build up an understanding of the visual world that's perfectly suited to the task at hand, just like a painter mastering their craft.




---


Drawing upon the intricate workings of the human eye and brain, Convolutional Neural Networks (CNNs) are crafted to echo the remarkable capabilities of our visual system. These sophisticated networks are structured to learn and identify patterns in a way that mirrors how we process visual information.

At the outset, a CNN starts simple, learning spatial filters that can detect basic elements like edges and lines—much like the early stages of visual processing in the human brain. As the network delves deeper, these filters evolve, becoming more complex and capable of recognizing textures and patterns. This gradual progression from detecting simple edges to discerning intricate textures is akin to the visual hierarchy present in our own cognitive processing.

As the layers advance, these networks become adept at identifying specific features of objects—transitioning from mere edge filters into comprehensive object detectors. This is reflective of the higher-order visual processing that occurs within the human cortex, where complex visual stimuli are interpreted and understood.

One of the most revolutionary aspects of CNNs is their ability to learn directly from the data, which eliminates the need for explicit prior modelling or spatial normalization of the signal. This quality is particularly advantageous because, in human vision, we do not consciously model or standardize visual inputs; our brains naturally adjust and recognize objects regardless of variations in size, position, or orientation.

By embracing this approach, CNNs can process and understand visual inputs in their raw form, accommodating a wide range of variations and inconsistencies in the images. This flexibility allows CNNs to perform robustly in real-world applications where the conditions are rarely controlled or uniform, much like our own visual experiences. It's this adaptability, inspired by the fluidity of human sight, that makes CNNs a groundbreaking tool in the field of computer vision.

## Taking inspiration from human vision: object classification & object Localization

Much like our own visual system's versatility in handling a variety of **visual tasks**, Convolutional Neural Networks (CNNs) are trained to master a wide array of **visual recognition** challenges. For instance, they can be taught to **classify objects** by identifying and labeling what they represent in an image—a process known as <font color='blue'>**object classification**</font>. Here, a CNN might look at an image and discern that it's a car.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/cars.png" width = "700" >

But their skills don't end there. CNNs can also pinpoint the exact location of an object within an image, a task referred to as <font color='blue'>**object localization**</font>. It's like when you're looking for your car in a parking lot; a CNN can scan an image and tell you right where the car is sitting on the grid. Furthermore, CNNs can go even deeper with **semantic segmentation**, which is like coloring in a picture by labels. In this case, the CNN would color in all the pixels that make up the car, effectively segmenting it from the rest of the image.

However, it's important to note that while our human visual system is a generalist—capable of shifting effortlessly between different visual tasks without needing to relearn how to see—CNNs often require specialized designs for each specific task. **The architecture that excels at classifying images may not be the best for localizing objects or segmenting them, necessitating different CNN configurations for each task.**

Despite this, the field is actively evolving, and researchers are continually drawing inspiration from the mammalian brain to create more flexible and general-purpose CNNs. The goal is to achieve a level of generalization akin to human vision, where a single system can adapt and excel across a variety of tasks with ease. The journey of improving CNNs is an ongoing testament to the profound impact that studying natural intelligence systems has on the advancement of artificial ones.



---

## What is Semantic Segmentation?


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/semantic_segmentation1.jpeg" width = "500" >

[A 2021 guide to Semantic Segmentation](https://nanonets.com/blog/semantic-image-segmentation-2020/)

Semantic segmentation is a deep learning algorithm that associates a label or category with every pixel in an image. It is used to recognize a collection of pixels that form distinct categories. For example, an autonomous vehicle needs to identify vehicles, pedestrians, traffic signs, pavement, and other road features. Here's a detailed look at what semantic segmentation entails:

1. **Image Division into Segments**: In semantic segmentation, an image is split into multiple segments or pixels. Unlike object detection, which identifies objects by bounding boxes, semantic segmentation classifies each pixel in the image.

2. **Classification of Each Segment**: Each segment (or pixel) is classified into a category. For example, in an image of a street scene, pixels might be classified into categories such as 'road', 'car', 'pedestrian', 'building', 'sky', etc.

3. **Contextual Understanding**: The process provides a comprehensive understanding of the image, not just identifying objects but also their boundaries and relations to each other within the scene.

4. **Applications**: Semantic segmentation is widely used in various applications including autonomous vehicles (for understanding road scenes), medical imaging (for identifying different tissues or anomalies), satellite image analysis (land use, urban planning), and more.

5. **Deep Learning Techniques**: Modern semantic segmentation often utilizes deep learning techniques, especially Convolutional Neural Networks (CNNs), which are effective in handling the complexities of image data and learning spatial hierarchies of features.

6. **Pixel-Level Classification**: Unlike other forms of image classification that categorize an entire image or object detection that identifies objects within bounding boxes, semantic segmentation goes down to the pixel level for classification, resulting in very detailed and precise image analysis.

In summary, semantic segmentation is a crucial technique in computer vision for detailed and context-aware analysis of images, classifying each pixel into meaningful categories, and enabling a wide range of applications in various fields.

**Other computational vision tasks**

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/semantic_segmentation2.png" width = "500" >



## Why Convolutions?: A Summary

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig2.png" width = "700" >

CNNs are engineered to imitate the human visual processing system. **They learn representations through multiple convolutional layers, where initial layers often function as edge detectors, and deeper layers progressively identify more intricate textures or entire objects.** This approach is particularly beneficial for image recognition, object localization, and segmentation tasks, as it enables the comparison of images without needing spatial normalization or image registration.

**This means that CNNs don't rely on the assumption that pixels in the same relative positions across different images represent identical content.** Such an assumption can be challenging to fulfill, especially in complex natural scenes. For instance, consider the varied architectures of buildings under the same category label - achieving one-to-one pixel correspondence in such cases would be exceedingly difficult.

Similarly, in medical imaging, individual variations in structures like brains can't be fully accounted for through **deformable models alone.** CNNs, with their ability to learn and identify features at various levels of complexity, provide a more adaptable and effective way to analyze and interpret such diverse and variable image data.



## The convolution operation

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/sobel.png" width = "600" >

The convolution operation is a fundamental process in image processing, particularly in the detection of edges within images. This is where Sobel edge detectors come into play, which are designed to perform this task through two specific filters: **one that identifies horizontal edges and another for vertical edges.**

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig8.png" width = "600" >

These Sobel filters work by **approximating the gradient of the image**. The gradient here refers to the rate of change in brightness across the image. In places where there's a sharp change in intensity, like the edge of an object, the gradient is large. The horizontal filter, with **its unique arrangement of positive and negative values**, highlights areas of the image where there's a significant horizontal change in intensity. Similarly, the vertical filter is tuned to **capture sharp changes in the vertical direction.**

When these filters are 'convolved' over the image—meaning they are passed over every part of the image and applied to each pixel—**they compute the finite-difference approximation of the gradient**. This convolution process results in a new image where the intensity of each pixel corresponds to the strength of the gradient at that point. So, in the output image, the edges—places where the original image changes rapidly from dark to light or vice versa—stand out as areas of high intensity. This technique is especially useful in many computer vision tasks because edges are critical for understanding the structure and shape of objects within images.

## What are Sobel filters?

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/sobel_filtering.jpg" width = "400" >

**Sobel filters are a type of filter used in image processing, particularly for edge detection**, which can also be applied in the context of CNNs. Here's an overview:

1. **Purpose**: Sobel filters are used to detect edges in images. They work by emphasizing regions of high spatial frequency (i.e., rapid changes in intensity) which correspond to edges.

2. **How They Work**: The Sobel filter operates by convolving a small matrix (or kernel) with the original image. This process **calculates the gradient of image intensity at each pixel**, which highlights the edges.

3. **Two Dimensions**: Typically, there are two Sobel filters – one for detecting **horizontal edges** (changes in the **vertical direction**) and another for detecting vertical edges (changes in the horizontal direction). **They are often used together to find the overall gradient magnitude at each point in the image.**

4. **Application in CNNs**: In the context of CNNs, while the network learns its own filters during training, Sobel filters (or the concept behind them) **can be informative, especially in the initial layers of the network where edge detection is often a key feature.**

5. **Pre-processing Tool**: Apart from being an inspiration for early CNN layers, Sobel filters can also be used as a pre-processing step to enhance edges in images before feeding them into a CNN, making the network more efficient in learning features.

In summary, Sobel filters are used for edge detection in images and can be relevant in the design and functioning of early layers in CNNs, either as a direct method or as a conceptual guide for what the network should learn to do in its initial stages.

## Convolved meaning

The word "convolved" is derived from the mathematical operation known as convolution. **To convolve means to roll or fold together**; it describes the process of combining two sets of information.

**In the context of mathematics and signal processing, convolution is a formal operation that expresses how the shape of one function is modified by another function.**

When one function is convolved with another, the convolution reflects how the shape of one is "smeared" by the other.

**In the context of image processing, for example, convolving an image with a filter means taking each point of the image and combining it with surrounding points based on the pattern defined by the filter.**

 This operation is fundamental in many applications, such as edge detection, blurring, and sharpening in images, as well as in signal processing and time series analysis.



---

The term "convolved" refers to the process of applying a convolution operation, which is a fundamental mathematical operation in the field of image processing and analysis. When an image is convolved with a filter (also known as a kernel), the filter is slid over the image, usually from the top-left corner to the bottom-right corner, and at each position, a mathematical operation is performed.

Here's a step-by-step breakdown of what happens during convolution:

1. **Overlay**: The filter is placed on top of the image such that it covers a portion of the image the same size as the filter.

2. **Element-wise Multiplication**: Each element of the filter is multiplied by the corresponding element of the image it covers.

3. **Summation**: The results of the multiplications are then summed up to get a single number.

4. **Replace**: This single number replaces the pixel value at the location of the center of the filter.

5. **Slide**: The filter is then moved (or slid) across to the next position on the image, and the process is repeated.

This operation essentially mixes the filter's values with the image's values, allowing features like edges, textures, or patterns to be accentuated depending on the type of filter used. For edge detection, as in the case of the Sobel filters, the convolution process calculates how much the intensity changes in a local area of the image, which corresponds to the presence of edges.

# Prior to CNNs … Hand Engineered Features

The convolutional operation


## Convolution in Mathematics

<font color='blue'>**The word "convolution" is a mathematical operation on two functions that produces a third function expressing how the shape of one is modified by the other. The term is derived from the mathematical convolution operation, which involves multiplying two functions after one has been flipped and shifted.**</font>

Here's a more technical explanation:

In the context of mathematics, especially in signal processing, convolution is **a function derived from two given functions by integration that expresses how the shape of one is modified by the other.** The mathematical expression for the continuous convolution of two functions $ f $ and $ g $ is written as:

$$
(f * g)(t) = \int_{-\infty}^{+\infty} f(\tau) g(t - \tau) d\tau
$$

Here, one function is reversed and shifted, and then integrated across the domain of the other function. In discrete systems, such as image processing with digital computers, the integral is replaced by a sum:

$$
(f * g)[n] = \sum_{m=-\infty}^{+\infty} f[m] g[n - m]
$$

This operation slides the $ g $ function over $ f $, multiplying and accumulating the overlap values at each position.

In the context of CNNs, this mathematical concept is used to apply filters to an input (like an image) to create feature maps, effectively transforming the input data to highlight certain features.



---

The concept of convolution in the context of Convolutional Neural Networks (CNNs) is directly adapted from the mathematical operation of convolution.

In CNNs, the convolution operation involves the following steps:

1. **Filters**: <font color='blue'>**CNNs use filters (also called kernels), which are small matrices of learnable weights.**</font> These filters are analogous to the function $ g $ in the convolution formula.

2. **Input Data**: The input data (like an image) to which the filters are applied can be thought of as the function $ f $ in the convolution formula.

3. **Convolution Operation**: As in the mathematical definition, the filter is applied across the input data. For each position of the filter on the input, the element-wise multiplication is performed between the filter and the part of the input it covers, and then the results are summed up to get a single value. This is similar to the discrete convolution formula:

$$
(f * g)[n] = \sum_{m} f[m] g[n - m]
$$

In this context, $ n $ would be the current position of the filter on the input, $ m $ would represent the elements of the filter, and $ f[m] $ and $ g[n - m] $ would represent the corresponding elements from the input data and the filter, respectively.

4. **Feature Maps**: The result of this convolution operation across the entire input creates a feature map, which highlights features from the input that the filter is designed to detect, such as edges or textures.

5. **Learning Process**: In a trained CNN, the values of the filter weights are learned through backpropagation. As the network is exposed to more data, it adjusts these weights to minimize the difference between the predicted output and the actual output.

6. **Stacking Layers**: CNNs typically have multiple layers of convolutions, with each layer designed to detect increasingly complex features. The first layer might detect simple edges, while deeper layers might detect more complex patterns by convolving over the feature maps produced by previous layers.

In summary, CNNs use the convolution operation to systematically apply filters to input data, creating feature maps that represent the presence of specific features within the data. This is how CNNs are able to learn from visual data and perform tasks such as image and video recognition.

## Taking inspiration from human vision

1. **Designed to Mimic the Human Visual System**:
   - Explanation: Our visual system processes images in a stepwise manner, from simple to complex. Signals from the eyes are sent to the **primary visual cortex (V1), which acts like a series of edge detectors.**
   - Example: When you look at a tree, your eyes first detect the **edges and contours** before recognizing it as a tree.
   - Relation to Outline: CNNs are structured similarly, **with initial layers acting as edge detectors, much like the V1 region in the brain.**

2. **Learns Spatial Filters of Increasing Complexity**:
   - Explanation: After the initial edge detection in V1, **the visual signals are passed to subsequent visual regions (V2, V3, etc.) **that detect more complex patterns and textures.
   - Example: After recognizing edges, your brain begins to notice the **bark's texture**, the leaves' shapes, and how they overlap and form the tree's canopy.
   - Relation to Outline: In CNNs, after the first layers learn to detect edges, **the following layers learn filters that can detect more complex features like textures and object parts.**

3. **From Edge Filters to Object Detectors**:
   - Explanation: As the visual signals move through the higher visual regions in the brain, they become increasingly abstract, and certain cells in the inferior temporal cortex may respond specifically to complex objects, like faces.
   - Example: Your brain not only sees a tree but can also differentiate between types of trees, or recognize a face in a crowd.
   - Relation to Outline: Similarly, deeper layers in CNNs learn to detect complex objects as a whole, moving from simple edge detection to comprehensive object detection.

4. **Removes Requirement for Spatial Normalisation of the Signal**:
   - Explanation: **Traditional machine learning models require images to be normalized or registered so that corresponding pixels are compared.** However, this is not viable for images with complex variations.
   - Example: Imagine trying to compare two pictures of the same breed of dog, **but one is closer to the camera than the other. Traditional methods would struggle unless the images are normalized to align them perfectly.**
   - Relation to Outline: CNNs do not require such normalization because **they learn to recognize features regardless of their position in the image, similar to how our visual system can recognize objects regardless of their location in our field of view.**

In summary, CNNs are designed to process visual information in a way that closely resembles the hierarchical, multi-stage processing of the human visual system, starting from basic edge detection and culminating in the recognition of complex objects. This design allows CNNs to handle the variability and complexity found in real-world visual scenes without the rigid spatial normalization required by traditional image processing and machine learning techniques.

## Adaptive Feature Hierarchies in Convolutional Neural Networks

The passage highlights the adaptive capability of Convolutional Neural Networks (CNNs) in **learning features from data,** as opposed to using predefined or "hand-engineered" feature detectors.

Here's an elaboration on that concept:

- **Bespoke Feature Learning**: Unlike traditional feature detection methods, where features have to be manually crafted and chosen by a human expert, C**NNs automatically learn the most relevant features for the task at hand directly from the data.** This process is "bespoke" in the sense that it's **custom-tailored**: the features that a CNN learns for one dataset may be very different from those it learns for another.

- **Layer-wise Feature Hierarchy**: The architecture of CNNs is designed to reflect a **hierarchy of feature complexity**:
   - **Layer 1**: The first layer typically learns **basic visual features, such as edges, corners, or colors**. For instance, if we visualize the filters from the first layer, we might see that **one filter activates strongly for vertical edges, while another might activate for green patches.**
   - **Layer 2**: The second layer combines these basic features to detect more complex structures, like simple textures or patterns. The **features at this level are more abstract than in layer 1**, and **each filter has a specific pattern it looks for in the image.**
   - **Layer 5**: By the fifth layer, the network has** combined lower-level features** to recognize high-level concepts, such as parts of objects or entire objects like faces or wheels. Here, the feature detectors are highly specialized, and each one may respond to very complex visual patterns.

- **Visualization and Interpretation**: When training a CNN, we can visualize what each layer is learning by examining the filters (also known as kernels) and their corresponding activations. For example, we can extract image patches that maximally activate a particular filter to understand what kind of features that filter is representing. This helps in interpreting how the CNN is processing and understanding the input data.

- **One-to-One Correspondence**: In deeper layers, there tends to be a **one-to-one correspondence between a filter and the complex feature it detects**. This means each filter is responsible for **identifying one specific**, complex pattern within the input data.

- **Complexity and Recognition**: As you go deeper into the CNN, the network **abstracts** more and the features it responds to become representations of whole objects or significant parts of objects. For example, a filter in a higher layer might specifically activate when it sees a human face or the circular shape of a wheel.

In essence, CNNs have the significant advantage of **adapting their feature detection to the specific characteristics of the data they are trained on, creating a custom-tailored set of feature detectors that can range from simple to complex**. This adaptability is a key factor in their success in various image recognition tasks.

## Flexibility and specialization of Convolutional Neural Networks

This block describes the flexibility and specialization of Convolutional Neural Networks (CNNs) in performing various visual tasks, drawing parallels and contrasts with the human visual system.

**Elaboration on CNN Tasks:**

1. **Object Classification**: CNNs can be trained to identify and label objects within an image, such as recognizing a vehicle and labeling it as a "car". This involves assigning a category to the entire image or to specific objects detected in the image.

2. **Object Localization**: Beyond classification, **CNNs can also determine the location of an object within an image. For instance, after recognizing a car, the network can also pinpoint its position by placing a bounding box around it.** This is useful in applications where the position of objects is crucial, like autonomous driving systems.

3. **Semantic Segmentation**: This is a more granular task where CNNs label each pixel of an image that belongs to a particular object. In the case of the car, semantic segmentation would involve labeling all pixels that make up the car's image, effectively segmenting it from the rest of the picture. **This is particularly useful in medical imaging to delineate the boundaries of organs or in autonomous vehicles to understand the environment at the pixel level.**

**Contrast with Human Visual System:**

- **The human visual system is highly generalizable, meaning it can perform a wide range of visual tasks using the same underlying mechanisms.** We can recognize objects, judge their location and movement, and understand complex scenes without needing to switch between different modes of processing.

- **CNNs, however, typically require different architectures or training processes to excel at different tasks.** For example, a network trained for object classification may not perform well on object localization or semantic segmentation without significant modifications.

**Current Research and Inspiration from Biology:**

- **Despite their impressive capabilities, CNNs still lack the generalization power inherent to the human visual system.** There's ongoing research to design CNNs that can perform multiple tasks or transfer learning from one task to another without needing completely separate models.

- Much of this research takes inspiration from biological neural networks, particularly the mammalian brain. By understanding how the brain processes visual information so efficiently and flexibly, researchers hope to replicate this adaptability in CNNs. For instance, studies into neural plasticity and how the brain repurposes neurons for different visual functions may inform new artificial network architectures that are capable of similar flexibility.

## Understanding the Convolutional Operation: From Sobel Edge Detection to Feature Analysis

The convolutional operation is a cornerstone of image processing and is particularly effective for feature detection, such as identifying edges within an image. Let's delve into the specific example of hand-engineered features using a Sobel edge detector to understand this better:

- **Sobel Edge Detector**: This is a popular algorithmic approach used to detect edges in images. **The Sobel edge detector uses two distinct convolutional filters—one for detecting horizontal edges and another for vertical edges.**

- **Horizontal and Vertical Filters**: The filter designed to detect horizontal edges will have weights arranged to highlight horizontal changes in intensity, whereas the filter for vertical edges will emphasize vertical changes. These filters are typically small matrices (e.g., 3x3) with specific values that are designed to respond strongly to edges in their respective orientations.

- **Finite Difference Approximation**: The principle behind the Sobel operator is to **approximate the gradient of the image intensity**. In mathematical terms, the gradient measures how much the intensity changes in space, which is indicative of an edge. Since digital images are discrete, the Sobel operator uses a finite difference method to estimate these gradients.

- **Convolution with Filters**: The actual process involves convolving the image with each of the Sobel filters. **Convolution, in this context, means sliding the filter over the image, multiplying the overlapping values, and summing them to produce a new pixel value for the output image.** This operation is performed for every pixel position in the image, effectively **scanning the whole image with the filters.**

- **Edge Intensity and Gradients**: The result of this convolution is two new images that represent the** gradient of the original image in the horizontal and vertical directions.** Wherever there is a sharp change in intensity in the original image, the corresponding pixel in the gradient image will have a high value. Therefore, the edges can be seen as areas with high intensity in the gradient images, indicating places where the gradient—and thus the change in image intensity—is large.

- **Resulting Edge Map**: By combining the horizontal and vertical gradient images, an overall edge map can be produced. This edge map highlights the locations of edges in the original image, showing where the intensity sharply changes, and thus, where the boundaries of objects are likely to be.

In summary, the Sobel edge detector is a concrete example of using convolutional operations with hand-engineered filters to extract meaningful features—like edges—from an image. This process is a precursor to the more complex, learned convolutions that take place within a CNN, where filters are not designed by hand but are instead learned from data to capture a vast array of features, not just edges.

## Edge Detection in Image Processing Using Sobel Filters

This image illustrates the Sobel edge detection filters and the convolution operation involved in using them.

Here's an elaboration on the components of the image:

- **Convolution Operation**: This is the process of applying a filter (also called a kernel) to an image to extract certain features, such as edges. It involves sliding the filter over the image, performing element-wise multiplication with the part of the image it covers, and summing up the results to form a new image.

- **Sobel Filters**: The image shows two 3x3 Sobel filters. These filters are designed to detect edges in different orientations:
  - The **Horizontal Filter** (on the left) is used to detect horizontal edges in the image. The positive values are at the top, and the negative values are at the bottom, which means this filter will respond strongly to horizontal changes in intensity—such as the transition from a light to a dark region horizontally.
  - The **Vertical Filter** (on the right) is used to detect vertical edges. The positive values are on the left side of the filter, and the negative values are on the right side, making this filter sensitive to vertical transitions in intensity.

- **Finite-Difference Approximation of the Gradient**: By convolving these filters with the image, we approximate the gradient of the image. The gradient refers to the rate of change of brightness within the image, which is high at the edges where the color changes abruptly. The convolution operation with Sobel filters highlights these areas by producing high values in the new image, effectively outlining where the edges are.

- **Edge Detection**: The process outlined in the image captures the essence of edge detection using the Sobel operator. By applying these two filters and then combining the resulting images, we can detect the presence and orientation of edges in the original image, which is a fundamental step in many image processing tasks.

This visualization is a common way to explain how edge detection works in the field of computer vision and is a fundamental concept for those learning about image processing and convolutional neural networks.

## Applying Sobel Edge Detection to MRI Image Analysis: A Synthesis of Convolution and Cross-Correlation

Here is the process of applying a convolutional operation, specifically using a vertical Sobel filter, on a small patch of an MRI image of the human brain. Here's an elaboration on the process and the technical details involved:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig9.png" width = "600" >

- **MRI Image Patch**: The process starts with an 8x8 patch of an MRI image. This patch has been selected because it contains an edge, which is indicated by a transition from a very bright to a very dark region within the brain structure.

- **Sobel Filter Application**: To detect the orientation and presence of this edge, the patch is convolved with the vertical Sobel filter. The vertical Sobel filter is a 3x3 matrix designed to highlight vertical edges by responding to changes in intensity in the vertical direction.

- **Cross-Correlation vs. Convolution**: Technically, the operation performed is called cross-correlation, not convolution. In true convolution, the filter would be flipped both horizontally and vertically before being applied to the image. However, in cross-correlation, the filter is used as is, without flipping.

- **Relevance in CNNs**: In the context of Convolutional Neural Networks (CNNs), the distinction between cross-correlation and convolution is generally disregarded. This is because CNNs learn the filter weights directly from the data during training, which means they can learn the correct orientation of the filter for the task regardless of whether the operation is technically convolution or cross-correlation.

- **Valid Assumption**: Given that CNNs can learn to identify features effectively whether the filters are flipped or not, the use of cross-correlation is considered a valid assumption in practice. This simplification does not affect the network's ability to learn from the data and detect features like edges in images.

In summary, the Sobel filter is applied to an MRI image patch to highlight the edges within it, and the operation used is akin to cross-correlation, which is a standard approach in CNNs. This approach simplifies the process without compromising the feature learning and edge detection capabilities of the network.

## The convolutional* operation

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/fig10.png" width = "600" >

Breaking down the process, it involves an element-wise multiplication followed by a summation. For instance, consider a convolution operation with a kernel applied to a section of an image. For the top row of this section, the computation would be 1×55 + 0×60 - 1×65. For the second row, it's 2×67 + 0×65 - 2×63, and for the third row, 1×73 + 0×64 - 1×62. These results are then summed up, yielding a total of 9. This sum is assigned to the corresponding middle pixel of the kernel's position on the image, which, in the context of the entire operation, becomes the outer top-left pixel of the output.

It's important to note that the resulting image grid from this convolution operation will be smaller than the original. Specifically, it will have 2 rows and 2 columns less. This reduction corresponds to the number of times the kernel can be fully applied to the original image space, accounting for the edges where the kernel cannot fully fit.



---

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/figure_red.png" width = "600" >

After calculating the product at the initial position, the filter is moved one step to the right, and the operation is repeated. In this next position, the resulting value from the convolution operation is -13.



---

This process continues for each position along the top row, with the filter being applied at each location sequentially.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/figure_green.png" width = "600" >



---

Then, the filter moves down one row and slides across again, repeating this procedure row by row.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/figure_blue.png" width = "600" >



---

This process is carried out until the filter reaches the bottom right corner of the image.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/figure_purpule.png" width = "600" >



---

As a result of the convolutional operation, a new image is produced where primarily vertical edges are retained. This outcome is due to the nature of the edge detection process: the element-wise product and sum operation yields higher amplitudes at locations with a significant intensity change from left to right. In areas where the intensity is relatively constant, the values tend to cancel out, highlighting areas of strong vertical edge presence in the image.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/figure_purpule_2.png" width = "600" >






## Key Steps in Sobel Edge Detection: Beyond Basic Convolution

To get the full picture with the Sobel operation, you actually need to do convolutions with both vertical and horizontal filters. Then, you kind of bring their outputs together for the final effect. Also, here's a nifty detail – to really nail the gradient approximation, you should **normalize** the value you get from the **filter operation**. This means dividing it by 4. Just a little extra info to make sure you've got the whole process down pat!


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a11.png" width = "600" >

Note, a couple important steps glassed over – specifically the full sobel operation convolves vertical and horizontal filters and combines them; secondly the output of the filter should be divided by 4 to return the correct finite difference calculation of gradient.


## Padding

You might have noticed that directly applying the convolution operation results in an output image that's smaller by 2 pixels in each dimension. This size reduction matches the number of unique positions where the filter fits onto the image. To produce an output that's the same size as the original, we need to pad the image, usually with zeros.

Padding is the solution to prevent our image grid from shrinking due to the convolution operation. By adding a layer of zeros around the image's perimeter, we enable our kernels to be centered even on the outermost pixels. For a 3x3 kernel, a single layer of padding suffices. However, larger kernels require more padding. It's also worth mentioning here that convolutional kernels typically have odd numbers of rows and columns. This design gives them a central point, which is important for evenly distributing their effects across the image.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a2.png" width = "600" >

In the lexicon of deep learning, a convolution performed without any padding is referred to as a "VALID" convolution. On the other hand, a convolution that yields an output of the same size as the original image, achieved by adding padding, is termed a "SAME" convolution.


## Strides

Let's talk about strides next, they're pretty cool! In deep learning, when we do convolutions, sometimes we play a bit of hopscotch with the image grid. Imagine we have a 7x7 image, and we want to do a convolution, but with a stride of 2.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a4.png" width = "600" >

So, what happens first? Well, it starts off just like usual. We place our kernel on the first spot it fits on the image, and then do our element-wise multiplication and sum. It's like the first step in a dance where we're about to skip around a bit more!

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a5.png" width = "600" >

But here's where the fun twist comes in! Instead of moving to the very next spot (you know, the one marked by the blue dotted line), we're going to hop two pixels over. That's right, we'll place our kernel at the position shown by the red box. And just like that, we get -27. But this time, it's not just any number - it's the next element in our output matrix. It's like playing a game of leapfrog on the image!

And if we keep up our two-pixel skipping game, we soon reach the outer right edge of the image. That's our cue to stop - we've hit the boundary!

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a6.png" width = "600" >

In deep learning, we can get creative with convolutions by playing around with different strides. For instance, imagine we have an input image with a height $ h = 7 $, a kernel size $ f = 3 $, and we decide to use a stride $ s = 2 $. The formula to figure out the output dimensions after our strided (and maybe padded) convolution adventure is $ \left\lfloor \frac{(h + 2 \times p - f)}{s} \right\rfloor + 1 $.

But here's a little heads-up: not all input dimensions play nice with strides larger than 1. If $ (h + 2 \times p - f) $ gives us an odd number, deep learning frameworks like PyTorch will throw in some extra zero padding to make everything fit just right.

So, when we move our filter kernel across the image, skipping 2 pixels each time (because our stride is 2), we're changing the image's dimensions. It scales according to our input height $ h $, plus double our padding $ 2 \times p $, minus our filter size $ f $, all divided by our stride $ s $. We round this down (those bracket-y symbols are mathematicians' way of saying "round down") and add 1 to get our final size.

Just a note though, not every input dimension is cut out for bigger strides. In theory, the kernels should fit neatly into the image, given the stride. But don't sweat it too much - if you accidentally mismatch image and stride dimensions, modern deep learning frameworks will automatically add padding to save the day. It's like having a safety net!

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/a7.png" width = "600" >








## Hand Engineered Features

Beyond basic edge detectors like Sobel filters, there's a whole world of more sophisticated feature detection techniques in image processing. These methods are designed to recognize much more complex textures and features in images. Let's dive into some of these advanced techniques:

1. **Scale-Invariant Feature Transformation (SIFT):**
   - SIFT is designed to detect and describe local features in images.
   - The key aspect of SIFT is its ability to be invariant to image scale and rotation. It can also handle changes in illumination, noise, and minor changes in viewpoint.
   - SIFT works by identifying key points in an image (like corners or edges) and describing them in a way that's consistent, regardless of how the image is transformed.

2. **Speeded-Up Robust Features (SURF):**
   - SURF is often considered as an extension of SIFT. It's faster to compute and has been found to be more robust against different transformations.
   - Just like SIFT, it's used for detecting and describing local features in images, but with improved speed and efficiency.

3. **Local Binary Patterns (LBP):**
   - LBP is a simple yet very effective texture operator that labels the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number.
   - It's particularly powerful for texture classification and has seen significant use in applications like facial recognition.

4. **Histogram of Oriented Gradients (HOG):**
   - HOG is particularly used for object detection in computer vision.
   - It works by dividing the image into small connected regions, called cells, and for each cell, computing a histogram of gradient directions or edge orientations for the pixels within the cell.

These more complex feature detectors like SIFT, SURF, LBP, and HOG learn large banks of filters, each designed to detect specific, complex image features or textures. They're crafted to be as invariant as possible to transformations like rotations, translations, and scaling.

Unlike simple edge detection, where comparison is made pixel by pixel, these methods involve comparing all combinations of detected features in each image. This allows for the identification of common key points or features, even when the images have undergone significant transformations.

In summary, these advanced techniques represent a significant leap over basic edge detection, offering robust, sophisticated methods for feature detection and description, crucial for tasks like object detection, facial recognition, and image classification.

## CNN learn filters from data

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023//main/week-9-CNN/img/b1.png" width = "800" >

CNNs operate in a similar way to traditional feature detectors but take it a step further by **learning bespoke filter kernels** that are **specifically fine-tuned for various image recognition tasks**.

Here's how they do it: **CNNs optimize the network weights, and these weights are essentially the filter kernels they've learned.** So, **instead of learning large weight matrices that correspond to the full dimensions of the images, like in fully connected networks, CNNs focus on smaller, more manageable weights.** These are the convolutional kernels, typically sized 3x3, 5x5, or 7x7, with a certain depth.

<font color='blue'>**But CNNs don’t just learn a single set of filters. They create a whole hierarchy of them, layer by layer. In each layer of a CNN, the learned weight matrices, or filters, are convolved with the image to produce an activation map for that layer. This activation map then serves as the input for the next layer. This layered approach enables the network to progressively learn more complex textures by combining responses from earlier layers. For example, you can imagine how merging edge activations of different orientations could lead to encoding more complex textures.**</font>

In essence, CNNs are about building up complexity, starting from simple patterns and gradually moving to more intricate ones, allowing for a nuanced understanding and recognition of various features in images.



---


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023//main/week-9-CNN/img/b2.png" width = "800" >

Alongside learning these complex textures, CNNs also downsample the data after every few layers, either through pooling or striding. This process expands the receptive field of the filters at each successive layer, as illustrated by boxes of increasing size. This expansion enables the filters to recognize image features at progressively larger scales.

For instance, early in the network, filters might detect simple features like edges. As the receptive field grows, the network begins to recognize more substantial parts of objects, such as wheels in the case of vehicle images.

Eventually, as the layers progress and the receptive field encompasses a larger portion of the image, the network becomes capable of detecting entire objects. This step-by-step process allows the CNN to build up from recognizing basic patterns to understanding complex structures within an image.



---

## what is a receptive field in CNN?

In Convolutional Neural Networks (CNNs), **the receptive field refers to the region of the input space that a particular CNN feature is looking at, or to which it is responding.** It essentially defines the extent of the area in the input image that affects or **contributes to the calculation of a particular feature in the network**. Here's a more detailed explanation:

1. **Basic Concept**: <font color='blue'>**When a convolutional operation is applied to an input image, each neuron (or unit) in the convolutional layer focuses on a small region of the input image. This specific region is known as the neuron's receptive field.**</font>

2. **Layer-wise Expansion**: As you go deeper into the CNN, the receptive field of neurons in higher layers becomes larger in terms of the input space. This is because each neuron in a deeper layer receives input from a combination of neurons in the previous layer, effectively looking at a larger portion of the input image.

3. **Importance in Feature Detection**: In early layers of a CNN, neurons have smaller receptive fields and thus detect simple, local features like edges or textures. In deeper layers, neurons have larger receptive fields and can integrate these local features to detect more complex patterns or objects.

4. **Determining Factors**: The size of the receptive field is determined by several factors, including the size of the filters (kernels) used in convolutional layers, the depth of the layer in the network, and the stride and padding used in convolutions.

5. **Overlap and Coverage**: Receptive fields of neighboring neurons often overlap, allowing the network to cover the entire image and capture the spatial hierarchy of features.

In summary, the receptive field in a CNN is a critical concept that determines how much of the input image a neuron 'sees' and processes, playing a key role in the network's ability to extract and hierarchically organize visual features from simple to complex.

**Translational Invariance
2.1.Convolutional_neural_networks.ipynb**

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023//main/week-9-CNN/img/b3.png" width = "500" >

Combining multiscale learning with the inherent translation equivariance of 2D/3D Euclidean convolutions makes standard CNNs effective for image processing. <font color='blue'>**Translation equivariance means that if you translate an image and then apply convolution, the result will be the same as if you first convolved the image and then translated it. This property, known as equivariance, is a key feature of CNNs.**</font>

**CNNs are also designed to be invariant to translations.** This means that **regardless of where an object is located in an image**, the CNN will still r**ecognize it as the same object.** This is largely due to the downsampling process, typically achieved through pooling.

However, it's important to note that this property is specific to **translation and scaling**. CNNs **are not** inherently equipped to handle other types of transformations, such as **rotations** and **non-linear deformations**. That's why data augmentation techniques are often used in practice to improve the network's ability to handle these kinds of variations.

Despite this limitation, <font color='blue'>**the translation equivariance and invariance properties**</font> of CNNs are significant advantages, especially in 2D/3D image processing. They mean that, for instance, **if an image is moved to the right, the corresponding feature layer produced by the convolution will also shift to the right** (i.e., $ f(T(x)) = T(f(x)) $, where $ T $ represents translation, and $ f $ represents the convolutional operation).

These properties, however, are not as straightforward or inherent in other domains, such as graphs and surfaces, which we'll explore in the final lecture of the course. The unique capabilities of CNNs in handling spatial transformations in images underscore their significance in the field of computer vision and image analysis.

## Filters not neurons!


1. **Filters as Weight Matrices**: In CNNs, filters (or kernels) are indeed the core components that perform the convolution operations. A filter is represented by a weight matrix $ W_i^l \in \mathbb{R}^{f \times f \times d_{(l-1)}} $, where $ f $ is the filter size and $ d_{(l-1)} $ is the depth of the input to that layer. These filters are the learned parameters of the network that help in extracting features from the input image.

**Imagine Filters as Artisans of Data**: Within the realms of CNNs, think of filters as skilled artisans whose tools are the weight matrices $ W_i^l \in \mathbb{R}^{f \times f \times d_{(l-1)}} $. Here, $ f $ isn't just a number—it's the size of the artisan's canvas, and $ d_{(l-1)} $ represents the depth of insight they have into the input data. These artisans craftily carve out patterns and features from the raw image data, helping to bring out the essence of the visual information.

2. **Receptive Field**: The concept of a receptive field is important in CNNs because each 'neuron' (or more accurately, each filter application) processes a small, localized region of the input image. This is different from a fully connected network, where each neuron is connected to every input. In CNNs, this locality is captured by the size of the filter, and the receptive field is the part of the image that is 'seen' by the filter during the convolution operation.

**Receptive Field - The Window to the World**: In the CNN landscape, each 'neuron' gets a personalized window—the receptive field—through which it gazes at a small segment of the image. It's a departure from the old days of fully connected networks, where each neuron was bombarded with the whole image. Now, our little windows keep things local and focused, allowing for a more intimate and detailed understanding of the image's features.

3. **Parameter Sharing and Convolution**: Parameter sharing is a fundamental feature of CNNs that allows the network to significantly reduce the number of parameters compared to a fully connected network. Because the same filter (with the same weights) is applied across the entire input image, this operation is equivalent to a convolution, where each element of the output feature map is computed by elementwise multiplication of the filter with the input image's receptive field followed by summing the results.

**Parameter Sharing - A Convolutional Symphony**: The magic of parameter sharing in CNNs is akin to an orchestra playing in unison. Just as each musician plays the same note in harmony, each neuron in a convolutional layer shares the same filter weights, creating a symphony of convolutions across the image. This not only cuts down on the cacophony of too many parameters but also ensures that the learned patterns are universally recognized across the entire image.

4. **Output Depth Determined by Number of Filters**: The depth of the output feature map is determined by the number of filters used in the convolutional layer. For example, if you apply 12 filters to an input image of size $ 32 \times 32 \times 3 $ with padding to maintain the same spatial dimensions, you would end up with an output feature map of size $ 32 \times 32 \times 12 $. Each filter produces a different feature map, and stacking these maps along the depth dimension forms the complete output of the convolutional layer.

**Depth of Output - The Ensemble of Filters**: The depth of the output in a CNN layer is set by the ensemble of filters applied. If you're picturing a $ 32 \times 32 \times 3 $ image, and you decide to call upon 12 unique filters, you're essentially creating 12 unique interpretations of the image, each capturing different features. Stack these interpretations together, and you get a $ 32 \times 32 \times 12 $ feature map that holds a richer, more nuanced representation of the original image.

The transition from Multi-Layer Perceptrons (MLPs) to CNNs involves understanding these key differences: the use of local receptive fields, shared weights, and the spatial hierarchy of features that the network can learn. CNNs are thus more suited to tasks like image recognition, where the spatial arrangement of pixels is crucial for identifying patterns and features.

So, when shifting your mindset from the more traditional neural networks to CNNs, it's like moving from a one-size-fits-all approach to a bespoke tailor-made understanding of images. The CNN approach is more nuanced, allowing the network to create a hierarchy of features, from the simple to the complex, which is perfect for making sense of the visual world.

## Convolutional Networks: Forward Propagation

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/Screen%20Shot%202023-11-14%20at%2011.02.54%20AM.png" width = "700" >

Imagine this lovely RGB image of a flower, bursting with red, green, and blue hues. This image isn't just a flat picture; it has depth $ d_0 = 3 $ corresponding to its color channels, which gives us a three-dimensional perspective on our digital flower.

Now, to understand this image through the eyes of a CNN, we start by learning filters, or mini lenses, that are shaped $ f \times f \times d_0 $. **Each filter is crafted to have the same depth as the image** so that it can fully appreciate all the color nuances. These filters are not just regular lenses; they're more like detectives' magnifying glasses, looking for specific clues—in our case, patterns and textures.

When we slide these filters across the image—carefully moving them stride by stride and adding a bit of padding around the edges to ensure we don't miss any part of the picture—we get a new image. This **new image, or activation map**, has a shape $ h_1 $, calculated by the formula you've learned, which neatly encapsulates the findings of our detective filters.

And what do these filters find? Well, initially, they might catch the simplest hints, like the edges of petals, much like an **edge detector** would. This is because early on, CNNs tend to learn simple, local patterns before moving on to complex ones.

Here's where you, the user, come in. You decide you want 64 different insights into our flower's picture, so you set $ d_1 = 64 $. **This means you're asking for 64 filters to scan the image. Each filter will give you a different perspective, focusing on various features of the flower.** Once the convolution operation is done, you'll have $ d_1 $ different 2D images—each a unique interpretation of the flower, representing different features highlighted by each filter.

The result? A multi-faceted, richly detailed understanding of the flower's image, with depth $ d_1 $ that the CNN will use to further analyze or make decisions, such as identifying the type of flower or detecting anomalies. This process is the essence of how CNNs process visual data, building up complex representations from simple beginnings.

## Convolutional Networks: Forward Propagation


Imagine we have an input $X \in \mathbb{R}^{10 \times 10 \times 3}$, which is an image with a height and width of 10 pixels each, across 3 color channels (RGB). We add padding of one pixel around the image, which expands it to $X \in \mathbb{R}^{11 \times 11 \times 3}$.

Our aim is to learn 64 filters, each with dimensions $3 \times 3 \times 3$. We denote these filters as $W_i$ for $i$ ranging from 0 to 63. Each filter will convolve with the image to detect various features.

Here's how we could express the convolution operation for the first filter $W_0$ in Python, along with the bias term $b_0$:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/orange.png" width = "400" >


```python
import numpy as np

# Assuming X is our padded input image and W_0 is the first filter
# b_0 is the bias term for the first filter
Z = np.zeros((output_height, output_width, 64))  # Initialize the output volume Z

# Convolution operation for the top-left corner of the image
Z[0, 0, 0] = np.sum(X[:3, :3, :] * W_0) + b_0

# Moving the filter to the right by one stride (stride = 1)
Z[0, 1, 0] = np.sum(X[:3, 1:4, :] * W_0) + b_0

# Moving the filter another stride to the right
Z[0, 2, 0] = np.sum(X[:3, 2:5, :] * W_0) + b_0

# This process would continue for the entire image...
```

The expression `Z[0, 0, 0]` captures the sum of element-wise products between the filter $W_0$ and the top-left corner of the image, adding the bias $b_0$ afterwards. This is done for each position the filter moves over, always applying a stride of 1, which gives us the activation map $V$ for the first feature $W_0$.

This approach is akin to neurons in a fully connected layer, but rather than connecting to every feature, each 'neuron' in a CNN connects locally, corresponding to the size of the filter, such as $3 \times 3 \times 3$.

By executing this procedure with all 64 filters, we create a detailed feature map for each filter, which collectively forms a comprehensive feature representation of the original input image. This detailed feature representation is the fundamental mechanism by which CNNs interpret visual information.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/red.png" width = "400" >

The result of the convolution at position $0,1,0$ in the output volume is essentially the application of the same filter $W_0$, now translated one unit to the right. This mirrors the convolution process as explained earlier, where the filter slides across the input space. However, in this context, we're employing notation that aligns with what was introduced during the initial discussion on fully connected MLP networks.

Sure thing! Think of the filter $ W_0 $ as a little explorer on a grid. At the spot $ 0,1,0 $ in our output, our explorer stands after taking a single step to the right. It's the same diligent explorer $ W_0 $, just shifted over by one spot in the grid, ready to uncover more secrets in the image. This step-by-step adventure across the grid isn't much different from the process we saw in the first lecture with fully connected networks, but instead of each explorer having a unique path, here they all follow the same route across the image, picking up patterns along the way.



Let's add a touch of mathematical charm to it! Once our intrepid filter $ W_0 $ has completed its grid-wide quest, it's time for the second filter $ W_1 $ to take the stage. It follows an identical path, weaving through the grid and capturing its own unique set of features. As it does this, $ W_1 $ decorates the next level of our output volume $ Z $, filling up the second depth channel. So if you think about it, while $ W_0 $ filled up the very first layer of our 3D feature map (that's the zeroth index of $ Z $), $ W_1 $ comes in to add another layer of insights right on top of it (populating the first index). This pattern of exploration and discovery is repeated for each and every filter, layering up our understanding of the image in $ Z $.

Matrix operations are the secret sauce that makes convolutional computations both elegant and efficient. Let's break down how this works with the convolution of image blocks and filters:

Firstly, we have these little squares or "blocks" of our image, each the size of the filter we're about to apply — say, a 3x3 grid. In the traditional sense, convolution would have us slide the filter over these blocks one by one, doing a lot of multiplications and additions. However, there's a clever trick we can use to speed things up: vectorization.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/yellow.png" width = "500" >


Vectorization is all about reshaping and alignment. We take each of these 3x3 blocks and stretch them out from a square into a long column. It's like taking a cube of string cheese and peeling off the strands one by one until you have a neat row. When you do this with every block, you get a tall stack of columns, each representing a piece of the original image, ready to be processed.

Now, we perform a similar stretch with each filter, turning what was once a 3x3 kernel into a 9-element vector. Why 9? Because 3 multiplied by 3 is 9, and that's how many pixels we have in each block or filter.

Once everything is stretched into columns and lined up, we bring in the big guns of matrix operations: the dot product. By taking the dot product of the filter vectors with the image block vectors, we're effectively doing all the multiplications and additions in one fell swoop. This is much faster than the old way of doing it pixel by pixel, convolution by convolution.

The beauty of this approach is that it can be done very quickly, especially on modern computers that are optimized for these kinds of matrix operations. By using this method, we can process entire images with multiple filters rapidly, which is what makes training and using CNNs feasible on a large scale.

Let's expand on this process of transforming image blocks into a matrix with a concrete example:

Suppose we have an input image with dimensions [227x227x3] — a standard size for many classic convolutional neural network architectures. Now, we want to convolve this image with a set of filters that are each [11x11x3] in size. The stride, which is the step we take as we slide the filter across the image, is set to 4. This means that instead of moving our filter one pixel at a time, we move it four pixels over for each step.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/green.png" width = "500" >

For each position of the filter on the image, we take the [11x11x3] block of pixels it covers and stretch this block out into a long, thin column vector. This vector has a length of 11 multiplied by 11 multiplied by 3, which equals 363. This stretching is like unrolling a ball of yarn into a straight line — we're turning a 3D block of our image into a 1D column.

Now, as we stride across our input image with a step of 4, we don't cover every single pixel one by one. Instead, we jump four pixels each time, which reduces the number of positions we need to stop at. Calculating the number of positions is simple: we subtract the filter size from the input size, divide by the stride, and add 1. For both the width and the height, this gives us (227 - 11) / 4 + 1 = 55 positions.

When we do this for both dimensions, we find that we have 55 positions along the width and 55 along the height, resulting in 55 times 55, which is 3025 different positions in total. As we repeat the stretching process for each of these positions, we form a matrix $ X_{\text{col}} $ from the operation known as $ \text{im2col} $. This matrix has a size of [363 x 3025], where each of the 3025 columns is one of these unrolled receptive fields.

One important thing to note is that because the stride is smaller than the filter size, our receptive fields will overlap, which means that some pixels from the input volume will be represented multiple times across different columns of the $ X_{\text{col}} $ matrix. This is a natural consequence of how convolution works with striding and ensures that we capture the necessary information from each part of the image as we apply our filters.

This matrix is a powerful representation that allows us to perform convolution as a matrix multiplication, which is a highly optimized operation on modern computing architectures. It's how we can efficiently process large images with complex filters, making the magic of CNNs possible in practical applications like image recognition.

## Convolutional Networks: Forward Propagation Summary

Let's elaborate on how the convolution operation in a neural network can be implemented efficiently using matrix operations. This will involve taking local regions of the input, stretching them out, and then performing dot products with the stretched-out filters.

1. **Stretching the Image Blocks into Columns**: We begin by targeting specific blocks of the image that align with the size of our filters. Each block is then stretched out into a column. This process, often referred to as `im2col`, transforms the 2D blocks from the image into 1D columns and stacks them side by side, forming a large matrix where each column is one image block.

2. **Stretching the Filters into Rows**: In parallel, we take each filter, which is a small weight matrix, and stretch it out into a row. By doing this with all the filters, we create another matrix where each row corresponds to one filter. The resulting matrix has the shape $ d_l \times (f \times f) $, with $ d_l $ being the number of filters and $ f \times f $ representing the flattened filter size.

3. **Matrix Multiplication**: With both matrices prepared, we carry out a single matrix multiplication using a function like `np.matmul`. This operation effectively computes the dot product between each stretched image block and filter. It's a highly optimized way to do many convolutions simultaneously because modern hardware is very good at crunching these kinds of large matrix operations.

4. **Reshaping Back into a Volume**: After the matrix multiplication, we're left with a 2D matrix where each element represents a convolution result. To get back to a format that represents the spatial structure of our original image, we reshape this matrix into a 3D volume. This volume has a depth equal to the number of filters, with the spatial dimensions determined by the original image size, filter size, stride, and padding used during the convolution.

5. **Applying Activation Function**: Lastly, each value in the output volume is passed through an activation function, like the Rectified Linear Unit (ReLU). This function introduces non-linearity into the model, allowing the network to learn more complex features. It operates on each value individually, typically setting all negative values to zero and leaving positive values unchanged, though other behaviors are possible depending on the specific function used.

By following these steps, we can transform the convolution operation from a series of individual computations into a few large, efficient matrix operations. This is crucial for making training and inference with CNNs practical on large-scale data.

## Convolutional Networks: Backward Propagation

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/blue_purpule.png" width = "700" >

Backpropagation in convolutional layers is quite a nifty process, leveraging the same convolutional operation that's used in the forward pass. Here's how it elegantly unfolds:

During the forward pass, the convolutional layer performs its key operation, indicated by the asterisk symbol. This operation involves elementwise multiplications followed by a summation. For instance, the value at the first position of the output feature map, $ z_{11} $, is calculated as $ z_{11} = w_{11} \times x_{11} + w_{12} \times x_{12} + \ldots $, and so on across the overlapping elements of the input and the filter.

Now, let's turn our attention to backpropagation. When the network is learning, it estimates how much the loss function $ L $, which measures the error of the model, would change with respect to the parameters, such as our weights $ W $. This is where the partial derivatives come into play.

Considering the partial derivative of the loss $ L $ with respect to a weight $ w_{11} $, we follow the chain rule, also known as the rule of total derivatives in this context. Since the partial derivative of $ Z $ with respect to $ W $ (for instance, $ \frac{\partial z_{11}}{\partial w_{11}} $) involves the corresponding elements of $ X $, **this computation also simplifies to a convolutional operation**.

<font color='blue'>**Here, the elements of $ X $ are convolved with the partial derivatives of $ L $ with respect to $ Z $.**</font>

T**his means that the backpropagation for a convolutional layer, where we're determining how to adjust the weights based on the error, can again be formulated as a series of matrix multiplication operations. **These operations are just as optimized and efficient as those in the forward pass, thanks to the regular and structured way that convolutional layers operate.

By translating the error back through the network using this process, the model can effectively 'learn' from the mistakes it makes, adjusting its filters to better recognize the patterns and features in the input data. This ability to learn through backpropagation is what powers the continued improvement of a CNN as it is exposed to more data and more iterations of training.

# CNN building blocks: Pooling

When we talk about constructing CNNs, there are several architectural elements to consider to ensure they perform effectively. One crucial aspect is downsampling, which serves multiple purposes within the network.

<font color='blue'>**Downsampling is a method to reduce the spatial dimensions (i.e., the width and height) of the input volume as it moves through layers of the CNN.**</font>

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/multiple_layers.png" width = "700" >

 The goals of downsampling are to:

1. **Increase the Receptive Field**: By downsampling, we effectively expand the receptive field of the neurons in the subsequent layers. The receptive field is the region of the input data that a particular neuron processes. A larger receptive field means that the neuron can capture information from a wider area of the input, which is useful for recognizing larger-scale patterns.

2. **Reduce the Number of Parameters**: Smaller input volumes mean fewer weights, which in turn reduces the number of parameters that the network needs to learn. This reduction can help to prevent overfitting, where a model learns the training data too well and fails to generalize to new, unseen data.

3. **Increase Representational Power**: By downsampling, we can control the amount of information that flows through the network. Each layer can then learn to represent the data in a more compact and abstract form, focusing on the higher-level features rather than the fine-grained details that might not be as important for the task at hand.

Now, let's delve into pooling, one of the most common methods for downsampling:

<font color='blue'>**Pooling is a downsampling technique that reduces the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. There are different types of pooling, but the most common is max pooling.**</font>

**Max pooling operates on each depth slice of the input and resizes it spatially, using the MAX operation. Essentially, a max pooling layer partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.**

For example, a max pooling layer with a pool size of 2x2 applied to an input with dimensions [227x227x3] would output a volume with dimensions roughly half in the spatial dimensions, resulting in [113x113x3] if we're using a stride of 2, which is typical.

Pooling layers serve to aggressively downsample the input volume's spatial dimensions, which decreases the number of parameters and the amount of computation in the network, and hence also helps control overfitting.

By incorporating pooling layers at appropriate points in a CNN, we ensure the network can abstract and condense information, leading to more efficient learning and a network that's better at capturing the essence of the input data for classification or other tasks.

## Max pool operation

Diving deeper into the downsampling process within CNNs, we often use pooling layers to reduce the **spatial dimensions of the input volume**, as you mentioned. Let's explore how this works with** 2x2 pooling filters and a stride of 2**, and touch on the alternatives like average pooling.

**Max Pooling: survival of the fittest**

The most common pooling technique is **max pooling**. It is straightforward yet powerful. Imagine you have a tiny window, a 2x2 pooling filter, that you slide over your image. With a stride of 2, this window **jumps two pixels over each time,** never overlapping with its previous positions. At each stop, the window looks at the 4 pixels it covers and picks the biggest pixel value to pass forward. This "**survival of the fittest**" approach ensures that only the most prominent features within each block are preserved.


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/pooling1.png" width = "600" >

So, if we visualize this process on a section of an image:

- In a 4x4 block highlighted in yellow, if the highest pixel value is 9, then 9 is the one that gets to move on.

- Similarly, for a green block, if the highest value is 7, then 7 is chosen.

- For a blue block, the maximum value let's say is 3, so 3 proceeds.

- And for a red block with the maximum value being 9 again, the number 9 is advanced to the next layer.

This max pooling process effectively halves the width and the height of the image, thereby reducing the size by a factor of 2. If you start with an image of size 8x8, after max pooling with these parameters, you end up with a 4x4 output.

**Average Pooling**:
While less common, average pooling follows a similar procedure, but instead of taking the maximum value from each block, it calculates the average. This means every pixel within the window contributes equally to the resulting pooled value. Average pooling is like taking a consensus from each block rather than a standout feature.

Both max and average pooling help to make the representation approximately invariant to small translations of the input. Invariance to local translations can be very useful if we care more about whether a feature is present rather than its exact location.

These operations are non-parametric, meaning they don't have parameters that learn from the data. Instead, they're fixed operations designed to reduce the size of the input, focus on the most significant information, and reduce the number of parameters and computation required in subsequent layers. This makes the network faster and reduces the risk of overfitting by providing an abstracted form of the representation.



---


Backpropagation through a max pooling layer is an intriguing process because it's quite different from backpropagation through fully connected or convolutional layers that have weights.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/pooling2.png" width = "700" >


In a max pooling layer, each output value is the maximum of a set of inputs (a block of the input image). During the forward pass, we typically keep track of which input was the maximum (often using a mask or an index map). Let's consider a block within the pooling layer, say the yellow block you mentioned. During the forward pass, we select the maximum value from this block. In the backpropagation phase, we need to route the gradient back only through this particular value that influenced the output.

Here's the critical bit: when backpropagating through a max pooling layer, the gradient of the loss with respect to the input of the max pooling layer is only non-zero for the element that was the maximum during the forward pass. This is because only this element had an impact on the output.

So, if we denote the forward operation of the max pool as a weighted sum over all elements in the block, the "weights" are effectively 0 for all but the largest element (which has a weight of 1). The backpropagation step involves passing the gradient through to only this largest element.

This process can be visualized as follows:

- During the forward pass, for each block, we record the position of the maximum element. We could say that each position gets a "vote" of zero, except for the maximum element, which gets a "vote" of one.
- During backpropagation, when we compute the gradient of the loss with respect to the output of the max pooling layer, we distribute this gradient only to the position that had the maximum vote. All other positions in the block remain unaffected (they get a gradient of zero).

In practice, this is often implemented by creating a mask during the forward pass that records the position of the maximum element. During the backpropagation step, this mask is used to distribute the gradients correctly.

The simplicity of this process is one of the reasons max pooling works so well. It allows the network to retain the strongest signals and propagate gradients in a way that reinforces the importance of these signals. This selective backpropagation helps to maintain the spatial hierarchy of features that the network is learning.



---

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/pooling3.png" width = "700" >

Backpropagation through a max pooling layer is an exercise in connectivity—or rather, selective connectivity. Let's take a step-by-step look at how the gradients are calculated during this phase, using the ReLU function as an instructive comparison.

In a fully connected network, when we backpropagate through a ReLU activation function, we're essentially asking, "Where were the neurons fired up?" Because ReLU zeros out any negative values, only positive inputs affect the output and, consequently, only they have non-zero gradients.

Now, let's map this understanding to the max pooling operation. During the forward pass, max pooling acts like a tournament, picking out the champion (the maximum value) from each local pool (block of the input image) and giving it the spotlight in the output. When we reverse this process during backpropagation, we're essentially handing back the prize—a slice of the gradient—to only the winner of each tournament. All the other participants (the non-maximum values) walk away with nothing, or a gradient of zero.

So here's what happens during backpropagation through a max pooling layer:

1. **Gradient Allocation**: The incoming gradient from the next layer (upstream in the backpropagation flow) arrives at the max pooling layer. This gradient contains information about how much each neuron's activation needs to change to reduce the overall loss.

2. **Tracking the Winners**: To correctly allocate this gradient, the network must remember who the winners were during the forward pass. This is where the "masks" come into play. A mask is a record of the locations of the maximum values within each block. The network must keep these masks handy as it moves forward because they're critical for guiding the backward flow.

3. **Propagating the Gradient**: When backpropagating, the gradient from the next layer is distributed exclusively to these recorded maximum positions. The gradient for each maximum value is precisely the incoming gradient from the layer above, passed through unchanged because these were the elements that influenced the output. For all other positions, the gradient is zero, because these elements had no influence on the forward pass output.

The term "click" you mentioned is pivotal—it signifies an action, like setting a marker. In the context of implementing CNNs, "clicking" is analogous to saving these masks. Without these markers or masks, the network wouldn't know where to send the gradients back during the backward pass.

It's a beautiful system of cause and effect: each neuron's contribution is measured, noted, and remembered. This way, during backpropagation, each contribution can be precisely adjusted, ensuring that the learning process is both efficient and accurate. Keeping track of which neurons were the most activated is crucial for the network to learn which features are the most important in making predictions.


## CNN building blocks: Strided convolutions

Strided convolutions are indeed a more modern approach to downsampling in CNNs, and they offer an interesting alternative to the traditional max or average pooling layers.

**Strided Convolutions for Downsampling**:
In a strided convolution, instead of moving the filter across the image one pixel at a time, we move it according to the stride length. For example, with a stride of 2, the filter jumps two pixels at a time. This action effectively reduces the spatial dimensions of the output, achieving the downsampling effect without a separate pooling layer.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/strided.png" width = "700" >


The main advantage of this method is that the convolutional filters are learned during training. In other words, the network learns the best way to downsample the input data, rather than relying on a fixed function like max or average pooling. This can theoretically improve the model's expressivity because the filters can adapt to what's most useful for reducing the loss, potentially learning to pool in a more sophisticated way than just taking the maximum or average.

**Trade-offs**:
However, there are trade-offs to consider with strided convolutions:

1. **Increased Parameter Count**: Since the downsampling is done by the convolutional filters themselves, it increases the number of parameters the network has to learn. This can be a downside, especially when working with small datasets where overfitting becomes a concern. More parameters mean the network has more capacity to memorize the training data, which can harm its ability to generalize to unseen data.

2. **Backpropagation Complexity**: It has been suggested that strided convolutions might make the backpropagation process more complicated. The concern is that, compared to pooling layers, strided convolutions could make it harder for the gradient to flow back to shallower layers in the network. This could potentially slow down the learning process for these earlier layers and require careful initialization and learning rate strategies to mitigate.

"FishNet: A Versatile Backbone for Image, Region, and Pixel-Level Prediction," paper discusses these issues in more depth, providing insights into how strided convolutions can affect the flow of gradients during training. It highlights that strided convolutions may impede the direct propagation of loss gradients to shallower layers compared to pooling operations.

In conclusion, while strided convolutions can potentially increase a model's ability to learn more complex and useful representations, they come with challenges that need to be addressed through careful network design and training procedures. The choice between pooling and strided convolutions will depend on the specific requirements of the task, the size and nature of the dataset, and the capacity of the network being used.

## CNN building blocks: 1 x 1 convolutions

The use of 1x1 convolutions, also known as pointwise convolutions, is a clever design within CNNs that can yield computational and representational efficiencies.

**Understanding 1x1 Convolutions**:
Imagine a tiny kernel, the orange block you mentioned, that's just 1 pixel wide and 1 pixel tall, but as deep as the input volume it's processing. This is the 1x1 convolutional filter. While it might seem counterintuitive at first—after all, how much can you learn from just one pixel?—the magic of 1x1 convolutions isn't in spatial filtering but in channel-wise computation.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/11convolution.png" width = "700" >

**Compressing and Upsampling**:
When this 1x1 filter sweeps across an incoming block (let's call it the blue block) with dimensions H x W x D, it mixes the information across the D channels, outputting a single number for each position it covers. The result is a 2D block with dimensions H x W x 1. If we deploy N such filters across the same blue block, we end up with an output of H x W x N.

Here’s the trick: If N is smaller than D, we've effectively compressed the data, reducing its dimensionality. If N is greater than D, we've done the opposite, upscaling the number of channels. These operations can dramatically alter the representational power of the network.

**Efficiency and Regularization**:
The 1x1 convolution serves a dual purpose:

1. **Efficiency**: By changing the number of channels, we can reduce the dimensions before applying more computationally intensive operations, like larger convolutions. This can speed up the network and decrease the memory footprint during both training and inference.

2. **Regularization**: They can act as a form of learnable pooling, reducing the number of parameters when going from a high number of input channels to a lower number. This reduction in parameters can help mitigate overfitting, especially when the available training data is limited, because the network is forced to learn a more compact representation of the data.

**Practical Use**:
These 1x1 convolutions are often sandwiched between larger convolutional layers. This design pattern is known as a bottleneck and is prevalent in modern architectures like ResNets and Inception networks. The initial 1x1 convolutions reduce the dimensionality, the subsequent larger convolutions perform the heavy lifting in terms of feature extraction, and then another set of 1x1 convolutions may scale the number of channels back up to reintegrate the extracted features.

In essence, 1x1 convolutions allow networks to manipulate the depth dimension of the data efficiently, providing a powerful tool for controlling the flow of information and the representational capacity of the network. They exemplify the innovative architectures that have been developed to make CNNs both powerful in their representational abilities and efficient in their use of computational resources.

## CNN building blocks: Batch normalisation

Batch normalization (batch norm) is a technique designed to make the training of deep networks more stable and faster. To understand its significance, let's first consider how deep networks are typically trained.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/Batch%20normalisation.png" width = "300" >


**The Role of Stochastic Gradient Descent**:
Deep learning networks often rely on stochastic gradient descent (SGD) or its variants for training. SGD updates the model's weights using only a subset of the data, known as a mini-batch, rather than the full dataset. This approach is practical for very large datasets where computing the loss across all examples would be computationally infeasible. However, using mini-batches introduces variability into the training process: the gradient computed on one batch may point in a slightly different direction than the gradient computed on another batch due to the different data samples contained in each.

**The Challenge of Noise**:
This variability means that the gradient descent process can be noisy, which may lead to unstable training dynamics. For example, the network's weights might be updated too aggressively in response to the noisy gradient from a particular batch, causing the learning process to diverge.

**How Batch Norm Helps**:
Batch normalization addresses this issue by normalizing the output of the previous activation layer for each batch. It applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1. This standardization process stabilizes the learning process by reducing the internal covariate shift, which is the change in the distribution of network activations due to the update of weights during training.

**The Mechanism of Batch Norm**:
Here’s what happens during batch norm:
1. The mean and variance are calculated for the data in the batch.
2. The data are normalized, subtracting the batch mean and dividing by the batch standard deviation.
3. The normalized data are then scaled and shifted by two learnable parameters, γ (gamma) and β (beta), which allow the network to undo the normalization if that's the optimal thing to do.

**The Benefits**:
- It allows each layer of the network to learn on a more stable distribution of inputs, which can improve the speed of training since it allows for higher learning rates.

- It provides a form of regularization, as the noise introduced by the normalization can have a slight regularization effect, reducing overfitting.

Batch norm has become a standard component in many CNN architectures because of its effectiveness in stabilizing training and allowing for faster convergence. For a more detailed understanding of batch normalization and to see it in action, you can refer to the lecture notes you provided or other deep learning resources that discuss batch norm in the context of different types of neural networks.



## CNN building blocks: Batch normalisation

The provided table outlines the algorithm for the batch normalization (Batch Norm) transform, a key operation used in training deep neural networks. Let's walk through the steps of this operation:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/table.png" width = "400" >

1. **Mean Calculation**: For a given mini-batch $ B = \{x_1...x_m\} $, the first step is to compute the mean $ \mu_B $ of the input values $ x $ across the batch. This is done by summing up all the input values $ x_i $ and then dividing by the number of inputs $ m $. The mean represents the average activation across the batch for each input variable.

2. **Variance Calculation**: Next, we calculate the variance $ \sigma_B^2 $ of the batch. This involves taking each input value $ x_i $, subtracting the batch mean $ \mu_B $, squaring the result, summing these squared differences, and then dividing by the number of inputs. The variance measures how much the activations vary from the mean within the batch.

3. **Normalization**: Each input $ x_i $ is then normalized by subtracting the batch mean $ \mu_B $ and dividing by the square root of the batch variance $ \sigma_B^2 $ plus a small number $ \epsilon $ to prevent division by zero. This step ensures that the resulting activations have a mean of 0 and a variance of 1, which helps to stabilize the learning process by making sure that the scale of activations doesn't become too large or too small.

4. **Scaling and Shifting**: Finally, the normalized values are transformed using learned parameters $ \gamma $ (gamma) and $ \beta $ (beta). These parameters allow the network to scale and shift the normalized activations in a way that the network learns is most beneficial for reducing the loss. This step gives the network the flexibility to undo the normalization if that’s what’s needed to get better performance.

Gamma and beta are learned for each input variable and are part of the model's parameters that are updated during training using gradient descent. The presence of $ \gamma $ and $ \beta $ means that batch normalization can learn to maintain the representational power of the network, even after the standardization step.

It is important to define these parameters within the constructor of neural network classes in frameworks like PyTorch because they are part of the model's state that needs to be learned from the data. By defining them in the constructor, they are properly tracked during training, and their gradients are computed during backpropagation.

This batch normalization process is beneficial because it counteracts the issue of internal covariate shift, where the distribution of each layer's inputs changes as the parameters of the previous layers change. By normalizing the inputs to each layer, batch normalization allows each layer to learn on a more stable distribution of inputs, which can lead to faster training and improved generalization performance.

For a more intuitive explanation of how batch normalization aids in training, and to understand the theoretical underpinnings, one can refer to recent papers, such as the NeurIPS paper suggested, which likely provides deeper insights into the mechanics and benefits of batch normalization in neural networks.


## CNN building blocks: Dropout for regularisation

Dropout is a regularization technique that prevents neural networks from overfitting on their training data, thus enhancing their ability to generalize to new, unseen data.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/dropout.png" width = "500" >


**How Dropout Works**:
Imagine each neuron in a neural network layer as a player on a sports team. During training, dropout randomly selects a subset of these neurons at each training step (or epoch) and temporarily 'benches' them; that is, their activations are set to zero, and they do not contribute to the forward pass or the backpropagation process. This means that their connection to the next layer is effectively ignored during this training step.

The probability of any given neuron being dropped is set by a predefined parameter, typically denoted as \( p \). For example, with \( p = 0.5 \), there is a 50% chance that any given neuron will be dropped during a training step.

**The Impact on Learning**:
When neurons are dropped out, the network cannot rely on any single set of neurons to make predictions, because it does not know which neurons will be 'benched' at any given training step. This forces the network to distribute the learning across all neurons, which can lead to a more robust representation of the data. Essentially, it encourages the network to be redundant, so that it does not become too dependent on any single path of activation to make predictions.

**Training vs. Testing**:
During training, dropout is active, and neurons are randomly dropped. However, at test time, dropout is disabled, and all neurons contribute to processing the input. To account for the difference between the number of active neurons during training and testing, the activations are typically scaled. For example, if \( p = 0.5 \), the output of each neuron during testing is multiplied by 0.5 to balance the larger number of active neurons compared to the training phase.

**The Trade-off**:
There is a delicate balance to be struck with dropout. Too little dropout, and the network may still overfit; too much dropout, and the network's capacity to learn can be too restricted, potentially leading to underfitting. The dropout rate is often a hyperparameter that is tuned based on the performance of the network on a validation set.

**Dropout as a Masking Operation**:
Conceptually, dropout can be thought of as a hand-engineered masking operation. It randomly generates a mask where the activations of certain neurons are set to zero. This mask changes at every iteration of the training process, giving rise to a wide variety of neural pathways being trained.

By preventing co-adaptation and encouraging independent contribution from each neuron, dropout enhances the generalization power of the neural network. It is a simple yet effective tool that has been widely adopted in the training of deep neural networks.

For a deeper dive into dropout and its applications, you can refer to the FastAI lesson notes provided, which likely offer practical insights and implementations of dropout in neural network training.

## At test time


Dropout is a regularization technique used in neural networks, typically to prevent overfitting. The concept of dropout involves randomly setting a fraction of the input units to 0 at each update during training time, which helps in making the model more robust and less prone to overfitting.

Here's a breakdown of how dropout works and the **rationale behind the adjustments made during training and test time**:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/dropout2.png" width = "700" >


1. **Dropout During Training:**
   - During training, dropout randomly deactivates a certain percentage (say \( p \)) of neurons in a layer. For \( p = 0.5 \), this means half of the neurons are turned off randomly in each training step.
   - This process forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
   - Because only a subset of neurons is active, the network's capacity is effectively reduced, helping to mitigate overfitting.

2. **Effect on Activations:**
   - When you drop out neurons, the overall signal passing through the network decreases. To compensate for this, the activations are effectively scaled up. This is due to the fact that the remaining neurons must account for the "missing" ones and hence have higher activations to maintain the overall level of signal through the network.

3. **Adjustments at Test Time:**
   - At test time, dropout is turned off, meaning all neurons are active. However, now we have a problem: since all neurons are active, the overall signal passing through the network will be higher compared to the training phase.
   - To correct for this, one approach is to scale down the activations at test time. Srivastava et al. suggested multiplying all weights by \( p \) during testing. This effectively scales down the activations to match the expected level during training.
   - Another approach, which is used in frameworks like PyTorch, is to apply the scaling during training instead. In this case, the activations are scaled up during training, so that no scaling is needed at test time.

4. **Reason for Adjustments:**
   - The reason for these adjustments is to ensure consistency between the training phase and the test phase. During training, with dropout, the network learns with a reduced capacity. At test time, you want to use the full capacity of the network, but the activations need to be balanced to reflect the training conditions.

In summary, the key point of dropout is to make each neuron more robust and to force the network to learn redundant representations. The adjustments made during training and test times are necessary to maintain the effectiveness of the network, ensuring that it performs well both during training (with dropout) and at test time (without dropout).

## Should batchnorm (BN) and dropout be used together?

The interaction between dropout and batch normalization (BN) in neural networks can indeed create complexities, as these techniques affect the network's learning dynamics in different ways. Let's explore these interactions and some of the strategies used to mitigate potential issues.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/dropout3.png" width = "400" >


1. **Batch Normalization and Dropout:**
   - **Batch Normalization (BN):** BN standardizes the outputs of a layer for each mini-batch, stabilizing the learning process. It maintains a running mean and variance during training, which are used to normalize the data at test time.
   - **Dropout:** As previously discussed, dropout randomly deactivates neurons during training to prevent overfitting.

2. **Issue with Combining BN and Dropout:**
   - When used together, these methods can conflict. BN calculates the mean and variance statistics during training. If dropout is also used, these statistics may not represent the true distribution of the network's activations, especially since dropout changes the distribution of the activations with each iteration.
   - At test time, BN uses the statistics gathered during training for normalization. However, if dropout was used, these statistics might not be accurate since all neurons are now active, potentially leading to suboptimal performance.

3. **Impact on Modern Networks:**
   - This conflict has led to a decline in the use of dropout in some modern architectures, especially those heavily reliant on BN.
   - However, dropout is still effective and used in certain scenarios. For instance, placing a dropout layer right before the final BN layer can be beneficial as it does not significantly shift the variance.

4. **Wide ResNet's Approach:**
   - The Wide ResNet architecture found a novel way to integrate dropout effectively by placing it after the activation layer, not before. This ordering seems to mitigate the negative interaction between dropout and BN.
   - Placing dropout after the activation layer might preserve the effectiveness of BN's normalization by maintaining a more consistent distribution of activations.

5. **Importance of Layer Ordering:**
   - This highlights the significance of the order in which layers are arranged in a network. The sequence of operations can greatly influence the network's learning dynamics and its ultimate performance.

6. **Reference to the Paper:**
   - The paper mentioned, available at [arXiv:1905.05928](https://arxiv.org/abs/1905.05928), likely provides further insights into these dynamics and the interaction between dropout and BN, along with potential solutions or workarounds.

In conclusion, while dropout and batch normalization are both powerful techniques in deep learning, their interaction requires careful consideration, especially in complex network architectures. The order of these layers, as well as understanding the implications of each technique on the network's learning dynamics, is crucial for optimal performance.

### What order should I put my layers? With max pool

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers1.png" width = "400" >

The organization of layers in a neural network, particularly when using techniques like batch normalization (BN), dropout, and activation functions, can significantly affect the network's performance. Understanding the best practices for layer ordering can help in designing more effective models. Let's break down the typical layer organization:

1. **Batch Normalization Before Activation (Traditional Approach):**
   - Traditionally, and as suggested in the original batch normalization paper, BN is applied right after the convolutional layers and before the activation functions.
   - This approach allows the BN layer to normalize the outputs of the convolutional layers before they are passed through the non-linear activation function. It stabilizes the learning process by ensuring that the inputs to the activation functions do not have too high variance, which can accelerate training and improve performance.

2. **Max Pooling and Activation Ordering:**
   - Max pooling and activation functions are both element-wise operations, which means that their ordering technically doesn't matter since they commute. This means applying activation before max pooling will yield the same result as applying max pooling before activation.
   - Despite this, the most common practice is to apply the activation function before max pooling. This is because applying the activation function first can introduce non-linearities into the feature maps, which the max pooling layer can then subsample.

3. **Placement of Dropout:**
   - When incorporating dropout into a network that also uses batch normalization, the general recommendation is to place dropout after batch normalization.
   - Placing dropout before BN can disrupt the normalization process, as dropout changes the distribution of the activations during training. This can make the statistics (mean and variance) computed by BN less representative of the true distribution, potentially leading to poorer model performance.

4. **General Layer Ordering Recommendation:**
   - **Convolutional Layer** (for feature extraction)
   - **Batch Normalization** (to normalize these features)
   - **Activation Function** (to introduce non-linearity)
   - **Max Pooling** (for downsampling)
   - **Dropout** (for regularization, placed after BN)

5. **Flexibility and Model-Specific Adjustments:**
   - While these guidelines are commonly followed, it's important to remember that the optimal layer organization can vary depending on the specific architecture and the problem at hand. Experimentation and validation are key in determining the most effective configuration for any given model.

In summary, while there are common practices in the ordering of layers, particularly regarding batch normalization, activation functions, max pooling, and dropout, the exact organization can depend on the specific requirements and goals of the neural network being developed. Experimentation and empirical validation remain crucial in determining the most effective layer sequence.

### What order should I put my layers? Replace pool with a strided convolution


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers2.png" width = "600" >

Swapping out the pooling layer for a strided convolution is a design choice that is increasingly common in modern neural network architectures. This approach has specific implications for the organization and functioning of the network. Let's elaborate on this structure:

1. **Strided Convolution vs. Pooling:**
   - **Pooling Layers:** Traditionally, pooling layers (like max pooling) are used to reduce the spatial dimensions (width and height) of the input volume. They help in making the representation smaller and more manageable, and in reducing the number of parameters and computation in the network.
   - **Strided Convolution:** As an alternative, a convolution layer with a stride greater than one can be used. Strided convolutions reduce the spatial dimensions of the output by skipping input values based on the stride. This approach combines feature extraction (done by the convolution) and downsampling (traditionally done by pooling) into a single operation.

2. **Advantages of Strided Convolution:**
   - **Learning to Downsample:** Unlike pooling, which has a fixed downsampling method (like taking the maximum or average), strided convolutions learn how to downsample the input. This can potentially lead to better performance since the network learns the most effective way to reduce the spatial dimensions.
   - **Efficiency:** Strided convolutions can reduce the computational load and the number of parameters in the network, as they combine two operations into one.

3. **Typical Layer Structure with Strided Convolution:**
   - If you choose to use strided convolutions instead of pooling layers, the typical layer structure in a convolutional neural network might look like this:
     - **Convolutional Layer** (with stride 1 for feature extraction)
     - **Batch Normalization** (to normalize these features)
     - **Activation Function** (to introduce non-linearity)
     - **Strided Convolutional Layer** (with stride > 1 for downsampling and additional feature extraction)
     - **Dropout** (for regularization, if needed)

4. **Considerations in Using Strided Convolutions:**
   - When using strided convolutions for downsampling, it's important to consider the stride size and the kernel size carefully. These parameters will determine how the input is downsampled and can significantly affect the network's ability to capture and retain important features.
   - Strided convolutions, while efficient, may sometimes lead to the loss of certain fine-grained features. This is something to consider, especially in tasks where retaining detailed spatial information is crucial.

In summary, replacing pooling layers with strided convolutions is a viable and often beneficial choice in designing neural network architectures. This approach allows for more efficient learning and downsampling, but requires careful tuning of stride and kernel size to ensure the network remains effective at feature extraction and representation.

### What order should I put my layers? Move batch-norm after dropout and after activation?


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers3.png" width = "400" >

Considering everything discussed, it's worth noting that a browse through a particular StackOverflow discussion reveals that some notable figures in the field advocate for placing batch normalization after the ReLU activation, rather than before. This suggests that experimenting with this arrangement and exploring literature or discussion forums on the topic could be insightful and potentially beneficial.


## CNN building blocks: Dropout for Bayesian DL

Automated real-time fetal head segmentation from ultrasound images using Bayesian deep learning with Monte-Carlo Dropout (MC Dropout) is a sophisticated approach that offers insights beyond just segmentation. Here's an elaboration on this method and its significance:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers4.png" width = "800" >

1. **Monte-Carlo Dropout for Uncertainty Estimation:**
   - MC Dropout is a technique used in Bayesian deep learning. Unlike traditional dropout, which is used only during training for regularization, MC Dropout is employed during inference (test time) to estimate uncertainty.
   - By activating dropout during inference, the network can generate multiple predictions for the same input by randomly dropping different sets of neurons each time. This results in a variety of outputs.

2. **Predicting and Analyzing Multiple Samples:**
   - The process involves making multiple predictions (N samples) for a given input. Each prediction will be slightly different due to the randomness introduced by dropout.
   - Analyzing these variations allows for the estimation of uncertainty in the network's predictions. Regions where predictions vary greatly indicate higher uncertainty, while consistent predictions suggest greater confidence.

3. **Generating Error Bounds:**
   - The variations in predictions can be visualized as error bounds, often illustrated in pink in such studies. These bounds give a visual representation of the certainty of the network's predictions.
   - Areas with wider error bounds are those where the model's predictions vary more, indicating higher uncertainty. This is particularly useful in medical imaging, where understanding the confidence level of a prediction is crucial.

4. **Application in Fetal Head Circumference Prediction:**
   - In the context of fetal head circumference prediction from ultrasound images, as referenced in the paper by Samel Budd, a current CDT student, this method can be particularly insightful.
   - By applying MC Dropout during the inference of fetal head circumference, the model not only provides the segmentation but also indicates which parts of the segmentation are less certain. This is often most noticeable at the edges of the predicted circumference, where the model might be less confident.

5. **Dropout Beyond Regularization:**
   - Although dropout has seen a decrease in popularity as a regularization tool in some modern neural network architectures, its utility in uncertainty estimation at test time remains significant.
   - This method of using dropout shifts the focus from preventing overfitting to providing a measure of the model's confidence in its predictions, which is especially important in critical applications like medical imaging.

In summary, the use of Monte-Carlo Dropout during inference for automated real-time fetal head segmentation from ultrasound images is an innovative approach that extends beyond mere image segmentation. It provides valuable insights into the model's certainty about its predictions, offering crucial information in medical diagnoses and treatments.



## CNN building blocks: Fully Connected Layer

The final crucial component in many neural network architectures, especially in the context of tasks like classification and regression, is the fully connected (FC) or linear layers. Here's an elaboration on their role and how they function:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers5.png" width = "800" >


1. **Function of Fully Connected Layers:**
   - Fully connected layers, often referred to as dense or linear layers, are a standard component in neural networks where each input node is connected to each output node.
   - The primary role of these layers is to take the high-level features learned by previous layers (like convolutional or pooling layers) and use them to make predictions or classifications.

2. **Expectation of Flattened Inputs:**
   - FC layers expect inputs in a flattened or vectorized form. This means that the multidimensional output (like the 2D feature maps from convolutional layers) must be converted into a 1D vector before being fed into a fully connected layer.
   - This flattening process essentially aligns all the learned features into a format that the FC layer can process.

3. **Parameter Reduction and Output Generation:**
   - One key function of fully connected layers is to reduce the number of parameters to match the required number of outputs for a specific task.
   - In binary classification or single-output regression tasks, the final FC layer will typically reduce the input to one output neuron, representing the predicted class or value.
   - For multi-output predictions, such as in multi-class classification, the number of neurons in the final FC layer corresponds to the number of classes (K). Each neuron's output can be interpreted as the network's confidence in each respective class.

4. **Importance in Network Architecture:**
   - Fully connected layers are often positioned towards the end of the network architecture. After the initial layers have performed feature extraction and dimensionality reduction, the FC layers focus on mapping these features to the desired output.
   - The design and number of FC layers can significantly impact the network's performance and its ability to generalize from the training data to make accurate predictions.

5. **Integration with Other Layers:**
   - In architectures that include dropout or batch normalization, these components might also be integrated before or between FC layers to improve generalization and control overfitting.

In summary, fully connected or linear layers play a crucial role in neural network architectures, especially for tasks like classification and regression. They are responsible for taking the learned high-level features and mapping them to the desired output format, whether it be for binary classification, multi-class classification, or regression tasks. Their design and placement in the network are key to the model's overall performance and effectiveness.


---

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/layers6.png" width = "700" >


A significant limitation of using fully connected layers is that they necessitate predefining the number of inputs, as each neuron in these layers must correspond to a feature from the preceding layer. This effectively means that the size of the input data must be hard-coded, limiting flexibility in input dimensions.


## Fully Convolutional Networks

Fully Convolutional Networks (FCNs) were developed to address the limitations posed by fully connected layers, specifically the need to hard-code input sizes. Here's how FCNs provide a more flexible alternative:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/week-9-CNN/img/cats.png" width = "500" >


1. **Replacing Fully Connected with Convolutional Layers:**

   - In traditional network architectures, fully connected (FC) layers require a **fixed input size**, as they need a **predefined number of neurons** corresponding to the **number of features in the input**. This limits the network's ability to **handle variable input sizes**.

   - FCNs circumvent this limitation by replacing FC layers with convolutional layers. These layers can handle variable input sizes, as **they apply filters across the input space and are not dependent on a fixed-size input**.

2. **Example of Convolutional Replacement:**

   - Consider a scenario where the last convolutional layer in a network (before what would typically be an FC layer) has an output shape of **7×7×256.**

   - In a traditional architecture, an FC layer with **4096 neurons** might follow this. **Each neuron would be connected to all 7×7×256 features.**

   - In an FCN, instead of this FC layer, you would use a convolutional layer with **4096 kernels, each of size 7×7×256.** This convolutional layer effectively acts as a **1x1 convolution over the 7×7×256 feature map, producing a 1×1×4096 output that mimics the operation of an FC layer.**

3. **Advantages in Flexibility and Application:**

   - This design allows the network to accept inputs of varying sizes, as the convolutional layers can adapt to different spatial dimensions.

   - This flexibility has significant implications, particularly in tasks like **semantic segmentation**. In semantic segmentation, the goal is often to process images of varying sizes and produce a pixel-wise classification of the image. **FCNs can handle different input image sizes without needing to resize them to a fixed input dimension.**

4. **Implications for Semantic Segmentation:**

   - In semantic segmentation, FCNs can process entire images, regardless of their size, and produce **segmentation maps that correspond in size to the input image.** This is a significant advantage over traditional architectures that require inputs to be a specific size.

In summary, Fully Convolutional Networks offer a versatile alternative to traditional networks by replacing **fully connected layers with convolutional layers.** This allows them to handle inputs of varying sizes, which is particularly beneficial in applications like semantic segmentation where input size flexibility is crucial.

## CNN Summary

**Goal:**
- The primary goal of CNNs is to learn specialized convolutional filters that are effective for tasks like image compression and representation learning. These networks are designed to automatically and efficiently identify and encode patterns in images.

**Key Building Blocks:**
1. **Convolutional Layers:** The foundational elements where filters are applied to extract features from the input.
2. **Downsampling Operations (Pooling or Striding):** These reduce the spatial dimensions of the feature maps, aiding in creating a more compact representation.
3. **Activation Functions:** Layers like ReLU introduce non-linearity, enabling the network to capture complex patterns.
4. **Batch Normalization and Dropout (Optional):** Used for regularization and to stabilize training, with batch normalization being more prevalent in modern CNNs.

**Weights are Filters:**
- In CNNs, the weights of the convolutional layers act as filters. These filters learn to recognize various patterns in the input image, such as edges or textures in the initial layers, and increasingly complex features in deeper layers. The concept of parameter sharing, where the same filter is applied across different regions of the input, allows these networks to efficiently learn spatial hierarchies and reduces the overall number of parameters.

**Tricks:**
1. **1×1 Convolutions:** Utilized to reduce the number of filters or channels, thereby decreasing the network's parameters and enabling more efficient computation.
2. **Replacing Fully Connected Layers with Convolutional Layers:** This makes the network adaptable to varying input sizes, which is particularly beneficial for tasks requiring flexibility in input dimensions, like image segmentation.

By focusing on these aspects, CNNs become powerful tools for processing and interpreting image data, with applications ranging from image classification to more complex tasks like semantic segmentation. The efficiency and effectiveness of CNNs are largely due to their specialized structure, leveraging convolutional filters to process spatial information in images.