<a href="https://colab.research.google.com/github/Dhanusha583/Deep_learning/blob/main/CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### **What is a Convolutional Neural Network (CNN)?**

A CNN is a type of artificial neural network designed to process and analyze visual data, like images. It is inspired by the way the human brain processes visual information in the visual cortex.

In the human brain, neurons in the visual cortex are specialized to detect specific patterns, such as edges, shapes, or textures. Similarly, a CNN identifies features in images through layers of connected neurons, progressively learning more complex features.

---

### **Key Steps in a CNN:**

1. **Input Image:**
   - For example, we have a grayscale image of size $6 \times 6$, where each pixel intensity value ranges from 0 to 255.
   - To normalize the pixel values, we scale them to a range of 0 to 1 using min-max scaling:
   $$
   \text{Normalized Value} = \frac{\text{Pixel Value}}{255}.
   $$

2. **Convolution Operation:**
   - A **filter (or kernel)** is a small matrix (e.g., $3 \times 3$) that slides over the image and performs element-wise multiplication followed by summation. This operation helps detect features like edges.
   - If the input image is $n \times n$ (e.g., $6 \times 6$) and the filter is $f \times f$ (e.g., $3 \times 3$), the output size is determined by the formula:
   $$
   \text{Output Size} = (n - f + 1) \times (n - f + 1).
   $$

   - For a $6 \times 6$ input and a $3 \times 3$ filter, the output size becomes $4 \times 4$.

3. **Padding:**
   - Padding involves adding a border (of zeros or other values) around the input image to control the size of the output.
   - Without padding, the image shrinks after each convolution. Padding prevents this loss of information.
   - The formula for output size with padding is:
   $$
   \text{Output Size} = (n + 2p - f + 1) \times (n + 2p - f + 1),
   $$
   where $p$ is the padding size.

   - For example, with $p = 1$, the $6 \times 6$ image padded becomes $8 \times 8$, and with a $3 \times 3$ filter, the output size is $6 \times 6$ (no shrinkage).

4. **Types of Padding:**
   - **Valid Padding:** No padding; results in smaller output.
   - **Same Padding:** Pads the image so the output size matches the input size.

5. **Feature Extraction:**
   - The convolutional layer extracts features, such as edges, corners, or textures, by applying multiple filters.

6. **Activation Function:**
   - After convolution, an activation function (like ReLU) introduces non-linearity, enabling the network to learn complex patterns.

7. **Pooling:**
   - Pooling layers reduce the size of the feature maps, preserving important features while reducing computational complexity.

---

### **Relation to the Human Brain:**
- Similar to how the **visual cortex** processes images through layers of neurons, a CNN processes an image in stages:
  - **Lower layers** detect simple features like edges.
  - **Higher layers** learn more complex patterns and combine features to identify objects.

This hierarchical processing mimics how our brain recognizes objects.




### **What is Stride in CNN?**

In Convolutional Neural Networks, **stride** refers to the step size at which the filter (or kernel) moves as it slides over the input image during the convolution operation.

- **Stride = 1:** The filter moves one pixel at a time, both horizontally and vertically. This results in a detailed feature map (dense output).
- **Stride > 1:** The filter skips over pixels as it moves, which reduces the size of the output feature map, effectively downsampling the input.

---

### **How Stride Affects the Output Size?**

When stride ($S$) is introduced, the output size formula for convolution changes. The modified formula becomes:

$$
\text{Output Size} = \left\lfloor \frac{n + 2p - f}{S} \right\rfloor + 1
$$

Where:
- $n$: Size of the input image (e.g., $6 \times 6$).
- $f$: Filter size (e.g., $3 \times 3$).
- $p$: Padding size.
- $S$: Stride.

---

### **Example Without Padding ($p = 0$):**

- Input image: $6 \times 6$
- Filter: $3 \times 3$
- Stride: $S = 2$

Using the formula:
$$
\text{Output Size} = \left\lfloor \frac{6 - 3}{2} \right\rfloor + 1 = 2 + 1 = 3.
$$

The output feature map will be $3 \times 3$.

---

### **Effect of Stride on Padding Formula:**

When $S > 1$, the padding formula must be adjusted to preserve the desired output size (e.g., to keep the same spatial dimensions). The adjusted formula is:

$$
\text{Output Size} = \left\lfloor \frac{n + 2p - f}{S} \right\rfloor + 1
$$

For example:
- To preserve the input size ($n = 6$), filter size ($f = 3$), stride ($S = 2$), and padding ($p$), we calculate $p$ as:
$$
p = \frac{S \times (\text{Output Size} - 1) - n + f}{2}.
$$

---

### **Summary:**

- **Stride** determines how much the filter skips while convolving over the image.
- A higher stride reduces the output size, resulting in downsampling.
- When $S > 1$, the padding formula needs to account for stride to maintain the desired output size.





---

### **How Filter Values are Updated in a Convolutional Neural Network (CNN)**

The process of updating **filter values** in a **Convolutional Neural Network (CNN)** is conceptually similar to updating weights in an Artificial Neural Network (ANN). Let's break down the steps involved:

---

### **1. Initializing Filter Values**
- Filters (or kernels) are small matrices (e.g., $3 \times 3$ or $5 \times 5$) initialized with random values at the start of the training process.
- Each filter is designed to learn specific features (like edges, corners, or textures) from the input image.

---

### **2. Forward Pass**
- The input image is convolved with the filter(s) to produce **feature maps**.
- These feature maps are then passed through an **activation function** (e.g., ReLU) that introduces non-linearity.
- The results are forwarded through subsequent layers, leading to predictions.

---

### **3. Calculating the Loss**
- A **loss function** (e.g., Mean Squared Error, Cross-Entropy Loss) measures the difference between the predicted output and the actual target label.
- The loss represents how far the network’s prediction is from the desired output.

---

### **4. Backpropagation to Update Filter Values**
The process of updating filter values is similar to how weights are updated in an ANN. The steps are as follows:

#### **Step 1: Compute Gradients**
- During backpropagation, the loss is propagated backward through the network.
- Gradients of the loss with respect to the filter values are computed using the **chain rule of differentiation**.
- These gradients indicate how much each filter contributed to the error.

#### **Step 2: Update Filter Values**
- Filter values are updated using **Gradient Descent** or its variants (e.g., Adam, RMSprop). The update rule is:

$$
W_{\text{new}} = W_{\text{old}} - \eta \cdot \frac{\partial \text{Loss}}{\partial W}
$$

Where:
- $W_{\text{old}}$: Current filter values.
- $\eta$: Learning rate (a small constant that controls the step size).
- $\frac{\partial \text{Loss}}{\partial W}$: Gradient of the loss with respect to the filter values.

#### **Step 3: Repeat Until Convergence**
- If the filter values do not reach an optimal state after one pass, the process is repeated for multiple iterations (epochs).
- Over time, filters learn to detect specific features that minimize the loss.

---

### **How Do Filters Adapt to the Input Image?**
- The gradients are computed based on the specific input image (or batch of images).
- This ensures that the filters adapt to the patterns in the training data. For example:
  - In early layers, filters might learn to detect simple features like edges or corners.
  - In deeper layers, filters may learn more complex features, such as shapes or even entire objects.

---

### **Summary**
1. Filters in CNNs are analogous to weights in ANNs.
2. Filters are initialized randomly and updated during training using backpropagation and gradient descent.
3. The goal is to adjust filter values so they can optimally extract features from input images to minimize prediction errors.

---

Would you like further details on activation functions, loss functions, or optimization techniques?

---




### **Filters vs. Pooling in a Convolutional Neural Network (CNN)**

Both **filters** and **pooling** are crucial in a **Convolutional Neural Network (CNN)**, but they serve **different purposes**. Let's break down each of them:

---

### **Filters in CNN**

Filters (or kernels) are essential components of the **convolution operation**.

#### **Purpose of Filters:**
- Filters are used to **extract features** from the input image.
- They detect patterns such as **edges**, **corners**, **textures**, and other features, which become more complex as the CNN deepens.

#### **How Filters Work:**
- A filter slides over the input image, performing **element-wise multiplication** and summation (this is the **convolution** operation).
- This produces a **feature map**, which highlights specific features that the filter is trained to detect.

---

### **Pooling in CNN**

**Pooling** is a separate operation applied **after** the convolution operation. It serves to **reduce the spatial size** of the feature maps.

#### **Purpose of Pooling:**
- To **reduce dimensionality** and the number of parameters, making the model more efficient.
- To make the network **invariant to small translations or distortions** in the input image (helps generalize).
- To reduce the **computational complexity** and **prevent overfitting**.

#### **How Pooling Works:**
- Pooling involves sliding a window (e.g., \(2 \times 2\)) over the feature map and **summarizing** the values in that window.
- Common types of pooling:
  - **Max Pooling**: Takes the **maximum value** in each window.
  - **Average Pooling**: Takes the **average** of all values in each window.

For example, consider the following input feature map:

\[
\begin{bmatrix}
1 & 3 & 2 & 1 \\
4 & 6 & 5 & 2 \\
7 & 8 & 9 & 4 \\
3 & 2 & 1 & 5
\end{bmatrix}
\]

After applying \(2 \times 2\) **Max Pooling** (stride = 2), the resulting feature map is:

\[
\begin{bmatrix}
6 & 5 \\
8 & 9
\end{bmatrix}
\]

---

### **Key Differences Between Filters and Pooling**

| **Aspect**          | **Filters**                                      | **Pooling**                                  |
|---------------------|-------------------------------------------------|----------------------------------------------|
| **Purpose**          | Extract features (edges, textures, etc.)        | Reduce spatial size and retain essential features. |
| **Operation**        | Convolution (element-wise multiplication + sum) | Aggregation (max or average of a window)     |
| **Output**           | Feature maps with extracted patterns            | Downsampled feature maps                    |
| **Learnable?**       | Yes, filter values are updated during training  | No, pooling is a fixed operation            |

---

### **Are Both Present in CNN?**

Yes, both **filters (convolution layers)** and **pooling layers** are present in CNNs, and they **work together**:

1. **Filters (Convolution Layers):**
   - Extract specific features from the input image.
   - Multiple filters are applied to the same input image, each detecting a different feature (e.g., edges, corners).

2. **Pooling Layers:**
   - Downsample the feature maps produced by the filters.
   - Reduce the spatial dimensions and computational load while preserving essential information.

---

### **Workflow in CNN:**

1. Input Image →  
2. **Convolution Layer (Filters)**: Extract features →  
3. **Activation Function (e.g., ReLU)**: Add non-linearity →  
4. **Pooling Layer**: Reduce feature map size →  
5. Repeat for multiple layers.


---





### Main Purpose of Pooling

1. **Feature Extraction:**
   - Pooling focuses on retaining the **most prominent or dominant features** from a feature map.
   - For example, in an image of three cats, pooling will help preserve key details (like the outline of a cat’s ear or eyes) while reducing less important information (like background noise).

2. **Dimensionality Reduction:**
   - Pooling reduces the size of the feature map, making the model computationally efficient.
   - This reduced size helps prevent overfitting by simplifying the learned patterns.

3. **Translation Invariance:**
   - Pooling makes the CNN more robust to minor shifts or distortions in the input image. For example, even if a cat’s ear moves slightly, max pooling will still capture the maximum value in that region, preserving the feature.

4. **Why Max Pooling?**
   - **Max pooling** selects the largest value in the pooling window, assuming the most important feature in that region is represented by the maximum value.
   - **Average pooling** computes the average, which smoothens the feature map and is less commonly used for feature extraction.

---

### Stride in Pooling vs. Convolution

Yes, **stride** in pooling works similarly to the stride in convolution. Here’s how:

1. **Stride in Convolution:**
   - The filter (e.g., $3 \times 3$) moves across the image in steps defined by the stride value.
   - Example: For stride = 1, the filter moves one pixel at a time. For stride = 2, it skips every alternate pixel.

2. **Stride in Pooling:**
   - The pooling window (e.g., $2 \times 2$) moves across the feature map in the same manner.
   - If **stride = 1**, the pooling operation overlaps and processes every pixel.  
   - If **stride = 2**, it skips pixels, reducing the output size more significantly.

#### Example of Max Pooling with Stride:
- Input Feature Map:
  $$
  \begin{bmatrix}
  1 & 3 & 2 & 4 \\
  5 & 6 & 7 & 8 \\
  9 & 10 & 11 & 12 \\
  13 & 14 & 15 & 16
  \end{bmatrix}
  $$

- Pooling Window: $2 \times 2$, **Stride = 2** (no overlap).  
  The pooling operation slides across the feature map, selecting the maximum value in each window:
  $$
  \begin{bmatrix}
  \text{Max}(1, 3, 5, 6) & \text{Max}(2, 4, 7, 8) \\
  \text{Max}(9, 10, 13, 14) & \text{Max}(11, 12, 15, 16)
  \end{bmatrix}
  =
  \begin{bmatrix}
  6 & 8 \\
  14 & 16
  \end{bmatrix}
  $$

- **Output Feature Map:** $2 \times 2$.

---

### Key Differences Between Filters and Pooling

| **Aspect**          | **Filter (Convolution)**                          | **Pooling**                                 |
|----------------------|--------------------------------------------------|---------------------------------------------|
| **Purpose**          | Detect specific patterns/features (e.g., edges). | Downsample while retaining important features. |
| **Operation**        | Element-wise multiplication and summation.       | Aggregation (max or average in a window).   |
| **Stride**           | Moves the filter across the input image.         | Moves the pooling window across the feature map. |
| **Learnable?**       | Yes, filter values are updated during training.   | No, pooling uses fixed operations.          |

---

### Summary

- **Pooling Layers** do not detect features like filters but help summarize and reduce the feature maps by keeping the most significant values (e.g., max pooling).
- **Stride** in pooling works just like in filters, controlling how far the pooling window moves at each step.
- Pooling and filters work together in CNNs: filters extract features, and pooling reduces their size while preserving key information.

---



### **What is Flattening?**

The flattening layer is used to convert the multi-dimensional output from previous layers (like convolutional and pooling layers) into a **1D vector**. This vector can then be used as input for **dense (fully connected) layers** in the network.

#### **Example:**
- After passing through convolution and pooling layers, we get a feature map. For example, let's say the output from the pooling layer is a $4 \times 4 \times 3$ matrix (4x4 spatial dimensions with 3 channels).
- The flattening layer takes this $4 \times 4 \times 3$ matrix and reshapes it into a $1D$ vector with $48$ elements:
  $$
  [x_1, x_2, x_3, \dots, x_{48}]
  $$

---

### **Purpose of the Flattening Layer:**

1. **Connecting Convolutional Layers to Dense Layers:**
   - **Convolutional and pooling layers** work to extract **spatial** and **hierarchical features** of the input data (such as edges, textures, shapes).
   - The **dense layers** then perform high-level reasoning and decision-making, like classifying objects or regressing values.
   - **Flattening** acts as the intermediary, converting these spatial features into a format (1D vector) that the dense layer can process.

2. **Preparation for Classification/Regression:**
   - In classification tasks, the flattened vector is passed to the dense layers, where it is transformed into a probability distribution over the classes (e.g., "cat" or "dog").
   - For regression, the dense layer could output a continuous value, like the price of an item.

---

### **How Does Flattening Work?**

To illustrate, let's use a simple example:

- **Input:** A feature map of size $4 \times 4 \times 3$.
- **Flattening operation:** The flattening layer simply rearranges the elements of this matrix into a **1D vector**.

If the matrix looks like this:
$$
\begin{bmatrix}
1 & 2 & 3 & 4 \\
5 & 6 & 7 & 8 \\
9 & 10 & 11 & 12 \\
13 & 14 & 15 & 16
\end{bmatrix}
$$
After flattening, it becomes:
$$
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
$$
- This transformation prepares the data for the next steps in the neural network, where the dense layer will perform computations based on the flattened vector.

---

### **Flattening vs Dense Layer**

While the flattening layer reshapes the data, the dense layer does the actual computation and decision-making:

| **Aspect**           | **Flattening**                              | **Dense Layer**                              |
|----------------------|---------------------------------------------|---------------------------------------------|
| **Purpose**           | Reshape multi-dimensional data into 1D.     | Perform computations to make predictions.   |
| **Learnable?**        | No, it’s just a reshaping operation.         | Yes, weights and biases are updated during training. |
| **Role in CNN**       | Bridge between convolutional and dense layers. | Final layer(s) for decision-making.         |

#### **Dense Layer (Fully Connected Layer):**
- After flattening, the **dense layer** uses the 1D vector as input and performs **linear transformations** followed by activation functions (e.g., ReLU, softmax).
- The output of a dense layer is computed as:
  $$
  y = W \cdot x + b
  $$
  where:
  - $W$ is the weight matrix,
  - $x$ is the input vector (the flattened data),
  - $b$ is the bias vector.

---

### **Workflow Example in CNN:**

The following steps demonstrate a typical workflow in a Convolutional Neural Network (CNN):

1. **Input Image:** An image is passed to the network.
2. **Convolution + Pooling Layers:** Extract relevant features from the image (such as edges, textures).
3. **Flattening Layer:** Converts the multi-dimensional feature map into a 1D vector.
4. **Dense Layers:** Perform classification (e.g., "cat" or "dog") or regression (e.g., predicting a value).

---

### **Example:**

Let’s assume you are working on a simple **image classification** task with the CNN:
1. **Input Image:** A $32 \times 32 \times 3$ image.
2. **Convolution + Pooling:** The network extracts features and reduces the size of the feature map.
3. **Flattening Layer:** The $16 \times 16 \times 8$ feature map is flattened into a vector of size $2048$.
4. **Dense Layer:** This vector is passed to a dense layer to output a probability distribution across classes.

---



Let's take an example of a small $6 \times 6$ grayscale image and walk through the steps of convolution, pooling, flattening, and connecting to dense layers. Here's the pixel data of the input image (values range from 0 to 255):

### **Input Image (6x6):**
$$
\begin{bmatrix}
34 & 78 & 120 & 200 & 180 & 90 \\
65 & 90 & 150 & 210 & 190 & 80 \\
100 & 130 & 170 & 220 & 200 & 100 \\
95 & 120 & 160 & 215 & 185 & 85 \\
75 & 100 & 145 & 195 & 175 & 70 \\
60 & 80 & 110 & 180 & 165 & 55
\end{bmatrix}
$$

---

### **Step 1: Normalize Pixel Values**
Normalize each pixel value to a range of 0 to 1 by dividing by 255.  
For example, the normalized value of 34 is $ \frac{34}{255} \approx 0.133 $.

$$
\begin{bmatrix}
0.133 & 0.306 & 0.471 & 0.784 & 0.706 & 0.353 \\
0.255 & 0.353 & 0.588 & 0.824 & 0.745 & 0.314 \\
0.392 & 0.510 & 0.667 & 0.863 & 0.784 & 0.392 \\
0.373 & 0.471 & 0.627 & 0.843 & 0.725 & 0.333 \\
0.294 & 0.392 & 0.569 & 0.765 & 0.686 & 0.275 \\
0.235 & 0.314 & 0.431 & 0.706 & 0.647 & 0.216
\end{bmatrix}
$$

---

### **Step 2: Convolution**
Use a $3 \times 3$ filter with the following weights:
$$
\begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1
\end{bmatrix}
$$

Stride = 1, no padding.

#### Apply Convolution:
1. Place the filter on the top-left $3 \times 3$ region of the normalized image:
   $$
   \begin{bmatrix}
   0.133 & 0.306 & 0.471 \\
   0.255 & 0.353 & 0.588 \\
   0.392 & 0.510 & 0.667
   \end{bmatrix}
   $$

2. Multiply element-wise with the filter and sum:
   $$
   (0.133 \cdot 1) + (0.306 \cdot 0) + (0.471 \cdot -1) +
   (0.255 \cdot 1) + (0.353 \cdot 0) + (0.588 \cdot -1) +
   (0.392 \cdot 1) + (0.510 \cdot 0) + (0.667 \cdot -1)
   $$
   Result: $0.133 - 0.471 + 0.255 - 0.588 + 0.392 - 0.667 = -0.946$.

3. Repeat for each $3 \times 3$ region. Final output (feature map) size is $4 \times 4$ (using formula $n - f + 1$):

$$
\begin{bmatrix}
-0.946 & \dots & \dots & \dots \\
\dots & \dots & \dots & \dots \\
\dots & \dots & \dots & \dots \\
\dots & \dots & \dots & \dots
\end{bmatrix}
$$

---

### **Step 3: Pooling**
Apply $2 \times 2$ Max Pooling (stride = 2) on the feature map.

1. Take each $2 \times 2$ block and find the maximum value.
2. Result: Output size is $2 \times 2$.

---

### **Step 4: Flattening**
Flatten the $2 \times 2$ matrix into a 1D vector:
$$
[x_1, x_2, x_3, x_4]
$$

---

### **Step 5: Dense Layer**
Connect the flattened vector to a dense layer, where each value contributes to the final output through weights and biases.

---




### **Step 1: Pooling**
We take the **feature map** after convolution and apply $2 \times 2$ max pooling with a stride of 2.

Assume the **feature map after convolution** (from Step 2 in the previous example) is as follows:

$$
\text{Feature Map (4x4):}
\begin{bmatrix}
-0.946 & -0.371 & 0.215 & 0.489 \\
0.158 & 0.345 & 0.678 & 0.789 \\
0.527 & 0.391 & 0.812 & 0.923 \\
0.234 & 0.412 & 0.653 & 0.789
\end{bmatrix}
$$

#### **Max Pooling (2x2, stride=2):**
We apply $2 \times 2$ pooling, taking the **maximum value** in each $2 \times 2$ block.

1. **Top-left block:**
   $$
   \begin{bmatrix}
   -0.946 & -0.371 \\
   0.158 & 0.345
   \end{bmatrix}
   $$
   Maximum = $0.345$.

2. **Top-right block:**
   $$
   \begin{bmatrix}
   0.215 & 0.489 \\
   0.678 & 0.789
   \end{bmatrix}
   $$
   Maximum = $0.789$.

3. **Bottom-left block:**
   $$
   \begin{bmatrix}
   0.527 & 0.391 \\
   0.234 & 0.412
   \end{bmatrix}
   $$
   Maximum = $0.527$.

4. **Bottom-right block:**
   $$
   \begin{bmatrix}
   0.812 & 0.923 \\
   0.653 & 0.789
   \end{bmatrix}
   $$
   Maximum = $0.923$.

#### **Output After Pooling (2x2):**
$$
\begin{bmatrix}
0.345 & 0.789 \\
0.527 & 0.923
\end{bmatrix}
$$

---

### **Step 2: Flattening**
We convert the $2 \times 2$ matrix into a **1D vector** by unraveling the matrix row by row.

1. Start with the first row: $[0.345, 0.789]$.
2. Add the second row: $[0.527, 0.923]$.

#### **Flattened Vector:**
$$
[0.345, 0.789, 0.527, 0.923]
$$

---

### **Step 3: Connect to Dense Layer**
The flattened vector becomes the input for the dense layer. Each value in the vector will be multiplied by a weight, added to a bias, and passed through an activation function.

For example:
$$
\text{Dense Layer Input: } [0.345, 0.789, 0.527, 0.923]
$$
This vector will be processed by the dense layer for the final output (e.g., classification probabilities).

Let me know if you'd like me to calculate the dense layer output explicitly!

---