CNNS, filters, and deep learning
This class covers convolutional neural nets. If you have any troubles running your code  locally, you can find the full notebook here. 

Question 1 of 7
Question 1: CNN Intro (2)


Engineer two prompts for ChatGPT or a similar LLM: 



One that produces a good high-level explanation of how a CNN works, and 
One that produces a good step-by-step explanation of the mechanism of convolution and how you might use convolution to converts images. 


Include your entire dialog with the LLM below, including both of these prompts and your suggestions for improvements.



Once you've primed it with your two prompts. Ask the LLM this follow-up question:



In 150 words or less, describe the "tied weights" property of CNNs. Specifically, how are CNNs just neural networks where the weights are "shared" between features? Why is the chain rule of differentiation is especially important calculating parameters in deep networks such a convolutional neural network?


Q: Give me a high level explanation on how a cnn works (100 words)

A: A Convolutional Neural Network (CNN) is a deep learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image, and differentiate one from the other. The pre-processing required in a CNN is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, CNNs have the ability to learn these filters/characteristics. The architecture of a CNN is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex.



Q: Now, give me a detailed, step by step explanation of how a CNN works

A: Certainly! Convolutional Neural Networks (CNNs) are a class of deep neural networks, most commonly applied to analyzing visual imagery. They have revolutionized the field of computer vision, achieving remarkable performance in tasks such as image classification, object detection, and more. Here's a detailed, step-by-step explanation of how CNNs work:



1. **Input Layer:**

  - The input layer holds the raw pixel values of the image. For instance, an image of size 200x200 pixels will have 40,000 neurons in the input layer, with each neuron representing the intensity of a pixel.



2. **Convolution Layer:**

  - The primary purpose of the convolution layer is to detect features such as edges, colors, or textures in the input image.

  - Filters (also known as kernels) slide (or "convolve") across the input image and perform element-wise multiplication with the part of the image they are covering.

  - The sum of the multiplication is taken, and the result forms a single pixel in the output feature map (or activation map).

  - Multiple filters are used to extract different features, resulting in multiple feature maps.



3. **ReLU (Rectified Linear Unit) Layer:**

  - This layer applies a non-linear activation function to the output of the convolution layer. The most common function used is ReLU, which replaces all negative pixel values in the feature map with zero.

  - The purpose of ReLU is to introduce non-linearity into the system, allowing the network to solve more complex problems.



4. **Pooling (Subsampling or Downsampling) Layer:**

  - This layer reduces the spatial dimensions (width and height) of the input volume for the next convolutional layer. It reduces the number of parameters, which helps in reducing computation and also helps in making the detection of features invariant to scale and orientation changes.

  - Max pooling is a common technique used in pooling where the maximum element from the rectified feature map is taken.



5. **Fully Connected (FC) Layer:**

  - After several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers.

  - Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activation can hence be computed with a matrix multiplication followed by a bias offset.



6. **Flatten Layer:**

  - The Flatten Layer converts the final feature maps into a single 1D vector. This flattening step is needed so that you can make use of fully connected layers after some convolution/maxpool layers.



7. **Output Layer:**

  - The last fully connected layer outputs the final probabilities for each class in a classification task (like softmax function), or the final features in a feature extraction task.



8. **Backpropagation:**

  - To learn the weights of the filters, CNNs use backpropagation, a form of supervised learning. After the forward pass, the network will compute the loss (i.e., the difference between the predicted output and the actual output).

  - The network then uses this loss to update the filter values in a way that minimizes the loss, using an optimization technique like gradient descent.



By stacking several convolutional and pooling layers, CNNs are able to learn hierarchical feature representations of the input data. Lower layers might learn to detect simple features like edges and curves, while deeper layers can recognize complex features like faces or objects. This hierarchical feature extraction makes CNNs very effective for tasks like image recognition.



Q: In 150 words or less, describe the "tied weights" property of CNNs. Specifically, how are CNNs just neural networks where the weights are "shared" between features? Why is the chain rule of differentiation is especially important calculating parameters in deep networks such a convolutional neural network?



A: "Tied weights" in CNNs refer to the concept of weight sharing, a key property that differentiates them from fully connected neural networks. In a CNN, the same filter (a set of weights) is applied across different parts of the input, effectively sharing the weights. This means that instead of learning separate weights for every pixel in the image, the network learns filters that detect features (like edges or textures) irrespective of their position in the input. This significantly reduces the number of parameters, making CNNs more efficient and less prone to overfitting.



The chain rule of differentiation is crucial in training deep networks like CNNs because it allows for the computation of gradients with respect to the weights during backpropagation. In a CNN, due to the depth and weight sharing, gradients of the loss function with respect to the weights are calculated by applying the chain rule iteratively through each layer. This enables the network to update its weights (including shared weights) in a way that minimizes the loss, effectively learning the feature representations in a hierarchical manner.



Critique: 

While the answer succinctly encapsulates the essence of "tied weights" in CNNs and their implications, such as parameter efficiency and reduced overfitting, some aspects of it could be improved. For instance, a more concrete example or analogy could be provided to illustrate the concept of weight sharing. 

Additionally, the role of the chain rule could be clarified further. While it mentions how important it is for computing gradients, it would be beneficial to emphasize that the chain rule enables the propagation of error information from the output back to each layer, which is essential for adjusting the shared weights appropriately during training.





Core Question: Convolutions (3)
In the readings, you looked at how a convolution layer is simply a "stack" of filters applied over an image to create a family of "new" images. In the pre-class, we'll be looking at the "one filter" level to build an intuition of how convolution works. Here, we use the "Racoon face" image that comes as a default image in scipy.

In [None]:
# modified from the docs 
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate2d.html
from scipy import signal
from scipy import misc
import numpy as np
import matplotlib.pyplot as plt

face = misc.face(gray=True) - misc.face(gray=True).mean()
plt.imshow(face, cmap="Greys")
plt.colorbar()

All a convolution is is a "template matching" process; we give it a "mini-image", and it searches for instances of this mini-image in a larger image. Here, let's extract the right eye of the raccoon to use as a template.

In [None]:
template = face[300:365, 670:750]
template -= template.mean()

plt.imshow(template, cmap="Greys")
plt.colorbar()

Now, we "convolve" this template over the image. Note that strictly speaking, in machine learning we use "cross-correlation", instead of convolution. They're both similar, but technically speaking a convolution searches for "upside down" instances of a template, whereas cross-correlationn searches for the template as provided. This split is primarily due to convolutions arising from the signal processing community first, and then being used in machine learning.



Tl;dr in machine learning when we say "convolution" we really mean "cross-correlation". And as we see, there is a sharp activation (darker is stronger) near the right eye of the racoon, right where the original template was taken from!

In [None]:
corr = signal.correlate2d(face, template, boundary='symm', mode='same')
plt.imshow(corr, cmap="Greys")
plt.colorbar()

Core Question 2a: 
What is this image? Why are there strong activations in locations other than the right eye (where the template is from), such as the left eye and the nose?

The image is the output of template matching, which is used to locate and recognize parts of an image that match a particular pattern or template. 

The heatmap shows the matching scores, with brighter areas indicating higher similarity to the template. 

The strong responses at locations other than the right eye, from where the template is derived, suggest that the left eye and the nose have similar features to the right eye—like edges, shapes, or textures—that the simple template matching algorithm is responding to. This can result in false positives, highlighting a limitation of basic template matching techniques.


We could also try other "filters": the following is known as the Sobel operator, a classic edge detector in computer vision:

In [None]:
template = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]])

plt.imshow(template, cmap="Greys")
plt.colorbar()

When applied over the original image, we have:

In [None]:
corr = signal.correlate2d(face, template, boundary='symm', mode='same')
plt.imshow(corr, cmap="Greys")
print("Operator 1")
plt.colorbar()

Question 2b: Is this a horizontal or a vertical edge detector? Why have you chosen the answer that you have?



Hint: Look the Sobel operator itself above (the 3x3 grid image). In which way is it oriented? What "pattern" would activate it?



Question 2c: We've only looked at grayscale images here (that is, only one channel), but most real-world images are in color! Based on the Shafkat (2018) reading, how is the match with a "filter/kernel" computed when the image has more than one "channel"?

Question 2b: Is this a horizontal or a vertical edge detector? Why have you chosen the answer that you have?

This is a horizontal edge detector. This is because of how it interacts with the image.



Rows (Top to Bottom):



The top row ([1, 0, -1]) has positive values at the left and a negative value at the right. This row will respond to dark-to-light transitions from left to right across the top of the area it is convolved with.

The middle row ([2, 0, -2]) is weighted more heavily (values are doubled) compared to the top and bottom rows, which increases the sensitivity of the filter to horizontal edges. The positive values are again on the left and the negative on the right, just like the top row.

The bottom row ([1, 0, -1]) is similar to the top row but flipped in sign, responding to light-to-dark transitions from left to right across the bottom of the convolved area.



Columns (Left to Right):



The left column ([1, 2, 1]) has all positive values and weights the filter to respond to light areas on the left when scanning the image.

The middle column ([0, 0, 0]) has zero values, meaning it does not respond to vertical changes in intensity—this is crucial for detecting horizontal edges since horizontal edges will not cause a change in intensity along the vertical axis within a small area.

The right column ([-1, -2, -1]) has all negative values, mirroring the left column, and weights the filter to respond to dark areas on the right.



Because of this, when the filter is applied (convolved) over the image, areas with strong horizontal changes in intensity (from light to dark or dark to light) will produce strong responses. These areas correspond to horizontal edges within the image, such as the boundaries between different objects or features.



Question 2c: We've only looked at grayscale images here (that is, only one channel), but most real-world images are in color! Based on the Shafkat (2018) reading, how is the match with a "filter/kernel" computed when the image has more than one "channel"?

According to Shafkat (2018), when dealing with color images, which typically have three channels (Red, Green, and Blue), the matching with a filter or kernel is done by extending the filter to have the same depth as the number of channels.  It can be done as follows:



Channel Extension - The filter is extended across the depth of the input image. For a color image with three channels, this means we will have a 3-dimensional filter. If the original grayscale filter is 3x3, the color filter would be 3x3x3.



Filter Application - The extended filter is applied to the image by computing the dot product between the filter and the image pixels over all channels. This is done for every location on the image. 



Summation - After the dot product is computed across all channels, the results are summed up to produce a single value. This summation collapses the channel dimension, yielding a two-dimensional array as the output.



Activation Map Creation - The single value from the summation is then placed in the corresponding location of the activation map (feature map).



Repetition - This process is repeated for each location on the input image, sliding the filter across the image both horizontally and vertically.



Multiple Filters - If multiple filters are applied, each one will produce a separate activation map. When all the filters have been applied, the activation maps are stacked along the depth dimension to produce the output volume.



This enables the network to learn filters that can detect features specific to a particular channel combination (such as the color and texture) .

Extension Question: Sobel operators + Gradients (4)
We've only looked at one of the two Sobel operators here. Read through the wiki page on Sobel operators, and implement the other kernel, and apply it over the image:

What does the racoon image look like with the other kernel applied?
How is it different from the one in question 2?

The first kernel detects changes in intensity in the horizontal direction. This is why there are strong responses at places where the gradient of the image intensity changes vertically, such as the vertical edges.



The second image emphasizes vertical edges. This operator detects changes in intensity in the vertical direction, which is why you see a strong response at places where the gradient of the image intensity changes horizontally.



The differences between the two images show the orientation of the edges they are designed to detect. The first operator responds more to vertical lines and edges (like the sides of the raccoon's face), while the second responds more to horizontal lines and edges (like the transitions around the eyes and the nose).

In [None]:
template = np.array([[1, 2, 1], [0, 0, 0], [-1, -2, -1]])
plt.imshow(template, cmap="Greys")
plt.colorbar()


corr = signal.correlate2d(face, template, boundary='symm', mode='same')
plt.imshow(corr, cmap="Greys")
print("Operator 2")
plt.colorbar()

Once you have the other filter applied, you can combine both filtered images with the following code to get the image gradient. What does it look like (visualize it with imshow) and how is it different from either filter in isolation?

The image shows the gradient magnitude of the edges in the raccoon image by combining horizontal and vertical edge detections. 

Unlike the images produced by the individual horizontal or vertical filters, this combined image highlights the edges in all directions. The result is a more complete representation of the edges and textures in the image, with increased detail and sharper contrast at the boundaries of shapes. 

This makes the outlines of the raccoon's features more defined against the background, compared to either filter used in isolation. However, it may also amplify noise, making the image appear rougher.

In [None]:
# fill in here
horizontal_kernel = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
vertical_kernel = np.array([[1, 2, 1], [0, 0, 0], [-1, -2, -1]])

horizontal_edges = signal.correlate2d(face, horizontal_kernel, boundary='symm', mode='same')
vertical_edges = signal.correlate2d(face, vertical_kernel, boundary='symm', mode='same')

gradient = np.sqrt(horizontal_edges**2 + vertical_edges**2)

# visualize here
plt.imshow(gradient, cmap='gray')
plt.title('Gradient Magnitude')
plt.colorbar()
plt.show()

#MLDevelopment
Review the class notion and do some synthesis: write some useful summary notes for a previous topic from the first half of class and upload them to the class notion. Include your synthesis below

SESSION 9: METRICS AND CROSS-VALIDATION

Some key points from that class include:



Training, Validation, and Test Sets 

  - The training set is used for the model to learn and identify patterns.

  - The validation set helps tune the model's hyperparameters and compare different models.

  - The test set is used for the final evaluation, providing an estimate of the model's performance on unseen data.



Performance Metrics

  - Accuracy - measures the overall correctness of the model.

  - Precision - gauges the correctness of positive predictions.

  - Recall - (or Sensitivity) assesses the coverage of actual positive cases.

  - F1 Score - combines precision and recall into a single metric, giving a balance of both.

  - ROC Curve -plots True Positive Rate against False Positive Rate at various threshold settings.

  - AUC - reflects the likelihood of the model distinguishing between classes.



Cross-Validation

  - K-Fold Cross-Validation- divides the data into k subsets, using k-1 for training and the remaining for validation, rotating until each subset has been used for validation.

  - Leave-One-Out Cross-Validation (LOOCV)- takes k to the extreme, where k equals the number of observations.

 - Procedure for cross-validation Includes shuffling the data, splitting into groups, iterating through each group as a test set, and summarizing the model's skill.

 - Data preparation and hyperparameter tuning should occur within the cross-validation loop to prevent data leakage.



Bias-Variance Tradeoff 

  - Bias - refers to errors due to overly simplistic assumptions in the learning algorithm.

  - Variance - reflects errors due to too much complexity in the learning algorithm.



Points to Consider

  - Certain methods like the Validation Set Approach can be variable or overestimate the test error.

  - AUC doesn't account for the distribution of classes and treats all errors equally.

  - Cross-validation techniques can be computationally expensive, particularly LOOCV.

