# I- Intro to pytorch and Convolution


In [18]:
import numpy as np
from matplotlib import pyplot as plt
import PIL.Image as I
import torch
from scipy.fftpack import fft2, ifft2,ifftshift, fftshift  
import cv2

#### Question 1 (Pseudo-Code)
Write a pseudo-code snippet that performs the following steps:
1. Assume you are given a NumPy array named color image with a shape of (H, W, C) where H=512, W=512, and C=3.
2. Assume you are also given a NumPy array named gray_patch with a shape of (P, P) where P=64. 
3. Insert the gray_patch into the top-left corner of the color image. Since the patch is grayscale, its values should be replicated across all three color channels of the target region in color_image.

In [24]:
color_image = I.open("/Users/ryanqchiqache/PycharmProjects/Machine-Learning-Learning-Center/ComputerVisionCourse/Exercise02/saturn.png").convert("RGB")
gray_patch_np = np.resize(color_image, (128,128))
gray_patch = np.transpose(gray_patch_np, (2,2))
replicated_patch = np.stack([gray_patch] * 3, axis=-1)
color_image[0:64, 0:64, :] = replicated_patch

AxisError: axis 2 is out of bounds for array of dimension 2

#### Question 2 (Free-Form)
- In image processing pipelines, it is common to convert image data between different data
types and ranges. 
- Explain why you would normalize an 8-bit image array (with pixel values
in the range [0, 255]) to a floating-point array (with values in the range [0, 1]) before applying
a filter like a Gaussian blur.

In [None]:
 # NOTE We normalize 8-bit images to [0, 1] floats before applying Gaussian blur to ensure numerical precision, prevent overflow, and match the floating-point expectations of the filter.

#### Question 3 (Pseudo-Code)
 - Most deep learning frameworks like PyTorch expect image data in the format [Batch,
Channels, Height, Width] (B, C, H, W)
 - while libraries like Pillow and Matplotlib often work with [Height, Width, Channels] (H, W, C). 
  - Write a pseudo-code function convert_hwc_to_bchw(image_array) that takes a single image as a NumPy array in HWC format and converts it into a PyTorch tensor suitable for a model, with a batch size of 1.

In [33]:
def convert_hwc_bchw(image_array):
    tensor = torch.from_numpy(image_array).permute(2,0,1)
    bchw = tensor.unsqueeze(0)
    return bchw.to(dtype=torch.float32)

np_array = np.random.randint(0, 255, (128, 128, 3), dtype=np.uint8)

print(convert_hwc_bchw(np_array).shape)

torch.Size([1, 3, 128, 128])


#### Question 4 (Free-Form)
- Explain the two primary differences between a NumPy ndarray and a PyTorch Tensor that make Tensors more suitable for deep learning.

In [None]:
# NOTE PyTorch tensors are better suited for deep learning because they support GPU acceleration and automatic differentiation for backpropagation.

#### Question 5 (Pseudo-Code)
- Implement a function convolve_2d(image, kernel) from scratch using basic array operations (e.g., loops and element-wise multiplication). 
- The function should perform a 2D cross-correlation (as is standard in CNNs).
- Inputs: A 2D single-channel image (H x W NumPy array) and a 2D kernel (K x K NumPy array).
- Output: A 2D filtered image.
- Assumptions: Use a stride of 1 and no padding. The output image size will be smaller than the input.

In [34]:
def conv_2d(img: I, kernel: np.ndarray, stride=1, bias=-1):
    H, W = img.shape
    kH, kW = kernel.shape
    
    out_h = H - kH // stride +1
    out_w = W - kW // stride +1
    
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            region = img[i*stride:i*stride+kH, j*stride: j*stride+kW]
            output[i, j] = np.sum(region * kernel) +bias
            
    return output
           
# Example

img = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

kernel = np.array([[1, 0],
                   [0, -1]])

out = conv_2d(img, kernel, stride=1, bias=0)
print(out)


[[-4. -4.]
 [-4. -4.]]


#### Question 6 (Free-Form)

The following formula defines a 2D Gaussian filter kernel:

*f(x, y) = (1 / (2πσ²)) * exp(-(x² + y²) / (2σ²))*

1. **What is the primary effect of applying a Gaussian filter to an image?**  
   →  "It smooths the image by averaging pixel intensities with a weighted kernel, reducing noise and fine detail."

2. **How does changing the value of σ (sigma) affect this outcome?**  
   → " A larger σ results in more blurring, as the kernel becomes wider and includes more neighboring pixels in the averaging process. A smaller σ keeps more detail."

#### Question 7 (Free-Form)
- A Laplace filter is often used for edge detection. A common 3x3 kernel for the Laplacian operator is:
- Generated code:
[[ 0, 1, 0],
[ 1,-4, 1],
[ 0, 1, 0]]
- Explain briefly, in terms of what the filter calculates, why this kernel highlights edges and regions of rapid intensity change in an image. 
- (Hint: Think about what the filter approximates mathematically).

In [35]:
# NOTE This kernel highlights edges because it approximates the **second spatial derivative** (the Laplacian) of the image. 
# NOTE It emphasizes regions where the intensity changes rapidly — i.e., edges — by subtracting the central pixel's value relative to its neighbors.

#### Question 8 (Free-Form)

**Two major limitations of hand-crafted filters:**

1. They are **fixed and manually designed**, so they can’t adapt to complex patterns or data variability.
2. They lack **learnability**, meaning they can’t improve over time or extract higher-level features.

**CNNs overcome these by:**
- Learning optimal filters during training through backpropagation.
- Stacking multiple layers to extract hierarchical features (edges → textures → objects).


 ### Question 9 (Free-Form)

- conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=5, padding=2)
1.  Explain the purpose of the in_channel and out_channel: 
     - in_channels=3: Specifies the number of channels in the input image (e.g., RGB).
     - out_channels=32: Specifies the number of filters (feature maps) the layer will learn, resulting in 32 output channels.

2. What is the shape of the learnable weight tensor in conv_layer :
     - conv_layer.weight.shape → [32, 3, 5, 5] where 32 is the output channels (filters), 3 input channels per filter, and 5x5 kernel size

#### Question 10 (Free-Form)
- An input tensor with Input shape:[16, 3, 128, 128]  # (Batch, Channels, Height, Width) is passed to the con_layer of Question 9. Calculate the shape of the output tensor.
-  out_dim = ((input_dim = 128 + 2 * padding = 2 - kernel_size = 5) / stride = 1 + ) 1 = ((128 + 4 - 5) / 1 ) + 1 =  128
-  out_shape = [16, 32, 128, 128]

#### Question 11 (Free-Form)
- implementation in NumPy can be identical to the output of PyTorch's nn.Conv2d layer (when using the same kernel).
 - What does this imply about the fundamental operation that nn.Conv2d performs? 
  - Why is using the PyTorch layer vastly more powerful and efficient in a deep learning context?
#### Answer :
- If a custom NumPy convolution matches the output of nn.Conv2d (with the same kernel), it means that nn.Conv2d performs the same fundamental operation — a cross-correlation over the input tensor.
- However, PyTorch's Conv2d is far more powerful because:
    - It supports GPU acceleration for large-scale, efficient training.
    - It integrates with PyTorch's autograd system, enabling automatic differentiation and backpropagation.

# II-  Fourier Transformation

#### Question 1
- Q: The Discrete Fourier Transform (DFT) of an image produces a complex-valued result. What are the two components called, and what does each represent in terms of the image's frequency content?

- A: The two components are the magnitude spectrum, representing the strength of frequencies, and the phase spectrum, representing the spatial arrangement of those frequencies in the image.

#### Quesiton 2
- Q: What is the purpose of fftshift when visualizing the 2D Fourier transform of an image? Where is the zero-frequency (DC) component located before and after applying fftshift?

- A: fftshift moves the DC component from the top-left corner (default position) to the center of the frequency spectrum, making visualization of frequency content more intuitive.

#### Question 3
 - Q: What would the centered Fourier magnitude spectrum look like for Image A (horizontal stripes) and Image B (vertical stripes)
 - A: Image A will show strong vertical frequency components (bright vertical lines), while Image B will show strong horizontal frequency components after rotation.

#### Question 4
- Q: How are the magnitude and phase spectra affected by the following transformations?

- Answers : 
- Rotating the image by 30 degrees: rotates both magnitude and phase spectra.
- Flipping horizontally: mirrors the phase spectrum; magnitude remains symmetric.
- Increasing contrast: scales the magnitude spectrum; phase remains unchanged.

#### Question 5
- Q: Write a pseudo-code function fourier_denoise(image, threshold) that denoises an image by keeping only the strongest frequency components.

In [36]:
def fourier_denoise(image, threshold):
    F = fft2(image)
    F_shifted = fftshift(F)
    
    magnitude = np.abs(F_shifted) > threshold
    
    F_filtered = F_shifted * magnitude
    F_ishifted = ifftshift(F_filtered)
    denoised = ifft2(F_ishifted)
    
    return np.real(denoised)
    

#### Question 6
- Q: What is the difference between a low-pass filter and a high-pass filter in the Fourier domain? and what is it used for ?

- A: A low-pass filter preserves low frequencies (center of the spectrum) and is used for blurring or denoising. A high-pass filter preserves high frequencies (edges of the spectrum) and is used for edge detection.

#### Question 7
- Q: What is aliasing in image down-sampling, and what does the Nyquist-Shannon theorem say about avoiding it?
- A: Aliasing occurs when high frequencies are misrepresented as low ones. To avoid it, the sampling rate must be at least twice the highest frequency present in the image.

#### Question 8
- Q: What essential pre-processing step is required before down-sampling an image?
- A: Low-pass filtering is needed to remove high-frequency content that would cause aliasing when the image is resized.

#### Question 9
- Q: Describe the three-step procedure to resize an image while minimizing aliasing.
- A :

In [37]:
def low_pass_filter(image):
    # Apply a Gaussian blur or other smoothing filter
    kernel = cv2.getGaussianKernel(ksize=5, sigma=1)
    gaussian = kernel @ kernel.T
    return cv2.filter2D(image, -1, gaussian)

def resize_nearest(image, scale):
    # Resize using nearest-neighbor interpolation
    new_size = (int(image.shape[1] * scale), int(image.shape[0] * scale))
    return cv2.resize(image, new_size, interpolation=cv2.INTER_NEAREST)

def resize_with_antialiasing(image, scale):
    # Step 1: Blur the image to remove high frequencies
    blurred = low_pass_filter(image)
    
    # Step 2: Resize using nearest neighbor (safe from aliasing)
    resized = resize_nearest(blurred, scale)
    return resized

#### Question 10 (Free-Form)

- Q: Why does Gaussian blur before resizing improve quality, and what is this technique called?

- A: Gaussian blur removes high frequencies that cause jagged edges when down-sampling. This pre-filtering step is called anti-aliasing and leads to smoother, higher-quality results.

# III- Detection and Segmentation

#### Question 1

- Q: What is the primary goal of the Grad-CAM technique? What two key pieces of information does it extract from a CNN, and from which parts of the computation are they obtained?

- A: Grad-CAM highlights regions of the input image that are important for a specific prediction. It extracts: (1) feature maps from the forward pass, and (2) gradients of the target class score w.r.t. those feature maps from the backward pass.

#### Question 2

- Q: Why is a late convolutional layer (e.g., layer4 in ResNet) typically used in Grad-CAM instead of an early layer?

- A: Late layers capture more semantic information (what the object is) but have lower spatial resolution. Early layers have higher spatial resolution but low-level features (edges, textures). Grad-CAM prefers semantic relevance over pixel precision.



#### Question 3
- Q: Write a pseudo-code function calculate_grad_cam(activations, gradients) that computes Grad-CAM.

In [20]:
def calculate_grad_cam(activations, gradients):
    # Step 1: Compute global average pooling of gradients
    alpha = gradients.mean(axis=(1, 2))  # shape: (C,)

    # Step 2: Weighted sum of the activations
    weighted_sum = (alpha[:, None, None] * activations).sum(axis=0)  # shape: (H, W)

    # Step 3: Apply ReLU
    heatmap = np.maximum(weighted_sum, 0)
    return heatmap

#### Question 4

- Q: What is a "hook" in PyTorch, and why is it used in Grad-CAM?

- A: A hook is a function that allows you to extract intermediate data from a model during the forward or backward pass. In Grad-CAM:
- A forward hook captures feature maps.
- A backward hook captures gradients w.r.t. those maps.



#### Question 5
- Q: What role do the final FC layer weights play in CAM, and what kind of architecture does this require?

- A: CAM uses the FC weights to compute a weighted sum over the final convolutional feature maps. This requires a network with a Global Average Pooling (GAP) layer before the FC layer, as in the original CAM paper.

#### Question 6

- Q: Compare Grad-CAM and CAM in terms of how they compute feature importance. What makes Grad-CAM more general?

- A: CAM uses FC weights directly (requires GAP + FC structure). Grad-CAM uses gradients to compute importance weights, making it applicable to any CNN architecture, not just those with GAP layers.

#### Question 7
- Q: How do you process a 7x7 class activation map to a binary 224x224 mask?
- A: Upscale to input resolution using for example cv2.resize(cam, wanted_shape, linear) then normalize to [0,1] then binarize.astype(np.unit8)

#### Question 8
- Q: Why is it important to visualize color-coded, blended segmentation masks?

- A: It helps users intuitively understand where and why the model is predicting each class.

#### Question 9
- Q: How does pseudo-segmentation differ from fully-trained segmentation networks like FCNs?
- first uses no pixel-level labels during training -> can give low-res masks
- second uses ground truth segmentation masks -> gives high resolution segmentations


# IV- Convolutional Neural Networks

#### Question 1 
- Do exercise for Convolution calculation in sheet 4

#### Question 2
- Q: Why use stride 2 in CNNs, and what is another layer that reduces spatial dimensions?

- A: Stride 2 reduces spatial dimensions to lower computational cost and to extract coarse-level features. Another example is max-pooling, which downsamples by taking the max value in local regions.

#### Question 3
- Q: What features might the following 2x2 kernels detect?

w1 = [[1, 0], [0, 1]]: Diagonal pattern detection (main diagonal)

w2 = [[1, 0], [-1, 0]]: Vertical edge detector (difference across rows)

w3 = [[1, 1], [0, 0]]: Horizontal edge detector (emphasis on top rows)

#### Question 4
- Q : Calculate the receptive field of a neuron in the final feature map of a CNN with the following architecture:
- Layer 1: Convolution with kernel size 5x5, stride 1.
- Layer 2: Max-Pooling with kernel size 2x2, stride 2.
- Layer 3: Convolution with kernel size 3x3, stride 1.

- A :
- 

#### Question 5
- Q: Training accuracy is 100%, test accuracy is 75%. What is this, and how to fix it?

- A: This is overfitting — the model memorizes training data. Fix it with:

- Regularization (e.g., dropout, weight decay)

- Data augmentation to increase variability

#### Question 6
- Q: What’s the purpose of model.train() vs model.eval() in PyTorch?

- A: These modes switch behavior:

- train(): Enables dropout, batchnorm updates

- eval(): Freezes dropout and batchnorm (inference mode)

#### Question 7, 8, 9, 10, 11
- Questions :
- Q7: Why use data augmentation (e.g., flipping, cropping) in CNN training?
- Q8: What does a saliency map represent?
- Q9: How do early vs. deep ResNet layer features differ?
- Q10: How to reduce a [64, 56, 56] activation map for visualization?
- Q11: How would saliency maps for 'cat' and 'dog' differ in the same image?

- Answers :
- A7: It increases data diversity, improving model generalization and robustness to variations in input images.
- A8: It shows which input pixels most influence the output class. High values indicate strong impact on the prediction.
- A9: Early layers detect edges, textures; deeper layers detect semantic objects or class-specific features.
- A10: Two options: Mean over channels → single-channel heatmap or Select max or specific channel for targeted visualization
- A11: Each highlights regions most relevant to its class — the cat map will focus on cat features, the dog map on dog features. This shows the network learns class-specific spatial cues.

# V- Sequence Modeling in Vision
- PRETTY STRAIGHT FORWARD

# VI Image Captioning and VIT


#### Question 1:
- Q: A standard image captioning model is composed of an encoder and a decoder.
1.  What is the role of the encoder (e.g., a pretrained VGG19)? What kind of representation does it produce from the input image?
2. What is the role of the decoder (e.g., an LSTM)? What sequence does it generate?

- A: Encoder (e.g., VGG19): The encoder's role is to extract meaningful visual features from the input image. A pretrained CNN like VGG19 processes the image through multiple convolutional layers and outputs a high-level, compact representation (feature map or vector) that captures semantic information.

- Decoder (e.g., LSTM): The decoder generates a sequence of words (the caption) from the encoded image features. At each time step, it uses its internal state and possibly context from attention to predict the next word.


#### Question 2:
- Q: In an attention-based image captioning model:

- How is the decoder's state at a given timestep used to create an attention distribution over the encoder's spatial feature maps?
- What is the benefit of this attention mechanism compared to a non-attentional model?
- A: 
- The decoder's hidden state at each timestep is used to compute attention weights over the encoder’s spatial feature map (e.g., via dot-product or additive attention). This results in a context vector which emphasizes relevant spatial locations of the image for generating the current word.
- Attention allows the model to dynamically focus on different parts of the image when generating each word. This leads to more accurate and descriptive captions, especially for complex images, unlike a single global vector which may lose spatial detail.

#### Quesiton 3:
- Q: Why is using a pretrained CNN (e.g., VGG19 trained on ImageNet) effective for image captioning?
- A:
- Pretrained CNNs have learned general-purpose visual features from large datasets like ImageNet, such as edges, textures, shapes, and object parts. This transferred knowledge helps the captioning model recognize and describe objects in new images without training from scratch.

#### Question 4:
- Q: What makes an LSTM well-suited for generating word sequences? Why not use a feed-forward network?
- A:
- LSTMs can model long-range dependencies and remember past context through their gated memory cells, which is essential for generating coherent and grammatically correct sentences. Feed-forward networks lack this sequential memory capability.


#### Question 5:
- Q: Describe the two main steps of the patch embedding process in a Vision Transformer (ViT):

- A:
- The image is divided into fixed-size patches (e.g., 16x16).

- Each patch is flattened and linearly projected into a fixed-dimensional embedding vector using a learnable projection matrix.


#### Question 6:
-Q: 
- Why is positional embedding essential in ViTs?

- Why isn't this needed in CNNs?
- A:
- Transformers lack inherent spatial order, so positional embeddings provide information about the position of each patch, allowing the model to reason about the image structure.
- CNNs have spatial locality built into their architecture via the convolution operation and kernel sliding mechanism, so position information is implicitly encoded.


#### Question 7:

-Q: 
- What is the purpose of the [CLS] token in a ViT for classification?

- How is it used for prediction?

-A:
- The [CLS] token is a learnable vector prepended to the input sequence. After the final transformer layer, it captures a summary of the entire image.
- Its final representation is passed through a classification head (e.g., linear layer + softmax) to produce the class label.

#### Question 8:
-Q: Name the two main sub-layers in a Transformer Encoder block and describe them:

- A:
- Multi-Head Self-Attention: Allows each patch to attend to all other patches to capture global context.

- Feed-Forward Network (MLP): Applies non-linear transformations independently to each token to increase representation capacity.


#### Question 9:
- Q: 
- What inductive bias does a CNN have?

- How does the lack of this bias in ViT affect its performance?
- A:
- CNNs assume spatial locality and translation invariance through local receptive fields and weight sharing.
- ViTs require more data to learn spatial structure from scratch. They often underperform on small datasets but can surpass CNNs when trained on very large datasets.

#### Question 10:
- Q:
- How does a CNN naturally produce a spatial feature map?

- How does a ViT produce spatial features for detection?
- A:
- Through convolution and pooling, CNNs reduce the input image into a lower-resolution spatial map that preserves object locations.
- The sequence of patch embeddings must be reshaped back into a grid or combined with positional information to build a spatial feature map for detection heads.

# VII Matching:


#### Question 1: SIFT (feature extraction and feature description)
- What is a keypoint?
- A location in the image that has distinctive texture or structure, like corners or blobs.

- What is a descriptor?
- A vector that encodes the local appearance around a keypoint. It must be robust to scale, rotation, and illumination changes.


#### Question 2:
- How many point pairs are needed to estimate an affine transformation?
- At least 3 point pairs.

- What happens if you use more?
- Least squares is applied to minimize the error across all pairs, improving robustness to noise.

#### Question 3:
- Why does least-squares fail with outlier matches?
- Outliers introduce large errors that skew the solution, leading to an inaccurate transformation.



#### Question 4:
- Give RANSAC algorithm in pseudocode :
- ```
  for N iterations:
    sample 3 point pairs
    fit affine model
    count inliers (matches consistent with the model)
    if current model has more inliers, save it
    final_model = fit affine model on best inlier set```

#### Question 5:
- Why is RANSAC robust?
- It focuses only on inliers and discards outliers. It avoids fitting to bad matches, unlike least-squares.



#### Question 6:
- What is the inlier threshold?
- A distance threshold to decide whether a match fits the model.

- After the main RANSAC loop terminates and has identified the best set of inliers,
what is the final step that is typically performed to get the most accurate model? Why
is this step important?
- Re-fit the model using all inliers for higher accuracy.

#### Question 7:
- Q :
 - Define dense optical flow. How does it differ from the sparse keypoint correspondences found in Task 1?
 - What kind of information does a dense flow field provide?

- A:
- Dense flow computes motion vectors for every pixel, while sparse methods compute motion only at selected keypoints.




#### Question 8:
- Explain the concept of backward image warping using an optical flow field.
- Iterate over the target image grid, map back to the source using flow, and sample pixel values from the source.

#### Question 9:
- When warping one video frame to another using optical flow, artifacts often appear in the
resulting image. Define what occlusions and disocclusions are in this context and explain
how they can lead to visible artifacts in the warped image (e.g., holes, smeared pixels).
- Occlusion: Part of source not visible in target.
- Disocclusion: Part of target not visible in source.They cause missing pixels or artifacts in the warped image.



# VIII Self Supervised learning:

- Question 1:

- Softmax with τ = 0.1 → sharper distribution (more confident).

- Softmax with τ = 10.0 → flatter distribution (less confident).

- Question 2 :
- Why use low τ in contrastive learning?
- It makes the model focus on the most similar (positive) sample, leading to stronger gradients and better feature separation.



- Question 3
- NT-Xent Loss:

- ```L(i, j) = -log [ exp(sim(z_i, z_j)/τ) / sum_{k ≠ i} exp(sim(z_i, z_k)/τ) ]```

- Numerator: similarity of positive pair.

- Denominator: similarities to all other (negative) samples.

- τ: controls sharpness/confidence.



- Question 4 (What is the "pretext task" in the SimCLR framework? Describe the two key steps involved in
creating the data for this task and what the model is trained to do.)
- Pretext task in SimCLR:

- Create two augmentations of the same image.

- Train model to bring them closer in feature space than other (different) images.




- Question 5

- t-SNE: Visualizes learned embeddings. Good = clusters by class.

- Linear Probing: Train a linear classifier on frozen embeddings. High accuracy → strong representations.



- Question 6:
- Linear Probing:
- Freeze backbone, train only a linear classifier on top. Used to evaluate the quality of learned embeddings.



- Question 7: (How does the training objective of CLIP (Contrastive Language-Image Pretraining) differ
from that of SimCLR? What are the "positive pairs" and "negative pairs" in the context of
CLIP's training?)
 -  SimCLR matches views of the same image. 
  - CLIP matches images and text. Positive = correct image-caption pair. Negative = mismatched pairs.


- Question 8: 
- Zero-shot inference with CLIP:

- Encode input image using CLIP image encoder.

- Encode class names using CLIP text encoder.

- Compute similarity between image embedding and each text embedding.

- Predict class with highest similarity.

- Question 9:

- What is zero-shot learning?
- Predicting unseen classes without retraining.

- Why can CLIP do this but ResNet can’t?
- CLIP learns joint image-text space. It generalizes via text prompts, while traditional classifiers only recognize seen labels.