## Homework 02

In this notebook you'll explore, train, and evaluate models on the FashionMNIST dataset.  FashionMNIST was set up as a more difficult drop-in replacement for MNIST.

For this assigment you'll want to use a CoCalc compute server with GPU.  Make sure you've watched the video at the beginning of the lesson about compute servers.

In [None]:
# Add your imports here
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

from introdl.utils import config_paths_keys

# Configure paths
paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']
MODELS_PATH = paths['MODELS_PATH']

### Warmup (5 points)

Train LeNet5Rev on FashionMNIST and evaluate the performance on the test set.  Include convergence plots of loss and accuracy on the training and test data.

### Improve the model (24 pts)

Try increasing the number of convolutional layers up to six with ReLU layers.  You many need to increase
the number of channels (but not in every layer).  Use two max pooling layers.  Kernel size can be 3 or 5
but adjust the padding so that the convolutional layers preserve the size of the feature maps.

You can also simplify the classifier.  Try a single linear layer instead of multiple linear layers
separated by ReLU functions.

You should be able to achieve about 92% accuracy on the test set.  Show convergence plots for each model you try.   

You should try at least three different models.  Describe your experiments.  For each experiment include the model and plot convergence results.  

### Describe the things you tried (3 pts)

Summarize the network architectures you tried.  What worked best?  What didn't help?

### Analyze your best model (8 pts)

Make a confusion matrix for the predictions of your best model on the test set.  You can set `use_class_labels = True` when using `evaluate_classifier` to see the names of the classes.  You can also access the names of the classes as an attribute of the dataset, e.g. `dataset.classes`.

Describe which classes get most confused by your model.  Plot examples of the images that your model is getting wrong.  Do these misclassifications make sense?  Are the images from the misclassified classes hard to distinguish by eye?

## Understanding Model Training

## Understanding Model Training

### Training vs Validation Data [3 pts]

1. In your FashionMNIST experiments above, you used training and test sets. Explain why we need separate datasets for training and evaluation. What problem are we trying to avoid?

YOUR ANSWER:

2. During training, when should you evaluate on the test/validation set? Every batch? Every epoch? Only at the end? Explain your reasoning.

YOUR ANSWER:

3. If your training accuracy is 99% but your test accuracy is only 60%, what does this indicate about your model? What might you do to fix this?

YOUR ANSWER:

### Understanding the Training Loop [5 pts]

The `train_simple_network` function you used above contains a training loop. Below is simplified code showing the key parts of what happens inside that function. For each marked line, explain in your own words what that line does and why it's necessary for training the network.

```python
def training_loop(model, train_loader, loss_func, optimizer, epochs):
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            
            # LINE A: What does optimizer.zero_grad() do and why is it necessary?
            optimizer.zero_grad()
            
            # LINE B: What is happening when we call model(data)?
            output = model(data)
            
            # LINE C: What does the loss function compute and what type of value does it return?
            loss = loss_func(output, target)
            
            # LINE D: What does loss.backward() calculate and where does it store the results?
            loss.backward()
            
            # LINE E: What does optimizer.step() do with the information from loss.backward()?
            optimizer.step()
```

YOUR EXPLANATIONS:
- LINE A: 
- LINE B: 
- LINE C: 
- LINE D: 
- LINE E: 

## Reflection [2 pts]

1. What, if anything, did you find difficult to understand for this lesson? Why?

YOUR ANSWER:

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

YOUR ANSWER:

## Reading Comprehension: Chapter 3 - Convolutional Neural Networks [10 pts]

These questions test your understanding of Chapter 3 from "Inside Deep Learning" by Edward Raff. 

### Spatial Structure and Prior Beliefs [2 pts]

1. The textbook states that "convolutions are powerful yet simple tools that help us encode information about the problem into the design of our network architecture." Explain what **spatial structural prior belief** means in the context of CNNs and why shuffling pixels in an image destroys this structure. Give a concrete example from your FashionMNIST experiments.

YOUR ANSWER:

2. Why does the textbook emphasize that CNNs work well for images but not for tabular/columnar data? What assumption do CNNs make that may not hold for spreadsheet-like data?

YOUR ANSWER:

### Weight Sharing and Translation [3 pts]

3. Section 3.2.4 introduces the concept of **weight sharing**. Explain how weight sharing in convolutions differs from fully connected layers. Why does this make CNNs more parameter-efficient for image data?

YOUR ANSWER:

4. The textbook discusses **translation invariance** as a desired property for image classification (Section 3.5). In your experiments above, did you observe any issues with small shifts in the input? How does max pooling help achieve partial translation invariance, and why is it only "partial"?

YOUR ANSWER:

5. According to the textbook, what is the relationship between stride and the output size? If you use a stride of 3 in a convolution, how does this affect the spatial dimensions of your output?

YOUR ANSWER:

### Architectural Design Choices [3 pts]

6. The textbook mentions that "you can have too much of a good thing and make a network too deep to learn" (Section 3.5.1). Based on your reading and experiments, what are the tradeoffs between network depth and performance? Why might a 200-layer network not always be better than a 20-layer network?

YOUR ANSWER:

7. In Section 3.4.3, the textbook introduces the `nn.Flatten()` operation. Explain why this operation is necessary when transitioning from convolutional layers to fully connected layers. What would happen if you tried to connect a Conv2d output directly to a Linear layer without flattening?

YOUR ANSWER:

8. The textbook suggests increasing the number of filters by a factor of K after each pooling layer of size K (Section 3.5.1). What is the computational reasoning behind this guideline? Did you follow this pattern in your experiments above?

YOUR ANSWER:

### Data Augmentation Philosophy [2 pts]

9. Section 3.6 describes data augmentation as "the feature engineering counterpart to deep learning." Based on the textbook's examples, what makes a good vs. bad augmentation for a specific dataset? Why does the author warn against using vertical flips for MNIST?

YOUR ANSWER:

10. The textbook states that neural networks are "data-hungry" and learn best with diverse data. How does data augmentation address this need differently than simply collecting more real data? What are the limitations of augmentation that the textbook mentions?

YOUR ANSWER: