## 1. After each stride-2 conv, why do we double the number of filters?
**Answer:**
After each stride-2 convolution, we typically double the number of filters to compensate for the reduction in the spatial dimensions (height and width) of the feature maps. As the spatial size decreases, the network is able to focus on more abstract and complex features. By increasing the number of filters, we ensure that the model can learn more features, maintaining the network's capacity to capture detailed information at deeper layers.

---

## 2. Why do we use a larger kernel with MNIST (with simple cnn) in the first conv?
**Answer:**
A larger kernel in the first convolutional layer is often used with datasets like MNIST to capture broader and more meaningful features from the input image. Since MNIST images are relatively simple and small (28x28 pixels), using a larger kernel (e.g., 5x5) helps in detecting larger patterns, such as edges and shapes, right from the initial layer, which is crucial for accurate digit recognition.

---

## 3. What data is saved by ActivationStats for each layer?
**Answer:**
`ActivationStats` typically saves the following data for each layer:
- **Mean:** The average activation value for the neurons in the layer.
- **Standard deviation:** The spread of the activation values around the mean.
- **Histograms:** Distribution of activation values to understand the layer's behavior.
- **Activation values:** Sometimes the actual activation values for each neuron can be stored for further analysis or debugging.

These statistics are useful for diagnosing issues like vanishing/exploding gradients and ensuring that the activations are within a reasonable range during training.

---

## 4. How do we get a learner's callback after they’ve completed training?
**Answer:**
To get a learner's callback after they’ve completed training, you can use the `learn.callbacks` attribute in most deep learning frameworks like PyTorch or Fastai. After training is completed, you can access specific callback information as follows:
```python
callbacks = learn.callbacks


## 5. What are the drawbacks of activations above zero?
**Answer:**
Activations above zero can have several drawbacks:
- **Exploding gradients:** If activations are consistently above zero, the gradients during backpropagation can grow too large, leading to unstable training and potentially causing the model to diverge.
- **Overfitting:** Large activations might cause the model to overfit the training data because the model may become overly sensitive to the specific patterns in the training set rather than learning generalizable features.
- **Saturation:** In activation functions like sigmoid or tanh, large positive activations can push the function into its saturation region, where the gradient is very small. This slows down learning as updates to the weights become minimal.

---

## 6. Draw up the benefits and drawbacks of practicing in larger batches?
**Answer:**
**Benefits:**
- **Stable Gradient Estimates:** Larger batches tend to provide more accurate and stable estimates of the gradient, which can lead to smoother and potentially faster convergence during training.
- **Efficient Computation:** Larger batch sizes can make better use of parallel processing capabilities of modern GPUs, leading to faster computation and shorter training times.

**Drawbacks:**
- **Increased Memory Usage:** Larger batches require more memory, which may limit the maximum batch size, especially on GPUs with limited VRAM.
- **Less Regularization:** Smaller batches introduce more noise into the gradient estimation process, which can act as a form of regularization and help prevent overfitting. Larger batches reduce this noise, which can make the model more prone to overfitting.

---

## 7. Why should we avoid starting training with a high learning rate?
**Answer:**
Starting training with a high learning rate can cause several issues:
- **Divergence:** A high learning rate may cause the model to take too large steps during gradient descent, potentially overshooting the optimal solution, leading to divergence where the loss increases rather than decreases.
- **Instability:** High learning rates can make the training process unstable, with the loss fluctuating wildly or failing to converge.
- **Poor Local Minima:** Even if the model converges, it may do so to a suboptimal solution, missing out on finding the best possible parameters due to the large steps skipping over finer adjustments.

---

## 8. What are the pros of studying with a high rate of learning?
**Answer:**
- **Faster Convergence:** A high learning rate can speed up the convergence of the model by allowing it to make significant progress towards the optimal solution in the initial stages of training.
- **Escaping Local Minima:** A higher learning rate can help the model jump out of local minima, potentially finding a better, more generalizable solution.
- **Efficient Early Training:** In the early stages of training, when the model is far from the optimum, a high learning rate can quickly reduce the loss, bringing the model closer to a good solution more efficiently.

---

## 9. Why do we want to end the training with a low learning rate?
**Answer:**
Ending the training with a low learning rate helps in fine-tuning the model:
- **Precise Adjustment:** A low learning rate allows for smaller, more precise adjustments to the model parameters, helping the model converge to a better and more refined solution.
- **Avoid Overshooting:** It reduces the risk of overshooting the minimum of the loss function, leading to more stable and reliable convergence.
- **Final Tuning:** As the model nears the optimal solution, a low learning rate helps in careful tuning, ensuring that the model settles into the best possible state with minimal error.

---
