# **Working with Real Data (MNIST Example)**
Now that we’ve coded an MLP from scratch, let’s move to a more complex task using real-world data.

We will use **TensorFlow/Keras** to simplify the implementation and train a network to recognize handwritten digits using the **MNIST** dataset.


##### Code using Keras:

```python
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the data
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

# One-hot encode labels
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)

# Build the model
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5, batch_size=128)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')
```

Here’s what this does:
- Loads the **MNIST dataset** of handwritten digits.
- Preprocesses the images (flattening them to vectors and normalizing the pixel values).
- Defines a simple MLP with one hidden layer of 512 neurons using **ReLU** and a softmax output layer.
- Trains the network on the training data.
- Evaluates the accuracy on the test data.

#### 5. **Explaining Key Concepts**

- **Loss function**: We use **categorical cross-entropy** for multiclass classification problems.
- **Optimization**: We use **Adam optimizer**, a variant of gradient descent, which automatically adjusts the learning rate during training.
- **Evaluation**: After training, we test the model on the test data to see how well it generalizes to unseen examples.



#### Day 2 Conclusion
Today, we covered:
- Multilayer Perceptron (MLP) and how it extends the simple perceptron.
- Backpropagation and gradient descent for training deep networks.
- Hands-on coding, first by building an MLP from scratch and then using TensorFlow to work on real-world data.

## **1. Imports**

```python
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
```

- **`tensorflow`**: A popular open-source machine learning library developed by Google.
- **`tensorflow.keras`**: TensorFlow's high-level neural networks API, which simplifies building and training models.
- **`layers` and `models`**: Submodules used to construct neural network layers and models.
- **`mnist`**: A dataset of handwritten digits (0-9), consisting of 60,000 training images and 10,000 testing images.


---

In [1]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist


In [2]:
import tensorflow as tf
# tf.config.list_physical_devices('GPU')



## **2. Loading the MNIST Dataset**

```python
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
```
- The MNIST dataset is split into training and testing sets.
  - **`train_images`**: 60,000 images used for training.
  - **`train_labels`**: Corresponding labels for training images.
  - **`test_images`**: 10,000 images used for testing.
  - **`test_labels`**: Corresponding labels for test images.


---

## **3. Data Preprocessing**

### **3.1 Reshaping and Normalizing Images**

```python
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255
```

---

In [3]:

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocess the data
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

In [4]:
train_labels[3]

1


- **Reshaping**:
  - Original shape: `(num_samples, 28, 28)`.
  - Reshaped to: `(num_samples, 784)` to flatten the 28x28 images into 1D vectors.
- **Normalization**:
  - Pixel values are integers from 0 to 255.
  - Dividing by 255 scales the values to the range [0, 1], which helps the neural network train more efficiently.

### **3.2 One-Hot Encoding Labels**

```python
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)
```

- **One-Hot Encoding**:
  - Converts integer labels (0-9) into binary class matrices.
  - Example: Label `3` becomes `[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`.
- **Why One-Hot Encoding?**
  - Necessary for multiclass classification with categorical cross-entropy loss.
  - Allows the model to output probabilities for each class.

---

### Why Do I have to do One-Hot Encoding?

In deep learning, **loss functions** like **categorical cross-entropy** and **sparse categorical cross-entropy** are used for classification tasks. Both are used for multi-class classification problems, but the main difference between them lies in how the labels are encoded. Here's an explanation:

### 1. **Categorical Cross-Entropy**
- **Usage**: Used when your labels are **one-hot encoded**.
  
  One-hot encoding means that for a classification problem with $ N $ classes, each label is represented as a vector of length $ N $, where one element is 1 (indicating the correct class) and all other elements are 0.

  Example:
  - Suppose you have 3 classes (0, 1, 2).
  - If the label is 2, it would be one-hot encoded as [0, 0, 1].

- **Formula**:
  $$
  L = -\sum_{i=1}^{N} y_i \log(\hat{y}_i)
  $$
  Where:
  - $ y_i $ is the one-hot encoded true label.
  - $ \hat{y}_i $ is the predicted probability for class $ i $.
  
- **When to use**: When your output labels are already one-hot encoded. For instance, in Keras, if your labels are one-hot encoded, you would use `categorical_crossentropy` as your loss function.

### 2. **Sparse Categorical Cross-Entropy**
- **Usage**: Used when your labels are **integers** rather than one-hot encoded.

  Instead of one-hot encoding, each label is represented by a single integer, where the value of the integer corresponds to the class index.

  Example:
  - Suppose you have 3 classes (0, 1, 2).
  - If the label is 2, it is represented as the integer `2` (not as a one-hot vector like `[0, 0, 1]`).

- **Formula**:
  The formula is similar to categorical cross-entropy, but the difference is in the input labels. Instead of using one-hot encoded vectors, the labels are integer class indices.

  $$
  L = -\log(\hat{y}_c)
  $$
  Where:
  - $ c $ is the correct class index (an integer).
  - $ \hat{y}_c $ is the predicted probability for the correct class $ c $.

- **When to use**: When your labels are integers (i.e., class indices) rather than one-hot encoded vectors. This is common in Keras when the labels are not explicitly converted to one-hot vectors; you would use `sparse_categorical_crossentropy`.

### Key Differences
- **Label Format**:
  - **Categorical Cross-Entropy**: Requires **one-hot encoded labels**.
  - **Sparse Categorical Cross-Entropy**: Requires **integer labels**.

- **Use Case**:
  - If your dataset labels are one-hot encoded (each label is a vector like `[0, 0, 1]`), you should use **categorical cross-entropy**.
  - If your dataset labels are integer-encoded (each label is a scalar like `2`), you should use **sparse categorical cross-entropy**.

### 3. **Binary Cross-Entropy**
- **Usage**: Used for **binary classification** problems where the output has two possible classes (0 or 1).

- **Formula**:
  $$
  L = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
  $$
  Where:
  - $ y $ is the true label (0 or 1).
  - $ \hat{y} $ is the predicted probability for class 1.

- **When to use**: For binary classification tasks, where there are only two possible classes (e.g., spam or not spam, cat or dog).

### 1. **Poisson Loss**
**Use Case**:
Used when modeling count data where the predictions are expected to follow a **Poisson distribution**.
- **Description**:
Poisson loss calculates the loss assuming that the data follows a Poisson distribution, which is commonly used for count-based data (like the number of events occurring within a fixed interval).

### Summary:
- **Categorical Cross-Entropy**: Use when labels are **one-hot encoded** for multi-class classification.
- **Sparse Categorical Cross-Entropy**: Use when labels are **integers** (class indices) for multi-class classification.
- **Binary Cross-Entropy**: Use for **binary classification** problems (two classes).

These loss functions calculate how far the predicted probabilities are from the actual class labels, helping the model adjust weights to improve its predictions during training.
Summary:
Categorical Cross-Entropy: Used for multi-class classification with one-hot encoded labels.
Sparse Categorical Cross-Entropy: Used for multi-class classification with integer encoded labels (more efficient than categorical cross-entropy).
Binary Cross-Entropy: Used for binary classification (0/1 labels).
Poisson Loss: Used for modeling count data where the data follows a Poisson distribution.
Each loss function is designed for different types of classification problems and different label formats (one-hot vs. integer encoding), so you should choose based on your specific problem and how your data is structured.


In [5]:

# One-hot encode labels
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)




## **4. Building the Model**

```python
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
model.add(layers.Dense(10, activation='softmax'))
```

### **4.1 Model Architecture**

- **`models.Sequential()`**:
  - A linear stack of layers.
  - Suitable for models where layers are added one after another.
- **Alternatives to Sequential Model**:
  - **Functional API**: Allows for complex architectures (e.g., models with multiple inputs/outputs, shared layers).
    - Example:
      ```python
      inputs = tf.keras.Input(shape=(28 * 28,))
      x = layers.Dense(512, activation='relu')(inputs)
      outputs = layers.Dense(10, activation='softmax')(x)
      model = tf.keras.Model(inputs=inputs, outputs=outputs)
      ```
  - **Subclassing API**: For custom models and layers by subclassing `tf.keras.Model` or `tf.keras.layers.Layer`.

### **4.2 Layers**

#### **4.2.1 Dense Layers**

- **`layers.Dense`**:
  - Also known as fully connected layers.
  - Each neuron in the layer is connected to every neuron in the previous and next layers.
- **Why Use Dense Layers?**
  - Suitable for tasks where the relationship between input features and output is not spatially dependent.
  - In this example, images are flattened, so spatial relationships are not preserved.

----------------------------------------------------------------
#### **4.2.2 First Dense Layer**

```python
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
```

- **Units**: 512 neurons.
- **Activation Function**: `'relu'` (Rectified Linear Unit).
  - **ReLU Activation**:
    - Formula: $ f(x) = \max(0, x) $
    - Introduces non-linearity.
    - Helps with the vanishing gradient problem.
- **Input Shape**: `(28 * 28,)` (flattened 784-dimensional vector).


In [6]:

# Build the model
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


---

#### **4.2.3 Second Dense Layer**

```python
model.add(layers.Dense(10, activation='softmax'))
```

- **Units**: 10 neurons (one for each digit class 0-9).
- **Activation Function**: `'softmax'`.
  - **Softmax Activation**:
    - Converts logits into probabilities that sum to 1.
    - Suitable for multiclass classification.
---

In [7]:
model.add(layers.Dense(10, activation='softmax'))



----------------------------------------------------------------
### **4.3 Activation Functions**

#### **Common Activation Functions**:

1. **Sigmoid**:
   - Formula: $ f(x) = \frac{1}{1 + e^{-x}} $
   - Output range: (0, 1)
   - Used in binary classification.
   - **Not ideal** for hidden layers due to vanishing gradient.

2. **Tanh**:
   - Formula: $ f(x) = \tanh(x) $
   - Output range: (-1, 1)
   - Centers data around zero.

3. **ReLU**:
   - Formula: $ f(x) = \max(0, x) $
   - Advantages:
     - Computationally efficient.
     - Mitigates vanishing gradient.
   - Disadvantages:
     - Dying ReLU problem (neurons stop activating).

4. **Leaky ReLU**:
   - Formula: $ f(x) = \max(\alpha x, x) $, where $ \alpha $ is a small constant (e.g., 0.01).
   - Allows a small gradient when the unit is not active.

5. **ELU (Exponential Linear Unit)**:
   - Combines benefits of ReLU and mitigates the dying ReLU problem.

6. **Softmax**:
   - Used in the output layer for multiclass classification.
   - Converts outputs to probability distributions.

**Choosing Activation Functions**:

- **Hidden Layers**:
  - Typically use ReLU, Leaky ReLU, or ELU.
  - ReLU is a common default choice due to its simplicity and effectiveness.

- **Output Layer**:
  - **Binary Classification**: Use `'sigmoid'` with `'binary_crossentropy'` loss.
  - **Multiclass Classification**: Use `'softmax'` with `'categorical_crossentropy'` loss.

---



## **5. Compiling the Model**

```python
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
```

### **5.1 Optimizer**

- **`optimizer='adam'`**:
  - **Adam (Adaptive Moment Estimation)**:
    - Combines RMSprop and momentum. Check bonus info below
    - Adjusts learning rate adaptively for each parameter.
    - Effective and widely used default optimizer.
- **Alternative Optimizers**:

1. **SGD (Stochastic Gradient Descent)**:
   - Basic optimizer with a fixed learning rate.
   - Can use momentum to improve convergence.
   - Example:
     ```python
     optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
     ```

2. **RMSprop**:
   - Maintains per-parameter learning rates, adjusting them based on the average of recent magnitudes of the gradients.

3. **Adagrad**:
   - Adapts the learning rate for each parameter based on the historical gradients.

4. **Adadelta**:
   - Extension of Adagrad, reduces its aggressive, monotonically decreasing learning rate.

**Choosing an Optimizer**:

- **Adam** is generally a good starting point.
- **SGD with Momentum** can be effective but may require more hyperparameter tuning.
- **Experimentation** is key; different optimizers may perform better on different datasets.


### Recall Loss function

### **5.2 Loss Function**

- **`loss='categorical_crossentropy'`**:
  - Used for multiclass classification when labels are one-hot encoded.
  - Measures the difference between two probability distributions (true labels and predicted probabilities).

- **Alternative Loss Functions**:

1. **`sparse_categorical_crossentropy`**:
   - Used when labels are integers instead of one-hot encoded vectors.
   - Example:
     - If labels are `[0, 1, 2]` instead of `[[1,0,0], [0,1,0], [0,0,1]]`.

2. **`binary_crossentropy`**:
   - Used for binary classification tasks.
   - Labels are either 0 or 1.

**Choosing a Loss Function**:

- For **multiclass classification** with one-hot labels, use **`categorical_crossentropy`**.
- For **multiclass classification** with integer labels, use **`sparse_categorical_crossentropy`**.
- For **binary classification**, use **`binary_crossentropy`**.

---

In [8]:

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy',
              metrics=['accuracy'])




## **6. Training the Model**

```python
model.fit(train_images, train_labels, epochs=5, batch_size=128)
```

### **6.1 Epochs**

- **`epochs=5`**:
  - Number of times the entire training dataset is passed through the model.
  - More epochs can lead to better learning but may cause overfitting if too many.


### **6.2 Batch Size**

- **`batch_size=128`**:
  - Number of samples processed before the model is updated.
  - **Why 128?**
    - It's a commonly used batch size.
    - Powers of 2 (e.g., 32, 64, 128, 256) are often chosen for computational efficiency on GPUs.
  - **Impact of Batch Size**:
    - **Smaller Batch Sizes**:
      - More frequent updates.
      - Can lead to noisier updates but may generalize better.
    - **Larger Batch Sizes**:
      - Faster computation per epoch (due to parallelism).
      - May require fewer epochs.
    - **Trade-off**:
      - Small batches can improve generalization but may take longer to train.
      - Large batches can speed up training but may require tuning learning rates.

**Choosing a Batch Size**:

- Experiment with different batch sizes.
- Consider computational resources (memory constraints).
- Common practice is to start with 32, 64, or 128.


---

**Summary of What Happens in 1 Epoch**:
You have **1280 samples**, split into **10 batches**.
During the forward pass, each hidden layer sees all the samples in the current batch, processes them, and passes the output to the next layer.
After processing each batch, the model calculates the loss and updates the weights in the backward pass.
Weights are **updated 10 times** in **1 epoch** (once after each batch).
After 10 epochs, the model will have seen the entire dataset 10 times, and the weights will have been updated **100 times in total (10 batches per epoch * 10 epochs)**.

**Conclusion:**
The key takeaway is that the weights in your hidden layers are updated after every batch of **128 samples**. Therefore, in one epoch, the weights are updated **10 times (once after each batch)**, and this happens regardless of how many layers you have in your network. The hidden layers process each batch independently, and backpropagation adjusts the weights after each batch.

---


In [9]:
# Train the model
model.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.8722 - loss: 0.4585
Epoch 2/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.9668 - loss: 0.1169
Epoch 3/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 6ms/step - accuracy: 0.9797 - loss: 0.0717
Epoch 4/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9857 - loss: 0.0491
Epoch 5/5
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9901 - loss: 0.0349


<keras.src.callbacks.history.History at 0x1f120fcdcd0>

---

## **7. Evaluating the Model**

```python
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')
```

- **`model.evaluate`**:
  - Computes the loss and metrics on the test data.
- **`test_acc`**:
  - The accuracy of the model on the test dataset.

---

In [10]:

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9753 - loss: 0.0802
Test accuracy: 0.9789000153541565


---

## **8. Alternative Network Variants**

To illustrate different variants for the same data, let's explore other network architectures and configurations.

### **8.1 Convolutional Neural Network (CNN)**

**Why Use CNNs?**

- **Spatial Hierarchy**: Preserve spatial relationships in image data.
- **Feature Extraction**: Automatically learn features like edges, textures.
- **Better Performance**: Often achieve higher accuracy on image data compared to fully connected networks.

**CNN Model for MNIST**:

```python
from tensorflow.keras import datasets, layers, models

# Reload data to get original shapes
(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()

# Expand dimensions to include channel information
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255

# One-hot encode labels
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)

# Build CNN model
cnn_model = models.Sequential()
cnn_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Flatten())
cnn_model.add(layers.Dense(64, activation='relu'))
cnn_model.add(layers.Dense(10, activation='softmax'))

# Compile the model
cnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
cnn_model.fit(train_images, train_labels, epochs=5, batch_size=64)

# Evaluate the model
test_loss, test_acc = cnn_model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')
```

**Explanation**:

- **`Conv2D` Layers**:
  - Extract features from images using convolutional filters.
  - Parameters:
    - **Filters**: Number of output filters in the convolution.
    - **Kernel Size**: Size of the convolution window.

- **`MaxPooling2D` Layers**:
  - Reduce spatial dimensions (height and width) to reduce computational load and control overfitting.

- **`Flatten` Layer**:
  - Flattens the 2D outputs to 1D for the fully connected layers.

- **Batch Size**: Reduced to 64 as CNNs are more computationally intensive.

### **8.2 Using Different Activation Functions**

#### **Replace ReLU with Leaky ReLU**

```python
from tensorflow.keras.layers import LeakyReLU

model = models.Sequential()
model.add(layers.Dense(512, input_shape=(28 * 28,)))
model.add(LeakyReLU(alpha=0.01))
model.add(layers.Dense(10, activation='softmax'))
```

- **LeakyReLU**:
  - Allows a small gradient when the unit is not active.
  - Can prevent the dying ReLU problem.

### **8.3 Using Different Optimizers**

#### **Using SGD with Momentum**

```python
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
```

- **SGD with Momentum**:
  - Momentum helps accelerate SGD in the relevant direction and dampens oscillations.
  - **Learning Rate**: Needs to be specified manually.

### **8.4 Using Different Loss Functions**

#### **Using Sparse Categorical Crossentropy**

- If labels are not one-hot encoded:

```python
# Do not one-hot encode labels
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
```

- **`sparse_categorical_crossentropy`**:
  - Expects integer labels instead of one-hot encoded vectors.
  - Can simplify preprocessing steps.

### **8.5 Adjusting Batch Size**

#### **Smaller Batch Size**

```python
model.fit(train_images, train_labels, epochs=5, batch_size=32)
```

- **Advantages**:
  - Potentially better generalization.
  - More updates per epoch.

#### **Larger Batch Size**

```python
model.fit(train_images, train_labels, epochs=5, batch_size=256)
```

- **Advantages**:
  - Faster computation per epoch due to parallelism.
  - Requires less frequent updates.

**Note**: When changing batch sizes, consider adjusting the learning rate accordingly.

### **8.6 Adding Dropout Layers to Prevent Overfitting**

```python
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation='softmax'))
```

- **`layers.Dropout(0.5)`**:
  - Randomly sets 50% of the input units to 0 during training.
  - Helps prevent overfitting by reducing reliance on specific neurons.

### **8.7 Deeper Network**

```python
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(28 * 28,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
```

- **Advantages**:
  - Deeper networks can capture more complex patterns.
- **Considerations**:
  - May require more data to prevent overfitting.
  - Training time increases.

### **8.8 Using Batch Normalization**

```python
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10, activation='softmax'))
```

- **Batch Normalization**:
  - Normalizes the outputs of the previous layer.
  - Can speed up training and improve performance.

---

## **9. Conclusion**

- **Flexibility**: TensorFlow and Keras provide flexibility to experiment with different architectures, activation functions, loss functions, optimizers, and batch sizes.
- **Experimentation**: The best configuration often depends on the specific dataset and problem. It's crucial to experiment and validate different approaches.
- **Best Practices**:
  - Start with a simple model and gradually increase complexity.
  - Monitor training and validation metrics to detect overfitting.
  - Use techniques like dropout and batch normalization to improve generalization.
  - Adjust hyperparameters (learning rate, batch size, epochs) based on performance.

For better understanding feel free to modify layers options.

In [11]:
from tensorflow.keras import datasets, layers, models

# Reload data to get original shapes
(train_images, train_labels), (test_images,
                               test_labels) = datasets.mnist.load_data()

# Expand dimensions to include channel information
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255

# One-hot encode labels
train_labels = tf.keras.utils.to_categorical(train_labels)
test_labels = tf.keras.utils.to_categorical(test_labels)

# Build CNN model
cnn_model = models.Sequential()
cnn_model.add(layers.Conv2D(
    32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
cnn_model.add(layers.MaxPooling2D((2, 2)))
cnn_model.add(layers.Flatten())
cnn_model.add(layers.Dense(64, activation='relu'))
cnn_model.add(layers.Dense(10, activation='softmax'))

# Compile the model
cnn_model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
cnn_model.fit(train_images, train_labels, epochs=5, batch_size=64)

# Evaluate the model
test_loss, test_acc = cnn_model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc}')

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 24ms/step - accuracy: 0.8697 - loss: 0.4188
Epoch 2/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 22ms/step - accuracy: 0.9832 - loss: 0.0548
Epoch 3/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 24ms/step - accuracy: 0.9889 - loss: 0.0368
Epoch 4/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 24ms/step - accuracy: 0.9913 - loss: 0.0274
Epoch 5/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 25ms/step - accuracy: 0.9936 - loss: 0.0196
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9862 - loss: 0.0402
Test accuracy: 0.989300012588501




For **Day 3**, we’ll move into **convolutional neural networks (CNNs)**, which are especially useful for image data like MNIST but on a deeper level. We'll also explore **pooling**, **convolutions**, and how to extract spatial features.

### RMSprop and Momentum in Deep Learning

Both **RMSprop** and **Momentum** are optimization techniques used to improve the convergence speed and performance of gradient descent in deep learning. They address some of the challenges faced by standard gradient descent, such as oscillations and the vanishing/exploding gradient problem. Here's an explanation of each:

---

### **Momentum**

**Momentum** is a method that helps gradient descent accelerate in the right direction by smoothing the oscillations caused by gradients. It does this by incorporating a fraction of the previous update to the current update, much like how a ball rolling down a hill gains momentum as it continues to move forward.

#### How it works:
In traditional gradient descent, the update rule for the weights is:

$
w \gets w - \eta \cdot \nabla L(w)
$

Where:
- $ \eta $ is the learning rate.
- $ \nabla L(w) $ is the gradient of the loss function with respect to the weights.

With **momentum**, we introduce a velocity term, $ v $, which accumulates the gradients over time:

$
v \gets \beta v - \eta \cdot \nabla L(w)
$
$
w \gets w + v
$

Where:
- $ v $ is the velocity (momentum) of the weights.
- $ \beta $ is the momentum coefficient (usually between 0.9 and 0.99), controlling how much of the previous velocity is retained.

#### Intuition:
- Momentum helps the optimizer move through **flat regions** (where gradients are small) more quickly by maintaining some velocity.
- It also reduces **oscillations** when descending into narrow valleys of the loss surface because it dampens the influence of noisy gradient updates by smoothing them out.

#### Benefits:
- Faster convergence, especially in scenarios with **high curvature** or **flat regions**.
- Reduces the oscillations caused by steep gradients in some directions, leading to more stable updates.

---

### **RMSprop (Root Mean Square Propagation)**

**RMSprop** is an adaptive learning rate optimization algorithm that adjusts the learning rate for each parameter individually, based on the recent history of gradients. This helps the optimizer deal with the vanishing/exploding gradient problem and maintain a steady learning process.

#### How it works:
RMSprop maintains a moving average of the squared gradients for each parameter, which helps scale the learning rate appropriately.

The update rule for RMSprop is:

1. Update the exponentially decaying average of the squared gradients:
   
   $
   E[g^2]_t \gets \beta E[g^2]_{t-1} + (1 - \beta) g_t^2
   $
   
   Where:
   - $ g_t $ is the gradient at time step $ t $.
   - $ E[g^2]_t $ is the moving average of the squared gradients.
   - $ \beta $ is the decay rate (commonly set to 0.9).

2. Update the weights:

   $
   w \gets w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
   $

   Where:
   - $ \eta $ is the global learning rate.
   - $ \epsilon $ is a small constant added to prevent division by zero (usually $ 10^{-8} $).

#### Intuition:
- RMSprop adapts the learning rate based on how large or small the gradients have been for each parameter, meaning parameters with large gradients get smaller updates, and parameters with small gradients get larger updates.
- This adjustment helps prevent large, noisy gradients from causing overly large updates, while ensuring small gradients still make meaningful progress.

#### Benefits:
- **Effective for non-stationary loss surfaces**: RMSprop performs well when the gradient magnitudes change frequently (non-stationary problems).
- **Prevents vanishing/exploding gradients**: By adjusting the learning rate based on the squared gradients, RMSprop helps stabilize the training process, especially in deep networks where vanishing/exploding gradients can be problematic.

---

### Comparison: **Momentum vs. RMSprop**
- **Momentum** accelerates gradient descent by building velocity and smoothing oscillations. It works well when the optimizer faces challenges with high curvature, small gradients, or oscillations.
  
- **RMSprop** adapts the learning rate for each parameter individually based on the recent gradient history. It excels at solving problems where the gradients are highly variable or non-stationary, and it helps stabilize the training process.

### **Combining Momentum and RMSprop**
Interestingly, both techniques can be combined. The **Adam optimizer** (Adaptive Moment Estimation) incorporates both **momentum** and **RMSprop** principles by using momentum for first-order gradients (like in momentum) and RMSprop-like behavior for second-order gradients (adaptively scaling the learning rate).

In summary:
- **Momentum** adds velocity to the gradient updates, helping traverse flat regions and dampen oscillations.
- **RMSprop** adjusts the learning rate for each parameter individually based on the recent history of gradients, preventing large updates in some directions and making training more stable.

These optimizers improve upon vanilla gradient descent, making them effective choices in deep learning models.