### 1. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (i.e., a single layer of linear threshold units trained using the Perceptron training algorithm)? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

Logistic Regression and the classical Perceptron are both linear classifiers, but there are important differences between them. Generally, Logistic Regression is preferred over the classical Perceptron for several reasons:

1. **Output Range:**
   - **Logistic Regression:** The output of a logistic regression model is a probability estimate, bounded between 0 and 1. This makes it suitable for binary classification tasks, where the output can be interpreted as the probability of belonging to a certain class.
   - **Perceptron:** The output of a classical perceptron is binary (0 or 1), making it less flexible for probability estimation.

2. **Differentiability:**
   - **Logistic Regression:** The logistic function used in logistic regression is smooth and differentiable. This differentiability is crucial for gradient-based optimization algorithms, allowing efficient and stable training.
   - **Perceptron:** The step function used in the classical perceptron is not differentiable, which can cause optimization issues when using gradient-based methods.

3. **Gradient Descent:**
   - **Logistic Regression:** Logistic Regression can be trained using gradient descent or other optimization methods that require the computation of gradients. The smoothness of the logistic function facilitates stable and efficient optimization.
   - **Perceptron:** The lack of differentiability in the step function makes it challenging to use traditional gradient descent. The Perceptron training algorithm involves updating weights based on misclassified examples, and it doesn't have the same smooth optimization properties as logistic regression.

To make a Perceptron equivalent to a Logistic Regression classifier, you can make the following modifications:

1. **Activation Function:** Replace the step function in the Perceptron with a logistic (sigmoid) activation function. This transforms the output of the Perceptron into a continuous range between 0 and 1.

1. **Loss Function:** Use a logistic loss (cross-entropy) instead of the Perceptron's original loss function. The logistic loss is suitable for probabilistic interpretation and is compatible with gradient-based optimization methods.

### 2. Why was the logistic activation function a key ingredient in training the first MLPs?


The logistic activation function, also known as the sigmoid activation function, played a key role in training the first Multilayer Perceptrons (MLPs) for several reasons:

1. **Differentiability:**
   - The logistic function is differentiable for all values of its input. This differentiability is crucial for training neural networks using gradient-based optimization algorithms like gradient descent. The ability to compute derivatives allows efficient backpropagation of errors through the network, enabling the adjustment of weights to minimize the error.

2. **Squashing Nonlinearity:**
   - The logistic function "squashes" its input into a range between 0 and 1. This characteristic is important for introducing nonlinearity into the network. Without nonlinearity, stacking multiple layers of neurons would not provide any additional representational power. The logistic function introduces the necessary nonlinearity, allowing the network to learn complex relationships in the data.

3. **Output as Probabilities:**
   - The logistic function produces outputs in the range [0, 1], which can be interpreted as probabilities. In binary classification problems, the output of the logistic activation function can represent the probability of belonging to the positive class. This probability interpretation is crucial for tasks where understanding the uncertainty or confidence of predictions is important.

4. **Smooth Transitions:**
   - The logistic function ensures smooth transitions between different input values. This smoothness is beneficial for optimization algorithms, making it easier for them to find a global minimum during training. In contrast, using step functions (as in the original perceptron) would result in non-differentiable points, making optimization more challenging.

5. **Vanishing Gradient Mitigation:**
   - The logistic function helps mitigate the vanishing gradient problem. In the backpropagation algorithm, gradients are propagated backward through the layers during training. The logistic function, with its non-zero derivatives even for extreme values, helps maintain non-zero gradients and avoids the issue of gradients becoming too small (vanishing) as they are backpropagated through multiple layers.

### 3. Name three popular activation functions. Can you draw them?


Three popular activation functions used in neural networks are:

1. **ReLU (Rectified Linear Unit):**
   - **Function:** \($f(x) = \max(0, x)$\)
   - **Advantages:** ReLU is computationally efficient and helps mitigate the vanishing gradient problem. It introduces non-linearity into the network and has been widely adopted in many deep learning architectures.

2. **Sigmoid (Logistic):**
   - **Function:** \($f(x) = \frac{1}{1 + e^{-x}}$\)
   - **Advantages:** Sigmoid squashes its input to the range \((0, 1)\), making it suitable for binary classification problems where the output can be interpreted as a probability. However, it has a tendency to saturate for extreme input values, leading to vanishing gradients.

3. **tanh (Hyperbolic Tangent):**
   - **Function:** \($f(x) = \frac{e^{2x} - 1}{e^{2x} + 1}$\)
   - **Advantages:** The tanh function squashes its input to the range \((-1, 1)\), making it zero-centered. This can help mitigate issues related to gradient vanishing, which are present in the sigmoid function. tanh is commonly used in scenarios where zero-centered outputs are desired.

### 4. Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function.  
a. What is the shape of the input matrix X?  
b. What about the shape of the hidden layer’s weight vector Wh, and the shape of its bias vector bh?  
c. What is the shape of the output layer’s weight vector Wo, and its bias vector bo?  
d. What is the shape of the network’s output matrix Y?  
e. Write the equation that computes the network’s output matrix Y as a function of X, Wh, bh, Wo and bo.

Let's denote the following:

- \($X$\) as the input matrix of shape \($(\text{batch size}, \text{input features})$\),
- \($Wh$\) as the weight matrix for the hidden layer of shape \($(\text{input features}, \text{number of hidden neurons})$\),
- \($bh$\) as the bias vector for the hidden layer of shape \($(\text{number of hidden neurons},)$\),
- \($Wo$\) as the weight matrix for the output layer of shape \($(\text{number of hidden neurons}, \text{number of output neurons})$\),
- \($bo$\) as the bias vector for the output layer of shape \($(\text{number of output neurons},)$\),
- \($Y$\) as the output matrix of shape \($(\text{batch size}, \text{number of output neurons})$\).

Now, based on the information provided:

1. **Input Matrix \($X$\):**
   - Shape: \($(\text{batch size}, 10)$\)

2. **Hidden Layer's Weight Matrix \($Wh$\):**
   - Shape: \($(10, 50)$\)

3. **Hidden Layer's Bias Vector \($bh$\):**
   - Shape: \($(50,)$\)

4. **Output Layer's Weight Matrix \($Wo$\):**
   - Shape: \($(50, 3)$\)

5. **Output Layer's Bias Vector \($bo$\):**
   - Shape: \($(3,)$\)

6. **Network's Output Matrix \($Y$\):**
   - Shape: \($(\text{batch size}, 3)$\)

7. **Equation for the Network's Output \($Y$\):**
   - The output of the hidden layer can be computed using the ReLU activation function, and the final output of the network can be obtained by applying the ReLU activation to the weighted sum at the output layer.

   The equations are as follows:

   \[$\text{Hidden Layer Output (without activation): } H = X \cdot Wh + bh$\]
   \[$\text{Hidden Layer Output (with ReLU activation): } A = \max(0, H)$\]
   \[$\text{Network Output: } Y = A \cdot Wo + bo$\]

In matrix form:

\[$Y = \max(0, X \cdot Wh + bh) \cdot Wo + bo$\]

These equations represent the forward pass of the given MLP, where \($X$\) is the input matrix, \($Wh$\) and \($Wo$\) are the weight matrices for the hidden and output layers, \($bh$\) and \($bo$\) are the bias vectors, and \($\max(0, \cdot)$\) denotes the element-wise ReLU activation function.

### 5. How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? If instead you want to tackle MNIST, how many neurons do you need in the output layer, using what activation function?

#### Email Classification (Spam or Ham):

1. **Number of Neurons in Output Layer:**
   - For binary classification (spam or ham), you need one neuron in the output layer.
  
2. **Activation Function in Output Layer:**
   - Use the sigmoid activation function in the output layer for binary classification. The sigmoid function squashes the output between 0 and 1, representing the probability of the input belonging to the positive class (spam).

#### MNIST Classification:

1. **Number of Neurons in Output Layer:**
   - For MNIST, which has 10 classes (digits 0 through 9), you need 10 neurons in the output layer, each representing a different digit.

2. **Activation Function in Output Layer:**
   - Use the softmax activation function in the output layer for multi-class classification. The softmax function converts the raw output scores into probability distributions over the different classes, facilitating the interpretation of the network's confidence in each class.

### 6. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?

#### Backpropagation:

**Backpropagation** is a supervised learning algorithm used to train neural networks. It consists of two phases: the forward pass and the backward pass.

1. **Forward Pass:** During the forward pass, the input data is passed through the network, layer by layer, to compute the predicted output. The input is multiplied by the weights, and activation functions are applied to produce the output of each layer. The predicted output is then compared to the actual target values using a loss function, which quantifies the difference between the predicted and actual values.
   
2. **Backward Pass:** The backward pass involves calculating the gradient of the loss function with respect to the weights of the network. The gradients are computed using the chain rule of calculus, starting from the output layer and moving backward through the network. The gradients indicate how much the loss would increase or decrease if the weights were adjusted.

3. **Weight Update:** The computed gradients are used to update the weights of the network in the direction that minimizes the loss. This process is typically performed using an optimization algorithm like gradient descent.

#### Reverse-Mode Autodiff:

**Reverse-mode autodiff (automatic differentiation)** is a mathematical technique for efficiently computing gradients in the context of machine learning and neural network training.

1. **Forward Evaluation:** During the forward pass, the computational graph is constructed to represent the operations performed in the model. Intermediate values and operations are recorded during this phase.
   
2. **Backward Differentiation:** In the backward pass, gradients are calculated by propagating the derivative of the loss backward through the computational graph. The chain rule is applied to compute the gradients efficiently.

#### Difference between Backpropagation and Reverse-Mode Autodiff:

1. **Terminology:** "Backpropagation" often specifically refers to the training algorithm for neural networks, including both the forward and backward passes. "Reverse-mode autodiff" is a more general term that describes the automatic computation of derivatives in computational graphs, not limited to neural networks.

2. **Scope:** Backpropagation is a specific application of reverse-mode autodiff in the context of neural network training. Reverse-mode autodiff is a broader concept used in various mathematical contexts, not exclusive to neural networks.

In summary, backpropagation is a type of reverse-mode autodiff used for training neural networks. Reverse-mode autodiff, as a broader concept, can be applied in various mathematical contexts beyond neural network training.

### 7. Can you list all the hyperparameters you can tweak in an MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?

In a Multilayer Perceptron (MLP), there are several hyperparameters that you can tweak to influence its training and performance. Here is a list of common hyperparameters in an MLP:

1. **Number of Hidden Layers:** The number of layers between the input and output layers.

2. **Number of Neurons in Each Hidden Layer:** The number of artificial neurons in each hidden layer.

3. **Learning Rate:** Determines the step size during optimization. It influences how much the weights are adjusted during each iteration of training.

4. **Activation Functions:** The activation function applied to the output of each neuron in the hidden layers. Common choices include ReLU, sigmoid, and tanh.

5. **Weight Initialization:** The method used to initialize the weights before training. Common methods include random initialization, Xavier/Glorot initialization, and He initialization.

6. **Regularization Techniques:** Techniques to prevent overfitting, such as L1 or L2 regularization, dropout, and early stopping.

7. **Batch Size:** The number of training examples used in each iteration of training.

8. **Number of Epochs:** The number of times the entire training dataset is passed through the network during training.

9. **Optimizer:** The optimization algorithm used for updating weights during training. Common choices include stochastic gradient descent (SGD), Adam, RMSprop, etc.

10. **Learning Rate Schedule:** A strategy for changing the learning rate during training. This can be a fixed schedule or adaptive methods.

11. **Momentum:** A parameter that adds a fraction of the previous weight update to the current update during optimization.

#### If an MLP is overfitting the training data, meaning it performs well on the training set but poorly on new, unseen data, you can try the following hyperparameter adjustments:

1. **Reduce Model Complexity:** Decrease the number of hidden layers or neurons in each layer to reduce the model's capacity.

2. **Apply Regularization:** Increase the strength of regularization techniques such as L1 or L2 regularization. Alternatively, add dropout to the hidden layers.

3. **Adjust Learning Rate:** Experiment with different learning rates. A smaller learning rate might help prevent overfitting.

4. **Early Stopping:** Monitor the performance on a validation set during training and stop training when the performance on the validation set starts to degrade.

5. **Data Augmentation:** Introduce data augmentation techniques to artificially increase the size of the training dataset.

6. **Batch Normalization:** Add batch normalization layers to normalize the inputs to hidden layers, which can help stabilize and accelerate training.

7. **Hyperparameter Search:** Perform a systematic hyperparameter search, possibly using techniques like grid search or random search, to find the best combination of hyperparameters for your specific problem.

It's important to note that the effectiveness of these adjustments may vary depending on the specific characteristics of your data and the problem at hand. Experimentation and monitoring the model's performance on a validation set are essential for finding the optimal set of hyperparameters.

### 8. Train a deep MLP on the MNIST dataset and see if you can get over 98% precision. Try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, add summaries, plot learning curves using TensorBoard, and so on).

In [2]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values to between 0 and 1

# Flatten the images
x_train_flat = x_train.reshape((x_train.shape[0], -1))
x_test_flat = x_test.reshape((x_test.shape[0], -1))

# One-hot encode the labels
y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

# Build the MLP model
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(784,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Set up callbacks
checkpoint_cb = ModelCheckpoint("model_checkpoint.h5", save_best_only=True)
tensorboard_cb = TensorBoard(log_dir="./logs", histogram_freq=1)

# Train the model
model.fit(
    x_train_flat, y_train_cat, epochs=20,
    validation_data=(x_test_flat, y_test_cat),
    callbacks=[checkpoint_cb, tensorboard_cb]
)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test_flat, y_test_cat)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")




Epoch 1/20



  saving_api.save_model(


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test Accuracy: 98.21%
