In [None]:
1. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

Logistic Regression is preferable because it can provide probabilistic predictions and can handle non-linearly separable data. To make a Perceptron equivalent to Logistic Regression, you can apply the sigmoid activation function to its output and use a loss function like binary cross-entropy. Here's Python code:

# Perceptron with Sigmoid activation and Binary Cross-Entropy loss
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def perceptron(X, weights):
    z = np.dot(X, weights)
    return sigmoid(z)

def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Training loop (update weights using gradient descent)
def train_perceptron(X, y, learning_rate, epochs):
    weights = np.random.rand(X.shape[1])  # Initialize weights
    for epoch in range(epochs):
        y_pred = perceptron(X, weights)
        loss = binary_cross_entropy(y, y_pred)
        gradient = np.dot(X.T, (y_pred - y)) / X.shape[0]
        weights -= learning_rate * gradient
    return weights


In [None]:
2. Why was the logistic activation function a key ingredient in training the first MLPs?

The logistic (sigmoid) activation function was used because it introduces non-linearity, allowing neural networks to approximate complex functions. It's differentiable, which is crucial for backpropagation. However, it suffers from vanishing gradients, which later led to the development of more effective activation functions like ReLU.

3. Name three popular activation functions. Can you draw them?

Three popular activation functions are:

ReLU (Rectified Linear Unit):
f(x) = max(0, x)

Sigmoid:
f(x) = 1 / (1 + exp(-x))

Tanh (Hyperbolic Tangent):
f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

Unfortunately, I can't draw them directly, but these functions exhibit specific shapes. ReLU is a linear ramp from 0 for positive values. Sigmoid is an S-shaped curve between 0 and 1, while Tanh is an S-shaped curve between -1 and 1.

4. Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons, all using the ReLU activation function.

Input matrix X shape: (batch_size, 10)
Hidden layer weight vector Wh shape: (10, 50)
Hidden layer bias vector bh shape: (50,)
Output layer weight vector Wo shape: (50, 3)
Output layer bias vector bo shape: (3,)
Network's output matrix Y shape: (batch_size, 3)
The equation for computing the network's output Y as a function of X, Wh, bh, Wo, and bo is as follows (assuming a single data sample):

hidden_output = np.maximum(0, np.dot(X, Wh) + bh)
Y = np.dot(hidden_output, Wo) + bo


In [None]:
5. How many neurons do you need in the output layer for email classification (spam or ham)? What activation function should you use? For MNIST classification, how many neurons are needed, and what activation function?

For email classification (binary classification, spam or ham), you need 1 neuron in the output layer with the sigmoid activation function.

For MNIST classification (multi-class, 10 digits), you need 10 neurons in the output layer with the softmax activation function.

6. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?

Backpropagation is a supervised learning algorithm used to train neural networks. It works by iteratively adjusting the network's weights and biases based on the gradient of a loss function with respect to these parameters. The process involves forward and backward passes through the network:

Forward Pass: Compute the network's predictions and the corresponding loss.
Backward Pass (Backpropagation): Compute gradients of the loss with respect to each parameter using the chain rule and update the parameters using gradient descent.
Reverse-mode autodiff is a mathematical technique for efficiently computing gradients in computational graphs. Backpropagation is a specific application of reverse-mode autodiff in the context of neural network training. They are essentially the same thing, with backpropagation being the algorithmic implementation of reverse-mode autodiff for neural networks.

7. List all the hyperparameters you can tweak in an MLP. If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?

Hyperparameters in an MLP include:

Learning rate
Number of hidden layers
Number of neurons in each hidden layer
Activation functions
Regularization techniques (e.g., L1 or L2 regularization)
Mini-batch size
Number of epochs
Weight initialization methods
Optimizers (e.g., SGD, Adam)
If the MLP overfits, you can:

Decrease the number of neurons in hidden layers
Add dropout or regularization
Increase the training data size
Decrease the learning rate
Early stopping (stop training when validation loss increases)
Use more complex architectures like convolutional layers or recurrent layers for specific tasks.
Tuning these hyperparameters and experimenting with different combinations can help mitigate overfitting.