In [1]:
# Why Prefer Logistic Regression over a Perceptron?

# Preference: Logistic Regression offers probabilistic outputs and is better suited for handling non-linearly separable data by using a sigmoid activation function and optimizing via gradient descent. The Perceptron algorithm, by contrast, produces binary output, often failing to converge for non-linear data.
# Tweaking a Perceptron: To make a Perceptron equivalent to Logistic Regression, replace the threshold step function with a sigmoid activation function and use gradient descent for optimization.

In [2]:
# Importance of Logistic Activation in Training Early MLPs:

# The logistic activation function (sigmoid) introduced non-linearity, enabling MLPs to learn complex patterns beyond linear separability. It also allowed for differentiable functions, crucial for backpropagation, enabling efficient gradient-based learning.

In [3]:
# Three Popular Activation Functions:

# Sigmoid: Squashes input to the (0,1) range.
# ReLU (Rectified Linear Unit): Outputs zero for negative inputs and linear values for positive inputs.
# Tanh: Squashes input to the (-1,1) range, generally preferred over sigmoid due to zero-centered outputs.


In [4]:
# Shape of Input Matrix 𝑋 X: If there are 𝑛n samples, each with 10 features, 𝑋X has a shape of (𝑛,10)(n,10).Shape of Hidden Layer’s Weight Vector 𝑊ℎW h​ : Since 𝑊ℎW h​  maps from 10 input features to 50 hidden neurons, it has a shape of (10,50)(10,50).
# Shape of Hidden Layer’s Bias Vector 𝑏ℎb h​ : One bias per hidden neuron, shape is (50,)(50,).Shape of Output Layer’s Weight Vector 𝑊𝑜W o​: Maps from 50 hidden neurons to 3 output neurons, shape is (50,3)(50,3).Shape of Output Layer’s Bias Vector 𝑏𝑜b o​ : One bias per output neuron, shape is (3,)(3,)Shape of Network’s Output Matrix 𝑌Y: Final output matrix shape is (𝑛,3)(n,3) (one row per sample and one column per class).

In [5]:
# Output Layer Neurons and Activation Functions:

# Spam or Ham Classification: Use 1 output neuron with a sigmoid activation function (output 0 for ham, 1 for spam).
# MNIST Classification: Use 10 output neurons (one per digit) with softmax activation to output class probabilities.


In [6]:
# Backpropagation and Reverse-Mode Autodiff:

# Backpropagation: A method for calculating gradients in neural networks by propagating the error backward through the layers, adjusting weights to minimize loss.
# Difference: Backpropagation is a specific application of reverse-mode autodiff, where gradients are computed with respect to intermediate variables. Reverse-mode autodiff can be used more generally for any computation graph.


In [7]:
# Hyperparameters to Tweak in an MLP:

# Hyperparameters: Number of layers, number of neurons per layer, learning rate, activation functions, batch size, number of epochs, dropout rate, regularization strength, optimizer choice.
# Handling Overfitting: Reduce the number of neurons/layers, add dropout, increase regularization, or use early stopping.


In [12]:
# Training a Deep MLP on MNIST with 98% Precision:

# Steps:
# Build the model architecture with multiple hidden layers.
# Use early stopping and dropout for regularization.
# Save checkpoints during training.
# Log metrics in TensorBoard for tracking.
# Visualize learning curves and accuracy to assess model performance.



import tensorflow as tf

# Load and preprocess the MNIST data
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0

# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile and train with checkpoint saving
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("mnist_model.keras", save_best_only=True)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=20, validation_split=0.1, callbacks=[checkpoint_cb, early_stopping_cb])

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.2%}")


Epoch 1/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.8504 - loss: 0.5115 - val_accuracy: 0.9653 - val_loss: 0.1262
Epoch 2/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9494 - loss: 0.1707 - val_accuracy: 0.9742 - val_loss: 0.0922
Epoch 3/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9623 - loss: 0.1221 - val_accuracy: 0.9763 - val_loss: 0.0811
Epoch 4/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9712 - loss: 0.0954 - val_accuracy: 0.9768 - val_loss: 0.0785
Epoch 5/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9748 - loss: 0.0811 - val_accuracy: 0.9807 - val_loss: 0.0724
Epoch 6/20
[1m1688/1688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9782 - loss: 0.0682 - val_accuracy: 0.9810 - val_loss: 0.0707
Epoch 7/20
[1m1