# MNIST Classification with a Multi-Layer Perceptron from Scratch (NumPy only)

In this notebook, I implement a fully-connected neural network (MLP) from scratch using only NumPy.

The goal is to classify handwritten digits from the MNIST dataset. The network architecture:
Input (784) → Dense(128) → ReLU → Dense(64) → ReLU → Dense(10) → Softmax

I created this project to strengthen my mathematical understanding of deep learning, no tutorials, no high-level ML libraries.

In [None]:
from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml('mnist_784', version=1, as_frame=False)

X = mnist['data']       # Shape: (70000, 784)
y = mnist['target']     # Shape: (70000,)

X = X / 255.0           # Normalize pixel values to [0, 1]
y = y.astype(np.int32)  # Convert labels to integers

In [None]:
# Split into train/test (60k train, 10k test)
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (60000, 784)
X_test shape: (10000, 784)
y_train shape: (60000,)
y_test shape: (10000,)


# Define Activation and Loss Functions

- ReLU introduces non-linearity.
- Softmax turns logits into probabilities.
- Cross-entropy penalizes incorrect class probabilities.

In [89]:
def ReLU(z: np.ndarray) -> np.ndarray:
    """Applies the ReLU function elementwise"""
    return np.maximum(0, z)

def ReLU_deriv(z: np.ndarray) -> np.ndarray:
    """Derivative of ReLU: 1 if z > 0, else 0"""
    return np.where(z > 0, 1, 0)

In [90]:
def softmax(Z: np.ndarray) -> np.ndarray:
    """
    Softmax function to convert logits (Z) into probabilities.
    Works element-wise across classes.
    """
    Z = Z - np.max(Z, axis=1, keepdims=True)
    exp_Z = np.exp(Z)
    return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)

In [None]:
def CrossEntropy(yhat: np.ndarray, y: np.ndarray, eps: float = 1e-15) -> float:
    """
    Computes the mean Cross-Entropy loss for multi-class classification.

    Parameters:
    - yhat (np.ndarray): Predicted probabilities with shape (batch_size, num_classes)
    - y (np.ndarray): One-hot encoded true labels with shape (batch_size, num_classes)
    - eps (float): Small value to prevent log(0)

    Returns:
    - float: Mean cross-entropy loss over the batch
    """
    yhat = np.clip(yhat, eps, 1 - eps)
    return -np.mean(np.sum(y * np.log(yhat), axis=1))

# Architecture

Input (784) → Dense(128) → ReLU → Dense(64) → ReLU → Output(10, softmax)

- w1 shape = (128, 784)
- b1 shape = (128,)
- output shape = (bs, 128)

- w2 shape = (64, 128)
- b2 shape = (64,)
- output shape = (bs, 64)

- w3 shape = (10, 64)
- b3 shape = (10,)
- output shape = (bs, 10)

# Initialize Weights and Biases with He Initialization
He initialization helps with vanishing and exploding gradients. This was a problem before I included He initialization.

In [128]:
w1 = np.random.randn(128, 784) * np.sqrt(2 / 784)
b1 = np.zeros(128)

w2 = np.random.randn(64, 128) * np.sqrt(2 / 128)
b2 = np.zeros(64)

w3 = np.random.randn(10, 64) * np.sqrt(2 / 64)
b3 = np.zeros(10)

In [None]:
batch_size = 32
epochs = 50
lr = 0.01

for epoch in range(epochs):
    total_loss = 0    # Sum of losses for this epoch
    correct = 0       # Count of correct predictions

    # Shuffle data each epoch
    perm = np.random.permutation(X_train.shape[0])
    X_train = X_train[perm]
    y_train = y_train[perm]
    
    for batch in range(0, X_train.shape[0], 32):
        X_train_batch = X_train[batch: batch + batch_size]  # (batch_size, 784)
        y_train_batch = y_train[batch: batch + batch_size]  # (batch_size,)
        batch_len = X_train_batch.shape[0]  # Actual batch size (handles last batch)

        #Forward Pass
        z1 = np.dot(X_train_batch, w1.T) + b1               # (batch_size, 128)
        a1 = ReLU(z1)

        z2 = np.dot(a1, w2.T) + b2                          # (batch_size, 64)
        a2 = ReLU(z2)

        z3 = np.dot(a2, w3.T) + b3                          # (batch_size, 10)
        yhat = softmax(z3)                                  # (batch_size, 10)

        # Onehot Encoding Labels
        y_onehot = np.zeros((batch_len, 10))
        y_onehot[np.arange(batch_len), y_train_batch] = 1   # Turn class indices into one-hot vectors

        # Loss and Accuracy
        loss = CrossEntropy(yhat, y_onehot)                 # Average loss over batch
        total_loss += loss

        preds = np.argmax(yhat, axis=1)                     # Class prediction per sample
        correct += np.sum(preds == y_train_batch)           # Tally correct predictions

        # Back Propagation
        dz3 = yhat - y_onehot                               # Loss gradient
        da2 = np.dot(dz3, w3)                               # Backprop into hidden layer 2
        dz2 = da2 * ReLU_deriv(z2)

        da1 = np.dot(dz2, w2)                               # Backprop into hidden layer 1
        dz1 = da1 * ReLU_deriv(z1)

        dw3 = np.dot(dz3.T, a2)
        db3 = dz3.sum(axis=0)
        dw2 = np.dot(dz2.T, a1)
        db2 = dz2.sum(axis=0)
        dw1 = np.dot(dz1.T, X_train_batch)
        db1 = dz1.sum(axis=0)

        # Adjusting Parameters
        w3 -= lr * dw3
        b3 -= lr * db3
        w2 -= lr * dw2
        b2 -= lr * db2
        w1 -= lr * dw1
        b1 -= lr * db1

    # Epoch Results
    acc = correct / X_train.shape[0]
    print(f"Epoch {epoch+1} | Loss: {total_loss / X_train.shape[0]:.4f} | Accuracy: {acc:.4f}") # Just realized I was displaying the loss wrong.
        

Epoch 1 | Loss: 0.0077 | Accuracy: 0.9222
Epoch 2 | Loss: 0.0035 | Accuracy: 0.9660
Epoch 3 | Loss: 0.0026 | Accuracy: 0.9740
Epoch 4 | Loss: 0.0020 | Accuracy: 0.9797
Epoch 5 | Loss: 0.0016 | Accuracy: 0.9840
Epoch 6 | Loss: 0.0015 | Accuracy: 0.9847
Epoch 7 | Loss: 0.0012 | Accuracy: 0.9879
Epoch 8 | Loss: 0.0010 | Accuracy: 0.9895
Epoch 9 | Loss: 0.0011 | Accuracy: 0.9885
Epoch 10 | Loss: 0.0009 | Accuracy: 0.9904
Epoch 11 | Loss: 0.0009 | Accuracy: 0.9909
Epoch 12 | Loss: 0.0008 | Accuracy: 0.9922
Epoch 13 | Loss: 0.0007 | Accuracy: 0.9935
Epoch 14 | Loss: 0.0006 | Accuracy: 0.9933
Epoch 15 | Loss: 0.0007 | Accuracy: 0.9936
Epoch 16 | Loss: 0.0006 | Accuracy: 0.9943
Epoch 17 | Loss: 0.0006 | Accuracy: 0.9947
Epoch 18 | Loss: 0.0006 | Accuracy: 0.9940
Epoch 19 | Loss: 0.0006 | Accuracy: 0.9934
Epoch 20 | Loss: 0.0006 | Accuracy: 0.9939
Epoch 21 | Loss: 0.0008 | Accuracy: 0.9932
Epoch 22 | Loss: 0.0005 | Accuracy: 0.9946
Epoch 23 | Loss: 0.0005 | Accuracy: 0.9956
Epoch 24 | Loss: 0.0

Overfitting intentionally.

In [130]:
# Forward pass on test set
z1_test = np.dot(X_test, w1.T) + b1
a1_test = ReLU(z1_test)

z2_test = np.dot(a1_test, w2.T) + b2
a2_test = ReLU(z2_test)

z3_test = np.dot(a2_test, w3.T) + b3
yhat_test = softmax(z3_test)

# Predictions
preds_test = np.argmax(yhat_test, axis=1)
accuracy_test = np.mean(preds_test == y_test)

print(f"Test Accuracy: {accuracy_test:.4f}")


Test Accuracy: 0.9831


# Conclusion
Multi-layer perceptron (MLP) achieved 98.31% accuracy on the MNIST test set. The model was built entirely from scratch using NumPy, without any high-level libraries like PyTorch or TensorFlow. The ReLU activation to introduce non-linearity in hidden layers. The softmax function to convert the final layer’s outputs into class probabilities. Cross-entropy loss to measure prediction error for multi-class classification. Mini-batch gradient descent to update weights using backpropagation. He initialization to ensure stable training for deep ReLU networks.

This was a more advanced follow-up to my softmax classifier project and helped deepen my understanding of how deep neural networks are trained, especially for backpropagation. This was a challenging but rewarding project, and I'm proud to have written every part myself.