# A Minimal Softmax Classifier From Scratch Using the Iris Dataset
In this notebook, I build a softmax classifier from scratch using numpy. The classifier will learn to classify the three types of iris flowers: setosa, versicolor, and virginica.
I made this in order to understand the math behind machine learning models and it was a huge help.
I will add as many comments explaining what I am doing to demonstrate my understanding, and 
because it helps give me a better understanding. This project will be different from most of the others because of the amount of explanations I included.
For clarification this was made entirely with my understanding. No tutorials were used.

In [316]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

In [317]:
data = load_iris()
X = data['data']    # Shape: (150, 4) — 4 features: sepal length, sepal width, petal length, petal width
y = data['target']  # Shape: (150,) — Integer labels: 0 = setosa, 1 = versicolor, 2 = virginica

# One-Hot Encoding of Labels
To use softmax and cross-entropy properly, we need to one-hot encode the integer class labels.
Instead of using integer 1 (e.g., class = 1), we represent it as a vector: [0, 1, 0].


In [318]:
y_onehot = np.zeros(shape=(y.shape[0], 3))   # Create an array of zeros with shape (150, 3)
y_onehot[np.arange(y.size), y] = 1           # Set the appropriate class index to 1 for each sample
y_onehot[0]                                  # Example output

array([1., 0., 0.])

In [319]:
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.25, random_state=62)

We split 25% of the data (38 samples) into the test set:
- `X_train.shape = (112, 4)`
- `y_train.shape = (112, 3)`
- `X_test.shape = (38, 4)`
- `y_test.shape = (38, 3)`


We now define our softmax and cross-entropy functions to convert logits to probabilities and measure how wrong the predictions are. Using SGD so batch size will be 1.

In [320]:
def softmax(Z: np.ndarray) -> np.ndarray:
    """
    Softmax function to convert logits (Z) into probabilities.
    Works element-wise across classes.

    Example:
    If Z = [2.0, 1.0, 0.1], softmax(Z) = [0.659, 0.242, 0.099]
    """
    Z = np.exp(Z)
    sum = Z.sum()
    return Z / sum # Shape is still (3,)

For the Loss Function, Cross-Entropy is used. Cross-Entropy uses the natural logarithm on the probabilities from the softmax function and then multiplies them by y (the correct answer). Only the log of the correct answer ends up being used since all other y's = 0.

In [321]:
def CrossEntropy(yhat: np.ndarray, y: np.ndarray) -> float:
    """
    Cross-Entropy Loss for a single training example.
    Assumes y is one-hot encoded and yhat contains probabilities.
    """
    return - np.sum(y * np.log(yhat))

# Parameter Initialization
Initialize weights and biases.
- w: shape (3, 4) → 3 classes × 4 input features
- b: shape (3,) → 1 bias per class

In [322]:
w = np.random.rand(3, 4)
b = np.random.rand(3)
w, b

(array([[0.72188236, 0.66838096, 0.62716875, 0.08227346],
        [0.9810613 , 0.86350937, 0.99160996, 0.23015252],
        [0.85203934, 0.42605142, 0.25747436, 0.66313353]]),
 array([0.85785875, 0.59830137, 0.95121131]))

We loop through the training data one example at a time (stochastic gradient descent), and use the softmax function for multi-class probability prediction and cross-entropy loss to measure prediction error. The model has: 
- X_train of shape (N, D) → N samples, D features
- w of shape (C, D) → C classes, D features
- b of shape (C,) → one bias per class

In [323]:
lr = 0.01
epochs = 100

for epoch in range(epochs):
    total_loss = 0  # Sum of losses over the entire epoch
    correct = 0     # Count of correctly predicted examples
    for z in range(X_train.shape[0]):
        x = X_train[z]          # Shape: (4,) — single training sample
        raw = np.dot(w, x) + b  # Shape: (3,) — raw scores (logits) per class
        yhat = softmax(raw)     # Shape: (3,) — predicted probabilities per class
        """
        np.dot(w, x) performs matrix-vector multiplication,
        w (3x4) · x (4,) into logits for 4 classes
        """

        loss = CrossEntropy(yhat, y_train[z]) # Scalar loss for one sample
        total_loss += loss                    # Accumulate loss

        pred_class = np.argmax(yhat)        # Predicted class index
        true_class = np.argmax(y_train[z])  # Ground truth class index
        if pred_class == true_class:
            correct += 1                    # Count if prediction is correct

        dz = yhat - y_train[z] # Gradient of loss w.r.t. raw scores (softmax output - true labels)

        dw = np.outer(dz, x) # Shape: (3, 4) — gradient of loss w.r.t. weights
        db = dz              # Shape: (3,) — gradient of loss w.r.t. biases
        w -= lr * dw         # SGD update for weights
        b -= lr * db         # SGD update for biases

        """
        dz gives the gradient of the cross-entropy loss.
        np.outer(dz, x) computes the full gradient matrix for w.
        And then the learning step subtracts the gradients scaled by the learning rate.
        """
    acc = correct / X_train.shape[0]
    print(f"Epoch {epoch} Loss: {total_loss:.4f} | Accuracy: {acc:.4f}")
        

Epoch 0 Loss: 112.5391 | Accuracy: 0.5625
Epoch 1 Loss: 78.2860 | Accuracy: 0.7232
Epoch 2 Loss: 67.9412 | Accuracy: 0.7321
Epoch 3 Loss: 61.8576 | Accuracy: 0.7411
Epoch 4 Loss: 57.5247 | Accuracy: 0.7768
Epoch 5 Loss: 54.1238 | Accuracy: 0.8036
Epoch 6 Loss: 51.3053 | Accuracy: 0.8393
Epoch 7 Loss: 48.8916 | Accuracy: 0.8482
Epoch 8 Loss: 46.7799 | Accuracy: 0.8571
Epoch 9 Loss: 44.9052 | Accuracy: 0.8571
Epoch 10 Loss: 43.2231 | Accuracy: 0.8661
Epoch 11 Loss: 41.7015 | Accuracy: 0.8750
Epoch 12 Loss: 40.3162 | Accuracy: 0.8750
Epoch 13 Loss: 39.0483 | Accuracy: 0.8839
Epoch 14 Loss: 37.8827 | Accuracy: 0.8839
Epoch 15 Loss: 36.8068 | Accuracy: 0.8929
Epoch 16 Loss: 35.8104 | Accuracy: 0.9107
Epoch 17 Loss: 34.8848 | Accuracy: 0.9107
Epoch 18 Loss: 34.0225 | Accuracy: 0.9196
Epoch 19 Loss: 33.2172 | Accuracy: 0.9196
Epoch 20 Loss: 32.4633 | Accuracy: 0.9196
Epoch 21 Loss: 31.7561 | Accuracy: 0.9196
Epoch 22 Loss: 31.0912 | Accuracy: 0.9196
Epoch 23 Loss: 30.4651 | Accuracy: 0.9196
E

In [324]:
correct = 0
for i in range(X_test.shape[0]):
    pred = np.dot(w, X_test[i]) + b
    yhat = softmax(pred)

    pred_class = np.argmax(yhat)
    true_class = np.argmax(y_test[i])
    if pred_class == true_class:
        correct += 1

acc = correct / X_test.shape[0]
print(f"Test Accuracy: {acc:.4f}")

Test Accuracy: 0.9737


## Conclusion

- Our softmax classifier achieved 97.37% accuracy on the Iris test set.
- The model was implemented entirely from scratch, without using any high-level ML libraries.
- I used:
  - The **softmax** function to convert logits to probabilities.
  - **Cross-entropy loss** to penalize incorrect predictions.
  - **Stochastic Gradient Descent** to update weights after each sample.

This was a fun little project that solidified my understanding.

---

