<a href="https://colab.research.google.com/github/JonaFlavier/Learning/blob/main/handwritten_digit_identifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Neural Networks: Image Classifier**

This notebook documents the **my exploration into neural networks**, starting with a **simple handwritten digit classifier** using the MNIST dataset. The goal is to **understand the fundamental building blocks** of a neural network, including:

- **Forward propagation**: How inputs flow through layers to produce predictions.
- **Backpropagation**: How errors are propagated backward to adjust weights.
- **Gradient descent**: The optimization process to improve model accuracy.

## **Why Start with MNIST?**
MNIST is a widely used dataset for handwritten digit recognition, making it a **great starting point** for understanding deep learning fundamentals. It consists of **28×28 grayscale images** labeled from **0 to 9**.

## **🔍 What This Notebook Covers**
- Implementing a **fully connected neural network** from scratch.
- Training the model using **Stochastic Gradient Descent (SGD)**.
- Evaluating initial results and identifying areas for improvement.
- Laying the foundation for future optimizations and experiments.

As I progress, I will explore techniques to **enhance performance**, such as **different activation functions, weight initialization strategies, and optimization algorithms**.

In [1]:
!pip install pandas numpy tensorflow



In [2]:
import random
import numpy as np
from tensorflow.keras.datasets import mnist

# **Creating the neural network**
### The class Network explores the underlying process used to train models.

### This model uses the most basic methods such as Stochastic Gradient Descent (SGD) for backpropagation, sigmoid for the activation function, the cost derivative to measure errors between iterations to help with updating weights and biases
---
## **Formulas applied:**
### **Feedforward propagation:**

$$
z = W \cdot a + b
$$
$$
a' = \sigma(z)
$$

- Each neuron computes a weighted sum of its inputs.
- The weighted sum \( z \) is passed through the **sigmoid activation function**.

## **Sigmoid Activation function:**

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
### **Derivative (used in backpropagation):**
$$
\sigma'(z) = \sigma(z)(1 - \sigma(z))
$$
- The sigmoid function maps inputs to values between 0 and 1.
- The derivative is used for adjusting weights during backpropagation.

## **Backpropagation:**
### **Output Layer Error:**
$$
\delta_L = (a_L - y) \odot \sigma'(z_L)
$$

### **Hidden Layer Error Propagation:**
$$
\delta^l = (W^{l+1})^T \cdot \delta^{l+1} \odot \sigma'(z^l)
$$
- The error propagates backward through the network using the derivative of the activation function.



## **Stochastic Gradient Descent (SGD):**
$$
W = W - \frac{\eta}{m} \sum \nabla W
$$
$$
b = b - \frac{\eta}{m} \sum \nabla b
$$
- $\eta$ is the learning rate.
- $m$ is the batch size.
- Gradients are accumulated over the mini-batch and used to update the weights and biases.


## **Cost Function Derivative:**

$$
\frac{\partial C}{\partial a} = (a_{\text{output}} - y)
$$
- Measures the error between predicted output $ a_{\text{output}} $ and actual label $ y $.
- This derivative helps update the model parameters.



In [13]:
# create a network object

class Network (object):
  def __init__(self, sizes):
    """
    sizes -> list containing the number of neurons per layer
    i.e.: [2,4,3,1] => input layer has 2 neurons, layer 2 has 4 neurons and layer 3 has 3 inner neurons whilst layer 4 has 1 neuron
    each neuron will have a random weight assigned initially which will be manipulated at each iteration of SGD
    Assume that the input layer will not require a bias as its only required in the outer layer neurons

    the weight matrix will be for connecting => (destination layer, source layer)
      (4 neuron layer, 2 neuron layer),
      (3 neuron layer, 4 neuron layer),
      (1 neuron layer, 3 neuron layer)

    """
    self.num_layers = len(sizes)
    self.sizes = sizes
    self.biases = [np.random.randn(y,1) for y in sizes[1:]] # for every layer after the input layer in sizes, attach a random bias
    self.weights = [np.random.randn(y,x) for x, y in zip(sizes[:-1], sizes[1:])] # the weight matrix will be for connecting => (4,2), (3,4), (1,3)
    print(f"biases {len(self.biases)} {self.biases[0].shape} {self.biases[1].shape}")
    print(f"weights {len(self.weights)} {self.weights[0].shape} {self.weights[1].shape}")

  def sigmoid(self,z):
    # function is defined by => σ(z) = 1/1+e^-z
    return 1.0/(1.0 + np.exp(-z))

  def sigmoid_prime(self, z):
    #
    return self.sigmoid(z)*(1-self.sigmoid(z))

  def feedforward(self, a):
    # apply the activation function per layer => a = σ(weight*input + bias)
    for b, w in zip(self.biases, self.weights):
      a = self.sigmoid(np.dot(w, a)+b)
    return a

  def SGD (self, training_data, epochs, mini_batch_size, eta, test_data=None):
    """
    Training neural network: Stochastic Gradient Descent
    training_data = list of tuples (training inputs X, desired outputs Y)
    test_data = if provided will evaluate against test data after each epoch

    """
    print(f"{len(training_data)}")
    if test_data is not None: n_test = len(test_data) # if present get the number of test data rows
    n = len(training_data) # get the length of the data rows being trained

    for i in range(epochs):
      # shuffle the training data
      random.shuffle(training_data)
      # segregate into mini batches
      mini_batches = [
          training_data[k:k+mini_batch_size]
          for k in range(0, n, mini_batch_size)
      ]

      # update per mini batch
      for mini_batch in mini_batches:
        self.update_mini_batch(mini_batch, eta)

      # evaluate iteration if test data is present
      if test_data:
        print(f"Epoch {i}: {self.evaluate(test_data)}/ {n_test}")
      else:
        print(f"Epoch {i} complete")

  def update_mini_batch(self, mini_batch, eta):
    """
    updates weights and biases using gradient descent
    backpropagation is used per mini batch
    mini_batch => list of tuples "(x, y)"
    eta => learning_rate
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases] # list of zero matrices to store sum of gradients of the bias for all samples in minibatch
    nabla_w = [np.zeros(w.shape) for w in self.weights] # list of zero matrices to store sum of gradient of the weights for all samples in minibatch

    for x, y in mini_batch:
      delta_nabla_b, delta_nabla_w = self.backprop(x, y)
      nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] # apply the changes per bias in each layer by the delta
      nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] # apply the changes per weight in each layer by the delta

    # use the function x-(alpha/batch_size)*x' to apply the learning rate
    self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)] # set the new biases from this mini_batch
    self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] # set the new weights from this mini_batch

  def backprop(self, x, y):
    # print(f"x shape{x.shape} y shape{y.shape}")
    x = x.reshape(-1,1)
    y = y.reshape(-1,1)
    nabla_b =[np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]

    # feed forward
    activation = x
    activations = [x] # list for all activations
    zs = [] # list storing all z vectors

    for b, w in zip(self.biases, self.weights):
      z = np.dot(w, activation)+b # using the function w*input + b
      # print(f"z value {z}")
      zs.append(z) # store weighted sum before activation
      # print(f" zs length {len(zs)} {zs[0].shape}")
      activation = self.sigmoid(z) # apply activation function
      activations.append(activation) # collect activation result into list

    # backward pass
    delta = self.cost_derivative(activations[-1],y) * self.sigmoid_prime(zs[-1]) # calculate the difference between the network output layer and the true results
    # print(f"delta shape: {delta.shape}")
    delta = delta.reshape(-1, 1)  # Ensure delta is a column vector

    # print(f"sigmoid prime: {self.sigmoid_prime(zs[-1])}")

    nabla_b[-1] = delta # gradient of cost function with respect to biases in output layer
    # print(f"nabla_b delta shape{nabla_b[-1].shape}")
    nabla_w[-1] = np.dot(delta, activations[-2].T) # gradient for cost function wrt weights in output layer by calculation how much of the weight contributed to the error
    # print(f"nabla_w[-1] shape{nabla_w[-1].shape}")

    for l in range(2, self.num_layers):
      z = zs[-l] # get weighted sum for layer 1
      sp = self.sigmoid_prime(z) # get sigmoid derivative
      delta = np.dot(self.weights[-l+1].T, delta) * sp # backpropagates error by multiplying the weights with the transpose of the weight matric

      activations_prev = activations[-l-1].reshape(-1,1)
      # print(f"delta shape: {delta.shape}")
      # print(f"activations[-l-1] shape: {activations_prev.shape}")



      nabla_b[-l] = delta # store bias for current layer
      nabla_w[-l] = np.dot(delta, activations_prev.T) # store bias for current layer

    return (nabla_b, nabla_w)


  def evaluate(self, test_data):
    """
    store in a tuple list the inference result and the actual result in (prediction, true_result) format
    then sum up which inferences were equal to the true result
    """

    test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]

    return sum(int(x.item()==y.item()) for (x, y) in test_results)



  def cost_derivative(self, output_activations, y):
    """
    get vector of partial derivatives \ partial C_x \ partial a for the output activations
    """
    return (output_activations-y)


In [20]:
# pull the data from mnist and feed training data and test data into the model
 (x_train,y_train), (x_test, y_test) = mnist.load_data() # get the data
print(f"{x_train.shape}")
print(f"{y_train.shape}")
print(f"{x_test.shape}")
print(f"{y_test.shape}")

# normalise the input data to values 0-1
x_train = x_train.astype('float32')/255.0
x_test = x_test.astype('float32')/255.0

#flatten images from 2d (28, 28) to 1d (784,)
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
# print(f"{x_train.shape}")
# print(f"{y_train.shape}")
# print(f"{x_test.shape}")
# print(f"{y_test.shape}")

# convert to list


training_data = list(zip(x_train, y_train))
test_data = list(zip(x_test, y_test))

net = Network([784, 100,10]) # network with 784 neurons in the 1st input layer, 30 neurons in the 2nd inner layer and 10 in the 3rd layer


net.SGD(training_data, 50, 10, 5.0, test_data=test_data)
print("End of training")

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)
(60000, 784)
(60000, 1)
(10000, 784)
(10000, 1)
biases 2 (100, 1) (10, 1)
weights 2 (100, 784) (10, 100)
60000
Epoch 0: 1126/ 10000
Epoch 1: 1126/ 10000
Epoch 2: 1127/ 10000
Epoch 3: 1126/ 10000
Epoch 4: 1126/ 10000
Epoch 5: 1125/ 10000
Epoch 6: 1124/ 10000
Epoch 7: 1123/ 10000
Epoch 8: 1123/ 10000
Epoch 9: 1123/ 10000
Epoch 10: 1122/ 10000
Epoch 11: 1122/ 10000
Epoch 12: 1122/ 10000
Epoch 13: 1122/ 10000
Epoch 14: 1122/ 10000
Epoch 15: 1121/ 10000
Epoch 16: 1121/ 10000
Epoch 17: 1121/ 10000
Epoch 18: 1121/ 10000
Epoch 19: 1121/ 10000
Epoch 20: 1121/ 10000
Epoch 21: 1121/ 10000
Epoch 22: 1121/ 10000
Epoch 23: 1121/ 10000
Epoch 24: 1121/ 10000
Epoch 25: 1121/ 10000
Epoch 26: 1121/ 10000
Epoch 27: 1121/ 10000
Epoch 28: 1121/ 10000
Epoch 29: 1121/ 10000
Epoch 30: 1121/ 10000
Epoch 31: 1121/ 10000
Epoch 32: 1121/ 10000
Epoch 33: 1121/ 10000
Epoch 34: 1121/ 10000
Epoch 35: 1120/ 10000
Epoch 36: 1120/ 10000
Epoch 37: 1120/ 10000
Epoch 38: 112

# **Initial Evaluation and Observations**

After implementing the base neural network, the initial results suggest that performance is **not yet optimal**. Below are some key observations from the current implementation.

## **1️. Observed Issues**
- **Inconsistent training results** – Accuracy varies significantly across runs.
- **Slow convergence** – The network takes many epochs to reach reasonable accuracy.
- **Potential vanishing gradient problem** – Sigmoid activation may cause small gradient updates.
- **Weight initialization concerns** – Random initialization might be affecting stability.
- **Possible overfitting** – Training accuracy is much higher than test accuracy.

## **2. Areas for Further Investigation**
To improve the network’s performance, the following areas will be explored:
- **Weight Initialization:** Investigate whether **Xavier (Glorot) Initialization** helps stabilize training.
- **Activation Functions:** Compare **ReLU vs. Sigmoid** to address gradient vanishing.
- **Loss Function:** Evaluate whether **Cross-Entropy Loss** is more effective than MSE for classification.
- **Optimizer Choices:** Test **Adam or Momentum-based SGD** for faster convergence.
- **Regularization Techniques:** Consider using **L2 Regularization or Dropout** to reduce overfitting.

## **3️. Next Steps**
- Run additional experiments to analyze how weight initialization affects training stability.
- Compare different activation functions and their impact on gradient flow.
- Implement and test different optimization strategies.
- Evaluate the effects of regularization on generalization.

The next phase will involve **testing these modifications one by one** to analyze their impact on performance.


# **Initial Evaluation and Observations**

After implementing the base neural network, the initial results suggest that performance is **not yet optimal**. Below are some key observations from the current implementation.

## **1️. Observed Issues**
- **Inconsistent training results** – Accuracy varies significantly across runs.
- **Slow convergence** – The network takes many epochs to reach reasonable accuracy.
- **Potential vanishing gradient problem** – Sigmoid activation may cause small gradient updates.
- **Weight initialization concerns** – Random initialization might be affecting stability.
- **Possible overfitting** – Training accuracy is much higher than test accuracy.

## **2. Areas for Further Investigation**
To improve the network’s performance, the following areas will be explored:
- **Weight Initialization:** Investigate whether **Xavier (Glorot) Initialization** helps stabilize training.
- **Activation Functions:** Compare **ReLU vs. Sigmoid** to address gradient vanishing.
- **Loss Function:** Evaluate whether **Cross-Entropy Loss** is more effective than MSE for classification.
- **Optimizer Choices:** Test **Adam or Momentum-based SGD** for faster convergence.
- **Regularization Techniques:** Consider using **L2 Regularization or Dropout** to reduce overfitting.

## **3️. Next Steps**
- Run additional experiments to analyze how weight initialization affects training stability.
- Compare different activation functions and their impact on gradient flow.
- Implement and test different optimization strategies.
- Evaluate the effects of regularization on generalization.

The next phase will involve **testing these modifications one by one** to analyze their impact on performance.
