# Introduction

In this python notebook, I will show you how you can develop your own neural network to classify MNIST image dataset.

The theoretical explanations and formulas presented in this notebook are based on insights provided by ChatGPT, which I have carefully reviewed and adjusted as the author to ensure clarity and accuracy,

The code implementation is written from scratch with every aspect of the neural network being manually coded to provide a deeper understanding of how these models work.

# Dataset

## 1. Introduction to MNIST Dataset

The **MNIST dataset** (Modified National Institute of Standards and Technology) is a collection of **70,000 grayscale images** of handwritten digits (0-9). Each image is **28×28 pixels**, making it a common benchmark for image classification tasks.

* **Training set**: 60,000 images
* **Test set**: 10,000 images

This dataset is widely used to test the performance of machine learning models, particularly in neural network-based classification.

In [None]:
# Load and visualize the dataset

## 2. Neural Networks Overview

A neural network is a computational model inspired by the human brain, consisting of neurons organized into layers. The key layers are:

* Input Layer: Takes the pixel values as input.
* Hidden Layers: Process the input using weights and activation functions.
* Output Layer: Produces the final classification (digits 0-9).

## 3. Neuron Implementation

Each neuron is a computational unit in the network that processes inputs, applies weights, and produces an output. The neuron performs the following:

Compute Weighted Sum:
$$ z = Wx + b $$

where:
* 𝑥 = input vector
* 𝑊 = weight matrix
* 𝑏 = bias
* 𝑧 = weighted sum (logits)

In [2]:
# Code for initializing the neural network structure here

## 4. Activation Functions

Activation functions introduce non-linearity into the network, which is crucial for learning complex patterns.

> Without non-linearity, a neural network, no matter how many layers it has, would essentially behave like a single linear transformation. This is because a series of matrix multiplications is still just another matrix multiplication — effectively reducing the network to a simple linear model. By adding non-linearity, we enable the network to model intricate relationships in the data, allowing it to perform more sophisticated tasks.

Common activation functions include:

### 1. ReLU (Rectified Linear Unit):

$$ f(z) = \max(0, z) $$

* Used in hidden layers
* It introduces non-linearity and helps with the vanishing gradient problem.

In [5]:
# Code for ReLU function

### 2. Softmax (for output layer):

$$ \hat{y}_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} $$
 
* Converts logits into probabilities for classification. Softmax ensures that the output values sum to 1, making them interpretable as probabilities.

### 3. Sigmoid (optional, not used in this code)
$$ f(z) = \frac{1}{1 + e^{-z}} $$

* Often used for binary classification but not common for multi-class problems like MNIST.

## 5. Loss Function

To evaluate how well the network is performing, we use a loss function that measures the difference between the predicted outputs and the true labels. In the case of classification, we typically use Cross-Entropy Loss:

For a single training example:
$$ L = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2 $$

where:
* $$ C \text{= number of classes (10 in MNIST)} $$
* $$ y_{i} \text{= true label (one-hat encoded)} $$
* $$ \hat{y}_i \text{= predicted probability (from softmax)} $$
