# Neural Network From Scratch

## Introduction 

A neural network is a computational **model** inspired by the structure and function of the human brain. It consists of **layers** of interconnected units called **neurons** or **nodes**, which process input data by performing weighted computations and applying activation functions. Neural networks are particularly effective at learning patterns and representations from data, making them suitable for tasks like classification, regression, pattern recognition, and decision-making.

### Goal of this project
The goal of this project is to have deeper understanding of neural networks by diving deeper into their fundemental compoenents like activation functions, loss functions and many more. 

## Neuron Concept
A **neuron** in a neural network is a computational unit that receives input signals from neurons in the previous layer, processes them by computing a weighted sum, add a bias, then applies an activation function, and passes the result to the next layer (in fully connected networks). 

- Neurons in the **input layer** don't process data—they simply hold the raw input features like pixel values or numerical data
- Neurons in the **output layer** produce the final prediction or decision based on the computations from previous layers.  

 **Weights** ($w$) are numerical values associated with the connections between neurons in adjacent layers. They determine how much influence a particular input has on a neuron’s output. For every connection from an input to a neuron, there is a corresponding weight.
 
**Bias** ($b$) is an additional parameter for each neuron that allows the model to shift the activation function, helping the network learn more flexible decision boundaries.

The output of a neuron (before activation) is computed as:

$$
z = \sum_{i=1}^{n} x_i \cdot w_i + b
$$

where:
* $x_i$ are the inputs,
* $w_i$ are the corresponding weights,
* $b$ is the bias term.

In [1]:
# This one illustrate three neurons' output to one neuron
inputs = [1.2, 2.3, 3.1] # 3 neurons' output
weights = [3.1, 2.1, 8.7] # 3 weights
bias = 3 # 1 bias

# The output of a neuron is computed as:
output = inputs[0] * weights[0] + inputs[1] * weights[1] + inputs[2] * weights[2] + bias

print(output)

38.519999999999996


In [2]:
# To illustrate four neurons to three neurons, we should have mulitpe wights with multiple basies
input_layer = [1.2, 2.3, 3.1, 4.2]
weights1 = [0.9, 0.1, -0.1, -0.3]
weights2 = [0.5, -0.91, 0.3, 0.2]
weights3 = [-0.1, 0.26, -0.26, -0.29]
bias1 = 2
bias2 = 3
bias3 = 0.5

layer1_neurons = [input_layer[0] * weights1[0] + input_layer[1] * weights1[1] + input_layer[2] * weights1[2] + bias1, #layer1_neuron1
                  input_layer[0] * weights2[0] + input_layer[1] * weights2[1] + input_layer[2] * weights2[2] + bias2, #layer1_neuron2
                  input_layer[0] * weights3[0] + input_layer[1] * weights3[1] + input_layer[2] * weights3[2] + bias3       #layer1_neuron3
                ]

print(layer1_neurons)

[3.0, 2.4370000000000003, 0.17199999999999993]


In [3]:
# Similar version using loops
input_layer = [1.2, 2.3, 3.1, 4.2]

weights = [[0.9, 0.1, -0.1, -0.3],
           [0.5, -0.91, 0.3, 0.2],
           [-0.1, 0.26, -0.26, -0.29]]

biases = [2, 3, 0.5]

layer_outputs = []  #output of the current layer
for neuron_weights, neuron_biases in zip(weights, biases):
    neuron_output = 0 #output of the given neuron
    for n_input, weight in zip(input_layer, neuron_weights):
        neuron_output += n_input * weight
    neuron_output += neuron_biases
    layer_outputs.append(neuron_output)
print(layer_outputs)

[1.74, 3.277, -1.046]


## Basic Math


### What Is a "Shape" in Programming and Neural Networks?

In the context of **machine learning, neural networks (NNs), and arrays (like NumPy or tensors in PyTorch/TensorFlow)**, the **"shape"** refers to the **dimensions** of a data structure—typically lists, arrays, or tensors.

➤ Example: Python list

```python
a = [1, 2, 3]
```

* Shape: `(3,)` → a 1D list with 3 elements.

➤ Example: 2D list (matrix)

```python
b = [[1, 2, 3],
     [4, 5, 6]]
```

* Shape: `(2, 3)` → 2 rows, 3 columns.

Think of shape like the size blueprint of the data structure.

---

### How Can We Identify the Shape of a List?

In Python, you can use:

#### NumPy 

```python
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)  # Output: (2, 3)
```

####  For native Python lists:

You can inspect shape manually:

```python
lst = [[1, 2, 3], [4, 5, 6]]
rows = len(lst)
cols = len(lst[0])
print((rows, cols))  # (2, 3)
```

But it's less robust since NumPy handles it more safely.

---

### Why Is Shape Important in Neural Networks?

#### 1. **Shape dictates compatibility between layers**

Each layer in a neural network expects input and produces output of certain shapes.

Example:

* If a layer expects input shape `(batch_size, 3)` and you pass `(batch_size, 2)`, it will **crash**.
* Matrix multiplication (used in dense layers) **requires shape compatibility**:

  $$
  (m \times n) \cdot (n \times p) \rightarrow (m \times p)
  $$

#### 2. **Bugs from shape mismatches are common**

Incorrect shapes cause:

* Runtime errors (`ValueError`, `shape mismatch`)
* Incorrect learning (e.g., if dimensions get unintentionally broadcasted)

#### 3. **Shapes guide how to reshape, flatten, or batch data**

Before feeding data into an NN:

* Images often need shape `(batch, height, width, channels)`
* Flattening or reshaping is often needed for Dense layers

#### 4. **Backpropagation depends on gradients with matching shapes**

All derivatives computed during training **must match shape** with their corresponding weights and activations.

## Explain Dot product

In [4]:
# Dot Product using Numpy
import numpy as np

inputs = [1, 2, 3, 2.5]
weights = [0.2, 0.8, -0.5, 1.0]
bias = 2

# We should know that the order in the dot operation matters, not in this example but,
# in the example where we have multiple weights (multiple neurons in the next layer), see example below
output = np.dot(inputs, weights) + bias

print(output)

4.8


## Batches, Layers & Objects

### Batches - Number of training samples at once.
In the context of neural networks, a batch refers to a subset of the training data that is processed at once before updating the model's parameters. The batch size is the number of training samples within that subset. Instead of processing the entire training dataset at once (which is called batch gradient descent), the data is divided into smaller batches to make training more efficient.

Multiple samples from input will be multipled with the same weight vectors and added to the same bias. We should not care to much about these values as we will see that weights' values and baiases will be adjusted using forward and backword propagation. 

In [5]:
sing_samp_input = [1, -6, 4, 9] # These are features from a single sample
mult_samp_input = [[1, 2, 3, 2.5],  # These are features from multiple samples
                   [2.0, 5.0, -1.0, 2.0],
                   [-1.5, 2.7, 3.3, -0.8]
                  ]

weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]]
biases = [2, 3, 0.5]

output = np.dot(mult_samp_input, np.array(weights).T) + biases
print(output)

[[ 4.8    1.21   2.385]
 [ 8.9   -1.81   0.2  ]
 [ 1.41   1.051  0.026]]


## Multiple Layers

In [6]:
# Now we will create multiple layers with batch size > 1
mult_samp_input = [[1, 2, 3, 2.5],
                   [2.0, 5.0, -1.0, 2.0],
                   [-1.5, 2.7, 3.3, -0.8]
                  ]

l1_weights = [[0.2, 0.8, -0.5, 1.0],
           [0.5, -0.91, 0.26, -0.5],
           [-0.26, -0.27, 0.17, 0.87]]
l1_biases = [2, 3, 0.5]

l2_weights = [[0.1, -0.14, 0.5],
              [-0.5, 0.12, -0.33],
              [-0.44, 0.73, -0.13]]
l2_biases = [-1, 2, -0.5]

l1_output = np.dot(mult_samp_input, np.array(l1_weights).T) + l1_biases
l2_output = np.dot(l1_output, np.array(l2_weights).T) + l2_biases
print(l2_output)

[[ 0.5031  -1.04185 -2.03875]
 [ 0.2434  -2.7332  -5.7633 ]
 [-0.99314  1.41254 -0.35655]]


## Object

In [7]:
# Now, we will convert this into an object
np.random.seed(0)

X =  [[1, 2, 3, 2.5],
      [2.0, 5.0, -1.0, 2.0],
      [-1.5, 2.7, 3.3, -0.8]]

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.10 * np.random.randn(n_inputs, n_neurons) # initialize weights
        self.biases = np.zeros((1, n_neurons)) # initialize biases

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases # compute the output

layer1 = Layer_Dense(4, 5)
layer2 = Layer_Dense(5, 2)
layer3 = Layer_Dense(2, 1)

layer1.forward(X)
print(layer1.output)
layer2.forward(layer1.output)
print(layer2.output)
layer3.forward(layer2.output)
print(layer3.output)


[[ 0.10758131  1.03983522  0.24462411  0.31821498  0.18851053]
 [-0.08349796  0.70846411  0.00293357  0.44701525  0.36360538]
 [-0.50763245  0.55688422  0.07987797 -0.34889573  0.04553042]]
[[ 0.148296   -0.08397602]
 [ 0.14100315 -0.01340469]
 [ 0.20124979 -0.07290616]]
[[-0.00087785]
 [ 0.00167789]
 [ 0.00036128]]


In [8]:
import nnfs 
from nnfs.datasets import spiral_data
# initialize the nnfs
nnfs.init()

# create dataset    
X, y = spiral_data(100, 3)

# create dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# perform a forward pass of our training data through this layer
dense1.forward(X)

print(dense1.output[:5])

[[ 0.          0.          0.        ]
 [-0.00104752  0.00113954 -0.00047984]
 [-0.00274148  0.00317292 -0.00086922]
 [-0.00421884  0.00526663 -0.00055913]
 [-0.00577077  0.00714014 -0.0008943 ]]


## Activation Function


In [9]:
class Activation_Function:
    def ReLU(self, inputs):
        self.output = np.maximum(0, inputs)
    
    def dReLU(self, dvalues):
        self.output = np.where(dvalues > 0, 1, 0)
    
    def sigmoid(self, inputs):
        self.output = 1 / (1 + np.exp(-inputs))
    
    def dsigmoid(self, dvalues):
        self.output = dvalues * (1 - dvalues)
    
    def tanh(self, inputs):
        self.output = np.tanh(inputs)
    
    def dtanh(self, dvalues):
        self.output = 1 - np.tanh(dvalues) ** 2
    
    def softmax(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)


In [10]:
# create dataset
X, y = spiral_data(100, 3)
# create dense layer (wieghts & biases) with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# create activation function
activation1 = Activation_Function()
# create second dense layer with 3 input features and 3 output values
dense2 = Layer_Dense(3, 3)
# create activation function
# activation2 = Activation_Function()


# perform a forward pass of our training data through this layer
dense1.forward(X)
activation1.ReLU(dense1.output)
dense2.forward(activation1.output)
activation1.softmax(dense2.output)

print(activation1.output[:5])


[[0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]
 [0.33333334 0.33333334 0.33333334]]


## Loss Functions
### What Is a Loss Function?

A **loss function** is a **mathematical function** that measures how **far off a model's prediction is from the actual target value**.

> Think of it like a penalty score: the higher the loss, the worse your model is doing.

---

### Why Do We Use a Loss Function?

We use loss functions to:

1. **Quantify model error** — You can't improve what you can't measure.
2. **Guide learning** — During training, optimization algorithms (like gradient descent) use the **loss value to adjust weights**.
3. **Compare models** — Lower loss typically indicates a better-performing model (for the same problem).

---

###  When and How Do We Use a Loss Function?

➤ **When**

* Every time the model makes a prediction during training.
* In **each iteration**, the loss function is used to evaluate the model’s output against ground truth.

➤ **How**

1. **Prediction:** The model outputs a value (e.g., 0.8 for class "dog")
2. **Compare with target:** The ground truth is 1 (it *is* a dog)
3. **Loss function:** Calculates how "wrong" 0.8 is compared to 1 (e.g., using MSE or Cross-Entropy)
4. **Backpropagation:** Uses the loss to compute gradients and update weights

---

### Common Types of Loss Functions

Loss functions vary depending on the type of task: **regression** vs **classification**.

---

####  1. **For Regression Tasks**

Where the model predicts continuous values.

**Mean Squared Error (MSE)**

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

* Penalizes large errors more than small ones
* Very common in predicting numbers (e.g., house prices)

**Mean Absolute Error (MAE)**

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

* Treats all errors equally, less sensitive to outliers

---

#### 2. **For Classification Tasks**

Where the model predicts categories or probabilities.

**Binary Cross-Entropy**

$$
\text{BCE} = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]
$$

* Used for binary classification (`y = 0` or `1`)
* Output should be a probability (`sigmoid` activation)

**Categorical Cross-Entropy**

$$
\text{CCE} = -\sum_{i} y_i \cdot \log(\hat{y}_i)
$$

* Used for multi-class classification
* Targets are one-hot encoded, predictions are softmax probabilities

**Sparse Categorical Cross-Entropy**

* Same as above, but labels are integers instead of one-hot
* Easier to use with many classes

---

#### Bonus: Other Losses

* **Huber Loss** – robust to outliers (mix between MSE and MAE)
* **KL Divergence** – measures difference between two probability distributions
* **Contrastive Loss / Triplet Loss** – used in face verification, embeddings

---


### Summary Table

| Type           | Task                       | Function                         | Example Use                | 
| -------------- | -------------------------- | -------------------------------- | -------------------------- | 
| MSE            | Regression                 | $\frac{1}{n}\sum(y - \hat{y})^2$ | Predicting temperatures    |
| MAE            | Regression                 | $\frac{1}{n}\sum(y - \hat{y})$   | Forecasting sales          |
| Binary CE      | Binary Classification      | BCE formula above                | Spam detection             |
| Categorical CE | Multi-Class Classification | CCE formula                      | Digit classification (0–9) |
| Sparse CE      | Multi-Class (int labels)   | Same as CCE                      | NLP classification tasks   |


In [15]:
print("Hello World")

Hello World
