# C1: INTRODUCTION TO NEURAL NETWORKS

## Neuron
- A neuron is a mathematical function that takes inputs, processes them, and produces an output.
- It is a computational unit that:
    1. Receives inputs
    2. Applies weights to each input
    3. Sums them up and adds a bias term  
        - $\mathrm{z} = \sum(w_i \cdot x_i) + b$
    4. Passes the result to an activation function to introduce non-linearity  
        - $\mathrm{y} = f(z)$
- Neurons are also called units or nodes.

## Neural Network
- A neural network is a computational model inspired by how the human brain works.
- It consists of layers of nodes/neurons connected by weighted links.
- Layers:
    - **Input layer** → takes raw data
    - **Hidden layers** → extract features through transformations
    - **Output layer** → gives the prediction
- During training, it adjusts weights using **backpropagation** with an optimizer like gradient descent.


In [2]:
# Inputs: restaurant features
price = 3 # scale 1 to 5 -> cheap to expensive
rating = 4 # scale 1 to 5 -> bad to excellent

# Weights: How much you care about each
w_price = -0.8 # you prefer cheaper, so negative weights 
w_rating = 1.2 # you prefer good ratings, so positive weights

# Bias: general tendency to eat out
bias = 0.5

# Neuron calculation
output = (price *  w_price) + (rating * w_rating) + bias
print(f"Neuron raw output: {output: .2f}")

Neuron raw output:  2.90


## Activation Functions

- **Definition**: An activation function decides whether a neuron should be activated, i.e., whether it should pass the information forward.  
- Activation functions introduce **non-linearity**, allowing the network to learn complex patterns instead of just passing information in a straight line.  
- Without activation functions, a neural network would behave like a simple linear model, regardless of its depth.

### Common Activation Functions

#### Sigmoid
- **Formula**: $\sigma(x) = \frac{1}{1 + e^{-x}}$  
- **Range**: (0, 1)  
- **Pros**:  
    - Useful for probability outputs  
    - Provides a smooth gradient  
- **Cons**:  
    - Suffers from the **vanishing gradient problem** (for very large or very small $x$)  
    - Can lead to **slow convergence**

#### Tanh (Hyperbolic Tangent)
- **Formula**: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$  
- **Range**: (-1, 1)  
- **Pros**:  
    - Output is **zero-centered** (better than sigmoid for hidden layers)  
- **Cons**:  
    - Still suffers from the **vanishing gradient problem**

#### ReLU (Rectified Linear Unit)
- **Formula**: $f(x) = \max(0, x)$  
- **Range**: (0, $\infty$)  
- **Pros**:  
    - Fast to compute  
    - Helps reduce the vanishing gradient problem  
    - Widely used in deep networks  
- **Cons**:  
    - Can cause **"dead neurons"** (if a neuron's output is always $\leq 0$, it stops learning)


In [8]:
import numpy as np

# Sigmoidal activation function
def sigmoidal(x):
    return 1/ (1 + np.exp(-x))
print("Sigmoidal functions for 3.3 ans 3.5")
print(sigmoidal(3.3), " Close to 1")
print(sigmoidal(-3.2), "Close to 0")
print("\n")

# Tanh activation function
def tanh(x):
    return np.tanh(x)

print("Tanh function for 2.0 and -2.0")
print(tanh(2.0), " Positive outcome")
print(tanh(-2.0), " Negative outcome")
print("\n")

# ReLU activation function

def relu(x):
    return max(0, x)

print(relu(3.3), " Keeps positive")
print(relu(-2.0), " Zeroes out")

Sigmoidal functions for 3.3 ans 3.5
0.9644288107273639  Close to 1
0.039165722796764356 Close to 0


Tanh function for 2.0 and -2.0
0.9640275800758169  Positive outcome
-0.9640275800758169  Negative outcome


3.3  Keeps positive
0  Zeroes out


## Feed Forward Neural Network

- A **feed-forward neural network (FNN)** is a type of artificial neural network.  
- In this network, information moves only in one direction — from input to output.  
- Unlike recurrent networks, it does not have cycles or feedback connections.  

### Layers
- **Input Layer**: Takes raw data  
- **Hidden Layers**: Perform computations using activation functions  
- **Output Layer**: Produces the final prediction or classification  

### Training
- Uses **backpropagation** and **gradient descent** to update weights  

### Usage
- Common applications include:  
    - Classification  
    - Regression  
    - Feature extraction (e.g., image recognition, speech processing)


In [12]:
# Input features (price, rating)
inputs = np.array([3, 4])

#Layer 1: 2 neurons
weights1 = np.array([[0.2, 0.8],    # Neuron 1 weights
                     [0.6, -0.5]])  # Neuron 2 weights
bias1 = np.array([0.5, -0.2])

layer1_output = np.dot(inputs, weights1.T) + bias1
layer1_output = np.maximum(layer1_output, 0) # ReLU activation

print("Layer 1 output: ", layer1_output)

# Layer 2: 1 neuron -> Final decision
weights2 = np.array([0.4, 0.9])
bias2 = 0.1

output = np.dot(layer1_output, weights2) + bias2
output = 1 / (1 + np.exp(-output))
print("Final decision (probability)", output)

Layer 1 output:  [4.3 0. ]
Final decision (probability) 0.86056612703835


## Training a Neural Network

A neural network can be trained by adjusting its internal weights based on data.  

### Steps of Training
1. Input data is fed into the network.  
2. The network computes an output.  
3. The output is compared with the actual label.  
4. A **loss function** calculates how wrong the network is.  
5. The gradient of the loss with respect to the weights is computed (slope of the error surface).  
6. This gradient indicates the direction to reduce error.  
7. An **optimizer** (e.g., SGD, Adam) updates the weights:  

   - $w = w - \eta \cdot \frac{\partial L}{\partial w}$  

   where:  
   - $w$ = weight  
   - $\eta$ = learning rate (controls step size)  
   - $L$ = loss function  

8. Repeat the process multiple times until the network’s predictions improve.  

## Loss Functions

- **Definition**: A loss function measures how well a model is performing.  
- It compares the model’s predictions with the actual target values.  
- **Goal**: Minimize the loss during training so the model learns better.  

### Types of Loss Functions

#### 1. Regression Losses
- **Mean Squared Error (MSE):**  
  - Formula: $L(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$  
  - Penalizes larger errors more heavily.  
- **Mean Absolute Error (MAE):**  
  - Formula: $L(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$  
  - More robust to outliers, less harsh than MSE.  
- **Huber Loss:**  
  - Formula:  

    $$
    L_\delta(y, \hat{y}) =
    \begin{cases} 
      \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\
      \delta \cdot |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise}
    \end{cases}
    $$  

  - Combines the advantages of MSE and MAE.  
  - Less sensitive to outliers compared to MSE.  

#### 2. Classification Losses
- **Binary Cross-Entropy (Log Loss):**  
  - Formula: $L(y, \hat{y}) = -\frac{1}{n}\sum_{i=1}^n \big[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \big]$  
  - For binary classification problems (e.g., spam vs. not spam).  
- **Categorical Cross-Entropy:**  
  - Formula: $L(y, \hat{y}) = -\sum_{i=1}^C y_i \log(\hat{y}_i)$  
  - For multi-class classification (e.g., cat/dog/horse).  
- **Sparse Categorical Cross-Entropy:**  
  - Same as categorical cross-entropy, but labels are integers instead of one-hot vectors.  

#### 3. Advanced Losses
- **Hinge Loss:** Commonly used for SVMs.  
- **KL Divergence (Kullback–Leibler):**  
  - Formula: $D_{KL}(P \| Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$  
  - Measures differences between probability distributions (useful in VAEs, knowledge distillation).  
- **Contrastive Loss / Triplet Loss:** Used in face recognition and metric learning.  
- **Dice Loss / IoU Loss:** Used in image segmentation tasks (e.g., medical imaging).  
- **CTC Loss (Connectionist Temporal Classification):** Used for sequence problems like speech recognition.  


In [14]:
import torch
import torch.nn as nn

# True values (targets)
y_true_reg = torch.tensor([2.5, 0.0, 2.1, 7.8])
y_pred_reg = torch.tensor([3.0, -0.5, 2.0, 7.5])

y_true_cls = torch.tensor([1, 0, 1])       # classification labels
y_pred_cls = torch.tensor([[0.3, 0.7],     # predicted probabilities
                           [0.6, 0.4],
                           [0.2, 0.8]])

# ----- Regression Losses -----
mse = nn.MSELoss()
mae = nn.L1Loss()
huber = nn.HuberLoss()

print("MSE Loss:", mse(y_pred_reg, y_true_reg).item())
print("MAE Loss:", mae(y_pred_reg, y_true_reg).item())
print("Huber Loss:", huber(y_pred_reg, y_true_reg).item())

# ----- Classification Loss -----
# Binary cross entropy
bce = nn.BCELoss()
sigmoid = nn.Sigmoid()

y_true_bin = torch.tensor([1., 0., 1.])  # binary labels
y_pred_bin = torch.tensor([0.9, 0.2, 0.8])  # predicted probs
print("Binary Cross-Entropy Loss:", bce(y_pred_bin, y_true_bin).item())

# Categorical cross entropy (for multi-class classification)
cross_entropy = nn.CrossEntropyLoss()
print("Categorical Cross-Entropy Loss:", cross_entropy(y_pred_cls, y_true_cls).item())

MSE Loss: 0.15000002086162567
MAE Loss: 0.3500000238418579
Huber Loss: 0.07500001043081284
Binary Cross-Entropy Loss: 0.18388254940509796
Categorical Cross-Entropy Loss: 0.5162140727043152


## Optimisation

- **Definition**: Optimisation is the process of updating model parameters (weights and biases) to minimize the loss function.  
- **Goal**: Find the best set of weights so that predictions are accurate.  
- This happens during the training loop:  
  **Forward pass → Compute loss → Backward pass (gradients) → Update weights**  

## Gradient Descent

Gradient Descent is the most common optimization algorithm.  

### Core Idea
- Adjust weights in the **opposite direction of the gradient** (slope of the loss function).  

### Update Rule
- $w = w - \eta \cdot \frac{\partial L}{\partial w}$  

Where:  
- $w$ = parameter (weight)  
- $L$ = loss function  
- $\eta$ = learning rate (step size)  

### Types of Gradient Descent
1. **Batch Gradient Descent**  
   - Uses the entire dataset per update  
   - Very accurate but very slow  

2. **Stochastic Gradient Descent (SGD)**  
   - Updates weights for every single sample  
   - Faster but noisy  

3. **Mini-Batch Gradient Descent**  
   - Uses small batches (e.g., 32, 64 samples)  
   - Best trade-off → most widely used in deep learning  

### Limitations
- **Local minima / saddle points** → model can get stuck  
- **Slow convergence** → if learning rate is too small  
- **Overshooting** → if learning rate is too large  

### Beyond Basic Gradient Descent
- **SGD with Momentum** → adds inertia to updates, helps escape local minima  
- **RMSProp** → adjusts learning rate per parameter  
- **Adam** → combines momentum + adaptive learning rate  
- **Adagrad / Adadelta** → adaptive optimizers useful for sparse data  

In [17]:
# Data (x and y = 2x)
x = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])

# Simple linear model: y = wx
model = nn.Linear(1, 1)   # 1 input, 1 output

# Loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

num_epochs = 100

for epoch in range(num_epochs):
    # Forward pass
    y_pred = model(x)
    loss = loss_fn(y_pred, y)

    # Backward pass
    optimizer.zero_grad()   # clear old gradients
    loss.backward()         # compute gradients
    optimizer.step()        # update weights

    if (epoch+1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

# Final learned weight and bias
[w, b] = model.parameters()
print("Learned weight:", w.item())
print("Learned bias:", b.item())


Epoch [20/100], Loss: 0.0819
Epoch [40/100], Loss: 0.0298
Epoch [60/100], Loss: 0.0264
Epoch [80/100], Loss: 0.0234
Epoch [100/100], Loss: 0.0208
Learned weight: 1.8803331851959229
Learned bias: 0.3518347144126892


## Backpropagation

- **Definition**: Backpropagation is an algorithm used to train neural networks.  
- It calculates how much each weight and bias contributed to the overall error (loss) by applying the **chain rule of calculus**, and then updates them to reduce the error.  

### Need for Backpropagation
- **Efficiency**: Reuses intermediate results from the forward pass when computing gradients.  
- **Scalability**: Works for deep networks where manual gradient calculation would be impossible.  

### Steps of Backpropagation
1. **Forward Pass**  
   - Inputs pass through the network layer by layer to produce an output $\hat{y}$.  
   - The predicted output $\hat{y}$ is compared with the true output $y$ using a loss function.  

2. **Backward Pass**  
   - Compute the gradient of the loss with respect to the output.  
   - Propagate this error backward through the network using the **chain rule**, layer by layer.  
   - Compute gradients for each weight and bias.  

3. **Update Parameters**  
   - Use **gradient descent** (or its variants) to update weights and biases based on the computed gradients.  


Complete Example:

In [None]:
import torch.optim as optim

# 1. Inputs with Dummy  dataset (X -> y)
X = torch.tensor([[0.0], [1.0], [2.0], [3.0]], dtype=torch.float32) # input
y = torch.tensor([[0.0], [2.0], [4.0], [6.0]], dtype=torch.float32) # expected output y = 2x

# 2. Define a simple neural netowrk
model = nn.Sequential(nn.Linear(1,1)) # 1 input -> 1 output

# 3. Define Loss function optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1) # Stochastic Gradient Decent

# 4. Training loop
epochs = 100
for epoch in range(epochs):
    # Forward pass: Compute prediction
    y_pred = model(X)
    # Compute loss (prediction vs actual)
    loss = loss_fn(y_pred, y)
    # Backward pass: compute gradients
    optimizer.zero_grad() # Reset old gradients
    loss.backward() # compute new gradients
    # Update weights
    optimizer.step()
    # Print progress
    if(epoch + 1) % 10 == 0:
        print(f"Epoch {epoch + 1}: Loss = {loss.item(): .4f}")

# Test the trained model
test_val = torch.tensor([5.0])
print("Prediction for input 5.0: ", model(test_val).item())



Epoch 10: Loss =  0.1821
Epoch 20: Loss =  0.0534
Epoch 30: Loss =  0.0157
Epoch 40: Loss =  0.0046
Epoch 50: Loss =  0.0013
Epoch 60: Loss =  0.0004
Epoch 70: Loss =  0.0001
Epoch 80: Loss =  0.0000
Epoch 90: Loss =  0.0000
Epoch 100: Loss =  0.0000
Prediction for input 5.0:  9.996416091918945
