<a href="https://colab.research.google.com/github/KostaKat/MAT442/blob/main/hw3_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 3.7.1 Mathematical Formulation
Neural Networks have been a break through in machine learning, and have transformered modern machine learning into deep learning. They model the human brain by having a collection of nodes that are connected, each node is a function.
### Simple Neural Network
A basic neural network model can be represented as follows:
- **Inputs**: $ x_1 $ and $ x_2 $
- **Weights**: $ w_1 $ and $ w_2 $
- **Bias**: $ b $
- **Activation function**: $ \sigma(z) $, where $ z $ is the weighted sum of inputs plus bias.

The output $ \hat{y} $ is calculated as:
$$
\hat{y} = \sigma(z) = \sigma(w_1 a_1 + w_2 a_2 + b)
$$

However, this is just a simple neural network, which is essentially equivalent to linear regression. As a result, its performance will be similar to that of a linear regression model.  However, neural networks can be expanded by stacking more nodes to create deeper networks. These stacked nodes are called layers and by adding layers input data is transformed into multiple linear functions, allowing the model to capture more complex relationships in data. This had found success in multiple applications in both vision and languages tasks, with neural networks having thousands, and even millions of nodes and layers.
### Node Calculation in a Layer
For node $ j $ in layer $ l $:
1. **Weighted Sum**: Calculate the input $ z_j^{(l)} $ to node $ j $ as:


$$
z_{j'}^{(l)} = \sum_{j=1}^{J_{l-1}} w_{j,j'}^{(l)} a_{j}^{(l-1)} + b_{j'}^{(l)}
$$

2. **Activation**: Compute the output $ a_j^{(l)} $ as:
   $$
   a_j^{(l)} = \sigma(z_j^{(l)})
   $$

In matrix form, it is represented as:
$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$
$$
a^{(l)} = \sigma(z^{(l)})
$$

- **Note**: The matrix $ W^{(l)} $ contains all weights for layer $ l $, and $ b^{(l)} $ is the bias vector.
- The activation function $ \sigma $ applies element-wise to $ z^{(l)} $.

As stated earlier, neural network's nodes essentially perform linear transformation on data. However, this is a bottleneck when trying to capture complex patterns in data. This is were activation function come in. They control the output of each node or neuron and introduce non-linearity to the model.


The activation of each layer is computed as:
$$
a^{(l)} = \sigma(z^{(l)}) = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})
$$
where $ \sigma $ is the chosen activation function applied element-wise.

### Types of Activation Functions

1. **Step Function**:
   $$
   \sigma(x) = \begin{cases}
      0, & x < 0 \\
      1, & x \geq 0
   \end{cases}
   $$
   The step function is useful for binary classification as it switches on or off at a threshold. However, it is not used in modern networks since its lack of gradient.

2. **ReLU (Rectified Linear Unit)**:
   $$
   \sigma(x) = \max(0, x)
   $$
   ReLU is widely used because it helps combat vanishing gradients. Furthermore, it allows for faster and more effective training on complex networks.

3. **Sigmoid Function**:
   $$
   \sigma(x) = \frac{1}{1 + e^{-x}}
   $$
   The sigmoid function maps inputs to a range between 0 and 1.

4. **Softmax Function**:
   $$
    \frac{e^{Z_k}}{\sum_{k=1}^K e^{Z_k}}
   $$
   The softmax function converts a vector of values into probabilities. This is commonly used in the final layer of multi-class classification networks.

### Cost Function

The next step in neural networks is cost functions. The cost function measures the error between the neural network’s predicted output and the actual values from the training data. Two common cost functions are used:

1. **Mean Squared Error (MSE)** for regression tasks:
   $$
   J = \frac{1}{2} \sum_{n=1}^N \sum_{k=1}^K \left( \hat{y}_k^{(n)} - y_k^{(n)} \right)^2
   $$

2. **Cross-Entropy Loss** for classification tasks, often used for binary classification:
  $$
J = -\sum_{n=1}^N \left( y^{(n)} \ln(\hat{y}^{(n)}) + (1 - y^{(n)}) \ln(1 - \hat{y}^{(n)}) \right)
$$

However, how is a neural network trained? A neural network is trained by backpropagation which minimizes the cost function or error. I works by adjusting layers output based on the previous layer. It used gradient descent on the costfunction to optimize weights (W) and biases (B).


#### Key Steps in Backpropagation:
1. **Compute Error Gradients**: Using the chain rule, compute partial derivatives of the cost function with respect to weights and biases. Define intermediate terms, $ \delta_j^{(l)} $, which represent the error in each layer.
2. **Iterative Weight Adjustments**: Starting from the output layer, calculate $ \delta $ values and propagate these back through each layer, updating weights and biases along the way.
3. **Update Rules**:
   - Weights and biases are updated based on the calculated gradients and a learning rate $ \beta $.
   - New weights and biases are given by:
   $$
   \text{New } w_{j,j'}^{(l)} = \text{Old } w_{j,j'}^{(l)} - \beta \frac{\partial J}{\partial w_{j,j'}^{(l)}} = \text{Old } w_{j,j'}^{(l)} - \beta \delta_{j'}^{(l)} a_{j}^{(l-1)}
   $$

and

$$
\text{New } b_{j'}^{(l)} = \text{Old } b_{j'}^{(l)} - \beta \frac{\partial J}{\partial b_{j'}^{(l)}} = \text{Old } b_{j'}^{(l)} - \beta \delta_{j'}^{(l)}
$$

4. **Iterate Until Convergence**

Source: Textbook








In [None]:
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load MNIST dataset
transform = transforms.ToTensor()
mnist_train = datasets.MNIST(root="data", train=True, download=True, transform=transform)
mnist_test = datasets.MNIST(root="data", train=False, download=True, transform=transform)

# Convert dataset to PyTorch tensors and move to GPU
x_train = mnist_train.data.view(-1, 28 * 28).float().to(device) / 255.0
y_train = mnist_train.targets.to(device)
x_test = mnist_test.data.view(-1, 28 * 28).float().to(device) / 255.0
y_test = mnist_test.targets.to(device)





In [None]:
# Sigmoid activation function
def sigmoid(z):
    return 1 / (1 + torch.exp(-z))

# Derivative of sigmoid function
def sigmoid_derivative(z):
    return sigmoid(z) * (1 - sigmoid(z))

# Softmax function for final layer
def softmax(z):
    exp_z = torch.exp(z - torch.max(z, dim=1, keepdim=True).values)
    return exp_z / exp_z.sum(dim=1, keepdim=True)

# Cost function (Cross-Entropy Loss)
def compute_cost(y_true, y_pred):
    m = y_true.shape[0]
    log_likelihood = -torch.log(y_pred[range(m), y_true] + 1e-8)
    cost = torch.sum(log_likelihood) / m
    return cost

# Network parameters
input_size = 784    # 28x28 input images
hidden_size = 64    # Size of the hidden layer
output_size = 10    # 10 classes for digits 0-9
learning_rate = 0.1 # Learning rate for gradient descent

# Initialize weights and biases as PyTorch tensors on GPU
torch.manual_seed(42)
W1 = torch.randn(input_size, hidden_size, device=device) * 0.01
b1 = torch.zeros(1, hidden_size, device=device)
W2 = torch.randn(hidden_size, output_size, device=device) * 0.01
b2 = torch.zeros(1, output_size, device=device)

# Forward propagation
def forward_propagation(X):
    Z1 = torch.matmul(X, W1) + b1
    A1 = sigmoid(Z1)
    Z2 = torch.matmul(A1, W2) + b2
    A2 = softmax(Z2)  # Softmax applied in the output layer for multiclass classification
    return Z1, A1, Z2, A2

# Backpropagation
def backpropagation(X, y, Z1, A1, Z2, A2):
    m = X.shape[0]

    # Output layer error
    dZ2 = A2.clone()
    dZ2[range(m), y] -= 1
    dW2 = torch.matmul(A1.T, dZ2) / m
    db2 = torch.sum(dZ2, axis=0, keepdim=True) / m

    # Hidden layer error
    dA1 = torch.matmul(dZ2, W2.T)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = torch.matmul(X.T, dZ1) / m
    db1 = torch.sum(dZ1, axis=0, keepdim=True) / m

    return dW1, db1, dW2, db2

In [None]:
# Training loop
epochs = 5000
for epoch in range(epochs):
    # Forward propagation
    Z1, A1, Z2, A2 = forward_propagation(x_train)

    # Compute cost
    cost = compute_cost(y_train, A2)

    # Backpropagation
    dW1, db1, dW2, db2 = backpropagation(x_train, y_train, Z1, A1, Z2, A2)

    # Gradient descent update
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

    # Print the cost every 100 epochs
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Cost: {cost.item()}")

# Prediction function
def predict(X):
    _, _, _, A2 = forward_propagation(X)
    return torch.argmax(A2, axis=1)

# Evaluate accuracy on the test set
predictions = predict(x_test)
accuracy = torch.mean((predictions == y_test).float()) * 100
print(f"Test Accuracy: {accuracy:.2f}%")


Epoch 0, Cost: 2.30229115486145
Epoch 100, Cost: 2.2802984714508057
Epoch 200, Cost: 2.1132278442382812
Epoch 300, Cost: 1.6313488483428955
Epoch 400, Cost: 1.211958646774292
Epoch 500, Cost: 0.9607114791870117
Epoch 600, Cost: 0.8002830743789673
Epoch 700, Cost: 0.6926383376121521
Epoch 800, Cost: 0.6177592277526855
Epoch 900, Cost: 0.5629952549934387
Epoch 1000, Cost: 0.5209764838218689
Epoch 1100, Cost: 0.4875912666320801
Epoch 1200, Cost: 0.46046969294548035
Epoch 1300, Cost: 0.4381183087825775
Epoch 1400, Cost: 0.41949304938316345
Epoch 1500, Cost: 0.4038088321685791
Epoch 1600, Cost: 0.3904561996459961
Epoch 1700, Cost: 0.37895938754081726
Epoch 1800, Cost: 0.36894771456718445
Epoch 1900, Cost: 0.36013373732566833
Epoch 2000, Cost: 0.3522941470146179
Epoch 2100, Cost: 0.3452552855014801
Epoch 2200, Cost: 0.3388809859752655
Epoch 2300, Cost: 0.3330637812614441
Epoch 2400, Cost: 0.3277179002761841
Epoch 2500, Cost: 0.3227742910385132
Epoch 2600, Cost: 0.31817662715911865
Epoch 2700