# 🎓 Week 8 - Day 4: Neural Network from Scratch

## Today's Goals:
✅ Build a complete neural network using only NumPy

✅ Implement forward propagation

✅ Implement backpropagation

✅ Train on MNIST digits (binary classification)

✅ Evaluate with accuracy and loss curves

---

**This is it!** Today we put everything together and build a real neural network!

---

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

print('✅ Libraries imported!')
print('🚀 Ready to build a neural network from scratch!')

---
## Part 1: Load and Prepare Data

**We'll use MNIST digits, but only classify 0 vs 1 (binary)**

In [None]:
# Load digits dataset
digits = load_digits()
X, y = digits.data, digits.target

print('📊 Original Dataset:')
print(f'   Shape: {X.shape}')
print(f'   Classes: {np.unique(y)}')
print(f'   Total samples: {len(X)}')

In [None]:
# Filter for binary classification (0 vs 1)
mask = (y == 0) | (y == 1)
X = X[mask]
y = y[mask]

print('🎯 Binary Classification Dataset:')
print(f'   Filtered to classes: 0 and 1')
print(f'   Shape: {X.shape}')
print(f'   Samples per class:')
print(f'      Class 0: {np.sum(y == 0)}')
print(f'      Class 1: {np.sum(y == 1)}')

In [None]:
# Visualize some examples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('Sample Images from Dataset', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    ax.imshow(X[i].reshape(8, 8), cmap='gray')
    ax.set_title(f'Label: {y[i]}', fontsize=12)
    ax.axis('off')

plt.tight_layout()
plt.show()

print('\n💡 Each image is 8×8 pixels = 64 features')

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print('📊 Data Split:')
print(f'   Training: {X_train.shape[0]} samples')
print(f'   Testing: {X_test.shape[0]} samples')

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print('\n✅ Features scaled (mean=0, std=1)')

# Reshape y for consistency
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

print(f'\n📐 Final shapes:')
print(f'   X_train: {X_train.shape}')
print(f'   y_train: {y_train.shape}')
print(f'   X_test: {X_test.shape}')
print(f'   y_test: {y_test.shape}')

---
## Part 2: Define Activation Functions

**We need both the function and its derivative**

In [None]:
def sigmoid(z):
    """
    Sigmoid activation function
    Output range: (0, 1)
    """
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """
    Derivative of sigmoid
    Used in backpropagation
    """
    s = sigmoid(z)
    return s * (1 - s)

print('✅ Sigmoid activation functions defined')
print('   • sigmoid(z): Forward pass')
print('   • sigmoid_derivative(z): Backward pass')

In [None]:
# Test activation functions
test_values = np.array([-2, -1, 0, 1, 2])
sig_output = sigmoid(test_values)
sig_deriv = sigmoid_derivative(test_values)

print('📊 Sigmoid Test:')
print(f'   Input:      {test_values}')
print(f'   Sigmoid:    {sig_output}')
print(f'   Derivative: {sig_deriv}')
print('\n💡 Notice: sigmoid(0) = 0.5, sigmoid(large) → 1, sigmoid(small) → 0')

---
## Part 3: Initialize Network Parameters

**Network Architecture: 64 → 16 → 8 → 1**

In [None]:
# Define architecture
input_size = 64      # 8×8 pixels
hidden1_size = 16    # First hidden layer
hidden2_size = 8     # Second hidden layer
output_size = 1      # Binary classification

print('🏗️  Network Architecture:')
print(f'   Input Layer:    {input_size} neurons')
print(f'   Hidden Layer 1: {hidden1_size} neurons')
print(f'   Hidden Layer 2: {hidden2_size} neurons')
print(f'   Output Layer:   {output_size} neuron')
print(f'\n   Flow: {input_size} → {hidden1_size} → {hidden2_size} → {output_size}')

In [None]:
def initialize_parameters():
    """
    Initialize weights and biases with small random values
    """
    params = {}
    
    # Layer 1: Input → Hidden1
    params['W1'] = np.random.randn(hidden1_size, input_size) * 0.01
    params['b1'] = np.zeros((hidden1_size, 1))
    
    # Layer 2: Hidden1 → Hidden2
    params['W2'] = np.random.randn(hidden2_size, hidden1_size) * 0.01
    params['b2'] = np.zeros((hidden2_size, 1))
    
    # Layer 3: Hidden2 → Output
    params['W3'] = np.random.randn(output_size, hidden2_size) * 0.01
    params['b3'] = np.zeros((output_size, 1))
    
    return params

# Initialize
parameters = initialize_parameters()

print('✅ Parameters initialized!')
print('\n📐 Parameter shapes:')
for key, value in parameters.items():
    print(f'   {key}: {value.shape}')

total_params = sum(p.size for p in parameters.values())
print(f'\n🔢 Total parameters: {total_params:,}')

---
## Part 4: Implement Forward Propagation

**Pass data through the network layer by layer**

In [None]:
def forward_propagation(X, params):
    """
    Forward pass through the network
    
    Returns:
        A3: Final output (predictions)
        cache: Dictionary of intermediate values (for backprop)
    """
    # Get parameters
    W1, b1 = params['W1'], params['b1']
    W2, b2 = params['W2'], params['b2']
    W3, b3 = params['W3'], params['b3']
    
    # Layer 1
    Z1 = np.dot(W1, X.T) + b1  # Weighted sum
    A1 = sigmoid(Z1)            # Activation
    
    # Layer 2
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    # Layer 3 (Output)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    # Store for backpropagation
    cache = {
        'Z1': Z1, 'A1': A1,
        'Z2': Z2, 'A2': A2,
        'Z3': Z3, 'A3': A3
    }
    
    return A3, cache

print('✅ Forward propagation function implemented!')
print('\n💡 It performs:')
print('   1. Layer 1: Z1 = W1·X + b1, A1 = sigmoid(Z1)')
print('   2. Layer 2: Z2 = W2·A1 + b2, A2 = sigmoid(Z2)')
print('   3. Layer 3: Z3 = W3·A2 + b3, A3 = sigmoid(Z3)')

In [None]:
# Test forward propagation
test_output, test_cache = forward_propagation(X_train[:5], parameters)

print('🧪 Testing forward propagation on 5 samples:')
print(f'\n   Output shape: {test_output.shape}')
print(f'   Output values: {test_output.T.flatten()}')
print('\n💡 These are predicted probabilities for class 1')
print('   Values close to 1 → Predict class 1')
print('   Values close to 0 → Predict class 0')

---
## Part 5: Compute Loss Function

**Binary Cross-Entropy Loss**

In [None]:
def compute_loss(A3, Y):
    """
    Binary Cross-Entropy Loss
    
    Loss = -[y*log(ŷ) + (1-y)*log(1-ŷ)]
    """
    m = Y.shape[0]  # Number of samples
    
    # Clip values to avoid log(0)
    A3 = np.clip(A3, 1e-7, 1 - 1e-7)
    
    # Calculate loss
    loss = -np.mean(Y.T * np.log(A3) + (1 - Y.T) * np.log(1 - A3))
    
    return loss

print('✅ Loss function implemented!')
print('\n📐 Binary Cross-Entropy Loss:')
print('   Measures how wrong predictions are')
print('   Lower loss = Better predictions')
print('   Perfect predictions = Loss close to 0')

In [None]:
# Test loss computation
test_loss = compute_loss(test_output, y_train[:5])

print(f'🧪 Test loss on 5 samples: {test_loss:.4f}')
print('\n💡 Before training, loss is typically around 0.69 (random guessing)')

---
## Part 6: Implement Backpropagation

**Calculate gradients to update weights**

In [None]:
def backward_propagation(X, Y, params, cache):
    """
    Backward pass - calculate gradients
    
    Returns:
        grads: Dictionary of gradients for each parameter
    """
    m = X.shape[0]  # Number of samples
    
    # Get cached values
    A1, A2, A3 = cache['A1'], cache['A2'], cache['A3']
    Z1, Z2, Z3 = cache['Z1'], cache['Z2'], cache['Z3']
    
    # Get parameters
    W1, W2, W3 = params['W1'], params['W2'], params['W3']
    
    # Output layer gradient
    dZ3 = A3 - Y.T  # Derivative of loss w.r.t. Z3
    dW3 = (1/m) * np.dot(dZ3, A2.T)
    db3 = (1/m) * np.sum(dZ3, axis=1, keepdims=True)
    
    # Hidden layer 2 gradient
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = dA2 * sigmoid_derivative(Z2)
    dW2 = (1/m) * np.dot(dZ2, A1.T)
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
    
    # Hidden layer 1 gradient
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = dA1 * sigmoid_derivative(Z1)
    dW1 = (1/m) * np.dot(dZ1, X)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    # Store gradients
    grads = {
        'dW1': dW1, 'db1': db1,
        'dW2': dW2, 'db2': db2,
        'dW3': dW3, 'db3': db3
    }
    
    return grads

print('✅ Backpropagation function implemented!')
print('\n💡 It calculates gradients for all weights and biases')
print('   Using the chain rule, working backward from output to input')

---
## Part 7: Update Parameters

**Gradient Descent: W = W - learning_rate × gradient**

In [None]:
def update_parameters(params, grads, learning_rate):
    """
    Update parameters using gradient descent
    """
    params['W1'] -= learning_rate * grads['dW1']
    params['b1'] -= learning_rate * grads['db1']
    params['W2'] -= learning_rate * grads['dW2']
    params['b2'] -= learning_rate * grads['db2']
    params['W3'] -= learning_rate * grads['dW3']
    params['b3'] -= learning_rate * grads['db3']
    
    return params

print('✅ Parameter update function implemented!')
print('\n📐 Formula: W_new = W_old - α × gradient')
print('   α (alpha) = learning rate (controls step size)')

---
## Part 8: Training Loop

**Put it all together and train the network!**

In [None]:
def train_network(X_train, y_train, X_test, y_test, 
                  learning_rate=0.5, epochs=1000, print_every=100):
    """
    Complete training loop
    """
    # Initialize parameters
    params = initialize_parameters()
    
    # Store history
    train_losses = []
    test_losses = []
    train_accuracies = []
    test_accuracies = []
    
    print('🚀 Starting training...')
    print(f'   Learning rate: {learning_rate}')
    print(f'   Epochs: {epochs}')
    print('\n' + '='*60)
    
    for epoch in range(epochs):
        # Forward propagation
        A3_train, cache_train = forward_propagation(X_train, params)
        
        # Compute loss
        train_loss = compute_loss(A3_train, y_train)
        
        # Backward propagation
        grads = backward_propagation(X_train, y_train, params, cache_train)
        
        # Update parameters
        params = update_parameters(params, grads, learning_rate)
        
        # Evaluate on test set
        A3_test, _ = forward_propagation(X_test, params)
        test_loss = compute_loss(A3_test, y_test)
        
        # Calculate accuracy
        train_pred = (A3_train > 0.5).astype(int)
        test_pred = (A3_test > 0.5).astype(int)
        train_acc = np.mean(train_pred.T == y_train)
        test_acc = np.mean(test_pred.T == y_test)
        
        # Store history
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        train_accuracies.append(train_acc)
        test_accuracies.append(test_acc)
        
        # Print progress
        if (epoch + 1) % print_every == 0 or epoch == 0:
            print(f'Epoch {epoch+1:4d}/{epochs} | '
                  f'Train Loss: {train_loss:.4f} | '
                  f'Test Loss: {test_loss:.4f} | '
                  f'Train Acc: {train_acc:.4f} | '
                  f'Test Acc: {test_acc:.4f}')
    
    print('='*60)
    print('\n✅ Training complete!')
    
    history = {
        'train_losses': train_losses,
        'test_losses': test_losses,
        'train_accuracies': train_accuracies,
        'test_accuracies': test_accuracies
    }
    
    return params, history

print('✅ Training function ready!')

In [None]:
# Train the network!
trained_params, history = train_network(
    X_train, y_train, X_test, y_test,
    learning_rate=0.5,
    epochs=1000,
    print_every=100
)

---
## Part 9: Visualize Training Progress

**Loss and accuracy curves**

In [None]:
# Plot loss curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history['train_losses'], label='Train Loss', linewidth=2)
axes[0].plot(history['test_losses'], label='Test Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Loss Curve', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history['train_accuracies'], label='Train Accuracy', linewidth=2)
axes[1].plot(history['test_accuracies'], label='Test Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Accuracy Curve', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\n💡 What to look for:')
print('   ✅ Loss should decrease over time')
print('   ✅ Accuracy should increase over time')
print('   ✅ Train and test curves should be close (not overfitting)')

---
## Part 10: Final Evaluation

**How well did our network learn?**

In [None]:
# Final predictions
A3_final, _ = forward_propagation(X_test, trained_params)
final_predictions = (A3_final > 0.5).astype(int).T

# Calculate metrics
final_accuracy = np.mean(final_predictions == y_test)
final_loss = history['test_losses'][-1]

print('🎯 Final Results on Test Set:')
print('='*50)
print(f'   Final Accuracy: {final_accuracy:.4f} ({final_accuracy*100:.2f}%)')
print(f'   Final Loss: {final_loss:.4f}')
print('='*50)

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, final_predictions)

print('\n📊 Confusion Matrix:')
print(cm)
print('\n📋 Classification Report:')
print(classification_report(y_test, final_predictions, 
                          target_names=['Digit 0', 'Digit 1']))

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
fig.suptitle('Sample Predictions on Test Set', fontsize=16, fontweight='bold')

# Get indices
indices = np.random.choice(len(X_test), 10, replace=False)

for i, ax in enumerate(axes.flat):
    idx = indices[i]
    
    # Get original image (before scaling)
    image = scaler.inverse_transform(X_test[idx:idx+1]).reshape(8, 8)
    
    true_label = y_test[idx][0]
    pred_label = final_predictions[idx][0]
    prob = A3_final.T[idx][0]
    
    # Plot
    ax.imshow(image, cmap='gray')
    
    # Title with color
    color = 'green' if true_label == pred_label else 'red'
    ax.set_title(f'True: {true_label} | Pred: {pred_label}\nProb: {prob:.2f}', 
                fontsize=10, color=color)
    ax.axis('off')

plt.tight_layout()
plt.show()

print('\n💡 Green = Correct, Red = Incorrect')

---
## 🎯 Challenge: Improve the Network!

**Try these modifications to improve performance:**

### Easy Challenges:
1. **Change learning rate**
   - Try: 0.1, 0.3, 0.7, 1.0
   - See which works best!

2. **Train for more epochs**
   - Try: 1500 or 2000 epochs
   - Does it improve?

3. **Modify network size**
   - Try: 64 → 32 → 16 → 1
   - Or: 64 → 20 → 10 → 1

### Medium Challenges:
4. **Add ReLU activation**
   - Use ReLU for hidden layers
   - Keep sigmoid for output

5. **Try different initialization**
   - Use larger initial weights (0.1 instead of 0.01)
   - See the effect!

### Hard Challenges:
6. **Implement mini-batch gradient descent**
   - Instead of using all data at once
   - Use batches of 32 samples

7. **Add learning rate decay**
   - Start with large learning rate
   - Gradually decrease it

8. **Classify more digits**
   - Try 0, 1, 2 (3 classes)
   - Need to change output layer!

---

**Experiment below! Copy the training code and modify it:**

In [None]:
# Your experiments here!
# Try different hyperparameters and modifications

# Example: Different learning rate
# trained_params_exp, history_exp = train_network(
#     X_train, y_train, X_test, y_test,
#     learning_rate=0.3,  # Changed!
#     epochs=1000,
#     print_every=100
# )

---
## 📚 Summary

### What We Built:

**Complete Neural Network from Scratch!**
- Using only NumPy (no deep learning frameworks)
- 3-layer architecture (2 hidden + 1 output)
- Binary classification on MNIST digits

### Components Implemented:

**1. Forward Propagation:**
- Layer-by-layer computation
- z = W·x + b
- a = sigmoid(z)

**2. Loss Function:**
- Binary Cross-Entropy
- Measures prediction error

**3. Backpropagation:**
- Calculate gradients using chain rule
- Work backward through layers
- Find how to improve each weight

**4. Gradient Descent:**
- Update weights: W = W - α × gradient
- Iteratively improve the network

**5. Training Loop:**
- Forward → Loss → Backward → Update
- Repeat until convergence

### 🎯 Key Takeaways:

✅ **Neural networks learn through iteration**
- Each epoch improves predictions slightly
- Loss decreases, accuracy increases

✅ **Hyperparameters matter**
- Learning rate controls convergence
- Network size affects capacity
- Initialization impacts training

✅ **Math becomes code**
- Forward prop = matrix multiplication
- Backprop = chain rule in action
- Gradient descent = simple subtraction

✅ **You now understand deep learning**
- Frameworks like PyTorch do this automatically
- But now you know what happens under the hood!

### 💡 What's Next:

- **Week 9:** Advanced optimizers (Adam, RMSprop)
- **Week 9:** Introduction to PyTorch
- **Week 10:** Convolutional Neural Networks
- **Week 11:** Transfer Learning

---

**Congratulations! You built a neural network from scratch! 🎉🎊**

**You are now a deep learning practitioner!** 🚀