# Lab 1: Introduction to Neural Networks

Welcome to the exciting world of neural networks! In this lab, we'll build neural networks from scratch to understand how they work, then use modern frameworks for practical applications.

## Learning Objectives

By the end of this lab, you will:
- Understand the mathematical foundations of neural networks
- Implement forward propagation from scratch
- Implement backpropagation from scratch
- Build multi-layer perceptrons (MLPs)
- Use activation functions effectively
- Train neural networks with gradient descent
- Use Keras and PyTorch for practical applications
- Recognize and solve handwritten digits (MNIST)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Tuple
import pandas as pd

# Deep learning frameworks
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Set style and seeds
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"GPU available (TF): {len(tf.config.list_physical_devices('GPU')) > 0}")
print(f"GPU available (PyTorch): {torch.cuda.is_available()}")

## Part 1: The Artificial Neuron

An artificial neuron is inspired by biological neurons. It:
1. Receives inputs $x_1, x_2, ..., x_n$
2. Applies weights $w_1, w_2, ..., w_n$ and bias $b$
3. Computes weighted sum: $z = \sum_{i} w_i x_i + b$
4. Applies activation function: $a = \sigma(z)$

### Common Activation Functions

**Sigmoid:** $\sigma(z) = \frac{1}{1 + e^{-z}}$ (outputs 0 to 1)

**Tanh:** $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ (outputs -1 to 1)

**ReLU:** $\text{ReLU}(z) = \max(0, z)$ (most popular for hidden layers)

**Softmax:** $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ (for multi-class output)

In [None]:
# Activation functions
class ActivationFunctions:
    @staticmethod
    def sigmoid(z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    @staticmethod
    def sigmoid_derivative(z):
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    @staticmethod
    def tanh(z):
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        return 1 - np.tanh(z) ** 2
    
    @staticmethod
    def relu(z):
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        return (z > 0).astype(float)
    
    @staticmethod
    def softmax(z):
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Visualize activation functions
z = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sigmoid
axes[0, 0].plot(z, ActivationFunctions.sigmoid(z), linewidth=2, label='sigmoid')
axes[0, 0].plot(z, ActivationFunctions.sigmoid_derivative(z), linewidth=2, label='derivative')
axes[0, 0].set_title('Sigmoid')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(z, ActivationFunctions.tanh(z), linewidth=2, label='tanh')
axes[0, 1].plot(z, ActivationFunctions.tanh_derivative(z), linewidth=2, label='derivative')
axes[0, 1].set_title('Tanh')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# ReLU
axes[1, 0].plot(z, ActivationFunctions.relu(z), linewidth=2, label='ReLU')
axes[1, 0].plot(z, ActivationFunctions.relu_derivative(z), linewidth=2, label='derivative')
axes[1, 0].set_title('ReLU')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Comparison
axes[1, 1].plot(z, ActivationFunctions.sigmoid(z), linewidth=2, label='Sigmoid')
axes[1, 1].plot(z, ActivationFunctions.tanh(z), linewidth=2, label='Tanh')
axes[1, 1].plot(z, ActivationFunctions.relu(z), linewidth=2, label='ReLU')
axes[1, 1].set_title('Comparison')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Properties:")
print("- Sigmoid: Smooth, bounded [0,1], prone to vanishing gradients")
print("- Tanh: Smooth, bounded [-1,1], zero-centered")
print("- ReLU: Simple, no upper bound, dead ReLU problem possible")

## Part 2: Simple Neural Network from Scratch

Let's build a 2-layer neural network (1 hidden layer) from scratch.

### Architecture:
- Input layer: $n$ features
- Hidden layer: $h$ neurons with ReLU activation
- Output layer: $k$ neurons with softmax activation

### Forward Propagation:
1. $Z^{[1]} = W^{[1]} X + b^{[1]}$
2. $A^{[1]} = \text{ReLU}(Z^{[1]})$
3. $Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$
4. $A^{[2]} = \text{softmax}(Z^{[2]})$

### Loss Function (Cross-Entropy):
$$L = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)})$$

In [None]:
class NeuralNetwork:
    """
    Simple feedforward neural network from scratch.
    """
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int,
                learning_rate: float = 0.01):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.learning_rate = learning_rate
        
        # Initialize weights (He initialization for ReLU)
        self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((output_size, 1))
        
        self.loss_history = []
    
    def forward(self, X):
        """
        Forward propagation.
        X shape: (input_size, m) where m is batch size
        """
        # Layer 1
        self.Z1 = self.W1.dot(X) + self.b1
        self.A1 = ActivationFunctions.relu(self.Z1)
        
        # Layer 2
        self.Z2 = self.W2.dot(self.A1) + self.b2
        self.A2 = ActivationFunctions.softmax(self.Z2.T).T
        
        return self.A2
    
    def compute_loss(self, Y, A2):
        """
        Compute cross-entropy loss.
        """
        m = Y.shape[1]
        loss = -np.sum(Y * np.log(A2 + 1e-8)) / m
        return loss
    
    def backward(self, X, Y):
        """
        Backward propagation.
        """
        m = X.shape[1]
        
        # Output layer gradients
        dZ2 = self.A2 - Y
        dW2 = (1/m) * dZ2.dot(self.A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        
        # Hidden layer gradients
        dA1 = self.W2.T.dot(dZ2)
        dZ1 = dA1 * ActivationFunctions.relu_derivative(self.Z1)
        dW1 = (1/m) * dZ1.dot(X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
        
        return dW1, db1, dW2, db2
    
    def update_parameters(self, dW1, db1, dW2, db2):
        """
        Update weights using gradient descent.
        """
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2
    
    def train(self, X, Y, epochs: int = 1000, print_every: int = 100):
        """
        Train the network.
        """
        for epoch in range(epochs):
            # Forward pass
            A2 = self.forward(X)
            
            # Compute loss
            loss = self.compute_loss(Y, A2)
            self.loss_history.append(loss)
            
            # Backward pass
            dW1, db1, dW2, db2 = self.backward(X, Y)
            
            # Update parameters
            self.update_parameters(dW1, db1, dW2, db2)
            
            if (epoch + 1) % print_every == 0:
                predictions = np.argmax(A2, axis=0)
                labels = np.argmax(Y, axis=0)
                accuracy = np.mean(predictions == labels)
                print(f"Epoch {epoch+1}/{epochs} - Loss: {loss:.4f}, Accuracy: {accuracy:.4f}")
    
    def predict(self, X):
        """
        Make predictions.
        """
        A2 = self.forward(X)
        return np.argmax(A2, axis=0)

In [None]:
# Test on synthetic data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, n_classes=3, random_state=42)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to one-hot encoding
def one_hot_encode(y, num_classes):
    m = len(y)
    one_hot = np.zeros((num_classes, m))
    one_hot[y, np.arange(m)] = 1
    return one_hot

Y_train = one_hot_encode(y_train, 3)
Y_test = one_hot_encode(y_test, 3)

# Transpose for our implementation
X_train_T = X_train.T
X_test_T = X_test.T

# Create and train network
nn = NeuralNetwork(input_size=20, hidden_size=64, output_size=3, learning_rate=0.1)
nn.train(X_train_T, Y_train, epochs=1000, print_every=200)

# Evaluate
predictions = nn.predict(X_test_T)
accuracy = np.mean(predictions == y_test)
print(f"\nTest Accuracy: {accuracy:.4f}")

# Plot loss
plt.figure(figsize=(10, 6))
plt.plot(nn.loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

## Part 3: Understanding Backpropagation

Backpropagation uses the chain rule to compute gradients efficiently.

### Chain Rule Example:
If $L = f(g(h(x)))$, then:
$$\frac{dL}{dx} = \frac{dL}{df} \cdot \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}$$

### For Our Network:
We compute gradients layer by layer, moving backward:
1. Output layer: $\frac{\partial L}{\partial W^{[2]}}$, $\frac{\partial L}{\partial b^{[2]}}$
2. Hidden layer: $\frac{\partial L}{\partial W^{[1]}}$, $\frac{\partial L}{\partial b^{[1]}}$

In [None]:
# Visualize gradient flow
def visualize_gradient_flow(nn, X_sample):
    """
    Visualize how gradients flow through the network.
    """
    # Forward pass
    nn.forward(X_sample)
    
    # Get weight magnitudes
    W1_mag = np.abs(nn.W1).mean()
    W2_mag = np.abs(nn.W2).mean()
    
    # Get activation magnitudes
    A1_mag = np.abs(nn.A1).mean()
    A2_mag = np.abs(nn.A2).mean()
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Weight magnitudes
    axes[0].bar(['W1', 'W2'], [W1_mag, W2_mag])
    axes[0].set_ylabel('Mean Absolute Weight')
    axes[0].set_title('Weight Magnitudes')
    axes[0].grid(True, alpha=0.3, axis='y')
    
    # Activation magnitudes
    axes[1].bar(['Input', 'Hidden (A1)', 'Output (A2)'], 
               [np.abs(X_sample).mean(), A1_mag, A2_mag])
    axes[1].set_ylabel('Mean Absolute Activation')
    axes[1].set_title('Activation Magnitudes')
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

visualize_gradient_flow(nn, X_train_T[:, :10])

## Part 4: MNIST with Keras

Now let's use Keras to build a neural network for the classic MNIST handwritten digit dataset.

In [None]:
# Load MNIST
(X_train_mnist, y_train_mnist), (X_test_mnist, y_test_mnist) = keras.datasets.mnist.load_data()

print(f"Training data shape: {X_train_mnist.shape}")
print(f"Test data shape: {X_test_mnist.shape}")

# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(X_train_mnist[i], cmap='gray')
    ax.set_title(f"Label: {y_train_mnist[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Preprocess data
X_train_mnist = X_train_mnist.reshape(-1, 784) / 255.0  # Flatten and normalize
X_test_mnist = X_test_mnist.reshape(-1, 784) / 255.0

# Build model with Keras
model_keras = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Compile
model_keras.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Print model summary
model_keras.summary()

In [None]:
# Train model
history = model_keras.fit(
    X_train_mnist, y_train_mnist,
    batch_size=128,
    epochs=10,
    validation_split=0.1,
    verbose=1
)

# Evaluate
test_loss, test_acc = model_keras.evaluate(X_test_mnist, y_test_mnist, verbose=0)
print(f"\nTest Accuracy: {test_acc:.4f}")

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history.history['loss'], label='Training')
axes[0].plot(history.history['val_loss'], label='Validation')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Over Time')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history.history['accuracy'], label='Training')
axes[1].plot(history.history['val_accuracy'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Over Time')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualize predictions
predictions = model_keras.predict(X_test_mnist[:20])
predicted_labels = np.argmax(predictions, axis=1)

fig, axes = plt.subplots(2, 10, figsize=(15, 3))
for i in range(20):
    ax = axes[i // 10, i % 10]
    ax.imshow(X_test_mnist[i].reshape(28, 28), cmap='gray')
    color = 'green' if predicted_labels[i] == y_test_mnist[i] else 'red'
    ax.set_title(f"P:{predicted_labels[i]}\nA:{y_test_mnist[i]}", color=color, fontsize=8)
    ax.axis('off')
plt.tight_layout()
plt.show()

print("Green = Correct, Red = Incorrect")

## Part 5: MNIST with PyTorch

Now let's implement the same network in PyTorch for comparison.

In [None]:
# Define model in PyTorch
class MNISTNet(nn.Module):
    def __init__(self):
        super(MNISTNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)  # No softmax (included in loss)
        return x

# Create model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_torch = MNISTNet().to(device)

print(model_torch)
print(f"\nTraining on: {device}")

In [None]:
# Prepare data for PyTorch
X_train_torch = torch.FloatTensor(X_train_mnist)
y_train_torch = torch.LongTensor(y_train_mnist)
X_test_torch = torch.FloatTensor(X_test_mnist)
y_test_torch = torch.LongTensor(y_test_mnist)

# Create data loaders
train_dataset = TensorDataset(X_train_torch, y_train_torch)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_torch.parameters(), lr=0.001)

# Training loop
epochs = 10
train_losses = []
train_accuracies = []

for epoch in range(epochs):
    model_torch.train()
    epoch_loss = 0
    correct = 0
    total = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        
        # Forward pass
        outputs = model_torch(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track metrics
        epoch_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += batch_y.size(0)
        correct += (predicted == batch_y).sum().item()
    
    avg_loss = epoch_loss / len(train_loader)
    accuracy = correct / total
    train_losses.append(avg_loss)
    train_accuracies.append(accuracy)
    
    print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")

# Evaluate
model_torch.eval()
with torch.no_grad():
    X_test_device = X_test_torch.to(device)
    y_test_device = y_test_torch.to(device)
    outputs = model_torch(X_test_device)
    _, predicted = torch.max(outputs.data, 1)
    test_accuracy = (predicted == y_test_device).sum().item() / y_test_device.size(0)
    print(f"\nTest Accuracy: {test_accuracy:.4f}")

In [None]:
# Plot PyTorch training
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(train_losses)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('PyTorch Training Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(train_accuracies)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('PyTorch Training Accuracy')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Neural networks** are composed of layers of artificial neurons
2. **Forward propagation** computes predictions layer by layer
3. **Backpropagation** efficiently computes gradients using the chain rule
4. **Activation functions** introduce non-linearity (ReLU is most common)
5. **Weight initialization** matters for training stability
6. **Learning rate** controls update step size
7. **Keras** provides high-level API for quick prototyping
8. **PyTorch** offers more flexibility and control
9. Both frameworks handle **GPU acceleration** automatically
10. **MNIST** is a great starting point for image classification

## Exercises

1. **Activation Comparison**: Train networks with different activation functions (sigmoid, tanh, ReLU) and compare
2. **Architecture Search**: Try different numbers of layers and neurons
3. **Learning Rate**: Experiment with different learning rates (0.001, 0.01, 0.1, 1.0)
4. **Batch Size**: Compare training with different batch sizes
5. **Regularization**: Add L2 regularization to prevent overfitting
6. **Dropout**: Implement dropout in both frameworks
7. **Custom Dataset**: Apply these techniques to Fashion-MNIST or CIFAR-10
8. **Visualization**: Visualize learned weights in the first layer

## Next Steps

In Lab 2, we'll explore:
- Advanced optimization techniques
- Batch normalization
- Dropout and regularization
- Deep network architectures
- Training best practices

Great job! You've built your first neural networks from scratch and with modern frameworks.