# Introduction to Neural Networks

## Week 7: From Perceptrons to Multi-Layer Networks

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand** the biological inspiration behind neural networks
2. **Explain** the structure and operation of a single perceptron
3. **Identify** common activation functions and their properties
4. **Build** a Multi-Layer Perceptron (MLP) using scikit-learn
5. **Understand** the basics of forward and backward propagation
6. **Visualize** decision boundaries of neural networks
7. **Tune** hyperparameters for better performance

## 1. Introduction

### 1.1 What is a Neural Network?

A **Neural Network** is a computational model inspired by biological neurons in the brain. It consists of interconnected nodes (neurons) organized in layers.

### 1.2 Brief History

| Year | Milestone |
|------|----------|
| 1943 | McCulloch & Pitts propose artificial neuron |
| 1958 | Rosenblatt introduces the Perceptron |
| 1969 | Minsky & Papert show limitations (XOR problem) |
| 1986 | Backpropagation popularized by Rumelhart et al. |
| 2012+ | Deep Learning revolution (ImageNet, GPT, etc.) |

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.model_selection import train_test_split, learning_curve, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.linear_model import LogisticRegression

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

---

## 2. The Perceptron

### 2.1 Structure

A **perceptron** is the simplest neural network unit:

```
         x₁ ----w₁----\
                        \
         x₂ ----w₂------>  Σ  ---> σ(z) ---> ŷ
                        /
         xₙ ----wₙ----/
                  ↑
                 +b (bias)
```

### 2.2 Mathematical Model

#### Linear Combination (Weighted Sum)
$$z = \sum_{i=1}^{n} w_i x_i + b = w^T x + b$$

where:
- $x_i$ = input features
- $w_i$ = weights (learnable parameters)
- $b$ = bias term

#### Activation Function
$$\hat{y} = \sigma(z)$$

The activation function $\sigma$ introduces **non-linearity** into the model.

### 2.3 Perceptron Learning Rule

For binary classification:
$$w_{i}^{(t+1)} = w_{i}^{(t)} + \eta (y - \hat{y}) x_i$$

where $\eta$ is the learning rate.

In [None]:
# Simple perceptron implementation for educational purposes
class SimplePerceptron:
    """A simple perceptron for binary classification."""
    
    def __init__(self, learning_rate=0.01, n_iterations=100):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.errors_ = []  # Track errors during training
    
    def _step_function(self, z):
        """Step activation function."""
        return np.where(z >= 0, 1, 0)
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Training loop
        for _ in range(self.n_iter):
            errors = 0
            for xi, yi in zip(X, y):
                # Forward pass
                z = np.dot(xi, self.weights) + self.bias
                y_pred = self._step_function(z)
                
                # Update weights
                update = self.lr * (yi - y_pred)
                self.weights += update * xi
                self.bias += update
                
                errors += int(yi != y_pred)
            
            self.errors_.append(errors)
        
        return self
    
    def predict(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self._step_function(z)

# Create linearly separable data
from sklearn.datasets import make_blobs
X_simple, y_simple = make_blobs(n_samples=100, centers=2, cluster_std=1.0, random_state=42)

# Train perceptron
perceptron = SimplePerceptron(learning_rate=0.01, n_iterations=50)
perceptron.fit(X_simple, y_simple)

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Decision boundary
ax1 = axes[0]
ax1.scatter(X_simple[y_simple==0, 0], X_simple[y_simple==0, 1], label='Class 0', alpha=0.7, s=50)
ax1.scatter(X_simple[y_simple==1, 0], X_simple[y_simple==1, 1], label='Class 1', alpha=0.7, s=50)

# Plot decision boundary
x_min, x_max = X_simple[:, 0].min() - 1, X_simple[:, 0].max() + 1
y_boundary = -(perceptron.weights[0] * np.linspace(x_min, x_max, 100) + perceptron.bias) / perceptron.weights[1]
ax1.plot(np.linspace(x_min, x_max, 100), y_boundary, 'r--', linewidth=2, label='Decision Boundary')

ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_title('Perceptron Decision Boundary')
ax1.legend()

# Training errors
ax2 = axes[1]
ax2.plot(range(1, len(perceptron.errors_) + 1), perceptron.errors_, marker='o')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Number of Misclassifications')
ax2.set_title('Perceptron Training Progress')

plt.tight_layout()
plt.show()

print(f"Final weights: {perceptron.weights}")
print(f"Final bias: {perceptron.bias}")
print(f"Training accuracy: {accuracy_score(y_simple, perceptron.predict(X_simple))*100:.1f}%")

---

## 3. Activation Functions

### 3.1 Why Non-linear Activation?

Without non-linear activation functions, a neural network with multiple layers would still be equivalent to a single linear transformation. Non-linearity allows networks to learn complex patterns.

### 3.2 Common Activation Functions

#### Sigmoid (Logistic)
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Properties**: Output range (0, 1), smooth gradient, but suffers from vanishing gradient.

#### Hyperbolic Tangent (tanh)
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

**Properties**: Output range (-1, 1), zero-centered, still has vanishing gradient.

#### Rectified Linear Unit (ReLU)
$$\text{ReLU}(z) = \max(0, z)$$

**Properties**: Simple, avoids vanishing gradient for positive values, but "dying ReLU" problem.

#### Leaky ReLU
$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$

**Properties**: Addresses dying ReLU problem with small slope for negative values.

#### Softmax (for multi-class output)
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

**Properties**: Outputs sum to 1, used for probability distribution over classes.

In [None]:
# Visualize activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.1):
    return np.where(z > 0, z, alpha * z)

# Derivatives
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

def relu_derivative(z):
    return np.where(z > 0, 1, 0)

z = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Activation functions
activations = [
    ('Sigmoid', sigmoid, sigmoid_derivative),
    ('Tanh', tanh, tanh_derivative),
    ('ReLU', relu, relu_derivative),
    ('Leaky ReLU', leaky_relu, lambda z: np.where(z > 0, 1, 0.1))
]

for idx, (name, func, deriv) in enumerate(activations):
    # Function
    axes[0, idx].plot(z, func(z), 'b-', linewidth=2)
    axes[0, idx].axhline(y=0, color='k', linestyle='-', alpha=0.3)
    axes[0, idx].axvline(x=0, color='k', linestyle='-', alpha=0.3)
    axes[0, idx].set_title(f'{name}')
    axes[0, idx].set_xlabel('z')
    axes[0, idx].set_ylabel('σ(z)')
    axes[0, idx].grid(True)
    
    # Derivative
    axes[1, idx].plot(z, deriv(z), 'r-', linewidth=2)
    axes[1, idx].axhline(y=0, color='k', linestyle='-', alpha=0.3)
    axes[1, idx].axvline(x=0, color='k', linestyle='-', alpha=0.3)
    axes[1, idx].set_title(f'{name} Derivative')
    axes[1, idx].set_xlabel('z')
    axes[1, idx].set_ylabel("σ'(z)")
    axes[1, idx].grid(True)

axes[0, 0].set_ylabel('Activation σ(z)')
axes[1, 0].set_ylabel("Derivative σ'(z)")

plt.tight_layout()
plt.show()

---

## 4. Multi-Layer Perceptron (MLP)

### 4.1 Architecture

```
Input Layer      Hidden Layer(s)      Output Layer
                                        
    x₁ --------O----------O----------O--- ŷ₁
              / \        / \
    x₂ ------O---O------O---O--------O--- ŷ₂
              \ /        \ /
    x₃ --------O----------O----------O--- ŷ₃
```

### 4.2 Forward Propagation

For layer $l$:
$$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$$
$$a^{[l]} = \sigma(z^{[l]})$$

where:
- $a^{[l-1]}$ = activations from previous layer
- $W^{[l]}$ = weight matrix for layer $l$
- $b^{[l]}$ = bias vector for layer $l$

### 4.3 Loss Functions

#### Binary Cross-Entropy
$$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

#### Categorical Cross-Entropy (Multi-class)
$$L = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})$$

#### Mean Squared Error (Regression)
$$L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

In [None]:
# Demonstrate MLP on non-linear data
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_moons, y_moons, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare Logistic Regression vs MLP
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'MLP (1 hidden layer, 10 units)': MLPClassifier(
        hidden_layer_sizes=(10,), max_iter=1000, random_state=42
    ),
    'MLP (2 hidden layers, 10 units each)': MLPClassifier(
        hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42
    ),
    'MLP (50 units)': MLPClassifier(
        hidden_layer_sizes=(50,), max_iter=1000, random_state=42
    )
}

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for ax, (name, model) in zip(axes, models.items()):
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Create mesh grid for decision boundary
    h = 0.02
    x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
    y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, 
               cmap='RdBu', edgecolors='black', s=50, alpha=0.7)
    
    train_acc = model.score(X_train_scaled, y_train)
    test_acc = model.score(X_test_scaled, y_test)
    
    ax.set_title(f'{name}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

---

## 5. Backpropagation

### 5.1 The Concept

**Backpropagation** is the algorithm used to train neural networks by computing gradients efficiently using the chain rule.

### 5.2 Gradient Descent Update

For each parameter:
$$\theta^{(t+1)} = \theta^{(t)} - \eta \frac{\partial L}{\partial \theta}$$

where $\eta$ is the learning rate.

### 5.3 Chain Rule Application

For a weight $w$ in layer $l$:
$$\frac{\partial L}{\partial w^{[l]}} = \frac{\partial L}{\partial a^{[L]}} \cdot \frac{\partial a^{[L]}}{\partial z^{[L]}} \cdot \frac{\partial z^{[L]}}{\partial a^{[L-1]}} \cdots \frac{\partial z^{[l]}}{\partial w^{[l]}}$$

### 5.4 Gradient Descent Variants

| Variant | Description |
|---------|-------------|
| **Batch GD** | Uses all samples for each update (slow) |
| **Stochastic GD** | Uses one sample per update (noisy) |
| **Mini-batch GD** | Uses small batches (best of both) |
| **Adam** | Adaptive learning rate with momentum |

In [None]:
# Visualize training progress
mlp = MLPClassifier(
    hidden_layer_sizes=(50, 30), 
    max_iter=500, 
    random_state=42,
    verbose=False,
    early_stopping=False
)

# Custom training to track loss
losses = []
train_accs = []
test_accs = []

# Train incrementally
mlp.partial_fit(X_train_scaled, y_train, classes=[0, 1])

for epoch in range(200):
    mlp.partial_fit(X_train_scaled, y_train)
    losses.append(mlp.loss_)
    train_accs.append(mlp.score(X_train_scaled, y_train))
    test_accs.append(mlp.score(X_test_scaled, y_test))

# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
ax1 = axes[0]
ax1.plot(losses, linewidth=2, color='blue')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Over Time')
ax1.grid(True)

# Accuracy curves
ax2 = axes[1]
ax2.plot(train_accs, label='Training Accuracy', linewidth=2)
ax2.plot(test_accs, label='Test Accuracy', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Test Accuracy')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

print(f"Final Training Accuracy: {train_accs[-1]:.4f}")
print(f"Final Test Accuracy: {test_accs[-1]:.4f}")

---

## 6. Key Hyperparameters

### 6.1 Network Architecture

| Parameter | Description | Effect |
|-----------|-------------|--------|
| **Number of layers** | Depth of network | More layers = more complex patterns |
| **Units per layer** | Width of each layer | More units = more capacity |

### 6.2 Training Parameters

| Parameter | Description | Typical Range |
|-----------|-------------|---------------|
| **Learning rate** | Step size for gradient descent | 0.0001 - 0.1 |
| **Batch size** | Samples per gradient update | 32 - 256 |
| **Epochs** | Number of passes through data | 10 - 1000+ |

### 6.3 Regularization

| Technique | Purpose |
|-----------|----------|
| **L2 (alpha)** | Penalizes large weights |
| **Dropout** | Randomly disables neurons during training |
| **Early stopping** | Stop training when validation score stops improving |

In [None]:
# Effect of hidden layer sizes
architectures = [
    (5,),
    (10,),
    (50,),
    (100,),
    (50, 30),
    (100, 50, 25)
]

results = []
for arch in architectures:
    mlp = MLPClassifier(
        hidden_layer_sizes=arch,
        max_iter=500,
        random_state=42,
        early_stopping=True,
        validation_fraction=0.1
    )
    mlp.fit(X_train_scaled, y_train)
    
    results.append({
        'Architecture': str(arch),
        'Train Acc': mlp.score(X_train_scaled, y_train),
        'Test Acc': mlp.score(X_test_scaled, y_test),
        'n_iter': mlp.n_iter_
    })

results_df = pd.DataFrame(results)
print("Architecture Comparison:")
print(results_df.to_string(index=False))

In [None]:
# Effect of regularization (alpha)
alphas = [0.0001, 0.001, 0.01, 0.1, 1.0]

fig, axes = plt.subplots(1, len(alphas), figsize=(20, 4))

for ax, alpha in zip(axes, alphas):
    mlp = MLPClassifier(
        hidden_layer_sizes=(50,),
        alpha=alpha,
        max_iter=500,
        random_state=42
    )
    mlp.fit(X_train_scaled, y_train)
    
    # Decision boundary
    h = 0.02
    x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
    y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train,
               cmap='RdBu', edgecolors='black', s=30)
    
    test_acc = mlp.score(X_test_scaled, y_test)
    ax.set_title(f'alpha={alpha}\nTest Acc: {test_acc:.3f}')
    ax.set_xlabel('Feature 1')

axes[0].set_ylabel('Feature 2')
plt.suptitle('Effect of L2 Regularization (alpha)', y=1.02, fontsize=14)
plt.tight_layout()
plt.show()

---

## 7. Practical Example: Iris Classification

In [None]:
# Load iris data
iris_df = pd.read_csv('../data/iris.csv')
X_iris = iris_df.drop('class', axis=1)
y_iris = iris_df['class']

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train MLP
mlp_iris = MLPClassifier(
    hidden_layer_sizes=(50, 30),
    activation='relu',
    solver='adam',
    alpha=0.001,
    max_iter=1000,
    random_state=42
)
mlp_iris.fit(X_train_iris_scaled, y_train_iris)

# Evaluate
y_pred_iris = mlp_iris.predict(X_test_iris_scaled)

print("MLP Classification Report for Iris Dataset:")
print(classification_report(y_test_iris, y_pred_iris))

# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test_iris, y_pred_iris)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=mlp_iris.classes_, yticklabels=mlp_iris.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - MLP on Iris Dataset')
plt.show()

---

## 8. Advantages and Disadvantages

### Advantages

| Advantage | Description |
|-----------|-------------|
| **Universal Approximation** | Can learn any continuous function |
| **Feature Learning** | Automatically learns useful representations |
| **Flexible Architecture** | Can be adapted to many problem types |
| **Non-linear Patterns** | Handles complex, non-linear relationships |

### Disadvantages

| Disadvantage | Mitigation |
|--------------|------------|
| **Requires large data** | Use transfer learning, data augmentation |
| **Computationally expensive** | Use GPUs, efficient architectures |
| **Black box** | Use explainability techniques (SHAP, LIME) |
| **Sensitive to hyperparameters** | Use automated hyperparameter tuning |
| **Prone to overfitting** | Regularization, dropout, early stopping |

---

## 9. Summary

### Key Takeaways

1. **Perceptrons** are the basic building blocks of neural networks
2. **Activation functions** introduce non-linearity (ReLU is most common)
3. **MLPs** stack multiple layers to learn complex patterns
4. **Backpropagation** efficiently computes gradients for training
5. **Hyperparameter tuning** is crucial for good performance
6. **Regularization** helps prevent overfitting

### When to Use Neural Networks

| Use Neural Networks When | Use Simpler Models When |
|--------------------------|------------------------|
| Large datasets available | Small datasets |
| Complex non-linear patterns | Linear relationships |
| Feature engineering is difficult | Clear feature engineering possible |
| Prediction accuracy is priority | Interpretability is priority |

---

## 10. Exercises

### Exercise 1: XOR Problem
Create the XOR dataset (4 points: (0,0)->0, (0,1)->1, (1,0)->1, (1,1)->0) and show that a single perceptron cannot solve it, but an MLP can.

### Exercise 2: Architecture Exploration
Experiment with different architectures (number of layers, units per layer) on the moons dataset. Find the simplest architecture that achieves >95% test accuracy.

### Exercise 3: Activation Function Comparison
Train the same MLP architecture with different activation functions ('relu', 'tanh', 'logistic') and compare their training curves and final performance.

### Exercise 4: Learning Rate Effect
Train MLPs with different learning rates (0.0001, 0.001, 0.01, 0.1) and plot their loss curves. What happens when the learning rate is too high or too low?

### Exercise 5: Regularization Study
Create a small dataset that overfits easily. Show how increasing the alpha parameter (L2 regularization) reduces overfitting.

### Exercise 6: Early Stopping
Implement early stopping and compare the final test accuracy with and without it. How many epochs does early stopping save?

### Exercise 7: Regression with MLP
Use MLPRegressor on the house_prices.csv data. Compare its performance with Linear Regression.

---

## 11. Further Reading

- [scikit-learn Neural Networks](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)
- [Deep Learning Book by Goodfellow et al.](https://www.deeplearningbook.org/) (Free online)
- [3Blue1Brown Neural Networks Playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) (Visual explanations)
- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/) (Free online book)
- [PyTorch Tutorials](https://pytorch.org/tutorials/) (For deeper neural network frameworks)