# Neural Network Architectures

Different types of neural network architectures have been developed to solve specific types of problems. This notebook explores the major architectures used in deep learning.

<span style="color : red">Band 5 & 6 students know there are different neural network architectures and understand generally their appropriate use cases.</span>

## Overview of Neural Network Architectures

```mermaid
mindmap
  root((Neural<br/>Networks))
    Feedforward
      Fully Connected
      Deep Neural Networks
    Convolutional
      Image Classification
      Object Detection
      Image Segmentation
    Recurrent
      LSTM
      GRU
      Time Series
    Transformers
      Self-Attention
      BERT
      GPT
```

In [None]:
# Import dependencies
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## 1. Convolutional Neural Networks (CNNs)

CNNs are specialized for processing grid-like data, particularly images. They use convolutional layers that apply filters to detect features.

### CNN Architecture

```mermaid
graph LR
    A[Input Image<br/>32x32x3] --> B[Conv Layer<br/>28x28x32]
    B --> C[Pooling<br/>14x14x32]
    C --> D[Conv Layer<br/>10x10x64]
    D --> E[Pooling<br/>5x5x64]
    E --> F[Flatten<br/>1600]
    F --> G[Dense Layer<br/>128]
    G --> H[Output<br/>10 classes]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#fff3cd,color:#333
    style F fill:#ffeaa7,color:#333
    style G fill:#d4edda,color:#333
    style H fill:#f8d7da,color:#333
```

### Key Components of CNNs

| Component | Purpose | Effect on Dimensions |
| --- | --- | --- |
| Convolutional Layer | Applies filters to detect features (edges, textures, patterns) | Reduces spatial dimensions slightly |
| Pooling Layer | Downsamples to reduce computation and prevent overfitting | Reduces spatial dimensions by half |
| Activation (ReLU) | Introduces non-linearity | No change in dimensions |
| Batch Normalization | Stabilizes and speeds up training | No change in dimensions |
| Flatten Layer | Converts 2D feature maps to 1D vector | Converts to 1D |
| Dense/Fully Connected | Final classification layers | Reduces to number of classes |

### Visualizing Convolution Operation

In [None]:
# Demonstrate convolution with a simple example
# Create a simple 5x5 image
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0]
])

# Edge detection filter (vertical edges)
filter_vertical = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

# Edge detection filter (horizontal edges)
filter_horizontal = np.array([
    [-1, -1, -1],
    [0, 0, 0],
    [1, 1, 1]
])

def convolve2d(image, kernel):
    """Simple 2D convolution"""
    output_size = image.shape[0] - kernel.shape[0] + 1
    output = np.zeros((output_size, output_size))
    
    for i in range(output_size):
        for j in range(output_size):
            output[i, j] = np.sum(image[i:i+3, j:j+3] * kernel)
    
    return output

# Apply filters
vertical_edges = convolve2d(image, filter_vertical)
horizontal_edges = convolve2d(image, filter_horizontal)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original Image')
axes[0].axis('off')

axes[1].imshow(vertical_edges, cmap='gray')
axes[1].set_title('Vertical Edge Detection')
axes[1].axis('off')

axes[2].imshow(horizontal_edges, cmap='gray')
axes[2].set_title('Horizontal Edge Detection')
axes[2].axis('off')

plt.tight_layout()
plt.show()

### Popular CNN Architectures

| Architecture | Year | Key Innovation | Use Case |
| --- | --- | --- | --- |
| LeNet | 1998 | First successful CNN | Handwriting recognition |
| AlexNet | 2012 | Deep CNN with ReLU, Dropout | ImageNet classification |
| VGGNet | 2014 | Very deep with small filters | Image classification |
| ResNet | 2015 | Skip connections (residual blocks) | Very deep networks (152 layers) |
| Inception | 2014 | Multiple filter sizes in parallel | Efficient image classification |
| MobileNet | 2017 | Depthwise separable convolutions | Mobile and embedded devices |
| EfficientNet | 2019 | Compound scaling | State-of-art efficiency |

## 2. Recurrent Neural Networks (RNNs)

RNNs are designed for sequential data where the order matters, such as time series, text, and speech.

### RNN Architecture

```mermaid
graph LR
    X1[Input<br/>t=1] --> H1[Hidden<br/>State 1]
    H1 --> Y1[Output<br/>t=1]
    H1 --> H2[Hidden<br/>State 2]
    X2[Input<br/>t=2] --> H2
    H2 --> Y2[Output<br/>t=2]
    H2 --> H3[Hidden<br/>State 3]
    X3[Input<br/>t=3] --> H3
    H3 --> Y3[Output<br/>t=3]
    H3 --> H4[...]
    
    style X1 fill:#e1f5ff,color:#333
    style X2 fill:#e1f5ff,color:#333
    style X3 fill:#e1f5ff,color:#333
    style H1 fill:#d4edda,color:#333
    style H2 fill:#d4edda,color:#333
    style H3 fill:#d4edda,color:#333
    style Y1 fill:#f8d7da,color:#333
    style Y2 fill:#f8d7da,color:#333
    style Y3 fill:#f8d7da,color:#333
```

**Key Feature**: Hidden state carries information from previous time steps

### RNN Variants

#### LSTM (Long Short-Term Memory)

```mermaid
graph TD
    A[Previous Cell State] --> B{Forget Gate<br/>What to forget?}
    C[Input] --> D{Input Gate<br/>What to add?}
    D --> E[New Cell State]
    B --> E
    E --> F{Output Gate<br/>What to output?}
    F --> G[Hidden State]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#e1f5ff,color:#333
    style D fill:#fff3cd,color:#333
    style E fill:#d4edda,color:#333
    style F fill:#fff3cd,color:#333
    style G fill:#f8d7da,color:#333
```

**Advantages**:
- Solves vanishing gradient problem
- Can learn long-term dependencies
- Gates control information flow

#### GRU (Gated Recurrent Unit)

- Simplified version of LSTM
- Fewer parameters, faster training
- Combines forget and input gates into update gate

### RNN Use Cases

| Task Type | Example | Architecture |
| --- | --- | --- |
| One-to-Many | Image captioning | One input → sequence output |
| Many-to-One | Sentiment analysis | Sequence input → one output |
| Many-to-Many (same length) | Video frame labeling | Sequence → sequence (aligned) |
| Many-to-Many (different length) | Machine translation | Sequence → sequence (encoder-decoder) |
| Sequence-to-Sequence | Text summarization | Encoder-decoder with attention |

In [None]:
# Demonstrate sequential data processing
# Generate a sine wave with noise
time_steps = np.linspace(0, 4*np.pi, 100)
signal = np.sin(time_steps) + np.random.normal(0, 0.1, 100)

# Simulate RNN processing (moving average as simple example)
window_size = 5
smoothed = np.convolve(signal, np.ones(window_size)/window_size, mode='same')

plt.figure(figsize=(12, 6))
plt.plot(time_steps, signal, 'b-', alpha=0.5, label='Noisy Signal (Input)')
plt.plot(time_steps, smoothed, 'r-', linewidth=2, label='RNN Processing (Output)')
plt.xlabel('Time Steps')
plt.ylabel('Value')
plt.title('RNN Sequential Processing Example')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 3. Transformer Architecture

Transformers revolutionized NLP and are now used for many tasks. They use self-attention mechanisms instead of recurrence.

### Transformer Architecture Overview

```mermaid
graph TB
    A[Input Sequence] --> B[Input Embedding]
    B --> C[Positional Encoding]
    C --> D[Multi-Head<br/>Self-Attention]
    D --> E[Add & Normalize]
    E --> F[Feed Forward<br/>Network]
    F --> G[Add & Normalize]
    G --> H{Repeat N times<br/>Encoder Layers}
    H --> I[Encoder Output]
    
    J[Output Sequence] --> K[Output Embedding]
    K --> L[Positional Encoding]
    L --> M[Masked Multi-Head<br/>Self-Attention]
    M --> N[Add & Normalize]
    I --> O[Cross-Attention]
    N --> O
    O --> P[Add & Normalize]
    P --> Q[Feed Forward<br/>Network]
    Q --> R[Add & Normalize]
    R --> S{Repeat N times<br/>Decoder Layers}
    S --> T[Output Probabilities]
    
    style A fill:#e1f5ff,color:#333
    style D fill:#d4edda,color:#333
    style I fill:#fff3cd,color:#333
    style J fill:#e1f5ff,color:#333
    style M fill:#d4edda,color:#333
    style O fill:#d4edda,color:#333
    style T fill:#f8d7da,color:#333
```

### Key Innovations of Transformers

#### 1. Self-Attention Mechanism

```mermaid
graph LR
    A["Word 1: 'The'"] --> Q1[Query]
    A --> K1[Key]
    A --> V1[Value]
    
    B["Word 2: 'cat'"] --> Q2[Query]
    B --> K2[Key]
    B --> V2[Value]
    
    C["Word 3: 'sat'"] --> Q3[Query]
    C --> K3[Key]
    C --> V3[Value]
    
    Q1 --> ATT[Attention<br/>Scores]
    Q2 --> ATT
    Q3 --> ATT
    K1 --> ATT
    K2 --> ATT
    K3 --> ATT
    
    ATT --> OUT[Weighted<br/>Values]
    V1 --> OUT
    V2 --> OUT
    V3 --> OUT
    
    style A fill:#e1f5ff,color:#333
    style B fill:#e1f5ff,color:#333
    style C fill:#e1f5ff,color:#333
    style ATT fill:#fff3cd,color:#333
    style OUT fill:#f8d7da,color:#333
```

**Self-attention allows each word to attend to all other words in the sequence simultaneously**

### Advantages of Transformers

| Feature | Benefit |
| --- | --- |
| Parallel Processing | Much faster training than RNNs |
| Long-Range Dependencies | Can relate distant words easily |
| No Vanishing Gradients | Direct connections via attention |
| Scalability | Can be trained on massive datasets |
| Transfer Learning | Pre-trained models work well on many tasks |

### Transformer Models

```mermaid
graph TD
    A[Transformer Architecture<br/>2017] --> B[Encoder-Only<br/>Models]
    A --> C[Decoder-Only<br/>Models]
    A --> D[Encoder-Decoder<br/>Models]
    
    B --> E[BERT<br/>Bidirectional understanding]
    B --> F[RoBERTa<br/>Robustly optimized BERT]
    
    C --> G[GPT Series<br/>Text generation]
    C --> H[LLaMA<br/>Open-source LLM]
    
    D --> I[T5<br/>Text-to-text]
    D --> J[BART<br/>Denoising autoencoder]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#ffeaa7,color:#333
```

## Comparing Architectures

| Architecture | Best For | Strengths | Limitations |
| --- | --- | --- | --- |
| **CNN** | Images, spatial data | Parameter sharing, translation invariance | Requires fixed-size input |
| **RNN/LSTM** | Sequential data, time series | Handles variable length sequences | Slow to train, vanishing gradients |
| **Transformer** | NLP, any sequential data | Parallel processing, long-range dependencies | High memory requirements |
| **Hybrid** | Complex tasks | Combines strengths of multiple architectures | More complex to implement |

## Hybrid Architectures

Modern deep learning often combines different architectures:

```mermaid
graph LR
    A[Image] --> B[CNN<br/>Feature Extraction]
    B --> C[Sequence of<br/>Features]
    C --> D[Transformer<br/>or RNN]
    D --> E[Caption<br/>or Description]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#f8d7da,color:#333
```

**Examples**:
- **Vision Transformers (ViT)**: Treat image patches as sequences
- **CLIP**: Combines vision and language understanding
- **Video Understanding**: CNN + Transformer for spatial-temporal analysis
- **Speech Recognition**: CNN for features + Transformer for sequence modeling