
# Deep Learning Notes: ANN, RNN, CNN, LSTM, and Transformers

These in-depth notes cover the core deep learning architectures you’ll use day-to-day—**ANN**, **RNN**, **CNN**, **LSTM**, and **Transformers**—with intuition, equations, design tips, and **ready-to-run PyTorch code**.

**What you'll get:**
- Intuition and when to use each architecture
- Key equations and pitfalls
- Minimal but complete PyTorch code templates (CPU-friendly)
- Practical tips (initialization, regularization, debugging)

> Requirements to run code cells: `torch`, `torchvision`, `numpy`, `matplotlib` (install via `pip install torch torchvision torchaudio numpy matplotlib`)



## Learning Objectives
- Understand the building blocks of neural networks and how gradients flow.
- Learn when to choose **ANN vs. RNN/LSTM vs. CNN vs. Transformer**.
- Implement each model in PyTorch and adapt the templates to your own datasets.
- Recognize common training issues (overfitting, vanishing gradients, exploding gradients) and fix them.



## Prerequisites
- Python, NumPy, and basic PyTorch (`nn.Module`, `DataLoader`, `optim`).
- Familiarity with gradient descent and loss functions.



---
## 1) Artificial Neural Networks (ANN / MLP)

**Use when:** Data are tabular or features are already engineered (no sequence/image structure).

### Intuition
- An **MLP** is a stack of linear layers plus nonlinearities.
- Nonlinearities (ReLU/GELU/Tanh) allow modeling complex functions.
- **Depth** increases expressivity; **width** increases capacity.

### Core Equations
Given input $\mathbf{x} \in \mathbb{R}^{d}$:
- Layer: $\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$
- Output (classification): $\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{W}^{(L)}\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)})$
- Loss (cross-entropy): $\mathcal{L} = -\sum_i y_i \log \hat{y}_i$

### Notes & Tips
- Prefer **ReLU/GELU** over sigmoid/tanh for hidden layers.
- Use **BatchNorm/LayerNorm**, **Dropout**, and **weight decay** to regularize.
- **Xavier/He** initialization generally works well.



### Minimal NumPy Forward Pass (for intuition)


In [None]:

import numpy as np

np.random.seed(0)
X = np.random.randn(4, 3)        # batch=4, features=3
W1 = np.random.randn(3, 5) * 0.1 # 3->5
b1 = np.zeros(5)
W2 = np.random.randn(5, 2) * 0.1 # 5->2
b2 = np.zeros(2)

def relu(z): 
    return np.maximum(0, z)

# forward
h1 = relu(X @ W1 + b1)
logits = h1 @ W2 + b2
probs = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)
probs



### PyTorch: MLP Classifier Template


In [None]:

import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset

class MLP(nn.Module):
    def __init__(self, in_dim, hidden_dims=(128, 64), out_dim=2, p_drop=0.1):
        super().__init__()
        layers = []
        prev = in_dim
        for h in hidden_dims:
            layers += [nn.Linear(prev, h), nn.ReLU(), nn.Dropout(p_drop)]
            prev = h
        layers += [nn.Linear(prev, out_dim)]
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

# dummy data (replace with your tensors)
X = torch.randn(512, 20)
y = torch.randint(0, 2, (512,))

ds = TensorDataset(X, y)
dl = DataLoader(ds, batch_size=32, shuffle=True)

model = MLP(in_dim=20, out_dim=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

for epoch in range(3):
    model.train()
    total_loss = 0.0
    for xb, yb in dl:
        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_loss += loss.item() * xb.size(0)
    print(f"epoch {epoch+1}: loss={total_loss/len(ds):.4f}")



---
## 2) Recurrent Neural Networks (RNN)

**Use when:** Data are **sequences** (text, time-series) and dependencies are mostly local/short.

### Intuition
- RNNs process tokens **sequentially**, carrying a hidden state.
- Struggle with **long-range dependencies** due to **vanishing/exploding gradients**.

### Equations (Elman RNN)
- $\mathbf{h}_t = \tanh(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}_h)$
- $\hat{\mathbf{y}}_t = \mathrm{softmax}(\mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y)$

### Tips
- Use gradient clipping; try **LSTM/GRU** if RNN underfits long contexts.
- Pack sequences with `pack_padded_sequence` when lengths vary.



### PyTorch: Character-level RNN (toy)


In [None]:

import torch
from torch import nn

class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size=128):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, 32)
        self.rnn = nn.RNN(input_size=32, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x, h=None):
        x = self.embed(x)
        out, h = self.rnn(x, h)
        logits = self.fc(out)  # [B, T, V]
        return logits, h

# Dummy data: random characters
vocab_size = 30
model = CharRNN(vocab_size)
x = torch.randint(0, vocab_size, (4, 20))  # batch=4, seq=20
logits, h = model(x)
logits.shape



---
## 3) LSTM (Long Short-Term Memory)

**Use when:** You need **longer context** than vanilla RNNs handle reliably (language modeling, time-series).

### Intuition
- LSTM introduces **gates** (input, forget, output) and a **cell state** to control information flow.

### Equations
Let $\sigma$ be sigmoid and $\odot$ element-wise multiplication.
- $\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$
- $\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$
- $\mathbf{g}_t = \tanh(\mathbf{W}_g [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_g)$
- $\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$
- $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_t$
- $\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$

### Tips
- Start with **GRU** if you want fewer parameters and similar performance.
- Use **bidirectional** LSTMs for sequence tagging (NER, POS).



### PyTorch: LSTM for Sequence Classification (template)


In [None]:

import torch
from torch import nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_size=128, num_layers=1, num_classes=2, bidirectional=False):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers=num_layers, 
                            batch_first=True, bidirectional=bidirectional)
        mult = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_size * mult, num_classes)

    def forward(self, x, lengths=None):
        x = self.embed(x)
        if lengths is not None:
            # Pack padded sequences for efficiency on variable lengths
            packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
            out, (h, c) = self.lstm(packed)
            # h: [num_layers*dirs, B, H]
            h_last = torch.cat([h[-2], h[-1]], dim=-1) if self.lstm.bidirectional else h[-1]
        else:
            out, (h, c) = self.lstm(x)
            h_last = torch.cat([h[-2], h[-1]], dim=-1) if self.lstm.bidirectional else h[-1]
        logits = self.fc(h_last)
        return logits

# Dummy data: token ids with padding id=0
B, T, V = 8, 25, 1000
x = torch.randint(1, V, (B, T))
lengths = torch.randint(low=T//2, high=T, size=(B,))
model = LSTMClassifier(vocab_size=V, bidirectional=True)
logits = model(x, lengths=lengths)
logits.shape



---
## 4) Convolutional Neural Networks (CNN)

**Use when:** Data have **spatial structure** (images, audio spectrograms), or local patterns matter.

### Intuition
- Convolutions apply **learnable local filters** across space/time.
- Weight sharing makes CNNs parameter-efficient and translation-aware.

### Key Concepts
- **Kernel/Filter size**: receptive field size.
- **Stride**: downsampling rate.
- **Padding**: preserves spatial size.
- **Pooling**: reduces resolution and adds invariance.

### Output size (1D/2D)
For input size $N$, kernel $K$, padding $P$, stride $S$:
$$
\text{out} = \left\lfloor \frac{N - K + 2P}{S} \right\rfloor + 1
$$



### PyTorch: Simple CNN for MNIST-like Images


In [None]:

import torch
from torch import nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 28->14
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 14->7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64*7*7, 128),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(128, num_classes)
        )
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Dummy batch
x = torch.randn(16, 1, 28, 28)
model = SimpleCNN()
logits = model(x)
logits.shape



---
## 5) Transformers

**Use when:** You need **long-range dependencies**, parallel training, and state-of-the-art performance (NLP, vision, multimodal).

### Intuition
- **Self-attention** lets each token attend to all others, modeling global context efficiently (quadratic in sequence length).
- Add **positional encodings** to inject order information.

### Scaled Dot-Product Attention
Given queries $Q$, keys $K$, values $V$:
$$
\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V
$$

### Multi-Head Attention (MHA)
- Project inputs to multiple heads, apply attention in parallel, then concatenate.

### Tips
- Use **LayerNorm**, **residual connections**, and **dropout**.
- **Causal masks** for autoregressive language modeling.



### Minimal Self-Attention in PyTorch


In [None]:

import torch
from torch import nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model=128, n_heads=4):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3*d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.qkv(x)                       # [B,T,3C]
        q, k, v = qkv.chunk(3, dim=-1)
        # reshape to heads
        def split_heads(t):
            return t.view(B, T, self.n_heads, self.d_head).transpose(1, 2)  # [B,H,T,D]
        q, k, v = map(split_heads, (q, k, v))
        # attention scores
        scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head)  # [B,H,T,T]
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = scores.softmax(dim=-1)
        out = attn @ v                                            # [B,H,T,D]
        out = out.transpose(1, 2).contiguous().view(B, T, C)      # [B,T,C]
        return self.proj(out)

# test
x = torch.randn(2, 16, 128)  # batch, seq, features
sa = SelfAttention(128, 4)
y = sa(x)
y.shape



### Transformer Encoder Block


In [None]:

class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model=128, n_heads=4, d_ff=256, p_drop=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = SelfAttention(d_model, n_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(p_drop),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(p_drop)

    def forward(self, x, mask=None):
        # Pre-norm
        x = x + self.dropout(self.attn(self.ln1(x), mask=mask))
        x = x + self.dropout(self.ff(self.ln2(x)))
        return x

# test block
blk = TransformerEncoderBlock()
y = blk(torch.randn(2, 16, 128))
y.shape



### Tiny Transformer Encoder for Text Classification (template)


In [None]:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))  # [1, max_len, d_model]

    def forward(self, x):
        T = x.size(1)
        return x + self.pe[:, :T, :]

class TinyTransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4, num_layers=2, num_classes=2, p_drop=0.1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model, padding_idx=0)
        self.pos = PositionalEncoding(d_model)
        self.blocks = nn.ModuleList([TransformerEncoderBlock(d_model, n_heads, d_ff=4*d_model, p_drop=p_drop)
                                     for _ in range(num_layers)])
        self.ln = nn.LayerNorm(d_model)
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x, mask=None):
        x = self.embed(x)
        x = self.pos(x)
        for blk in self.blocks:
            x = blk(x, mask=mask)
        x = self.ln(x)
        # Use mean pooling over sequence (CLS alternatives also fine)
        x = x.mean(dim=1)
        return self.fc(x)

# Dummy batch of token ids
B, T, V = 8, 32, 2000
tokens = torch.randint(1, V, (B, T))
model = TinyTransformerClassifier(vocab_size=V, num_classes=3)
logits = model(tokens)
logits.shape



---
## Generic Training Loop (applies to all models)


In [None]:

def train_model(model, dataloader, epochs=5, lr=1e-3, weight_decay=1e-4, max_grad_norm=1.0, device='cpu'):
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    for epoch in range(1, epochs+1):
        model.train()
        total, correct, total_loss = 0, 0, 0.0
        for xb, yb in dataloader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            logits = model(xb) if xb.ndim==2 else model(xb)  # image/text compatible
            loss = criterion(logits, yb)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            optimizer.step()
            total_loss += loss.item() * xb.size(0)
            preds = logits.argmax(dim=-1)
            correct += (preds == yb).sum().item()
            total += xb.size(0)
        print(f"epoch {epoch}: loss={total_loss/total:.4f} acc={correct/total:.3f}")



---
## Practical Guidance & Debugging Checklist

- **Data sanity check**: Overfit a tiny subset (e.g., 100 samples). If it can't, something’s wrong (lr, bug, normalization).
- **Learning rate**: Start with `1e-3` (Adam/AdamW); try lr range test or cosine schedules.
- **Initialization**: Rely on PyTorch defaults; for deep MLPs, prefer `ReLU + He init`.
- **Regularization**: Weight decay, dropout, early stopping. For sequences, also **label smoothing**.
- **Batch size**: Increase until VRAM-bound; use gradient accumulation if needed.
- **Mixed precision**: `torch.cuda.amp.autocast()` for speed on GPU.
- **Monitoring**: Track loss/accuracy, learning rate, gradient norms, and activation stats.
- **Reproducibility**: Set seeds and log versions of data & code.



---
## Next Steps
- Swap the dummy tensors with a real dataset using `torchvision` (for CNN) or `torchtext`/custom `Dataset` (for RNN/LSTM/Transformer).
- Add **padding masks** for variable-length sequences in Transformers.
- Try **GRU** variants and **Conv1D** for time-series.
- For Transformers, explore **causal masking** for language modeling and **RoPE** for rotary positional encodings.
