<h3>🧠 Types of Neural Network Architectures</h3>

<div style="font-size: 12px; width: 90%;">

<table>
  <thead>
    <tr>
      <th>Type</th>
      <th>Layers</th>
      <th>Params</th>
      <th>Train Time</th>
      <th>Complexity</th>
      <th>Applications</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Shallow</td>
      <td>1–3</td>
      <td>1K–10K</td>
      <td>Minutes–Hours</td>
      <td>Simple</td>
      <td>Linear reg; Binary cls; Feature extract</td>
    </tr>
    <tr>
      <td>Medium</td>
      <td>4–10</td>
      <td>100K–1M</td>
      <td>Hours–Days</td>
      <td>Moderate</td>
      <td>Image cls (CNN); LM (RNN); Speech recog</td>
    </tr>
    <tr>
      <td>Deep</td>
      <td>11–50</td>
      <td>1M–10M</td>
      <td>Days–Weeks</td>
      <td>High</td>
      <td>Obj detect (YOLO); Seg (FCN); NLP</td>
    </tr>
    <tr>
      <td>Very Deep</td>
      <td>51–100+</td>
      <td>10M–100M+</td>
      <td>Weeks–Months</td>
      <td>Very High</td>
      <td>SOTA vision; Adv NLP (QA, MT)</td>
    </tr>
  </tbody>
</table>

</div>



✅ **Universal Approximation Theorem**:  
A shallow neural network (with one hidden layer) can mimic any complex function accurately, if it has enough neurons and a suitable activation function.

<h3 style="font-size:16px; margin-bottom:6px;">📦 Deep Learning Architectures</h3>

<style>
  .compact-grid {
    display: grid;
    grid-template-columns: 160px auto;
    row-gap: 2px;
    font-size: 13px;
    font-family: sans-serif;
  }
  .compact-grid b {
    white-space: nowrap;
  }
</style>

<b style="font-size:14px;">1. ANN Architectures</b>
<div class="compact-grid">
  <b>McCulloch-Pitts</b> <span>Binary threshold model.</span>
  <b>Hebbian Network</b> <span>“Neurons that fire together, wire together.”</span>
  <b>Perceptron / MLP</b> <span>Basic/stacked feedforward layers.</span>
  <b>ADALINE / MADALINE</b> <span>Linear adaptive models.</span>
  <b>Backpropagation</b> <span>Error-driven learning for MLPs.</span>
  <b>RBF Networks</b> <span>Use Gaussian activations.</span>
</div>

<b style="font-size:14px;">2. Vision Models</b>
<div class="compact-grid">
  <b>Image Classification</b> <span>Label entire image.</span>
  <b>Object Detection</b> <span>Locate + classify objects.</span>
  <b>Image Segmentation</b> <span>Semantic, Instance, Panoptic.</span>
</div>

<b style="font-size:14px;">3. NLP Models</b>
<div class="compact-grid">
  <b>RNN Family</b> <span>RNN, LSTM, GRU, etc.</span>
  <b>Transformer</b> <span>Self-attention architecture.</span>
  <b>Pretrained</b> <span>BERT, RoBERTa, T5...</span>
  <b>LLMs</b> <span>GPT, LLaMA — autoregressive.</span>
</div>

<b style="font-size:14px;">4. Generative Models</b>
<div class="compact-grid">
  <b>Autoencoders</b> <span>AE, VAE — latent encoding.</span>
  <b>GANs</b> <span>Adversarial generation.</span>
  <b>Flow Models</b> <span>RealNVP, Glow.</span>
  <b>Diffusion</b> <span>DDPM, Stable Diffusion.</span>
</div>


<h2>🧠 NLP Model Timeline (1986–2025)</h2>
<style>
  .timeline {
    display: grid;
    grid-template-columns: 70px auto;
    row-gap: 4px;
    font-family: sans-serif;
    font-size: 14px;
  }
  .year {
    font-weight: bold;
    white-space: nowrap;
  }
</style>
<div class="timeline">
  <div class="year">1986</div><div>RNN – Basic sequential modeling</div>
  <div class="year">1997</div><div>LSTM – Long-term memory via gating</div>
  <div class="year">2014</div><div>GRU, Bi-RNN, Stacked RNN, Seq2Seq, Additive Attention (Bahdanau)</div>
  <div class="year">2015</div><div>Multiplicative Attention (Luong)</div>
  <div class="year">2017</div><div>Transformer – Self & Cross Attention</div>
  <div class="year">2018</div><div>BERT, GPT‑1</div>
  <div class="year">2019</div><div>RoBERTa, DistilBERT, ALBERT, XLNet, ERNIE, GPT‑2, MarianMT, BART</div>
  <div class="year">2020</div><div>T5/mT5, ELECTRA, DeBERTa, PEGASUS, ViT, Reformer, Linformer, Performer, BigBird, Longformer</div>
  <div class="year">2021</div><div>Mistral 7B, Switch Transformer, CLIP</div>
  <div class="year">2022</div><div>FlashAttention, BLIP, Flamingo, Perceiver IO</div>
  <div class="year">2023</div><div>GPT‑4, LLaMA‑1/2, DeepSeek (LLM + Coder)</div>
  <div class="year">2024</div><div>LLaMA‑3.0 (Apr), LLaMA‑3.1 / Mistral L2 (Jul), DeepSeek‑V3 (Dec), R1‑Lite (Nov)</div>
  <div class="year">2025</div><div>Jan: DeepSeek‑R1, Mar: V3‑0324, May: Devstral (Small‑2505), Mixtral (MoE)</div>
</div>


## 🔹 1. RNN Family – Early Sequence Models  
**Goal**: Handle sequential data by using memory of past tokens.

- **RNN**: Recurrent Neural Network; learns temporal patterns, but suffers from vanishing gradients.  
- **LSTM**: Long Short-Term Memory; introduces gates (input, forget, output) to retain long-term dependencies.  
- **GRU**: Gated Recurrent Unit; simpler than LSTM, uses update/reset gates for efficiency.  
- **Bidirectional RNN**: Processes data in both forward and backward directions to capture full context.  
- **Stacked RNN**: Multiple RNN layers stacked to learn deeper temporal features.  
- **Encoder-Decoder**: Sequence-to-sequence model that maps input to output sequences, used in translation.

---

## 🔹 2. Attention Mechanisms – Beyond RNNs  
**Goal**: Focus on relevant parts of the input sequence during prediction.

- **Additive Attention (Bahdanau)**: Uses learnable weights via feedforward layers for alignment scoring.  
- **Multiplicative Attention (Luong)**: Uses dot-product between query and key vectors; more efficient.  
- **Cross-Attention**: Decoder queries encoder outputs, key for sequence-to-sequence models.  
- **Self-Attention**: Each token attends to all others in the sequence; core to Transformers.  
- **Flash Attention**: Highly optimized self-attention with reduced memory and faster runtime.

---

## 🔹 3. Transformer Era – Scalable Parallel Processing  
**Goal**: Use self-attention and parallelism to scale better than RNNs.

### 🔸 Standard Transformer (2017)
- **Architecture**: Encoder-Decoder with self-attention; introduced positional encoding and multi-head attention.

### 🔸 Efficient Transformers
- **BigBird / Longformer**: Handle long sequences using sparse or windowed attention patterns.  
- **Reformer**: Improves memory by replacing attention with hashing and reversible layers.  
- **Switch Transformer**: Uses sparse Mixture-of-Experts (MoE) for efficient routing.  
- **Performer / Linformer**: Reduce complexity from quadratic to linear in sequence length.  
- **Flash Attention**: Memory-efficient GPU-optimized attention mechanism.

### 🔸 Vision & Multimodal Transformers
- **ViT**: Vision Transformer; applies transformer architecture to image patches.  
- **Perceiver IO**: Can handle diverse modalities (text, image, audio) with a unified model.  
- **CLIP / BLIP / Flamingo**: Combine vision and text for tasks like image captioning and retrieval.

---

## 🔹 4. Pretrained Transformers – Generalizable Models  
**Goal**: Use self-supervised learning at scale to generalize to many tasks.

### 🔸 Encoder-only (Bidirectional)
- **BERT**: Bidirectional masked language model for contextual word embeddings.  
- **RoBERTa**: Robustly optimized BERT with more data and training steps.  
- **DistilBERT**: Smaller, faster BERT with 95% of performance.  
- **ALBERT**: Lightweight BERT with cross-layer parameter sharing.  
- **DeBERTa**: Uses disentangled attention for better token representation.  
- **ELECTRA**: Trains discriminator to detect replaced tokens instead of masking.

### 🔸 Encoder-Decoder
- **T5**: Text-to-Text Transfer Transformer; unifies all NLP tasks as text-to-text.  
- **mT5**: Multilingual T5 supporting many languages.  
- **PEGASUS**: Pretraining optimized for abstractive summarization.  
- **BART**: Denoising autoencoder + seq2seq; good for text generation.  
- **MarianMT**: Efficient multilingual translation model.

### 🔸 Hybrid Models
- **XLNet**: Combines autoregressive and autoencoding; learns all factor permutations.  
- **ERNIE**: Injects structured knowledge from knowledge graphs into BERT-style models.

---

## 🔹 5. Generative Models – Foundation of LLMs  
**Goal**: Autoregressively generate high-quality and coherent text.

- **GPT-1**: Introduced decoder-only transformer for language modeling.  
- **GPT-2**: Large-scale generative model with coherent paragraph generation.  
- **GPT-3**: Few-shot learning with 175B parameters; enabled prompt-based learning.  
- **GPT-4**: Multimodal and highly aligned LLM with better reasoning.  
- **Open-Source**: LLaMA, Mistral, DeepSeek, Falcon — efficient LLM alternatives for research and deployment.

---