<h3>🧠 Types of Neural Network Architectures</h3>

<div style="font-size: 12px; width: 90%;">

<table>
  <thead>
    <tr>
      <th>Type</th>
      <th>Layers</th>
      <th>Params</th>
      <th>Train Time</th>
      <th>Complexity</th>
      <th>Applications</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Shallow</td>
      <td>1–3</td>
      <td>1K–10K</td>
      <td>Minutes–Hours</td>
      <td>Simple</td>
      <td>Linear reg; Binary cls; Feature extract</td>
    </tr>
    <tr>
      <td>Medium</td>
      <td>4–10</td>
      <td>100K–1M</td>
      <td>Hours–Days</td>
      <td>Moderate</td>
      <td>Image cls (CNN); LM (RNN); Speech recog</td>
    </tr>
    <tr>
      <td>Deep</td>
      <td>11–50</td>
      <td>1M–10M</td>
      <td>Days–Weeks</td>
      <td>High</td>
      <td>Obj detect (YOLO); Seg (FCN); NLP</td>
    </tr>
    <tr>
      <td>Very Deep</td>
      <td>51–100+</td>
      <td>10M–100M+</td>
      <td>Weeks–Months</td>
      <td>Very High</td>
      <td>SOTA vision; Adv NLP (QA, MT)</td>
    </tr>
  </tbody>
</table>

</div>



✅ **Universal Approximation Theorem**:  
A shallow neural network (with one hidden layer) can mimic any complex function accurately, if it has enough neurons and a suitable activation function.

<h3 style="font-size:16px; margin-bottom:6px;">📦 Deep Learning Architectures</h3>

<style>
  .compact-grid {
    display: grid;
    grid-template-columns: 160px auto;
    row-gap: 2px;
    font-size: 13px;
    font-family: sans-serif;
  }
  .compact-grid b {
    white-space: nowrap;
  }
</style>

<b style="font-size:14px;">1. ANN Architectures</b>
<div class="compact-grid">
  <b>McCulloch-Pitts</b> <span>Binary threshold model.</span>
  <b>Hebbian Network</b> <span>“Neurons that fire together, wire together.”</span>
  <b>Perceptron / MLP</b> <span>Basic/stacked feedforward layers.</span>
  <b>ADALINE / MADALINE</b> <span>Linear adaptive models.</span>
  <b>Backpropagation</b> <span>Error-driven learning for MLPs.</span>
  <b>RBF Networks</b> <span>Use Gaussian activations.</span>
</div>

<b style="font-size:14px;">2. Vision Models</b>
<div class="compact-grid">
  <b>Image Classification</b> <span>Label entire image.</span>
  <b>Object Detection</b> <span>Locate + classify objects.</span>
  <b>Image Segmentation</b> <span>Semantic, Instance, Panoptic.</span>
</div>

<b style="font-size:14px;">3. NLP Models</b>
<div class="compact-grid">
  <b>RNN Family</b> <span>RNN, LSTM, GRU, etc.</span>
  <b>Transformer</b> <span>Self-attention architecture.</span>
  <b>Pretrained</b> <span>BERT, RoBERTa, T5...</span>
  <b>LLMs</b> <span>GPT, LLaMA — autoregressive.</span>
</div>

<b style="font-size:14px;">4. Generative Models</b>
<div class="compact-grid">
  <b>Autoencoders</b> <span>AE, VAE — latent encoding.</span>
  <b>GANs</b> <span>Adversarial generation.</span>
  <b>Flow Models</b> <span>RealNVP, Glow.</span>
  <b>Diffusion</b> <span>DDPM, Stable Diffusion.</span>
</div>


<h2 style="font-size:20px;">🧠 NLP Model Timeline (1986–2025)</h2>

<table style="font-size:13px; font-family:sans-serif; border-collapse: collapse; margin-bottom: 16px;">
  <tr><td><b>1986</b></td><td>RNN – Basic sequential modeling</td></tr>
  <tr><td><b>1997</b></td><td>LSTM – Long-term memory via gating</td></tr>
  <tr><td><b>2014</b></td><td>GRU, Bi-RNN, Stacked RNN, Seq2Seq, Additive Attention (Bahdanau)</td></tr>
  <tr><td><b>2015</b></td><td>Multiplicative Attention (Luong)</td></tr>
  <tr><td><b>2017</b></td><td>Transformer – Self & Cross Attention</td></tr>
  <tr><td><b>2018</b></td><td>BERT, GPT‑1</td></tr>
  <tr><td><b>2019</b></td><td>RoBERTa, DistilBERT, ALBERT, XLNet, ERNIE, GPT‑2, MarianMT, BART</td></tr>
  <tr><td><b>2020</b></td><td>T5/mT5, ELECTRA, DeBERTa, PEGASUS, ViT, Reformer, Linformer, Performer, BigBird, Longformer</td></tr>
  <tr><td><b>2021</b></td><td>Mistral 7B, Switch Transformer, CLIP</td></tr>
  <tr><td><b>2022</b></td><td>FlashAttention, BLIP, Flamingo, Perceiver IO</td></tr>
  <tr><td><b>2023</b></td><td>GPT‑4, LLaMA‑1/2, DeepSeek (LLM + Coder)</td></tr>
  <tr><td><b>2024</b></td><td>LLaMA‑3.0 (Apr), LLaMA‑3.1 / Mistral L2 (Jul), DeepSeek‑V3 (Dec), R1‑Lite (Nov)</td></tr>
  <tr><td><b>2025</b></td><td>Jan: DeepSeek‑R1, Mar: V3‑0324, May: Devstral (Small‑2505), Mixtral (MoE)</td></tr>
</table>

<hr>

<h3 style="font-size:16px;">🔹 1. RNN Family – Early Sequence Models</h3>
<p><i>Goal:</i> Handle sequential data by using memory of past tokens.</p>

<ul style="font-size:13px;">
  <li><b>RNN:</b> Learns temporal patterns, but suffers from vanishing gradients.</li>
  <li><b>LSTM:</b> Adds gates (input, forget, output) for long-term memory.</li>
  <li><b>GRU:</b> Uses update/reset gates; simpler than LSTM.</li>
  <li><b>Bidirectional RNN:</b> Processes sequences in both directions.</li>
  <li><b>Stacked RNN:</b> Multiple RNN layers stacked for deeper features.</li>
  <li><b>Encoder-Decoder:</b> Maps input → output sequences, used in translation.</li>
</ul>

<h3 style="font-size:16px;">🔹 2. Attention Mechanisms – Beyond RNNs</h3>
<p><i>Goal:</i> Focus on relevant input parts during prediction.</p>

<ul style="font-size:13px;">
  <li><b>Additive Attention:</b> Feedforward-based alignment (Bahdanau).</li>
  <li><b>Multiplicative Attention:</b> Dot-product based (Luong).</li>
  <li><b>Cross-Attention:</b> Decoder attends to encoder outputs.</li>
  <li><b>Self-Attention:</b> Token attends to all others; base of Transformers.</li>
  <li><b>Flash Attention:</b> Fast, memory-efficient self-attention.</li>
</ul>

<h3 style="font-size:16px;">🔹 3. Transformer Era – Scalable Parallel Processing</h3>
<p><i>Goal:</i> Leverage parallelism & self-attention to scale beyond RNNs.</p>

<ul style="font-size:13px;">
  <li><b>Transformer (2017):</b> Encoder-Decoder, multi-head, positional encoding.</li>
</ul>

<b>Efficient Transformers</b>
<ul style="font-size:13px;">
  <li><b>BigBird / Longformer:</b> Sparse/windowed attention for long texts.</li>
  <li><b>Reformer:</b> Memory-efficient using hashing + reversible layers.</li>
  <li><b>Switch Transformer:</b> MoE-based dynamic routing.</li>
  <li><b>Performer / Linformer:</b> Linear-time attention.</li>
  <li><b>Flash Attention:</b> GPU-optimized fast attention.</li>
</ul>

<b>Vision & Multimodal Transformers</b>
<ul style="font-size:13px;">
  <li><b>ViT:</b> Transformer on image patches.</li>
  <li><b>Perceiver IO:</b> Handles text, vision, audio.</li>
  <li><b>CLIP / BLIP / Flamingo:</b> Vision+language tasks.</li>
</ul>

<h3 style="font-size:16px;">🔹 4. Pretrained Transformers – Generalizable Models</h3>
<p><i>Goal:</i> Learn general-purpose representations via self-supervised training.</p>

<b>Encoder-only</b>  
<ul style="font-size:13px;">
  <li><b>BERT:</b> Masked LM; bidirectional context.</li>
  <li><b>RoBERTa:</b> BERT+ longer training, more data.</li>
  <li><b>DistilBERT:</b> Compressed version with 95% accuracy.</li>
  <li><b>ALBERT:</b> Lightweight BERT with parameter sharing.</li>
  <li><b>DeBERTa:</b> Disentangled attention mechanism.</li>
  <li><b>ELECTRA:</b> Detects replaced tokens (generator-discriminator).</li>
</ul>

<b>Encoder-Decoder</b>  
<ul style="font-size:13px;">
  <li><b>T5 / mT5:</b> Text-to-text for all tasks (multilingual).</li>
  <li><b>PEGASUS:</b> Summarization-focused pretraining.</li>
  <li><b>BART:</b> Denoising + sequence-to-sequence generation.</li>
  <li><b>MarianMT:</b> Efficient multilingual translator.</li>
</ul>

<b>Hybrid</b>  
<ul style="font-size:13px;">
  <li><b>XLNet:</b> Combines autoregressive and autoencoding.</li>
  <li><b>ERNIE:</b> Integrates structured knowledge (e.g., KG).</li>
</ul>

<h3 style="font-size:16px;">🔹 5. Generative Models – Foundation of LLMs</h3>
<p><i>Goal:</i> Autoregressive generation of fluent, coherent text.</p>

<ul style="font-size:13px;">
  <li><b>GPT-1:</b> Decoder-only Transformer.</li>
  <li><b>GPT-2:</b> Large-scale, paragraph coherence.</li>
  <li><b>GPT-3:</b> 175B params; prompt learning.</li>
  <li><b>GPT-4:</b> Multimodal, highly aligned reasoning.</li>
  <li><b>Open-Source:</b> LLaMA, Mistral, DeepSeek, Falcon.</li>
</ul>
