# The Bitter Lesson

# https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf

# Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings

# https://arxiv.org/html/2410.09649v1


# 📊 Comparison: Feature Engineering vs. Network Engineering

| **Aspect**            | **Feature Engineering**                                                                 | **Network Engineering**                                                                 |
|------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| **Definition**         | Manually designing, selecting, and transforming input variables to improve model performance. | Designing and optimizing neural network architectures (layers, connections, modules) to learn features automatically. |
| **Era of Dominance**   | Pre-deep learning era (1950s–2010s), especially with classical ML models (SVMs, logistic regression, decision trees). | Deep learning era (2010s–present), where models learn features end-to-end. |
| **Who Designs Features** | Human experts (domain knowledge crucial).                                               | Neural networks (architectures + training automatically discover features). |
| **Techniques**         | PCA, normalization, polynomial features, embeddings, handcrafted filters (e.g., HOG, SIFT in vision). | CNNs (convolutions), RNNs (temporal patterns), Transformers (attention), ResNet (residuals). |
| **Advantages**         | - Exploits domain expertise. <br> - Works well with small data. <br> - Lower computational demand. | - Scales to large, unstructured data. <br> - Learns hierarchical representations. <br> - Reduces reliance on human-crafted features. |
| **Limitations**        | - Time-consuming & labor-intensive. <br> - May miss hidden patterns. <br> - Hard to generalize. | - Data-hungry & compute-intensive. <br> - Risk of overfitting without regularization. <br> - Requires architecture search expertise. |
| **Examples**           | - TF-IDF in NLP. <br> - Handcrafted medical features (e.g., tumor size). <br> - Statistical moments in finance. | - AlexNet (2012) replacing handcrafted vision features. <br> - BERT/GPT learning embeddings from raw text. <br> - AlphaFold predicting protein structures. |
| **Philosophy**         | *“Features are engineered by humans, models are simple.”*                                | *“Networks engineer the features, humans design the architectures.”* |

---

## ✅ Summary Insight
* **Feature Engineering** dominated early AI by encoding **human knowledge** into models.  
* **Network Engineering** defines modern deep learning by building **architectures that learn representations automatically**.  
* Today, both coexist: preprocessing and normalization remain useful, but deep architectures largely replaced manual feature crafting in **vision, NLP, and multimodal AI**.  


# 🕰️ Timeline: From Feature Engineering to Network Engineering in AI

---

| **Era**        | **Field**   | **Dominant Approach**          | **Key Techniques / Models**                                       | **Landmark Papers & Authors** |
|----------------|-------------|--------------------------------|-------------------------------------------------------------------|--------------------------------|
| **1950s–1970s** | General AI  | Handcrafted Features           | Symbolic AI, rule-based logic, perceptrons (limited capacity)     | McCarthy (1955, AI term), Rosenblatt (1958, Perceptron) |
| **1980s**      | Vision      | Early Feature Extraction       | Edge detection (Sobel, Canny), handcrafted filters                | Canny (1986, Edge Detection) |
|                | Speech      | Manual Features                | LPC (Linear Predictive Coding), MFCC (Mel-Frequency Cepstral Coefficients) | Davis & Mermelstein (1980, MFCC) |
|                | NLP         | Statistical Features           | n-grams, POS tagging, curated lexicons                           | Jelinek (1980s, Statistical LM) |
| **1990s**      | Vision      | Feature Engineering Peak       | SIFT (Scale-Invariant Feature Transform), HOG (Histograms of Oriented Gradients) | Lowe (1999, SIFT); Dalal & Triggs (2005, HOG) |
|                | NLP         | Feature-Based ML               | TF-IDF, bag-of-words, feature-rich linear models                  | Salton (1993, TF-IDF) |
|                | Speech      | Engineered Features + HMMs     | MFCC + Hidden Markov Models (HMMs)                               | Rabiner (1989, HMM Tutorial) |
| **2000s**      | Vision      | Sophisticated Features         | SURF, Gabor filters, handcrafted descriptors                     | Bay et al. (2006, SURF) |
|                | NLP         | Embeddings (transition phase)  | Neural LMs, Word2Vec (representation learning)                   | Bengio (2003, Neural LM); Mikolov (2013, Word2Vec) |
|                | Speech      | Hybrid ML                      | GMM-HMM, DNN-HMM hybrids                                          | Hinton et al. (2012, Speech DNNs) |
| **2010–2015**  | Vision      | Network Engineering Takes Over | CNNs replace handcrafted features (end-to-end learning)           | Krizhevsky et al. (2012, AlexNet) |
|                | NLP         | Neural Feature Learning        | RNNs, LSTMs, Seq2Seq, Attention                                  | Cho et al. (2014, Seq2Seq); Bahdanau et al. (2014, Attention) |
|                | Speech      | Deep Neural Features           | End-to-end DNNs for recognition                                  | Hinton et al. (2012, Deep Speech) |
| **2016–2020**  | Vision      | Deep Architectures             | ResNet, Inception, EfficientNet                                  | He et al. (2015, ResNet) |
|                | NLP         | Transformer Era                | BERT, GPT series                                                 | Vaswani et al. (2017, Attention); Devlin et al. (2019, BERT) |
|                | Multi-Modal | Joint Feature Learning         | CLIP (image+text), Video Transformers                            | Radford et al. (2021, CLIP) |
| **2021–2025**  | Vision & NLP| Foundation Models              | ViTs, multimodal LLMs, diffusion models                          | Dosovitskiy et al. (2020, ViT); Ho et al. (2020, Diffusion) |
|                | Multi-Modal | Unified Architectures          | GPT-4, Gemini, LLaVA, Flamingo                                   | OpenAI (2023, GPT-4); DeepMind (2022, Flamingo) |
|                | General AI  | AutoML / NAS                   | Automated neural architecture search                             | Zoph & Le (2017, NASNet) |

---

## 🔑 Evolution Insight

- **1950s–1990s → Feature Engineering Era**  
  Domain experts manually designed features (edges, MFCCs, TF-IDF).  
  Models were shallow (SVMs, HMMs, linear regression).  

- **2000s → Transition**  
  Shift toward **learned embeddings** (Word2Vec, deep speech).  
  Hybrid approaches: handcrafted features + shallow nets.  

- **2010s → Network Engineering Era**  
  CNNs (vision), RNNs (sequence), Transformers (attention).  
  Networks learned hierarchical features end-to-end.  

- **2020s → Foundation Models & AutoML**  
  Large multimodal models unify representation learning (LLMs, ViTs, diffusion).  
  Network engineering increasingly **automated** via NAS/AutoML.  
  Feature engineering reduced to **preprocessing & normalization**.  

---


# 🚀 Summary in Key Points: The Bitter Lesson in Action (CVPR 2005–2024)

---

## 🔑 The Bitter Lesson
- Rich Sutton (2019): AI progress comes from **general-purpose, compute-heavy methods**, not handcrafted heuristics.  
- Computer Vision (CV) exemplifies this trajectory: **SIFT/HOG → CNNs → Transformers → Foundation Models**.  

---

## 📊 Study Design
- Dataset: **20 years of CVPR papers (2005–2024)**.  
- Method: Large Language Models (LLMs) rated titles & abstracts on 5 “Bitter Lesson” dimensions:  
  1. **Learning > Engineering**  
  2. **Search > Heuristics**  
  3. **Scalability with Computation**  
  4. **Generality > Specificity**  
  5. **Fundamental Principles > Tricks**  
- Analysis: Correlated alignment scores with **citation counts**.  

---

## 📈 Core Findings
- **Learning over Engineering**: Strong upward trend post-2012 (deep learning era).  
- **Scalability with Computation**: Rapid rise with AlexNet, ResNet, scaling laws.  
- **Generality**: Growth accelerates with foundation models.  
- **Search over Heuristics**: Stagnant — explicit search rarely adopted.  
- **Citation Impact**: Post-2015, papers aligning with scalability & learning gained **more citations**.  

---

## 🌊 Impactful Era (2015–2020)
- Sharpest increase in **bitter lesson alignment**.  
- Predictive power of alignment ↔ citations peaked.  
- Coincides with **AlexNet, ResNet, ImageNet breakthroughs**.  

---

## 🔮 Future Outlook
- **Inference-time search** (e.g., OpenAI o1 models, AlphaGo’s MCTS) = next frontier.  
- Expect CV to further embrace **scalable, compute-driven learning**, reducing reliance on heuristics.  

---

## 🧩 Meta Contribution
- Methodology itself embodies Sutton’s vision:  
  - Used **LLMs (fruits of Bitter Lesson)** to analyze research trends.  
  - Demonstrates automation accelerating **scientific meta-analysis**.  

---

## ✨ One-Line Takeaway
The last two decades of CVPR confirm Sutton’s Bitter Lesson: **the more computer vision embraced scalable, general, compute-driven learning, the more impactful the research became** — and the future leans toward even deeper reliance on computation and search.  


# 📊 Evolution of Computer Vision Through the Lens of *The Bitter Lesson*

| Era | Dominant Approach | Key Traits | Landmark Examples | Alignment with Bitter Lesson |
|-----|------------------|------------|-------------------|------------------------------|
| **Pre-2012: Handcrafted Era** | Feature Engineering | - Manual design (SIFT, HOG, Haar cascades) <br> - Strong reliance on domain expertise <br> - Limited scalability with compute | SIFT (Lowe, 1999), HOG (Dalal & Triggs, 2005) | **Low**: Engineering > Learning, little scalability |
| **2012–2015: Deep Learning Breakthrough** | CNN Revolution | - Data-driven hierarchical features <br> - GPU-accelerated training <br> - ImageNet-scale datasets | AlexNet (2012), VGG (2014), ResNet (2015) | **Medium–High**: Clear shift to learning & compute |
| **2016–2020: Scaling Deep Nets** | Large CNNs & Early Transformers | - Rapid architecture innovations (ResNet, Inception, EfficientNet) <br> - Scaling depth, width, and data <br> - Early multimodal exploration | ResNet (2015), Transformer (2017), EfficientNet (2019) | **High**: Scalability & generality dominate |
| **2020–2024: Foundation Models Era** | General Models + Multimodality | - Vision–Language models (CLIP, ALIGN, Florence) <br> - Few-shot & zero-shot generalization <br> - Universal transferable features | CLIP (2021), Florence (2021), ViT (2020) | **Very High**: Maximum scalability & generality |
| **2024–Future: Inference-Time Scaling** | Compute-as-Search | - Dynamic compute at inference (o1 models, AlphaGo’s MCTS) <br> - Simulating strategies & test-time optimization <br> - Less human heuristics, more automated exploration | OpenAI o1 (2024), AlphaGo (2016, precursor) | **Emerging Frontier**: Search over heuristics returns |

---

## 🔑 Teaching Insight

- **Handcrafted Era** → Engineers wrote features.  
- **Deep Learning Era** → Networks learned features.  
- **Foundation Models** → Networks learn **generalizable representations**.  
- **Inference-Time Scaling** → Networks may also **learn to search dynamically**.  


# 📌 The Bitter Lesson — Simplified Summary

---

## 🌍 Core Idea
The biggest lesson from **70+ years of AI**:  
👉 **General methods that scale with computation beat human-crafted knowledge in the long run.**

---

## ⚡ Why?
- **Moore’s Law** → computing power grows exponentially.  
- **Short-term:** Human knowledge helps.  
- **Long-term:** Scalable methods + computation always win.  

---

## 🧩 Examples Across Fields
- **Chess (1997, Deep Blue vs Kasparov):**  
  Human-like reasoning failed → brute-force deep search + hardware won.  

- **Go (2016, AlphaGo):**  
  Handcrafted tricks failed → success via massive search + self-play learning.  

- **Speech Recognition (1970s–today):**  
  Phoneme/vocal tract rules failed → statistical HMMs → deep learning won.  

- **Computer Vision:**  
  Handcrafted features (edges, SIFT, HOG) plateaued → CNNs & deep nets dominate.  

---

## 🚫 Why Human Knowledge Hurts Long-Term
- Feels natural to bake in human reasoning.  
- Works in the short term, gives researchers pride.  
- But **plateaus** and blocks scalability.  

---

## ✅ What Actually Works
Two families of methods **scale with compute**:
1. **Search** (tree search, optimization, brute force).  
2. **Learning** (statistical models, deep nets, self-play).  

➡️ These improve automatically as compute grows.  

---

## 🎯 Big Takeaway
- The world is **too complex to predefine** with rules.  
- Don’t hard-code intelligence (objects, grammar, reasoning).  
- Instead: build systems that **learn and discover for themselves**.  

---

## ✨ In One Sentence
**The Bitter Lesson:**  
Stop hand-coding intelligence.  
Let **search + learning + computation** discover solutions — they always win in the long run.  


# ⚡ The Bitter Lessons of AI

---

## 🎲 Games & Search
- **Chess (1997, Deep Blue)**  
  Handcrafted chess knowledge lost.  
  👉 Brute-force search + compute won.  

- **Go (2016, AlphaGo)**  
  Human-inspired heuristics failed.  
  👉 Massive search + self-play learning won.  

- **Checkers (1950s–2007, Chinook)**  
  Decades of strategy modeling plateaued.  
  👉 Exhaustive search with compute solved the game.  

---

## 🎤 Speech Recognition
- **1970s–2010s**  
  Phoneme models + vocal-tract physics failed.  
  👉 Statistical HMMs → Deep learning won.  

---

## 📚 Natural Language Processing
- Rule-based grammar systems collapsed.  
- 👉 Statistical models → embeddings → transformers won.  

---

## 👁️ Computer Vision
- Edges, SIFT, HOG, handcrafted filters died.  
- 👉 CNNs → Vision Transformers → Foundation models won.  

---

## 🌍 Machine Translation
- Symbolic, grammar-based translators failed.  
- 👉 Data-driven Seq2Seq & attention models won.  

---

## 🤖 Robotics & Control
- Hard-coded locomotion (e.g., walking rules) unstable.  
- 👉 Reinforcement learning + large-scale simulation won.  

---

## 🎵 Recommendation Systems
- Human-coded “taste ontologies” failed.  
- 👉 Collaborative filtering + deep models on massive data won.  

---

## 🚗 Autonomous Driving
- Expert-crafted “if–then” rules collapsed in complexity.  
- 👉 End-to-end perception + deep learning policies progressing.  

---

## 💊 Drug Discovery & Bioinformatics
- Human-designed molecular rules plateaued.  
- 👉 Deep generative models + scaling laws win.  

---

## 🕹️ General Game AI
- Hand-coded strategies weak.  
- 👉 Self-play (AlphaZero, MuZero) + search dominate.  

---

## 🧠 Knowledge Representation (GOFAI, 1960s–80s)
- Symbolic expert systems collapsed.  
- 👉 Data-driven statistical learning took over.  

---

## 📈 Scaling Laws (2020s)
- Carefully tuned small models saturate.  
- 👉 Bigger models + more data = predictable gains.  

---

## 🧬 Neuroscience-Inspired AI
- Direct brain mimicry (symbolic neurons, cognitive architectures) underperformed.  
- 👉 Abstract, scalable learning (Transformers, RL) succeed.  

---

## ⚙️ Optimization
- Hand-tuned heuristics limited.  
- 👉 Gradient descent + large compute optimization won.  

---

## 🔮 Search at Inference (Emerging Frontier)
- Static models plateau.  
- 👉 Inference-time compute (AlphaGo’s MCTS, OpenAI o1) rising.  

---

## ✨ Meta-Lesson
Across all domains, the Bitter Lesson repeats:  
👉 Human-designed tricks help briefly but **hit walls**.  
👉 **Scalable, compute-hungry methods — search + learning — always overtake them.**


# ⚡ The 30 Bitter Lessons of AI

---

## 🎲 Games
1. **Chess (1997, Deep Blue)**  
   Human strategy models failed.  
   👉 Brute-force search + compute won.  

2. **Go (2016, AlphaGo)**  
   Hand-crafted heuristics collapsed.  
   👉 Search + self-play learning won.  

3. **Checkers (2007, Chinook)**  
   Human opening books obsolete.  
   👉 Exhaustive computation solved the game.  

4. **Backgammon (TD-Gammon, 1990s)**  
   Expert rules dominated early.  
   👉 Reinforcement learning + self-play beat pros.  

5. **Poker (2017, Libratus, Pluribus)**  
   Hand-coded bluffing rules failed.  
   👉 Game-theoretic RL with large compute beat humans.  

---

## 🎤 Speech & Language
6. **Speech Recognition (DARPA 1970s → 2010s)**  
   Phoneme/vocal-tract models weak.  
   👉 HMMs, then deep neural nets scaled.  

7. **Text-to-Speech**  
   Rule-based phonetic synthesis unnatural.  
   👉 Neural TTS (WaveNet, Tacotron) won.  

8. **Machine Translation**  
   Grammar-based symbolic MT stalled.  
   👉 Statistical → Seq2Seq → Transformers won.  

9. **Natural Language Understanding (NLP)**  
   Hand-coded semantic ontologies failed.  
   👉 Word embeddings → LLMs succeeded.  

10. **Information Retrieval / Search Engines**  
    Human taxonomies (Yahoo directories) collapsed.  
    👉 Statistical ranking + large-scale indexing won.  

---

## 🎵 Recommendations & Vision
11. **Recommendation Systems**  
    Expert-coded “taste graphs” failed.  
    👉 Collaborative filtering + deep learning scaled.  

12. **Computer Vision (Edges, SIFT, HOG)**  
    Feature engineering plateaued.  
    👉 CNNs, ViTs, multimodal models surpassed.  

13. **Image Classification (ILSVRC 2012)**  
    Decades of descriptors obsolete.  
    👉 AlexNet with GPUs crushed error rates.  

14. **Object Detection**  
    Rule-based Haar cascades limited.  
    👉 Deep detectors (R-CNN, YOLO, DETR) won.  

15. **Medical Imaging**  
    Handcrafted features underwhelmed.  
    👉 CNNs on massive datasets outperform.  

---

## 🤖 Robotics & Control
16. **Reinforcement Learning in Robotics**  
    Hard-coded walking/gait rules brittle.  
    👉 Policy learning + simulation scaling wins.  

17. **Autonomous Driving**  
    Rule-based pipelines brittle in open world.  
    👉 End-to-end deep policies + sensor fusion more robust.  

18. **Control Systems (Classic AI Robotics)**  
    Symbolic planners (STRIPS, 1970s) failed at scale.  
    👉 RL + model-based learning scale better.  

---

## 🔬 Science & Discovery
19. **Drug Discovery**  
    Rule-driven chemistry “expert systems” plateaued.  
    👉 Generative models + deep RL accelerating progress.  

20. **Protein Folding (AlphaFold, 2020)**  
    Decades of handcrafted bio-physics.  
    👉 Deep learning + compute solved structure.  

---

## 🧠 Knowledge & Reasoning
21. **Knowledge Representation (GOFAI, 1960s–80s)**  
    Expert systems collapsed.  
    👉 Statistical & neural learning dominated.  

22. **Cognitive Architectures**  
    Human reasoning mimics (SOAR, ACT-R) failed to generalize.  
    👉 Scalable ML methods overtook.  

23. **Optimization**  
    Hand-tuned search heuristics fragile.  
    👉 Gradient descent + stochastic optimization scalable.  

24. **Hyperparameter Tuning**  
    Manual trial-and-error slow.  
    👉 Bayesian optimization + AutoML won.  

25. **Game AI (General)**  
    Scripted behavior brittle.  
    👉 Self-play RL + search (AlphaZero, MuZero) generalize.  

---

## 🔎 Paradigms & Scaling
26. **Neuroscience-Inspired AI**  
    Direct mimicry of neurons/brains didn’t scale.  
    👉 Abstract scalable math (attention, backprop) did.  

27. **Symbolic Logic & Planning**  
    Logic-based AI stagnated.  
    👉 Learning-based statistical models scaled better.  

28. **Scaling Laws (2020s)**  
    Carefully tuned small models plateau.  
    👉 Bigger models + more data → predictable gains.  

29. **Inference-Time Compute (New Frontier)**  
    Static frozen models plateau.  
    👉 Dynamic search at inference (AlphaGo’s MCTS, OpenAI o1) rising.  

30. **General AI Outlook**  
    Human-designed rules/knowledge always appealing but brittle.  
    👉 Search + learning + compute scaling are the universal winners.  

---

## ✨ Master Insight
Across all domains — games, language, vision, robotics, science — the pattern repeats:  
👉 Hand-engineered knowledge feels right but **stalls**.  
👉 **Scalable, compute-driven methods (search + learning) always break through and win.**
