| Technique | Definition | 3 Pros | 3 Cons |
|------------|------------|------------|------------|
| **Distributional Hypothesis** | Linguistic principle that meaning is derived from contextual usage. | Theoretical foundation of embeddings; Intuitive semantic basis; Corpus-driven | Not a model; No numeric representation; Context window undefined |
| **Vector Space Model** | Represents documents/terms as high-dimensional vectors. | Enables similarity computation; Foundation of IR; Simple linear algebra | Sparse; No deep semantics; Ignores order |
| **Bag-of-Words (BoW)** | Word frequency vector representation ignoring order. | Simple; Fast; Interpretable | No syntax; Sparse; No semantic similarity |
| **TF-IDF** | Weighted BoW reducing importance of common words. | Improves IR; Highlights informative words; Easy to compute | Still sparse; No context awareness; Static |
| **PMI** | Measures word association strength via co-occurrence probability. | Captures association; Theoretical grounding; Basis for embeddings | Sparse; Unstable for rare words; No sentence modeling |
| **LSA** | Applies SVD to term-document matrix to uncover latent structure. | Dense vectors; Captures global structure; Reduces dimensionality | Linear only; Static; Computationally heavy |
| **LDA** | Probabilistic topic model representing documents as topic mixtures. | Interpretable topics; Probabilistic; Good for corpora analysis | Not word-level semantic model; Limited expressiveness; Requires tuning |
| **Neural Probabilistic LM** | Learns word embeddings while predicting next word. | Dense distributed reps; Learns semantics; End-to-end training | Computationally expensive (early versions); Limited scale; Context window small |
| **Deep Neural LMs** | Multi-layer neural language models. | Better generalization; Captures nonlinear patterns; Improved prediction | Data-hungry; Training instability; Hard to interpret |
| **Word2Vec (CBOW)** | Predicts target word from context window. | Fast; Efficient; Good for frequent words | Static embeddings; Ignores polysemy; Limited context size |
| **Word2Vec (Skip-gram)** | Predicts context from center word. | Good for rare words; Semantic arithmetic; Scalable | Static; Ignores sentence meaning; No syntax modeling |
| **GloVe** | Embeddings via global co-occurrence factorization. | Uses global stats; Strong semantic structure; Stable | Static; Requires full matrix; Memory heavy |
| **Matrix Factorization Embeddings** | Embeddings from decomposing co-occurrence matrices. | Theoretical clarity; Links to PMI; Global structure | Linear assumption; Computational cost; Static |
| **Character Embeddings** | Word representations built from character sequences. | Handles OOV; Morphology-aware; Compact vocab | Slower; Harder training; Limited semantics alone |
| **Subword Embeddings** | Word vectors composed from subword units. | Handles rare words; Multilingual friendly; Reduces vocab size | Still mostly static; Context limited; Complex preprocessing |
| **fastText** | Word2Vec extended with subword information. | OOV handling; Efficient; Strong multilingual performance | Static embeddings; Shallow context; Less expressive than transformers |
| **Contextual Word Embeddings** | Word vectors dependent on sentence context. | Resolves polysemy; Sentence-aware; Powerful semantics | Computationally heavy; Large models; Hard to deploy |
| **ELMo** | BiLSTM-based contextual embeddings. | First practical contextual model; Deep layers; Transferable | Sequential (slow); Not transformer-based; Heavy |
| **Self-Attention** | Mechanism computing pairwise token relevance. | Parallelizable; Global context; Scalable | Quadratic complexity; Memory intensive; Needs large data |
| **Transformer** | Architecture based entirely on self-attention. | Highly parallel; Scalable; Foundation of LLMs | Data-intensive; Expensive training; Quadratic attention cost |
| **BERT** | Bidirectional transformer with MLM objective. | Strong language understanding; Contextual; Transfer learning | Not generative; Large size; Requires fine-tuning |
| **Masked Language Modeling** | Training objective predicting masked tokens. | Bidirectional learning; Efficient pretraining; Strong semantics | Masking mismatch; Not natural generation; Limited sequence modeling |
| **RoBERTa** | Optimized BERT training without NSP. | Better performance; Larger training data; Stable | High compute; Still encoder-only; No generative ability |
| **Sentence-BERT** | Siamese BERT for sentence embeddings. | Efficient similarity; Good retrieval; Practical | Limited generative use; Requires contrastive data; Static after pooling |
| **Contrastive Sentence Embeddings** | Embeddings trained with contrastive loss. | Strong similarity structure; Robust; Flexible | Needs positive/negative pairs; Data sensitive; Domain dependent |
| **SimCSE** | Contrastive sentence embedding using dropout noise. | Simple; Strong results; Efficient | Sensitive to training setup; Limited reasoning; Domain shifts |
| **T5** | Text-to-text transformer framework. | Unified tasks; Flexible; Strong transfer | Heavy compute; Generative focus; Complex fine-tuning |
| **GPT-3 Style Autoregressive Embeddings** | Decoder-based next-token prediction models. | Emergent abilities; Few-shot learning; Strong generative power | Expensive; Opaque reasoning; Context window limits |
| **CLIP Embeddings** | Joint text-image embedding space via contrastive learning. | Multimodal alignment; Strong zero-shot; Robust | Limited reasoning; Image-text only; Contrastive bias |
| **Cross-Modal Contrastive Learning** | Aligns representations across modalities. | Multimodal fusion; Scalable; Flexible | Needs aligned data; Complex training; Modality imbalance |
| **Text-Code Embeddings** | Embeddings aligned between natural language and code. | Code search; Cross-domain; Practical tooling | Limited domain scope; Requires large code corpora; Evaluation complex |
| **E5 Embeddings** | Weakly supervised contrastive embedding model. | Strong MTEB results; General-purpose; Robust | Training heavy; Still static vector output; Not generative |
| **RetroMAE** | Retrieval-focused masked autoencoder pretraining. | Improves retrieval; Dense indexing; Efficient search | Specialized objective; Limited generation; Pretraining complexity |
| **Instruction-Tuned Embeddings** | Embeddings conditioned on task instructions. | Task-aware; Flexible; Better alignment | Prompt sensitive; Needs instruction data; Overfitting risk |
| **Foundation Model Embeddings** | Embeddings extracted from large pretrained LLMs. | Rich semantics; Generalizable; Multitask | Large footprint; Expensive inference; Black-box nature |
| **Multimodal Unified Embeddings** | Shared embedding space across text, image, audio. | Cross-modal reasoning; Unified representation; Flexible | Data expensive; Alignment difficulty; Modality trade-offs |
| **Matryoshka Embeddings** | Nested embeddings usable at multiple dimensions. | Efficient storage; Flexible truncation; Scalable | Training complexity; Limited adoption; May lose precision |
| **Long-Context Embeddings** | Embeddings preserving very large context windows. | Captures global structure; Document-level reasoning; Memory integration | Quadratic attention cost; Expensive; Noise accumulation |
| **RAG Embeddings** | Embeddings used for external retrieval memory. | Extends knowledge; Scalable memory; Reduces hallucination | Retrieval dependency; Latency; Pipeline complexity |
| **Vector Databases** | Systems storing/searching embeddings efficiently. | Fast similarity search; Scalable; Production-ready | Approximate results; Infrastructure complexity; Storage heavy |
| **Mixture-of-Experts Embeddings** | Embeddings from sparsely activated expert layers. | Efficient scaling; Specialization; Lower inference cost | Routing complexity; Instability; Hard to tune |
| **LLM-as-Embedding Models** | Using LLM hidden states as embeddings. | Rich representations; Flexible; No separate model needed | Large compute; Not optimized for similarity; Inconsistent pooling |
| **Training-Free Embeddings** | Extract embeddings without retraining models. | Cheap; Fast deployment; No extra training | Suboptimal quality; Limited control; Model-dependent |
| **KV-Rerouting Embeddings** | Uses internal KV attention states for embedding extraction. | No retraining; Efficient reuse; Leverages internal structure | Experimental; Architecture-specific; Limited validation |
| **Reasoning-Aware Embeddings** | Embeddings encoding reasoning chains. | Better logic capture; Planning potential; Structured meaning | Hard to evaluate; Computationally heavy; Early research |
| **Dynamic State Embeddings** | Embeddings representing evolving internal model states. | Captures process; Useful for agents; Temporal awareness | Complex; Large memory; Research stage |
| **Agentic Embeddings** | Representations encoding action/decision planning states. | Supports autonomous agents; Planning-aware; Context adaptive | Very early stage; High complexity; Not standardized |


## Words Representations Techniques chronological Timeline

| Technique | Paper(s) (Year — Title) |
|------------|---------------------------|
| Distributional Hypothesis | 1954 — Distributional Structure; 1957 — A Synopsis of Linguistic Theory 1930–1955 |
| Vector Space Model | 1975 — A Vector Space Model for Automatic Indexing (Salton, Wong & Yang) |
| Bag-of-Words (BoW) | 1972 — A Statistical Interpretation of Term Specificity and Its Application in Retrieval (Spärck Jones) |
| TF-IDF | 1988 — Term-Weighting Approaches in Automatic Text Retrieval |
| PMI | 1989 — Word Association Norms, Mutual Information, and Lexicography |
| Latent Semantic Analysis (LSA) | 1990 — Indexing by Latent Semantic Analysis; 1997 — A Solution to Plato’s Problem |
| Latent Dirichlet Allocation (LDA) | 2003 — Latent Dirichlet Allocation |
| Neural Probabilistic Language Model | 2003 — A Neural Probabilistic Language Model |
| Deep Neural Language Models | 2008 — A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning |
| Word2Vec (CBOW) | 2013 — Efficient Estimation of Word Representations in Vector Space |
| Word2Vec (Skip-gram) | 2013 — Distributed Representations of Words and Phrases and their Compositionality |
| GloVe | 2014 — GloVe: Global Vectors for Word Representation |
| Matrix Factorization Embeddings | 2014 — Neural Word Embedding as Implicit Matrix Factorization |
| Character Embeddings | 2015 — Character-Aware Neural Language Models |
| Subword Embeddings | 2016 — Enriching Word Vectors with Subword Information |
| fastText | 2016 — Enriching Word Vectors with Subword Information |
| Contextual Word Embeddings | 2018 — Deep Contextualized Word Representations |
| ELMo | 2018 — Deep Contextualized Word Representations |
| Self-Attention | 2017 — Attention Is All You Need |
| Transformer | 2017 — Attention Is All You Need |
| BERT | 2018 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
| Masked Language Modeling | 2018 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
| RoBERTa | 2019 — RoBERTa: A Robustly Optimized BERT Pretraining Approach |
| Sentence-BERT (SBERT) | 2019 — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks |
| Contrastive Sentence Embeddings | 2021 — SimCSE: Simple Contrastive Learning of Sentence Embeddings |
| SimCSE | 2021 — SimCSE: Simple Contrastive Learning of Sentence Embeddings |
| T5 (Text-to-Text Transformer) | 2020 — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |
| GPT-3 Style Autoregressive Embeddings | 2020 — Language Models are Few-Shot Learners |
| CLIP Multimodal Embeddings | 2021 — Learning Transferable Visual Models From Natural Language Supervision |
| Cross-Modal Contrastive Learning | 2021 — Learning Transferable Visual Models From Natural Language Supervision |
| Text-Code Embeddings | 2022 — Text and Code Embeddings by Contrastive Pre-Training |
| E5 Embeddings | 2022 — Text Embeddings by Weakly-Supervised Contrastive Pre-Training |
| RetroMAE | 2022 — RetroMAE: Pre-Training Retrieval-Oriented Transformers via Masked Auto-Encoder |
| Instruction-Tuned Embeddings | 2023 — INSTRUCTOR: Instruction-Tuned Text Embeddings |
| Foundation Model Embeddings | 2023 — LLaMA: Open and Efficient Foundation Language Models; 2020 — Language Models are Few-Shot Learners |
| Multimodal Unified Embeddings | 2021 — Learning Transferable Visual Models From Natural Language Supervision |
| Matryoshka Embeddings | 2025 — Matryoshka Representation Learning |
| Long-Context Embeddings | 2023 — LongNet: Scaling Transformers to 1,000,000,000 Tokens |
| Retrieval-Augmented Generation (RAG) Embeddings | 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |
| Vector Databases | 2016 — Billion-Scale Similarity Search with GPUs (FAISS) |
| Mixture-of-Experts Embeddings | 2025 — Your Mixture-of-Experts LLM Is Secretly an Embedding Model for Free |
| LLM-as-Embedding Models | 2024 — NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models |
| Training-Free Embeddings | 2026 — Training-Free Text Embedding via Internal KV Re-routing |
| KV-Rerouting Embeddings | 2026 — Training-Free Text Embedding via Internal KV Re-routing |
| Reasoning-Aware Embeddings | 2026 — Do Reasoning Models Enhance Embedding Representations? |
| Dynamic State Embeddings | 2022 — ReAct: Synergizing Reasoning and Acting in Language Models |
| Agentic Embeddings | 2023 — Toolformer: Language Models Can Teach Themselves to Use Tools |
