# **Large Language Models (LLMs)**

## **Large Language Models and Transformer Architecture**
#### **What is a Language Model**
A language model **estimates the probability that a sequence of words will appear**. Given an initial text, it **predicts the most likely subsequent terms**. The simplest models **are based on n-grams**, while **modern models use advanced neural networks**, especially **transformers**.

#### **RNN to Transformer**
Before transformers, **RNNs (e.g., LSTM and GRU)** were used to **process sequences sequentially**, but they **were difficult to parallelize and suffered from the problem of vain gradients**. Transformer have revolutionized the field thanks to **self-attention**,**which allows the entire sequence to be processed in parallel**, also modeling long-term dependencies more effectively.

### **The Transformer**

#### **Basic architecture**
Introduced by Google in 2017 for machine translation, the **transformer** is composed of two main blocks:
- **Encoder**: **processes the input** text and creates a contextual representation.
- **Decoder**: **generates the output** text autoregressively from this representation.

#### **Fundamental components**
1. **Input Embedding**: each word is converted into a high-dimensional vector (embedding), combined with **positional encoding** to maintain word order.
2. **Multi-head Attention**: central mechanism that allows the model to focus on different parts of the sequence via multiple parallel attention “heads”.
3. **Self-Attention**: calculates the relative importance between words using **query (Q)**, **key (K)** and **value (V)** vectors.
4. **Layer Normalization & Residuals**: stabilize and speed up training through normalization and direct connections between inputs and outputs of a layer.
5. **Feedforward Layer**: applies nonlinear transformations to each position of the sequence.
6. **Decoder (with masked attention)**: generates the output one token at a time, without accessing future tokens to preserve autoregression.

### **Mixture of Experts (MoE)**

The Mixture of Experts (MoE) is an **architecture** that combines **multiple** specialized **models**, called **“experts,”** to **address complex tasks more efficiently and effectively**.

#### **Main Components**
- **Experts**: specialized models (often transformers), each trained to **handle a specific subset of the task or data**.
- **Gating Network (Router)**: decides **which experts to send each piece of input to**. **Computes a probability** distribution over all experts, **establishing the “weight**” of each expert’s contribution.
- **Combination Mechanism**: **combines the expert responses**, weighting them based on the distribution provided by the router, to obtain a final optimized prediction.

### **Large Reasoning Models**

**Advanced reasoning models** do not rely only on the power of the Transformer, but **integrate architectural techniques**, **prompting strategies** and specialized **training methods**.

#### Prompting Techniques
- **Chain-of-Thought**: guides the model to **reason** step by step.
- **Tree-of-Thoughts**: explores **multiple logical paths and selects** the best one.
- **Least-to-Most**: **tackles problems in order of increasing difficulty**, reusing answers.

#### Improving logical capabilities
- **Fine-tuning on reasoning datasets** (logic, mathematics, common sense).
- **Instruction tuning**: trains the model to follow instructions in natural language.
- **RLHF**: improves consistency and quality by rewarding more useful answers.

#### Support Techniques
- **Knowledge distillation**: transfer skills from large models to lightweight models.
- **Beam Search & Temperature**: improve output quality.
- **RAG**: integrate data from external sources to enrich answers.

### **Training Transformers**

#### Data prep
- Cleaning → Tokenization → Split between training/testing.

#### Training
- Input → Output → **loss** calculation → Backpropagation → Weight optimization.

#### Architectures
- **Decoder-only** (GPT): sequential prediction.
- **Encoder-only** (BERT): masked input reconstruction.
- **Encoder-Decoder** (T5): tasks such as translation, synthesis, QA.

#### Context Length
More tokens = more context → but also more resources. Balance is needed.

Although Large Language Models are based on mathematics and statistics (neural networks, probability, optimization), **they are not just that**. Their emerging abilities for **reasoning, generalization, and language understanding** come from how they learn through large amounts of data, linguistic examples, and techniques such as *prompting* or *fine-tuning*. Mathematics is the basis, but the resulting behavior **simulates complex cognitive processes**, which cannot be explained with formulas alone.

### **Transformer and LLM Evolution**

#### **GPT-1 (2018) – OpenAI**
- First **decoder-only** based on **unsupervised pre-training** (BooksCorpus).
- Introduced the concept of **supervised fine-tuning after pre-training**, making models more generic and adaptable.
- Limitations: weak cohesion on long texts, little context memory.

#### **BERT – Google**
- **Encoder-only**, specialized in **language understanding**, not generation.
- Trained with **masked language modeling** and **next sentence prediction**.
- Great for: sentiment analysis, QA, NLU.

#### **GPT-2**
- 1.5B parameters, trained on WebText.
- Huge progress in **text coherence** and introduced **zero-shot learning**.

#### **GPT-3, 3.5, 4**
- GPT-3: 175B parameters, excels at few-shot and zero-shot. Commercial use via API.
- InstructGPT: tuning on human data + RLHF → more useful and secure answers.
- GPT-4: **multimodal**, 128k tokens, strong on reasoning and complex domains.

### **Other main models**

#### **LaMDA (Google)**
- Focus on fluid and natural conversations. Trained on dialogue data.

#### **Gopher (DeepMind)**
- 280B parameters. Great at general knowledge, less so on abstract reasoning.

#### **GLaM (Google)**
- First **sparsely activated** model (Mixture of Experts, 1.2T parameters but only active part for input), efficient and powerful.

#### **Chinchilla (DeepMind)**
- 70B parameters, but trained on **larger dataset**. Proved that more data matters as much as parameters → new standard for optimal training.

### **PaLM Line → Gemini (Google)**

#### **PaLM / PaLM 2**
- 540B parameters. Trained for **reasoning**, **code**, **translation**, **humor**.
- PaLM 2: more efficient, great at coding and QA.

#### **Gemini 1.0 → 2.0**
- **Multimodal models** (text, images, audio, video).
- **Gemini Pro / Ultra / Nano / Flash** for scalable or device-based uses.
- **Gemini 1.5 Pro**: context up to 10M tokens, excels in video analysis, code comprehension, multilingual reasoning.
- **Gemini 2.0**: improves efficiency, spatial understanding, scientific reasoning. “Flash Thinking” version shows the “thinking process”.

### **Relevant open-source models**

#### **Gemma (Google)**
- Lightweight open-source models, even **multimodal**, optimized for efficiency.
- **Gemma 3**: up to 128k tokens, over 140 languages, available in sizes from 1B to 27B.

#### **LLaMA (Meta)**
- Decoder-only, 7B to 70B parameters.
- LLaMA 2-Chat optimized for dialogue, LLaMA 3.2 also supports vision and long context.

#### **Mixtral (Mistral AI)**
- **Sparse Mixture of Experts** model, uses only 13B of 47B parameters for input.
- Excels in **math**, **multilingual** and **code**.

### **Advanced Reasoning Oriented Models**

#### **OpenAI o1**
- Inner Chain of Reasoning (CoT), excels in science, math, competitions (e.g. AIME).

#### **DeepSeek**
- Pure unsupervised RL model. Uses **Group Relative Policy Optimization (GRPO)** + automatic selection of best outputs to generate its training data. Excellent in math and reasoning tasks.

### **Other well-known models**
- **Qwen (Alibaba)**: excellent on reasoning and math (up to 72B).
- **Yi (01.AI)**: balanced between performance and efficiency.
- **Grok (xAI)**: 1M tokens, focus on reasoning and self-correction.
- Others: **GPT-NeoX**, **Alpaca**, **Vicuna**, **Falcon**, **PHI**, **DBRX**, **NVLM**, etc.

### **Conclusion**
Transformer models have evolved from monolithic models to **more modular, multimodal and intelligent architectures**, focusing on efficiency, reasoning, and adaptability. Today, the challenge is to scale while maintaining efficiency and controlling costs, paving the way for more sustainable and accessible models.


##### **Comparison**
Transformer-based language models have **evolved from encoder-decoder** architectures with millions of parameters **to massive decoder-only** models with billions of parameters and trained on trillions of tokens. This growth has improved performance and fostered emergent behaviors, such as **few-shot** and **zero-shot learning**. However, **limitations persist**: **difficulty with natural dialogues**, **weak mathematical skills**, and **possible biases or toxic responses**. The next section explores how these problems are being addressed.

### **Fine-Tuning of LLMs**

After **pre-training** on huge unlabeled datasets, LLMs can be **specialized** via **supervised fine-tuning (SFT)**, improving their effectiveness on specific tasks (e.g. QA, summarization, translations) or desired behaviors (e.g. following instructions, dialoguing, avoiding toxic responses).

#### **Fine-Tuning Techniques**
- **Instruction tuning**: train the model to follow instructions (e.g. “write a poem”).
- **Dialogue tuning**: optimize for multi-turn conversations.
- **Safety tuning**: reduce bias and dangerous content.

Fine-tuning is much less expensive than pre-training and more data efficient.

### **RLHF – Reinforcement Learning from Human Feedback**
Advanced technique to **align the model with human preferences**. A **Reward Model** trained on **human feedback** (preferences between two responses) is used. The final model is optimized using RL algorithms (e.g. policy gradient).

Variants:
- **RLAIF**: uses feedback generated by other models instead of humans.
- **DPO**: alternative method that avoids explicit training via RL.

These techniques improve the safety, accuracy and utility of the model.

### **Parameter Efficient Fine-Tuning (PEFT)**

Full fine-tuning of LLMs is expensive in terms of time and resources. **PEFT** techniques offer an efficient alternative, allowing to adapt large models **training only a small portion of parameters**, while maintaining high performance.

#### Main PEFT techniques:
- **Adapter**: small modules added to the model; only they update.
- **LoRA (Low-Rank Adaptation)**: updates only two lightweight matrices, leaving the original weights unchanged. Versions like **QLoRA** also use quantization for greater efficiency.
- **Soft Prompting**: optimized vectors instead of text prompts. They can be just a few tokens and are highly efficient.

---

**Performance/Cost Comparison**:
- **Performance**: Full Fine-Tuning > LoRA > Soft Prompting
- **Efficiency and Cost**: Soft Prompting > LoRA > Full Fine-Tuning

All PEFT techniques offer a good balance between performance and resources used, making fine-tuning accessible even in resource-constrained environments.

### **Prompt Engineering and Using LLMs**

The effectiveness of large language models (LLMs) strongly depends on two factors: **prompt engineering** and **sampling techniques**.

### **Prompt Engineering**
It is the art of building effective inputs (prompts) to obtain desired outputs. It can be used to:
- obtain more **factual** answers
- stimulate **creativity** (e.g. stories, songs)
- guide the model with **clear instructions**, examples, keywords, or formatting

There are three main approaches:
- **Zero-shot prompting**: provide only the instruction (without examples)
- **Few-shot prompting**: include 3–5 examples to help the model
- **Chain-of-thought prompting**: guide the model by showing step-by-step reasoning for complex problems

These techniques improve the reliability and intelligence of the model on various tasks.

### **Sampling Techniques in LLMs**
They are used to control creativity, coherence and diversity of the generated output:

- **Greedy Search**: chooses the most probable token. Coherent but predictable outputs.
- **Random Sampling**: random selection based on the distribution. Creative, but more chaotic.
- **Temperature**: adjusts randomness. Higher values ​​= greater diversity.
- **Top-K**: samples only among the K most probable tokens.
- **Top-P (nucleus)**: samples among tokens whose probability sum reaches P.
- **Best-of-N**: generates N answers and chooses the best one according to a criterion (e.g. logical coherence).

### **Evaluation of LLMs in Applications**
When moving from a prototype to a real app, a customized evaluation system is needed:

- **Evaluation Data**: must reflect real use cases. They can also include production logs or synthetic data.
- **Development Context**: Evaluation must include the entire system (e.g. RAG, agents, etc.).
- **Definition of “good”**: It is based on practical goals, not just on literal matching with a correct answer.

### **Evaluation Methods**
- **Traditional**: Compare with ideal answers. Limited with creative tasks.
- **Human**: The gold standard for subtle evaluations.
- **Autoraters (LLM)**: Models that evaluate other models. Must be calibrated with human judgments for reliability.

### **Advanced Evaluation**
More interpretable systems are being developed, where an LLM breaks a task into subtasks and evaluates each, improving transparency and accuracy of judgment. Great for areas such as multimedia generation.

### **Speeding up inference in LLMs**

As models grow, so do costs and latency. To optimize performance, we use:

- **Trade-offs**: accepting small quality losses for speed or cost gains.
- **Quantization**: reducing the precision of weights (e.g. from 32 to 4 bits), reducing latency and memory.
- **Distillation**: a large model trains a smaller one to speed up inference.
- **Flash Attention** and **Prefix Caching**: optimize attention and reuse computations for repeated inputs.
- **Speculative Decoding**: a small model predicts tokens in advance and the main model verifies them.
- **Batching and Parallelization**: processing multiple requests together and distributing the load across multiple hardware.

These techniques improve efficiency while maintaining quality or only slightly sacrificing it.

### **1. Code and Math**
- **LLM for Developers**: Generation, Completion, Refactoring, Translation, Testing, Documentation.
- **AlphaCode 2**: Top 15% on Codeforces.
- **FunSearch and AlphaGeometry**: Advanced Math Problem Solving.

### **2. Machine Translation**
- Natural translations in messaging, e-commerce, and travel apps.

### **3. Text Summaries**
- For news, scientific papers, business chats.

### **4. Q&A**
- Contextual and personalized answers in virtual assistants, customer support, and academic platforms.

### **5. Chatbots**
- Dynamic and human conversations in customer service and entertainment.

### **6. Content Generation**
- Advertisements, movie scripts, creative and coherent texts.

### **7. Semantic Inference (NLI)**
- Meaning analysis for sentiment, legal documents, and medical diagnoses.

### **8. Text Classification**
- Spam, news, customer feedback, evaluation of generated outputs.

### **9. Text Analysis**
- Market research, in-depth literature analysis.

### **10. Multimodal Applications**
- **Education, accessibility, medicine, marketing**: merging text, images, audio, and video for richer, smarter interactions.

**[Whitepaper summary](https://www.kaggle.com/whitepaper-foundational-llm-and-text-generation):**

- **Transformers**: the basis of all modern LLMs; not only the size of the model matters but also the quality of the data.
- **Fine-tuning**: it is divided into multiple phases (e.g. instruction tuning, safety tuning, SFT, RLHF, RLAIF) to adapt the behavior of the model.
- **Inference optimization**: there are techniques to reduce costs and latency without compromising performance.
- **Applications**: LLMs can be used for summarization, translation, QA, chatbots, code generation and more.
- **Prompt engineering & sampling**: fundamental to get relevant results; Top-K, Top-P and decoding parameters influence correctness, diversity and creativity.

**Conclusion**: combining good fine-tuning, well-designed prompts and sampling techniques allows to get the most out of LLMs.