# Introduction to Large Language Models (LLMs)

Large Language Models (LLMs) are a type of deep learning model trained on massive amounts of text data. They can understand and generate human-like text, perform various language tasks, and demonstrate emergent capabilities.

<span style="color : red">Band 5 & 6 students should understand how LLMs work at a high level, their capabilities and limitations, and ethical considerations around their use.</span>

## What Makes a Language Model "Large"?

```mermaid
graph LR
    A[Language Model Size Factors] --> B[Parameters]
    A --> C[Training Data]
    A --> D[Compute Resources]
    
    B --> B1[Billions of parameters<br/>7B, 13B, 70B, 175B+]
    C --> C1[Terabytes of text<br/>Books, websites, code]
    D --> D1[Thousands of GPUs<br/>Millions of dollars]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#d4edda,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#ffeaa7,color:#333
```

## Evolution of Language Models

| Era | Model Type | Example | Parameters | Key Innovation |
| --- | --- | --- | --- | --- |
| Early | Statistical | N-grams | N/A | Word frequency patterns |
| 2013-2017 | Word Embeddings | Word2Vec, GloVe | Millions | Vector representations |
| 2017-2018 | Transformers | Original Transformer | 65M | Self-attention mechanism |
| 2018-2019 | Pre-trained LMs | BERT, GPT-2 | 110M-1.5B | Transfer learning |
| 2020-2022 | Large LMs | GPT-3, PaLM | 175B-540B | Scale and few-shot learning |
| 2023+ | Multimodal LLMs | GPT-4, Gemini, Claude | Unknown | Vision, reasoning, tool use |

In [None]:
# Import dependencies
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

## How LLMs Work: High-Level Overview

```mermaid
graph LR
    A[Input Text:<br/>'The cat sat on the'] --> B[Tokenization:<br/>Split into tokens]
    B --> C[Token IDs:<br/>245, 3874, 3332...]
    C --> D[Embeddings:<br/>Convert to vectors]
    D --> E[Transformer Layers:<br/>Process with attention]
    E --> F[Output Probabilities:<br/>Predict next token]
    F --> G[Next Token:<br/>'mat']
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#d4edda,color:#333
    style F fill:#ffeaa7,color:#333
    style G fill:#f8d7da,color:#333
```

## Key Concepts in LLMs

### 1. Tokenization

Breaking text into smaller units (tokens) that the model can process.

| Method | Example | Use Case |
| --- | --- | --- |
| Character-level | "hello" → ['h', 'e', 'l', 'l', 'o'] | Small vocabulary, long sequences |
| Word-level | "hello world" → ['hello', 'world'] | Simple but large vocabulary |
| Subword (BPE) | "unhappiness" → ['un', 'happiness'] | Balance between character and word |

### 2. Context Window

The maximum number of tokens the model can consider at once.

```mermaid
graph TD
    A[Early Models<br/>512-1024 tokens]
    B[GPT-3<br/>2048-4096 tokens]
    C[GPT-4<br/>8K-32K tokens]
    D[Claude 2/3<br/>100K-200K tokens]
    E[Gemini 1.5<br/>1M+ tokens]
    
    A --> B --> C --> D --> E
    
    style A fill:#fff3cd,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#d4edda,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#00b894
```

**Longer context = Can consider more information but requires more computation**

In [None]:
# Demonstrate simple token prediction (conceptual)
# Simulate next-token prediction probabilities
tokens = ['mat', 'table', 'floor', 'chair', 'sofa']
probabilities = [0.45, 0.25, 0.15, 0.10, 0.05]

plt.figure(figsize=(10, 6))
plt.bar(tokens, probabilities, color=['red' if p == max(probabilities) else 'blue' for p in probabilities])
plt.xlabel('Possible Next Tokens')
plt.ylabel('Probability')
plt.title('LLM Next-Token Prediction: "The cat sat on the ___"')
plt.ylim(0, 0.5)
for i, (token, prob) in enumerate(zip(tokens, probabilities)):
    plt.text(i, prob + 0.01, f'{prob:.2f}', ha='center')
plt.show()

print(f"Most likely next token: '{tokens[probabilities.index(max(probabilities))]}'")

## Training Process for LLMs

```mermaid
graph TD
    A[Pre-training Phase] --> A1[Collect massive text data<br/>Books, web, code, etc.]
    A1 --> A2[Next-token prediction task<br/>Unsupervised learning]
    A2 --> A3[Train for weeks/months<br/>on thousands of GPUs]
    A3 --> B[Base Model<br/>Can generate text]
    
    B --> C[Fine-tuning Phase]
    C --> C1[Supervised fine-tuning<br/>High-quality examples]
    C1 --> C2[RLHF: Reinforcement Learning<br/>from Human Feedback]
    C2 --> D[Instruction-tuned Model<br/>Follows instructions better]
    
    D --> E[Alignment & Safety]
    E --> E1[Red-teaming<br/>Test for harmful outputs]
    E1 --> E2[Additional safety training]
    E2 --> F[Production Model<br/>Ready for deployment]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#ffeaa7,color:#333
    style D fill:#d4edda,color:#333
    style E fill:#f8d7da,color:#333
    style F fill:#00b894
```

## Capabilities of Modern LLMs

### Core Capabilities

| Capability | Description | Example |
| --- | --- | --- |
| Text Generation | Create fluent, coherent text | Story writing, article generation |
| Question Answering | Answer questions based on context | "What is photosynthesis?" |
| Summarization | Condense long text into key points | Summarize a research paper |
| Translation | Convert between languages | English to Spanish |
| Code Generation | Write and explain code | "Write a function to sort a list" |
| Analysis | Extract insights from data | Sentiment analysis, entity extraction |

### Emergent Capabilities

Abilities that appear in larger models but not smaller ones:

```mermaid
graph LR
    A[Small Models<br/>< 1B params] --> B[Basic text generation<br/>Simple Q&A]
    C[Medium Models<br/>1-10B params] --> D[Better coherence<br/>Some reasoning]
    E[Large Models<br/>10-100B params] --> F[Few-shot learning<br/>Complex reasoning<br/>Chain-of-thought]
    G[Very Large Models<br/>100B+ params] --> H[Advanced reasoning<br/>Multi-step problems<br/>Creative tasks]
    
    style A fill:#fff3cd,color:#333
    style C fill:#ffeaa7,color:#333
    style E fill:#d4edda,color:#333
    style G fill:#00b894
```

## Popular LLM Families

### Open-Source LLMs

```mermaid
mindmap
  root((Open-Source<br/>LLMs))
    LLaMA Family
      Meta's LLaMA 2
      LLaMA 3
      Variants: Alpaca, Vicuna
    Mistral AI
      Mistral 7B
      Mixtral 8x7B
    Google
      Gemma
      T5, Flan-T5
    Community
      Falcon
      MPT
      BLOOM
```

### Proprietary LLMs

| Company | Model | Key Features |
| --- | --- | --- |
| OpenAI | GPT-4, GPT-4 Turbo | Multimodal, strong reasoning |
| Anthropic | Claude 3 (Opus, Sonnet, Haiku) | Long context, safety-focused |
| Google | Gemini (Ultra, Pro, Nano) | Multimodal, efficient |
| Meta | LLaMA (research) | Open weights, strong performance |
| Cohere | Command R+ | Enterprise-focused, RAG-optimized |

## How to Use LLMs: Prompting Strategies

### 1. Zero-Shot Prompting

```
Prompt: "Translate 'Hello, how are you?' to French."
Response: "Bonjour, comment allez-vous ?"
```

### 2. Few-Shot Prompting

```
Prompt: 
"Classify the sentiment:
Text: 'I love this product!' → Positive
Text: 'This is terrible.' → Negative
Text: 'It's okay, nothing special.' → "

Response: "Neutral"
```

### 3. Chain-of-Thought Prompting

```
Prompt: "Solve step-by-step: If a train travels 120 km in 2 hours, 
what is its speed?"

Response: 
"Let's solve this step by step:
1. Speed = Distance / Time
2. Distance = 120 km
3. Time = 2 hours
4. Speed = 120 / 2 = 60 km/h
Therefore, the train's speed is 60 km/h."
```

In [None]:
# Visualize model size vs performance trend
# Note: These are approximate, conceptual values
model_sizes = [0.1, 0.5, 1, 3, 7, 13, 30, 70, 175, 540]
performance = [45, 58, 65, 72, 78, 82, 86, 89, 92, 94]
training_cost = [1, 5, 10, 50, 200, 500, 2000, 8000, 50000, 200000]  # in thousands of dollars

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Performance vs model size
ax1.plot(model_sizes, performance, 'b-o', linewidth=2, markersize=8)
ax1.set_xlabel('Model Size (Billions of Parameters)')
ax1.set_ylabel('Performance Score (%)')
ax1.set_title('LLM Performance vs Model Size')
ax1.set_xscale('log')
ax1.grid(True, alpha=0.3)

# Training cost vs model size
ax2.plot(model_sizes, training_cost, 'r-s', linewidth=2, markersize=8)
ax2.set_xlabel('Model Size (Billions of Parameters)')
ax2.set_ylabel('Training Cost ($1000s USD)')
ax2.set_title('Training Cost vs Model Size')
ax2.set_xscale('log')
ax2.set_yscale('log')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Note: Larger models show diminishing returns in performance but exponential increases in cost.")

## Limitations of LLMs

### Technical Limitations

| Limitation | Description | Impact |
| --- | --- | --- |
| Hallucinations | Generate false or nonsensical information | Can provide confident but wrong answers |
| Context Length | Limited tokens in context window | Cannot process very long documents |
| Recency | Training data has a cutoff date | No knowledge of recent events |
| Arithmetic | Struggle with precise calculations | Math errors in complex problems |
| Consistency | May give different answers to same question | Unreliable for deterministic tasks |
| No True Understanding | Pattern matching, not genuine comprehension | Can fail on novel situations |

### Ethical Concerns

```mermaid
graph LR
    A[Ethical Concerns with LLMs] --> B[Bias & Fairness]
    A --> C[Misinformation]
    A --> D[Privacy]
    A --> E[Environmental Impact]
    A --> F[Job Displacement]
    
    B --> B1[Training data reflects<br/>societal biases]
    C --> C1[Can generate misleading<br/>or false content]
    D --> D1[May memorize and leak<br/>training data]
    E --> E1[Massive energy consumption<br/>for training]
    F --> F1[Automation of knowledge<br/>work tasks]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#f8d7da,color:#333
    style C fill:#f8d7da,color:#333
    style D fill:#f8d7da,color:#333
    style E fill:#f8d7da,color:#333
    style F fill:#f8d7da,color:#333
```

## Advanced LLM Techniques

### Retrieval-Augmented Generation (RAG)

```mermaid
graph LR
    A[User Query] --> B[Retrieve Relevant<br/>Documents]
    B --> C[Document Database<br/>or Vector Store]
    C --> D[Retrieved Context]
    D --> E[Combine Query<br/>+ Context]
    E --> F[LLM]
    F --> G[Generated Response<br/>with Citations]
    
    style A fill:#e1f5ff,color:#333
    style C fill:#fff3cd,color:#333
    style D fill:#d4edda,color:#333
    style F fill:#d4edda,color:#333
    style G fill:#f8d7da,color:#333
```

**Benefits**: Reduces hallucinations, provides up-to-date information, enables source citations

### Fine-tuning

Adapting a pre-trained model for specific tasks or domains:

- **Full Fine-tuning**: Update all model parameters
- **LoRA (Low-Rank Adaptation)**: Update small adapter layers (efficient)
- **Prompt Tuning**: Only train prompt embeddings

### Tool Use / Function Calling

```mermaid
graph LR
    A[User: What's the weather<br/>in Sydney?] --> B[LLM decides to<br/>call weather tool]
    B --> C[Execute:<br/>get_weather 'Sydney']
    C --> D[API returns:<br/>22°C, Sunny]
    D --> E[LLM generates:<br/>It's 22°C and sunny...]
    
    style A fill:#e1f5ff,color:#333
    style B fill:#fff3cd,color:#333
    style C fill:#d4edda,color:#333
    style D fill:#ffeaa7,color:#333
    style E fill:#f8d7da,color:#333
```

## Future of LLMs

### Emerging Trends

1. **Multimodal Models**: Processing text, images, audio, video together
2. **Smaller, More Efficient Models**: Better performance with fewer parameters
3. **Longer Context Windows**: Processing entire books or codebases
4. **Improved Reasoning**: Better at mathematics, logic, and multi-step problems
5. **Specialized Models**: Domain-specific LLMs (medical, legal, code)
6. **On-Device Models**: Running LLMs locally on phones and laptops
7. **Better Alignment**: More controllable, safer, less biased models

```mermaid
timeline
    title Evolution of LLM Capabilities
    2017 : Transformers introduced
         : Attention mechanism
    2018-2019 : BERT, GPT-2
              : Transfer learning
    2020-2022 : GPT-3, Large models
              : Few-shot learning
    2023 : GPT-4, Claude, Gemini
         : Multimodal capabilities
    2024-2026 : Better reasoning
              : Longer context
              : Efficiency improvements
```

## Practical Considerations for Using LLMs

### Choosing an LLM

| Factor | Considerations |
| --- | --- |
| **Cost** | API fees vs self-hosting, usage volume |
| **Performance** | Benchmark scores, task-specific evaluation |
| **Privacy** | Data retention policies, on-premise options |
| **Latency** | Response time requirements |
| **Context Length** | How much text needs to be processed |
| **Capabilities** | Code, multimodal, specific domains |
| **License** | Open-source vs proprietary, commercial use |

### Best Practices

1. **Clear Prompts**: Be specific and provide context
2. **Validation**: Always verify LLM outputs for critical applications
3. **Ethical Use**: Consider bias, fairness, and societal impact
4. **Human in the Loop**: Use LLMs to augment, not replace, human judgment
5. **Monitoring**: Track performance, costs, and failure modes
6. **Privacy**: Don't share sensitive information with external LLM APIs