# Agentic Reinforcement Learning: From Traditional RL to Autonomous Agents

**Workshop Overview:**
This workshop introduces the emerging field of Agentic Reinforcement Learning, where large language models (LLMs) are transformed into autonomous agents through RL training.


**Key Papers:**
- Wang et al. (2024). "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey" [arXiv:2509.02547](https://arxiv.org/abs/2509.02547)
- Yao et al. (2024). "Demystifying Reinforcement Learning in Agentic Reasoning" [arXiv:2510.11701](https://arxiv.org/abs/2510.11701)
- Zhou et al. (2024). "AgentRL: Scaling Agentic RL with Multi-Turn, Multi-Task Framework" [arXiv:2510.04206](https://arxiv.org/abs/2510.04206)

---


## 1. Introduction: What is Agentic RL?

### Traditional RL vs Agentic RL

**Traditional RL** focuses on training agents to maximize rewards in well-defined environments:
- Game playing (Chess, Go, Atari)
- Robotic control
- Resource optimization

**Agentic RL** extends this to create autonomous agents that can:
- **Reason** about complex problems
- **Use tools** and external resources
- **Plan** multi-step strategies
- **Remember** and learn from past experiences
- **Self-improve** through reflection

### The Paradigm Shift

```
Traditional RL:           Agentic RL:
State → Action            Observation → Thought → Tool Use → Action
                                 ↓
                            Memory, Planning, Reflection
```

**Why LLMs + RL?**
- LLMs provide natural language understanding and reasoning
- RL provides goal-directed behavior optimization
- Together: Agents that can understand, reason, and act autonomously


In [1]:
# Setup: Install required packages
import sys
!{sys.executable} -m pip install numpy matplotlib gymnasium -q

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple, Optional
import json
from dataclasses import dataclass
from collections import defaultdict

print("✓ Setup complete!")



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✓ Setup complete!


## 2. Mathematical Foundations

---

### 2.1 Traditional RL: Markov Decision Process

Traditional RL models the world as a **Markov Decision Process (MDP)**:
- The agent is in a **state**
- Takes an **action**
- Transitions to a **new state**
- Receives a **reward**

**Goal:** Find a policy (mapping from states → actions) that maximizes cumulative reward.

$$\text{MDP} = (\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{T}, \gamma)$$

| Symbol | Meaning | Example |
|--------|---------|--------|
| $\mathcal{S}$ | State space | Game pixels, board positions |
| $\mathcal{A}$ | Action space | Move left/right, place piece |
| $\mathcal{R}$ | Reward function | +1 win, -1 lose |
| $\gamma$ | Discount factor | 0.99 (value future rewards) |

---

### 2.2 Agentic RL: What's Different?

$$\text{Agentic MDP} = (\mathcal{O}, \mathcal{C}, \mathcal{T}_{tool}, \mathcal{M}, \mathcal{A}, \mathcal{R}, \gamma)$$

| New Component | What It Is | Example |
|---------------|------------|--------|
| $\mathcal{O}$ (Observations) | Natural language input | "What is 5+3?" |
| $\mathcal{C}$ (Cognitive Actions) | Thinking, planning | "I need to calculate..." |
| $\mathcal{T}_{tool}$ (Tools) | External functions | Calculator, search, code |
| $\mathcal{M}$ (Memory) | Past interactions | Previous conversation |

**The key innovation: Two types of reward**

| Reward Type | When Given | Signal Density | Example |
|-------------|------------|----------------|--------|
| **Process Reward** | Each reasoning step | Dense | +0.3 for using calculator correctly |
| **Outcome Reward** | Final answer only | Sparse | +1.0 for correct answer |

**Process rewards are what make agentic RL work** — they provide learning signal for every reasoning step, not just the final answer!

---

### 2.3 Comparison

| Aspect | Traditional RL | Agentic RL |
|--------|---------------|------------|
| Input | Numbers (pixels) | Natural language |
| Processing | state → action | observe → think → use tool → act |
| Tools | None | Calculator, search, code |
| Reward | Outcome only | **Process + Outcome** |

---

### 2.4 Training Objective

$$\mathcal{L} = \mathcal{L}_{RL} + \lambda_1 \mathcal{L}_{SFT} + \lambda_2 \mathcal{L}_{KL}$$

### 2.5 The Three Training Losses

---

#### **Loss 1: RL Loss — Maximize Reward**

$$\mathcal{L}_{RL} = -\mathbb{E} \left[ \sum_t \log \pi_\theta(a_t | s_t) \cdot A_t \right]$$

**What it does:** This is the policy gradient. If an action got better-than-expected reward (positive advantage), increase its probability. If worse, decrease it.

| Symbol | Meaning |
|--------|--------|
| $\log \pi_\theta(a_t|s_t)$ | How confident the model was about this action |
| $A_t$ | Advantage: was this action better or worse than average? |

**Intuition:** Reinforce good actions, suppress bad ones.

---

#### **Loss 2: SFT Loss — Teach Good Behavior**

$$\mathcal{L}_{SFT} = -\log \pi_\theta(y | x)$$

**What it does:** Supervised fine-tuning. Show the model examples of good behavior and train it to imitate them. This keeps the model coherent.

| Symbol | Meaning |
|--------|--------|
| $x$ | Input prompt (observation) |
| $y$ | Expert response (what a good agent would say) |

**Intuition:** "Here's how to do it correctly" — learn from expert demonstrations.

---

#### **Loss 3: KL Loss — Don't Forget**

$$\mathcal{L}_{KL} = D_{KL}(\pi_\theta \| \pi_{ref})$$

**What it does:** Measures how different the current model is from the original base model. Keeps the model from drifting too far.

| Model | Description |
|-------|------------|
| Base/Reference model | Knows everything, but bad at specific task |
| Trained model | Good at task, but might forget other things |

**Intuition:** "Don't forget what you already know" — stay grounded.

---

#### **Combined: The Balancing Act**

$$\mathcal{L} = \underbrace{\mathcal{L}_{RL}}_{\text{get rewards}} + \lambda_1 \underbrace{\mathcal{L}_{SFT}}_{\text{stay coherent}} + \lambda_2 \underbrace{\mathcal{L}_{KL}}_{\text{don't drift}}$$

Typical weights: $\lambda_1 = 0.1$, $\lambda_2 = 0.01$

## 3. Key Agentic Capabilities Taxonomy

Based on Wang et al. (2024), agentic systems exhibit **six core capabilities**:

### 3.1 Planning
- **Definition:** Breaking down complex tasks into manageable sub-goals
- **Examples:** Task decomposition, goal setting, strategy formulation
- **RL Role:** Learn to generate effective plans through reward feedback

### 3.2 Tool Use
- **Definition:** Selecting and invoking external tools/APIs to extend capabilities
- **Examples:** Calculator, web search, code execution, database queries
- **RL Role:** Learn which tools to use and when to use them

### 3.3 Memory
- **Definition:** Storing and retrieving information from past interactions
- **Types:**
  - *Episodic:* Specific past experiences
  - *Semantic:* General knowledge and facts
- **RL Role:** Learn what to remember and when to retrieve

### 3.4 Reasoning
- **Definition:** Multi-step logical inference and problem-solving
- **Examples:** Chain-of-thought, analogical reasoning, causal inference
- **RL Role:** Optimize reasoning chains for correctness and efficiency

### 3.5 Self-Improvement
- **Definition:** Learning from mistakes and refining behavior
- **Examples:** Self-reflection, error correction, iterative refinement
- **RL Role:** Meta-learning to improve learning strategies themselves

### 3.6 Perception
- **Definition:** Understanding multi-modal inputs (text, images, audio)
- **Examples:** Vision-language understanding, scene comprehension
- **RL Role:** Learn attention mechanisms and relevant feature extraction

---

### 3.7 Concrete Example: How Each Capability Works

**Task:** "Find the current stock price of Apple, calculate the P/E ratio, and summarize if it's a good buy."

| Step | Capability Used | Agent Action |
|------|-----------------|--------------|
| 1 | **Planning** | Decompose: (a) get stock price, (b) get earnings, (c) calculate P/E, (d) analyze |
| 2 | **Tool Use** | Call `stock_api("AAPL")` → returns \$178.50 |
| 3 | **Tool Use** | Call `stock_api("AAPL", "earnings")` → returns EPS \$6.42 |
| 4 | **Reasoning** | Calculate P/E = \$178.50 ÷ \$6.42 = 27.8 |
| 5 | **Memory** | Retrieve: "Historical tech P/E averages 20-30" |
| 6 | **Reasoning** | Compare: 27.8 is within normal range for tech |
| 7 | **Self-Improvement** | Reflect: "Should I also check growth rate?" → decides yes |
| 8 | **Tool Use** | Call `stock_api("AAPL", "growth")` → 8% YoY |
| 9 | **Reasoning** | Synthesize all data into final recommendation |

**This shows how the six capabilities work together in a realistic agentic workflow.**


## 4. Agentic RL: A Complete Working Example

In this section, we build a **complete agentic RL system** from scratch.

We'll implement every component from Section 2.2:

| Component | What It Is | Section |
|-----------|-----------|--------|
| Observation $\mathcal{O}$ | Natural language input | 4.2 |
| Cognitive Actions $\mathcal{C}$ | LLM generates reasoning | 4.3 |
| Tool Space $\mathcal{T}_{tool}$ | External tools (calculator, search) | 4.4 |
| Process Reward | Reward for good reasoning | 4.5 |
| Outcome Reward | Reward for correct answer | 4.5 |
| The Agent | Combines everything | 4.6 |
| Training Losses | RL + SFT + KL | 4.8 |

### 4.1 Building a Tiny Language Model

First, we need a language model. Real systems use GPT-4 or Claude, but we'll build a **tiny one** that works the same way:

**How LLMs generate text:**
1. Start with a prompt
2. Predict probability of each possible next token
3. Sample a token from that distribution
4. Record the log probability (for RL training!)
5. Repeat until done

#### Step 1: Define the Vocabulary

Every LLM has a vocabulary — the set of tokens it knows.
- GPT-4: ~100,000 tokens
- Our tiny LLM: 22 tokens

In [2]:
# Our tiny vocabulary
VOCAB = [
    '<eos>',  # End of sequence token
    # Words for reasoning
    'I', 'need', 'to', 'calculate', 'this',
    'Let', 'me', 'compute', 'the', 'answer', 'is',
    # Action tokens (for tool use!)
    'Action:', 'calculate(', 'search(', 'finish(',
    ')',
    # Numbers
    '5', '+', '3', '8', 'result'
]

TOKEN_TO_ID = {token: i for i, token in enumerate(VOCAB)}

print(f"Vocabulary size: {len(VOCAB)} tokens")
print(f"\nTokens: {VOCAB}")
print(f"\nExample: 'calculate' → ID {TOKEN_TO_ID['calculate']}")

Vocabulary size: 22 tokens

Tokens: ['<eos>', 'I', 'need', 'to', 'calculate', 'this', 'Let', 'me', 'compute', 'the', 'answer', 'is', 'Action:', 'calculate(', 'search(', 'finish(', ')', '5', '+', '3', '8', 'result']

Example: 'calculate' → ID 4


#### Step 2: Pre-training (Learning Patterns)

GPT learns patterns from trillions of tokens of internet text.

We'll define patterns manually — this is our "pre-training":
- After "I", the model should likely say "need"
- After "Action:", the model should choose a tool like "calculate("

In [3]:
# Pre-trained patterns: P(next_token | previous_token)
# This is what GPT learns from internet text!

TRANSITIONS = {
    # Reasoning patterns
    'I': ['need'],
    'need': ['to'],
    'to': ['calculate'],
    'calculate': ['this'],
    'this': ['Action:'],
    
    # Tool selection
    'Action:': ['calculate(', 'finish('],  # Choose which tool!
    
    # Math expression
    'calculate(': ['5'],
    '5': ['+'],
    '+': ['3'],
    '3': [')'],
    ')': ['<eos>'],
    
    # Finishing
    'finish(': ['8'],
    '8': [')'],
}

print("Pre-trained patterns:")
for prev, nexts in list(TRANSITIONS.items())[:5]:
    print(f"  After '{prev}' → likely {nexts}")

Pre-trained patterns:
  After 'I' → likely ['need']
  After 'need' → likely ['to']
  After 'to' → likely ['calculate']
  After 'calculate' → likely ['this']
  After 'this' → likely ['Action:']


#### Step 3: Token-by-Token Generation

Now we can generate text like GPT:
1. Look at previous token
2. Get probability distribution over next tokens
3. Sample next token
4. **Record log probability** ← This is crucial for RL!

In [4]:
def get_next_token_probs(prev_token):
    """
    Compute probability distribution over next tokens.
    This is like GPT's softmax output!
    """
    probs = np.ones(len(VOCAB)) * 0.01  # Small base probability
    
    if prev_token in TRANSITIONS:
        # Boost probability of likely next tokens
        for next_token in TRANSITIONS[prev_token]:
            probs[TOKEN_TO_ID[next_token]] = 0.8 / len(TRANSITIONS[prev_token])
    else:
        # Default: prefer action tokens
        probs[TOKEN_TO_ID['Action:']] = 0.3
    
    return probs / probs.sum()  # Normalize (like softmax)

# Test it
print("After 'I', next token probabilities:")
probs = get_next_token_probs('I')
for i, p in enumerate(probs):
    if p > 0.05:
        print(f"  '{VOCAB[i]}': {p:.1%}")

After 'I', next token probabilities:
  'need': 79.2%


In [5]:
def generate_text(prompt, max_tokens=10, greedy=False):
    """
    Generate text token by token, exactly like GPT!
    Returns: text, tokens, and LOG PROBABILITIES (for RL)
    """
    tokens = []
    log_probs = []
    prev_token = 'I'
    
    for _ in range(max_tokens):
        # 1. Get probability distribution
        probs = get_next_token_probs(prev_token)
        
        # 2. Sample or pick most likely token
        if greedy:
            next_id = np.argmax(probs)  # Always pick highest prob
        else:
            next_id = np.random.choice(len(VOCAB), p=probs)
        next_token = VOCAB[next_id]
        
        # 3. Record log probability (CRUCIAL FOR RL!)
        log_prob = np.log(probs[next_id])
        
        tokens.append(next_token)
        log_probs.append(log_prob)
        prev_token = next_token
        
        if next_token == '<eos>' or next_token == ')':
            break
    
    return {
        'text': ' '.join(tokens),
        'tokens': tokens,
        'log_probs': log_probs,
        'total_log_prob': sum(log_probs)
    }

# Demo: Greedy generation (most likely path)
print("Greedy generation (always pick highest probability):")
result = generate_text("calculate", max_tokens=12, greedy=True)
print(f"  Tokens: {result['tokens']}")
print(f"  Text: '{result['text']}'")
print(f"  Log probs per token: {[f'{lp:.2f}' for lp in result['log_probs']]}")
print(f"  Total log π = {result['total_log_prob']:.2f}")

# Demo: Sampling (random, like real LLM training)
print("\nSampling (random picks, like real training):")
for i in range(3):
    result = generate_text("calculate", max_tokens=8, greedy=False)
    print(f"  '{result['text']}'  log π = {result['total_log_prob']:.2f}")

Greedy generation (always pick highest probability):
  Tokens: ['need', 'to', 'calculate', 'this', 'Action:', 'calculate(', '5', '+', '3', ')']
  Text: 'need to calculate this Action: calculate( 5 + 3 )'
  Log probs per token: ['-0.23', '-0.23', '-0.23', '-0.23', '-0.23', '-0.92', '-0.23', '-0.23', '-0.23', '-0.23']
  Total log π = -3.01

Sampling (random picks, like real training):
  'need to <eos>'  log π = -5.08
  'result calculate( 5 + 3 )'  log π = -9.48
  'need to calculate me Action: me Action: calculate('  log π = -11.90


#### Why Log Probabilities Matter for RL

| log π value | Probability | Meaning |
|-------------|-------------|----------|
| 0 | 100% | Completely certain |
| -2 | 14% | Fairly confident |
| -5 | 0.7% | Uncertain |
| -10 | 0.005% | Very uncertain |

**In RL training:**
- High reward + confident action → strong positive update
- Low reward + uncertain action → weak negative update

The model learns to be **confident about good actions** and **uncertain about bad actions**.

---
### 4.2 Observation Space $\mathcal{O}$

In traditional RL, observations are numbers (like game pixels).

In **agentic RL**, observations are **natural language**:
- The task description
- Available tools
- Context/memory

---

#### What is an Observation?

Think of it as **everything the agent sees before deciding what to do**:

| Component | Example | Purpose |
|-----------|---------|--------|
| Task | "What is 5+3?" | What to solve |
| Tools | "You have: calculator, search" | What capabilities exist |
| Context | "Previous step calculated 5+3=8" | What already happened |

#### Why Natural Language?

Traditional RL: `state = [0.5, 0.2, 0.8, ...]` (numbers)

Agentic RL: `observation = "Solve: 5+3. You have a calculator."` (text)

**The LLM can understand rich, complex instructions that would be impossible to encode as numbers!**

#### Code: Creating an Observation

Below we build the observation string that the agent will "see".

This is like setting up a game board before a player makes a move.

In [6]:
def create_observation(task, context="", tools=None):
    """Create a natural language observation for the agent."""
    if tools is None:
        tools = ["calculate(expr)", "search(query)", "finish(answer)"]
    
    obs = f"Task: {task}\n"
    obs += "Tools: " + ", ".join(tools) + "\n"
    if context:
        obs += f"Previous: {context}\n"
    obs += "Format: I need to... Action: tool(input)"
    return obs

# Example
obs = create_observation("What is 5 + 3?")
print("Example observation:")
print("-" * 40)
print(obs)

Example observation:
----------------------------------------
Task: What is 5 + 3?
Tools: calculate(expr), search(query), finish(answer)
Format: I need to... Action: tool(input)


---
### 4.3 Cognitive Actions $\mathcal{C}$

#### What Does the LLM Generate?

The **action** is what the LLM generates. It has two parts:

| Part | Example | What it is |
|------|---------|------------|
| **Reasoning** | "I need to calculate this" | The thought process (cognitive action) |
| **Tool call** | "Action: calculate(5+3)" | The external action |

Combined output: `"I need to calculate this Action: calculate(5+3)"`

---

#### The Log Probability: How Confident is the Model?

For every token the LLM generates, it has a probability. The **log probability** is the log of this.

| log π value | Probability | Meaning |
|-------------|-------------|--------|
| -0.2 | 82% | "I'm very confident" |
| -1.0 | 37% | "Somewhat confident" |
| -5.0 | 0.7% | "Not confident at all" |

**Why does this matter for RL?**
- The RL loss uses log π to adjust which actions to encourage
- If a low-confidence action got high reward → increase its probability!
- If a high-confidence action got low reward → decrease its probability!

#### Code: Generating an Action

The LLM takes the observation and generates:
1. **Reasoning** - why it's doing something
2. **Tool call** - what action to take

We also compute the **log probability** — this tells us how confident the model was.

In real training, this log prob is used in the RL loss!

In [7]:
# Track step number for demo
current_step = 0

def generate_action(observation):
    """
    LLM generates action + log probability.
    In this demo, we simulate realistic multi-step behavior.
    """
    global current_step
    current_step += 1
    
    # Step 1: Calculate, Step 2: Finish with answer
    if 'Result: 8' in observation or current_step > 1:
        tokens = ['Action:', 'finish(', '8', ')']
    else:
        tokens = ['I', 'need', 'to', 'calculate', 'this', 'Action:', 'calculate(', '5', '+', '3', ')']
    
    # Compute log prob for each token
    log_probs = []
    for j, token in enumerate(tokens):
        prev_token = tokens[j-1] if j > 0 else 'I'
        probs = get_next_token_probs(prev_token)
        token_id = TOKEN_TO_ID.get(token, 0)
        log_prob = np.log(probs[token_id] + 1e-10)
        log_probs.append(log_prob)
    
    return ' '.join(tokens), sum(log_probs)

# Test
current_step = 0
action, log_prob = generate_action("Task: What is 5+3")
print(f"Step 1 action: '{action}'")
print(f"Log prob: {log_prob:.2f}")

Step 1 action: 'I need to calculate this Action: calculate( 5 + 3 )'
Log prob: -7.63


#### Code: Parsing the Action

The LLM generates text like `"I need to calculate Action: calculate(5+3)"`.

We need to **extract** the tool name and arguments so we can actually run the tool.

In [8]:
import re

def parse_action(text):
    """Extract tool name and input from generated text."""
    if 'finish(' in text:
        return 'finish', '8'
    elif 'calculate(' in text:
        return 'calculate', '5+3'
    elif 'search(' in text:
        return 'search', 'query'
    return 'unknown', ''

# Test
print("Parsing examples:")
print(f"  'Action: calculate(5+3)' -> {parse_action('Action: calculate(5+3)')}")
print(f"  'Action: finish(8)' -> {parse_action('Action: finish(8)')}")

Parsing examples:
  'Action: calculate(5+3)' -> ('calculate', '5+3')
  'Action: finish(8)' -> ('finish', '8')


---
### 4.4 Tool Space $\mathcal{T}_{tool}$

The agent has access to **external tools**:

| Tool | What it does |
|------|-------------|
| `calculate(expr)` | Computes math |
| `search(query)` | Looks up information |
| `finish(answer)` | Submits final answer |

**In real systems**, these could be:
- Web browsers
- Code interpreters (Python, JavaScript)
- Databases
- APIs (weather, stocks, etc.)
- Anything with an interface!

In [9]:
class ToolSpace:
    """External tools the agent can use."""
    
    def calculate(self, expr):
        try:
            result = eval(expr.replace(' ', ''), {"__builtins__": {}}, {})
            return f"Result: {result}"
        except:
            return "Error"
    
    def search(self, query):
        db = {"capital of france": "Paris"}
        return f"Found: {db.get(query.lower(), 'No info')}"
    
    def finish(self, answer):
        return f"ANSWER: {answer}"
    
    def execute(self, tool_name, tool_input):
        if hasattr(self, tool_name):
            return getattr(self, tool_name)(tool_input)
        return "Error: unknown tool"

tools = ToolSpace()
print("Tool demos:")
print(f"  calculate('5+3') -> {tools.calculate('5+3')}")
print(f"  finish('8') -> {tools.finish('8')}")

Tool demos:
  calculate('5+3') -> Result: 8
  finish('8') -> ANSWER: 8


---
### 4.5 Reward Model: Process vs Outcome

We have **two reward functions** — this is what makes training work!

#### Process Reward (dense feedback for each step)
| Action | Reward | Why |
|--------|--------|-----|
| Used calculator correctly | +0.3 | Good tool use |
| Showed reasoning | +0.1 | Transparent thinking |
| Made error | -0.2 | Learn from mistakes |

#### Outcome Reward (sparse feedback at the end)
| Result | Reward |
|--------|--------|
| Correct answer | +1.0 |
| Wrong answer | -1.0 |

**The combination of dense process rewards + sparse outcome rewards is what makes agentic RL training work!**

In [10]:
def get_process_reward(action_text, tool_name, tool_output):
    """Reward for a single reasoning step."""
    reward = 0.0
    reasons = []
    
    if tool_name == 'calculate' and 'Result:' in tool_output:
        reward += 0.3
        reasons.append("+0.3 used calculator")
    
    if 'need' in action_text:
        reward += 0.1
        reasons.append("+0.1 showed reasoning")
    
    return reward, reasons

# Example
r, reasons = get_process_reward("I need to calculate", "calculate", "Result: 8")
print(f"Process reward: {r:+.1f}")
print(f"  Reasons: {reasons}")

Process reward: +0.4
  Reasons: ['+0.3 used calculator', '+0.1 showed reasoning']


In [11]:
def get_outcome_reward(final_answer, correct_answer):
    """Reward for final answer."""
    if str(final_answer).strip() == str(correct_answer).strip():
        return 1.0, "Correct!"
    return -1.0, "Wrong"

# Examples
print("Outcome rewards:")
r, msg = get_outcome_reward("8", "8")
print(f"  Answer '8', expected '8': {r:+.1f} ({msg})")
r, msg = get_outcome_reward("7", "8")
print(f"  Answer '7', expected '8': {r:+.1f} ({msg})")

Outcome rewards:
  Answer '8', expected '8': +1.0 (Correct!)
  Answer '7', expected '8': -1.0 (Wrong)


---
### 4.6 The Complete Agent

Now we combine everything into one agent:

```
Observation -> LLM generates action -> Parse tool -> Execute -> Get reward -> Repeat
```

The agent collects a **trajectory**: sequence of (observation, action, reward).

In [12]:
class AgenticLLMAgent:
    """Agentic LLM that reasons and uses tools."""
    
    def __init__(self, correct_answer, max_steps=3):
        self.tools = ToolSpace()
        self.correct_answer = correct_answer
        self.max_steps = max_steps
        self.trajectory = []
    
    def run_episode(self, task):
        global current_step
        current_step = 0
        self.trajectory = []
        context = ""
        total_reward = 0
        final_answer = None
        
        print(f"Task: {task}")
        print("=" * 50)
        
        for step in range(self.max_steps):
            print(f"\nStep {step + 1}:")
            
            # 1. Create observation (includes previous results)
            obs = create_observation(task, context)
            
            # 2. Generate action
            action_text, log_prob = generate_action(obs + context)
            print(f"  LLM: '{action_text}'")
            
            # 3. Parse and execute tool
            tool_name, tool_input = parse_action(action_text)
            tool_output = self.tools.execute(tool_name, tool_input)
            print(f"  Tool: {tool_name}({tool_input}) -> {tool_output}")
            
            # 4. Get process reward
            process_reward, reasons = get_process_reward(action_text, tool_name, tool_output)
            print(f"  Reward: {process_reward:+.1f} {reasons}")
            total_reward += process_reward
            
            # 5. Store trajectory
            self.trajectory.append({
                'action': action_text,
                'log_prob': log_prob,
                'reward': process_reward,
                'tool': tool_name
            })
            
            context += f" {tool_output}"
            
            if tool_name == 'finish':
                final_answer = tool_input
                break
        
        # Outcome reward
        outcome, msg = get_outcome_reward(final_answer, self.correct_answer)
        total_reward += outcome
        print(f"\n{'='*50}")
        print(f"Final Answer: {final_answer} -> {msg} ({outcome:+.1f})")
        print(f"Total Reward: {total_reward:+.1f}")
        
        return self.trajectory, total_reward

print("Agent ready!")

Agent ready!


---
### 4.7 Running an Episode

Let's watch the agent solve a problem step by step.

At each step, notice:
- The LLM generates text (cognitive action)
- We compute log probability (for RL gradient)
- We give process reward (dense feedback)

In [13]:
# Create agent for: What is 5 + 3?
agent = AgenticLLMAgent(correct_answer="8", max_steps=3)

# Run episode
trajectory, total_reward = agent.run_episode("What is 5 + 3?")

Task: What is 5 + 3?

Step 1:
  LLM: 'I need to calculate this Action: calculate( 5 + 3 )'
  Tool: calculate(5+3) -> Result: 8
  Reward: +0.4 ['+0.3 used calculator', '+0.1 showed reasoning']

Step 2:
  LLM: 'Action: finish( 8 )'
  Tool: finish(8) -> ANSWER: 8
  Reward: +0.0 []

Final Answer: 8 -> Correct! (+1.0)
Total Reward: +1.4


---
### 4.8 Computing the Training Losses

Now we compute the three losses from Section 2.5.

**Important:** In this notebook, we **calculate** the loss values to show how they work.
In real training, you would also **backpropagate** to update the model weights.

| Step | This Notebook | Real Training |
|------|---------------|---------------|
| 1. Collect trajectory | ✓ Agent runs, collects data | ✓ Same |
| 2. Compute loss | ✓ We do this! | ✓ Same |
| 3. Backprop | ✗ Skip | ✓ `loss.backward()` |
| 4. Update weights | ✗ Skip | ✓ `optimizer.step()` |

---

#### Loss 1: RL Loss (Policy Gradient)

$$\mathcal{L}_{RL} = -\sum_t \log \pi(a_t|s_t) \cdot A_t$$

**The idea:** Multiply log probability by advantage (was this action better or worse than expected?).
- Positive advantage + negative sign = **negative loss** = "do more of this!"
- Negative advantage + negative sign = **positive loss** = "do less of this!"

---

**What we calculate in the notebook:**
```python
advantage = reward - baseline          # Was this step good?
rl_loss = -log_prob * advantage        # Weighted by model confidence
```

| This Notebook | Real Training |
|---------------|---------------|
| Log probs from tiny LLM | Log probs from GPT-4, Llama, etc. |
| `reward - average` as advantage | GAE (complex advantage estimation) |
| Sum over 2-3 steps | Sum over 1000s of tokens |
| We print the loss | We call `loss.backward()` |

In [14]:
def compute_rl_loss(trajectory, gamma=0.99):
    """
    Policy gradient loss: L = -sum(log_prob * advantage)
    """
    # Step 1: Compute returns (cumulative discounted reward)
    rewards = [t['reward'] for t in trajectory]
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    # Step 2: Compute advantages (return - baseline)
    baseline = np.mean(returns)
    advantages = [G - baseline for G in returns]
    
    # Step 3: Policy gradient loss
    rl_loss = 0
    print("RL Loss computation:")
    for t, A in zip(trajectory, advantages):
        term = -t['log_prob'] * A  # Negative because we minimize
        rl_loss += term
        sign = '+' if A > 0 else '-'
        print(f"  Step: log_prob={t['log_prob']:.2f}, advantage={A:+.2f}")
        print(f"        -> {sign} push probability (good action)" if A > 0 else f"        -> push DOWN probability (bad action)")
    
    return rl_loss

rl_loss = compute_rl_loss(trajectory)
print(f"\nTotal RL Loss: {rl_loss:.3f}")

RL Loss computation:
  Step: log_prob=-7.63, advantage=+0.20
        -> + push probability (good action)
  Step: log_prob=-6.00, advantage=-0.20
        -> push DOWN probability (bad action)

Total RL Loss: 0.326


#### Loss 2: SFT Loss (Supervised Fine-Tuning)

$$\mathcal{L}_{SFT} = -\log \pi(y|x)$$

**The idea:** We have an "expert action" — what a perfect agent would do.
We want our model to imitate this expert.

**Example:**
- Expert action: `"I need to calculate this Action: calculate(5+3)"`
- Our model: generates something similar
- SFT loss: how different is our output from the expert?

---

**What we calculate in the notebook:**
```python
expert_action = "I need to calculate this Action: calculate(5+3)"
sft_loss = -average(log_probs_of_trajectory)  # How well did we match?
```

| This Notebook | Real Training |
|---------------|---------------|
| **1 fixed expert action** | **1000s of expert demonstrations** |
| "1 expert" = we hardcode what good looks like | "1000s" = large dataset of correct examples |
| Use trajectory log probs as proxy | Feed expert through model, compute exact log probs |
| One example | Batch of 32-512 examples per update |

**"One fixed expert action"** means: in this demo, we define ONE example of what the model should say.

**"1000s of expert demonstrations"** means: in real training, you have a huge dataset like:
```
Task: "What is 2+2?"  →  Expert: "Action: calculate(2+2)"
Task: "What is 3*5?"  →  Expert: "Action: calculate(3*5)"
Task: "Capital of France?"  →  Expert: "Action: search(capital france)"
... 10,000 more examples ...
```

In [15]:
def compute_sft_loss(trajectory):
    """
    SFT loss: encourage generating expert-like actions.
    """
    # Expert action: what a good agent would do
    expert_action = "I need to calculate this Action: calculate( 5 + 3 )"
    
    # In real training, we'd compute log prob of expert action
    # Here we approximate with trajectory log probs
    log_probs = [t['log_prob'] for t in trajectory]
    sft_loss = -np.mean(log_probs)  # Negative log prob
    
    print("SFT Loss computation:")
    print(f"  Expert action: '{expert_action[:40]}...'")
    print(f"  Avg log prob: {np.mean(log_probs):.2f}")
    print(f"  SFT Loss: {sft_loss:.3f} (lower = more like expert)")
    
    return sft_loss

sft_loss = compute_sft_loss(trajectory)

SFT Loss computation:
  Expert action: 'I need to calculate this Action: calcula...'
  Avg log prob: -6.81
  SFT Loss: 6.813 (lower = more like expert)


#### Loss 3: KL Divergence (Regularization)

$$\mathcal{L}_{KL} = D_{KL}(\pi_\theta \| \pi_{ref})$$

**The idea:** Don't let the model drift too far from its original behavior.

**Why?** Without this, the model might:
- Forget how to speak coherently
- Start outputting gibberish that happens to get rewards
- Lose all its general knowledge

---

**What we calculate in the notebook:**
```python
ref_log_prob = log_prob - 0.5    # Simulated reference model
kl = log_prob - ref_log_prob     # Difference = how much we've drifted
```

| This Notebook | Real Training |
|---------------|---------------|
| Simulated reference (subtract 0.5) | Frozen copy of original model |
| Simple difference | Full KL over entire vocabulary |
| Approximate | Computed for every token position |

**Reference model** = A frozen copy of the model BEFORE RL training.
```python
# In real training:
reference_model = copy.deepcopy(model)
reference_model.requires_grad = False  # Never update this!
```

In [16]:
def compute_kl_loss(trajectory):
    """
    KL divergence: keep policy close to reference model.
    """
    # In real training: compare current vs frozen reference model
    # Here we approximate with variance of log probs
    log_probs = [t['log_prob'] for t in trajectory]
    
    # Simulated reference log probs (base model)
    ref_log_probs = [lp - 0.5 for lp in log_probs]  # Slight difference
    
    # KL = E[log(current/ref)] = E[log_current - log_ref]
    kl_loss = np.mean([curr - ref for curr, ref in zip(log_probs, ref_log_probs)])
    kl_loss = abs(kl_loss)
    
    print("KL Loss computation:")
    print(f"  Current model log probs: {[f'{lp:.2f}' for lp in log_probs]}")
    print(f"  Reference model log probs: {[f'{lp:.2f}' for lp in ref_log_probs]}")
    print(f"  KL Loss: {kl_loss:.3f} (lower = closer to reference)")
    
    return kl_loss

kl_loss = compute_kl_loss(trajectory)

KL Loss computation:
  Current model log probs: ['-7.63', '-6.00']
  Reference model log probs: ['-8.13', '-6.50']
  KL Loss: 0.500 (lower = closer to reference)


#### Total Training Loss

$$\mathcal{L} = \mathcal{L}_{RL} + \lambda_1 \mathcal{L}_{SFT} + \lambda_2 \mathcal{L}_{KL}$$

| Component | Purpose | Typical Weight |
|-----------|---------|----------------|
| RL Loss | Get rewards | 1.0 |
| SFT Loss | Stay coherent | λ₁ = 0.1 |
| KL Loss | Don't forget | λ₂ = 0.01 |

---

| This Notebook | Real Training |
|---------------|---------------|
| Compute loss, print it | Compute loss |
| That's it! (demo only) | `loss.backward()` - compute gradients |
| | `optimizer.step()` - update weights |
| | Repeat 1000s of times |

In [17]:
lambda_sft = 0.1
lambda_kl = 0.01

total_loss = rl_loss + lambda_sft * sft_loss + lambda_kl * kl_loss

print("="*50)
print("TOTAL TRAINING LOSS")
print("="*50)
print(f"  RL Loss:  {rl_loss:.3f}")
print(f"  SFT Loss: {lambda_sft} * {sft_loss:.3f} = {lambda_sft * sft_loss:.3f}")
print(f"  KL Loss:  {lambda_kl} * {kl_loss:.3f} = {lambda_kl * kl_loss:.3f}")
print(f"  ---------")
print(f"  TOTAL:    {total_loss:.3f}")
print()
print("In real training, we'd backprop this loss to update the LLM weights!")

TOTAL TRAINING LOSS
  RL Loss:  0.326
  SFT Loss: 0.1 * 6.813 = 0.681
  KL Loss:  0.01 * 0.500 = 0.005
  ---------
  TOTAL:    1.013

In real training, we'd backprop this loss to update the LLM weights!


---
### 4.9 Summary

**What we built:**
- Tiny LLM that generates tokens with log probabilities
- Tools (calculator, search, finish)
- Process + Outcome rewards
- Complete agent that collects trajectories

**The training loop:**
```python
for episode in range(num_episodes):
    trajectory = agent.run_episode(task)
    loss = rl_loss + λ₁*sft_loss + λ₂*kl_loss
    optimizer.step(loss)  # Update LLM
```

**What makes it "agentic":**
1. Multi-step reasoning (think → act → observe)
2. Tool use (external capabilities)
3. Memory (context accumulates)
4. Natural language throughout

---
### 4.10 Common Questions & Answers

---

#### Q: Why does "less negative log probability" mean "more confident"?

**A:** Log probability is always ≤ 0 (since probability is between 0 and 1).

| Probability | Log Probability | Interpretation |
|-------------|-----------------|----------------|
| 90% | -0.1 | Very confident |
| 50% | -0.7 | Uncertain |
| 10% | -2.3 | Not confident |
| 1% | -4.6 | Very unlikely |

**Less negative = closer to 0 = higher probability = more confident!**

---

#### Q: What's the difference between SFT and KL loss? Both prevent collapse?

**A:** They prevent different problems:

| Loss | Prevents | How |
|------|----------|-----|
| **SFT** | Bad behavior | "Here's exactly what to say" — learn from examples |
| **KL** | Forgetting | "Don't drift from original" — stay grounded |

**Analogy:**
- SFT = A teacher showing you correct answers
- KL = A rubber band pulling you back to who you were

---

#### Q: What does "greedy" vs "sampling" generation mean?

**A:** Two ways to pick the next token:

| Method | How it works | Result |
|--------|--------------|--------|
| **Greedy** | Always pick highest probability | Same output every time |
| **Sampling** | Randomly pick according to probabilities | Different outputs |

**For RL training:** We need sampling for exploration (try different actions).

**For deployment:** Often use greedy or low-temperature for consistency.

---

#### Q: How do you get the reference model for KL divergence?

**A:** Make a frozen copy before training:

```python
import copy

# Before RL training starts:
reference_model = copy.deepcopy(policy_model)

# Freeze it - never update!
for param in reference_model.parameters():
    param.requires_grad = False

# During training:
current_log_probs = policy_model(tokens)      # This changes
ref_log_probs = reference_model(tokens)        # This stays fixed
kl_loss = (current_log_probs - ref_log_probs).mean()
```

---

#### Q: What's the difference between process and outcome reward?

**A:**

| Reward Type | When | Density | Example |
|-------------|------|---------|--------|
| **Process** | Every step | Dense | +0.3 for using calculator |
| **Outcome** | End only | Sparse | +1.0 for correct final answer |

**Why both?**
- Outcome alone: Agent doesn't know which steps were good
- Process alone: Agent might optimize steps but never finish
- **Together: Best of both worlds!**

---

#### Q: In this demo, is the model actually learning?

**A:** No! This demo only **computes the loss** — it doesn't update weights.

| This Demo | Real Training |
|-----------|---------------|
| Compute loss ✓ | Compute loss ✓ |
| Print loss ✓ | `loss.backward()` ✓ |
| Stop here | `optimizer.step()` ✓ |
| | Repeat 1000s of times |

The demo shows **how** the loss is calculated. Real training would use this loss to update billions of parameters.

---

#### Q: What makes this "agentic" vs regular RL?

**A:** Key differences:

| Aspect | Regular RL | Agentic RL |
|--------|-----------|------------|
| Input | Numbers | Natural language |
| Actions | Discrete choices | Generated text |
| Tools | None | Calculator, search, etc. |
| Reasoning | Implicit | Explicit ("I need to...") |
| Reward | Outcome only | **Process + Outcome** |

---

#### Q: How is this different from just prompting?

**A:** Prompting uses a **fixed model**. Agentic RL actually **updates the model weights** based on rewards.

| Approach | Model Changes? | Gets Better? |
|----------|----------------|--------------|
| Prompting | No (frozen) | No — same capability |
| Agentic RL | Yes (training) | Yes — improves at specific tasks |

The model learns from experience and becomes better at tasks you train it on.

---

#### Q: What about safety?

**A:** Several mechanisms help:

| Mechanism | How It Helps |
|-----------|--------------|
| **KL loss** | Prevents model from drifting to dangerous behaviors |
| **RLHF** | Human feedback steers toward safe outputs |
| **Constitutional AI** | Model critiques its own outputs |
| **Reward design** | Penalize harmful actions explicitly |

Safety is an active research area — these techniques reduce but don't eliminate risks.

---

#### Q: Can I try this with a real LLM?

**A:** Yes! Here's a practical path:

```python
# Using TRL (Transformer Reinforcement Learning) library
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLM

# 1. Load a small model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

# 2. Define your reward function
def reward_fn(response):
    return score_response(response)

# 3. Train with PPO
trainer = PPOTrainer(config, model, tokenizer)
trainer.step(queries, responses, rewards)
```

**Recommended starting point:** TRL library + Llama-3-8B or Mistral-7B.

---

#### Q: How expensive is this to train?

**A:**

| Scale | Cost | Hardware |
|-------|------|----------|
| GPT-4 level | $10M+ | 1000s of GPUs |
| Fine-tune 7B model | $100-1000 | 1-8 GPUs |
| This demo | Free | Your laptop |

**Key insight:** You don't need to train from scratch! Fine-tuning a pre-trained model on specific tasks is much cheaper. The bottleneck is often **getting good reward signals**, not compute.