# Reasoning Agent: HTML Code Generator with Self-Improvement

This notebook demonstrates a **reasoning agent** that:
1. Generates HTML code based on a user prompt
2. Evaluates and improves the code iteratively

We'll implement two versions:
- **Version 1**: LLM-as-Judge (the LLM evaluates its own output)
- **Version 2**: Reflection with External Feedback (using HTML validation)

We'll use **Hugging Face's free Inference API** with open-source models.

---

## üìö Theory: Understanding Agentic AI

Before diving into the implementation, let's understand the theoretical foundations of agentic AI and reasoning systems.

### What is Agentic AI?

**Agentic AI** refers to AI systems that can **plan, act, evaluate, and improve** autonomously in pursuit of specific goals. Unlike traditional AI that follows fixed instructions or responds to patterns, agentic systems use **reasoning loops** to make context-aware decisions in real time.

At its core, an agentic system combines:
- A **Large Language Model (LLM)** as the reasoning engine
- **External tools** that extend capabilities (search, code execution, validation)
- **Feedback loops** that enable learning and self-improvement

This combination allows AI to handle open-ended, multifaceted problems that require adaptive workflows and context-aware decisions.

### The Reasoning Loop: Think-Act-Observe

Agentic systems operate through a continuous cycle that mirrors human problem-solving:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   THINK     ‚îÇ  1. Task Decomposition: Break down the goal
‚îÇ  (Reason)   ‚îÇ  2. Planning: Decide on approach
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ    ACT      ‚îÇ  3. Delegation: Assign to tools/agents
‚îÇ  (Execute)  ‚îÇ  4. Action: Generate output or call tools
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  OBSERVE    ‚îÇ  5. Evaluation: Review results
‚îÇ (Reflect)   ‚îÇ  6. Adaptation: Refine approach
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Loop back to THINK
```

This iterative process enables:
- **Decision-making** based on context
- **Learning** from results
- **Long-term planning** across multiple steps

### Key Agentic Design Patterns

#### 1. **Reflection** üîÑ

Creating feedback loops where LLMs review and improve their outputs.

**Two Types**:
- **Self-Reflection**: LLM critiques its own output (LLM-as-Judge)
- **External Feedback**: Using tools to provide objective validation

**Research Evidence**:
- **Self-Refine** (Madaan et al., 2023): ~20% improvement across diverse tasks
- **Reflexion** (Shinn et al., 2023): 91% accuracy on HumanEval (vs GPT-4's 80%)
- **CRITIC** (Gou et al., 2024): 10-30% improvement using external tools

**When to Apply Reflection**:
- ‚úÖ Validating request feasibility
- ‚úÖ Checking initial plans
- ‚úÖ After each execution step
- ‚úÖ Verifying final outputs

**Trade-offs**:
- ‚ûï Improved accuracy and quality
- ‚ûñ Increased latency (multiple LLM calls)
- ‚ûñ Higher costs

#### 2. **Tool Use** üõ†Ô∏è

Extending LLM capabilities with external tools:
- Web search for real-time information
- Code execution for calculations
- Database queries for data access
- Validators for correctness checking

#### 3. **Planning** üìã

Breaking complex goals into actionable steps:
- Multi-step reasoning
- Conditional logic and branching
- Dynamic replanning based on results

#### 4. **Multi-Agent Collaboration** üë•

Multiple specialized agents working together:
- Division of labor by expertise
- Parallel processing of subtasks
- Coordination and synthesis

### The ReAct Framework

**ReAct** (Reasoning + Acting) by Yao et al. (2022) is a foundational framework that combines:

- **Reasoning**: Explicit thought traces (reflection + planning)
- **Acting**: Task-relevant actions in the environment

The framework creates a loop where:
1. Reasoning guides action selection
2. Actions produce observations
3. Observations inform further reasoning

**Popular Implementations**:
- **DSPy** (Databricks): `ReAct` class
- **LangGraph**: `create_react_agent` function
- **smolagents** (HuggingFace): ReAct-based code agents

### Traditional AI vs Agentic AI

| Aspect | Traditional AI | Agentic AI |
|--------|---------------|------------|
| **Behavior** | Fixed instructions | Dynamic decision-making |
| **Feedback** | One-shot response | Iterative refinement |
| **Tools** | Limited/none | Extensive tool use |
| **Planning** | Pre-programmed | Adaptive planning |
| **Learning** | Static | Self-improvement |
| **Context** | Pattern matching | Context-aware reasoning |

### Our Implementation Approach

In this notebook, we'll implement **two versions** of a reasoning agent:

#### **Version 1: LLM-as-Judge (Self-Reflection)**
```
User Prompt ‚Üí Generate HTML ‚Üí Self-Evaluate ‚Üí Improve ‚Üí Repeat
```
- The LLM generates code
- The same LLM judges its own output
- Iteratively improves based on self-critique
- **Pros**: Simple, no external dependencies
- **Cons**: May have blind spots in self-evaluation

#### **Version 2: Reflection with External Feedback**
```
User Prompt ‚Üí Generate HTML ‚Üí External Validator ‚Üí Reflect on Errors ‚Üí Fix ‚Üí Repeat
```
- The LLM generates code
- External HTML parser validates syntax
- LLM reflects on objective validation errors
- Iteratively fixes issues
- **Pros**: Objective validation, catches concrete errors
- **Cons**: Requires external tools, more complex

Both approaches demonstrate the power of **reflection** in improving AI output quality through iterative refinement.

### Key Research Papers

1. **Yao et al. (2022)**: ["ReAct: Synergizing Reasoning and Acting in Language Models"](https://arxiv.org/abs/2210.03629)
2. **Madaan et al. (2023)**: ["Self-Refine: Iterative Refinement with Self-Feedback"](https://arxiv.org/abs/2303.17651)
3. **Shinn et al. (2023)**: ["Reflexion: Language Agents with Verbal Reinforcement Learning"](https://arxiv.org/abs/2303.11366)
4. **Gou et al. (2024)**: ["CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing"](https://arxiv.org/abs/2305.11738)

Now let's see these concepts in action! üöÄ

## Setup

First, install the required packages and set up authentication.

In [None]:
!pip install -q huggingface_hub

In [None]:
import os
from huggingface_hub import InferenceClient
from typing import Dict, List, Tuple
import json
from html.parser import HTMLParser
import re

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
from dotenv import load_dotenv

env_path = "/content/drive/MyDrive/.env"
load_dotenv(env_path)

HF_TOKEN = os.getenv('HF_TOKEN')

### Model Configuration

We'll use **Qwen/Qwen2.5-72B-Instruct** - a powerful open-source model available via Hugging Face Inference API.

Alternative models you can try:
- `meta-llama/Llama-3.1-70B-Instruct`
- `mistralai/Mixtral-8x7B-Instruct-v0.1`
- `microsoft/Phi-3-medium-4k-instruct`

In [None]:
# Initialize the Hugging Face Inference Client
MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"
client = InferenceClient(token=HF_TOKEN)

---

## üí¨ Understanding LLM Messages: Roles and Content

Before we dive into building our reasoning agent, let's understand how we communicate with Large Language Models (LLMs).

### How Do We Talk to LLMs?

When you interact with an LLM (like ChatGPT, Claude, or open-source models), you're not just sending plain text. Instead, you send **structured messages** that help the LLM understand the context and respond appropriately.

Each message has two key components:

#### 1. **Role** - Who is speaking?

There are three main roles:

| Role | Description | Purpose |
|------|-------------|----------|
| **`system`** | Sets the behavior and context | "You are a helpful assistant", "You are an expert coder" |
| **`user`** | The human asking questions | Your prompts and requests |
| **`assistant`** | The LLM's responses | Previous answers from the AI |

#### 2. **Content** - What is being said?

The actual text of the message - the instructions, questions, or responses.

### Message Structure

Messages are formatted as a list of dictionaries:

```python
messages = [
    {
        "role": "system",
        "content": "You are a helpful coding assistant."
    },
    {
        "role": "user",
        "content": "Write a Python function to calculate factorial."
    }
]
```

### Why This Matters

Understanding message roles is crucial because:

1. **System messages** set the "personality" and instructions for the LLM
2. **Conversation history** is maintained through user/assistant message pairs
3. **Context** from previous messages influences future responses
4. **Agentic systems** use this structure to create feedback loops

Let's see this in action! üëá

In [None]:
def call_llm(messages: List[Dict[str, str]], max_tokens: int = 2000, temperature: float = 0.7) -> str:
    """
    Call the LLM with a list of messages.

    Args:
        messages: List of message dicts with 'role' and 'content'
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0 to 1.0)

    Returns:
        Generated text response
    """
    try:
        response = client.chat_completion(
            __,
            model=MODEL_NAME,
            __,
            __
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error calling LLM: {str(e)}"

In [None]:
# Test the connection
test_response = call_llm([{"role": "user", "content": __}], max_tokens=50)
print("Model test:", test_response)

### üéØ Interactive Demo: Message Roles in Action

In [None]:
# Example 1: Simple message with system role
print("Example 1: Basic Message Structure\n")
print("="*60)

messages_example1 = [
    ###########################
    # The agent is a "friendly teacher explaining concepts simply",
    # and you ask "What is Python?"
    # INSERT YOU CODE HERE
    {
        "role": 
        "content": 
    },
    {
        "role": 
        "content": 
    }
    ###########################
]

print("Messages sent to LLM:")
for msg in messages_example1:
    print(f"\n[{msg['role'].upper()}]")
    print(f"{msg['content']}")

In [None]:
print("Calling LLM...\n")

response1 = # INSERT YOU CODE HERE
print(response1)

In [None]:
# Example 2: Impact of different system messages
print("Example 2: How System Messages Change Behavior\n")
print("="*60)

user_question = "Explain what a variable is in programming."

###########################
# Ask the same question to two different agents.
# INSERT YOU CODE HERE
messages_friendly = # Friendly teacher
messages_technical = # Technical expert
###########################

In [None]:
###########################
# Print both answers
# INSERT YOU CODE HERE
###########################

In [None]:
# Example 3: Multi-turn conversation
print("Example 3: Multi-Turn Conversation\n")
print("="*60)

messages_example3 = [
    {
        "role": "__",
        "content": "You are a concise coding assistant."
    },
    {
        "role": "__",
        "content": "Write a function to add two numbers."
    },
    {
        "role": "__",
        "content": "def add(a, b):\n    return a + b"
    },
    {
        "role": "__",
        "content": "Now add type hints to it."
    }
]

print("Conversation history:")
for i, msg in enumerate(messages_example3, 1):
    print(f"\n{i}. [{msg['role'].upper()}]")
    print(f"   {msg['content'][:100]}..." if len(msg['content']) > 100 else f"   {msg['content']}")

In [None]:
print("Calling LLM with conversation history...\n")

###########################
# INSERT YOU CODE HERE
###########################

---

## Version 1: LLM-as-Judge

In this version, the LLM generates HTML code, then acts as a judge to evaluate its own output and suggest improvements. The agent iterates through multiple rounds of generation and self-critique.

In [None]:
# Function 1: generate HTML code based on a prompt

def generate_html(prompt: str) -> str:
    """
    Generate HTML code based on a prompt.

    Args:
        prompt: Description of the HTML to generate

    Returns:
        Generated HTML code
    """
    ###########################
    # INSERT YOU CODE HERE
    # Messages instructions:
    # - Use a system message to set the role as a "expert HTML developer"
    # - Tell the agent to respond only with HTML code
    # - Use the user prompt to specify the HTML content needed
    # - The prompt should be passed as a user message (ask me about this!)
    # Response instructions:
    # - Limit max tokens to 2000
    # - Set a high temperature to encourage creativity
    messages = 
    response = 
    ###########################

    # Extract HTML code from response (remove markdown code blocks if present)
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

In [None]:
# Test the generate_html function (this will not be our real task)
test_prompt = "A simple homepage for a bakery with a header, product list, and contact info"
generated_html = generate_html(test_prompt)
print("Generated HTML:\n")
print(generated_html)

In [None]:
# Function 2: judge HTML code quality based on a prompt

def judge_html(html_code: str, original_prompt: str) -> Tuple[float, str]:
    """
    Use LLM to judge the quality of generated HTML.

    Args:
        html_code: The HTML code to evaluate
        original_prompt: The original user prompt

    Returns:
        Tuple of (score, feedback)
        - score: Quality score from 0.0 to 10.0
        - feedback: Detailed feedback and improvement suggestions
    """
    ###########################
    # INSERT YOU CODE HERE
    # Messages instructions (you can use ChatGPT to help you write this!):
    # - Write a system message that sets the role as an "expert web developer and code reviewer"
    # - Provide clear instructions to score the HTML from 0 to 10 based on criteria like correctness, style, and adherence to the prompt
    # - Ask for detailed feedback on improvements
    # - Request the response to include a "SCORE:" line and a "FEEDBACK:" section
    # - Include both the original prompt and the generated HTML in user messages
    # Response instructions:
    # - Limit max tokens to 1000
    # - Use a moderate temperature for deterministic output
    messages = 
    response = 
    ###########################

    # Parse score and feedback
    score = 5.0  # Default score
    feedback = response

    if "SCORE:" in response:
        try:
            score_text = response.split("SCORE:")[1].split("\n")[0].strip()
            score = float(re.findall(r'\d+\.?\d*', score_text)[0])
        except:
            pass

    if "FEEDBACK:" in response:
        feedback = response.split("FEEDBACK:")[1].strip()

    return score, feedback

In [None]:
# Test the judge_html function
# You must get a score between 0 and 1, and a feedback string
test_score, test_feedback = judge_html(generated_html, test_prompt)
print(f"Judge Score: {test_score}/10")
print("Judge Feedback:")
print(test_feedback)

In [None]:
# Function 3: improve HTML code based on feedback

def improve_html(html_code: str, feedback: str, original_prompt: str) -> str:
    """
    Improve HTML code based on feedback.

    Args:
        html_code: Current HTML code
        feedback: Feedback from the judge
        original_prompt: Original user prompt

    Returns:
        Improved HTML code
    """
    ###########################
    # INSERT YOU CODE HERE
    # Messages instructions (you can use ChatGPT to help you write this!):
    # - Set the role as an "expert HTML developer" and code improver
    # - Instruct the agent to improve the provided HTML code based on the feedback
    # - Request only the improved HTML code in the response
    # - Include the original prompt, current HTML, and feedback in user messages
    # Response instructions:
    # - Limit max tokens to 2000
    # - Use a higher temperature for creativity    
    messages = 
    response = 
    ###########################

    # Extract HTML code
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

Let's tie it all together:

In [None]:
def reasoning_agent_v1(prompt: str, max_iterations: int = 3, target_score: float = 8.0) -> Dict:
    """
    Reasoning agent that generates and improves HTML using LLM-as-Judge.

    Args:
        prompt: Description of the HTML to generate
        max_iterations: Maximum number of improvement iterations
        target_score: Target quality score to achieve

    Returns:
        Dictionary with final HTML, score, and iteration history
    """
    print(f"ü§ñ Reasoning Agent V1: LLM-as-Judge")
    print(f"üìù Task: {prompt}\n")

    history = []

    # Initial generation
    print("[Iteration 1] Generating initial HTML...")
    html_code = # INSERT YOU CODE HERE

    # Evaluate
    print("[Iteration 1] Evaluating quality...")
    score, feedback = # INSERT YOU CODE HERE
    print(f"[Iteration 1] Score: {score}/10")
    print(f"[Iteration 1] Feedback: {feedback[:200]}...\n")

    history.append({
        "iteration": 1,
        "html": html_code,
        "score": score,
        "feedback": feedback
    })

    # Iterative improvement
    for i in range(2, max_iterations + 1):
        if score >= target_score:
            print(f"‚úÖ Target score achieved! Stopping at iteration {i-1}\n")
            break

        print(f"[Iteration {i}] Improving HTML based on feedback...")
        html_code = # INSERT YOU CODE HERE

        print(f"[Iteration {i}] Evaluating improved version...")
        score, feedback =  # INSERT YOU CODE HERE
        print(f"[Iteration {i}] Score: {score}/10")
        print(f"[Iteration {i}] Feedback: {feedback[:200]}...\n")

        history.append({
            "iteration": i,
            "html": html_code,
            "score": score,
            "feedback": feedback
        })

    print(f"üéØ Final Score: {score}/10\n")

    return {
        "final_html": html_code,
        "final_score": score,
        "history": history
    }

### Demo: Version 1 (LLM-as-Judge)

In [None]:
# Run the reasoning agent
# Use the prompt: "A modern landing page for a coffee shop with a hero section, menu preview, and contact form"
result_v1 = # INSERT YOU CODE HERE

In [None]:
# Display final HTML
print("="*80)
print("FINAL HTML CODE:")
print("="*80)
print(result_v1["final_html"])

In [None]:
# Visualize the HTML in Colab
from IPython.display import HTML, display

display(HTML(result_v1["final_html"]))

---

## Version 2: Reflection with External Feedback

In this version, we add **external validation** using:
1. HTML syntax validation (checking for parsing errors)
2. Structure validation (checking for required elements)
3. LLM reflection based on external feedback

This demonstrates how external tools can provide objective feedback to guide the reasoning process.

In [None]:
class HTMLValidator(HTMLParser):
    """
    Custom HTML parser to validate HTML structure and collect errors.
    """
    def __init__(self):
        super().__init__()
        self.errors = []
        self.tags = []
        self.tag_stack = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        if tag not in ['img', 'br', 'hr', 'input', 'meta', 'link']:
            self.tag_stack.append(tag)

    def handle_endtag(self, tag):
        if tag in ['img', 'br', 'hr', 'input', 'meta', 'link']:
            return
        if not self.tag_stack:
            self.errors.append(f"Unexpected closing tag: </{tag}>")
        elif self.tag_stack[-1] != tag:
            self.errors.append(f"Mismatched tags: expected </{self.tag_stack[-1]}>, got </{tag}>")
        else:
            self.tag_stack.pop()

    def error(self, message):
        self.errors.append(f"Parse error: {message}")

def validate_html(html_code: str, required_elements: List[str] = None) -> Tuple[bool, List[str]]:
    """
    Validate HTML code for syntax errors and required elements.

    Args:
        html_code: HTML code to validate
        required_elements: List of required HTML tags (e.g., ['html', 'body', 'head'])

    Returns:
        Tuple of (is_valid, list_of_issues)
    """
    validator = HTMLValidator()
    issues = []

    try:
        validator.feed(html_code)
    except Exception as e:
        issues.append(f"Critical parsing error: {str(e)}")
        return False, issues

    # Check for parsing errors
    if validator.errors:
        issues.extend(validator.errors)

    # Check for unclosed tags
    if validator.tag_stack:
        issues.append(f"Unclosed tags: {', '.join(validator.tag_stack)}")

    # Check for required elements
    if required_elements:
        missing = [elem for elem in required_elements if elem not in validator.tags]
        if missing:
            issues.append(f"Missing required elements: {', '.join(missing)}")

    # Basic structure checks
    if 'html' in validator.tags:
        if 'head' not in validator.tags:
            issues.append("Missing <head> element")
        if 'body' not in validator.tags:
            issues.append("Missing <body> element")

    is_valid = len(issues) == 0
    return is_valid, issues

In [None]:
def reflect_and_improve(html_code: str, validation_issues: List[str], original_prompt: str) -> str:
    """
    Use LLM to reflect on external validation feedback and improve code.

    Args:
        html_code: Current HTML code
        validation_issues: List of issues from external validator
        original_prompt: Original user prompt

    Returns:
        Improved HTML code
    """
    issues_text = "\n- " + "\n- ".join(validation_issues)

    ###########################
    # INSERT YOU CODE HERE
    # Messages instructions:
    # - Set the role as an "expert HTML developer" who improves code based on validation feedback
    # - Instruct the agent to reflect on the validation issues and generate corrected HTML code
    # - Request only the corrected HTML code in the response
    # - Include the original prompt, current HTML, and validation issues in user messages
    # Response instructions:
    # - Limit max tokens to 2000
    # - Use a higher temperature for creativity
    messages = 
    response = 
    ###########################

    # Extract HTML code
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

In [None]:
def reasoning_agent_v2(prompt: str, max_iterations: int = 3, required_elements: List[str] = None) -> Dict:
    """
    Reasoning agent with reflection based on external validation feedback.

    Args:
        prompt: Description of the HTML to generate
        max_iterations: Maximum number of improvement iterations
        required_elements: List of required HTML elements

    Returns:
        Dictionary with final HTML, validation status, and iteration history
    """
    print(f"ü§ñ Reasoning Agent V2: Reflection with External Feedback")
    print(f"üìù Task: {prompt}\n")

    history = []

    # Initial generation
    print("[Iteration 1] Generating initial HTML...")
    html_code = # INSERT YOU CODE HERE

    # External validation
    print("[Iteration 1] Running external validation...")
    is_valid, issues = # INSERT YOU CODE HERE

    if is_valid:
        print("[Iteration 1] ‚úÖ Validation passed!")
    else:
        print(f"[Iteration 1] ‚ùå Validation failed with {len(issues)} issue(s)")
        for issue in issues:
            print(f"  - {issue}")
    print()

    # Also get LLM judge score
    score, feedback = # INSERT YOU CODE HERE
    print(f"[Iteration 1] LLM Judge Score: {score}/10\n")

    history.append({
        "iteration": 1,
        "html": html_code,
        "is_valid": is_valid,
        "issues": issues,
        "score": score,
        "feedback": feedback
    })

    # Iterative improvement based on external feedback
    for i in range(2, max_iterations + 1):
        if is_valid and score >= 8.0:
            print(f"‚úÖ Code is valid and high quality! Stopping at iteration {i-1}\n")
            break

        if not is_valid:
            # Fix validation issues first
            print(f"[Iteration {i}] Reflecting on validation issues and improving...")
            html_code = # INSERT YOU CODE HERE
        else:
            # Improve based on LLM feedback
            print(f"[Iteration {i}] Improving based on LLM feedback...")
            html_code = # INSERT YOU CODE HERE

        # Validate again
        print(f"[Iteration {i}] Running external validation...")
        is_valid, issues = # INSERT YOU CODE HERE

        if is_valid:
            print(f"[Iteration {i}] ‚úÖ Validation passed!")
        else:
            print(f"[Iteration {i}] ‚ùå Validation failed with {len(issues)} issue(s)")
            for issue in issues:
                print(f"  - {issue}")
        print()

        # Get LLM score
        score, feedback = # INSERT YOU CODE HERE
        print(f"[Iteration {i}] LLM Judge Score: {score}/10\n")

        history.append({
            "iteration": i,
            "html": html_code,
            "is_valid": is_valid,
            "issues": issues,
            "score": score,
            "feedback": feedback
        })

    print(f"üéØ Final Status: {'‚úÖ Valid' if is_valid else '‚ùå Invalid'}, Score: {score}/10\n")

    return {
        "final_html": html_code,
        "is_valid": is_valid,
        "final_score": score,
        "history": history
    }

### Demo: Version 2 (Reflection with External Feedback)

In [None]:
# Run the reasoning agent with external validation
# Use the prompt: "A modern landing page for a coffee shop with a hero section, menu preview, and contact form"
# The required elements are ['html', 'head', 'body', 'title']
result_v2 = # INSERT YOU CODE HERE

In [None]:
# Display final HTML
print("="*80)
print("FINAL HTML CODE:")
print("="*80)
print(result_v2["final_html"])

In [None]:
# Visualize the HTML in Colab
from IPython.display import HTML, display

display(HTML(result_v2["final_html"]))

### üí≠ Discussion questions

1. **When is self-reflection (LLM-as-Judge) better than external validation?**
   - Think about subjective vs. objective tasks

2. **What are the trade-offs of adding more reflection iterations?**
   - Consider: quality, cost, latency, diminishing returns

3. **How would you combine multiple types of feedback?**
   - Example: Syntax validation + style checking + user preferences

4. **What tasks are NOT suitable for reasoning agents?**
   - When is a simple one-shot generation better?

5. **How can we prevent infinite loops in reasoning agents?**
   - What stopping criteria make sense?

6. **Real-world applications**: Where have you seen reasoning agents in action?
   - GitHub Copilot, ChatGPT Code Interpreter, etc.

---

## Key Takeaways

This notebook demonstrated two approaches to building reasoning agents:

### **Version 1: LLM-as-Judge**
- The LLM evaluates its own output
- Useful for subjective quality assessment
- Simpler implementation
- May have blind spots in self-evaluation

### **Version 2: Reflection with External Feedback**
- Combines LLM reasoning with objective external tools
- More reliable for catching concrete errors
- Demonstrates how to integrate external validation
- Better separation of concerns (correctness vs. quality)

### **General Principles**
1. **Iterative refinement**: Both agents improve through multiple iterations
2. **Feedback loops**: Critical for self-improvement
3. **External validation**: Adds objectivity and reliability
4. **Open-source models**: Powerful reasoning is possible without proprietary APIs

### **Extensions You Can Try**
- Add performance metrics (page size, load time)
- Implement multi-agent collaboration (one generates, another reviews)
- Add user feedback as another external signal