# Reasoning Agent: HTML Code Generator with Self-Improvement

This notebook demonstrates a **reasoning agent** that:
1. Generates HTML code based on a user prompt
2. Evaluates and improves the code iteratively

We'll implement two versions:
- **Version 1**: LLM-as-Judge (the LLM evaluates its own output)
- **Version 2**: Reflection with External Feedback (using HTML validation)

We'll use **Hugging Face's free Inference API** with open-source models.

---

## üìö Theory: Understanding Agentic AI

Before diving into the implementation, let's understand the theoretical foundations of agentic AI and reasoning systems.

### What is Agentic AI?

**Agentic AI** refers to AI systems that can **plan, act, evaluate, and improve** autonomously in pursuit of specific goals. Unlike traditional AI that follows fixed instructions or responds to patterns, agentic systems use **reasoning loops** to make context-aware decisions in real time.

At its core, an agentic system combines:
- A **Large Language Model (LLM)** as the reasoning engine
- **External tools** that extend capabilities (search, code execution, validation)
- **Feedback loops** that enable learning and self-improvement

This combination allows AI to handle open-ended, multifaceted problems that require adaptive workflows and context-aware decisions.

### The Reasoning Loop: Think-Act-Observe

Agentic systems operate through a continuous cycle that mirrors human problem-solving:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   THINK     ‚îÇ  1. Task Decomposition: Break down the goal
‚îÇ  (Reason)   ‚îÇ  2. Planning: Decide on approach
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ    ACT      ‚îÇ  3. Delegation: Assign to tools/agents
‚îÇ  (Execute)  ‚îÇ  4. Action: Generate output or call tools
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  OBSERVE    ‚îÇ  5. Evaluation: Review results
‚îÇ (Reflect)   ‚îÇ  6. Adaptation: Refine approach
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Loop back to THINK
```

This iterative process enables:
- **Decision-making** based on context
- **Learning** from results
- **Long-term planning** across multiple steps

### Key Agentic Design Patterns

#### 1. **Reflection** üîÑ

Creating feedback loops where LLMs review and improve their outputs.

**Two Types**:
- **Self-Reflection**: LLM critiques its own output (LLM-as-Judge)
- **External Feedback**: Using tools to provide objective validation

**Research Evidence**:
- **Self-Refine** (Madaan et al., 2023): ~20% improvement across diverse tasks
- **Reflexion** (Shinn et al., 2023): 91% accuracy on HumanEval (vs GPT-4's 80%)
- **CRITIC** (Gou et al., 2024): 10-30% improvement using external tools

**When to Apply Reflection**:
- ‚úÖ Validating request feasibility
- ‚úÖ Checking initial plans
- ‚úÖ After each execution step
- ‚úÖ Verifying final outputs

**Trade-offs**:
- ‚ûï Improved accuracy and quality
- ‚ûñ Increased latency (multiple LLM calls)
- ‚ûñ Higher costs

#### 2. **Tool Use** üõ†Ô∏è

Extending LLM capabilities with external tools:
- Web search for real-time information
- Code execution for calculations
- Database queries for data access
- Validators for correctness checking

#### 3. **Planning** üìã

Breaking complex goals into actionable steps:
- Multi-step reasoning
- Conditional logic and branching
- Dynamic replanning based on results

#### 4. **Multi-Agent Collaboration** üë•

Multiple specialized agents working together:
- Division of labor by expertise
- Parallel processing of subtasks
- Coordination and synthesis

### The ReAct Framework

**ReAct** (Reasoning + Acting) by Yao et al. (2022) is a foundational framework that combines:

- **Reasoning**: Explicit thought traces (reflection + planning)
- **Acting**: Task-relevant actions in the environment

The framework creates a loop where:
1. Reasoning guides action selection
2. Actions produce observations
3. Observations inform further reasoning

**Popular Implementations**:
- **DSPy** (Databricks): `ReAct` class
- **LangGraph**: `create_react_agent` function
- **smolagents** (HuggingFace): ReAct-based code agents

### Traditional AI vs Agentic AI

| Aspect | Traditional AI | Agentic AI |
|--------|---------------|------------|
| **Behavior** | Fixed instructions | Dynamic decision-making |
| **Feedback** | One-shot response | Iterative refinement |
| **Tools** | Limited/none | Extensive tool use |
| **Planning** | Pre-programmed | Adaptive planning |
| **Learning** | Static | Self-improvement |
| **Context** | Pattern matching | Context-aware reasoning |

### Our Implementation Approach

In this notebook, we'll implement **two versions** of a reasoning agent:

#### **Version 1: LLM-as-Judge (Self-Reflection)**
```
User Prompt ‚Üí Generate HTML ‚Üí Self-Evaluate ‚Üí Improve ‚Üí Repeat
```
- The LLM generates code
- The same LLM judges its own output
- Iteratively improves based on self-critique
- **Pros**: Simple, no external dependencies
- **Cons**: May have blind spots in self-evaluation

#### **Version 2: Reflection with External Feedback**
```
User Prompt ‚Üí Generate HTML ‚Üí External Validator ‚Üí Reflect on Errors ‚Üí Fix ‚Üí Repeat
```
- The LLM generates code
- External HTML parser validates syntax
- LLM reflects on objective validation errors
- Iteratively fixes issues
- **Pros**: Objective validation, catches concrete errors
- **Cons**: Requires external tools, more complex

Both approaches demonstrate the power of **reflection** in improving AI output quality through iterative refinement.

### Key Research Papers

1. **Yao et al. (2022)**: ["ReAct: Synergizing Reasoning and Acting in Language Models"](https://arxiv.org/abs/2210.03629)
2. **Madaan et al. (2023)**: ["Self-Refine: Iterative Refinement with Self-Feedback"](https://arxiv.org/abs/2303.17651)
3. **Shinn et al. (2023)**: ["Reflexion: Language Agents with Verbal Reinforcement Learning"](https://arxiv.org/abs/2303.11366)
4. **Gou et al. (2024)**: ["CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing"](https://arxiv.org/abs/2305.11738)

Now let's see these concepts in action! üöÄ

## Setup

First, install the required packages and set up authentication.

In [None]:
!pip install -q huggingface_hub

In [None]:
import os
from huggingface_hub import InferenceClient
from typing import Dict, List, Tuple
import json
from html.parser import HTMLParser
import re

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
from dotenv import load_dotenv

env_path = "/content/drive/MyDrive/.env"
load_dotenv(env_path)

HF_TOKEN = os.getenv('HF_TOKEN')

### Model Configuration

We'll use **Qwen/Qwen2.5-72B-Instruct** - a powerful open-source model available via Hugging Face Inference API.

Alternative models you can try:
- `meta-llama/Llama-3.1-70B-Instruct`
- `mistralai/Mixtral-8x7B-Instruct-v0.1`
- `microsoft/Phi-3-medium-4k-instruct`

In [None]:
# Initialize the Hugging Face Inference Client
MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"
client = InferenceClient(token=HF_TOKEN)

In [None]:
def call_llm(messages: List[Dict[str, str]], max_tokens: int = 2000, temperature: float = 0.7) -> str:
    """
    Call the LLM with a list of messages.

    Args:
        messages: List of message dicts with 'role' and 'content'
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0 to 1.0)

    Returns:
        Generated text response
    """
    try:
        response = client.chat_completion(
            __,
            model=MODEL_NAME,
            __,
            __
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error calling LLM: {str(e)}"

In [None]:
# Test the connection
test_response = call_llm([{"role": "user", "content": __}], max_tokens=50)
print("Model test:", test_response)

---

## üí¨ Understanding LLM Messages: Roles and Content

Before we dive into building our reasoning agent, let's understand how we communicate with Large Language Models (LLMs).

### How Do We Talk to LLMs?

When you interact with an LLM (like ChatGPT, Claude, or open-source models), you're not just sending plain text. Instead, you send **structured messages** that help the LLM understand the context and respond appropriately.

Each message has two key components:

#### 1. **Role** - Who is speaking?

There are three main roles:

| Role | Description | Purpose |
|------|-------------|----------|
| **`system`** | Sets the behavior and context | "You are a helpful assistant", "You are an expert coder" |
| **`user`** | The human asking questions | Your prompts and requests |
| **`assistant`** | The LLM's responses | Previous answers from the AI |

#### 2. **Content** - What is being said?

The actual text of the message - the instructions, questions, or responses.

### Message Structure

Messages are formatted as a list of dictionaries:

```python
messages = [
    {
        "role": "system",
        "content": "You are a helpful coding assistant."
    },
    {
        "role": "user",
        "content": "Write a Python function to calculate factorial."
    }
]
```

### Why This Matters

Understanding message roles is crucial because:

1. **System messages** set the "personality" and instructions for the LLM
2. **Conversation history** is maintained through user/assistant message pairs
3. **Context** from previous messages influences future responses
4. **Agentic systems** use this structure to create feedback loops

Let's see this in action! üëá

### üéØ Interactive Demo: Message Roles in Action

In [None]:
# Example 1: Simple message with system role
print("Example 1: Basic Message Structure\n")
print("="*60)

messages_example1 = [
    {
        "role": "__",
        "content": "You are a friendly teacher explaining concepts simply."
    },
    {
        "role": "__",
        "content": "What is Python?"
    }
]

print("Messages sent to LLM:")
for msg in messages_example1:
    print(f"\n[{msg['role'].upper()}]")
    print(f"{msg['content']}")

print("\n" + "="*60)
print("Calling LLM...\n")

response1 = __(__, max_tokens=200)
print("[ASSISTANT]")
print(response1)

In [None]:
# Example 2: Multi-turn conversation
print("Example 2: Multi-Turn Conversation\n")
print("="*60)

messages_example2 = [
    {
        "role": "__",
        "content": "You are a concise coding assistant."
    },
    {
        "role": "__",
        "content": "Write a function to add two numbers."
    },
    {
        "role": "__",
        "content": "def add(a, b):\n    return a + b"
    },
    {
        "role": "__",
        "content": "Now add type hints to it."
    }
]

print("Conversation history:")
for i, msg in enumerate(messages_example2, 1):
    print(f"\n{i}. [{msg['role'].upper()}]")
    print(f"   {msg['content'][:100]}..." if len(msg['content']) > 100 else f"   {msg['content']}")

print("\n" + "="*60)
print("Calling LLM with conversation history...\n")

response2 = __(__, max_tokens=150)
print("[ASSISTANT]")
print(response2)

In [None]:
# Example 3: Impact of different system messages
print("Example 3: How System Messages Change Behavior\n")
print("="*60)

user_question = "Explain what a variable is in programming."

# Friendly teacher
messages_friendly = [
    {"role": "__", "content": "You are a friendly teacher for 10-year-olds. Use simple words and fun examples."},
    {"role": "__", "content": __}
]

# Technical expert
messages_technical = [
    {"role": "__", "content": "You are a computer science professor. Be precise and technical."},
    {"role": "__", "content": __}
]

print("Same question, different system messages:\n")
print(f"Question: {user_question}\n")

print("\n" + "-"*60)
print("Response 1: Friendly Teacher")
print("-"*60)
response_friendly = call_llm(__, max_tokens=150)
print(response_friendly)

print("\n" + "-"*60)
print("Response 2: Technical Expert")
print("-"*60)
response_technical = call_llm(__, max_tokens=150)
print(response_technical)

print("\n" + "="*60)
print("Notice how the SAME question gets DIFFERENT answers!")
print("This is the power of the system message.")

---

## Version 1: LLM-as-Judge

In this version, the LLM generates HTML code, then acts as a judge to evaluate its own output and suggest improvements. The agent iterates through multiple rounds of generation and self-critique.

In [None]:
def generate_html(prompt: str) -> str:
    """
    Generate HTML code based on a prompt.

    Args:
        prompt: Description of the HTML to generate

    Returns:
        Generated HTML code
    """
    messages = [
        {
            "role": "__", ## What role?
            "content": "You are an expert HTML developer. Generate clean, semantic, and well-structured HTML code based on user requirements. Return ONLY the HTML code without explanations."
        },
        {
            "role": "__", ## What role?
            "content": f"Generate HTML code for: {__}" ## What input?
        }
    ]

    response = call_llm(__, max_tokens=2000, temperature=0.7) ## What context?

    # Extract HTML code from response (remove markdown code blocks if present)
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

In [None]:
def judge_html(html_code: str, original_prompt: str) -> Tuple[float, str]:
    """
    Use LLM to judge the quality of generated HTML.

    Args:
        html_code: The HTML code to evaluate
        original_prompt: The original user prompt

    Returns:
        Tuple of (score, feedback)
        - score: Quality score from 0.0 to 10.0
        - feedback: Detailed feedback and improvement suggestions
    """
    messages = [
        {
            "role": "__", ## What role?
            "content": """You are an expert HTML code reviewer. Evaluate the provided HTML code based on:
1. Correctness: Does it match the requirements?
2. Code quality: Is it semantic, accessible, and well-structured?
3. Best practices: Does it follow HTML5 standards?

Provide:
- A score from 0 to 10
- Specific feedback on what's good and what needs improvement

Format your response as:
SCORE: [number]
FEEDBACK: [your detailed feedback]"""
        },
        {
            "role": "__", ## What role?
            "content": f"""Original requirement: {__}

HTML code to evaluate:
```html
{__}
```

Please evaluate this code.""" ## What inputs?
        }
    ]

    response = __(__, max_tokens=1000, temperature=0.3)

    # Parse score and feedback
    score = 5.0  # Default score
    feedback = response

    if "SCORE:" in response:
        try:
            score_text = response.split("SCORE:")[1].split("\n")[0].strip()
            score = float(re.findall(r'\d+\.?\d*', score_text)[0])
        except:
            pass

    if "FEEDBACK:" in response:
        feedback = response.split("FEEDBACK:")[1].strip()

    return score, feedback

In [None]:
def improve_html(html_code: str, feedback: str, original_prompt: str) -> str:
    """
    Improve HTML code based on feedback.

    Args:
        html_code: Current HTML code
        feedback: Feedback from the judge
        original_prompt: Original user prompt

    Returns:
        Improved HTML code
    """
    messages = [
        {
            "role": "__", ## What role?
            "content": "You are an expert HTML developer. Improve the provided HTML code based on the feedback. Return ONLY the improved HTML code without explanations."
        },
        {
            "role": "__", ## What role?
            "content": f"""Original requirement: {__}

Current HTML code:
```html
{__}
```

Feedback for improvement:
{__}

Please provide the improved HTML code.""" ## What inputs?
        }
    ]

    response = call_llm(__, max_tokens=2000, temperature=0.7)

    # Extract HTML code
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

Let's tie it all together:

In [None]:
def reasoning_agent_v1(prompt: str, max_iterations: int = 3, target_score: float = 8.0) -> Dict:
    """
    Reasoning agent that generates and improves HTML using LLM-as-Judge.

    Args:
        prompt: Description of the HTML to generate
        max_iterations: Maximum number of improvement iterations
        target_score: Target quality score to achieve

    Returns:
        Dictionary with final HTML, score, and iteration history
    """
    print(f"ü§ñ Reasoning Agent V1: LLM-as-Judge")
    print(f"üìù Task: {prompt}\n")

    history = []

    # Initial generation
    print("[Iteration 1] Generating initial HTML...")
    html_code = __(__) ## Which function?

    # Evaluate
    print("[Iteration 1] Evaluating quality...")
    score, feedback = __(__, __) ## Which function?
    print(f"[Iteration 1] Score: {score}/10")
    print(f"[Iteration 1] Feedback: {feedback[:200]}...\n")

    history.append({
        "iteration": 1,
        "html": html_code,
        "score": score,
        "feedback": feedback
    })

    # Iterative improvement
    for i in range(2, max_iterations + 1):
        if score >= target_score:
            print(f"‚úÖ Target score achieved! Stopping at iteration {i-1}\n")
            break

        print(f"[Iteration {i}] Improving HTML based on feedback...")
        html_code = __(__, __, __) ## Which function?

        print(f"[Iteration {i}] Evaluating improved version...")
        score, feedback = __(__, __) ## Which function?
        print(f"[Iteration {i}] Score: {score}/10")
        print(f"[Iteration {i}] Feedback: {feedback[:200]}...\n")

        history.append({
            "iteration": i,
            "html": html_code,
            "score": score,
            "feedback": feedback
        })

    print(f"üéØ Final Score: {score}/10\n")

    return {
        "final_html": html_code,
        "final_score": score,
        "history": history
    }

### Demo: Version 1 (LLM-as-Judge)

In [None]:
# Run the reasoning agent
result_v1 = __( ## Which function?
    prompt="A modern landing page for a coffee shop with a hero section, menu preview, and contact form",
    max_iterations=3,
    target_score=8.0
)

# Display final HTML
print("="*80)
print("FINAL HTML CODE:")
print("="*80)
print(result_v1["final_html"])

In [None]:
# Visualize the HTML in Colab
from IPython.display import HTML, display

display(HTML(result_v1["final_html"]))

---

## Version 2: Reflection with External Feedback

In this version, we add **external validation** using:
1. HTML syntax validation (checking for parsing errors)
2. Structure validation (checking for required elements)
3. LLM reflection based on external feedback

This demonstrates how external tools can provide objective feedback to guide the reasoning process.

In [None]:
class HTMLValidator(HTMLParser):
    """
    Custom HTML parser to validate HTML structure and collect errors.
    """
    def __init__(self):
        super().__init__()
        self.errors = []
        self.tags = []
        self.tag_stack = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        if tag not in ['img', 'br', 'hr', 'input', 'meta', 'link']:
            self.tag_stack.append(tag)

    def handle_endtag(self, tag):
        if tag in ['img', 'br', 'hr', 'input', 'meta', 'link']:
            return
        if not self.tag_stack:
            self.errors.append(f"Unexpected closing tag: </{tag}>")
        elif self.tag_stack[-1] != tag:
            self.errors.append(f"Mismatched tags: expected </{self.tag_stack[-1]}>, got </{tag}>")
        else:
            self.tag_stack.pop()

    def error(self, message):
        self.errors.append(f"Parse error: {message}")

def validate_html(html_code: str, required_elements: List[str] = None) -> Tuple[bool, List[str]]:
    """
    Validate HTML code for syntax errors and required elements.

    Args:
        html_code: HTML code to validate
        required_elements: List of required HTML tags (e.g., ['html', 'body', 'head'])

    Returns:
        Tuple of (is_valid, list_of_issues)
    """
    validator = HTMLValidator()
    issues = []

    try:
        validator.feed(html_code)
    except Exception as e:
        issues.append(f"Critical parsing error: {str(e)}")
        return False, issues

    # Check for parsing errors
    if validator.errors:
        issues.extend(validator.errors)

    # Check for unclosed tags
    if validator.tag_stack:
        issues.append(f"Unclosed tags: {', '.join(validator.tag_stack)}")

    # Check for required elements
    if required_elements:
        missing = [elem for elem in required_elements if elem not in validator.tags]
        if missing:
            issues.append(f"Missing required elements: {', '.join(missing)}")

    # Basic structure checks
    if 'html' in validator.tags:
        if 'head' not in validator.tags:
            issues.append("Missing <head> element")
        if 'body' not in validator.tags:
            issues.append("Missing <body> element")

    is_valid = len(issues) == 0
    return is_valid, issues

In [None]:
def reflect_and_improve(html_code: str, validation_issues: List[str], original_prompt: str) -> str:
    """
    Use LLM to reflect on external validation feedback and improve code.

    Args:
        html_code: Current HTML code
        validation_issues: List of issues from external validator
        original_prompt: Original user prompt

    Returns:
        Improved HTML code
    """
    issues_text = "\n- " + "\n- ".join(validation_issues)

    messages = [
        {
            "role": "__",
            "content": """You are an expert HTML developer. You will receive HTML code with validation errors from an external validator.
Reflect on these issues and generate corrected HTML code that addresses all problems.
Return ONLY the corrected HTML code without explanations."""
        },
        {
            "role": "__",
            "content": f"""Original requirement: {__}

Current HTML code:
```html
{__}
```

External validation found these issues:
{__}

Please fix all issues and provide the corrected HTML code."""
        }
    ]

    response = call_llm(messages, max_tokens=2000, temperature=0.7)

    # Extract HTML code
    html_code = response.strip()
    if "```html" in html_code:
        html_code = html_code.split("```html")[1].split("```")[0].strip()
    elif "```" in html_code:
        html_code = html_code.split("```")[1].split("```")[0].strip()

    return html_code

In [None]:
def reasoning_agent_v2(prompt: str, max_iterations: int = 3, required_elements: List[str] = None) -> Dict:
    """
    Reasoning agent with reflection based on external validation feedback.

    Args:
        prompt: Description of the HTML to generate
        max_iterations: Maximum number of improvement iterations
        required_elements: List of required HTML elements

    Returns:
        Dictionary with final HTML, validation status, and iteration history
    """
    print(f"ü§ñ Reasoning Agent V2: Reflection with External Feedback")
    print(f"üìù Task: {prompt}\n")

    history = []

    # Initial generation
    print("[Iteration 1] Generating initial HTML...")
    html_code = generate_html(prompt)

    # External validation
    print("[Iteration 1] Running external validation...")
    is_valid, issues = validate_html(html_code, required_elements)

    if is_valid:
        print("[Iteration 1] ‚úÖ Validation passed!")
    else:
        print(f"[Iteration 1] ‚ùå Validation failed with {len(issues)} issue(s)")
        for issue in issues:
            print(f"  - {issue}")
    print()

    # Also get LLM judge score
    score, feedback = judge_html(html_code, prompt)
    print(f"[Iteration 1] LLM Judge Score: {score}/10\n")

    history.append({
        "iteration": 1,
        "html": html_code,
        "is_valid": is_valid,
        "issues": issues,
        "score": score,
        "feedback": feedback
    })

    # Iterative improvement based on external feedback
    for i in range(2, max_iterations + 1):
        if is_valid and score >= 8.0:
            print(f"‚úÖ Code is valid and high quality! Stopping at iteration {i-1}\n")
            break

        if not is_valid:
            # Fix validation issues first
            print(f"[Iteration {i}] Reflecting on validation issues and improving...")
            html_code = reflect_and_improve(html_code, issues, prompt)
        else:
            # Improve based on LLM feedback
            print(f"[Iteration {i}] Improving based on LLM feedback...")
            html_code = improve_html(html_code, feedback, prompt)

        # Validate again
        print(f"[Iteration {i}] Running external validation...")
        is_valid, issues = validate_html(html_code, required_elements)

        if is_valid:
            print(f"[Iteration {i}] ‚úÖ Validation passed!")
        else:
            print(f"[Iteration {i}] ‚ùå Validation failed with {len(issues)} issue(s)")
            for issue in issues:
                print(f"  - {issue}")
        print()

        # Get LLM score
        score, feedback = judge_html(html_code, prompt)
        print(f"[Iteration {i}] LLM Judge Score: {score}/10\n")

        history.append({
            "iteration": i,
            "html": html_code,
            "is_valid": is_valid,
            "issues": issues,
            "score": score,
            "feedback": feedback
        })

    print(f"üéØ Final Status: {'‚úÖ Valid' if is_valid else '‚ùå Invalid'}, Score: {score}/10\n")

    return {
        "final_html": html_code,
        "is_valid": is_valid,
        "final_score": score,
        "history": history
    }

### Demo: Version 2 (Reflection with External Feedback)

In [None]:
# Run the reasoning agent with external validation
result_v2 = reasoning_agent_v2(
    prompt="A modern landing page for a coffee shop with a hero section, menu preview, and contact form",
    max_iterations=3,
    required_elements=['html', 'head', 'body', 'title']
)

# Display final HTML
print("="*80)
print("FINAL HTML CODE:")
print("="*80)
print(result_v2["final_html"])

In [None]:
# Visualize the HTML in Colab
from IPython.display import HTML, display

display(HTML(result_v2["final_html"]))

---

## üíæ Saving and Viewing HTML Files

Let's save the generated HTML files and learn different ways to view them.

### Method 1: Save to Files and Download

Save the HTML files to the Colab filesystem and download them to view in your browser.

In [None]:
import os
from google.colab import files

def save_html_file(html_content: str, filename: str) -> str:
    """
    Save HTML content to a file.

    Args:
        html_content: The HTML code to save
        filename: Name of the file (e.g., 'output_v1.html')

    Returns:
        Full path to the saved file
    """
    filepath = f'/content/{filename}'

    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(html_content)

    print(f"‚úÖ Saved: {filepath}")
    return filepath

# Save both HTML files
print("Saving HTML files...\n")

v1_filepath = save_html_file(result_v1['final_html'], 'coffee_shop_v1_llm_judge.html')
v2_filepath = save_html_file(result_v2['final_html'], 'coffee_shop_v2_external_feedback.html')

print("\nüìÅ Files saved successfully!")

In [None]:
# Download files to your local machine
print("Downloading files...\n")

files.download(v1_filepath)
files.download(v2_filepath)

print("\n‚úÖ Files downloaded! Open them in your browser to view.")

### Method 2: View Inline in Colab

Display the HTML directly in the notebook using IPython's HTML display.

In [None]:
from IPython.display import HTML, display

print("="*80)
print("VERSION 1: LLM-as-Judge")
print("="*80)
display(HTML(result_v1['final_html']))

In [None]:
print("="*80)
print("VERSION 2: Reflection with External Feedback")
print("="*80)
display(HTML(result_v2['final_html']))

### Method 3: View in IFrame

Display the saved HTML files in an iframe for a more isolated view.

In [None]:
from IPython.display import IFrame

print("Viewing Version 1 in IFrame:\n")
display(IFrame(src=v1_filepath, width=800, height=600))

In [None]:
print("Viewing Version 2 in IFrame:\n")
display(IFrame(src=v2_filepath, width=800, height=600))

### Method 4: Side-by-Side Comparison

View both versions side by side to compare the results.

In [None]:
from IPython.display import HTML, display

comparison_html = f"""
<div style="display: flex; gap: 20px;">
    <div style="flex: 1; border: 2px solid #4CAF50; padding: 10px;">
        <h3 style="color: #4CAF50; text-align: center;">Version 1: LLM-as-Judge</h3>
        <p style="text-align: center;"><strong>Score:</strong> {result_v1['final_score']}/10</p>
        <iframe srcdoc='{result_v1['final_html'].replace("'", "&apos;")}'
                width="100%" height="500" style="border: 1px solid #ddd;"></iframe>
    </div>
    <div style="flex: 1; border: 2px solid #2196F3; padding: 10px;">
        <h3 style="color: #2196F3; text-align: center;">Version 2: External Feedback</h3>
        <p style="text-align: center;">
            <strong>Score:</strong> {result_v2['final_score']}/10 |
            <strong>Valid:</strong> {'‚úÖ' if result_v2['is_valid'] else '‚ùå'}
        </p>
        <iframe srcdoc='{result_v2['final_html'].replace("'", "&apos;")}'
                width="100%" height="500" style="border: 1px solid #ddd;"></iframe>
    </div>
</div>
"""

display(HTML(comparison_html))

### üí° Tips for Viewing HTML Files

**In Google Colab**:
- Use `display(HTML(...))` for inline viewing
- Use `IFrame(...)` for isolated viewing
- Files are saved to `/content/` directory

**On Your Local Machine**:
1. Download the files using `files.download()`
2. Open them in any web browser (Chrome, Firefox, Safari, etc.)
3. Right-click ‚Üí Open With ‚Üí Your Browser

**Sharing Your HTML**:
- Upload to GitHub Pages for free hosting
- Use services like CodePen, JSFiddle, or Netlify
- Share the HTML code directly

**Editing the HTML**:
- Open in any text editor (VS Code, Sublime, Notepad++)
- Use browser DevTools to inspect and modify
- Feed back to the agent for further improvements

---

## Comparison: V1 vs V2

Let's compare the two approaches:

In [None]:
import pandas as pd

comparison_data = {
    "Aspect": [
        "Feedback Source",
        "Objectivity",
        "Error Detection",
        "Improvement Focus",
        "Reliability"
    ],
    "V1: LLM-as-Judge": [
        "LLM self-evaluation",
        "Subjective",
        "May miss syntax errors",
        "Overall quality & style",
        "Depends on LLM capability"
    ],
    "V2: External Feedback": [
        "External validator + LLM",
        "Objective validation",
        "Catches syntax errors",
        "Correctness first, then quality",
        "More reliable for correctness"
    ]
}

df_comparison = pd.DataFrame(comparison_data)
print(df_comparison.to_string(index=False))

---

## Try Your Own Prompts!

Experiment with both agents using your own prompts:

In [None]:
# Customize your prompt here
custom_prompt = "A portfolio page for a photographer with an image gallery and about section"

print("\n" + "="*80)
print("TESTING VERSION 1: LLM-as-Judge")
print("="*80 + "\n")

result_custom_v1 = reasoning_agent_v1(
    prompt=custom_prompt,
    max_iterations=2,
    target_score=8.0
)

print("\n" + "="*80)
print("TESTING VERSION 2: Reflection with External Feedback")
print("="*80 + "\n")

result_custom_v2 = reasoning_agent_v2(
    prompt=custom_prompt,
    max_iterations=2,
    required_elements=['html', 'head', 'body', 'title']
)

# Display both results
print("\n" + "="*80)
print("RESULTS COMPARISON")
print("="*80)
print(f"V1 Final Score: {result_custom_v1['final_score']}/10")
print(f"V2 Final Score: {result_custom_v2['final_score']}/10")
print(f"V2 Valid HTML: {result_custom_v2['is_valid']}")

---

## üéÆ Interactive Exercises for Students

Time to get hands-on! These exercises will help you understand reasoning agents better.

### üèÜ Exercise 1: Design Your Own System Message

**Challenge**: Create a system message that makes the LLM generate HTML in a specific style.

Try these personas:
- A minimalist designer (clean, simple HTML)
- A creative artist (colorful, animated)
- An accessibility expert (ARIA labels, semantic HTML)
- A 90s web designer (tables, marquee tags!)

In [None]:
# TODO: Students fill this in!
# Create your own system message and generate HTML

my_system_message = """
# YOUR SYSTEM MESSAGE HERE
# Example: You are a minimalist web designer who believes less is more...
"""

my_prompt = "Create a simple contact form"  # Change this too!

messages = [
    {"role": "system", "content": my_system_message},
    {"role": "user", "content": f"Generate HTML for: {my_prompt}"}
]

my_html = call_llm(messages, max_tokens=2000)

# Extract and display
if "```html" in my_html:
    my_html = my_html.split("```html")[1].split("```")[0].strip()
elif "```" in my_html:
    my_html = my_html.split("```")[1].split("```")[0].strip()

from IPython.display import HTML, display
print("Your Generated HTML:\n")
print(my_html)
print("\n" + "="*60 + "\n")
print("Preview:")
display(HTML(my_html))

### üèÜ Exercise 2: Build a Simple Feedback Loop

**Challenge**: Implement a mini reasoning agent that:
1. Generates a joke
2. Rates the joke (1-10)
3. If score < 7, generates a better joke

This teaches the core concept of reflection!

In [None]:
# TODO: Students complete this!

def generate_joke(topic: str) -> str:
    """Generate a joke about the given topic."""
    messages = [
        {"role": "system", "content": "You are a comedian. Generate a short, funny joke."},
        {"role": "user", "content": f"Tell me a joke about {topic}"}
    ]
    return call_llm(messages, max_tokens=100)

def rate_joke(joke: str) -> float:
    """Rate the joke from 1-10."""
    messages = [
        {"role": "system", "content": "You are a comedy critic. Rate jokes from 1-10. Respond with just: SCORE: [number]"},
        {"role": "user", "content": f"Rate this joke: {joke}"}
    ]
    response = call_llm(messages, max_tokens=50)

    # Extract score
    try:
        import re
        score = float(re.findall(r'\d+\.?\d*', response)[0])
        return score
    except:
        return 5.0

def improve_joke(joke: str, score: float) -> str:
    """Improve the joke based on the score."""
    # TODO: Students implement this!
    # Hint: Ask the LLM to improve the joke
    messages = [
        # YOUR CODE HERE
    ]
    return call_llm(messages, max_tokens=100)

# Run the joke improvement loop
topic = "programming"  # Change this!
max_iterations = 3

print(f"üé≠ Joke Improvement Agent: Topic = '{topic}'\n")
print("="*60)

joke = generate_joke(topic)
print(f"\n[Iteration 1]")
print(f"Joke: {joke}")
score = rate_joke(joke)
print(f"Score: {score}/10")

for i in range(2, max_iterations + 1):
    if score >= 7.0:
        print(f"\n‚úÖ Great joke! Stopping at iteration {i-1}")
        break

    print(f"\n[Iteration {i}] Improving...")
    joke = improve_joke(joke, score)
    print(f"Joke: {joke}")
    score = rate_joke(joke)
    print(f"Score: {score}/10")

print("\n" + "="*60)
print(f"Final joke (Score: {score}/10):")
print(joke)

### üèÜ Exercise 3: Experiment with Reflection Patterns

**Challenge**: Modify the HTML generator to add a new type of feedback.

Ideas:
- Check for accessibility (alt tags, ARIA labels)
- Count the number of HTML elements
- Validate CSS if inline styles are used
- Check for mobile responsiveness

In [None]:
# TODO: Students implement a new validator!

def check_accessibility(html_code: str) -> tuple[bool, list[str]]:
    """
    Check HTML for accessibility issues.

    Returns:
        (is_accessible, list_of_issues)
    """
    issues = []

    # TODO: Implement accessibility checks
    # Hints:
    # - Check for <img> tags without alt attributes
    # - Check for form inputs without labels
    # - Check for proper heading hierarchy (h1, h2, h3...)

    # Example check:
    if '<img' in html_code and 'alt=' not in html_code:
        issues.append("Images missing alt attributes")

    # Add more checks here!

    is_accessible = len(issues) == 0
    return is_accessible, issues

# Test your validator
test_html = """
<html>
<body>
    <h1>Welcome</h1>
    <img src="logo.png">
    <input type="text">
</body>
</html>
"""

is_accessible, issues = check_accessibility(test_html)
print("Accessibility Check Results:")
print(f"Accessible: {is_accessible}")
if issues:
    print("\nIssues found:")
    for issue in issues:
        print(f"  - {issue}")

### üèÜ Exercise 4: Create Your Own Reasoning Agent

**Challenge**: Build a reasoning agent for a different task!

Ideas:
- **Email writer**: Generate professional emails and improve tone
- **Recipe creator**: Generate recipes and check for missing ingredients
- **Story writer**: Generate stories and improve plot consistency
- **Math solver**: Solve problems and verify answers
- **Code debugger**: Generate code and check for syntax errors

In [None]:
# TODO: Students build their own reasoning agent!

def my_reasoning_agent(task: str, max_iterations: int = 3):
    """
    Your custom reasoning agent.

    Steps:
    1. Generate initial output
    2. Evaluate output (self-reflection or external validation)
    3. Improve based on feedback
    4. Repeat until satisfied or max iterations reached
    """
    print(f"ü§ñ My Reasoning Agent")
    print(f"Task: {task}\n")
    print("="*60)

    # TODO: Implement your agent here!
    # Use the patterns from Version 1 or Version 2

    pass

# Test your agent
# my_reasoning_agent("Write a professional email asking for a meeting")

### üí≠ Discussion Questions for Class

1. **When is self-reflection (LLM-as-Judge) better than external validation?**
   - Think about subjective vs. objective tasks

2. **What are the trade-offs of adding more reflection iterations?**
   - Consider: quality, cost, latency, diminishing returns

3. **How would you combine multiple types of feedback?**
   - Example: Syntax validation + style checking + user preferences

4. **What tasks are NOT suitable for reasoning agents?**
   - When is a simple one-shot generation better?

5. **How can we prevent infinite loops in reasoning agents?**
   - What stopping criteria make sense?

6. **Real-world applications**: Where have you seen reasoning agents in action?
   - GitHub Copilot, ChatGPT Code Interpreter, etc.

### üåü Bonus Challenges

For advanced students:

1. **Multi-Agent System**: Create two agents that critique each other's work
2. **Adaptive Iterations**: Dynamically decide how many iterations to run based on improvement rate
3. **Feedback Fusion**: Combine LLM-as-Judge with multiple external validators
4. **Memory System**: Make the agent remember past mistakes and avoid them
5. **Human-in-the-Loop**: Add a step where the agent asks for human feedback
6. **Parallel Generation**: Generate multiple candidates and pick the best one
7. **Cost Optimization**: Minimize API calls while maintaining quality

---

## Key Takeaways

This notebook demonstrated two approaches to building reasoning agents:

### **Version 1: LLM-as-Judge**
- The LLM evaluates its own output
- Useful for subjective quality assessment
- Simpler implementation
- May have blind spots in self-evaluation

### **Version 2: Reflection with External Feedback**
- Combines LLM reasoning with objective external tools
- More reliable for catching concrete errors
- Demonstrates how to integrate external validation
- Better separation of concerns (correctness vs. quality)

### **General Principles**
1. **Iterative refinement**: Both agents improve through multiple iterations
2. **Feedback loops**: Critical for self-improvement
3. **External validation**: Adds objectivity and reliability
4. **Open-source models**: Powerful reasoning is possible without proprietary APIs

### **Extensions You Can Try**
- Add CSS validation
- Include accessibility checks (WCAG compliance)
- Add performance metrics (page size, load time)
- Implement multi-agent collaboration (one generates, another reviews)
- Add user feedback as another external signal