# üîß Xenon: XML Repair for LLM-Generated XML

Welcome to the interactive Xenon demo! This notebook shows you how to repair malformed XML commonly generated by Large Language Models.

**What is Xenon?**
- Zero-dependency Python library
- Fixes truncated, malformed, and messy XML from LLMs
- Simple API with robust error handling

**Open in Colab**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MarsZDF/xenon/blob/main/xenon_demo.ipynb)

## üì¶ Installation

Run this cell to install Xenon (will be available via pip soon):

In [None]:
# For this demo, we'll use the local version
# Once published: pip install xenon

# If running locally:
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), 'src'))

from xenon import repair_xml, parse_xml, repair_xml_safe, repair_xml_lenient, ValidationError

print("‚úÖ Xenon imported successfully!")

---

# üéØ The 4 Major LLM XML Failure Modes

Xenon handles the most common ways LLMs break XML:

## 1Ô∏è‚É£ Truncation / Cut-off

LLMs run out of tokens mid-tag and leave XML incomplete:

In [None]:
# LLM output that got cut off
truncated = '<root><user name="alice"<address><city>San Francisco'

print("‚ùå BROKEN XML:")
print(truncated)
print()

# Xenon repairs it
repaired = repair_xml(truncated)
print("‚úÖ REPAIRED XML:")
print(repaired)

## 2Ô∏è‚É£ Conversational Fluff

LLMs wrap XML in conversational text:

In [None]:
# LLM adds unnecessary commentary
fluff = '''
Sure! Here's the XML you requested:

<root>
    <message>Hello World</message>
    <status>success</status>
</root>

Hope this helps! Let me know if you need anything else.
'''

print("‚ùå MESSY OUTPUT:")
print(repr(fluff))
print()

# Xenon extracts just the XML
repaired = repair_xml(fluff)
print("‚úÖ CLEAN XML:")
print(repaired)

## 3Ô∏è‚É£ Malformed Attributes

LLMs forget to quote attribute values:

In [None]:
# Missing quotes around attributes
unquoted = '<product id=12345 category=electronics price=299.99 inStock=true>'

print("‚ùå INVALID ATTRIBUTES:")
print(unquoted)
print()

# Xenon adds quotes
repaired = repair_xml(unquoted)
print("‚úÖ VALID ATTRIBUTES:")
print(repaired)

## 4Ô∏è‚É£ Unescaped Entities

LLMs forget to escape special characters like `&` and `<`:

In [None]:
# Special characters not escaped
unescaped = '<description>Price: $5 < $10 & shipping included</description>'

print("‚ùå UNESCAPED ENTITIES:")
print(unescaped)
print()

# Xenon escapes them
repaired = repair_xml(unescaped)
print("‚úÖ PROPERLY ESCAPED:")
print(repaired)

---

# üî• Combined Failure Modes

Real LLM outputs often have MULTIPLE issues at once:

In [None]:
# Everything broken at once!
nightmare = '''
Here's the user data you asked for:

<users>
    <user id=1001 role=admin>
        <name>John Smith & Associates</name>
        <email>john@example.com</email>
        <status>Active & Verified
'''

print("‚ùå COMPLETE DISASTER:")
print(nightmare)
print()

# Xenon handles it all
repaired = repair_xml(nightmare)
print("‚úÖ FULLY REPAIRED:")
print(repaired)
print()

# What Xenon fixed:
print("üîß Xenon fixed:")
print("  ‚Ä¢ Removed conversational text")
print("  ‚Ä¢ Added quotes to attributes (id=1001 ‚Üí id=\"1001\")")
print("  ‚Ä¢ Escaped & symbols (& ‚Üí &amp;)")
print("  ‚Ä¢ Closed truncated tags (</status>, </user>, </users>)")

---

# üìä Parse to Dictionary

Convert repaired XML to Python dictionaries for easy data extraction:

In [None]:
# Malformed XML from LLM
malformed = '<response status=success><data count=3><item>Apple</item><item>Banana</item><item>Orange'

print("üìù Input:")
print(malformed)
print()

# Parse directly (repairs automatically)
data = parse_xml(malformed)

print("üì¶ Parsed Dictionary:")
import json
print(json.dumps(data, indent=2))
print()

# Access data easily
print("üéØ Extracted Values:")
print(f"  Status: {data['response']['@attributes']['status']}")
print(f"  Count: {data['response']['data']['@attributes']['count']}")
print(f"  Items: {data['response']['data']['item']}")

---

# üõ°Ô∏è Error Handling Modes

Xenon provides 3 modes for different use cases:

## Safe Mode (Production-Ready)

In [None]:
# Safe mode validates input and provides helpful errors

print("1Ô∏è‚É£ Valid XML:")
result = repair_xml_safe('<root><item>test</item>')
print(f"   ‚úÖ {result}")
print()

print("2Ô∏è‚É£ Invalid input (None):")
try:
    result = repair_xml_safe(None)
except ValidationError as e:
    print(f"   ‚ùå {e}")
print()

print("3Ô∏è‚É£ Empty string with allow_empty=True:")
result = repair_xml_safe('', allow_empty=True)
print(f"   ‚úÖ Returns: {repr(result)}")
print()

print("4Ô∏è‚É£ Size limit protection:")
try:
    huge = '<root>' + 'x' * 10000
    result = repair_xml_safe(huge, max_size=5000)
except ValidationError as e:
    print(f"   ‚ùå {e}")

## Lenient Mode (Never Crashes)

In [None]:
# Lenient mode NEVER raises exceptions

print("Testing lenient mode with various inputs:")
print()

test_cases = [
    (None, "None"),
    (123, "Integer"),
    ("", "Empty string"),
    (["<root>"], "List"),
    ("<root><item", "Truncated XML"),
]

for inp, description in test_cases:
    result = repair_xml_lenient(inp)
    print(f"  {description:20s} ‚Üí {repr(result)[:40]}")

---

# üéÆ Interactive Playground

Try your own malformed XML below!

In [None]:
# ‚úèÔ∏è Edit this XML and run the cell to see how Xenon repairs it!

your_xml = '''
Sure, here's the XML:

<config>
    <database host=localhost port=5432>
        <credentials user=admin password=secret123
'''

print("üì• YOUR INPUT:")
print(your_xml)
print()
print("=" * 60)
print()
print("üì§ XENON OUTPUT:")
repaired = repair_xml(your_xml)
print(repaired)
print()
print("=" * 60)
print()
print("üìä AS DICTIONARY:")
data = parse_xml(your_xml)
print(json.dumps(data, indent=2))

---

# üåç Real-World LLM Examples

Examples from actual LLM outputs:

## Example 1: ChatGPT API Response

In [None]:
chatgpt_output = '''
Here's the product catalog in XML format:

<catalog>
    <product id=A001 category=electronics>
        <name>Laptop Pro 15</name>
        <price currency=USD>1299.99</price>
        <description>High-performance laptop with 16GB RAM & SSD</description>
        <inStock>true</inStock>
    </product>
    <product id=A002 category=electronics>
        <name>Wireless Mouse</name>
        <price currency=USD>29.99
'''

print("Before:")
print(chatgpt_output)
print("\n" + "=" * 60 + "\n")
print("After:")
print(repair_xml(chatgpt_output))

## Example 2: Claude Code Generation

In [None]:
claude_output = '''
I'll generate the XML configuration for you:

<configuration environment=production>
    <server>
        <host>api.example.com</host>
        <port>8080</port>
        <ssl enabled=true cert=/path/to/cert.pem>
    </server>
    <database type=postgresql>
        <connection string=postgresql://user@localhost:5432/db
'''

print("Before:")
print(claude_output)
print("\n" + "=" * 60 + "\n")
print("After:")
repaired = repair_xml(claude_output)
print(repaired)
print("\n" + "=" * 60 + "\n")
print("Parsed:")
print(json.dumps(parse_xml(claude_output), indent=2))

## Example 3: Structured Data Extraction

In [None]:
# LLM extracting data from text into XML
extraction = '''
Based on the article, here's the extracted information:

<article>
    <title>Breaking News: Tech Company Valued at $5B < $10B
    <author name=Jane Doe role=senior-reporter>
    <published date=2024-01-15 time=14:30:00>
    <tags>
        <tag>technology</tag>
        <tag>business & finance</tag>
        <tag>startups
'''

print("Malformed extraction:")
print(extraction)
print("\n" + "=" * 60 + "\n")
print("Cleaned up:")
repaired = repair_xml(extraction)
print(repaired)
print("\n" + "=" * 60 + "\n")
print("Structured data:")
data = parse_xml(extraction)
print(json.dumps(data, indent=2))

---

# üéì Summary

## What Xenon Fixes:
- ‚úÖ **Truncation**: Auto-closes open tags
- ‚úÖ **Conversational Fluff**: Extracts pure XML
- ‚úÖ **Malformed Attributes**: Adds missing quotes
- ‚úÖ **Unescaped Entities**: Escapes `&` and `<`

## Three Modes:
- üõ°Ô∏è **Safe Mode**: Production-ready with validation
- üöÄ **Lenient Mode**: Never crashes, always returns something
- ‚ö° **Default Mode**: Fast, assumes valid string input

## Installation:
```bash
pip install xenon  # Coming soon!
```

## Resources:
- **GitHub**: https://github.com/MarsZDF/xenon
- **Documentation**: See README.md
- **License**: MIT

---

### üí° Have fun repairing XML! üîß