# üîß Xenon v0.6.0: XML Repair for LLM-Generated XML

Welcome to the interactive Xenon demo! This notebook shows you how to repair malformed XML commonly generated by Large Language Models.

**What is Xenon?**
- Zero-dependency Python library
- Fixes truncated, malformed, and messy XML from LLMs
- Simple API with robust error handling
- **NEW in v0.6.0**: Diff reporting, formatting, HTML entities, encoding detection

**Open in Colab**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MarsZDF/xenon/blob/main/xenon_demo.ipynb)

## üì¶ Installation

Install Xenon directly from GitHub:

In [None]:
# Install Xenon from GitHub
!pip install -q git+https://github.com/MarsZDF/xenon.git

from xenon import (
    repair_xml_safe,
    repair_xml_with_report,
    parse_xml,
    format_xml,
)

print("‚úÖ Xenon v0.6.0 installed successfully!")

---

# üéØ The 4 Major LLM XML Failure Modes

Xenon handles the most common ways LLMs break XML:

## 1Ô∏è‚É£ Truncation / Cut-off

LLMs run out of tokens mid-tag and leave XML incomplete:

In [None]:
# LLM output that got cut off
truncated = '<root><user name="alice"><address><city>San Francisco'

print("‚ùå BROKEN XML:")
print(truncated)
print()

# Xenon repairs it
repaired = repair_xml_safe(truncated)
print("‚úÖ REPAIRED XML:")
print(repaired)

## 2Ô∏è‚É£ Conversational Fluff

LLMs wrap XML in conversational text:

In [None]:
# LLM adds unnecessary commentary
fluff = '''
Sure! Here's the XML you requested:

<root>
    <message>Hello World</message>
    <status>success</status>
</root>

Hope this helps! Let me know if you need anything else.
'''

print("‚ùå MESSY OUTPUT:")
print(fluff)
print()

# Xenon extracts just the XML
repaired = repair_xml_safe(fluff)
print("‚úÖ CLEAN XML:")
print(repaired)

## 3Ô∏è‚É£ Malformed Attributes

LLMs forget to quote attribute values:

In [None]:
# Missing quotes around attributes
unquoted = '<product id=12345 category=electronics price=299.99>Laptop</product>'

print("‚ùå INVALID ATTRIBUTES:")
print(unquoted)
print()

# Xenon adds quotes
repaired = repair_xml_safe(unquoted)
print("‚úÖ VALID ATTRIBUTES:")
print(repaired)

## 4Ô∏è‚É£ Unescaped Entities

LLMs forget to escape special characters like `&` and `<`:

In [None]:
# Special characters not escaped
unescaped = '<description>Price: $5 < $10 & shipping included</description>'

print("‚ùå UNESCAPED ENTITIES:")
print(unescaped)
print()

# Xenon escapes them
repaired = repair_xml_safe(unescaped)
print("‚úÖ PROPERLY ESCAPED:")
print(repaired)

---

# üÜï NEW in v0.6.0: Repair Analysis

See exactly what Xenon fixed with detailed repair reports:

In [None]:
# Malformed XML with multiple issues
messy = '''
Here's your data:

<users>
    <user id=1001 role=admin>
        <name>John Smith & Associates</name>
        <email>john@example.com</email>
        <status>Active & Verified
'''

# Get detailed report of what was fixed
repaired, report = repair_xml_with_report(messy)

print("üìä REPAIR REPORT:")
print(report.summary())
print()

# Show statistics
stats = report.statistics()
print("üìà STATISTICS:")
print(f"  Total repairs: {stats['total_repairs']}")
print(f"  Input size: {stats['input_size']} bytes")
print(f"  Output size: {stats['output_size']} bytes")
print()

# Group by repair type
print("üîß REPAIRS BY TYPE:")
for repair_type, actions in report.by_type().items():
    print(f"  {repair_type.value}: {len(actions)}x")
print()

print("‚úÖ REPAIRED XML:")
print(repaired)

## üÜï View Changes as Diff

See before/after comparison:

In [None]:
# Show unified diff
print("üìù UNIFIED DIFF:")
print(report.to_unified_diff())
print()

# Get diff statistics
diff_stats = report.get_diff_summary()
print("üìä DIFF SUMMARY:")
print(f"  Lines added: {diff_stats['lines_added']}")
print(f"  Lines changed: {diff_stats['lines_changed']}")
print(f"  Similarity: {diff_stats['similarity_ratio']:.1%}")

---

# üÜï NEW in v0.6.0: XML Formatting

Format XML for readability or storage:

In [None]:
# Compact XML
compact = '<root><item>test</item><another>data</another></root>'

print("üìù ORIGINAL (compact):")
print(compact)
print()

# Pretty-print for readability
pretty = format_xml(compact, style='pretty')
print("‚ú® PRETTY-PRINTED:")
print(pretty)
print()

# Minify for storage
minified = format_xml(pretty, style='minify')
print("üì¶ MINIFIED (saves space):")
print(minified)
print(f"Size reduction: {len(pretty)} ‚Üí {len(minified)} bytes ({(1-len(minified)/len(pretty))*100:.0f}% smaller)")

## üÜï Repair + Format in One Step

In [None]:
# Broken XML that needs formatting
broken = '<root><item>test</item><another>data'

# Repair AND format in one call
result = repair_xml_safe(broken, format_output='pretty')

print("üì• INPUT:")
print(broken)
print()
print("üì§ OUTPUT (repaired + formatted):")
print(result)

---

# üÜï NEW in v0.6.0: HTML Entity Support

Handle HTML entities that LLMs often use:

In [None]:
# LLMs often use HTML entities
with_entities = '<price>&euro;50 &mdash; &copy;2025</price>'

print("üìù WITH HTML ENTITIES:")
print(with_entities)
print()

# Convert to numeric entities (XML-safe)
result = repair_xml_safe(with_entities, html_entities='numeric')
print("üî¢ NUMERIC ENTITIES (XML-safe):")
print(result)
print()

# Convert to Unicode
result2 = repair_xml_safe(with_entities, html_entities='unicode')
print("‚ú® UNICODE:")
print(result2)

---

# üÜï NEW in v0.6.0: Bytes Input Support

Handle bytes with automatic encoding detection:

In [None]:
# Bytes input (e.g., from API responses)
xml_bytes = b'<root>caf\xc3\xa9</root>'  # UTF-8 encoded with caf√©

print("üì¶ BYTES INPUT:")
print(xml_bytes)
print()

# Xenon auto-detects encoding
result = repair_xml_safe(xml_bytes)
print("‚úÖ DECODED + REPAIRED:")
print(result)
print(type(result))

---

# üìä Parse to Dictionary

Convert repaired XML to Python dictionaries:

In [None]:
import json

# Malformed XML from LLM
malformed = '<response status=success><data count=3><item>Apple</item><item>Banana</item><item>Orange'

print("üìù Input:")
print(malformed)
print()

# Parse directly (repairs automatically)
data = parse_xml(malformed)

print("üì¶ Parsed Dictionary:")
print(json.dumps(data, indent=2))
print()

# Access data easily
print("üéØ Extracted Values:")
print(f"  Status: {data['response']['@attributes']['status']}")
print(f"  Count: {data['response']['data']['@attributes']['count']}")
print(f"  Items: {data['response']['data']['item']}")

---

# üåç Real-World LLM Example

Complete workflow: broken LLM output ‚Üí repaired ‚Üí formatted ‚Üí parsed

In [None]:
# Realistic ChatGPT output with multiple issues
chatgpt_output = '''
Here's the product catalog:

<catalog>
    <product id=A001 category=electronics>
        <name>Laptop Pro 15</name>
        <price currency=USD>&euro;1299.99</price>
        <description>High-performance laptop with 16GB RAM & SSD</description>
        <inStock>true</inStock>
    </product>
    <product id=A002 category=electronics>
        <name>Wireless Mouse</name>
        <price currency=USD>29.99
'''

print("üì• RAW LLM OUTPUT:")
print(chatgpt_output[:200] + "...")
print()

# Repair with all v0.6.0 features
repaired, report = repair_xml_with_report(
    chatgpt_output,
    format_output='pretty',
    html_entities='unicode'
)

print("üîß WHAT WAS FIXED:")
print(report.summary())
print()

print("‚úÖ FINAL RESULT:")
print(repaired)
print()

# Parse to dictionary
data = parse_xml(repaired)
print("üìä AS JSON:")
print(json.dumps(data, indent=2)[:300] + "...")

---

# üéÆ Interactive Playground

Try your own malformed XML below!

In [None]:
# ‚úèÔ∏è Edit this XML and run the cell to see how Xenon repairs it!

your_xml = '''
Sure, here's the XML:

<config>
    <database host=localhost port=5432>
        <credentials user=admin password=secret123
'''

print("üì• YOUR INPUT:")
print(your_xml)
print()
print("=" * 60)
print()

# Repair with detailed report
repaired, report = repair_xml_with_report(your_xml, format_output='pretty')

print("üîß REPAIRS MADE:")
for action in report.actions:
    print(f"  ‚Ä¢ {action.description}")
print()
print("=" * 60)
print()

print("üì§ XENON OUTPUT:")
print(repaired)

---

# üéì Summary

## Core Features:
- ‚úÖ **Truncation**: Auto-closes open tags
- ‚úÖ **Conversational Fluff**: Extracts pure XML
- ‚úÖ **Malformed Attributes**: Adds missing quotes
- ‚úÖ **Unescaped Entities**: Escapes `&` and `<`

## üÜï NEW in v0.6.0:
- üìä **Repair Reports**: See exactly what was fixed
- üé® **XML Formatting**: Pretty, compact, or minify
- üåç **HTML Entities**: Convert ‚Ç¨, ¬©, ‚Äî, etc.
- üì¶ **Bytes Support**: Auto-detect encoding
- üîç **Enhanced Errors**: Line/column context

## Resources:
- **GitHub**: https://github.com/MarsZDF/xenon
- **Documentation**: See README.md and docs/
- **License**: MIT
- **Tests**: 371 tests, 88% coverage ‚úÖ

---

### üí° Happy XML repairing! üîß