# Step 5: Document Metadata Parsing Demo

This notebook demonstrates **document metadata extraction and generation** - parsing the LaTeX preamble (everything before `\begin{document}`).

## What We're Testing

1. **Parse metadata from LaTeX preamble** - Extract `\renewcommand` fields and colors
2. **Generate preamble from metadata** - Convert back to LaTeX
3. **Round-trip validation** - Verify metadata survives conversion

## Success Criteria

- ✅ All `\renewcommand` fields extracted correctly
- ✅ Color fields separated from general fields
- ✅ Professional profile cleaned of LaTeX commands
- ✅ Round-trip produces identical metadata

In [2]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

load_dotenv()
STRUCTURED_PATH = Path(os.getenv("RESUME_ARCHIVE_PATH")) / "structured"

print(f"✓ Project root: {project_root}")
print(f"✓ Test fixtures: {STRUCTURED_PATH}")

✓ Project root: /home/sean/ARCHER
✓ Test fixtures: /home/sean/ARCHER/data/resume_archive/structured


In [3]:
from archer.contexts.templating.converter import LaTeXToYAMLConverter, YAMLToLaTeXConverter
from omegaconf import OmegaConf

parser = LaTeXToYAMLConverter()
generator = YAMLToLaTeXConverter()

print("✓ Converters loaded")

✓ Converters loaded


## Step 1: Load Original LaTeX

Load the test fixture with preamble metadata.

In [7]:
latex_path = STRUCTURED_PATH / "document_metadata_test.tex"
original_latex = latex_path.read_text(encoding="utf-8")

print("Original LaTeX preamble (first 500 chars):")
print("=" * 60)
print(original_latex[:500])
print("...")

Original LaTeX preamble (first 500 chars):
\documentclass[11pt,letterpaper,english,usenames,dvipsnames]{article}

\input{mystyle/packages.sty}
\input{mystyle/colors.sty}
\input{mystyle/gencommands.sty}
\input{mystyle/pagestyles.sty}
\input{mystyle/tables.sty}
\input{mystyle/symbols.sty}
\input{mystyle/panels.sty}
\input{mystyle/defaultspacing.sty}
\graphicspath{{./}{Figs/}}

\setlength{\leftmargin}{0.4in}
\setlength{\rightmargin}{0.5in}
\setlength{\aboveheader}{10pt}
\setlength{\bottommargin}{0.2in}

\renewcommand{\pdfkeywords}{Sean, Sta
...


## Step 2: Parse LaTeX → Metadata Dict

Extract all metadata from preamble.

In [None]:
metadata = parser.extract_document_metadata(original_latex)

print("Extracted Metadata:")
print("=" * 60)
print(f"Name: {metadata['name']}")
print(f"Date: {metadata['date']}")
print(f"Brand: {metadata['brand']}")
print(f"\nProfessional Profile: {metadata['professional_profile']}")
print(f"\nColors:")
for color_key, color_val in metadata['colors'].items():
    print(f"  {color_key}: {color_val}")
print(f"\nOther Fields:")
for field_key, field_val in metadata['fields'].items():
    print(f"  {field_key}: {field_val}")

## Step 3: Generate LaTeX Preamble from Metadata

Convert metadata back to LaTeX `\renewcommand` statements.

In [None]:
generated_preamble = generator.generate_preamble(metadata)

print("Generated Preamble:")
print("=" * 60)
print(generated_preamble)

## Step 4: Parse Generated Preamble → Metadata Dict (Round-trip)

Parse the generated preamble to verify round-trip fidelity.

In [None]:
# Wrap preamble in document for parser
test_latex = generated_preamble + "\n\\begin{document}\n\\end{document}"
roundtrip_metadata = parser.extract_document_metadata(test_latex)

print("Round-trip Metadata:")
print("=" * 60)
print(f"Name: {roundtrip_metadata['name']}")
print(f"Date: {roundtrip_metadata['date']}")
print(f"Brand: {roundtrip_metadata['brand']}")
print(f"\nProfessional Profile: {roundtrip_metadata['professional_profile']}")
print(f"\nColors:")
for color_key, color_val in roundtrip_metadata['colors'].items():
    print(f"  {color_key}: {color_val}")

## Step 5: Validate Round-trip Fidelity

Compare original metadata with round-trip metadata.

In [11]:
print("Validation Results:")
print("=" * 60)

# Check name
name_match = metadata['name'] == roundtrip_metadata['name']
print(f"✓ Name preserved: {name_match}" if name_match else f"✗ Name mismatch: {metadata['name']} != {roundtrip_metadata['name']}")

# Check date
date_match = metadata['date'] == roundtrip_metadata['date']
print(f"✓ Date preserved: {date_match}" if date_match else f"✗ Date mismatch: {metadata['date']} != {roundtrip_metadata['date']}")

# Check brand
brand_match = metadata['brand'] == roundtrip_metadata['brand']
print(f"✓ Brand preserved: {brand_match}" if brand_match else f"✗ Brand mismatch: {metadata['brand']} != {roundtrip_metadata['brand']}")

# Check professional profile
profile_match = metadata['professional_profile'] == roundtrip_metadata['professional_profile']
print(f"✓ Profile preserved: {profile_match}" if profile_match else f"✗ Profile mismatch")

# Check colors
colors_match = metadata['colors'] == roundtrip_metadata['colors']
print(f"✓ Colors preserved: {colors_match}" if colors_match else f"✗ Colors mismatch: {metadata['colors']} != {roundtrip_metadata['colors']}")

# Overall success
all_match = name_match and date_match and brand_match and profile_match and colors_match
print("\n" + "=" * 60)
if all_match:
    print("✅ SUCCESS: All metadata preserved through round-trip!")
else:
    print("❌ FAILURE: Some metadata lost or corrupted")

Validation Results:
✓ Name preserved: True
✓ Date preserved: True
✓ Brand preserved: True
✓ Profile preserved: True
✓ Colors preserved: True

✅ SUCCESS: All metadata preserved through round-trip!


## Step 6: Test Edge Cases

Test metadata without professional profile.

In [12]:
minimal_latex = """\nrenewcommand{\\myname}{\\textbf{Test Name}}
\\renewcommand{\\mydate}{January 2025}
\\renewcommand{\\brand}{Software Engineer}
\\renewcommand{\\emphcolor}{black}

\\begin{document}
\\end{document}
"""

minimal_metadata = parser.extract_document_metadata(minimal_latex)

print("Minimal Metadata (no profile):")
print("=" * 60)
print(f"Name: {minimal_metadata['name']}")
print(f"Date: {minimal_metadata['date']}")
print(f"Brand: {minimal_metadata['brand']}")
print(f"Profile: {minimal_metadata['professional_profile']}")
print(f"Colors: {minimal_metadata['colors']}")

if minimal_metadata['professional_profile'] is None:
    print("\n✅ Optional profile field handled correctly (None)")
else:
    print(f"\n❌ Profile should be None, got: {minimal_metadata['professional_profile']}")

Minimal Metadata (no profile):
Name: 
Date: January 2025
Brand: Software Engineer
Profile: None
Colors: {'emphcolor': 'black'}

✅ Optional profile field handled correctly (None)


## Summary

### What We Validated

1. ✅ **Preamble parsing** - All `\renewcommand` fields extracted
2. ✅ **Color separation** - Color fields separated from general fields
3. ✅ **LaTeX command cleaning** - `\textbf{...}`, `\centering`, `\par` removed from values
4. ✅ **Preamble generation** - Valid LaTeX syntax produced
5. ✅ **Round-trip fidelity** - All metadata preserved through conversion
6. ✅ **Optional fields** - Professional profile can be None

### Next Steps

Step 6 will add:
- Multi-page parsing with `extract_pages()`
- Page 2 bottom bar extraction
- Complete document assembly with `generate_document()`