# Step 4: Page Structure Demonstration

This notebook demonstrates complete single-page parsing with paracol structure.

## What Step 4 Added

- **Page-level data structures**: `Page`, `PageRegions`, `Column`, `Section`
- **Paracol parsing**: `extract_page_regions()` finds `\begin{paracol}{2}` and `\switchcolumn`
- **Type inference**: Automatically detects section types from LaTeX structure
- **Page generation**: `generate_page()` creates complete paracol structure

## Round-Trip Test

We'll demonstrate: **LaTeX → YAML → LaTeX → YAML** and verify both round-trips produce identical results.

In [1]:
# Setup
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from omegaconf import OmegaConf
from archer.contexts.templating.converter import (
    YAMLToLaTeXConverter,
    LaTeXToYAMLConverter,
)

print("✅ Setup complete")

✅ Setup complete


---

## Step 1: Load Original LaTeX

Load the test LaTeX file with paracol structure.

In [2]:
# Load original LaTeX
latex_path = project_root / "data/resume_archive/structured/single_page_test.tex"
original_latex = latex_path.read_text(encoding="utf-8")

print("Original LaTeX:")
print("=" * 60)
print(original_latex)
print("=" * 60)
print(f"\nLength: {len(original_latex)} characters")

Original LaTeX:
\begin{paracol}{2}

\section*{Core Skills}
   { \setlength{\baselineskip}{10pt} \setlength{\parskip}{7.5pt} \scshape

    Machine Learning

    High-Performance\\Computing (HPC)

    MLOps

   }

\section*{Languages}

    \texttt{Python} | \texttt{Bash} | \texttt{C++}

\switchcolumn

\section*{Experience}

    \begin{itemizeAcademic}{Test Company}{Software Engineer}{City, ST}{2023 -- Present}

        \itemi Built scalable ML infrastructure

        \itemi Reduced latency by 50\%

    \end{itemizeAcademic}

\end{paracol}


Length: 527 characters


---

## Step 2: Parse LaTeX → Structured Dict (First Conversion)

Parse the LaTeX into structured Python dict using `extract_page_regions()`.

In [5]:
# Parse LaTeX to structured format
parser = LaTeXToYAMLConverter()
page_regions_1 = parser.extract_page_regions(original_latex, page_number=1)

print("Parsed Structure (Python Dict):")
print("=" * 60)

# Display structure summary
print(f"Top bar: show_professional_profile = {page_regions_1['top']['show_professional_profile']}")
print(f"\nLeft column sections: {len(page_regions_1['left_column']['sections'])}")
for i, sect in enumerate(page_regions_1['left_column']['sections']):
    print(f"  {i+1}. {sect['name']} ({sect['type']})")

print(f"\nMain column sections: {len(page_regions_1['main_column']['sections'])}")
for i, sect in enumerate(page_regions_1['main_column']['sections']):
    print(f"  {i+1}. {sect['name']} ({sect['type']})")

print(f"\nBottom bar: {page_regions_1['bottom']}")

Parsed Structure (Python Dict):
Top bar: show_professional_profile = True

Left column sections: 2
  1. Core Skills (skill_list_caps)
  2. Languages (skill_list_pipes)

Main column sections: 1
  1. Experience (work_history)

Bottom bar: None


---

## Step 3: Convert Dict → YAML String

Convert the structured dict to YAML format using OmegaConf.

In [7]:
# Convert to YAML
yaml_conf_1 = OmegaConf.create(page_regions_1)
yaml_str_1 = OmegaConf.to_yaml(yaml_conf_1)

print("YAML Representation:")
print("=" * 60)
print(yaml_str_1)
print("=" * 60)
print(f"\nLength: {len(yaml_str_1)} characters")

YAML Representation:
top:
  show_professional_profile: true
left_column:
  sections:
  - name: Core Skills
    type: skill_list_caps
    content:
      list:
      - Machine Learning
      - High-Performance\\Computing (HPC)
      - MLOps
  - name: Languages
    type: skill_list_pipes
    content:
      list:
      - Python
      - Bash
      - C++
main_column:
  sections:
  - name: Experience
    type: work_history
    subsections:
    - type: work_experience
      metadata:
        company: Test Company
        title: Software Engineer
        location: City, ST
        dates: 2023 -- Present
      content:
        bullets:
        - text: Built scalable ML infrastructure
        - text: Reduced latency by 50\%
bottom: null


Length: 715 characters


---

## Step 4: Generate LaTeX from Dict (Second Conversion)

Generate LaTeX from the structured dict using `generate_page()`.

In [10]:
# Generate LaTeX from parsed structure
generator = YAMLToLaTeXConverter()
generated_latex = generator.generate_page(page_regions_1)

print("Generated LaTeX:")
print("=" * 60)
print(generated_latex)
print("=" * 60)
print(f"\nLength: {len(generated_latex)} characters")

Generated LaTeX:
\begin{paracol}{2}

\section*{Core Skills}

   { \setlength{\baselineskip}{10pt} \setlength{\parskip}{7.5pt} \scshape

    Machine Learning

    High-Performance\\Computing (HPC)

    MLOps

   }

\section*{Languages}

    \texttt{Python} | \texttt{Bash} | \texttt{C++}

\switchcolumn

\section*{Experience}

    \begin{itemizeAcademic}{Test Company}{Software Engineer}{City, ST}{2023 -- Present}

        \itemi Built scalable ML infrastructure

        \itemi Reduced latency by 50\%

    \end{itemizeAcademic}


\end{paracol}

Length: 528 characters


---

## Step 5: Parse Generated LaTeX → Dict Again (Third Conversion)

Parse the generated LaTeX back into structured dict to complete the round-trip.

In [11]:
# Parse generated LaTeX back to structure
page_regions_2 = parser.extract_page_regions(generated_latex, page_number=1)

print("Round-trip Parsed Structure:")
print("=" * 60)

# Display structure summary
print(f"Top bar: show_professional_profile = {page_regions_2['top']['show_professional_profile']}")
print(f"\nLeft column sections: {len(page_regions_2['left_column']['sections'])}")
for i, sect in enumerate(page_regions_2['left_column']['sections']):
    print(f"  {i+1}. {sect['name']} ({sect['type']})")

print(f"\nMain column sections: {len(page_regions_2['main_column']['sections'])}")
for i, sect in enumerate(page_regions_2['main_column']['sections']):
    print(f"  {i+1}. {sect['name']} ({sect['type']})")

Round-trip Parsed Structure:
Top bar: show_professional_profile = True

Left column sections: 2
  1. Core Skills (skill_list_caps)
  2. Languages (skill_list_pipes)

Main column sections: 1
  1. Experience (work_history)


---

## Step 6: Convert Second Dict → YAML String

Convert the round-trip dict to YAML for comparison.

In [12]:
# Convert round-trip structure to YAML
yaml_conf_2 = OmegaConf.create(page_regions_2)
yaml_str_2 = OmegaConf.to_yaml(yaml_conf_2)

print("Round-trip YAML Representation:")
print("=" * 60)
print(yaml_str_2)
print("=" * 60)
print(f"\nLength: {len(yaml_str_2)} characters")

Round-trip YAML Representation:
top:
  show_professional_profile: true
left_column:
  sections:
  - name: Core Skills
    type: skill_list_caps
    content:
      list:
      - Machine Learning
      - High-Performance\\Computing (HPC)
      - MLOps
  - name: Languages
    type: skill_list_pipes
    content:
      list:
      - Python
      - Bash
      - C++
main_column:
  sections:
  - name: Experience
    type: work_history
    subsections:
    - type: work_experience
      metadata:
        company: Test Company
        title: Software Engineer
        location: City, ST
        dates: 2023 -- Present
      content:
        bullets:
        - text: Built scalable ML infrastructure
        - text: Reduced latency by 50\%
bottom: null


Length: 715 characters


---

## Validation: Compare Original and Round-Trip Structures

Now we'll validate that the structures are identical.

In [13]:
# Compare the two dicts
print("Structure Comparison:")
print("=" * 60)

# Check if dicts are equal
if page_regions_1 == page_regions_2:
    print("✅ SUCCESS: Structures are IDENTICAL")
    print("\nLaTeX → Dict → LaTeX → Dict produces identical structure")
else:
    print("❌ FAILURE: Structures differ")
    print("\nDifferences found - investigating...")
    
    # Check each component
    if page_regions_1['top'] != page_regions_2['top']:
        print("  - Top bar differs")
    if page_regions_1['left_column'] != page_regions_2['left_column']:
        print("  - Left column differs")
    if page_regions_1['main_column'] != page_regions_2['main_column']:
        print("  - Main column differs")
    if page_regions_1['bottom'] != page_regions_2['bottom']:
        print("  - Bottom bar differs")

Structure Comparison:
✅ SUCCESS: Structures are IDENTICAL

LaTeX → Dict → LaTeX → Dict produces identical structure


---

## Detailed Content Validation

Let's verify that section content is preserved correctly.

In [14]:
print("Content Validation:")
print("=" * 60)

# Check Core Skills content
core_skills_1 = page_regions_1['left_column']['sections'][0]
core_skills_2 = page_regions_2['left_column']['sections'][0]

print("Core Skills section:")
print(f"  Original items: {len(core_skills_1['content']['list'])}")
print(f"  Round-trip items: {len(core_skills_2['content']['list'])}")
if core_skills_1['content']['list'] == core_skills_2['content']['list']:
    print("  ✅ Lists are identical")
else:
    print("  ❌ Lists differ")
    print(f"    Original: {core_skills_1['content']['list']}")
    print(f"    Round-trip: {core_skills_2['content']['list']}")

# Check Languages content
languages_1 = page_regions_1['left_column']['sections'][1]
languages_2 = page_regions_2['left_column']['sections'][1]

print("\nLanguages section:")
print(f"  Original items: {len(languages_1['content']['list'])}")
print(f"  Round-trip items: {len(languages_2['content']['list'])}")
if languages_1['content']['list'] == languages_2['content']['list']:
    print("  ✅ Lists are identical")
else:
    print("  ❌ Lists differ")

# Check Experience content
experience_1 = page_regions_1['main_column']['sections'][0]
experience_2 = page_regions_2['main_column']['sections'][0]

print("\nExperience section:")
print(f"  Original subsections: {len(experience_1['subsections'])}")
print(f"  Round-trip subsections: {len(experience_2['subsections'])}")

work_exp_1 = experience_1['subsections'][0]
work_exp_2 = experience_2['subsections'][0]

print(f"  Original company: {work_exp_1['metadata']['company']}")
print(f"  Round-trip company: {work_exp_2['metadata']['company']}")
print(f"  Original bullets: {len(work_exp_1['content']['bullets'])}")
print(f"  Round-trip bullets: {len(work_exp_2['content']['bullets'])}")

if work_exp_1 == work_exp_2:
    print("  ✅ Work experience is identical")
else:
    print("  ❌ Work experience differs")

Content Validation:
Core Skills section:
  Original items: 3
  Round-trip items: 3
  ✅ Lists are identical

Languages section:
  Original items: 3
  Round-trip items: 3
  ✅ Lists are identical

Experience section:
  Original subsections: 1
  Round-trip subsections: 1
  Original company: Test Company
  Round-trip company: Test Company
  Original bullets: 2
  Round-trip bullets: 2
  ✅ Work experience is identical


---

## YAML String Comparison

Compare the two YAML strings (they should be identical).

In [15]:
print("YAML String Comparison:")
print("=" * 60)

if yaml_str_1 == yaml_str_2:
    print("✅ SUCCESS: YAML strings are IDENTICAL")
    print("\nNo differences in YAML representation")
else:
    print("❌ FAILURE: YAML strings differ")
    print(f"\nOriginal YAML length: {len(yaml_str_1)}")
    print(f"Round-trip YAML length: {len(yaml_str_2)}")
    
    # Show character-by-character diff for debugging
    import difflib
    diff = difflib.unified_diff(
        yaml_str_1.splitlines(keepends=True),
        yaml_str_2.splitlines(keepends=True),
        fromfile='original_yaml',
        tofile='roundtrip_yaml',
        lineterm=''
    )
    print("\nDifferences:")
    print(''.join(diff))

YAML String Comparison:
✅ SUCCESS: YAML strings are IDENTICAL

No differences in YAML representation


---

## LaTeX Comparison (Structural)

Compare original LaTeX with generated LaTeX. Note: Whitespace/formatting may differ, but structure should be identical.

In [16]:
print("LaTeX Comparison:")
print("=" * 60)

# Check key structural elements
checks = [
    (r"\begin{paracol}{2}", "paracol environment"),
    (r"\switchcolumn", "column switch"),
    (r"\end{paracol}", "paracol close"),
    (r"\section*{Core Skills}", "Core Skills section"),
    (r"\section*{Languages}", "Languages section"),
    (r"\section*{Experience}", "Experience section"),
    (r"\begin{itemizeAcademic}", "work experience environment"),
]

all_present = True
for pattern, description in checks:
    in_original = pattern in original_latex
    in_generated = pattern in generated_latex
    
    if in_original and in_generated:
        print(f"  ✅ {description}: present in both")
    elif not in_original and not in_generated:
        print(f"  ✅ {description}: absent in both")
    else:
        print(f"  ❌ {description}: MISMATCH")
        print(f"     Original: {in_original}, Generated: {in_generated}")
        all_present = False

if all_present:
    print("\n✅ All structural elements present in generated LaTeX")
else:
    print("\n❌ Some structural elements missing or mismatched")

LaTeX Comparison:
  ✅ paracol environment: present in both
  ✅ column switch: present in both
  ✅ paracol close: present in both
  ✅ Core Skills section: present in both
  ✅ Languages section: present in both
  ✅ Experience section: present in both
  ✅ work experience environment: present in both

✅ All structural elements present in generated LaTeX


---

## Final Summary

Overall validation results.

In [17]:
print("\n" + "=" * 60)
print("FINAL VALIDATION RESULTS")
print("=" * 60)

# Calculate success
dict_match = (page_regions_1 == page_regions_2)
yaml_match = (yaml_str_1 == yaml_str_2)

print(f"\n1. Dict comparison: {'✅ PASS' if dict_match else '❌ FAIL'}")
print(f"2. YAML comparison: {'✅ PASS' if yaml_match else '❌ FAIL'}")

if dict_match and yaml_match:
    print("\n" + "=" * 60)
    print("🎉 COMPLETE SUCCESS 🎉")
    print("=" * 60)
    print("\nLaTeX → YAML → LaTeX → YAML produces identical results")
    print("\nStep 4 validation: PASSED")
    print("\nPage structure parsing works correctly:")
    print("  ✅ Paracol environment detected")
    print("  ✅ Column separation preserved")
    print("  ✅ Section types inferred correctly")
    print("  ✅ Section content preserved")
    print("  ✅ Complete round-trip fidelity")
else:
    print("\n❌ VALIDATION FAILED - See differences above")


FINAL VALIDATION RESULTS

1. Dict comparison: ✅ PASS
2. YAML comparison: ✅ PASS

🎉 COMPLETE SUCCESS 🎉

LaTeX → YAML → LaTeX → YAML produces identical results

Step 4 validation: PASSED

Page structure parsing works correctly:
  ✅ Paracol environment detected
  ✅ Column separation preserved
  ✅ Section types inferred correctly
  ✅ Section content preserved
  ✅ Complete round-trip fidelity


---

## Optional: Save Files for Manual Inspection

Save the generated LaTeX to a file so you can manually diff it.

In [18]:
# Save generated LaTeX for manual inspection
output_path = project_root / "data/resume_archive/structured/single_page_test_generated.tex"
output_path.write_text(generated_latex, encoding="utf-8")

print(f"Generated LaTeX saved to: {output_path}")
print(f"\nTo manually diff:")
print(f"  diff data/resume_archive/structured/single_page_test.tex \\")
print(f"       data/resume_archive/structured/single_page_test_generated.tex")

Generated LaTeX saved to: /home/sean/ARCHER/data/resume_archive/structured/single_page_test_generated.tex

To manually diff:
  diff data/resume_archive/structured/single_page_test.tex \
       data/resume_archive/structured/single_page_test_generated.tex
