# Step 6: Two-Page Document Support Demo

This notebook demonstrates **multi-page document parsing** - splitting LaTeX documents on `\clearpage` markers and parsing each page separately.

## What We're Testing

1. **Extract pages from LaTeX document** - Split on `\clearpage` markers
2. **Handle paracol spanning pages** - Single paracol environment across multiple pages
3. **Parse continuation pages** - Pages without `\switchcolumn` (main column only)
4. **Validate page structure** - Each page has correct regions

## Success Criteria

- ✅ Document splits correctly on `\clearpage`
- ✅ Page 1 has both left and main columns
- ✅ Page 2 has only main column (continuation)
- ✅ All content preserved across pages
- ✅ Page numbers correctly assigned

In [3]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

load_dotenv()
STRUCTURED_PATH = Path(os.getenv("RESUME_ARCHIVE_PATH")) / "structured"

print(f"✓ Project root: {project_root}")
print(f"✓ Test fixtures: {STRUCTURED_PATH}")

✓ Project root: /home/sean/ARCHER
✓ Test fixtures: /home/sean/ARCHER/data/resume_archive/structured


In [4]:
from archer.contexts.templating.converter import LaTeXToYAMLConverter
from omegaconf import OmegaConf

parser = LaTeXToYAMLConverter()

print("✓ Parser loaded")

✓ Parser loaded


## Step 1: Load Two-Page LaTeX Document

Load a test fixture with two pages separated by `\clearpage`.

In [5]:
latex_path = STRUCTURED_PATH / "two_page_test.tex"
paracol_content = latex_path.read_text(encoding="utf-8")

# Need to wrap in document markers for extract_pages()
latex_str = "\\begin{document}\n" + paracol_content + "\n\\end{document}"

print("Original LaTeX (first 500 chars):")
print("=" * 60)
print(latex_str[:500])
print("...\n")

# Show where \clearpage appears
clearpage_pos = latex_str.find("\\clearpage")
print(f"\\clearpage found at position {clearpage_pos}")
print("\nContext around \\clearpage:")
print(latex_str[clearpage_pos-50:clearpage_pos+100])

Original LaTeX (first 500 chars):
\begin{document}
\begin{paracol}{2}

% Page 1 - Left Column
\section*{Core Skills}
   { \setlength{\baselineskip}{10pt} \setlength{\parskip}{7.5pt} \scshape

    Machine Learning

    High-Performance\\Computing (HPC)

    MLOps

   }

\switchcolumn

% Page 1 - Main Column
\section*{Experience}

    \begin{itemizeAcademic}{Test Company}{Software Engineer}{City, ST}{2023 -- Present}

        \itemi Built scalable ML infrastructure

        \itemi Reduced latency by 50\%

    \end{itemizeAcademic}
...

\clearpage found at position 502

Context around \clearpage:
duced latency by 50\%

    \end{itemizeAcademic}

\clearpage

% Page 2 - Main Column (continues from page 1)
\section*{More Experience}

    \begin{it


## Step 2: Extract All Pages

Use `extract_pages()` to split document on `\clearpage` markers.

In [6]:
pages = parser.extract_pages(latex_str)

print(f"Extracted {len(pages)} pages")
print("=" * 60)

for page in pages:
    page_num = page["page_number"]
    regions = page["regions"]
    
    print(f"\nPage {page_num}:")
    print(f"  Show profile: {regions['top']['show_professional_profile']}")
    print(f"  Left column: {regions['left_column'] is not None}")
    print(f"  Main column: {regions['main_column'] is not None}")
    
    if regions['left_column']:
        left_sections = [s['name'] for s in regions['left_column']['sections']]
        print(f"  Left sections: {left_sections}")
    
    if regions['main_column']:
        main_sections = [s['name'] for s in regions['main_column']['sections']]
        print(f"  Main sections: {main_sections}")

Extracted 2 pages

Page 1:
  Show profile: True
  Left column: True
  Main column: True
  Left sections: ['Core Skills']
  Main sections: ['Experience']

Page 2:
  Show profile: False
  Left column: False
  Main column: True
  Main sections: ['More Experience']


## Step 3: Validate Page 1 Structure

Page 1 should have both left and main columns.

In [7]:
page1 = pages[0]

print("Page 1 Validation:")
print("=" * 60)

# Check page number
assert page1["page_number"] == 1
print("✓ Page number is 1")

# Check professional profile
assert page1["regions"]["top"]["show_professional_profile"] == True
print("✓ Professional profile shown (page 1)")

# Check has both columns
assert page1["regions"]["left_column"] is not None
assert page1["regions"]["main_column"] is not None
print("✓ Has both left and main columns")

# Check sections
left_sections = page1["regions"]["left_column"]["sections"]
main_sections = page1["regions"]["main_column"]["sections"]

print(f"\nLeft column ({len(left_sections)} sections):")
for section in left_sections:
    print(f"  - {section['name']} ({section['type']})")

print(f"\nMain column ({len(main_sections)} sections):")
for section in main_sections:
    print(f"  - {section['name']} ({section['type']})")
    if section.get('subsections'):
        for subsection in section['subsections']:
            company = subsection.get('metadata', {}).get('company', 'N/A')
            print(f"    → {company}")

Page 1 Validation:
✓ Page number is 1
✓ Professional profile shown (page 1)
✓ Has both left and main columns

Left column (1 sections):
  - Core Skills (skill_list_caps)

Main column (1 sections):
  - Experience (work_history)
    → Test Company


## Step 4: Validate Page 2 Structure

Page 2 should have only main column (continuation page, no `\switchcolumn`).

In [8]:
page2 = pages[1]

print("Page 2 Validation:")
print("=" * 60)

# Check page number
assert page2["page_number"] == 2
print("✓ Page number is 2")

# Check professional profile hidden
assert page2["regions"]["top"]["show_professional_profile"] == False
print("✓ Professional profile hidden (page 2)")

# Check has only main column (no left column)
assert page2["regions"]["left_column"] is None
print("✓ No left column (continuation page)")

assert page2["regions"]["main_column"] is not None
print("✓ Has main column")

# Check sections
main_sections = page2["regions"]["main_column"]["sections"]

print(f"\nMain column ({len(main_sections)} sections):")
for section in main_sections:
    print(f"  - {section['name']} ({section['type']})")
    if section.get('subsections'):
        for subsection in section['subsections']:
            company = subsection.get('metadata', {}).get('company', 'N/A')
            print(f"    → {company}")

Page 2 Validation:
✓ Page number is 2
✓ Professional profile hidden (page 2)
✓ No left column (continuation page)
✓ Has main column

Main column (1 sections):
  - More Experience (work_history)
    → Another Company


## Step 5: Validate Content Preservation

Check that all content survives page splitting.

In [9]:
print("Content Preservation Validation:")
print("=" * 60)

# Page 1 content
page1_exp = page1["regions"]["main_column"]["sections"][0]
page1_company = page1_exp["subsections"][0]["metadata"]["company"]
page1_bullets = len(page1_exp["subsections"][0]["content"]["bullets"])

print(f"\nPage 1 Experience:")
print(f"  Company: {page1_company}")
print(f"  Bullets: {page1_bullets}")

assert page1_company == "Test Company"
assert page1_bullets == 2
print("  ✓ Page 1 content preserved")

# Page 2 content
page2_exp = page2["regions"]["main_column"]["sections"][0]
page2_company = page2_exp["subsections"][0]["metadata"]["company"]
page2_title = page2_exp["subsections"][0]["metadata"]["title"]
page2_bullets = len(page2_exp["subsections"][0]["content"]["bullets"])

print(f"\nPage 2 Experience:")
print(f"  Company: {page2_company}")
print(f"  Title: {page2_title}")
print(f"  Bullets: {page2_bullets}")

assert page2_company == "Another Company"
assert page2_title == "Senior Engineer"
assert page2_bullets == 2
print("  ✓ Page 2 content preserved")

Content Preservation Validation:

Page 1 Experience:
  Company: Test Company
  Bullets: 2
  ✓ Page 1 content preserved

Page 2 Experience:
  Company: Another Company
  Title: Senior Engineer
  Bullets: 2
  ✓ Page 2 content preserved


## Step 6: Compare with Expected YAML

Load the expected YAML and verify parsed structure matches.

In [10]:
yaml_path = STRUCTURED_PATH / "two_page_test.yaml"
yaml_data = OmegaConf.load(yaml_path)
expected_pages = OmegaConf.to_container(yaml_data["document"]["pages"], resolve=True)

print("YAML Comparison:")
print("=" * 60)

# Check page count
assert len(pages) == len(expected_pages)
print(f"✓ Page count matches: {len(pages)} pages")

# Check page numbers
for parsed, expected in zip(pages, expected_pages):
    assert parsed["page_number"] == expected["page_number"]
print("✓ Page numbers match")

# Check section counts
p1_left_count = len(pages[0]["regions"]["left_column"]["sections"])
e1_left_count = len(expected_pages[0]["regions"]["left_column"]["sections"])
assert p1_left_count == e1_left_count
print(f"✓ Page 1 left column: {p1_left_count} sections")

p1_main_count = len(pages[0]["regions"]["main_column"]["sections"])
e1_main_count = len(expected_pages[0]["regions"]["main_column"]["sections"])
assert p1_main_count == e1_main_count
print(f"✓ Page 1 main column: {p1_main_count} sections")

p2_main_count = len(pages[1]["regions"]["main_column"]["sections"])
e2_main_count = len(expected_pages[1]["regions"]["main_column"]["sections"])
assert p2_main_count == e2_main_count
print(f"✓ Page 2 main column: {p2_main_count} sections")

YAML Comparison:
✓ Page count matches: 2 pages
✓ Page numbers match
✓ Page 1 left column: 1 sections
✓ Page 1 main column: 1 sections
✓ Page 2 main column: 1 sections


## Step 7: Test Edge Case - No Cross-Page Contamination

Verify sections don't appear on wrong pages.

In [11]:
print("Cross-Page Contamination Check:")
print("=" * 60)

# Get all section names from each page
page1_sections = []
if pages[0]["regions"]["left_column"]:
    page1_sections.extend([s["name"] for s in pages[0]["regions"]["left_column"]["sections"]])
if pages[0]["regions"]["main_column"]:
    page1_sections.extend([s["name"] for s in pages[0]["regions"]["main_column"]["sections"]])

page2_sections = []
if pages[1]["regions"]["main_column"]:
    page2_sections.extend([s["name"] for s in pages[1]["regions"]["main_column"]["sections"]])

print(f"Page 1 sections: {page1_sections}")
print(f"Page 2 sections: {page2_sections}")

# Verify "More Experience" is only on page 2
assert "More Experience" in page2_sections
assert "More Experience" not in page1_sections
print("\n✓ 'More Experience' only on page 2")

# Verify "Core Skills" is only on page 1
assert "Core Skills" in page1_sections
assert "Core Skills" not in page2_sections
print("✓ 'Core Skills' only on page 1")

# Verify "Experience" is only on page 1
assert "Experience" in page1_sections
assert "Experience" not in page2_sections
print("✓ 'Experience' only on page 1")

Cross-Page Contamination Check:
Page 1 sections: ['Core Skills', 'Experience']
Page 2 sections: ['More Experience']

✓ 'More Experience' only on page 2
✓ 'Core Skills' only on page 1
✓ 'Experience' only on page 1


## Summary

### What We Validated

1. ✅ **Multi-page splitting** - Document correctly split on `\clearpage`
2. ✅ **Page structure** - Page 1 has both columns, Page 2 has main only
3. ✅ **Professional profile** - Shown on page 1, hidden on page 2
4. ✅ **Content preservation** - All sections and content preserved
5. ✅ **No cross-contamination** - Sections appear on correct pages only
6. ✅ **YAML match** - Parsed structure matches expected YAML

### Key Capabilities Demonstrated

- **Paracol spanning pages** - Single `\begin{paracol}...\end{paracol}` across multiple pages
- **Continuation pages** - Pages without `\switchcolumn` handled correctly
- **Page metadata** - Page numbers and professional profile flags correct
- **Hierarchical parsing** - Section parsing composes correctly with page parsing

### Next Steps

With two-page support complete, ARCHER can now parse real 2-page resumes from the archive. The next milestone would be:
- Complete document parsing (metadata + pages)
- Full Dalcom resume round-trip test
- Document generation (assembling complete documents)