# Resume Data Structure Demo

This notebook demonstrates the `ResumeDocument`, `ResumeDocumentArchive` and `ResumeSection` classes from `archer.contexts.templating.resume_data_structure`.

These classes provide a **simplified interface designed for the targeting context** to work with resume data by stripping away LaTeX formatting and layout metadata.

In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv
from omegaconf import OmegaConf
from pprint import pprint
from collections import Counter

from archer.contexts.templating.resume_data_structure import (
    ResumeDocument,
    ResumeDocumentArchive,
)

load_dotenv()
RESUME_ARCHIVE_PATH = Path(os.getenv("RESUME_ARCHIVE_PATH"))

# Pick a sample resume
sample_yaml = RESUME_ARCHIVE_PATH / "structured" / "Res_ACMECorp.yaml"

## Part 1: Without ResumeDocument (OmegaConf)

Let's see what working directly with the YAML looks like:

In [None]:
# Load with OmegaConf
yaml_data = OmegaConf.load(sample_yaml)
yaml_dict = OmegaConf.to_container(yaml_data, resolve=True)

print("Top-level keys:")
print(list(yaml_dict.keys()))
print("\nDocument keys:")
print(list(yaml_dict["document"].keys()))

### Getting Work Experience Directly from yaml (Without ResumeDocument)

Notice how deeply nested and complex the structure is:

In [None]:
# Navigate the complex nested structure
doc = yaml_dict["document"]

# Find work experience sections
work_experiences = []
for page in doc.get("pages", []):
    for region_name, region_data in page.get("regions", {}).items():
        if region_data and "sections" in region_data:
            for section in region_data["sections"]:
                # Check for nested subsections (work_history contains work_experience)
                if "subsections" in section:
                    for subsection in section["subsections"]:
                        if subsection.get("type") == "work_experience":
                            work_experiences.append(subsection)

print(f"Found {len(work_experiences)} work experiences\n")

# Get first work experience
first_work = work_experiences[0]
print("Company:", first_work["metadata"]["company"])
print("Title:", first_work["metadata"]["title"])
print("Dates:", first_work["metadata"]["dates"])
print("\nProjects:")
for proj in first_work["content"]["projects"][:2]:
    # Note: project names still have LaTeX formatting
    print(f"  - {proj['metadata']['name']}")

### Getting Skills (Without ResumeDocument)

Also requires manual navigation and LaTeX cleaning:

In [None]:
# Find skill sections
skill_sections = []
for page in doc.get("pages", []):
    for region_name, region_data in page.get("regions", {}).items():
        if region_data and "sections" in region_data:
            for section in region_data["sections"]:
                if section.get("type") in ("skill_list_caps", "skill_list_pipes", "skill_categories"):
                    skill_sections.append(section)

print(f"Found {len(skill_sections)} skill sections\n")

# Get first skill section
first_skills = skill_sections[0]
print(f"Section: {first_skills['metadata']['name']}")
print("Skills (raw - with LaTeX):")
for skill in first_skills["content"]["items"][:5]:
    print(f"  - {skill}")

# Would need to manually clean LaTeX formatting...

### The Problems:

1. **Complex navigation** - Multiple nested loops to find sections
2. **Mixed with formatting** - Colors, spacing, LaTeX commands mixed with content
3. **No text extraction** - Have to manually build plaintext
4. **Lots of boilerplate** - Same navigation code repeated everywhere

---

## Part 2: ResumeDocument

Now let's see the simplified interface:

In [None]:
# Load with ResumeDocument
doc = ResumeDocument(sample_yaml)

print(f"Name: {doc.name}")
print(f"Title: {doc.professional_title}")
print(f"Filename: {doc.filename}")
print(f"Date: {doc.date}")
print(f"Total sections: {len(doc.sections)}")

### Table of Contents

Get a formatted overview of all sections with one property:

In [None]:
print(doc.table_of_contents)

### Getting Work Experience (New Way)

Work experiences are now grouped in `work_history` sections:

In [None]:
# Find work_history sections
work_history_sections = [s for s in doc.sections if s.section_type == "work_history"]

print(f"Found {len(work_history_sections)} work history sections\n")

# Get first work_history section
first_history = work_history_sections[1]
print(f"Section: {first_history.name}")
print(f"Number of work experiences: {len(first_history.data['subsections'])}\n")

# Access first work experience within the section
first_work = first_history.data['subsections'][0]

# Structured data access
print(f"Company: {first_work['company']}")
print(f"Title: {first_work['title']}")
print(f"Dates: {first_work['dates']}")
print(f"Location: {first_work['location']}")

# Show top-level bullets
print(f"\nTop-level bullets ({len(first_work['items'])}):\n")
for i, bullet in enumerate(first_work["items"][:3], 1):
    print(f"{i}. {bullet[:100]}...")
if len(first_work["items"]) > 3:
    print(f"   ... and {len(first_work['items']) - 3} more")

# Show projects
print(f"\nProjects ({len(first_work['projects'])}):\n")

for proj in first_work['projects']:
    print(f"üìå {proj['name']}")
    print(f"   Bullets: {len(proj['items'])}")
    if proj['items']:
        print(f"   First bullet: {proj['items'][0][:80]}...")
    print()

### Getting Skills (using ResumeDocument)

Clean, LaTeX-free text:

In [None]:
# Find skill sections
skill_sections = [
    s for s in doc.sections 
    if s.section_type in ("skill_list_caps", "skill_list_pipes", "skill_categories")
]

print(f"Found {len(skill_sections)} skill sections\n")

for section in skill_sections:
    print(f"üìã {section.name} ({section.section_type}):")
    
    # Handle flat skill lists (unified structure with "items")
    if section.section_type in ("skill_list_caps", "skill_list_pipes"):
        for skill in section.data['items'][:5]:
            print(f"   ‚Ä¢ {skill}")
        if len(section.data['items']) > 5:
            print(f"   ... and {len(section.data['items']) - 5} more")
    
    # Handle categorized skills (conatainer with subsections)
    elif section.section_type == "skill_categories":
        for category in section.data['subsections'][:3]:  # Show first 3 categories
            print(f"   üìÅ {category['name']}:")
            for skill in category['items'][:3]:  # Show first 3 skills per category
                print(f"      ‚Ä¢ {skill}")
            if len(category['items']) > 3:
                print(f"      ... and {len(category['items']) - 3} more")
        if len(section.data['subsections']) > 3:
            print(f"   ... and {len(section.data['subsections']) - 3} more categories")
    
    print()

In [None]:
# Compare: Manual iteration (preserves structure) vs get_items() (flattens)

ml_tools = doc.get_section("ML Infrastructure Tools")

print("Manual iteration (structured):")
for category in ml_tools.data['subsections'][:2]:
    print(f"  üìÅ {category['name']}: {len(category['items'])} items")

print("\nget_items() (flattened):")
all_tools = ml_tools.get_items()
print(f"  Total: {len(all_tools)} items (no category info)")
print(f"  {all_tools}")

In [None]:
# get_items() works on all section types

# Simple list (skill_list_caps)
core_skills = doc.get_section("Core Skills")
print(f"Core Skills: {len(core_skills.get_items())} items")

# Wrapper (skill_categories) - flattens all categories
ml_tools = doc.get_section("ML Infrastructure Tools")
print(f"ML Tools: {len(ml_tools.get_items())} items (across {len(ml_tools.data['subsections'])} categories)")

# Work history - flattens work bullets AND project bullets
experience = doc.get_section("Experience")
print(f"Experience: {len(experience.get_items())} bullets (work + projects combined)")

# Projects wrapper - flattens all projects
projects = doc.get_section("Other Projects")
print(f"Other Projects: {len(projects.get_items())} bullets (across {len(projects.data['subsections'])} projects)")

In [None]:
# Use case: Search for keyword across all work bullets
experience = doc.get_section("Experience")
all_bullets = experience.get_items()

keyword = "GPU"
matches = [b for b in all_bullets if keyword.lower() in b.lower()]

print(f"Found {len(matches)} bullets mentioning '{keyword}':\n")
for bullet in matches:
    print(f"  ‚Ä¢ {bullet[:80]}...")

### get_items_by_section() - Multiple Sections at Once

For convenience, `ResumeDocument` has a method to get items from multiple sections.

In [None]:
# Get items from multiple skill sections
items_by_section = doc.get_items_by_section([
    "Core Skills",
    "Languages",
    "Hardware"
])

print("Items by section:\n")
for section_name, items in items_by_section.items():
    print(f"{section_name}: {len(items)} items")
    print(f"  {items}\n")

In [None]:
# Use case: Collect all skills from different sections
all_skills = []
skill_sections = ["Core Skills", "Languages", "Hardware"]
items = doc.get_items_by_section(skill_sections)

for section_items in items.values():
    all_skills.extend(section_items)

print(f"Collected {len(all_skills)} total skills from {len(skill_sections)} sections")
print(f"\nFirst 10: {all_skills[:10]}")

### Markdown Text Property

Each section has a `.text` property that formats structured data into **markdown** (not plaintext!).

This is perfect for LLM consumption since markdown is very on-distribution for language models.

**Key feature:** It's lazy-evaluated and cached!

In [None]:
# Get a work_history section
work_section = work_history_sections[0]

# Check if text is cached
print(f"Text cached initially: {work_section._text_cache is not None}")

# Access text property (triggers formatting)
text = work_section.text

print(f"Text cached after access: {work_section._text_cache is not None}")
print(f"\nMarkdown preview (first 700 chars):\n")
print(text[:700])
print("...")

# Show markdown structure
print("\n" + "="*60)
print("Notice the markdown formatting:")
print("  - ## for section headers (e.g., 'Experience')")
print("  - ### for company names")
print("  - **bold** for job titles")
print("  - *italic* for dates")
print("  - #### for project headers")
print("  - - for bullet points")
print()
print("NOTE: Some bullets may show malformed LaTeX like 'textbf'")
print("      This is a known bug documented in TODO.md")
print("="*60)

### Get All Text

Combine all sections into one searchable markdown document.

**Page breaks:** The `---` markdown separator appears between pages!

In [None]:
all_text = doc.get_all_text()

print(f"Total text length: {len(all_text)} characters")
print(f"Total words (approx): {len(all_text.split())}")

# Check page numbers
page_numbers = sorted(set(s.page_number for s in doc.sections))
print(f"\nPages in document: {page_numbers}")

# Count actual page breaks (not dates that use ---)
page_break_count = all_text.count('\n\n---\n\n')
print(f"Page break markers: {page_break_count}")

print(f"\nSearching for 'Machine Learning'...")
# Simple text search
count = all_text.lower().count('machine learning')
print(f"Found {count} occurrences")

In [None]:
# Show sections by page
print("Sections by page:\n")
for page in sorted(set(s.page_number for s in doc.sections)):
    page_sections = [s for s in doc.sections if s.page_number == page]
    print(f"Page {page}:")
    for section in page_sections:
        print(f"  - {section.name} ({section.section_type})")
    print()

In [None]:
print(all_text)

---

## Part 3: ResumeDocumentArchive

Load multiple resumes at once with error handling:

In [None]:
# Initialize archive
archive = ResumeDocumentArchive(RESUME_ARCHIVE_PATH)

import warnings
yaml_files = sorted(archive.structured_path.glob("*.yaml"))

i = 0
errors = []
for yaml_file in yaml_files:
    try:
        doc = ResumeDocument(yaml_file, mode="plaintext")
        i += 1
    except Exception as e:
        errors.append((yaml_file.name, str(e)))
if errors:
    error_summary = "\n".join(f"  - {name}: {error}" for name, error in errors)
    warnings.warn(
        f"Failed to load {len(errors)} YAML file(s):\n{error_summary}",
        UserWarning
    )


In [None]:
# Load all pre-converted YAMLs
documents = archive.load(mode="available")

print(f"Loaded {len(documents)} documents")
print(f"\nFirst 10 filenames:")
for doc in documents[:10]:
    print(f"  ‚Ä¢ {doc.filename}")

### Analyzing the Archive

Now we can easily analyze all resumes:

In [None]:
section_type_counts = Counter()

for doc in documents:
    for section in doc.sections:
        section_type_counts[section.section_type] += 1

print("Section types across all resumes:")
for stype, count in section_type_counts.most_common():
    print(f"  {stype}: {count}")

In [None]:
# Count total work experiences across all work_history sections per resume
work_exp_counts = []
for doc in documents:
    count = sum(
        len(s.data.get('subsections', [])) 
        for s in doc.sections 
        if s.section_type == "work_history"
    )
    work_exp_counts.append(count)

print(f"Average work experiences per resume: {sum(work_exp_counts) / len(work_exp_counts):.1f}")
print(f"Min: {min(work_exp_counts)}, Max: {max(work_exp_counts)}")

### Searching Across All Resumes

Find resumes mentioning specific technologies:

In [None]:
def find_resumes_with_keyword(documents, keyword):
    """Find resumes containing a keyword."""
    matches = []
    for doc in documents:
        text = doc.get_all_text().lower()
        if keyword.lower() in text:
            matches.append(doc)
    return matches

# Search for different keywords
keywords = ["pytorch", "kubernetes", "llm", "hpc", "machine learning", "tensorflow", "docker"]

print("Keyword search results:\n")
for keyword in keywords:
    matches = find_resumes_with_keyword(documents, keyword)
    print(f"'{keyword}': {len(matches)} resumes ({len(matches)/len(documents)*100:.1f}%)")

---

## Part 4: ResumeSection

A peek under the hood:

In [None]:
doc = ResumeDocument(sample_yaml)

# ResumeSection is the basic building block
section = doc.sections[0]  # Core Skills

print("ResumeSection attributes:")
print(f"  name: {section.name}")
print(f"  section_type: {section.section_type}")
print(f"  page_number: {section.page_number}")
print(f"  region: {section.region}")
print(f"  display_name: {section.display_name}")
print()

# The data dict contains the actual content
print(f"Data structure for {section.section_type}:")
print(f"  Keys: {list(section.data.keys())}")
print(f"  Items: {len(section.data['items'])} skills")
print(f"  First 3: {section.data['items'][:3]}")
print()

# Compare with a wrapper section
wrapper = doc.sections[3]  # ML Infrastructure Tools
print(f"Data structure for {wrapper.section_type}:")
print(f"  Keys: {list(wrapper.data.keys())}")
print(f"  Subsections: {len(wrapper.data['subsections'])}")
print(f"  First subsection: {wrapper.data['subsections'][0]['name']}")
print(f"    Type: {wrapper.data['subsections'][0]['type']}")
print(f"    Items: {len(wrapper.data['subsections'][0]['items'])}")

In [None]:
# Section types follow consistent patterns:

print("Pattern 1: Direct content sections (items at top level)")
print("  Types: skill_list_caps, skill_list_pipes, personality_alias_array")
print("  Structure: {'items': [...], 'name': '...'}")
print()

direct_section = doc.get_section("Core Skills")
print(f"  Example: {direct_section.name}")
print(f"    data.keys(): {list(direct_section.data.keys())}")
print(f"    items[0]: {direct_section.data['items'][0]}")
print()

print("Pattern 2: Wrapper sections (subsections with items)")
print("  Types: skill_categories, projects")
print("  Structure: {'subsections': [{'type': '...', 'name': '...', 'items': [...]}, ...]}")
print()

wrapper_section = doc.get_section("ML Infrastructure Tools")
print(f"  Example: {wrapper_section.name}")
print(f"    data.keys(): {list(wrapper_section.data.keys())}")
print(f"    subsections[0]: {wrapper_section.data['subsections'][0]['name']}")
print(f"      type: {wrapper_section.data['subsections'][0]['type']}")
print(f"      items: {wrapper_section.data['subsections'][0]['items'][:2]}")
print()

print("Pattern 3: Work history (flattened work_experience)")
print("  Type: work_history")
print("  Structure: {'subsections': [{'type': 'work_experience', 'company': '...', 'title': '...', 'items': [...], 'projects': [...]}]}")
print()

work_section = doc.get_section("Experience")
print(f"  Example: {work_section.name}")
print(f"    data.keys(): {list(work_section.data.keys())}")
print(f"    subsections[0]['company']: {work_section.data['subsections'][0]['company']}")
print(f"    subsections[0].keys(): {list(work_section.data['subsections'][0].keys())}")
print(f"    Has 'items': {('items' in work_section.data['subsections'][0])}")
print(f"    Has 'projects': {('projects' in work_section.data['subsections'][0])}")
if work_section.data['subsections'][0]['projects']:
    print(f"    First project: {work_section.data['subsections'][0]['projects'][0]['name']}")
    print(f"      project type: {work_section.data['subsections'][0]['projects'][0]['type']}")

---

## Part 5: Key Advantages Summary

### OmegaConf
‚ùå Complex nested navigation  
‚ùå Mixed with formatting metadata  
‚ùå LaTeX commands in text  
‚ùå Boilerplate code for every operation  
‚ùå Hard to search/analyze  

### ResumeDocument
‚úÖ Flat, direct section access  
‚úÖ Content only (no formatting)  
‚úÖ **Markdown output** (great for LLM comprehension)  
‚úÖ LaTeX automatically converted to markdown  
‚úÖ **Page breaks preserved** (`---` separators between pages)  
‚úÖ **Professional profile included** at the top  
‚úÖ **Table of contents property** for quick overview  
‚úÖ **Region tracking** (left column vs main column)  
‚úÖ **Hierarchical structure preserved** (work_history contains multiple work experiences)  
‚úÖ Simple, reusable API  
‚úÖ Easy search and analysis  
‚úÖ Batch operations with error handling  

### Markdown Formatting

The `.text` property returns markdown with:
- `#` header for professional profile (name | brand)
- `##` headers for section headings (e.g., "Experience", "Education")
- `###` headers for companies/institutions
- `####` headers for projects
- **Bold** for emphasis (from `\textbf{}`, `\coloremph{}`)
- *Italic* usually for dates (from `\textit{}`)
- `-` for bullet points
- `` `code` `` for `\texttt{}` (usually software names)
- `---` page breaks between pages

This hierarchy prevents duplicate section headers and groups related work experiences together!

This is *on-distribution* for LLMs, making it perfect for:
- Prompting language models
- Semantic search with embeddings
- Content relevance scoring
- Automated analysis

### Perfect for Targeting Context ‚úÖ

The Targeting context can now:
- Get quick resume overview with `doc.table_of_contents`
- Search historical resumes for relevant content
- Score sections by relevance to job description
- Find similar past applications
- Extract bullets matching specific criteria
- Pass markdown directly to LLMs for decision-making
- Preserve page structure for layout decisions
- Track which sections are in left vs main column
- Access grouped work experiences within work_history sections
- All without worrying about LaTeX formatting!