# Resume Database Demo

This notebook demonstrates the usage of `ResumeDocumentArchive` and `ResumeDatabase` for analyzing resume content.

**Two complementary interfaces:**
- **ResumeDocumentArchive**: Rich structured access (preserves hierarchy, sections, formatting)
- **ResumeDatabase**: Fast SQL queries (flattened items table, cross-resume search)

In [1]:
import os
from pathlib import Path
from dotenv import load_dotenv

from archer.contexts.templating import (
    ResumeDocumentArchive,
    ResumeDatabase,
)

load_dotenv()
RESUME_ARCHIVE_PATH = Path(os.getenv("RESUME_ARCHIVE_PATH"))
DB_PATH = RESUME_ARCHIVE_PATH / "structured" / "database" / "resumes.db"

## Part 1: ResumeDocumentArchive (Structured Access)

Load all resumes into memory with preserved structure.

In [None]:
# Load archive in plaintext mode (good for analysis)
archive = ResumeDocumentArchive(archive_path=RESUME_ARCHIVE_PATH)
archive.load(mode="available", format_mode="plaintext")
print(f"Loaded {len(archive.documents)} resumes into a ResumeDocumentArchive object")

# Access a specific resume
sample_resume = "Res202507_MLOps_Revature"
doc = archive.documents[sample_resume]
print(doc.table_of_contents)

In [12]:
# Access section data (preserves hierarchy)
section = doc.sections[4]  # Experience (work_history)
print(f"Section: {section.name} ({section.section_type})")
print(f"\nFirst subsection (work_experience):")
subsection = section.data["subsections"][0]
print(f"  Company: {subsection['company']}")
print(f"  Title: {subsection['title']}")
print(f"  Items: {len(subsection['items'])} bullets")
print(f"  Projects: {len(subsection['projects'])} nested projects")

Section: Experience (work_history)

First subsection (work_experience):
  Company: Booz Allen Hamilton
  Title: Engineering and Physical Sciences Researcher
  Items: 5 bullets
  Projects: 2 nested projects


## Part 2: Build Database (One-Time)

Flatten all resume content into a queryable SQLite database.

> **Tip**: You can also build the database from the command line using:
> ```bash
> python scripts/build_resume_database.py
> ```

In [15]:
# Build database from archive (deletes existing DB)
db = ResumeDatabase.from_documents(
    documents=list(archive.documents.values()),
    db_path=DB_PATH
)

# Count total items
total_items = db.query("SELECT COUNT(*) as count FROM items")[0]["count"]
print(f"Database built: {total_items} items")

Database built: 4415 items


## Part 3: Query Database (Fast, Persistent)

Load existing database and run SQL queries.

> Note: You do not have to rerun this cell if you already ran the above cell in this notebook.
Similarly, you do not have to run the above cell if you have ever run it before.

In [None]:
# Load existing database (fast - no parsing needed)
db = ResumeDatabase(DB_PATH)
print("Database loaded from disk")

Database loaded from disk


In [17]:
# Get all skills
skills = db.get_all_skills()
print(f"Total skills: {len(skills)}")
print(f"\nFirst 5 skills:")
for skill in skills[:5]:
    print(f"  - {skill['item_text'][:50]}... (from {skill['resume_name']})")

Total skills: 2280

First 5 skills:
  - Python... (from Res202506_SenMathLibEng_NVIDIA)
  - Bash... (from Res202506_SenMathLibEng_NVIDIA)
  - C++... (from Res202506_SenMathLibEng_NVIDIA)
  - MATLAB... (from Res202506_SenMathLibEng_NVIDIA)
  - Wolfram/Mathematica... (from Res202506_SenMathLibEng_NVIDIA)


In [18]:
# Get all work bullets
bullets = db.get_all_bullets()
print(f"Total work bullets: {len(bullets)}")
print(f"\nFirst bullet:")
bullet = bullets[0]
print(f"  Resume: {bullet['resume_name']}")
print(f"  Company: {bullet['company']}")
print(f"  Title: {bullet['job_title']}")
print(f"  Type: {bullet['subsection_type']}")
print(f"  Text: {bullet['item_text'][:80]}...")

Total work bullets: 1844

First bullet:
  Resume: Res202506_SenMathLibEng_NVIDIA
  Company: None
  Title: None
  Type: project
  Text: Led development of domain-specific benchmark spanning CPUs, GPUs and ASICs...


In [19]:
# Custom SQL query: Find all ML-related bullets
ml_bullets = db.query(
    "SELECT * FROM items WHERE item_text LIKE ? AND subsection_type IN (?, ?)",
    ("%machine learning%", "work_experience", "project")
)
print(f"Found {len(ml_bullets)} ML-related bullets\n")
print("First 3:")
for bullet in ml_bullets[:3]:
    print(f"  - [{bullet['company']}] {bullet['item_text'][:60]}...")

Found 12 ML-related bullets

First 3:
  - [Florida State University] Forged lasting cross-institutional collaborations across 5 r...
  - [Florida State University] Forged lasting cross-institutional collaborations across 5 p...
  - [Florida State University] Forged lasting collaborations across multi-functional teams ...


In [20]:
# Query by section type
skill_categories = db.get_items_by_section_type("skill_categories")
print(f"Items from skill_categories sections: {len(skill_categories)}")
print(f"\nSample categories:")
for item in skill_categories[:5]:
    print(f"  [{item['subsection_name']}] {item['item_text']}")

Items from skill_categories sections: 1335

Sample categories:
  [Languages] Python
  [Languages] Bash
  [Languages] C++
  [Languages] MATLAB
  [Languages] Wolfram/Mathematica


In [21]:
# Query bullets from specific company
booz_bullets = db.query(
    "SELECT * FROM items WHERE company = ?",
    ("Booz Allen Hamilton",)
)
print(f"Bullets from Booz Allen: {len(booz_bullets)}")
print(f"\nFirst 3:")
for bullet in booz_bullets[:3]:
    proj = f" [{bullet['project_name']}]" if bullet['project_name'] else ""
    print(f"  {bullet['subsection_type']}{proj}: {bullet['item_text'][:60]}...")

Bullets from Booz Allen: 194

First 3:
  work_experience: Cut LLM training sessions from 1 month to \ a day by scaling...
  work_experience: Enabled client to train LLM to a loss that would have otherw...
  work_experience: Grew team LLM capabilities by building custom PyTorch models...


In [None]:
# Query items from specific project
moe_items = db.query(
    "SELECT * FROM items WHERE project_name LIKE ?",
    ("%Mixture of Experts%",)
)
print(f"Items from MoE project: {len(moe_items)}")
for item in moe_items:
    print(f"  - {item['item_text'][:80]}...")
    

Items from MoE project: 46
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pushed client training to a loss that would have otherwise cost 6 compute budget...
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pushed client training to a loss that would have otherwise cost 6 compute budget...
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pushed client training to a loss that would have otherwise cost 6 compute budget...
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pushed client training to a loss that would have otherwise cost 6 compute budget...
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pushed client training to loss that would have otherwise cost 6\ compute budget...
  - Performed original scaling studies on MoE models, for which empirical scaling be...
  - Pu

## Part 4: Comparison

When to use each interface?


### Use ResumeDocumentArchive when:

  ✓ Need to preserve section hierarchy
  
  ✓ Want markdown formatting (mode='markdown')
  
  ✓ Analyzing structure (table of contents, page layout)
  
  ✓ Iterating through sections in order
  
### Use ResumeDatabase when:

  ✓ Searching across all resumes (WHERE, LIKE)
  
  ✓ Filtering by metadata (company, dates, type)
  
  ✓ Finding similar bullets or skills
  
  ✓ Fast queries without loading all resumes
  
  ✓ Persistent storage (no re-parsing needed)