# Resume Pattern Analysis Demo

Demonstrates the various analysis capabilities:
1. **Exact substring matching** (whole document)
2. **Regex pattern matching** (whole document)
3. **Field-specific search** (e.g., only search within `\brand` field)
4. **Field value enumeration** (discover all unique values)
5. **Combined analysis** (compare different approaches)

In [1]:
# Setup
import os
from pathlib import Path
from dotenv import load_dotenv
from archer.utils.resume_analyzer import (
    analyze_keyword_frequencies,
    analyze_keywords_in_field,
    enumerate_field_values,
    count_pattern_matches
)

load_dotenv()
RESUME_ARCHIVE_PATH = Path(os.getenv("RESUME_ARCHIVE_PATH"))
print(f"Analyzing resumes in: {RESUME_ARCHIVE_PATH}")
print(f"Number of resume files: {len(list(RESUME_ARCHIVE_PATH.glob('*.tex')))}")

Analyzing resumes in: /home/sean/ARCHER/data/resume_archive
Number of resume files: 57


In [10]:
# Example 1: Exact substring matching (whole document)
# How many resumes contain "Machine Learning" anywhere in the document?

keywords = {
    "ML Terms": ["Machine Learning", "ML", "AI", "Artificial Intelligence", "Physics", "LLM", "Large Language Model", "transformer", "CUDA"]
}

num_resumes, total_chars, occurrences, resume_count = analyze_keyword_frequencies(
    RESUME_ARCHIVE_PATH, keywords
)

print(f"Exact substring search across entire document ({num_resumes} resumes):")
print(f"{'Keyword':<25} {'Resumes':<10} {'% Resumes':<12} {'Total Occurrences'}")
print("-" * 65)
for kw in keywords["ML Terms"]:
    count = resume_count.get(kw, 0)
    percent = (count / num_resumes) * 100
    total = occurrences.get(kw, 0)
    print(f"{kw:<25} {count:<10} {percent:>10.1f}% {total:>17}")

Exact substring search across entire document (57 resumes):
Keyword                   Resumes    % Resumes    Total Occurrences
-----------------------------------------------------------------
Machine Learning          57              100.0%               175
ML                        57              100.0%               731
AI                        51               89.5%               247
Artificial Intelligence   1                 1.8%                 1
Physics                   57              100.0%               223
LLM                       57              100.0%               588
Large Language Model      40               70.2%                40
transformer               56               98.2%               129
CUDA                      57              100.0%               129


In [31]:
# Example 2: Field-specific search (exact matching and regex pattern matching)
# How many resumes contain "AI" in the brand field ONLY?

keywords_brand = {
    "AI/ML in Title": ["Physicist", r"Machine Learning|ML", r"Artificial Intelligence|AI", "Engineer",  "LLM", "Scientist", "Research", "Quantum", "Data", "Software" ]
}

num_with_brand, brand_chars, brand_occur, brand_count = analyze_keywords_in_field(
    RESUME_ARCHIVE_PATH,
    keywords_brand,
    "brand",  # Only search in \renewcommand{\brand}{...}
    is_regex=True
)

print(f"Exact search within \\brand field only ({num_with_brand} resumes with brand field):")
print(f"{'Keyword':<30} {'Count':<8} {'%'}")
print("-" * 70)
for kw in keywords_brand["AI/ML in Title"]:
    count = brand_count.get(kw, 0)
    percent = (count / num_with_brand) * 100 if num_with_brand > 0 else 0
    print(f"{kw:<30} {count:<5} {percent:>5.1f}%")

Exact search within \brand field only (57 resumes with brand field):
Keyword                        Count    %
----------------------------------------------------------------------
Physicist                      55     96.5%
Machine Learning|ML            23     40.4%
Artificial Intelligence|AI     15     26.3%
Engineer                       47     82.5%
LLM                            4       7.0%
Scientist                      4       7.0%
Research                       4       7.0%
Quantum                        4       7.0%
Data                           4       7.0%
Software                       4       7.0%


In [None]:
# Example 3: Field value enumeration
# Discover all unique brand values and their frequencies

field_values = enumerate_field_values(RESUME_ARCHIVE_PATH)

# Show top 10 most common brand values
brand_values = field_values.get("brand", {})
sorted_brands = sorted(brand_values.items(), key=lambda x: x[1], reverse=True)

print(f"Top 10 most common \\brand values (out of {len(brand_values)} unique):")
print(f"{'Brand Value':<50} {'Count':<8} {'%'}")
print("-" * 70)
for brand, count in sorted_brands[:10]:
    percent = (count / num_resumes) * 100
    # Truncate for display
    display_brand = brand if len(brand) <= 48 else brand[:45] + "..."
    print(f"{display_brand:<50} {count:<8} {percent:>5.1f}%")

print(f"\nTotal unique brand values: {len(brand_values)}")

Top 10 most common \brand values (out of 25 unique):
Brand Value                                        Count    %
----------------------------------------------------------------------
Machine Learning Engineer | Physicist              15        26.3%
AI Engineer | Physicist                            5          8.8%
Software Engineer | Physicist                      4          7.0%
AI Systems Engineer | Computational Physicist      3          5.3%
ML Infrastructure Engineer | HPC Physicist         3          5.3%
LLM Testing Specialist | HPC Physicist             2          3.5%
HPC Physicist | Quantum Engineer                   2          3.5%
Machine Learning Engineer | HPC Physicist          2          3.5%
Data Scientist | Computational Physicist           2          3.5%
AI Engineer | Computational Physicist              2          3.5%

Total unique brand values: 25


In [None]:
# Example 4: Comparison - "AI" in whole document vs. in brand field only

# Whole document
whole_doc = {
    "Terms": ["AI"]
}
_, _, _, whole_doc_count = analyze_keyword_frequencies(RESUME_ARCHIVE_PATH, whole_doc)

# Brand field only
_, _, _, brand_only_count = analyze_keywords_in_field(
    RESUME_ARCHIVE_PATH, whole_doc, "brand", is_regex=False
)

# ProfessionalProfile field only
_, _, _, profile_only_count = analyze_keywords_in_field(
    RESUME_ARCHIVE_PATH, whole_doc, "ProfessionalProfile", is_regex=False
)

whole_count = whole_doc_count.get("AI", 0)
brand_ai = brand_only_count.get("AI", 0)
profile_ai = profile_only_count.get("AI", 0)

print("Where does 'AI' appear?")
print(f"  Entire document:           {whole_count} resumes ({whole_count/num_resumes*100:.1f}%)")
print(f"  In \\brand field only:      {brand_ai} resumes ({brand_ai/num_resumes*100:.1f}%)")
print(f"  In \\ProfessionalProfile:   {profile_ai} resumes ({profile_ai/num_resumes*100:.1f}%)")
print(f"\nInsight: 'AI' appears in the brand field in {brand_ai} resumes but")
print(f"         appears SOMEWHERE in {whole_count} resumes (body text, skills, etc.)")

Where does 'AI' appear?
  Entire document:           51 resumes (89.5%)
  In \brand field only:      14 resumes (24.6%)
  In \ProfessionalProfile:   15 resumes (26.3%)

Insight: 'AI' appears in the brand field in 14 resumes but
         appears SOMEWHERE in 51 resumes (body text, skills, etc.)
