# Template-Based Bidirectional Conversion Demo

**Goal:** Demonstrate how templates can drive conversion in BOTH directions:
- **Generation:** YAML data + Jinja template → LaTeX output
- **Parsing:** LaTeX input + YAML parsing config → YAML data

This notebook shows the concept with a simple example, then applies it to real resume data.

---
# Section 1: Generic Example

**Domain:** Book bibliography (simple, non-resume example)

In [1]:
import re
from jinja2 import Environment
from typing import Dict, Any, List

## Example Data Structure

Simple book with title and authors

In [2]:
book_data = {
    "title": "The Pragmatic Programmer",
    "authors": ["Andrew Hunt", "David Thomas"],
    "year": 1999
}

book_data

{'title': 'The Pragmatic Programmer',
 'authors': ['Andrew Hunt', 'David Thomas'],
 'year': 1999}

## Direction 1: YAML → LaTeX (Generation)

Use Jinja2 template to generate LaTeX

In [3]:
# Generation template (Jinja2)
generation_template = r"""
\textbf{<< title >>} (<< year >>)
\begin{itemize}
<%% for author in authors %%>
  \item \textit{<< author >>}
<%% endfor %%>
\end{itemize}
""".strip()

# Create Jinja environment with custom delimiters (to avoid LaTeX brace conflicts)
env = Environment(
    variable_start_string='<<',
    variable_end_string='>>',
    block_start_string='<%%',
    block_end_string='%%>'
)

template = env.from_string(generation_template)
latex_output = template.render(book_data)

print("Generated LaTeX:")
print(latex_output)

Generated LaTeX:
\textbf{The Pragmatic Programmer} (1999)
\begin{itemize}

  \item \textit{Andrew Hunt}

  \item \textit{David Thomas}

\end{itemize}


## Direction 2: LaTeX → YAML (Parsing)

This is the interesting part. How can we make parsing template-driven?

**Idea:** Create a YAML config that declares:
- What patterns to look for
- How to extract data from matches
- Where to put extracted data in output structure

In [4]:
# Parsing configuration (YAML-like dict)
parsing_config = {
    # Pattern to extract title and year from first line
    "title_pattern": {
        "regex": r"\\textbf\{(?P<title>[^}]+)\}\s*\((?P<year>\d+)\)",
        "extract": {
            "title": "title",  # Capture group 'title' → field 'title'
            "year": "year"      # Capture group 'year' → field 'year'
        }
    },
    # Pattern to extract authors from \item \textit{...}
    "author_pattern": {
        "regex": r"\\item\s+\\textit\{(?P<author>[^}]+)\}",
        "extract_list": "authors",  # Multiple matches → list field 'authors'
        "capture_group": "author"
    }
}

parsing_config

{'title_pattern': {'regex': '\\\\textbf\\{(?P<title>[^}]+)\\}\\s*\\((?P<year>\\d+)\\)',
  'extract': {'title': 'title', 'year': 'year'}},
 'author_pattern': {'regex': '\\\\item\\s+\\\\textit\\{(?P<author>[^}]+)\\}',
  'extract_list': 'authors',
  'capture_group': 'author'}}

## Generic Parser Using Config

This is the magic - a generic function that uses the parsing config to extract data.

**Key enhancement:** Supports nested field paths like `content.list` using dot notation.

In [5]:
def set_nested_field(data: Dict, field_path: str, value: Any):
    """Helper to set nested fields using dot notation (e.g., 'content.list')."""
    keys = field_path.split('.')
    current = data
    for key in keys[:-1]:
        if key not in current:
            current[key] = {}
        current = current[key]
    current[keys[-1]] = value


def parse_with_config(latex_str: str, config: Dict[str, Any]) -> Dict[str, Any]:
    """
    Generic parser that uses a parsing config to extract data from LaTeX.
    
    The config is like an "inverse Jinja template" - it declares what to extract
    and where to put it, then this function does the actual extraction.
    """
    result = {}
    
    for pattern_name, pattern_config in config.items():
        regex = pattern_config["regex"]
        
        # Case 1: Extract single match with multiple capture groups
        if "extract" in pattern_config:
            match = re.search(regex, latex_str)
            if match:
                for capture_group, field_name in pattern_config["extract"].items():
                    value = match.group(capture_group)
                    # Convert to int if it's a number
                    if value.isdigit():
                        value = int(value)
                    set_nested_field(result, field_name, value)
        
        # Case 2: Extract list of matches (one capture group, multiple matches)
        elif "extract_list" in pattern_config:
            field_name = pattern_config["extract_list"]
            capture_group = pattern_config["capture_group"]
            
            matches = re.finditer(regex, latex_str)
            values = [match.group(capture_group) for match in matches]
            set_nested_field(result, field_name, values)
    
    return result

## Test: Parse the LaTeX we generated

In [6]:
# Parse the LaTeX output using our config
parsed_data = parse_with_config(latex_output, parsing_config)

print("Parsed data:")
print(parsed_data)

print("\n" + "="*50)
print("Round-trip check:")
print(f"Original: {book_data}")
print(f"Parsed:   {parsed_data}")
print(f"Match: {parsed_data == book_data}")

Parsed data:
{'title': 'The Pragmatic Programmer', 'year': 1999, 'authors': ['Andrew Hunt', 'David Thomas']}

Round-trip check:
Original: {'title': 'The Pragmatic Programmer', 'authors': ['Andrew Hunt', 'David Thomas'], 'year': 1999}
Parsed:   {'title': 'The Pragmatic Programmer', 'year': 1999, 'authors': ['Andrew Hunt', 'David Thomas']}
Match: True


## The Key Insight

**Jinja template (generation):**
```jinja
\textbf{<< title >>} (<< year >>)
<%% for author in authors %%>
  \item \textit{<< author >>}
<%% endfor %%>
```

**Parsing config (extraction):**
```yaml
title_pattern:
  regex: '\\textbf\{(?P<title>[^}]+)\}\s*\((?P<year>\d+)\)'
  extract:
    title: title
    year: year
author_pattern:
  regex: '\\item\s+\\textit\{(?P<author>[^}]+)\}'
  extract_list: authors
  capture_group: author
```

Both are **declarative**:
- Jinja says "here's the LaTeX structure, populate these placeholders"
- Parsing config says "here's the LaTeX structure, extract these patterns"

A generic `parse_with_config()` function reads the config and does the extraction,
just like Jinja's `render()` method reads the template and does the generation.

---
# Section 2: Resume-Specific Example

Apply the same pattern to actual resume data: `skill_list_pipes`

## Resume Data Structure

In [7]:
# Actual resume section data (skill_list_pipes)
skill_data = {
    "type": "skill_list_pipes",
    "content": {
        "list": ["Python", "Bash", "C++", "Mathematica"]
    }
}

skill_data

{'type': 'skill_list_pipes',
 'content': {'list': ['Python', 'Bash', 'C++', 'Mathematica']}}

## Generation: YAML → LaTeX

Using the actual Jinja template from `types/skill_list_pipes/template.tex.jinja`

In [8]:
# This is the ACTUAL template from archer/contexts/templating/types/skill_list_pipes/template.tex.jinja
skill_template_str = r"""<%% for item in content.list %%>\texttt{<<< item >>>}<%% if not loop.last %%> | <%% endif %%><%% endfor %%>"""

# Create Jinja environment (same custom delimiters as TemplateRegistry)
env_resume = Environment(
    variable_start_string='<<<',
    variable_end_string='>>>',
    block_start_string='<%%',
    block_end_string='%%>'
)

skill_template = env_resume.from_string(skill_template_str)
skill_latex = skill_template.render(skill_data)

print("Generated LaTeX:")
print(skill_latex)

Generated LaTeX:
\texttt{Python} | \texttt{Bash} | \texttt{C++} | \texttt{Mathematica}


## Parsing: LaTeX → YAML

This is what the parsing config would look like for `skill_list_pipes`

In [9]:
# Parsing config for skill_list_pipes
# This would be stored in types/skill_list_pipes/parse_config.yaml
skill_parse_config = {
    "type_field": {
        "value": "skill_list_pipes",
        "extract_literal": "type"  # Set literal value (not from regex)
    },
    "items_pattern": {
        "regex": r"\\texttt\{(?P<item>[^}]+)\}",
        "extract_list": "content.list",  # Nested path: content.list
        "capture_group": "item"
    }
}

skill_parse_config

{'type_field': {'value': 'skill_list_pipes', 'extract_literal': 'type'},
 'items_pattern': {'regex': '\\\\texttt\\{(?P<item>[^}]+)\\}',
  'extract_list': 'content.list',
  'capture_group': 'item'}}

## Enhanced Parser with Literal Values

Update parser to support literal values (for `type` field)

In [10]:
def parse_with_config_enhanced(latex_str: str, config: Dict[str, Any]) -> Dict[str, Any]:
    """
    Enhanced parser with support for:
    - Nested field paths (content.list)
    - Literal values (extract_literal)
    - Single-match extraction (extract)
    - Multi-match extraction (extract_list)
    """
    result = {}
    
    for pattern_name, pattern_config in config.items():
        # Case 0: Set literal value (no regex, just assign a value)
        if "extract_literal" in pattern_config:
            field_name = pattern_config["extract_literal"]
            value = pattern_config["value"]
            set_nested_field(result, field_name, value)
            continue
        
        regex = pattern_config["regex"]
        
        # Case 1: Extract single match with multiple capture groups
        if "extract" in pattern_config:
            match = re.search(regex, latex_str)
            if match:
                for capture_group, field_name in pattern_config["extract"].items():
                    value = match.group(capture_group)
                    if value.isdigit():
                        value = int(value)
                    set_nested_field(result, field_name, value)
        
        # Case 2: Extract list of matches (one capture group, multiple matches)
        elif "extract_list" in pattern_config:
            field_name = pattern_config["extract_list"]
            capture_group = pattern_config["capture_group"]
            
            matches = re.finditer(regex, latex_str)
            values = [match.group(capture_group) for match in matches]
            set_nested_field(result, field_name, values)
    
    return result

## Test: Parse Resume LaTeX

In [11]:
# Parse the generated LaTeX using config
parsed_skill_data = parse_with_config_enhanced(skill_latex, skill_parse_config)

print("Parsed data:")
print(parsed_skill_data)

print("\n" + "="*60)
print("Round-trip check:")
print(f"Original: {skill_data}")
print(f"Parsed:   {parsed_skill_data}")
print(f"Match: {parsed_skill_data == skill_data}")

Parsed data:
{'type': 'skill_list_pipes', 'content': {'list': ['Python', 'Bash', 'C++', 'Mathematica']}}

Round-trip check:
Original: {'type': 'skill_list_pipes', 'content': {'list': ['Python', 'Bash', 'C++', 'Mathematica']}}
Parsed:   {'type': 'skill_list_pipes', 'content': {'list': ['Python', 'Bash', 'C++', 'Mathematica']}}
Match: True


## What This Means for ARCHER

**Current state (asymmetric):**
```
types/skill_list_pipes/
├── type.yaml              # Schema (unused)
└── template.tex.jinja     # Generation template (used)
```

Code: `parse_skill_list_pipes()` has hardcoded regex in Python

---

**Future state (symmetric):**
```
types/skill_list_pipes/
├── type.yaml              # Schema (for validation)
├── template.tex.jinja     # Generation template
└── parse_config.yaml      # Parsing template (NEW)
```

Code: Generic `parse_with_config()` loads config, no hardcoded regex

---

**Benefits:**
1. **Symmetry:** Both directions use declarative templates
2. **Visibility:** Regex patterns visible in YAML files, not buried in Python
3. **Maintainability:** Change patterns by editing YAML, not Python code
4. **Error messages:** Can show expected pattern from config when parsing fails
5. **Pattern variants:** Easy to add alternative parsing configs for historical resumes

**Implementation:** Create `parse_config.yaml` for all 9 types, update converter to use generic parser