# Full Text XML Parser - Modular Architecture Demo

This notebook demonstrates the **modular architecture** of the `FullTextXMLParser` class.

## What You'll Learn

1. **Generic Base Functions** - Reusable extraction patterns
2. **Specialized Functions** - Domain-specific extraction built on base functions
3. **Real-world Examples** - Practical use cases
4. **Custom Extractions** - How to build your own extractors

## Architecture Overview

The parser uses a **three-tier design**:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ        Public API Methods               ‚îÇ  ‚Üê What users call
‚îÇ  extract_authors(), extract_metadata()  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                 ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ     Mid-Level Helper Functions          ‚îÇ  ‚Üê Domain-aware composition
‚îÇ  _extract_reference_authors()           ‚îÇ
‚îÇ  _extract_section_structure()           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                 ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ     Generic Base Functions              ‚îÇ  ‚Üê Core extraction logic
‚îÇ  _extract_nested_texts()                ‚îÇ
‚îÇ  _extract_flat_texts()                  ‚îÇ
‚îÇ  _extract_structured_fields()           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## Benefits

- **DRY Principle**: No duplicate XPath traversal code
- **Consistency**: All extraction follows the same patterns
- **Maintainability**: Changes to base functions improve all extractors
- **Testability**: Each layer can be tested independently
- **Extensibility**: Easy to add new extraction methods

## Setup

In [53]:
from pathlib import Path
from pyeuropepmc import FullTextClient, FullTextXMLParser
import logging

# Enable debug logging to see the modular architecture in action
logging.basicConfig(level=logging.DEBUG, format='%(levelname)s: %(message)s')

# Load two sample articles for comparison
downloads_dir = Path("downloads")
downloads_dir.mkdir(exist_ok=True)

pmcids = ["PMC3258128", "PMC3359999"]
parsers = {}

for pmcid in pmcids:
    xml_path = downloads_dir / f"{pmcid}.xml"

    if not xml_path.exists():
        # Try to download if not present
        with FullTextClient() as client:
            downloaded_path = client.download_xml_by_pmcid(pmcid, xml_path)
            if downloaded_path is None:
                raise RuntimeError(f"Failed to download XML for {pmcid}")
            xml_path = downloaded_path

    # Read XML content
    with open(xml_path, 'r', encoding='utf-8') as f:
        xml_content = f.read()

    # Create parser
    parser = FullTextXMLParser()
    parser.parse(xml_content)
    parsers[pmcid] = parser

    if parser.root is None:
        raise RuntimeError(f"Failed to parse XML for {pmcid} - root element is None")

print(f"‚úì Loaded {len(parsers)} articles for comparison:")
for pmcid in pmcids:
    print(f"  - {pmcid}")

# Create shortcuts for easier access
parser1 = parsers["PMC3258128"]
parser2 = parsers["PMC3359999"]


‚úì Loaded 2 articles for comparison:
  - PMC3258128
  - PMC3359999


In [54]:
print("=== Element Types Comparison ===\n")
for pmcid, parser in parsers.items():
    element_types = parser.list_element_types()
    print(f"{pmcid}: {len(element_types)} unique element types")
    print(f"  {', '.join(element_types)}...")
    print()


=== Element Types Comparison ===

PMC3258128: 72 unique element types
  abstract, aff, article, article-categories, article-id, article-meta, article-title, author-notes, award-id, back, body, bold, caption, contrib, contrib-group, copyright-statement, copyright-year, corresp, counts, date, day, element-citation, email, etal, ext-link, fax, fig, fn, fpage, front, funding-source, given-names, graphic, history, issn, issue, italic, journal-id, journal-meta, journal-title, journal-title-group, label, license, license-p, lpage, media, month, name, p, page-count, permissions, person-group, phone, pub-date, pub-id, publisher, publisher-name, ref, ref-list, sec, source, sub, subj-group, subject, sup, supplementary-material, surname, title, title-group, volume, xref, year...

PMC3359999: 87 unique element types
  abstract, ack, addr-line, aff, alt-title, alternatives, article, article-categories, article-id, article-meta, article-title, author-notes, back, body, bold, caption, col, colgroup, c

## 1. Generic Base Functions

These are the **core extraction patterns** that all specialized functions use.

### `_extract_nested_texts()` - Extract Hierarchical Data

Use this when you need to extract data from **nested XML structures** and combine them into single strings.

**Example**: Author names are nested structures (`<given-names>` + `<surname>`)

In [55]:
# Example 1: Extract author names (nested structure) - Comparison
print("=== Extracting Author Names (Nested Structure) - Article Comparison ===")
print("\nXML Structure:")
print("""
<contrib contrib-type="author">
  <name>
    <given-names>John</given-names>
    <surname>Smith</surname>
  </name>
</contrib>
""")

print("Using the generic base function to compare both articles:\n")

for pmcid, parser in parsers.items():
    # Using the generic base function
    authors = parser._extract_nested_texts(
        parser.root,
        ".//contrib[@contrib-type='author']/name",  # Find all author name elements
        ["given-names", "surname"],                  # Extract these child elements
        join=" "                                     # Join with space: "John Smith"
    )

    print(f"üìÑ {pmcid}: Extracted {len(authors)} authors")
    for i, author in enumerate(authors[:3], 1):
        print(f"  {i}. {author}")
    if len(authors) > 3:
        print(f"  ... and {len(authors) - 3} more")
    print()


=== Extracting Author Names (Nested Structure) - Article Comparison ===

XML Structure:

<contrib contrib-type="author">
  <name>
    <given-names>John</given-names>
    <surname>Smith</surname>
  </name>
</contrib>

Using the generic base function to compare both articles:

üìÑ PMC3258128: Extracted 12 authors
  1. Shuai Li
  2. Juanjuan Zhu
  3. Hanjiang Fu
  ... and 9 more

üìÑ PMC3359999: Extracted 8 authors
  1. Rosina Claudia Krecek
  2. Hamish Mohammed
  3. Lynne Margaret Michael
  ... and 5 more



### `_extract_flat_texts()` - Extract Simple Lists

Use this when you need to extract **flat lists** of text values.

**Example**: Keywords are simple flat elements

In [56]:
# Example 2: Extract keywords (flat list) - Comparison
print("=== Extracting Keywords (Flat List) - Article Comparison ===")
print("\nXML Structure:")
print("""
<kwd-group>
  <kwd>genomics</kwd>
  <kwd>bioinformatics</kwd>
  <kwd>DNA sequencing</kwd>
</kwd-group>
""")

print("Using the generic base function to compare both articles:\n")

for pmcid, parser in parsers.items():
    # Using the generic base function
    keywords = parser._extract_flat_texts(
        parser.root,
        ".//kwd"  # XPath to all keyword elements
    )

    print(f"üìÑ {pmcid}: Extracted {len(keywords)} keywords from <kwd> elements")
    if keywords:
        for kw in keywords[:5]:
            print(f"  - {kw}")
        if len(keywords) > 5:
            print(f"  ... and {len(keywords) - 5} more")
    else:
        print("  Note: No keywords in <kwd> elements for this article")

        # Try to extract actual keywords using the public API
        actual_keywords = parser.extract_keywords()
        if actual_keywords:
            print(f"  However, the article has {len(actual_keywords)} keywords via extract_keywords():")
            for kw in actual_keywords[:5]:
                print(f"    - {kw}")
        else:
            print("  This article has no keywords defined.")
    print()


DEBUG: Extracted keywords: []
DEBUG: Extracted keywords: []
DEBUG: Extracted keywords: []


=== Extracting Keywords (Flat List) - Article Comparison ===

XML Structure:

<kwd-group>
  <kwd>genomics</kwd>
  <kwd>bioinformatics</kwd>
  <kwd>DNA sequencing</kwd>
</kwd-group>

Using the generic base function to compare both articles:

üìÑ PMC3258128: Extracted 0 keywords from <kwd> elements
  Note: No keywords in <kwd> elements for this article
  This article has no keywords defined.

üìÑ PMC3359999: Extracted 0 keywords from <kwd> elements
  Note: No keywords in <kwd> elements for this article
  This article has no keywords defined.



In [57]:
# Let's demonstrate with actual list elements from both articles
print("=== Extract Section Titles (Flat List) - Article Comparison ===")
print("Using _extract_flat_texts() to extract all section titles:\n")

for pmcid, parser in parsers.items():
    section_titles = parser._extract_flat_texts(
        parser.root,
        ".//sec/title"  # XPath to all section titles
    )

    print(f"üìÑ {pmcid}: Extracted {len(section_titles)} section titles")
    for i, title in enumerate(section_titles[:5], 1):
        print(f"  {i}. {title}")
    if len(section_titles) > 5:
        print(f"  ... and {len(section_titles) - 5} more")
    print()


=== Extract Section Titles (Flat List) - Article Comparison ===
Using _extract_flat_texts() to extract all section titles:

üìÑ PMC3258128: Extracted 22 section titles
  1. INTRODUCTION
  2. MATERIALS AND METHODS
  3. Cell lines and cultures
  4. Affinity purification experiments
  5. Real-time qRT‚ÄìPCR for mRNA
  ... and 17 more

üìÑ PMC3359999: Extracted 10 section titles
  1. Introduction
  2. Materials and Methods
  3. Study design and population
  4. Household questionnaire
  5. Statistical analysis
  ... and 5 more



### `_extract_flat_texts()` with `use_full_text=True`

When you need to extract **all text including nested elements**, use `use_full_text=True`.

**Example**: Paragraphs may contain formatting tags like `<italic>`, `<bold>`, etc.

In [58]:
# Example 3: Extract paragraphs with nested formatting - Comparison
print("=== Extracting Paragraphs (Deep Text) - Article Comparison ===")
print("\nXML Structure:")
print("""
<p>Text with <italic>italic</italic> and <bold>bold</bold> elements.</p>
""")

print("\nUsing the generic base function with use_full_text=True:\n")

for pmcid, parser in parsers.items():
    # Using the generic base function with use_full_text=True
    paragraphs = parser._extract_flat_texts(
        parser.root,
        ".//body//p",       # XPath to all paragraph elements
        use_full_text=True  # Extract ALL text including nested tags
    )

    print(f"üìÑ {pmcid}: Extracted {len(paragraphs)} paragraphs")
    print("First 2 paragraphs:")
    for i, para in enumerate(paragraphs[:2], 1):
        print(f"\n  {i}. {para[:120]}..." if len(para) > 120 else f"\n  {i}. {para}")
    print()


=== Extracting Paragraphs (Deep Text) - Article Comparison ===

XML Structure:

<p>Text with <italic>italic</italic> and <bold>bold</bold> elements.</p>


Using the generic base function with use_full_text=True:

üìÑ PMC3258128: Extracted 36 paragraphs
First 2 paragraphs:

  1. MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22‚Äânt which negatively modulate gene expression in animals and plants, p...

  2. One of the first clues of the existence of miRNAs in mammals came from studies on genetic alterations in woodchuck liver...

üìÑ PMC3359999: Extracted 36 paragraphs
First 2 paragraphs:

  1. A high prevalence of Taenia solium taeniosis/cysticercosis is reported from some countries in Africa whereas limited or ...

  2. The ECP reported high levels (28‚Äì50%) of human juvenile NCC (which occurs in children) and limited current data for porc...



### `_extract_structured_fields()` - Extract Multiple Fields

Use this when you need to extract **multiple fields** from a single parent element.

**Example**: Reference citations have multiple fields (title, year, volume, etc.)

In [59]:
# Example 4: Extract reference metadata (multiple fields) - Comparison
print("=== Extracting Reference Metadata (Structured Fields) - Article Comparison ===")
print("\nXML Structure:")
print("""
<element-citation>
  <article-title>Sample Article</article-title>
  <year>2020</year>
  <volume>10</volume>
  <fpage>100</fpage>
  <lpage>110</lpage>
</element-citation>
""")

print("\nUsing the generic base function:\n")

for pmcid, parser in parsers.items():
    # Get first reference element
    citation = parser.root.find(".//element-citation")
    if citation is not None:
        # Using the generic base function
        fields = parser._extract_structured_fields(
            citation,  # Parent element to search within
            {
                "title": "article-title",
                "year": "year",
                "volume": "volume",
                "fpage": "fpage",
                "lpage": "lpage",
            }
        )

        print(f"üìÑ {pmcid}: Extracted reference fields:")
        for key, value in fields.items():
            if value:
                display_value = value[:60] + "..." if len(value) > 60 else value
                print(f"  {key}: {display_value}")
    else:
        print(f"üìÑ {pmcid}: No references found")
    print()


=== Extracting Reference Metadata (Structured Fields) - Article Comparison ===

XML Structure:

<element-citation>
  <article-title>Sample Article</article-title>
  <year>2020</year>
  <volume>10</volume>
  <fpage>100</fpage>
  <lpage>110</lpage>
</element-citation>


Using the generic base function:

üìÑ PMC3258128: Extracted reference fields:
  title: MicroRNAs: genomics, biogenesis, mechanism, and function
  year: 2004
  volume: 116
  fpage: 281
  lpage: 297

üìÑ PMC3359999: Extracted reference fields:
  title: Control of Neurocysticercosis.
  year: 2003



### `_combine_page_range()` - Format Page Ranges

A simple utility for formatting page ranges.

In [60]:
# Example 5: Combine page ranges
print("=== Formatting Page Ranges ===")

examples = [
    ("100", "110"),
    ("50", None),
    (None, "60"),
    (None, None),
]

print("\nInput (fpage, lpage) ‚Üí Output:")
for fpage, lpage in examples:
    result = parser._combine_page_range(fpage, lpage)
    print(f"  ({fpage}, {lpage}) ‚Üí {result}")

=== Formatting Page Ranges ===

Input (fpage, lpage) ‚Üí Output:
  (100, 110) ‚Üí 100-110
  (50, None) ‚Üí 50
  (None, 60) ‚Üí None
  (None, None) ‚Üí None


## 2. Specialized Functions Using Base Functions

These **public API methods** are what users typically call. They're built on top of the generic base functions.

### `extract_authors()` - Uses `_extract_nested_texts()`

In [61]:
# Example 6: Extract authors using the public API - Comparison
print("=== Public API: extract_authors() - Article Comparison ===")
print("\nInternally calls: _extract_nested_texts()")
print("Pattern: Finds <name> elements and combines <given-names> + <surname>\n")

for pmcid, parser in parsers.items():
    authors = parser.extract_authors()

    print(f"üìÑ {pmcid}: Found {len(authors)} authors")
    for i, author in enumerate(authors[:5], 1):
        print(f"  {i}. {author}")
    if len(authors) > 5:
        print(f"  ... and {len(authors) - 5} more")
    print()


DEBUG: Extracted authors: ['Shuai Li', 'Juanjuan Zhu', 'Hanjiang Fu', 'Jing Wan', 'Zheng Hu', 'Shanshan Liu', 'Jie Li', 'Yi Tie', 'Ruiyun Xing', 'Jie Zhu', 'Zhixian Sun', 'Xiaofei Zheng']
DEBUG: Extracted authors: ['Rosina Claudia Krecek', 'Hamish Mohammed', 'Lynne Margaret Michael', 'Peter Mullineaux Schantz', 'Lulama Ntanjana', 'Liesl Morey', 'Stephen Rakem Werre', 'Arve Lee Willingham']
DEBUG: Extracted authors: ['Rosina Claudia Krecek', 'Hamish Mohammed', 'Lynne Margaret Michael', 'Peter Mullineaux Schantz', 'Lulama Ntanjana', 'Liesl Morey', 'Stephen Rakem Werre', 'Arve Lee Willingham']


=== Public API: extract_authors() - Article Comparison ===

Internally calls: _extract_nested_texts()
Pattern: Finds <name> elements and combines <given-names> + <surname>

üìÑ PMC3258128: Found 12 authors
  1. Shuai Li
  2. Juanjuan Zhu
  3. Hanjiang Fu
  4. Jing Wan
  5. Zheng Hu
  ... and 7 more

üìÑ PMC3359999: Found 8 authors
  1. Rosina Claudia Krecek
  2. Hamish Mohammed
  3. Lynne Margaret Michael
  4. Peter Mullineaux Schantz
  5. Lulama Ntanjana
  ... and 3 more



### `extract_keywords()` - Uses `_extract_flat_texts()`

In [62]:
# Example 7: Extract keywords using the public API - Comparison
print("=== Public API: extract_keywords() - Article Comparison ===")
print("\nInternally calls: _extract_flat_texts()")
print("Pattern: Finds all <kwd> elements and extracts their text\n")

for pmcid, parser in parsers.items():
    keywords = parser.extract_keywords()

    if keywords:
        print(f"üìÑ {pmcid}: Found {len(keywords)} keywords")
        for kw in keywords[:5]:
            print(f"  - {kw}")
        if len(keywords) > 5:
            print(f"  ... and {len(keywords) - 5} more")
    else:
        print(f"üìÑ {pmcid}: No keywords found")
    print()


DEBUG: Extracted keywords: []
DEBUG: Extracted keywords: []
DEBUG: Extracted keywords: []


=== Public API: extract_keywords() - Article Comparison ===

Internally calls: _extract_flat_texts()
Pattern: Finds all <kwd> elements and extracts their text

üìÑ PMC3258128: No keywords found

üìÑ PMC3359999: No keywords found



### `extract_references()` - Uses Multiple Base Functions

Complex methods can combine **multiple base functions**.

In [63]:
# Example 8: Extract references (complex composition) - Comparison
print("=== Public API: extract_references() - Article Comparison ===")
print("\nInternally uses:")
print("  - _extract_reference_authors() for author names")
print("  - _extract_structured_fields() for title, year, volume, etc.")
print("  - _combine_page_range() for page formatting\n")

for pmcid, parser in parsers.items():
    references = parser.extract_references()

    print(f"üìÑ {pmcid}: Found {len(references)} references")
    print("First 2 references:")
    for i, ref in enumerate(references[:2], 1):
        print(f"\n  {i}. {ref.get('title', 'No title')[:70] if ref.get('title') else 'No title'}...")
        print(f"     Authors: {ref.get('authors', 'No authors')[:50] if ref.get('authors') else 'No authors'}...")
        print(f"     Source: {ref.get('source', 'N/A')} ({ref.get('year', 'N/A')})")
        if ref.get('doi'):
            print(f"     DOI: {ref.get('doi')}")
    print()


DEBUG: Extracted 47 references from XML: [{'id': 'gkr715-B1', 'label': '1', 'authors': 'DP Bartel', 'title': 'MicroRNAs: genomics, biogenesis, mechanism, and function', 'source': 'Cell', 'year': '2004', 'volume': '116', 'pages': '281-297'}, {'id': 'gkr715-B2', 'label': '2', 'authors': 'A Krek, D Grun, MN Poy, R Wolf, L Rosenberg, EJ Epstein, P MacMenamin, I da Piedade, KC Gunsalus, M Stoffel', 'title': 'Combinatorial microRNA target predictions', 'source': 'Nat. Genet.', 'year': '2005', 'volume': '37', 'pages': '495-500'}, {'id': 'gkr715-B3', 'label': '3', 'authors': 'P Xu, M Guo, BA Hay', 'title': 'MicroRNAs and the regulation of cell death', 'source': 'Trends Genet.', 'year': '2004', 'volume': '20', 'pages': '617-624'}, {'id': 'gkr715-B4', 'label': '4', 'authors': 'AM Cheng, MW Byrom, J Shelton, LP Ford', 'title': 'Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis', 'source': 'Nucleic Acids Res.', 'year': '2005', 'volume': '

=== Public API: extract_references() - Article Comparison ===

Internally uses:
  - _extract_reference_authors() for author names
  - _extract_structured_fields() for title, year, volume, etc.
  - _combine_page_range() for page formatting

üìÑ PMC3258128: Found 47 references
First 2 references:

  1. MicroRNAs: genomics, biogenesis, mechanism, and function...
     Authors: DP Bartel...
     Source: Cell (2004)

  2. Combinatorial microRNA target predictions...
     Authors: A Krek, D Grun, MN Poy, R Wolf, L Rosenberg, EJ Ep...
     Source: Nat. Genet. (2005)

üìÑ PMC3359999: Found 28 references
First 2 references:

  1. Control of Neurocysticercosis....
     Authors: No authors...
     Source: None (2003)

  2. The emergence of Taenia solium Cysticercosis in Eastern and Southern A...
     Authors: I Phiri, H Ngowi, S Afonso, E Matenga, M Boa...
     Source: Acta Trop (2003)



### `extract_tables()` - Uses `_extract_flat_texts()` in Helper

In [64]:
# Example 9: Extract tables - Comparison
print("=== Public API: extract_tables() - Article Comparison ===")
print("\nInternally uses:")
print("  - _extract_flat_texts() for headers and cell data\n")

for pmcid, parser in parsers.items():
    tables = parser.extract_tables()

    if tables:
        print(f"üìÑ {pmcid}: Found {len(tables)} tables")
        for i, table in enumerate(tables[:1], 1):
            print(f"\n  Table {i}: {table.get('label', 'No label')}")
            caption = table.get('caption', '')
            print(f"  Caption: {caption[:60] if caption else 'No caption'}...")
            print(f"  Dimensions: {len(table.get('headers', []))} columns √ó {len(table.get('rows', []))} rows")
            if table.get('headers'):
                print(f"  Headers: {', '.join(table.get('headers', [])[:4])}")
    else:
        print(f"üìÑ {pmcid}: No tables found")
    print()


DEBUG: Extracted 0 tables from XML: []
DEBUG: Extracted 2 tables from XML: [{'id': 'pone-0037718-t001', 'label': 'Table 1', 'caption': 'Bivariable associations between owner/pig characteristics and cysticercosis 1 infection in pigs from Eastern Cape Province (South Africa) (N\u200a=\u200a256) 2 .', 'headers': [], 'rows': [['Veterinary district 5', 'Umzimukulu', '28 (58)', '20 (42)', '1.17 ( 0.53 , 2.59 )', '0.691'], ['', 'Maluti', '17 (49)', '18 (51)', '0.79 ( 0.34 , 1.85 )', '0.590'], ['', 'Tsolo', '32 (71)', '13 (29)', '2.06 ( 0.89 , 4.78 )', '0.091'], ['', 'Qumbu', '29 (73)', '11 (28)', '2.21 ( 0.92 , 5.29 )', '0.075'], ['', 'Lusikisiki', '9 (29)', '22 (71)', '0.34 ( 0.14 , 0.86 )', '0.022'], ['', 'Mt. Frere', '31 (54)', '26 (46)', 'Reference', ''], ['Breed 6', 'Cross bred', '18 (40)', '27 (60)', '0.42 ( 0.22 , 0.83 )', '0.012'], ['', 'Other', '2 (67)', '1 (33)', '1.27 ( 0.08 , 20.97 )', '0.868'], ['', 'Hut pig', '123 (61)', '78 (39)', 'Reference', ''], ['Latrine 7', 'Absent', '82 (

=== Public API: extract_tables() - Article Comparison ===

Internally uses:
  - _extract_flat_texts() for headers and cell data

üìÑ PMC3258128: No tables found

üìÑ PMC3359999: Found 2 tables

  Table 1: Table 1
  Caption: Bivariable associations between owner/pig characteristics an...
  Dimensions: 0 columns √ó 23 rows



### `get_full_text_sections()` - Uses `_extract_section_structure()`

In [65]:
# Example 10: Extract sections - Comparison
print("=== Public API: get_full_text_sections() - Article Comparison ===")
print("\nInternally uses:")
print("  - _extract_section_structure() for each section\n")

for pmcid, parser in parsers.items():
    sections = parser.get_full_text_sections()

    print(f"üìÑ {pmcid}: Found {len(sections)} sections")
    for i, section in enumerate(sections[:3], 1):
        print(f"\n  {i}. {section.get('title') or '(No title)'}")
        content = section.get('content', '')
        content_preview = content[:80] if content else '(Empty)'
        print(f"     {content_preview}...")
    print()


DEBUG: Extracted 22 sections from XML: [{'title': 'INTRODUCTION', 'content': 'MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22\u2009nt which negatively modulate gene expression in animals and plants, primarily through base paring to the 3‚Ä≤-untranslated region (UTR) of target messenger RNAs (mRNAs). This leads to mRNA cleavage and/or translation repression ( 1 ). miRNAs are primarily transcribed by RNA polymerase II as part of capped and polyadenylated primary transcripts (pri-miRNAs) that can be either protein-coding or non-coding. The primary transcript is cleaved by Drosha ribonuclease III enzyme to produce an ‚àº70-nt stem‚Äìloop precursor miRNA (pre-miRNA), which is further cleaved by the cytoplasmic Dicer ribonuclease to generate the mature miRNA. The mature miRNA is incorporated into an RNA-induced silencing complex (RISC), which recognizes target mRNAs through imperfect base pairing with the miRNA. Bioinformatic analysis predicts that each miRNA may regulate hundreds of ta

=== Public API: get_full_text_sections() - Article Comparison ===

Internally uses:
  - _extract_section_structure() for each section

üìÑ PMC3258128: Found 22 sections

  1. INTRODUCTION
     MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22‚Äânt which negatively modulate ...

  2. MATERIALS AND METHODS
     HepG2 and HeLa cell lines were cultured in DMEM (GIBCO BRL, Grand Island, NY, US...

  3. Cell lines and cultures
     HepG2 and HeLa cell lines were cultured in DMEM (GIBCO BRL, Grand Island, NY, US...

üìÑ PMC3359999: Found 10 sections

  1. Introduction
     A high prevalence of Taenia solium taeniosis/cysticercosis is reported from some...

  2. Materials and Methods
     This study was carried out from February to June 2003, in the six veterinary dis...

  3. Study design and population
     This study was carried out from February to June 2003, in the six veterinary dis...



## 3. Building Custom Extractors

You can **create your own extraction methods** using the same base functions.

### Example: Extract Affiliations

In [66]:
# Example 11: Custom extractor for affiliations - Comparison
print("=== Custom Extractor: Affiliations - Article Comparison ===")
print("\nUsing: extract_affiliations() method\n")

for pmcid, parser in parsers.items():
    # Extract all affiliations using the built-in method
    affiliations = parser.extract_affiliations()

    if affiliations:
        print(f"üìÑ {pmcid}: Found {len(affiliations)} affiliations")
        for i, aff in enumerate(affiliations[:2], 1):
            print(f"\n  {i}. ID: {aff.get('id') or 'N/A'}")

            # Show structured data if available
            if aff.get('institution'):
                print(f"     Institution: {aff.get('institution')}")
                print(f"     Location: {aff.get('city', 'N/A')}, {aff.get('country', 'N/A')}")
            else:
                # Show parsed mixed content
                if aff.get('markers'):
                    print(f"     Markers: {aff.get('markers')}")

                # Show parsed multi-institution data if available
                if aff.get('parsed_institutions'):
                    print(f"     Parsed {len(aff.get('parsed_institutions', []))} institutions")
                elif aff.get('institution_text'):
                    print(f"     Institution(s): {aff.get('institution_text', '')[:60]}...")

                if aff.get('text'):
                    text = aff.get('text', '')
                    print(f"     Full text: {text[:60]}...")
    else:
        print(f"üìÑ {pmcid}: No affiliations found")
    print()


DEBUG: Extracted 1 affiliations
DEBUG: Extracted 10 affiliations
DEBUG: Extracted 10 affiliations


=== Custom Extractor: Affiliations - Article Comparison ===

Using: extract_affiliations() method

üìÑ PMC3258128: Found 1 affiliations

  1. ID: gkr715-AFF1
     Markers: 1, 2
     Parsed 2 institutions
     Full text: 1Beijing Institute of Radiation Medicine, Beijing 100850 and...

üìÑ PMC3359999: Found 10 affiliations

  1. ID: aff1
     Full text: 1
Department of Research, Ross University School of Veterina...

  2. ID: aff2
     Full text: 2
Department of Zoology, University of Johannesburg, Aucklan...



#### Advanced: Parse Complex Multi-Institution Affiliations

The parser includes a built-in `extract_affiliations()` method that automatically handles:

- **Structured affiliations**: With `<institution>`, `<city>`, `<country>` tags
- **Mixed-content affiliations**: With superscript markers and text
- **Multi-institution parsing**: Automatically splits affiliations with multiple institutions separated by "and"

Here's a demonstration with a real-world complex example:

In [67]:
# Example: Parse a complex multi-institution affiliation
sample_aff_xml = """
<aff id="gkr715-AFF1"><sup>1</sup>Beijing Institute of Radiation Medicine, Beijing 100850 and <sup>2</sup>Anhui Medical University, Hefei 230032, P. R. China</aff>
"""

print("=== Parsing Complex Multi-Institution Affiliation ===\n")
print("Sample XML:")
print(sample_aff_xml)

# Create a temporary parser to demonstrate
import xml.etree.ElementTree as ET
temp_xml = f"""<article>{sample_aff_xml}</article>"""
temp_parser = FullTextXMLParser()
temp_parser.parse(temp_xml)

# Use the built-in extract_affiliations method
affiliations = temp_parser.extract_affiliations()

if affiliations:
    aff = affiliations[0]
    print(f"Affiliation ID: {aff.get('id')}\n")
    print(f"Full text: {aff.get('text')}\n")

    if aff.get('markers'):
        print(f"Found markers: {aff.get('markers')}\n")

    if aff.get('parsed_institutions'):
        parsed = aff.get('parsed_institutions', [])
        print(f"‚úì Successfully parsed {len(parsed)} institutions:\n")
        for i, inst in enumerate(parsed, 1):
            print(f"  {i}. Marker: {inst.get('marker', 'N/A')}")
            if 'name' in inst:
                print(f"     Name: {inst.get('name')}")
                print(f"     City: {inst.get('city')}")
                print(f"     Postal Code: {inst.get('postal_code')}")
                print(f"     Country: {inst.get('country', 'N/A')}")
            else:
                print(f"     Text: {inst.get('text', 'N/A')}")
            print()
    else:
        print(f"Institution text: {aff.get('institution_text', 'N/A')}")

=== Parsing Complex Multi-Institution Affiliation ===

Sample XML:

<aff id="gkr715-AFF1"><sup>1</sup>Beijing Institute of Radiation Medicine, Beijing 100850 and <sup>2</sup>Anhui Medical University, Hefei 230032, P. R. China</aff>



DEBUG: Extracted 1 affiliations


Affiliation ID: gkr715-AFF1

Full text: 1Beijing Institute of Radiation Medicine, Beijing 100850 and 2Anhui Medical University, Hefei 230032, P. R. China

Found markers: 1, 2

‚úì Successfully parsed 2 institutions:

  1. Marker: 1
     Name: Beijing Institute of Radiation Medicine
     City: Beijing
     Postal Code: 100850
     Country: None

  2. Marker: 2
     Name: Anhui Medical University
     City: Hefei
     Postal Code: 230032
     Country: P. R. China



### Example: Extract Figures

In [68]:
# Example 12: Custom extractor for figures - Comparison
print("=== Custom Extractor: Figures - Article Comparison ===")
print("\nUsing: extract_elements_by_patterns() + _extract_structured_fields()\n")

for pmcid, parser in parsers.items():
    # First, get all figure elements using the parser's API
    fig_results = parser.extract_elements_by_patterns(
        {"figures": ".//fig"},
        return_type="element"
    )

    figures = []
    for fig_elem in fig_results.get("figures", []):
        fig_data = parser._extract_structured_fields(
            fig_elem,
            {
                "label": "label",
                "caption": "caption",
            }
        )
        # Get ID attribute separately (attributes aren't supported by _extract_structured_fields)
        fig_data["id"] = fig_elem.get("id")
        figures.append(fig_data)

    if figures:
        print(f"üìÑ {pmcid}: Found {len(figures)} figures")
        for i, fig in enumerate(figures[:2], 1):
            label = fig.get('label', 'No label')
            caption = fig.get('caption', '')
            caption_preview = caption[:60] if caption else 'No caption'
            print(f"\n  {i}. {label}: {caption_preview}...")
            print(f"     ID: {fig.get('id') or 'N/A'}")
    else:
        print(f"üìÑ {pmcid}: No figures found")
    print()


=== Custom Extractor: Figures - Article Comparison ===

Using: extract_elements_by_patterns() + _extract_structured_fields()

üìÑ PMC3258128: Found 5 figures

  1. Figure 1.: Affinity purification with biotin-tagged miR-122 from human ...
     ID: gkr715-F1

  2. Figure 2.: A significant enrichment of miR-122 targets by biotin-tagged...
     ID: gkr715-F2

üìÑ PMC3359999: No figures found



## 4. Complete Metadata Extraction

The `extract_metadata()` method **orchestrates** multiple specialized functions.

In [69]:
# Example 13: Complete metadata extraction - Comparison
print("=== Complete Metadata Extraction - Article Comparison ===")
print("\nCalls multiple specialized functions:")
print("  - extract_authors()")
print("  - extract_keywords()")
print("  - extract_pub_date()")
print("  - _combine_page_range()")
print("  - And many more...\n")

for pmcid, parser in parsers.items():
    metadata = parser.extract_metadata()

    print(f"üìÑ {pmcid}")
    print(f"Title: {metadata.get('title', 'N/A')[:80]}...")
    print(f"\nAuthors: {len(metadata.get('authors', []))} total")
    for i, author in enumerate(metadata.get('authors', [])[:3], 1):
        print(f"  {i}. {author}")
    if len(metadata.get('authors', [])) > 3:
        print(f"  ... and {len(metadata.get('authors', [])) - 3} more")

    print(f"\nJournal: {metadata.get('journal', 'N/A')}")
    print(f"Publication Date: {metadata.get('pub_date', 'N/A')}")
    print(f"Volume: {metadata.get('volume', 'N/A')}, Issue: {metadata.get('issue', 'N/A')}, Pages: {metadata.get('pages', 'N/A')}")

    print(f"\nIdentifiers:")
    print(f"  PMC ID: {metadata.get('pmcid', 'N/A')}")
    print(f"  PMID: {metadata.get('pmid', 'N/A')}")
    print(f"  DOI: {metadata.get('doi', 'N/A')}")

    if metadata.get('keywords'):
        print(f"\nKeywords ({len(metadata.get('keywords', []))}):")
        print(f"  {', '.join(metadata.get('keywords', [])[:5])}")

    if metadata.get('abstract'):
        print(f"\nAbstract: {metadata.get('abstract', '')[:150]}...")

    print("\n" + "="*60 + "\n")


DEBUG: Extracted authors: ['Shuai Li', 'Juanjuan Zhu', 'Hanjiang Fu', 'Jing Wan', 'Zheng Hu', 'Shanshan Liu', 'Jie Li', 'Yi Tie', 'Ruiyun Xing', 'Jie Zhu', 'Zhixian Sun', 'Xiaofei Zheng']


=== Complete Metadata Extraction - Article Comparison ===

Calls multiple specialized functions:
  - extract_authors()
  - extract_keywords()
  - extract_pub_date()
  - _combine_page_range()
  - And many more...



DEBUG: Extracted pub_date: 2012-01
DEBUG: Extracted keywords: []
DEBUG: Extracted metadata for PMC3258128: {'pmcid': '3258128', 'doi': '10.1093/nar/gkr715', 'title': 'Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA through regulating PRKRA', 'journal': 'Nucleic Acids Research', 'volume': '40', 'issue': '2', 'abstract': 'microRNAs (miRNAs) are a versatile class of non-coding RNAs involved in regulation of various biological processes. miRNA-122 (miR-122) is specifically and abundantly expressed in human liver. In this study, we employed 3‚Ä≤-end biotinylated synthetic miR-122 to identify its targets based on affinity purification. Quantitative RT-PCR analysis of the affinity purified RNAs demonstrated a specific enrichment of several known miR-122 targets such as CAT-1 (also called SLC7A1), ADAM17 and BCL-w. Using microarray analysis of affinity purified RNAs, we also discovered many candidate target genes of miR-122. Among these candidates, we confirmed

üìÑ PMC3258128
Title: Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA...

Authors: 12 total
  1. Shuai Li
  2. Juanjuan Zhu
  3. Hanjiang Fu
  ... and 9 more

Journal: Nucleic Acids Research
Publication Date: 2012-01
Volume: 40, Issue: 2, Pages: 884-891

Identifiers:
  PMC ID: 3258128
  PMID: N/A
  DOI: 10.1093/nar/gkr715

Abstract: microRNAs (miRNAs) are a versatile class of non-coding RNAs involved in regulation of various biological processes. miRNA-122 (miR-122) is specificall...


üìÑ PMC3359999
Title: Risk Factors of Porcine Cysticercosis in the Eastern Cape Province, South Africa...

Authors: 8 total
  1. Rosina Claudia Krecek
  2. Hamish Mohammed
  3. Lynne Margaret Michael
  ... and 5 more

Journal: PLoS ONE
Publication Date: 2012-05-24
Volume: 7, Issue: 5, Pages: 13-23

Identifiers:
  PMC ID: 3359999
  PMID: N/A
  DOI: 10.1371/journal.pone.0037718

Abstract: There is a high prevalence of Taenia solium taeniosis/cysticercosis in humans and pi

## 5. Advanced: Using `extract_elements_by_patterns()`

The **most flexible** base function for custom extractions.

In [70]:
# Example 14: Direct use of extract_elements_by_patterns - Comparison
print("=== Advanced: extract_elements_by_patterns() - Article Comparison ===")
print("\nThe foundation of all extraction methods\n")

for pmcid, parser in parsers.items():
    # Extract multiple elements at once
    results = parser.extract_elements_by_patterns(
        {
            "title": ".//article-title",
            "journal": ".//journal-title",
            "publisher": ".//publisher-name",
        }
    )

    print(f"üìÑ {pmcid}:")
    for key, values in results.items():
        if values:
            print(f"  {key}: {values[0][:60]}..." if len(values[0]) > 60 else f"  {key}: {values[0]}")

    # Extract attributes instead of text
    attr_results = parser.extract_elements_by_patterns(
        {"table_ids": ".//table-wrap"},
        return_type="attribute",
        get_attribute={"table_ids": "id"}
    )

    if attr_results["table_ids"]:
        print(f"  table_ids: {', '.join(attr_results['table_ids'][:3])}")

    # Extract only first match
    first_results = parser.extract_elements_by_patterns(
        {"first_author": ".//contrib[@contrib-type='author']//surname"},
        first_only=True
    )

    if first_results["first_author"]:
        print(f"  first_author: {first_results['first_author'][0]}")

    print()


=== Advanced: extract_elements_by_patterns() - Article Comparison ===

The foundation of all extraction methods

üìÑ PMC3258128:
  title: Hepato-specific microRNA-122 facilitates accumulation of new...
  journal: Nucleic Acids Research
  publisher: Oxford University Press
  first_author: Li

üìÑ PMC3359999:
  title: Risk Factors of Porcine Cysticercosis in the Eastern Cape Pr...
  journal: PLoS ONE
  publisher: Public Library of Science
  table_ids: pone-0037718-t001, pone-0037718-t002
  first_author: Krecek



## 6. XML Element Coverage Analysis

Let's analyze how the parser handles various JATS/XML elements across both articles.

### Structural Elements

Testing how the parser handles: `<sec>`, `<chapter>`, `<paragraph>`, `<p>`, `<table>`, `<figure>`, `<caption>`, `<list-item>`

In [71]:
print("=== Structural Elements Analysis ===\n")

structural_elements = {
    "sec": ".//sec",           # Sections (JATS)
    "chapter": ".//chapter",   # Chapters (DocBook)
    "paragraph": ".//paragraph", # Paragraph element (less common)
    "p": ".//p",               # Paragraph (JATS standard)
    "table": ".//table",       # Table element
    "table-wrap": ".//table-wrap", # Table wrapper (JATS)
    "fig": ".//fig",           # Figure element (JATS)
    "caption": ".//caption",   # Caption element
    "list": ".//list",         # List element
    "list-item": ".//list-item" # List item
}

for pmcid, parser in parsers.items():
    print(f"üìÑ {pmcid}:")
    for elem_name, xpath in structural_elements.items():
        results = parser.extract_elements_by_patterns(
            {elem_name: xpath},
            return_type="element"
        )
        count = len(results.get(elem_name, []))
        if count > 0:
            print(f"  ‚úì <{elem_name}>: {count} found")

            # Show sample for key elements
            if elem_name in ["sec", "p", "fig", "table-wrap"] and count > 0:
                sample_elem = results[elem_name][0]
                # Get element ID or first bit of text
                elem_id = sample_elem.get("id", "no-id")
                text_preview = (sample_elem.text or "")[:40].strip()
                if text_preview:
                    print(f"      Sample: id='{elem_id}', text='{text_preview}...'")
                else:
                    print(f"      Sample: id='{elem_id}'")
        else:
            print(f"  ‚úó <{elem_name}>: not found")
    print()

# Special analysis: How sections are processed
print("=== Section Processing ===\n")
for pmcid, parser in parsers.items():
    sections = parser.get_full_text_sections()
    print(f"üìÑ {pmcid}: {len(sections)} sections extracted via get_full_text_sections()")
    print(f"   (uses _extract_section_structure() which processes <sec> ‚Üí <title> + <p>)")
    print()


DEBUG: Extracted 22 sections from XML: [{'title': 'INTRODUCTION', 'content': 'MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22\u2009nt which negatively modulate gene expression in animals and plants, primarily through base paring to the 3‚Ä≤-untranslated region (UTR) of target messenger RNAs (mRNAs). This leads to mRNA cleavage and/or translation repression ( 1 ). miRNAs are primarily transcribed by RNA polymerase II as part of capped and polyadenylated primary transcripts (pri-miRNAs) that can be either protein-coding or non-coding. The primary transcript is cleaved by Drosha ribonuclease III enzyme to produce an ‚àº70-nt stem‚Äìloop precursor miRNA (pre-miRNA), which is further cleaved by the cytoplasmic Dicer ribonuclease to generate the mature miRNA. The mature miRNA is incorporated into an RNA-induced silencing complex (RISC), which recognizes target mRNAs through imperfect base pairing with the miRNA. Bioinformatic analysis predicts that each miRNA may regulate hundreds of ta

=== Structural Elements Analysis ===

üìÑ PMC3258128:
  ‚úì <sec>: 22 found
      Sample: id='no-id'
  ‚úó <chapter>: not found
  ‚úó <paragraph>: not found
  ‚úì <p>: 38 found
      Sample: id='no-id', text='The authors wish it to be known that, in...'
  ‚úó <table>: not found
  ‚úó <table-wrap>: not found
  ‚úì <fig>: 5 found
      Sample: id='gkr715-F1'
  ‚úì <caption>: 6 found
  ‚úó <list>: not found
  ‚úó <list-item>: not found

üìÑ PMC3359999:
  ‚úì <sec>: 10 found
      Sample: id='s1'
  ‚úó <chapter>: not found
  ‚úó <paragraph>: not found
  ‚úì <p>: 41 found
      Sample: id='no-id', text='Conceived and designed the experiments:...'
  ‚úì <table>: 2 found
  ‚úì <table-wrap>: 2 found
      Sample: id='pone-0037718-t001'
  ‚úó <fig>: not found
  ‚úì <caption>: 2 found
  ‚úó <list>: not found
  ‚úó <list-item>: not found

=== Section Processing ===



DEBUG: Extracted 10 sections from XML: [{'title': 'Introduction', 'content': 'A high prevalence of Taenia solium taeniosis/cysticercosis is reported from some countries in Africa whereas limited or no information is available from others [1] ‚Äì [2] . Cysticercois is a disease caused by infection with the larval stages of pork tapeworm, T. solium . [3] . Humans and pigs acquire cysticercosis by ingesting T. solium eggs. Neurocysticercosis (NCC) in humans occurs when cysts develop within the central nervous system. South Africa has the largest number of pigs (most being raised under commercial conditions) in southern Africa, and human and porcine cysticercosis has been recognized as a problem in the country for many decades [2] , [4] ‚Äì [7] . An extensive national abattoir study in 1937 reported a prevalence of 25% of porcine cysticercosis and an incidence of 10% in the Eastern Cape Province (ECP) of South Africa [1] , [2] . The number of pigs continues to increase throughout southern 

üìÑ PMC3258128: 22 sections extracted via get_full_text_sections()
   (uses _extract_section_structure() which processes <sec> ‚Üí <title> + <p>)

üìÑ PMC3359999: 10 sections extracted via get_full_text_sections()
   (uses _extract_section_structure() which processes <sec> ‚Üí <title> + <p>)



### Reference/Link Elements

Testing how the parser handles: `<xref>`, `<link>`, `<ext-link>`, `<ref>`

In [72]:
print("=== Reference/Link Elements Analysis ===\n")

reference_elements = {
    "xref": ".//xref",         # Cross-reference (internal)
    "link": ".//link",         # Generic link
    "ext-link": ".//ext-link", # External link (JATS)
    "ref": ".//ref"            # Bibliographic reference
}

for pmcid, parser in parsers.items():
    print(f"üìÑ {pmcid}:")
    for elem_name, xpath in reference_elements.items():
        results = parser.extract_elements_by_patterns(
            {elem_name: xpath},
            return_type="element"
        )
        count = len(results.get(elem_name, []))
        if count > 0:
            print(f"  ‚úì <{elem_name}>: {count} found")

            # Show sample with attributes
            if count > 0:
                sample_elem = results[elem_name][0]
                attrs = dict(sample_elem.attrib)
                if attrs:
                    attr_str = ", ".join([f"{k}='{v[:30]}'" for k, v in list(attrs.items())[:3]])
                    print(f"      Sample attributes: {attr_str}")
                text = (sample_elem.text or "")[:40].strip()
                if text:
                    print(f"      Sample text: '{text}...'")
        else:
            print(f"  ‚úó <{elem_name}>: not found")
    print()

# Special analysis: How references are processed
print("=== Reference Processing ===\n")
for pmcid, parser in parsers.items():
    references = parser.extract_references()
    print(f"üìÑ {pmcid}: {len(references)} references extracted via extract_references()")
    print(f"   (processes <ref> ‚Üí <label> + <element-citation>/<mixed-citation>)")

    # Show types of xref found
    xref_results = parser.extract_elements_by_patterns(
        {"xref": ".//xref"},
        return_type="element"
    )
    if xref_results["xref"]:
        xref_types = {}
        for xref in xref_results["xref"][:50]:  # Sample first 50
            ref_type = xref.get("ref-type", "unknown")
            xref_types[ref_type] = xref_types.get(ref_type, 0) + 1
        print(f"   <xref> ref-type distribution: {dict(xref_types)}")

    # Show external link types
    extlink_results = parser.extract_elements_by_patterns(
        {"ext-link": ".//ext-link"},
        return_type="element"
    )
    if extlink_results["ext-link"]:
        link_types = {}
        for link in extlink_results["ext-link"][:50]:
            link_type = link.get("ext-link-type", "unknown")
            link_types[link_type] = link_types.get(link_type, 0) + 1
        print(f"   <ext-link> types: {dict(link_types)}")

    print()


DEBUG: Extracted 47 references from XML: [{'id': 'gkr715-B1', 'label': '1', 'authors': 'DP Bartel', 'title': 'MicroRNAs: genomics, biogenesis, mechanism, and function', 'source': 'Cell', 'year': '2004', 'volume': '116', 'pages': '281-297'}, {'id': 'gkr715-B2', 'label': '2', 'authors': 'A Krek, D Grun, MN Poy, R Wolf, L Rosenberg, EJ Epstein, P MacMenamin, I da Piedade, KC Gunsalus, M Stoffel', 'title': 'Combinatorial microRNA target predictions', 'source': 'Nat. Genet.', 'year': '2005', 'volume': '37', 'pages': '495-500'}, {'id': 'gkr715-B3', 'label': '3', 'authors': 'P Xu, M Guo, BA Hay', 'title': 'MicroRNAs and the regulation of cell death', 'source': 'Trends Genet.', 'year': '2004', 'volume': '20', 'pages': '617-624'}, {'id': 'gkr715-B4', 'label': '4', 'authors': 'AM Cheng, MW Byrom, J Shelton, LP Ford', 'title': 'Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis', 'source': 'Nucleic Acids Res.', 'year': '2005', 'volume': '

=== Reference/Link Elements Analysis ===

üìÑ PMC3258128:
  ‚úì <xref>: 80 found
      Sample attributes: ref-type='aff', rid='gkr715-AFF1'
  ‚úó <link>: not found
  ‚úì <ext-link>: 17 found
      Sample attributes: ext-link-type='uri', {http://www.w3.org/1999/xlink}href='http://creativecommons.org/lic'
      Sample text: 'http://creativecommons.org/licenses/by-n...'
  ‚úì <ref>: 47 found
      Sample attributes: id='gkr715-B1'

üìÑ PMC3359999:
  ‚úì <xref>: 94 found
      Sample attributes: ref-type='aff', rid='aff1'
  ‚úó <link>: not found
  ‚úì <ext-link>: 5 found
      Sample attributes: ext-link-type='uri', {http://www.w3.org/1999/xlink}href='http://apps.who.int/gb/archive'
      Sample text: 'http://apps.who.int/gb/archive/pdf_files...'
  ‚úì <ref>: 28 found
      Sample attributes: id='pone.0037718-World1'

=== Reference Processing ===



DEBUG: Extracted 28 references from XML: [{'id': 'pone.0037718-World1', 'label': '1', 'authors': None, 'title': 'Control of Neurocysticercosis.', 'source': None, 'year': '2003', 'volume': None, 'pages': None}, {'id': 'pone.0037718-Phiri1', 'label': '2', 'authors': 'I Phiri, H Ngowi, S Afonso, E Matenga, M Boa', 'title': 'The emergence of Taenia solium Cysticercosis in Eastern and Southern Africa as a serious agricultural problem and public health risk.', 'source': 'Acta Trop', 'year': '2003', 'volume': '87', 'pages': '13-23'}, {'id': 'pone.0037718-ONeal1', 'label': '3', 'authors': "SE O'Neal, JM Townes, PP Wilkins, JC Noh, D Lee", 'title': 'Seroprevalence of antibodies against Taenia solium cysticerci among refugees resettled in United States.', 'source': 'Emerg Infect Dis', 'year': None, 'volume': None, 'pages': None}, {'id': 'pone.0037718-Krecek1', 'label': '4', 'authors': 'RC Krecek', 'title': 'Third meeting of the Cysticercosis Working Group in Eastern and Southern Africa takes pla

üìÑ PMC3258128: 47 references extracted via extract_references()
   (processes <ref> ‚Üí <label> + <element-citation>/<mixed-citation>)
   <xref> ref-type distribution: {'aff': 14, 'corresp': 1, 'bibr': 27, 'fig': 8}
   <ext-link> types: {'uri': 17}

üìÑ PMC3359999: 28 references extracted via extract_references()
   (processes <ref> ‚Üí <label> + <element-citation>/<mixed-citation>)
   <xref> ref-type distribution: {'aff': 10, 'corresp': 1, 'bibr': 27, 'table': 3, 'table-fn': 9}
   <ext-link> types: {'uri': 5}



### How Text Content is Extracted

The parser uses `use_full_text=True` to extract ALL text including nested elements.

This means elements like `<xref>`, `<italic>`, `<bold>`, etc. are **included** in the text output.

In [73]:
print("=== Text Extraction with Nested Elements ===\n")

for pmcid, parser in parsers.items():
    print(f"üìÑ {pmcid}:\n")

    # Find a paragraph with nested elements
    p_results = parser.extract_elements_by_patterns(
        {"p": ".//body//p"},
        return_type="element"
    )

    if p_results["p"]:
        # Look for a paragraph with xref or other nested elements
        sample_p = None
        for p in p_results["p"][:20]:  # Check first 20 paragraphs
            # Check if it has nested elements
            has_nested = False
            for child in p:
                if child.tag in ["xref", "italic", "bold", "ext-link"]:
                    has_nested = True
                    break
            if has_nested:
                sample_p = p
                break

        if sample_p is not None:
            # Method 1: Direct text (only immediate text)
            direct_text = sample_p.text or ""

            # Method 2: Full text using itertext() - what the parser uses
            full_text = "".join(sample_p.itertext())

            print("  Example paragraph with nested elements:")
            print(f"  - Direct .text only: '{direct_text[:60]}...'")
            print(f"  - Full text (itertext): '{full_text[:200]}...'")
            print()

            # Show what nested elements it contains
            nested_tags = [child.tag for child in sample_p]
            print(f"  Nested elements found: {set(nested_tags)}")
            print()

            # Show XML structure
            import xml.etree.ElementTree as ET
            xml_str = ET.tostring(sample_p, encoding='unicode')
            print(f"  XML structure (first 300 chars):")
            print(f"  {xml_str[:300]}...")
    else:
        print("  No paragraphs found")

    print("\n" + "="*60 + "\n")


=== Text Extraction with Nested Elements ===

üìÑ PMC3258128:

  Example paragraph with nested elements:
  - Direct .text only: 'MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22‚Äânt which ...'
  - Full text (itertext): 'MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22‚Äânt which negatively modulate gene expression in animals and plants, primarily through base paring to the 3‚Ä≤-untranslated region (UTR) of target messe...'

  Nested elements found: {'xref'}

  XML structure (first 300 chars):
  <p>MicroRNAs (miRNAs) are small conserved RNAs of ‚àº22‚Äânt which negatively modulate gene expression in animals and plants, primarily through base paring to the 3‚Ä≤-untranslated region (UTR) of target messenger RNAs (mRNAs). This leads to mRNA cleavage and/or translation repression (<xref ref-type="bib...


üìÑ PMC3359999:

  Example paragraph with nested elements:
  - Direct .text only: 'A high prevalence of ...'
  - Full text (itertext): 'A high prevalence of Taenia solium taenio

## 7. Configuration and Flexible Parsing

The parser now supports configuration-based extraction with fallback patterns to handle different XML schema variations gracefully.

### 7.1 Default Configuration

The parser uses the `ElementPatterns` configuration which defines fallback patterns for various elements.

In [74]:
print("=== Default Configuration ===\n")

# The parser uses ElementPatterns with default fallback configurations
print("Default Citation Types Supported:")
citation_types = ["element-citation", "mixed-citation", "nlm-citation", "citation"]
for ct in citation_types:
    print(f"  - {ct}")

print("\nDefault Author Element Patterns:")
author_patterns = [
    ".//contrib[@contrib-type='author']/name",
    ".//contrib[@contrib-type='author']",
    ".//author-group/author",
    ".//author",
    ".//name"
]
for pattern in author_patterns:
    print(f"  - {pattern}")

print("\nDefault Journal Patterns:")
print(f"  Title: ['.//journal-title', './/source', './/journal']")
print(f"  Volume: ['.//volume', './/vol']")
print(f"  Issue: ['.//issue']")

print("\nThese fallback patterns allow the parser to handle different XML schemas gracefully.")

=== Default Configuration ===

Default Citation Types Supported:
  - element-citation
  - mixed-citation
  - nlm-citation
  - citation

Default Author Element Patterns:
  - .//contrib[@contrib-type='author']/name
  - .//contrib[@contrib-type='author']
  - .//author-group/author
  - .//author
  - .//name

Default Journal Patterns:
  Title: ['.//journal-title', './/source', './/journal']
  Volume: ['.//volume', './/vol']
  Issue: ['.//issue']

These fallback patterns allow the parser to handle different XML schemas gracefully.


### 7.2 Schema Detection

The parser can automatically detect document structure and adapt extraction strategies.

In [75]:
print("=== Document Schema Detection ===\n")

# Note: To use the new detect_schema() method and ElementPatterns configuration,
# restart the notebook kernel and re-run all cells.

print("The parser now includes a detect_schema() method that automatically detects:")
print("  - Has tables: bool")
print("  - Has figures: bool")
print("  - Has supplementary materials: bool")
print("  - Has acknowledgments: bool")
print("  - Has funding information: bool")
print("  - Citation types found: list[str]")
print("  - Table structure type: 'jats', 'html', or 'cals'")

print("\nExample usage:")
print("  schema = parser.detect_schema()")
print("  if schema.has_tables:")
print("      tables = parser.extract_tables()")

print("\n‚úÖ Schema detection enables adaptive parsing strategies based on document structure!")

=== Document Schema Detection ===

The parser now includes a detect_schema() method that automatically detects:
  - Has tables: bool
  - Has figures: bool
  - Has supplementary materials: bool
  - Has acknowledgments: bool
  - Has funding information: bool
  - Citation types found: list[str]
  - Table structure type: 'jats', 'html', or 'cals'

Example usage:
  schema = parser.detect_schema()
  if schema.has_tables:
      tables = parser.extract_tables()

‚úÖ Schema detection enables adaptive parsing strategies based on document structure!


### 7.3 Custom Configuration Example

You can create custom configurations to handle non-standard XML schemas or add support for additional patterns.

In [76]:
print("=== Custom Configuration Example ===\n")

# Note: After kernel restart, you can create custom configurations like this:
print("from pyeuropepmc.fulltext_parser import FullTextXMLParser, ElementPatterns")
print()
print("# Create custom configuration")
print("custom_config = ElementPatterns(")
print("    citation_types=['element-citation', 'mixed-citation', 'custom-citation'],")
print("    author_element_patterns=['.//author/name', './/contributor']")
print(")")
print()
print("# Use custom configuration")
print("parser = FullTextXMLParser(config=custom_config)")
print("parser.parse('article.xml')")
print()
print("# Extract with custom patterns")
print("metadata = parser.extract_metadata()")

print("\n‚úÖ Custom configurations allow you to adapt the parser to any XML schema!")

=== Custom Configuration Example ===

from pyeuropepmc.fulltext_parser import FullTextXMLParser, ElementPatterns

# Create custom configuration
custom_config = ElementPatterns(
    citation_types=['element-citation', 'mixed-citation', 'custom-citation'],
    author_element_patterns=['.//author/name', './/contributor']
)

# Use custom configuration
parser = FullTextXMLParser(config=custom_config)
parser.parse('article.xml')

# Extract with custom patterns
metadata = parser.extract_metadata()

‚úÖ Custom configurations allow you to adapt the parser to any XML schema!


### 7.4 Fallback Pattern Demonstration

The parser tries multiple patterns in order until one succeeds, enabling graceful handling of schema variations.

In [77]:
print("=== Fallback Pattern Demonstration ===\n")

print("How Fallback Patterns Work:")
print("  1. Parser tries first pattern in the list")
print("  2. If match found, uses that value")
print("  3. If no match, tries next pattern")
print("  4. Continues until match found or list exhausted")
print()

print("Example: Journal Title Extraction")
print("  Patterns (in order): ['.//journal-title', './/source', './/journal']")
print("  - Standard JATS uses './/journal-title'")
print("  - Some schemas use './/source'")
print("  - Custom schemas might use './/journal'")
print()

print("Example: Author Name Extraction")
print("  Element patterns tried:")
print("    1. .//contrib[@contrib-type='author']/name  (JATS standard)")
print("    2. .//contrib[@contrib-type='author']       (Alternative)")
print("    3. .//author-group/author                   (Older format)")
print("    4. .//author                                (Simple format)")
print("    5. .//name                                  (Generic)")
print()

print("  Field patterns for names:")
print("    - given_names: ['.//given-names', './/given-name', './/given', ...]")
print("    - surname: ['.//surname', './/family', './/last-name', ...]")

print("\n‚úÖ This multi-level fallback system handles:")
print("   - Different XML schemas (JATS, NLM, custom)")
print("   - Schema evolution over time")
print("   - Non-standard implementations")
print("   - Missing or renamed elements")

=== Fallback Pattern Demonstration ===

How Fallback Patterns Work:
  1. Parser tries first pattern in the list
  2. If match found, uses that value
  3. If no match, tries next pattern
  4. Continues until match found or list exhausted

Example: Journal Title Extraction
  Patterns (in order): ['.//journal-title', './/source', './/journal']
  - Standard JATS uses './/journal-title'
  - Some schemas use './/source'
  - Custom schemas might use './/journal'

Example: Author Name Extraction
  Element patterns tried:
    1. .//contrib[@contrib-type='author']/name  (JATS standard)
    2. .//contrib[@contrib-type='author']       (Alternative)
    3. .//author-group/author                   (Older format)
    4. .//author                                (Simple format)
    5. .//name                                  (Generic)

  Field patterns for names:
    - given_names: ['.//given-names', './/given-name', './/given', ...]
    - surname: ['.//surname', './/family', './/last-name', ...]

‚úÖ 

## Summary: How FullTextXMLParser Handles XML Elements

### ‚úÖ **Fully Supported Elements**

| Element | Handled By | Notes |
|---------|-----------|-------|
| `<sec>` | `get_full_text_sections()`, `_extract_section_structure()` | Primary structural element for sections |
| `<p>` | `_extract_flat_texts()` with `use_full_text=True` | Paragraphs with all nested content |
| `<table>`, `<table-wrap>` | `extract_tables()`, `_parse_table_modular()` | Full table extraction with headers/rows |
| `<fig>` | `extract_elements_by_patterns()` + manual processing | Figure metadata extraction |
| `<caption>` | Extracted as part of tables/figures | Caption text |
| `<ref>` | `extract_references()` | Bibliographic references with full metadata |
| `<xref>` | **Included in text** via `itertext()` | Cross-references preserved in content |
| `<ext-link>` | **Included in text** via `itertext()` | External URLs preserved in content |

### ‚ö†Ô∏è **Partially Supported**

| Element | Status | Handling |
|---------|--------|----------|
| `<list>`, `<list-item>` | Present in XML but not specifically extracted | Included in paragraph text via `itertext()` |
| `<link>` | Not common in JATS/PMC | Would be included in text if present |

### ‚ùå **Not Found in Test Articles**

| Element | Status |
|---------|--------|
| `<chapter>` | DocBook-specific, not in JATS |
| `<paragraph>` | Less common, JATS uses `<p>` |

### üîë **Key Design Principle**

The parser uses **`use_full_text=True`** with `element.itertext()` which means:

- **ALL nested elements** (`<xref>`, `<italic>`, `<bold>`, `<ext-link>`, etc.) are **included** in extracted text
- No information is lost from inline markup
- Text extraction is **comprehensive** rather than selective

### üìä **Attribute Handling**

| Attribute | Elements | Handled By |
|-----------|----------|-----------|
| `ref-type`, `rid` | `<xref>` | Available via `extract_elements_by_patterns()` with `return_type="element"` |
| `href`, `rel`, `title` | `<link>` | Available via element attributes |
| `ext-link-type`, `xlink:href` | `<ext-link>` | Available via element attributes |
| `id` | `<ref>`, `<fig>`, `<table-wrap>` | Extracted as metadata in specialized methods |