# PMC XML to Text Conversion

This notebook documents the process of converting a set of PubMed Central (PMC) XML files representing a trusted corpus of curated papers from Don Elbert and collagues. We convert the XML files to plain text format, for use as RAG source.  

## Overview

- **Input**: PMC XML files containing full-text research papers
- **Output**: Plain text files with structured content (title, abstract, sections)
- **Purpose**: Prepare text data for downstream processing, analysis, and knowledge extraction

## Data Source

The XML files are located in `data/alz_papers_3k/` and contain approximately 3,000 Alzheimer's disease related research papers from PubMed Central.

In [None]:
import os
import sys
from pathlib import Path
import subprocess
from datetime import datetime

# Add the project root to the path so we can import the conversion script
project_root = Path().absolute().parent
sys.path.append(str(project_root))

print(f"Project root: {project_root}")
print(f"Current working directory: {os.getcwd()}")

## Check Input Data

First, let's verify the XML data is available and examine its structure.

In [None]:
# Define paths
xml_data_dir = project_root / "data" / "alz_papers_3k"
output_dir = project_root / "data" / "alz_papers_3k_text"

print(f"XML data directory: {xml_data_dir}")
print(f"Output directory: {output_dir}")
print(f"XML directory exists: {xml_data_dir.exists()}")

if xml_data_dir.exists():
    # Count XML files
    xml_files = list(xml_data_dir.rglob("*.xml"))
    print(f"\nFound {len(xml_files)} XML files")
    
    # Show a few example filenames
    print("\nExample files:")
    for i, file in enumerate(xml_files[:5]):
        print(f"  {file.name}")
    
    if len(xml_files) > 5:
        print(f"  ... and {len(xml_files) - 5} more")
else:
    print("❌ XML data directory not found!")
    print("Please ensure the XML data is located at:", xml_data_dir)

## Run the Conversion Script

Now we'll run the conversion script to transform XML files into text format.

In [None]:
# Create output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)

# Path to the conversion script
script_path = project_root / "scripts" / "convert_pmc_xml_to_text.py"

print(f"Conversion script: {script_path}")
print(f"Script exists: {script_path.exists()}")

if not script_path.exists():
    print("❌ Conversion script not found!")

In [None]:
# Record start time
start_time = datetime.now()
print(f"Starting conversion at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")

# Run the conversion script
cmd = [
    "python", 
    str(script_path),
    str(xml_data_dir),
    str(output_dir),
    "--verbose"
]

print(f"\nRunning command: {' '.join(cmd)}")
print("-" * 60)

# Execute the script and capture output
try:
    result = subprocess.run(
        cmd,
        cwd=project_root,
        capture_output=True,
        text=True,
        timeout=1800  # 30 minute timeout
    )
    
    print("STDOUT:")
    print(result.stdout)
    
    if result.stderr:
        print("\nSTDERR:")
        print(result.stderr)
    
    print(f"\nReturn code: {result.returncode}")
    
except subprocess.TimeoutExpired:
    print("❌ Script timed out after 30 minutes")
except Exception as e:
    print(f"❌ Error running script: {e}")

# Record end time
end_time = datetime.now()
duration = end_time - start_time
print(f"\nConversion completed at: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total duration: {duration}")

## Verify Output

Let's check the results of the conversion process.

In [None]:
# Check output directory
if output_dir.exists():
    text_files = list(output_dir.glob("*.txt"))
    print(f"Created {len(text_files)} text files")
    
    if text_files:
        # Show some statistics
        total_size = sum(f.stat().st_size for f in text_files)
        avg_size = total_size / len(text_files) if text_files else 0
        
        print(f"Total output size: {total_size / 1024 / 1024:.1f} MB")
        print(f"Average file size: {avg_size / 1024:.1f} KB")
        
        # Show example filenames
        print("\nExample output files:")
        for i, file in enumerate(sorted(text_files)[:5]):
            size_kb = file.stat().st_size / 1024
            print(f"  {file.name} ({size_kb:.1f} KB)")
        
        if len(text_files) > 5:
            print(f"  ... and {len(text_files) - 5} more")
    else:
        print("❌ No text files created!")
else:
    print("❌ Output directory not found!")

## Sample Output

Let's examine a sample converted text file to verify the quality of the conversion.

In [None]:
# Read and display a sample file
if text_files:
    sample_file = text_files[0]
    print(f"Sample file: {sample_file.name}")
    print("=" * 60)
    
    try:
        with open(sample_file, 'r', encoding='utf-8') as f:
            content = f.read()
            
        # Show first 2000 characters
        preview_length = 2000
        if len(content) > preview_length:
            print(content[:preview_length])
            print(f"\n... (showing first {preview_length} characters of {len(content)} total)")
        else:
            print(content)
            
    except Exception as e:
        print(f"❌ Error reading sample file: {e}")
else:
    print("No text files available to sample")

## Conversion Summary

Summary of the XML to text conversion process.

In [None]:
# Generate summary
print("📊 CONVERSION SUMMARY")
print("=" * 50)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Input directory: {xml_data_dir}")
print(f"Output directory: {output_dir}")

if 'xml_files' in locals():
    print(f"Input XML files: {len(xml_files)}")
else:
    print("Input XML files: Not counted")

if 'text_files' in locals():
    print(f"Output text files: {len(text_files)}")
    if xml_files and text_files:
        success_rate = len(text_files) / len(xml_files) * 100
        print(f"Success rate: {success_rate:.1f}%")
else:
    print("Output text files: Not counted")

if 'duration' in locals():
    print(f"Processing time: {duration}")

print("\n✅ Conversion process completed!")
print("\nThe converted text files are ready for:")
print("- Text analysis and NLP processing")
print("- Knowledge extraction")
print("- Integration into the knowledge graph pipeline")

## Next Steps

With the text files now available, you can:

1. **Text Analysis**: Perform entity recognition, relationship extraction, etc.
2. **Knowledge Extraction**: Extract facts and relationships for the knowledge graph
3. **Quality Assessment**: Review conversion quality and identify any issues
4. **Integration**: Incorporate the text data into your existing knowledge graph pipeline

The converted text files maintain the structure of the original papers with clear section headers (Title, Abstract, Introduction, Methods, etc.) making them suitable for downstream processing.