# üìÑ PDF Extraction Pipeline for RAG

This notebook runs the complete PDF extraction pipeline on **Google Colab** with GPU acceleration.

**Pipeline Phases:**
1. **Phase 1**: PDF ‚Üí Markdown extraction (Marker ML)
2. **Phase 2-3**: PubMed metadata enrichment & validation
3. **Phase 4**: JSONL generation for vector databases

---

## üîß Setup


In [None]:
# Mount Google Drive
from google.colab import drive
import os

print("üîå Mounting Google Drive...")
drive.mount('/content/drive', force_remount=True)
print("‚úÖ Drive mounted!")


In [None]:
# Configure project path - UPDATE THIS to your folder name
PROJECT_FOLDER = "pdf_extraction"  # Change if your folder has a different name

project_path = f"/content/drive/MyDrive/{PROJECT_FOLDER}"

# Navigate to project
%cd {project_path}

print(f"üìÇ Working directory: {os.getcwd()}")
print("\nüìÅ Project files:")
!ls -la


In [None]:
# Install dependencies
print("üì¶ Installing dependencies...")
print("This may take a few minutes on first run.\n")

!pip install -q marker-pdf pydantic pypdfium2 requests

print("\n‚úÖ Dependencies installed!")


In [None]:
# Verify GPU is available
import torch

print("üîç Checking hardware...")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU detected: {gpu_name}")
    print("   Pipeline will use GPU acceleration (10-20x faster)")
else:
    print("‚ö†Ô∏è No GPU detected. Pipeline will run on CPU (slower).")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")


## üì• Check Input PDFs

Make sure your PDFs are in the `data/raw/` folder.


In [None]:
# Check input PDFs
import glob

pdf_files = glob.glob("data/raw/*.pdf")

print(f"üìÑ Found {len(pdf_files)} PDF(s) in data/raw/")

if pdf_files:
    print("\nFiles:")
    for f in pdf_files[:10]:  # Show first 10
        print(f"   - {os.path.basename(f)}")
    if len(pdf_files) > 10:
        print(f"   ... and {len(pdf_files) - 10} more")
else:
    print("\n‚ö†Ô∏è No PDFs found!")
    print("   Upload PDFs to: data/raw/")


---
## üöÄ Phase 1: PDF Extraction

Extracts text from PDFs using the **Marker** ML library with GPU acceleration.


In [None]:
print("üöÄ Starting Phase 1: PDF Extraction")
print("=" * 50)
print("This may take several minutes depending on PDF count.")
print("=" * 50)

!python pdf_marker_extraction.py

print("\n" + "=" * 50)
print("‚úÖ Phase 1 complete!")
print("   Output: data/marker_outputs/")


In [None]:
# Check Phase 1 output
json_files = glob.glob("data/marker_outputs/*.json")
print(f"üìä Phase 1 produced {len(json_files)} JSON file(s)")

if json_files:
    # Show sample
    import json
    with open(json_files[0], 'r') as f:
        sample = json.load(f)
    print(f"\nüìÑ Sample output from: {os.path.basename(json_files[0])}")
    print(f"   Title: {sample.get('metadata', {}).get('title', 'N/A')[:60]}...")
    print(f"   DOI: {sample.get('metadata', {}).get('doi', 'N/A')}")
    print(f"   Text length: {len(sample.get('text', '')):,} chars")


---
## üîç Phase 2-3: PubMed Enrichment

Validates and enriches metadata using the **PubMed E-utilities API**.


In [None]:
# Optional: Set PubMed API key for faster processing
# Get free API key: https://www.ncbi.nlm.nih.gov/account/

# Uncomment and fill in if you have an API key:
# os.environ["PUBMED_API_KEY"] = "your_api_key_here"
# os.environ["PUBMED_EMAIL"] = "your@email.com"

print("‚ÑπÔ∏è PubMed API configuration:")
if os.environ.get("PUBMED_API_KEY"):
    print("   ‚úÖ API key set (10 requests/sec)")
else:
    print("   ‚ö†Ô∏è No API key (3 requests/sec)")
    print("   Set PUBMED_API_KEY for faster processing")


In [None]:
print("üîç Starting Phase 2-3: PubMed Enrichment")
print("=" * 50)

!python pubmed_enrichment.py

print("\n" + "=" * 50)
print("‚úÖ Phase 2-3 complete!")
print("   Output: data/processed/")


In [None]:
# Check Phase 2-3 output
final_files = glob.glob("data/processed/*_final.json")
failed_files = glob.glob("data/failed/*.json")

print(f"üìä Phase 2-3 results:")
print(f"   ‚úÖ Successful: {len(final_files)}")
print(f"   ‚ùå Failed: {len(failed_files)}")

if final_files:
    # Show sample
    with open(final_files[0], 'r') as f:
        sample = json.load(f)
    print(f"\nüìÑ Sample enriched document:")
    print(f"   Title: {sample.get('Title', 'N/A')[:60]}...")
    print(f"   Link: {sample.get('Link', 'N/A')}")
    print(f"   Citation: {sample.get('Citation', 'N/A')[:80]}...")


---
## üì¶ Phase 4: Generate JSONL

Combines all documents into a single JSONL file for database ingestion.


In [None]:
print("üì¶ Starting Phase 4: JSONL Generation")
print("=" * 50)

!python combine_json_to_jsonl.py

print("\n" + "=" * 50)
print("‚úÖ Phase 4 complete!")
print("   Output: Output/pdf_extraction.jsonl")


In [None]:
# Verify final output
output_file = "Output/pdf_extraction.jsonl"

if os.path.exists(output_file):
    with open(output_file, 'r') as f:
        line_count = sum(1 for _ in f)
    
    size_mb = os.path.getsize(output_file) / (1024 * 1024)
    
    print("üìä Final JSONL file:")
    print(f"   Documents: {line_count}")
    print(f"   Size: {size_mb:.2f} MB")
    print(f"   Path: {output_file}")
    
    with open(output_file, 'r') as f:
        first_line = json.loads(f.readline())
    print(f"\nüìÑ Document fields:")
    for key in first_line.keys():
        print(f"   - {key}")
else:
    print("‚ùå Output file not found!")


---
## ‚úÖ Pipeline Complete!

Your extracted documents are ready for use in RAG systems.

**Output locations:**
- Individual JSONs: `data/processed/`
- Combined JSONL: `Output/pdf_extraction.jsonl`
- Logs: `logs/`


In [None]:
# Summary
print("üéâ PIPELINE COMPLETE!")
print("=" * 50)
print(f"\nüìÅ Output files in Google Drive:")
print(f"   {project_path}/data/processed/")
print(f"   {project_path}/Output/pdf_extraction.jsonl")
print("\nüìã Next steps:")
print("   1. Download the JSONL file")
print("   2. Upload to your vector database")
print("   3. Use in your RAG application")
