Skip to content

Integrate schema parameter into download functions #14

@MALathon

Description

@MALathon

Summary

Add schema parameter to download_pdfs_from_webpage() and process_pdfs() to apply site-specific configurations automatically.

Design

API Changes

def download_pdfs_from_webpage(
    url: str,
    # ... existing parameters ...
    schema: Optional[Union[str, SiteSchema]] = None,  # NEW
) -> ProcessResult:
    """
    Args:
        schema: Site schema to use. Can be:
            - None: No schema (current behavior)
            - 'auto': Auto-detect from URL
            - str: Schema name (e.g., 'springer_book')
            - SiteSchema: Schema instance
    """

Implementation

def download_pdfs_from_webpage(url, ..., schema=None):
    # Resolve schema
    resolved_schema = None
    if schema == 'auto':
        resolved_schema = detect_schema(url)
        if resolved_schema:
            logger.info(f"Auto-detected schema: {resolved_schema.name}")
    elif isinstance(schema, str):
        resolved_schema = get_schema(schema)
        if not resolved_schema:
            raise ValueError(f"Unknown schema: {schema}")
    elif isinstance(schema, SiteSchema):
        resolved_schema = schema
    
    # Apply schema defaults (explicit params override)
    if resolved_schema:
        if sort_by is None:
            sort_by = resolved_schema.sort_by
        if sort_key is None:
            sort_key = resolved_schema.sort_key
        if filter_config is None:
            filter_config = resolved_schema.get_filter_config()
        if output_name is None:
            output_name = resolved_schema.default_output_name
        # Use schema's recommended depth if not specified
        if recursion_depth == 0 and resolved_schema.recommended_depth > 0:
            recursion_depth = resolved_schema.recommended_depth
    
    # Continue with existing logic...

Usage Examples

from fetcharoo import download_pdfs_from_webpage

# Auto-detect schema
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
    schema='auto'
)

# Explicit schema by name
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/...',
    schema='springer_book'
)

# Schema instance with overrides
from fetcharoo.schemas import SpringerBook
result = download_pdfs_from_webpage(
    url='...',
    schema=SpringerBook(request_delay=2.0)
)

# Explicit params override schema defaults
result = download_pdfs_from_webpage(
    url='...',
    schema='springer_book',
    sort_by='alpha'  # Overrides schema's 'numeric'
)

Tasks

  • Add schema parameter to download_pdfs_from_webpage()
  • Add schema parameter to process_pdfs()
  • Implement schema resolution logic (auto/name/instance)
  • Apply schema defaults with explicit param override
  • Log when auto-detection succeeds
  • Raise clear error for unknown schema names
  • Update function docstrings
  • Add integration tests

Acceptance Criteria

  • schema='auto' correctly detects and applies schemas
  • Named schemas work: schema='springer_book'
  • Schema instances work with custom settings
  • Explicit parameters always override schema defaults
  • Clear error message for unknown schema names

Dependencies

Part of

Parent issue: #10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions