-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Add schema parameter to download_pdfs_from_webpage() and process_pdfs() to apply site-specific configurations automatically.
Design
API Changes
def download_pdfs_from_webpage(
url: str,
# ... existing parameters ...
schema: Optional[Union[str, SiteSchema]] = None, # NEW
) -> ProcessResult:
"""
Args:
schema: Site schema to use. Can be:
- None: No schema (current behavior)
- 'auto': Auto-detect from URL
- str: Schema name (e.g., 'springer_book')
- SiteSchema: Schema instance
"""Implementation
def download_pdfs_from_webpage(url, ..., schema=None):
# Resolve schema
resolved_schema = None
if schema == 'auto':
resolved_schema = detect_schema(url)
if resolved_schema:
logger.info(f"Auto-detected schema: {resolved_schema.name}")
elif isinstance(schema, str):
resolved_schema = get_schema(schema)
if not resolved_schema:
raise ValueError(f"Unknown schema: {schema}")
elif isinstance(schema, SiteSchema):
resolved_schema = schema
# Apply schema defaults (explicit params override)
if resolved_schema:
if sort_by is None:
sort_by = resolved_schema.sort_by
if sort_key is None:
sort_key = resolved_schema.sort_key
if filter_config is None:
filter_config = resolved_schema.get_filter_config()
if output_name is None:
output_name = resolved_schema.default_output_name
# Use schema's recommended depth if not specified
if recursion_depth == 0 and resolved_schema.recommended_depth > 0:
recursion_depth = resolved_schema.recommended_depth
# Continue with existing logic...Usage Examples
from fetcharoo import download_pdfs_from_webpage
# Auto-detect schema
result = download_pdfs_from_webpage(
url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
schema='auto'
)
# Explicit schema by name
result = download_pdfs_from_webpage(
url='https://link.springer.com/book/...',
schema='springer_book'
)
# Schema instance with overrides
from fetcharoo.schemas import SpringerBook
result = download_pdfs_from_webpage(
url='...',
schema=SpringerBook(request_delay=2.0)
)
# Explicit params override schema defaults
result = download_pdfs_from_webpage(
url='...',
schema='springer_book',
sort_by='alpha' # Overrides schema's 'numeric'
)Tasks
- Add
schemaparameter todownload_pdfs_from_webpage() - Add
schemaparameter toprocess_pdfs() - Implement schema resolution logic (auto/name/instance)
- Apply schema defaults with explicit param override
- Log when auto-detection succeeds
- Raise clear error for unknown schema names
- Update function docstrings
- Add integration tests
Acceptance Criteria
schema='auto'correctly detects and applies schemas- Named schemas work:
schema='springer_book' - Schema instances work with custom settings
- Explicit parameters always override schema defaults
- Clear error message for unknown schema names
Dependencies
- Create SiteSchema base dataclass #11 (SiteSchema base class)
- Implement schema registry with auto-detection #12 (Schema registry)
- Add built-in schemas for common sites (Springer, arXiv) #13 (Built-in schemas)
Part of
Parent issue: #10
Metadata
Metadata
Assignees
Labels
No labels