-
Notifications
You must be signed in to change notification settings - Fork 0
Fix/openalex #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/openalex #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the OpenAlex integration to improve robustness and error handling when processing academic documents. The changes focus on making data models more resilient to incomplete API responses and adding validation for OpenAlex URLs.
Key changes:
- Made all Pydantic model fields optional to handle incomplete data from the OpenAlex API
- Added validation and error handling for OpenAlex URL extraction with new exception types
- Improved logging to show success/error rates relative to total documents processed
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| welearn_datastack/data/source_models/open_alex.py | Converted all Pydantic model fields to Optional types for robust handling of missing API data |
| welearn_datastack/plugins/rest_requesters/open_alex.py | Added URL validation logic, new exception handling, and fixed batch processing bug where wrong document set was used |
| welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py | Enhanced logging to display document retrieval success and error counts |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…g sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Stanislas Bruhière <46960549+Nastaliss@users.noreply.github.com>
This pull request introduces several improvements and refactorings to the OpenAlex integration, focusing on better error handling, more robust data modeling, and cleaner code practices. The most significant changes include updating Pydantic models to use optional fields, introducing stricter validation and error handling for OpenAlex URLs, and refactoring type hints and collection handling for consistency.
Data model improvements
welearn_datastack/data/source_models/open_alex.pynow useOptionaltypes for their fields, making them more robust to missing or incomplete data from the OpenAlex API.Error handling and validation
ManagementExceptions,NotEnoughData,UnknownURL) and integrated them into OpenAlex URL extraction logic to provide clearer error messages and skip invalid or missing URLs during document processing. [1] [2] [3]Type hint and collection consistency
welearn_datastack/plugins/rest_requesters/open_alex.pyto use standard Python collections (list,dict) instead of importedListandDict, improving code consistency and readability. [1] [2] [3]Logging enhancements
welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.pyto show both the number of successful document retrievals and errors relative to the total attempted, aiding in debugging and monitoring.API query and batch processing fixes