Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Nov 14, 2025

This pull request introduces several improvements and refactorings to the OpenAlex integration, focusing on better error handling, more robust data modeling, and cleaner code practices. The most significant changes include updating Pydantic models to use optional fields, introducing stricter validation and error handling for OpenAlex URLs, and refactoring type hints and collection handling for consistency.

Data model improvements

  • All Pydantic models in welearn_datastack/data/source_models/open_alex.py now use Optional types for their fields, making them more robust to missing or incomplete data from the OpenAlex API.

Error handling and validation

  • Added new exception classes (ManagementExceptions, NotEnoughData, UnknownURL) and integrated them into OpenAlex URL extraction logic to provide clearer error messages and skip invalid or missing URLs during document processing. [1] [2] [3]

Type hint and collection consistency

  • Refactored type hints throughout welearn_datastack/plugins/rest_requesters/open_alex.py to use standard Python collections (list, dict) instead of imported List and Dict, improving code consistency and readability. [1] [2] [3]

Logging enhancements

  • Improved logging in welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py to show both the number of successful document retrievals and errors relative to the total attempted, aiding in debugging and monitoring.

API query and batch processing fixes

  • Refactored batch processing in OpenAlex requester to ensure only valid OpenAlex IDs are used in API queries, and fixed a bug where the wrong document batch was referenced, improving reliability of document retrieval.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the OpenAlex integration to improve robustness and error handling when processing academic documents. The changes focus on making data models more resilient to incomplete API responses and adding validation for OpenAlex URLs.

Key changes:

  • Made all Pydantic model fields optional to handle incomplete data from the OpenAlex API
  • Added validation and error handling for OpenAlex URL extraction with new exception types
  • Improved logging to show success/error rates relative to total documents processed

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
welearn_datastack/data/source_models/open_alex.py Converted all Pydantic model fields to Optional types for robust handling of missing API data
welearn_datastack/plugins/rest_requesters/open_alex.py Added URL validation logic, new exception handling, and fixed batch processing bug where wrong document set was used
welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py Enhanced logging to display document retrieval success and error counts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lpi-tn and others added 4 commits November 14, 2025 16:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…g sanitization

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@lpi-tn lpi-tn merged commit 986c09c into main Nov 17, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Fix/openalex branch November 17, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants