Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Nov 20, 2025

This pull request introduces a validation mechanism to ensure that extracted documents contain mandatory non-null fields, and updates both the extraction logic and its tests to handle cases where required data is missing. The main changes include adding a new validation function, integrating it into the document extraction workflow, and expanding unit tests to cover scenarios with missing content.

Validation and extraction improvements:

  • Added a new function validate_non_null_fields_document in welearn_datastack/modules/validation.py to check if a WeLearnDocument has non-empty description and full_content fields after extraction.
  • Integrated the new validation function into the extract_data_from_urls method in document_collector.py, marking documents with missing mandatory fields as errors with an appropriate HTTP error code and message. [1] [2]

Testing enhancements:

  • Added a unit test test_extract_and_with_none_data in test_extract_n_collect_docs.py to verify that documents with None content are correctly identified as errors during extraction.

@lpi-tn lpi-tn requested review from Copilot and jmsevin November 20, 2025 13:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds validation to ensure that extracted documents contain mandatory non-null fields (description and full_content). Documents missing these required fields are now marked as errors with HTTP code 422 instead of being processed further.

  • Introduced a new validation function to check for mandatory non-null fields in extracted documents
  • Integrated validation into the document extraction workflow to catch documents with missing content
  • Added test coverage for the new validation logic with missing content scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
welearn_datastack/modules/validation.py New validation function to check for non-empty mandatory fields in WeLearnDocument
welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py Integrated validation into extraction workflow to mark invalid documents as errors
tests/document_collector_hub/test_nodes/test_extract_n_collect_docs.py Added test case for documents with None content to verify error handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lpi-tn and others added 2 commits November 20, 2025 14:14
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…_collector.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lpi-tn lpi-tn requested a review from samonaisi November 20, 2025 13:14
@lpi-tn lpi-tn merged commit 6ded5da into main Nov 20, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Fix/after-scrape-validation branch November 20, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants