feat: add validation for mandatory fields in document extraction #75

lpi-tn · 2025-11-20T13:13:03Z

This pull request introduces a validation mechanism to ensure that extracted documents contain mandatory non-null fields, and updates both the extraction logic and its tests to handle cases where required data is missing. The main changes include adding a new validation function, integrating it into the document extraction workflow, and expanding unit tests to cover scenarios with missing content.

Validation and extraction improvements:

Added a new function validate_non_null_fields_document in welearn_datastack/modules/validation.py to check if a WeLearnDocument has non-empty description and full_content fields after extraction.
Integrated the new validation function into the extract_data_from_urls method in document_collector.py, marking documents with missing mandatory fields as errors with an appropriate HTTP error code and message. [1] [2]

Testing enhancements:

Added a unit test test_extract_and_with_none_data in test_extract_n_collect_docs.py to verify that documents with None content are correctly identified as errors during extraction.

Copilot

Pull Request Overview

This PR adds validation to ensure that extracted documents contain mandatory non-null fields (description and full_content). Documents missing these required fields are now marked as errors with HTTP code 422 instead of being processed further.

Introduced a new validation function to check for mandatory non-null fields in extracted documents
Integrated validation into the document extraction workflow to catch documents with missing content
Added test coverage for the new validation logic with missing content scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
welearn_datastack/modules/validation.py	New validation function to check for non-empty mandatory fields in WeLearnDocument
welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py	Integrated validation into extraction workflow to mark invalid documents as errors
tests/document_collector_hub/test_nodes/test_extract_n_collect_docs.py	Added test case for documents with None content to verify error handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py

welearn_datastack/modules/validation.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…_collector.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat: add validation for mandatory fields in document extraction

4dc2f99

lpi-tn requested review from Copilot and jmsevin November 20, 2025 13:13

Copilot AI reviewed Nov 20, 2025

View reviewed changes

welearn_datastack/nodes_workflow/DocumentHubCollector/document_collector.py Outdated Show resolved Hide resolved

welearn_datastack/modules/validation.py Outdated Show resolved Hide resolved

lpi-tn and others added 2 commits November 20, 2025 14:14

Update welearn_datastack/modules/validation.py

3b5a2fb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update welearn_datastack/nodes_workflow/DocumentHubCollector/document…

5cf00bd

…_collector.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lpi-tn requested a review from samonaisi November 20, 2025 13:14

samonaisi approved these changes Nov 20, 2025

View reviewed changes

lpi-tn merged commit 6ded5da into main Nov 20, 2025
7 checks passed

lpi-tn deleted the Fix/after-scrape-validation branch November 20, 2025 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add validation for mandatory fields in document extraction #75

feat: add validation for mandatory fields in document extraction #75

Uh oh!

lpi-tn commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add validation for mandatory fields in document extraction #75

feat: add validation for mandatory fields in document extraction #75

Uh oh!

Conversation

lpi-tn commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants