feat: add validation for mandatory fields in document extraction #75
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a validation mechanism to ensure that extracted documents contain mandatory non-null fields, and updates both the extraction logic and its tests to handle cases where required data is missing. The main changes include adding a new validation function, integrating it into the document extraction workflow, and expanding unit tests to cover scenarios with missing content.
Validation and extraction improvements:
validate_non_null_fields_documentinwelearn_datastack/modules/validation.pyto check if aWeLearnDocumenthas non-emptydescriptionandfull_contentfields after extraction.extract_data_from_urlsmethod indocument_collector.py, marking documents with missing mandatory fields as errors with an appropriate HTTP error code and message. [1] [2]Testing enhancements:
test_extract_and_with_none_dataintest_extract_n_collect_docs.pyto verify that documents withNonecontent are correctly identified as errors during extraction.