Skip to content

Conversation

@felixjordandev
Copy link
Collaborator

@felixjordandev felixjordandev commented Nov 21, 2025

Overview: This PR expands the validation module to gracefully handle missing data, ensuring more robust report generation.

Changes

  • Introduced new methods within the validation module for handling missing fields and failed agent outputs.
  • Added the normalize_missing(data) function to replace absent data points with explicit placeholders.
  • Implemented a missing_data_report JSON field to provide clear explanations for any data gaps.

Summary by CodeRabbit

  • New Features

    • Null or empty fields in input data are now normalized to "N/A" across nested structures and lists.
    • Cross-source validation now reports status as either "PASSED" or "COMPLETED_WITH_ALERTS".
    • A missing-data report is generated listing the exact field paths that were normalized.
  • Tests

    • Added tests verifying normalization behavior and that original data remains unchanged.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 21, 2025

Walkthrough

Added normalize_missing(data) to deeply copy and replace None/"" with "N/A" across dicts/lists and produce a missing_data_report; perform_cross_source_checks() now sets cross_source_checks to "COMPLETED_WITH_ALERTS" when alerts exist, otherwise "PASSED".

Changes

Cohort / File(s) Summary
Tests
backend/app/services/validation/tests/test_validation_engine.py
Added test_normalize_missing() asserting deep normalization of None/"""N/A", exact equality with expected normalized structure, presence and contents of missing_data_report, and immutability of the original input.
Validation engine
backend/app/services/validation/validation_engine.py
Added normalize_missing(data: Dict[str, Any]) -> Dict[str, Any] (deep copy, recursive traversal of dicts/lists replacing None/"" with "N/A", recording replacement paths in missing_data_report). Updated perform_cross_source_checks() to set cross_source_checks to "COMPLETED_WITH_ALERTS" when alerts exist, otherwise "PASSED".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ValidationEngine
  Note over ValidationEngine: normalize_missing(data)
  Caller->>ValidationEngine: normalize_missing(data)
  ValidationEngine->>ValidationEngine: deepcopy input
  ValidationEngine->>ValidationEngine: traverse dicts & lists
  ValidationEngine->>ValidationEngine: replace None/"" -> "N/A" and record path
  ValidationEngine-->>Caller: return normalized_data + missing_data_report

  Note over ValidationEngine: perform_cross_source_checks(...)
  Caller->>ValidationEngine: perform_cross_source_checks(...)
  ValidationEngine->>ValidationEngine: compute alerts
  alt alerts found
    ValidationEngine-->>Caller: cross_source_checks = "COMPLETED_WITH_ALERTS"
  else no alerts
    ValidationEngine-->>Caller: cross_source_checks = "PASSED"
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Inspect recursive traversal for edge cases (mixed types, recursion depth, non-dict/list iterables).
  • Verify path-formatting in missing_data_report matches consumers and tests (e.g., list indices like [1]).
  • Confirm deepcopy use preserves original immutability and that no external references are mutated.
  • Review perform_cross_source_checks change for downstream state expectations or telemetry.

Possibly related PRs

Poem

🐰 I hop through maps and lists so deep,
I swap the empties from their sleep,
I tuck "N/A" where blanks were found,
I note each path and mark the ground,
Now every report has tidy beats.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Feat: Enhance validation with missing data normalization' accurately and specifically describes the main change—adding a normalize_missing function to handle missing data in the validation module.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/validation-missing-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/app/services/validation/validation_engine.py (1)

165-165: Remove unused loop variable.

The loop variable i is not used within the loop body.

As per coding guidelines.

Apply this diff:

-                    for i, part in enumerate(path_parts[:-1]):
+                    for part in path_parts[:-1]:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9af7b23 and adda3ed.

⛔ Files ignored due to path filters (2)
  • backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
  • backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc
📒 Files selected for processing (2)
  • backend/app/services/validation/tests/test_validation_engine.py (1 hunks)
  • backend/app/services/validation/validation_engine.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/services/validation/tests/test_validation_engine.py (1)
backend/app/services/validation/validation_engine.py (2)
  • perform_cross_source_checks (58-127)
  • normalize_missing (130-180)
🪛 Ruff (0.14.5)
backend/app/services/validation/validation_engine.py

165-165: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (2)
backend/app/services/validation/tests/test_validation_engine.py (1)

2-2: LGTM!

The import statement correctly adds normalize_missing alongside the existing import.

backend/app/services/validation/validation_engine.py (1)

122-125: LGTM!

The addition of the cross_source_checks status field provides a clear validation state indicator. The logic correctly sets "COMPLETED_WITH_ALERTS" when alerts exist and "PASSED" otherwise.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adda3ed and 35c5137.

⛔ Files ignored due to path filters (2)
  • backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
  • backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc
📒 Files selected for processing (2)
  • backend/app/services/validation/tests/test_validation_engine.py (1 hunks)
  • backend/app/services/validation/validation_engine.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/services/validation/tests/test_validation_engine.py
🧰 Additional context used
🪛 Ruff (0.14.5)
backend/app/services/validation/validation_engine.py

166-166: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (5)
backend/app/services/validation/validation_engine.py (5)

6-6: LGTM! Previous critical issue resolved.

The addition of deepcopy import correctly addresses the previous review comment about shallow copy mutations.


123-127: LGTM! Clear status reporting.

The conditional status setting appropriately distinguishes between clean validation and validation with alerts.


131-143: LGTM! Function setup is correct.

The function signature is well-typed, the docstring is clear, and the use of deepcopy correctly prevents mutation of the original input data.


146-158: LGTM! Dictionary traversal logic is sound.

The logic correctly handles nested dictionaries, builds appropriate paths, and recurses into nested structures.


179-181: Approve with note on potential key collision.

The function correctly invokes the traversal and returns the normalized data with the report attached.

Note that if the input data already contains a "missing_data_report" key, it will be overwritten. This is likely acceptable given the function's purpose, but consider documenting this behavior if it hasn't been already.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/services/validation/validation_engine.py (1)

173-180: Remove or fix ineffective type-conversion logic.

The reassignment normalized_data_ref = [] (line 176) only updates the local variable and doesn't modify the parent structure in normalized_data. This means the type conversion won't actually take effect. Similar issue exists at lines 203-204.

If this defensive code is meant to handle malformed inputs, it should either be fixed to properly update the parent structure, or removed entirely if the function assumes well-formed inputs.

Consider either:

  • Removing these checks if the function assumes well-formed input
  • Or properly updating the parent structure (which would require tracking parent references)
-                            if not isinstance(normalized_data_ref, list):
-                                # If it's not a list, and it's an empty dict, convert to list
-                                if isinstance(normalized_data_ref, dict) and not normalized_data_ref:
-                                    normalized_data_ref = []
-                                else:
-                                    # If it's not a list and not an empty dict, this is an error in path or structure
-                                    # For now, let's assume the path is well-formed and the type matches
-                                    pass
+                            # Assume well-formed input where structure matches path
+                            if not isinstance(normalized_data_ref, list):
+                                # Skip or log unexpected structure
+                                continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35c5137 and 3ff7ce4.

⛔ Files ignored due to path filters (2)
  • backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
  • backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc
📒 Files selected for processing (1)
  • backend/app/services/validation/validation_engine.py (2 hunks)
🔇 Additional comments (2)
backend/app/services/validation/validation_engine.py (2)

5-7: LGTM! Necessary imports added.

The re module is used for path parsing in list traversal, and deepcopy correctly addresses the previous shallow-copy concern, ensuring the original input data remains unmodified.


124-128: LGTM! Clear status summary logic.

The addition of a summary status based on alert presence provides a clear, actionable signal for downstream consumers of the validation results.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ff7ce4 and ea434d6.

⛔ Files ignored due to path filters (2)
  • backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
  • backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc
📒 Files selected for processing (1)
  • backend/app/services/validation/validation_engine.py (2 hunks)
🔇 Additional comments (2)
backend/app/services/validation/validation_engine.py (2)

5-7: LGTM! Imports are appropriate.

The re module for regex-based path parsing and deepcopy for safe data cloning are both necessary for the new normalize_missing function.


124-128: LGTM! Status field provides clear validation outcome.

The addition of cross_source_checks status based on alert presence improves the clarity of validation results.

Comment on lines 201 to 208
if not isinstance(normalized_data_ref, dict):
# If it's not a dict, and it's an empty list, convert to dict
if isinstance(normalized_data_ref, list) and not normalized_data_ref:
normalized_data_ref = {}
else:
# If it's not a dict and not an empty list, this is an error in path or structure
# For now, let's assume the path is well-formed and the type matches
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: Type conversion doesn't update parent structure.

Lines 203-204 attempt to convert an empty list to a dict by reassigning normalized_data_ref, but this only changes the local variable without updating the parent structure. This is the same issue as in lines 173-180.

If this conversion is necessary, update the parent structure directly. Otherwise, consider removing the conversion or raising an exception to flag structural inconsistencies.

🤖 Prompt for AI Agents
In backend/app/services/validation/validation_engine.py around lines 201 to 208,
the code reassigns normalized_data_ref to {} when encountering an empty list but
only updates the local variable, leaving the parent container unchanged; update
the parent structure instead (e.g., set parent_container[parent_key_or_index] =
{} using the stored parent reference and key/index used to reach
normalized_data_ref) so the change persists, or if conversion is not allowed,
replace the conversion with raising a clear exception to flag the structural
mismatch.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
backend/app/services/validation/validation_engine.py (1)

132-169: normalize_missing recursion is sound; consider small robustness and cleanup tweaks

The recursive traversal over dicts/lists and use of deepcopy are correct and preserve the original input. A couple of small refinements would make this more robust and simpler:

  1. Guard against invalid input early
    If data is ever None or not a dict, this will currently fail with an opaque AttributeError on normalized_data.update(...). You can mirror check_missing_values’ behavior with explicit checks:
 def normalize_missing(data: Dict[str, Any]) -> Dict[str, Any]:
@@
-    normalized_data = deepcopy(data)
+    if data is None:
+        raise ValueError("data must not be None")
+    if not isinstance(data, dict):
+        raise TypeError("data must be a dict")
+
+    normalized_data = deepcopy(data)
  1. Drop the redundant update at the end
    temp_root['__root__'] and normalized_data refer to the same dict instance, so this line is effectively a no-op:
-    normalized_data.update(temp_root['__root__']) # Update normalized_data with the modified content
+    # normalized_data already holds the modified structure via temp_root['__root__']

You can safely remove it to simplify the function.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea434d6 and 1e7ebad.

⛔ Files ignored due to path filters (2)
  • backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
  • backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc
📒 Files selected for processing (1)
  • backend/app/services/validation/validation_engine.py (2 hunks)
🔇 Additional comments (2)
backend/app/services/validation/validation_engine.py (2)

5-7: Deep copy import correctly supports non-mutating normalization

Using deepcopy for normalized_data is the right fix to ensure callers’ input isn’t mutated while still allowing in-place updates during traversal. This now matches the docstring promise of returning “a new dictionary.”


124-127: cross_source_checks status wiring looks consistent

The new cross_source_checks flag cleanly summarizes the result: "COMPLETED_WITH_ALERTS" whenever any alert message is recorded, and "PASSED" otherwise. This covers all branches of the preceding logic and should make downstream consumption simpler.

@klingonaston
Copy link
Collaborator

the normalize_missing(data) function is a solid addition for handling absent data, ensures clarity in the reports.

@klingonaston klingonaston merged commit c404c7d into main Nov 21, 2025
1 check passed
@klingonaston klingonaston deleted the feat/validation-missing-data branch November 21, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants