Feat: Enhance validation with missing data normalization #56

felixjordandev · 2025-11-21T12:56:26Z

Overview: This PR expands the validation module to gracefully handle missing data, ensuring more robust report generation.

Changes

Introduced new methods within the validation module for handling missing fields and failed agent outputs.
Added the normalize_missing(data) function to replace absent data points with explicit placeholders.
Implemented a missing_data_report JSON field to provide clear explanations for any data gaps.

Summary by CodeRabbit

New Features
- Null or empty fields in input data are now normalized to "N/A" across nested structures and lists.
- Cross-source validation now reports status as either "PASSED" or "COMPLETED_WITH_ALERTS".
- A missing-data report is generated listing the exact field paths that were normalized.
Tests
- Added tests verifying normalization behavior and that original data remains unchanged.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…nd list indices

…handling

coderabbitai · 2025-11-21T12:56:36Z

Walkthrough

Added normalize_missing(data) to deeply copy and replace None/"" with "N/A" across dicts/lists and produce a missing_data_report; perform_cross_source_checks() now sets cross_source_checks to "COMPLETED_WITH_ALERTS" when alerts exist, otherwise "PASSED".

Changes

Cohort / File(s)	Summary
Tests `backend/app/services/validation/tests/test_validation_engine.py`	Added `test_normalize_missing()` asserting deep normalization of `None`/`""` → `"N/A"`, exact equality with expected normalized structure, presence and contents of `missing_data_report`, and immutability of the original input.
Validation engine `backend/app/services/validation/validation_engine.py`	Added `normalize_missing(data: Dict[str, Any]) -> Dict[str, Any]` (deep copy, recursive traversal of dicts/lists replacing `None`/`""` with `"N/A"`, recording replacement paths in `missing_data_report`). Updated `perform_cross_source_checks()` to set `cross_source_checks` to `"COMPLETED_WITH_ALERTS"` when alerts exist, otherwise `"PASSED"`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant ValidationEngine
  Note over ValidationEngine: normalize_missing(data)
  Caller->>ValidationEngine: normalize_missing(data)
  ValidationEngine->>ValidationEngine: deepcopy input
  ValidationEngine->>ValidationEngine: traverse dicts & lists
  ValidationEngine->>ValidationEngine: replace None/"" -> "N/A" and record path
  ValidationEngine-->>Caller: return normalized_data + missing_data_report

  Note over ValidationEngine: perform_cross_source_checks(...)
  Caller->>ValidationEngine: perform_cross_source_checks(...)
  ValidationEngine->>ValidationEngine: compute alerts
  alt alerts found
    ValidationEngine-->>Caller: cross_source_checks = "COMPLETED_WITH_ALERTS"
  else no alerts
    ValidationEngine-->>Caller: cross_source_checks = "PASSED"
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Inspect recursive traversal for edge cases (mixed types, recursion depth, non-dict/list iterables).
Verify path-formatting in missing_data_report matches consumers and tests (e.g., list indices like [1]).
Confirm deepcopy use preserves original immutability and that no external references are mutated.
Review perform_cross_source_checks change for downstream state expectations or telemetry.

Possibly related PRs

Feat: Introduce data validation engine for report generation #53 — modifies the same validation_engine.py area and likely overlaps with cross-source check behavior and missing-value handling.

Poem

🐰 I hop through maps and lists so deep,
I swap the empties from their sleep,
I tuck "N/A" where blanks were found,
I note each path and mark the ground,
Now every report has tidy beats.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Feat: Enhance validation with missing data normalization' accurately and specifically describes the main change—adding a normalize_missing function to handle missing data in the validation module.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/validation-missing-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

backend/app/services/validation/validation_engine.py (1)
165-165: Remove unused loop variable.

The loop variable i is not used within the loop body.

As per coding guidelines.

Apply this diff:
-                    for i, part in enumerate(path_parts[:-1]):
+                    for part in path_parts[:-1]:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9af7b23 and adda3ed.

⛔ Files ignored due to path filters (2)

backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc

📒 Files selected for processing (2)

backend/app/services/validation/tests/test_validation_engine.py (1 hunks)
backend/app/services/validation/validation_engine.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/services/validation/tests/test_validation_engine.py (1)

backend/app/services/validation/validation_engine.py (2)

perform_cross_source_checks (58-127)

normalize_missing (130-180)

🪛 Ruff (0.14.5)

backend/app/services/validation/validation_engine.py

165-165: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (2)

backend/app/services/validation/tests/test_validation_engine.py (1)

2-2: LGTM!

The import statement correctly adds normalize_missing alongside the existing import.

backend/app/services/validation/validation_engine.py (1)

122-125: LGTM!

The addition of the cross_source_checks status field provides a clear validation state indicator. The logic correctly sets "COMPLETED_WITH_ALERTS" when alerts exist and "PASSED" otherwise.

backend/app/services/validation/tests/test_validation_engine.py

backend/app/services/validation/validation_engine.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adda3ed and 35c5137.

⛔ Files ignored due to path filters (2)

backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc

📒 Files selected for processing (2)

backend/app/services/validation/tests/test_validation_engine.py (1 hunks)
backend/app/services/validation/validation_engine.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/services/validation/tests/test_validation_engine.py

🧰 Additional context used

🪛 Ruff (0.14.5)

backend/app/services/validation/validation_engine.py

166-166: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🔇 Additional comments (5)

backend/app/services/validation/validation_engine.py (5)

6-6: LGTM! Previous critical issue resolved.

The addition of deepcopy import correctly addresses the previous review comment about shallow copy mutations.

123-127: LGTM! Clear status reporting.

The conditional status setting appropriately distinguishes between clean validation and validation with alerts.

131-143: LGTM! Function setup is correct.

The function signature is well-typed, the docstring is clear, and the use of deepcopy correctly prevents mutation of the original input data.

146-158: LGTM! Dictionary traversal logic is sound.

The logic correctly handles nested dictionaries, builds appropriate paths, and recurses into nested structures.

179-181: Approve with note on potential key collision.

The function correctly invokes the traversal and returns the normalized data with the report attached.

Note that if the input data already contains a "missing_data_report" key, it will be overwritten. This is likely acceptable given the function's purpose, but consider documenting this behavior if it hasn't been already.

backend/app/services/validation/validation_engine.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/app/services/validation/validation_engine.py (1)

173-180: Remove or fix ineffective type-conversion logic.

The reassignment normalized_data_ref = [] (line 176) only updates the local variable and doesn't modify the parent structure in normalized_data. This means the type conversion won't actually take effect. Similar issue exists at lines 203-204.

If this defensive code is meant to handle malformed inputs, it should either be fixed to properly update the parent structure, or removed entirely if the function assumes well-formed inputs.

Consider either:

Removing these checks if the function assumes well-formed input
Or properly updating the parent structure (which would require tracking parent references)

-                            if not isinstance(normalized_data_ref, list):
-                                # If it's not a list, and it's an empty dict, convert to list
-                                if isinstance(normalized_data_ref, dict) and not normalized_data_ref:
-                                    normalized_data_ref = []
-                                else:
-                                    # If it's not a list and not an empty dict, this is an error in path or structure
-                                    # For now, let's assume the path is well-formed and the type matches
-                                    pass
+                            # Assume well-formed input where structure matches path
+                            if not isinstance(normalized_data_ref, list):
+                                # Skip or log unexpected structure
+                                continue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35c5137 and 3ff7ce4.

⛔ Files ignored due to path filters (2)

backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc

📒 Files selected for processing (1)

backend/app/services/validation/validation_engine.py (2 hunks)

🔇 Additional comments (2)

backend/app/services/validation/validation_engine.py (2)

5-7: LGTM! Necessary imports added.

The re module is used for path parsing in list traversal, and deepcopy correctly addresses the previous shallow-copy concern, ensuring the original input data remains unmodified.

124-128: LGTM! Clear status summary logic.

The addition of a summary status based on alert presence provides a clear, actionable signal for downstream consumers of the validation results.

backend/app/services/validation/validation_engine.py

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ff7ce4 and ea434d6.

⛔ Files ignored due to path filters (2)

backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc

📒 Files selected for processing (1)

backend/app/services/validation/validation_engine.py (2 hunks)

🔇 Additional comments (2)

backend/app/services/validation/validation_engine.py (2)

5-7: LGTM! Imports are appropriate.

The re module for regex-based path parsing and deepcopy for safe data cloning are both necessary for the new normalize_missing function.

124-128: LGTM! Status field provides clear validation outcome.

The addition of cross_source_checks status based on alert presence improves the clarity of validation results.

backend/app/services/validation/validation_engine.py

coderabbitai · 2025-11-21T13:52:41Z

backend/app/services/validation/validation_engine.py

+                                if not isinstance(normalized_data_ref, dict):
+                                    # If it's not a dict, and it's an empty list, convert to dict
+                                    if isinstance(normalized_data_ref, list) and not normalized_data_ref:
+                                        normalized_data_ref = {}
+                                    else:
+                                        # If it's not a dict and not an empty list, this is an error in path or structure
+                                        # For now, let's assume the path is well-formed and the type matches
+                                        pass


⚠️ Potential issue | 🟠 Major

Major: Type conversion doesn't update parent structure.

Lines 203-204 attempt to convert an empty list to a dict by reassigning normalized_data_ref, but this only changes the local variable without updating the parent structure. This is the same issue as in lines 173-180.

If this conversion is necessary, update the parent structure directly. Otherwise, consider removing the conversion or raising an exception to flag structural inconsistencies.

🤖 Prompt for AI Agents

In backend/app/services/validation/validation_engine.py around lines 201 to 208, the code reassigns normalized_data_ref to {} when encountering an empty list but only updates the local variable, leaving the parent container unchanged; update the parent structure instead (e.g., set parent_container[parent_key_or_index] = {} using the stored parent reference and key/index used to reach normalized_data_ref) so the change persists, or if conversion is not allowed, replace the conversion with raising a clear exception to flag the structural mismatch.

backend/app/services/validation/validation_engine.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/app/services/validation/validation_engine.py (1)
132-169: normalize_missing recursion is sound; consider small robustness and cleanup tweaks

The recursive traversal over dicts/lists and use of deepcopy are correct and preserve the original input. A couple of small refinements would make this more robust and simpler:

Guard against invalid input early
If data is ever None or not a dict, this will currently fail with an opaque AttributeError on normalized_data.update(...). You can mirror check_missing_values’ behavior with explicit checks:
 def normalize_missing(data: Dict[str, Any]) -> Dict[str, Any]:
@@
-    normalized_data = deepcopy(data)
+    if data is None:
+        raise ValueError("data must not be None")
+    if not isinstance(data, dict):
+        raise TypeError("data must be a dict")
+
+    normalized_data = deepcopy(data)
Drop the redundant update at the end
temp_root['__root__'] and normalized_data refer to the same dict instance, so this line is effectively a no-op:
-    normalized_data.update(temp_root['__root__']) # Update normalized_data with the modified content
+    # normalized_data already holds the modified structure via temp_root['__root__']
You can safely remove it to simplify the function.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea434d6 and 1e7ebad.

⛔ Files ignored due to path filters (2)

backend/app/services/validation/__pycache__/validation_engine.cpython-313.pyc is excluded by !**/*.pyc
backend/app/services/validation/tests/__pycache__/test_validation_engine.cpython-313-pytest-8.4.2.pyc is excluded by !**/*.pyc

📒 Files selected for processing (1)

backend/app/services/validation/validation_engine.py (2 hunks)

🔇 Additional comments (2)

backend/app/services/validation/validation_engine.py (2)

5-7: Deep copy import correctly supports non-mutating normalization

Using deepcopy for normalized_data is the right fix to ensure callers’ input isn’t mutated while still allowing in-place updates during traversal. This now matches the docstring promise of returning “a new dictionary.”

124-127: cross_source_checks status wiring looks consistent

The new cross_source_checks flag cleanly summarizes the result: "COMPLETED_WITH_ALERTS" whenever any alert message is recorded, and "PASSED" otherwise. This covers all branches of the preceding logic and should make downstream consumption simpler.

klingonaston · 2025-11-21T14:30:04Z

the normalize_missing(data) function is a solid addition for handling absent data, ensures clarity in the reports.

felixjordandev added 5 commits November 21, 2025 06:56

feat: Add missing data normalization to validation module

adda3ed

Fix: Prevent in-place mutation in normalize_missing with deepcopy

35c5137

Fix: Correct path normalization in validation engine for mixed keys a…

3ff7ce4

…nd list indices

Fix validation engine dict assignment to prevent overwriting lists

ea434d6

Refactor _traverse_and_normalize for correct data updates and type …

1e7ebad

…handling

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

backend/app/services/validation/tests/test_validation_engine.py Show resolved Hide resolved

backend/app/services/validation/validation_engine.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

backend/app/services/validation/validation_engine.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

backend/app/services/validation/validation_engine.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

klingonaston approved these changes Nov 21, 2025

View reviewed changes

klingonaston merged commit c404c7d into main Nov 21, 2025
1 check passed

klingonaston deleted the feat/validation-missing-data branch November 21, 2025 14:30

Feat: Enhance validation with missing data normalization #56

Feat: Enhance validation with missing data normalization #56

Uh oh!

Conversation

felixjordandev commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

klingonaston commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felixjordandev commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading