Skip to content

feat: classify JSON export schema field mismatches, fix validator view mappings, duplicate field handling, and array type parsing, and add implementation roadmap#8528

Merged
pethers merged 26 commits intomasterfrom
copilot/align-json-export-schema
Apr 6, 2026

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 5, 2026

Description

The 5 JSON export schemas define 168 fields but only 24 exist in database sample data (144 mismatches). Analysis reveals most "mismatches" are either JSON structural grouping objects (67) or fields derivable from existing DB columns (42) — not true gaps. This PR classifies every field into 4 validator-backed categories (implemented/structural/computed/planned), annotates all schema docs, enhances the validator to report status with fully deterministic output, fixes a critical view-mapping bug, a duplicate field-name handling bug, and an array type parsing bug in the validator, and provides a prioritized implementation roadmap.

A critical bug was discovered and fixed in validate_schemas.py: the ministry schema_to_views mapping used riksdagen_government (English spelling) but the actual CSV files and DB views use riksdagen_goverment (matching the database identifiers). This caused the validator to miss 6 ministry CSV data files entirely. Additional missing view patterns were added for committee (committee_productivity) and ministry (ministry_productivity_matrix, ministry_risk_evolution), increasing implemented fields from 21 to 24.

A second bug was fixed in the mermaid field parser: duplicate field names across different classes/paths (e.g., politician has both +String party and +PartyLink party) would silently overwrite each other. Non-scalar types (PartyLink, Trends, VoteAlignment, etc.) are now tracked using composite keys (e.g., PartyLink:party) and are always classified as STRUCTURAL. A SCALAR_TYPES constant distinguishes scalar field types from non-scalar/link types. The PartyLink party entry was removed from politician-schema-validated.md since CSV sample data validates scalar columns, not nested link objects.

A third bug was fixed in the mermaid field regex: the pattern +(\w+)\s+(\w+) did not match array type forms like String[] or CommitteeLink[] (the [] broke the \w+ match), so array fields in schema diagrams were silently skipped and never classified/validated. The regex now captures optional [] notation, and the extracted base type is normalized (e.g., String[]String is still scalar, CommitteeLink[] remains non-scalar/STRUCTURAL).

Additionally, meetings was removed from the committee structural fields list in FIELD_MAPPING.md — it was contradictorily listed as both PLANNED and STRUCTURAL, and the validator classifies it as planned. All view/file lists (matched_views, missing_views) are now sorted before writing to validation-results.json for fully deterministic output across runs.

Field classifications distinguish between nested object types (e.g., +Trends trend, +VoteAlignment alignment, +Productivity productivity) which are STRUCTURAL, and leaf metrics (e.g., +Integer committeeChairs, +Float strengthScore) which are COMPUTED or PLANNED. All doc table classifications — including validated schema files, FIELD_MAPPING.md per-schema sections, and summary reports — are fully reconciled with validation-results.json output across 9 rounds of review alignment. All structural JSON grouping fields consistently use ❌ STRUCTURAL labeling (not DEPRECATED) across all documentation sections. Ministry view references use the actual DB view identifiers (view_riksdagen_goverment*) throughout all sections including the Top 25 implementation roadmap table.

Type of Change

Primary Changes

  • 🔄 Enhancement
  • 🐛 Bug Fix

Political Analysis

  • 📊 Political Data Analysis
    • Party Analysis
    • Committee Processing
    • Government Data
    • Riksdagen Integration
  • 📈 Analytics & Metrics
    • Performance Metrics

Technical Changes

  • 📝 Documentation
    • Technical Documentation
  • ✅ Testing
    • Unit Tests

Impact Analysis

Political Analysis Impact

  • Impact on data quality: Field-level implementation status now tracked across all 4 political entity schemas with corrected view mappings, accurate non-scalar field tracking, and complete array field parsing
  • Impact on analysis accuracy: Clear distinction between implemented (24), structural (67), computed (42), and planned (35) fields — all backed by validator output with corrected ministry CSV file matching, composite-key deduplication, and array type handling
  • Impact on transparency features: Validated schemas now classify fields with accurate counts; ministry schema now correctly matches 8 CSV data files (was 2 due to view-mapping bug); non-scalar link types no longer silently overwrite same-named scalar fields; array fields are no longer silently skipped

Technical Impact

  • Performance impact: None — documentation and validation tooling only
  • Security implications: None
  • Dependency changes: None

Testing

  • Unit tests added/updated
  • Political data validation completed

23 tests pass (5 new/updated):

  • test_structural_fields_defined — verifies STRUCTURAL_FIELDS constant (including trend, alignment)
  • test_computed_fields_defined — verifies COMPUTED_FIELDS constant (including id, committeeChairs, strengthScore)
  • test_field_status_in_validation_results — verifies results structure with all 4 categories
  • test_field_classification_with_data — end-to-end classification with aggregation across structural/computed/planned
  • test_mismatch_includes_status — verifies STRUCTURAL/COMPUTED/PLANNED status in mismatch entries

Documentation

  • Package/module documentation updated
  • Political analysis documentation updated

Key changes:

  • validate_schemas.py — Fixed critical ministry view-mapping bug (riksdagen_governmentriksdagen_goverment matching actual DB spelling). Fixed duplicate field-name handling: non-scalar types (PartyLink, Trends, VoteAlignment, etc.) now use composite keys (Type:name) to prevent overwriting same-named scalar fields. Fixed mermaid regex to match Type[] array notation; base type is normalized for scalar check (String[]String is scalar, CommitteeLink[] remains non-scalar). Added SCALAR_TYPES constant to distinguish scalar vs non-scalar field types; non-scalar types are always classified as STRUCTURAL. Added missing view patterns: committee_productivity, ministry_productivity_matrix, ministry_risk_evolution. Enhanced with 3-way mismatch classification, fully deterministic output (sorted schema_fields, unmapped_data_columns, matched_views, missing_views), and dynamic file counts:

    SCALAR_TYPES = {"String", "Integer", "Float", "Long", "Double", "Boolean",
                    "Date", "DateTime", "int", "float", "string", "boolean", "date"}
    STRUCTURAL_FIELDS = {"attributes", "labels", "relationships", "intelligence",
                         "trend", "alignment", "productivity", "performance",
                         "budget", "predictions", ...}  # 32 entries
    COMPUTED_FIELDS = {"fullName", "partyLoyalty", "totalMembers", "performanceScore",
                       "id", "committeeChairs", "strengthScore", ...}  # 31 entries
  • FIELD_MAPPING.md — Every field now has ✅ IMPLEMENTED / 🔀 COMPUTED / 🔄 PLANNED / ❌ STRUCTURAL status. Status legend maps directly to validator output categories. Validation Summary table aligned with validator output (67/42/35). Per-schema sections reconciled with validation-results.json: politician implemented fields include riskScore, riskLevel, attendanceRate; party implemented includes totalVotes; committee implemented includes regularMembers, totalMembers; ministry implemented includes name, id. All structural/computed/planned lists match validator per-schema breakdowns. meetings removed from committee structural list (it's PLANNED per validator). Committee and ministry structural field sections consistently use ❌ STRUCTURAL labeling. Ministry view references corrected to actual DB identifiers (view_riksdagen_goverment*) across all sections including the Top 25 implementation roadmap table. Mismatch reduction strategy table uses 67 structural fields with cascading count adjustments. Validation exclusions count updated to 67 structural fields.

  • 4 schema files (politician-, party-, committee-, ministry-schema.md) — Added 📋 Field Implementation Status section after each mermaid diagram. Nested object types correctly classified as STRUCTURAL. Leaf metrics classified as COMPUTED. Non-scalar link types (e.g., MinistryLink:ministry) reclassified as STRUCTURAL. Array types (e.g., CommitteeMembership[] current/historical, ElectionHistory[] history, RegionalSupport[] regions, CommitteeLink[] committees) classified as STRUCTURAL per validator logic. Politician schema: current/historical reclassified from PLANNED to STRUCTURAL (non-scalar CommitteeMembership[] arrays); field count note updated to match validator's 55 fields_defined. Party schema: subcategories/intelligenceTags reclassified from PLANNED to STRUCTURAL (array types per validator); history/regions reclassified from PLANNED to STRUCTURAL (non-scalar ElectionHistory[]/RegionalSupport[] arrays); field count note updated to match validator's 51 fields_defined. Summary counts in both schemas reconciled with per-schema breakdowns in validation-results.json. Committee schema: regularMembers/totalMembers now IMPLEMENTED (found in view_committee_productivity). Ministry schema: id now IMPLEMENTED (found in view_riksdagen_goverment_proposals).

  • 4 validated schemas (*-validated.md) — Field counts aligned with validation-results.json per-schema breakdowns. PartyLink party removed from politician-schema-validated.md (CSV validates scalars only; scalar String party retained). party-schema-validated.md mermaid diagram updated to include totalVotes. Duplicate ## Migration Notes headings removed.

  • SCHEMA_VALIDATION_REPORT.md — Uses dynamic files_analyzed count. Total Schema Mismatches corrected to 144 (168 − 24 implemented). Field Status Summary shows full 4-category breakdown with corrected numbers: 24 implemented, 67 structural, 42 computed, 35 planned. Checklist counts updated to match validator output. Mismatch resolution strategy section updated.

  • VALIDATED_SCHEMAS_SUMMARY.md — Numbers aligned with actual validator output (24 implemented, 67 structural, 42 computed, 35 planned).

  • .gitignore — Added __pycache__/, *.pyc, and .pytest_cache/ to prevent committing Python build artifacts.

Screenshots

N/A

Related Issues

Related to #893

Checklist

  • Code follows project coding standards
  • Comments are clear and helpful
  • Documentation is updated
  • Tests are passing
  • Security compliance is maintained
  • Performance impact is acceptable
  • Breaking changes are documented
  • Changes are backward compatible

Additional Notes

Mismatch breakdown from validator output: 67 structural (JSON grouping objects like attributes, labels, trend, alignment, productivity, budget, predictions, non-scalar link types like PartyLink:party, MinistryLink:ministry, and array link types like CommitteeLink[]:committees, CommitteeMembership[]:current, ElectionHistory[]:history, RegionalSupport[]:regions) + 42 computed (derivable fields like fullName, totalMembers, committeeChairs, strengthScore, established) + 35 planned (fields needing new data sources) = 144 total mismatches (24 fields implemented, 168 total fields). The critical view-mapping fix corrected the ministry schema_to_views mapping from riksdagen_government to riksdagen_goverment (matching the actual DB view spelling in full_schema.sql), which unlocked 6 additional ministry CSV data files. The composite-key fix prevents non-scalar types from silently overwriting same-named scalar fields — non-scalar types are tracked as Type:name keys and always classified as STRUCTURAL. The array regex fix ensures Type[] fields (e.g., String[], CommitteeLink[]) are parsed and classified rather than silently skipped. Committee gained 2 implemented fields (regularMembers, totalMembers from view_committee_productivity) and ministry gained 1 (id from view_riksdagen_goverment_proposals). All documentation tables across validated schemas, FIELD_MAPPING.md per-schema sections, schema field implementation status tables, and summary reports reflect these validator-verified numbers. Per-schema field implementation status tables in politician-schema.md and party-schema.md reconciled: non-scalar array types (CommitteeMembership[], ElectionHistory[], RegionalSupport[]) correctly classified as STRUCTURAL; scalar array types (String[] for subcategories/intelligenceTags) also classified as STRUCTURAL per validator output; field count notes updated to match validator fields_defined totals (55 for politician, 51 for party). All view/file lists are sorted for fully deterministic output across runs. All structural JSON grouping fields consistently use ❌ STRUCTURAL labeling across all documentation sections. Ministry source view references use the actual DB identifiers (view_riksdagen_goverment, view_riksdagen_goverment_role_member) throughout all documentation sections including the Top 25 implementation roadmap table.

Security Considerations

  • No sensitive data exposed
  • Security best practices followed
  • Compliance requirements met

Release Notes

Fixed critical validator view-mapping bug (ministry riksdagen_governmentriksdagen_goverment matching actual DB spelling), fixed duplicate field-name handling (non-scalar types now use composite Type:name keys and are classified as STRUCTURAL), fixed mermaid regex to parse Type[] array fields (previously silently skipped), removed contradictory meetings from committee structural list (it's PLANNED per validator), sorted all view/file lists for fully deterministic output, and added missing view patterns (committee_productivity, ministry_productivity_matrix, ministry_risk_evolution). Removed PartyLink party from politician validated schema (CSV validates scalars only). Reconciled all per-schema field implementation status tables with validator output — non-scalar array types (CommitteeMembership[], ElectionHistory[], RegionalSupport[]) and scalar array types (String[]) correctly classified as STRUCTURAL; field count notes aligned with validator fields_defined totals; total mismatches header corrected to 144. Updated validation results: 144 mismatches (168 total fields), 24 implemented, 67 structural, 42 computed, 35 planned. Classified all JSON export schema field mismatches into 4 validator-backed categories, added field status annotations to all schema and validated schema documentation, fully reconciled all per-schema field lists and counts with validation-results.json output across 9 rounds of review alignment, ensured consistent STRUCTURAL labeling across all documentation sections, added .gitignore for Python artifacts, and created a prioritized 25-field implementation roadmap.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 1ad51e8.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-05T21:21:51.812855+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 121

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 12 31 ⚠️ REVIEW
Party 42 17 38 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (121 mismatches documented).

Copilot AI and others added 7 commits April 5, 2026 21:27
Add implementation status (IMPLEMENTED/COMPUTED/PLANNED/DEPRECATED) for
all 142 fields across 5 schemas based on validation run showing 121
mismatches. Key additions:

- Status column with icons for every field mapping table
- Validation summary showing 21 implemented, 43 computed, 28 planned,
  45 deprecated structural grouping fields
- Top 25 High-Value Fields prioritization with effort estimates
- Mismatch resolution strategy reducing 121 gaps to 18 true gaps
- Per-schema unmapped DB columns with priority recommendations
- Implementation roadmap (3 sprints)
- Field status promotion lifecycle documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Add a Field Implementation Status section after the mermaid schema
diagram in politician-schema.md. The table categorizes all 55 fields
from the class diagram into four statuses:
- IMPLEMENTED (11): fields available in current database exports
- COMPUTED (13): fields derivable from existing data
- PLANNED (11): fields requiring new data sources
- STRUCTURAL (20): JSON grouping objects

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…otations, and enhanced validator

Agent-Logs-Url: https://github.com/Hack23/cia/sessions/57d6f1f4-def9-4f9c-b234-5b04ee040d5e

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-05T21:39:09.005242+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 121

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 12 31 ⚠️ REVIEW
Party 42 17 38 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (121 mismatches documented).

Copilot AI and others added 2 commits April 5, 2026 21:42
…rrent analysis

- Update version to 2.0.0 / 2026-04-05
- Update Overall Assessment table with per-schema coverage percentages
- Add Field Status Summary (21 implemented, 50 structural, 71 planned)
- Add Mismatch Resolution Strategy (3-category classification)
- Add Top 25 Priority Fields for Implementation
- Add Remediation Roadmap (3 sprints with effort/impact tracking)
- Update Validation Checklist with current status

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…eld status tests

Agent-Logs-Url: https://github.com/Hack23/cia/sessions/57d6f1f4-def9-4f9c-b234-5b04ee040d5e

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-05T21:46:16.742709+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 121

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 12 31 ⚠️ REVIEW
Party 42 17 38 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (121 mismatches documented).

…d aggregation test

Agent-Logs-Url: https://github.com/Hack23/cia/sessions/57d6f1f4-def9-4f9c-b234-5b04ee040d5e

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-05T21:48:28.196458+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 121

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 12 31 ⚠️ REVIEW
Party 42 17 38 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (121 mismatches documented).

Copilot AI changed the title [WIP] Align JSON export schema field definitions with database schema feat: classify 121 JSON export schema field mismatches and add implementation roadmap Apr 5, 2026
Copilot AI requested a review from pethers April 5, 2026 21:51
@pethers pethers marked this pull request as ready for review April 5, 2026 21:58
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot apply changes based on the comments in this thread

Applied all 3 review comments in 8f558ea:

  1. Duplicate field name handling — Non-scalar mermaid types (PartyLink, Trends, VoteAlignment, etc.) now use composite keys (Type:name) to prevent overwriting same-named scalar fields. Added SCALAR_TYPES constant. Non-scalar types are always classified as STRUCTURAL automatically.

  2. PartyLink party in validated schema — Removed PartyLink party from politician-schema-validated.md mermaid diagram and field descriptions (CSV can only validate scalar values). Scalar String party retained with note about the distinction.

  3. __pycache__ cleanup — Confirmed no __pycache__ files are tracked in git. Added .pytest_cache to .gitignore.

Updated counts: 144 total fields, 24 implemented, 56 structural, 42 computed, 22 planned, 120 mismatches. All 23 tests pass. CodeQL clean.

Copilot AI changed the title feat: classify JSON export schema field mismatches, fix validator view mappings, and add implementation roadmap feat: classify JSON export schema field mismatches, fix validator view mappings and duplicate field handling, and add implementation roadmap Apr 6, 2026
@pethers pethers requested a review from Copilot April 6, 2026 15:43
@pethers pethers requested review from Copilot and removed request for Copilot April 6, 2026 15:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Comment on lines 118 to 123
# Pattern matches: +Type fieldName
# Use composite key (Type:fieldName) to avoid duplicate-name overwrites
mermaid_pattern = r'\+(\w+)\s+(\w+)'
for match in re.finditer(mermaid_pattern, content):
field_type = match.group(1)
field_name = match.group(2)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mermaid field regex (+(\w+)\s+(\w+)) doesn’t match common type forms like String[] / CommitteeLink[] (the [] breaks the \s+), so array fields in the schema diagrams are silently skipped and never classified/validated. This makes fields_defined/mismatch counts incomplete for several schemas. Update the regex to capture optional [] (and possibly generic forms), and normalize the extracted type so scalar arrays (e.g., String[]) are still treated as scalar while object arrays remain non-scalar.

Copilot uses AI. Check for mistakes.
| `decisions` | Container section for decision data | ❌ **STRUCTURAL** |
| `intelligence` | Container section for intelligence data | ❌ **STRUCTURAL** |
| `policy` | Container section for policy data | ❌ **STRUCTURAL** |
| `meetings` | Container section for meeting data (when structural) | ❌ **STRUCTURAL** |
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meetings is listed as a PLANNED data field above, but also appears under “Structural Fields” as a JSON grouping container. This is contradictory and conflicts with the validator output (where meetings is a planned mismatch field). Remove meetings from the structural list (or rename the structural container if you intended a separate grouping object).

Copilot uses AI. Check for mistakes.
Comment on lines 359 to +373
# Map schema fields to database columns (convert camelCase to snake_case)
schema_fields = set(schema_info["fields"].keys())
# Sort for deterministic output order
schema_fields = sorted(schema_info["fields"].keys())

# Initialize field status tracking
field_status = {
"implemented": [],
"structural": [],
"computed": [],
"planned": []
}

# Check if schema fields exist in data (with flexible matching)
unmapped_schema_fields = []
unmapped_data_columns = list(all_columns)

for field in schema_fields:
# Try various naming conventions (using set to avoid duplicates)
possible_names = list({
field,
self._camel_to_snake(field),
field.lower(),
field.upper()
})
unmapped_data_columns = sorted(all_columns)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’ve made field-related output deterministic by sorting schema_fields/unmapped_data_columns, but the overall report/JSON still includes lists like matched_views/missing_views whose ordering depends on CSV load/iteration order. For truly stable output (to avoid noisy diffs across runs), consider sorting those view/file lists before writing them into schema_result and validation-results.json.

Copilot uses AI. Check for mistakes.
Comment on lines +567 to +569
- [x] Structural fields (53) documented as JSON grouping objects
- [x] Computable fields (42) mapped to source DB columns
- [ ] Planned fields (23) tracked with data source requirements
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checklist counts appear out of sync with the validator output in validation-results.json: structural is now 56 (not 53) and planned is 22 (not 23). Update these numbers (and any related text) so the checklist stays consistent with the reported field_status_summary totals.

Suggested change
- [x] Structural fields (53) documented as JSON grouping objects
- [x] Computable fields (42) mapped to source DB columns
- [ ] Planned fields (23) tracked with data source requirements
- [x] Structural fields (56) documented as JSON grouping objects
- [x] Computable fields (42) mapped to source DB columns
- [ ] Planned fields (22) tracked with data source requirements

Copilot uses AI. Check for mistakes.
`validate_schemas.py` currently reports the raw mismatch total and does **not** yet exclude all non-implemented categories from its counts.

The following categories are candidates for exclusion from mismatch counts in a future validator enhancement:
- All ❌ **STRUCTURAL** JSON grouping fields (53 fields)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Validation Exclusions” bullet still says structural = 53 fields, but the validator summary in validation-results.json (and this PR’s docs elsewhere) reports 56 structural fields. Update this count (and verify planned/computed counts too) to keep the mismatch math consistent.

Suggested change
- All ❌ **STRUCTURAL** JSON grouping fields (53 fields)
- All ❌ **STRUCTURAL** JSON grouping fields (56 fields)

Copilot uses AI. Check for mistakes.
@pethers
Copy link
Copy Markdown
Member

pethers commented Apr 6, 2026

@copilot apply changes based on the comments in this thread

…ural list, sort view lists, update all counts

- Fix mermaid regex to match Type[] array notation (String[], CommitteeLink[], etc.)
- Normalize base type for scalar check (String[] → String is still scalar)
- Remove meetings from committee structural fields (it's PLANNED, not structural)
- Sort matched_views and missing_views lists for deterministic output
- Update SCHEMA_VALIDATION_REPORT.md checklist: structural 56→67, planned 22→35
- Update FIELD_MAPPING.md validation exclusions: structural 56→67
- Update all documentation counts: 168 total fields, 24 implemented, 67 structural,
  42 computed, 35 planned, 144 mismatches
- Re-run validator to regenerate validation-results.json and auto-generated report
- All 23 tests pass

Agent-Logs-Url: https://github.com/Hack23/cia/sessions/dafa1381-0a99-4091-bcd6-ad65a11dc1eb

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-06T16:19:53.438014+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 144

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 55 12 41 ⚠️ REVIEW
Party 51 17 47 ⚠️ REVIEW
Committee 29 12 25 ⚠️ REVIEW
Ministry 33 8 31 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (144 mismatches documented).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-06T16:20:59.465024+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 144

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 55 12 41 ⚠️ REVIEW
Party 51 17 47 ⚠️ REVIEW
Committee 29 12 25 ⚠️ REVIEW
Ministry 33 8 31 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (144 mismatches documented).

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot apply changes based on the comments in this thread

Applied all 5 review comments in 331d769 and 1683756:

  1. Mermaid regex — Updated to r'\+(\w+(?:\[\])?)\s+(\w+)' to capture Type[] array notation. Scalar arrays (String[]) normalized via removesuffix("[]") to match SCALAR_TYPES; non-scalar arrays (CommitteeLink[]) are classified as STRUCTURAL. Detected 24 previously-skipped array fields.

  2. meetings conflict — Removed from committee structural fields table (kept as PLANNED only). Heading count updated 8→7.

  3. Deterministic view listsmatched_views and missing_views sorted before writing to schema_result and JSON output.

  4. SCHEMA_VALIDATION_REPORT.md checklist — Updated: structural 53→67, planned 23→35.

  5. FIELD_MAPPING.md exclusions — Updated: structural 53→67.

Updated counts: 168 total fields, 24 implemented, 67 structural, 42 computed, 35 planned, 144 mismatches. All 23 tests pass. CodeQL clean.

Copilot AI changed the title feat: classify JSON export schema field mismatches, fix validator view mappings and duplicate field handling, and add implementation roadmap feat: classify JSON export schema field mismatches, fix validator view mappings, duplicate field handling, and array type parsing, and add implementation roadmap Apr 6, 2026
@pethers pethers requested a review from Copilot April 6, 2026 16:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Comment on lines +248 to +252
| **Committees** | | | |
| `current` | 🔄 PLANNED | Committees | Requires committee membership data |
| `historical` | 🔄 PLANNED | Committees | Requires historical committee data |
| `chairPositions` | 🔄 PLANNED | Committees | Requires chair position tracking |

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current/historical committee membership fields are shown as 🔄 PLANNED, but in the mermaid diagram they are non-scalar array types (+CommitteeMembership[] current/historical). The validator logic classifies all non-scalar types as STRUCTURAL (and emits composite keys like CommitteeMembership[]:current), so these rows (and the Status Summary counts) should be reconciled to avoid contradicting validation-results.json.

Copilot uses AI. Check for mistakes.
| 🔄 PLANNED | 8 | Fields requiring new data sources or engine integration |
| ❌ STRUCTURAL | 22 | JSON grouping objects and non-scalar types, not individual data fields |

> **Note:** The table above documents all 55 fields from the mermaid diagram (14+11+8+22). The validator tracks 46 of these (14 impl + 19 struct + 11 comp + 2 plan); 9 nested class fields are only documented in this table. See `validation-results.json` for per-field status.
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note claims the validator “tracks 46” fields and that “9 nested class fields are only documented in this table”, but the validator’s mermaid regex scans the entire diagram and (per validation-results.json) counts 55 fields for this schema. This note should be updated/removed so it matches the validator behavior and reported counts.

Suggested change
> **Note:** The table above documents all 55 fields from the mermaid diagram (14+11+8+22). The validator tracks 46 of these (14 impl + 19 struct + 11 comp + 2 plan); 9 nested class fields are only documented in this table. See `validation-results.json` for per-field status.
> **Note:** The table above documents all 55 fields from the mermaid diagram (14+11+8+22). The validator scans the entire mermaid diagram and reports all 55 fields for this schema. See `validation-results.json` for per-field status.

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +195
| subcategories | 🔄 PLANNED | Labels | Requires new data source |
| descriptions | ❌ STRUCTURAL | Labels | JSON grouping object |
| intelligenceTags | 🔄 PLANNED | Labels | Requires new data source |
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subcategories and intelligenceTags are defined in the mermaid diagram as array types (String[]). The validator treats array fields as structural grouping/containers (and validation-results.json currently lists subcategories/intelligenceTags as STRUCTURAL), but this table marks them as 🔄 PLANNED. Please align these statuses (and the Summary counts) with the validator output/category definitions.

Suggested change
| subcategories | 🔄 PLANNED | Labels | Requires new data source |
| descriptions | ❌ STRUCTURAL | Labels | JSON grouping object |
| intelligenceTags | 🔄 PLANNED | Labels | Requires new data source |
| subcategories | ❌ STRUCTURAL | Labels | JSON grouping array |
| descriptions | ❌ STRUCTURAL | Labels | JSON grouping object |
| intelligenceTags | ❌ STRUCTURAL | Labels | JSON grouping array |

Copilot uses AI. Check for mistakes.
| 🔄 PLANNED | 17 | Requires new data sources |
| ❌ STRUCTURAL | 18 | JSON grouping objects |

> **Note:** The table above documents all 51 fields from the mermaid diagram (4+12+17+18). The validator tracks 43 of these (4 impl + 18 struct + 12 comp + 9 plan); 8 nested class fields are only documented in this table. See `validation-results.json` for per-field status.
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note says the validator “tracks 43” fields and that 8 nested fields are only documented here, but validation-results.json reports fields_defined: 51 for the Party schema, which implies the validator is already counting all mermaid fields. Please update/remove this note to avoid contradicting the validator output.

Suggested change
> **Note:** The table above documents all 51 fields from the mermaid diagram (4+12+17+18). The validator tracks 43 of these (4 impl + 18 struct + 12 comp + 9 plan); 8 nested class fields are only documented in this table. See `validation-results.json` for per-field status.
> **Note:** The table above documents all 51 fields from the mermaid diagram (4+12+17+18), which matches the Party schema `fields_defined: 51` reported in `validation-results.json`. See that file for per-field status details.

Copilot uses AI. Check for mistakes.
**Date:** 2026-04-05
**Status:** Analysis Complete - Remediation Plan Provided
**Validated Against:** service.data.impl/src/main/resources/full_schema.sql
**Total Schema Mismatches:** 120 (across 4 entity schemas)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This report header states Total Schema Mismatches: 120, but the updated validator output and the Field Status Summary in the same document indicate 144 mismatches (168 total fields − 24 implemented). Please reconcile the 120 vs 144 mismatch totals so readers don’t get conflicting metrics.

Suggested change
**Total Schema Mismatches:** 120 (across 4 entity schemas)
**Total Schema Mismatches:** 144 (across 4 entity schemas)

Copilot uses AI. Check for mistakes.
@pethers
Copy link
Copy Markdown
Member

pethers commented Apr 6, 2026

@copilot apply changes based on the comments in this thread

… output

- politician: current/historical → STRUCTURAL (CommitteeMembership[] non-scalar), summary 8→6 planned, 22→24 structural
- politician: note updated to reflect validator tracks all 55 fields (not 46)
- party: subcategories/intelligenceTags → STRUCTURAL (String[] array types per validator)
- party: history/regions → STRUCTURAL (ElectionHistory[]/RegionalSupport[] non-scalar)
- party: summary 17→13 planned, 18→22 structural; note updated to match fields_defined: 51
- SCHEMA_VALIDATION_REPORT.md: Total Schema Mismatches 120 → 144 (168 - 24 implemented)

Agent-Logs-Url: https://github.com/Hack23/cia/sessions/87ca677d-6eed-4545-9caa-615aaee059c1

Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2026-04-06T17:25:34.505154+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 202
  • ⚠️ Total Mismatches: 144

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 55 12 41 ⚠️ REVIEW
Party 51 17 47 ⚠️ REVIEW
Committee 29 12 25 ⚠️ REVIEW
Ministry 33 8 31 ⚠️ REVIEW
Intelligence 0 5 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

ℹ️ Status: Field mismatches are documented and tracked. These represent differences between original projected schemas and actual implemented data.

📄 Data-Validated Schemas: See *-schema-validated.md files for schemas matching actual data (144 mismatches documented).

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot apply changes based on the comments in this thread

Applied all 5 review comments in 1ad51e8:

  1. Politician current/historical — Changed from PLANNED to ❌ STRUCTURAL (CommitteeMembership[] non-scalar array type per validator). Summary updated: STRUCTURAL 22→24, PLANNED 8→6.

  2. Politician note — Updated to state validator scans and reports all 55 mermaid fields (was incorrectly claiming 46 with 9 undocumented).

  3. Party subcategories/intelligenceTags — Changed from PLANNED to ❌ STRUCTURAL (array types per validator output). Also fixed history (ElectionHistory[]) and regions (RegionalSupport[]) which had the same issue. Summary updated: STRUCTURAL 18→22, PLANNED 17→13.

  4. Party note — Updated to state all 51 fields match fields_defined: 51 in validation-results.json (was incorrectly claiming 43).

  5. SCHEMA_VALIDATION_REPORT.md — Total mismatches corrected from 120 to 144 (168 total − 24 implemented).

All 23 tests pass.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 6, 2026

@pethers pethers merged commit da6ea92 into master Apr 6, 2026
14 of 15 checks passed
@pethers pethers deleted the copilot/align-json-export-schema branch April 6, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align JSON export schema field definitions with actual database schema — resolve 125 field mismatches

3 participants