Skip to content

BDMS-626: Improve validation, schema alignment, and well inventory handling#596

Merged
jirhiker merged 72 commits intostagingfrom
kas-well-BDMS-626-inventory-ingestion-updates_v2
Mar 23, 2026
Merged

BDMS-626: Improve validation, schema alignment, and well inventory handling#596
jirhiker merged 72 commits intostagingfrom
kas-well-BDMS-626-inventory-ingestion-updates_v2

Conversation

@ksmuczynski
Copy link
Contributor

@ksmuczynski ksmuczynski commented Mar 11, 2026

Why

This PR addresses the following problem / context:

  • Database mapping was incomplete: public_availability_acknowledgement, monitoring_status, well/water notes, and water level observations were not being persisted correctly.
  • Several schema fields were overly strict, requiring values that should be optional, using plain strings where lexicon enums were expected, and rejecting valid CSVs too aggressively.
  • WellInventoryRow schema now integrates directly with lexicon-based enums to improve consistency between CSV ingestion and database persistence.
  • Real-world imports exposed database failures that were hard to diagnose, especially around contact organization foreign keys and empty-string values being written into lexicon-backed columns.
  • Real-world source files also included monitoring_frequency values like Complete that do not map cleanly to the current lexicon and needed importer-side normalization.
  • Valid CSVs saved with a UTF-8 BOM could be misread at the header level, causing required fields like project to appear missing.
  • BDD tests suffered from primary key conflicts and outdated test data that no longer aligned with the importer’s validation and best-effort behavior.
  • Previously, a single malformed row in a CSV would abort the entire import process. Users need "best-effort" logic where valid data is saved while invalid rows are flagged.

How

Implementation summary - the following was changed / added / removed:

  • Database mapping:

    • public_availability_acknowledgement now maps to Location.release_status (True → public, False → private, unset → draft).
    • monitoring_status is now written to the StatusHistory table.
    • well_notes and water_notes are now stored as polymorphic notes on the Thing.
    • Providing water-level data now creates the expected Sample and Observation records and attaches them to the correct field activity.
    • Water-level observation field names were aligned with the actual database model.
  • WellInventoryRow schema:

    • Made site_name, elevation_ft, elevation_method, monitoring_point_height_ft, and depth_to_water_ft optional.
    • Replaced plain str types with lexicon-based enums for elevation_method, depth_source, well_pump_type, monitoring_status, sample_method, and data_quality.
    • Added a flexible_lexicon_validator for case-insensitive, whitespace-tolerant matching.
    • Relaxed contact validation to require only one of contact_name or contact_organization.
    • Made water_level_date_time required only when depth_to_water_ft is provided.
    • Normalized blank values like depth to water, contact organization, and well status to None instead of persisting invalid empty strings.
    • Added regression coverage for blank depth_to_water_ft and blank lexicon-backed contact/status fields.
  • Best-effort logic:

    • Implemented row-level atomic savepoints so individual row failures no longer abort the full import.
    • Failed rows are logged with 1-based row numbers and well IDs in a validation_errors list.
    • Fixed an UnboundLocalError caused by auto-generated well IDs.
    • Updated error reporting to surface actual database-level failures rather than collapsing them to a generic database error.
    • Added commit=False support so services/thing_helper.py and services/contact_helper.py participate correctly in the outer best-effort transaction without prematurely committing.
    • Made reruns idempotent by detecting previously imported rows and skipping duplicate record creation instead of failing on unique constraints.
  • Lexicon and import compatibility:

    • Added missing organization terms required by real-user imports so contact organization foreign keys can resolve correctly.
    • Updated monitoring_status enum construction to use the correct lexicon category.
    • Added BOM-safe CSV decoding so UTF-8 BOM headers are parsed correctly during CLI import.
    • Normalized monitoring_frequency = Complete source values so they do not create a monitoring frequency record and instead set monitoring_status to Not currently monitored.
    • Added unit and BDD coverage for the Complete monitoring frequency normalization behavior.
  • BDD test suite:

    • Updated test CSVs in tests/features/data/ to use valid lexicon terms and properly quoted comma-containing values.
    • Added scenario-based unique ID suffixing in Given steps to isolate tests and prevent primary key conflicts.
    • Updated well-inventory-csv.feature to assert partial success (e.g., "1 well is imported") in negative scenarios where best-effort behavior is expected.
    • Aligned validation expectations with the importer’s current detailed database error reporting.
    • All 44 well inventory BDD scenarios now pass when the test database is rebuilt against the current branch state.

Notes

Any special considerations, workarounds, or follow-up work to note?

  • Further enhancements may be required for complete schema validation coverage.
  • Additional real-user CSV validation coverage may be scoped in a follow-up PR focused on ingestion of real user-entered data.
  • Lexicon matching is intentionally lenient at ingestion time (case-insensitive, whitespace-stripped), but values must still resolve to a known lexicon term before persistence.
  • Because contact organizations remain lexicon-backed, new real-world organizations still require lexicon seeding in the target database before imports will succeed.
  • Test isolation is achieved via per-scenario ID suffixing in the BDD setup; if the shared test database strategy changes in the future, this workaround may no longer be necessary.
  • BDD runs on this branch may require rebuilding ocotilloapi_test after schema or lexicon changes to avoid stale test database state.

- Introduced `validation_alias` with `AliasChoices` for selected fields (`well_status`, `sampler`, `measurement_date_time`, `mp_height`) to allow alternate field names.
- Ensured alignment with schema validation updates.
- Introduced unit tests for `WellInventoryRow` alias mappings.
- Verified correct handling of alias fields like `well_hole_status`, `mp_height_ft`, and others.
- Ensured canonical fields take precedence when both alias and canonical values are provided.
… and new fields

- Added `flexible_lexicon_validator` to support case-insensitive validation of enum-like fields.
- Introduced new fields: `OriginType`, `WellPumpType`, `MonitoringStatus`, among others.
- Updated existing fields to use flexible lexicon validation for improved consistency.
- Adjusted `WellInventoryRow` optional fields handling and validation rules.
- Refined contact field validation logic to require `role` and `type` when other contact details are provided.
…dations

- Refined validation error handling to provide more detailed feedback in test assertions.
- Adjusted test setup to ensure accurate validation scenarios for contact and water level fields.
- Updated contact-related tests to validate new composite field error messages.
- Renamed "Water" to "Water Bearing Zone" and refined its definition.
- Added new term "Water Quality" under `note_type` category.
… to prevent cross-test collisions

- Supports BDD test suite stability
- Added hashing mechanism to append unique suffix to `well_name_point_id` for scenario isolation.
- Integrated pandas for robust CSV parsing and content modifications when applicable.
- Ensured handling preserves existing format for IDs ending with `-xxxx`.
- Maintained existing handling for empty or non-CSV files.
…ollback side effects

- Supports transaction management
- Moved `session.refresh` calls under `commit` condition to streamline database session operations.
- Reorganized `session.rollback` logic to properly align with commit flow.
…ory source fields in support of schema alignment and database mapping

- Update well inventory CSV files to correct data inconsistencies and improve schema alignment.
- Added support for `Sample`, `Observation`, and `Parameter` objects within well inventory processing.
- Enhanced elevation handling with optional and default value logic.
- Introduced `release_status`, `monitoring_status`, and validation for derived fields.
- Updated notes handling with new cases and refined content categorization.
- Improved `depth_to_water` processing with associated sample and observation creation.
- Refined lexicon updates and schema field adjustments for better data consistency.
…h 1 well

- Updated BDD tests to reflect changes in well inventory bulk upload logic, allowing the import of 1 well despite validation errors.
- Modified step definitions for more granular validation on imported well counts.
- Enhanced error message detail in responses for validation scenarios.
- Adjusted sample CSV files to match new import logic and validation schema updates.
- Refined service behavior to improve handling of validation errors and partial imports.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the well-inventory CSV ingestion pipeline to better align schema validation with lexicon-backed enums, persist previously-missed fields (notes, monitoring/public availability, water level observations), and change the import behavior to “best-effort” so individual row failures don’t abort the entire upload.

Changes:

  • Added row-level savepoints and improved error reporting so invalid rows are skipped while valid rows are persisted.
  • Updated WellInventoryRow schema to relax/adjust requirements and introduce lexicon-backed enum parsing plus CSV field aliases.
  • Updated BDD/unit tests and test CSV fixtures to reflect new validation rules and partial-success behavior.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
services/well_inventory_csv.py Best-effort import via nested savepoints; mapping updates for release status, notes, monitoring status, and water level observation persistence
schemas/well_inventory.py Schema optionality updates, lexicon enum coercion, contact + water level validation adjustments, and alias handling
services/thing_helper.py Adds monitoring status history writes; adjusts commit/rollback behavior for outer-transaction support
services/contact_helper.py Adjusts commit/rollback behavior for outer-transaction support
cli/service_adapter.py Improves exit code and stderr reporting for partial success and validation failures
tests/test_well_inventory.py Adds schema alias tests and helper row builder
tests/features/well-inventory-csv.feature Updates expected partial-success outcomes in negative scenarios
tests/features/steps/*.py Updates step definitions for partial success and loosens validation-error matching
tests/features/data/*.csv Refreshes fixture data to match new lexicon/validation expectations
core/lexicon.json Adds a new note_type term

… unique well name suffixes in well inventory scenarios

- Updated `pd.read_csv` calls with `keep_default_na=False` to retain empty values as-is.
- Refined logic for suffix addition by excluding empty and `-xxxx` suffixed IDs.
- Improved test isolation by maintaining scenario-specific unique identifiers.
…nd `DataQuality`

- Changed `SampleMethodField` to validate against `SampleMethod` instead of `OriginType`
- Changed `DataQualityField` to validate against `DataQuality` instead of `OriginType`
… import

- Make contact.role and contact.contact_type nullable in the ORM and migrations
- Update contact schemas and well inventory validation to accept missing values
- Allow contact import when name or organization is present without role/type
- Stop round-tripping CSV fixtures through pandas to avoid rewriting structural test cases
- Preserve repeated header rows and duplicate column fixtures so importer validation is exercised correctly
- Keep the blank contact name/organization scenario focused on a single invalid row for stable assertions
…n errors

- Prevent one actual validation error from satisfying multiple expected assertions (avoids false positives)
- Keep validation matching order-independent while requiring distinct matches (preserves flexibility)
- Tighten BDD error checks without relying on exact error text (improves test precision)
…behavior

- Update partial-success scenarios to expect valid rows to import alongside row-level validation errors
- Reflect current importer behavior for invalid lexicon, invalid date, and repeated-header cases
- Keep BDD coverage focused on user-visible import outcomes instead of outdated all-or-nothing assumptions
…sitive parsing

- Update unit expectations to accept lowercase placeholder tokens that are now supported
- Document normalization of mixed-case and spaced placeholder formats to uppercase prefixes
- Keep test coverage aligned with importer behavior and reduce confusion around valid autogen inputs
…DataQuality`

- Adjust test data to reflect updated descriptions for `sample_method` and `data_quality` fields.
…ization scenarios

- Add test to ensure contact creation returns None when both name and organization are missing
- Add test to verify contact creation with organization only, ensuring proper dict structure
- Update assertions for comprehensive validation of contact fields
@ksmuczynski
Copy link
Contributor Author

@jirhiker Recent updates to the CLI command tests are failing on my branch. Since these changes don't seem to impact the well inventory ingestion, should I resolve the test failures within this PR, or would you prefer I handle them in a separate one before re-opening for review?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 5 comments.

…ence

- Promote well_status to lexicon-backed validation with well_hole_status alias support
- Prevent invalid well_hole_status values from surfacing as DB constraint errors
- Align BDD fixtures and assertions with stable user-facing validation behavior
Copilot AI review requested due to automatic review settings March 20, 2026 02:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 4 comments.


existing_well = _find_existing_imported_well(session, model)
if existing_well is not None:
return existing_well.name
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotency shortcut returns existing_well.name, which gets appended to wells and counted in total_rows_imported. On reruns this reports rows as “imported” even though nothing new was created, which is misleading for CLI summary/metrics. Consider returning None (and optionally adding a warning entry) when a row is detected as already imported so total_rows_imported reflects newly-created records only.

Suggested change
return existing_well.name
logging.info(
"Well '%s' already imported; skipping creation for idempotent CSV row",
existing_well.name,
)
return None

Copilot uses AI. Check for mistakes.
Comment on lines 141 to 145
expected_errors = [
{
"field": "composite field error",
"error": "Value error, contact_1_role must be provided if name is provided",
"error": "Value error, contact_1_role is required when contact fields are provided",
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step expects the error text "contact_1_role is required when contact fields are provided", but the WellInventoryRow validator raises "... when contact data is provided". Because _handle_validation_error matches by substring, this wording mismatch will prevent the step from matching the actual validation error. Update the expected string to match the current validator message (or use a shorter substring that’s stable).

Copilot uses AI. Check for mistakes.
@then("the command exits with a non-zero exit code")
def step_impl_command_exit_nonzero(context):
assert context.cli_result.exit_code != 0
assert context.cli_result.exit_code != 0, context.cli_result.exit_code
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion message here only echoes the numeric exit code, which doesn’t help diagnose failures. Consider including context.cli_result.stderr (and/or .stdout) so a failing non-zero exit assertion shows the underlying CLI error output.

Copilot uses AI. Check for mistakes.
jacob-a-brown and others added 7 commits March 20, 2026 16:31
- Emit validation and import progress during interactive CLI runs
- Report per-project import progress and periodic row counts
- Keep non-interactive callers and tests quiet by default
…stion-updates_v2' into kas-well-BDMS-626-inventory-ingestion-updates_v2
Copilot AI review requested due to automatic review settings March 23, 2026 15:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

services/well_inventory_csv.py:796

  • model.completion_source is now an OriginType enum (see WellInventoryRow), but CreateWell.well_completion_date_source is typed as str | None. Passing the enum object here can serialize to something like OriginType.<name> instead of the underlying lexicon term, causing incorrect persistence or FK failures. Convert to the enum .value (or change CreateWell.well_completion_date_source to OriginType) before calling CreateWell(...) for consistency with the other enum-backed fields.
        well_completion_date=model.date_drilled,
        well_completion_date_source=model.completion_source,
        well_pump_type=model.well_pump_type,

…t currently monitored"

- Treat source monitoring_frequency value Complete as no monitoring frequency
- Map Complete rows to monitoring_status = Not currently monitored
- Add schema and import regression coverage for the normalization
- Add unit and BDD coverage for the normalization behavior
Copilot AI review requested due to automatic review settings March 23, 2026 19:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

services/well_inventory_csv.py:803

  • WellInventoryRow.completion_source is now an OriginType enum, but CreateWell.well_completion_date_source is typed as str | None and is later written to DataProvenance.origin_type in add_thing. Passing the enum through here risks persisting the enum repr (e.g., OriginType.<member>) instead of the lexicon term. Convert to the enum's .value (or None) before building CreateWell, similar to how other lexicon-backed fields are normalized above (e.g., elevation_method, depth_source).
        well_completion_date=model.date_drilled,
        well_completion_date_source=model.completion_source,
        well_pump_type=model.well_pump_type,

… to Unknown

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 23, 2026 19:31
@jirhiker jirhiker merged commit 75d2cb5 into staging Mar 23, 2026
9 of 10 checks passed
@jirhiker jirhiker deleted the kas-well-BDMS-626-inventory-ingestion-updates_v2 branch March 23, 2026 19:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/features/steps/well-inventory-csv.py:1

  • select is used but not imported in this file (at least in the provided diff context). Add from sqlalchemy import select (or equivalent) near the top so this step doesn’t raise NameError.

elevation_m = convert_ft_to_m(elevation_ft)
elevation_ft = model.elevation_ft
elevation_m = (
convert_ft_to_m(float(elevation_ft)) if elevation_ft is not None else 0.0
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When elevation_ft is missing, this writes an elevation of 0.0 meters, which is a real value and can be indistinguishable from valid near-sea-level elevations. Prefer persisting None (or omitting the field) when elevation is absent; if the database requires a non-null elevation, consider enforcing it as required at the schema level instead of silently defaulting to 0.

Suggested change
convert_ft_to_m(float(elevation_ft)) if elevation_ft is not None else 0.0
convert_ft_to_m(float(elevation_ft)) if elevation_ft is not None else None

Copilot uses AI. Check for mistakes.
Comment on lines 431 to 438
if commit:
session.commit()
session.refresh(thing)

for note in thing.notes:
session.refresh(note)
else:
session.flush()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes add_thing(commit=False) behavior: previously the function refreshed thing (and its notes) even when only flushing. If any callers rely on refreshed relationships/attributes after commit=False, they may now observe stale/incomplete state. Consider restoring session.refresh(thing) (and any required refreshes) after session.flush() as well, while keeping the conditional rollback behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +184
WellStatusField: TypeAlias = Annotated[
Optional[MonitoringStatus],
BeforeValidator(flexible_lexicon_validator(MonitoringStatus)),
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WellStatusField typed as MonitoringStatus is semantically confusing and makes it hard to reason about which lexicon category is intended for well status vs monitoring status. Consider introducing a dedicated enum/type alias (e.g., WellStatus or StatusValue) and use that for well_status, leaving MonitoringStatus for monitoring-specific status.

Suggested change
WellStatusField: TypeAlias = Annotated[
Optional[MonitoringStatus],
BeforeValidator(flexible_lexicon_validator(MonitoringStatus)),
WellStatus: TypeAlias = MonitoringStatus
WellStatusField: TypeAlias = Annotated[
Optional[WellStatus],
BeforeValidator(flexible_lexicon_validator(WellStatus)),

Copilot uses AI. Check for mistakes.
progress_callback, "No valid rows were available for import."
)
except Exception as exc:
logging.exception("Unexpected error in _import_well_inventory_csv")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On unexpected exceptions this returns early without explicitly rolling back the session. Since the function is operating on an open SQLAlchemy session (and may have pending/failed nested transactions), explicitly calling session.rollback() before returning helps ensure the connection is returned to the pool cleanly and avoids “transaction is aborted” issues in subsequent operations within the same context.

Suggested change
logging.exception("Unexpected error in _import_well_inventory_csv")
logging.exception("Unexpected error in _import_well_inventory_csv")
session.rollback()

Copilot uses AI. Check for mistakes.

existing_well = _find_existing_imported_well(session, model)
if existing_well is not None:
return existing_well.name
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a truthy value for already-imported rows makes the importer treat “skipped” rows as “imported” (e.g., imported_count += 1 and total_rows_imported can include duplicates on reruns). Consider returning a sentinel (e.g., None) or a structured result that distinguishes created vs skipped, so summaries and progress reporting remain accurate while still surfacing which well IDs were detected as duplicates.

Suggested change
return existing_well.name
logging.info(
"Skipping import for existing well '%s' (id=%s)", existing_well.name, getattr(existing_well, "id", None)
)
return None

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants