Skip to content

ETL: sync commodities with self-referencing parent hierarchy #199

@GitAddRemote

Description

@GitAddRemote

User Story

As a backend engineer, I need station_commodities populated from UEX API data with a self-referencing parent_id hierarchy resolved via a two-pass upsert, so that raw resources, refined materials, and manufactured goods can be queried in a tree structure for the catalog browser and trade route planner.

Definition of Done

  • CommoditiesSyncStep ETL step created and registered at tier-9 (alongside items, after categories)
  • station_commodities upserted by uex_id in two passes to resolve self-referencing parent_id
  • category_id FK left as NULL when UEX does not populate id_category (known data gap — emit no warning, this is expected)
  • manufacturer_id FK resolved from station_companies where present
  • is_harvestable, is_minable, is_illegal, is_raw boolean fields populated
  • buy_price_avg, sell_price_avg stored as DECIMAL(12,4) for trade calculations
  • Unit tests: two-pass parent resolution, null category_id accepted, manufacturer FK, idempotency
  • pnpm test passes

Acceptance Criteria

  • After ETL: SELECT count(*) FROM station_commodities matches UEX commodities endpoint count
  • Root-level commodities (no parent in UEX data) have parent_id = NULL — no warning emitted
  • category_id = NULL rows are valid and expected — no warning emitted for this specific gap
  • A commodity with id_parent that resolves to a known uex_id has a non-null parent_id UUID after pass 2
  • Re-running ETL is idempotent

Technical Elaboration

UEX Endpoint

GET /commodities{ id, name, code, id_category, id_company, id_parent, is_harvestable, is_minable, is_illegal, is_raw, buy_price_avg, sell_price_avg, is_available_live, ... }

Known data gap: UEX does not populate id_category on commodities. This is a UEX API limitation, not a Station bug. Leave category_id = NULL. A future Station issue will add a manual commodity→category mapping table.

Two-Pass Self-Reference Resolution

Pass 1 — Upsert all commodities with parent_id = NULL:

INSERT INTO station_commodities (uex_id, name, code, parent_id, ...)
VALUES ($uexId, $name, $code, NULL, ...)
ON CONFLICT (uex_id) DO UPDATE SET name = EXCLUDED.name, ...

Pass 2 — Build commodityByUexId map, then update parent_id for commodities with id_parent:

for (const commodity of commoditiesWithParent) {
  const parentUuid = commodityByUexId.get(commodity.id_parent);
  if (!parentUuid) {
    await this.warnMissingParent(runId, commodity);
    continue;
  }
  await this.repo.update({ uexId: commodity.id }, { parentId: parentUuid });
}

Warning on missing parent: If id_parent is set but not found in the commodity map, emit severity='warn'. This differs from category_id = NULL (which is expected and silent).

Schema Fields

station_commodities:
  uex_id         INT UNIQUE,
  name           VARCHAR(255),
  code           VARCHAR(50),
  parent_id      UUID NULLABLE FK → station_commodities (self-reference),
  category_id    UUID NULLABLE FK → station_categories,
  manufacturer_id UUID NULLABLE FK → station_companies,
  is_harvestable  BOOLEAN DEFAULT FALSE,
  is_minable      BOOLEAN DEFAULT FALSE,
  is_illegal      BOOLEAN DEFAULT FALSE,
  is_raw          BOOLEAN DEFAULT FALSE,
  buy_price_avg   DECIMAL(12,4),
  sell_price_avg  DECIMAL(12,4),
  is_available_live BOOLEAN

Deferring FK Constraint

The self-referencing parent_id FK must be DEFERRABLE INITIALLY DEFERRED in the migration DDL so that pass 1 can insert rows with parent_id = NULL and pass 2 can update them within the same transaction without violating the constraint on intermediate states.

Alternatively, set parent_id = NULL in pass 1 and update in pass 2 after committing pass 1 — no deferred FK needed if running as two separate transactions.

Design Elaboration

The category_id = NULL gap is a deliberate, documented design decision. UEX treats commodities as a flat list with an implied category (all commodities are "commodities") and does not cross-reference them to the category hierarchy. Station will eventually build a manual mapping table (station_commodity_category_map) to associate commodities with categories for faceted browsing, but this is deferred to avoid blocking the ETL pipeline on incomplete UEX data.

The self-referencing parent hierarchy represents commodity refinement chains: raw ore → processed material → manufactured good. The two-pass approach is correct because UEX does not guarantee that parent records appear before child records in API response ordering.


Depends on: #190, #196

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend services and logicdatabaseSchema, migrations, indexingenhancementNew feature or requesttech-storyTechnical implementation storyuex-syncUEXcorp API sync and integration

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions