feat: ETL steps for terminals and terminal distances with hourly scheduler by GitAddRemote · Pull Request #220 · GitAddRemote/station

GitAddRemote · 2026-05-25T02:16:46Z

Closes #195

Summary

TerminalsSyncStep: fetches GET /terminals, resolves location FK (space station → outpost → city priority order), 44-column upsert with ON CONFLICT (uex_id) DO UPDATE. Unresolvable primary-location FKs emit severity=warn and the row still upserts with the FK nulled. poi_id and secondary FKs (star_system, planet, orbit, moon, faction, company) are nullable and silently nulled when not found.
TerminalDistancesSyncStep: fetches GET /terminals_distances, resolves terminal_origin_id and terminal_destination_id from station_terminal by code string, batched upsert (500 rows/batch) into station_terminal_distance by (terminal_origin_id, terminal_destination_id). Unknown codes emit severity=warn and the row is skipped.
12-hour skip guard on both steps: queries station_etl_run for the most recent completed run with steps_failed=0 and no severity=error warning for this step name; skips and logs at DEBUG if within 12 hours. No full-table synced_at rewrites — skip guard is backed by the run lifecycle table.
CatalogEtlService.runStep(name): acquires the same catalog_etl advisory lock as runEtl(), creates a real EtlRun row, executes the step, and persists an EtlWarning on failure before rethrowing.
CatalogEtlScheduler: terminals-sync fires at minute 0 (@Cron('0 * * * *')); terminal-distances-sync fires at minute 5 (@Cron('5 * * * *')) to ensure terminals are populated first. ConflictException from the advisory lock is logged at DEBUG (expected concurrency); all other errors are logged at ERROR.
Both steps registered in CatalogEtlModule providers and added to ETL_STEPS pipeline after jump-points-sync.

Test plan

pnpm test --filter backend passes (175 catalog-etl tests green)
Skip guard: mock station_etl_run returning completed_at within 12h → UEX client not called
Skip guard: mock completed_at > 12h ago → full upsert cycle runs
Skip guard: empty result (null last_completed) → runs unconditionally (first-deploy safe)
Unresolvable space station / outpost / city FK → warning emitted, row still upserted with FK nulled
Unresolvable origin/destination terminal code in distances → warning emitted, row skipped
Multiple distances batched into single INSERT per batch (UPSERT_BATCH_SIZE=500)
ConflictException from runStep is caught and logged at debug in scheduler
runStep('unknown-step') → throws Error: Unknown ETL step: unknown-step

Copilot

Pull request overview

Adds two new Catalog ETL steps to ingest UEX “terminals” and “terminal distances” data and schedules them to run hourly, with a 12-hour skip guard to respect UEX caching.

Changes:

Introduces TerminalsSyncStep and TerminalDistancesSyncStep with upsert logic and warning emission.
Adds hourly CatalogEtlScheduler cron jobs to trigger the new steps via CatalogEtlService.runStep(...).
Registers new steps in the ETL pipeline and module wiring; adds unit tests for both steps.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
backend/src/modules/catalog-etl/steps/terminals-sync.step.ts	New ETL step to fetch `/terminals`, resolve parent FKs, and upsert into `station_terminal` with a 12h skip guard.
backend/src/modules/catalog-etl/steps/terminals-sync.step.spec.ts	Unit tests for terminals skip guard, upsert behavior, and warning scenarios.
backend/src/modules/catalog-etl/steps/terminal-distances-sync.step.ts	New ETL step to fetch `/terminal_distances`, resolve terminal FKs, and upsert into `station_terminal_distance` with a 12h skip guard.
backend/src/modules/catalog-etl/steps/terminal-distances-sync.step.spec.ts	Unit tests for distance skip guard, upsert behavior, and warning/skip scenarios.
backend/src/modules/catalog-etl/schedulers/catalog-etl.scheduler.ts	Adds hourly cron triggers for the two new ETL steps.
backend/src/modules/catalog-etl/catalog-etl.service.ts	Wires new steps into the ETL pipeline and introduces `runStep(stepName)`.
backend/src/modules/catalog-etl/catalog-etl.service.spec.ts	Updates service test module providers to include the new steps.
backend/src/modules/catalog-etl/catalog-etl.module.ts	Registers the new steps and the new scheduler provider.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

- Add TerminalsSyncStep: 12-hour skip guard, location FK resolution (space station → outpost → city priority), 43-column upsert with ON CONFLICT (uex_id) DO UPDATE - Add TerminalDistancesSyncStep: 12-hour skip guard, terminal FK resolution by uex_id, upsert by (terminal_origin_id, terminal_destination_id) - Add CatalogEtlScheduler: @Cron('0 * * * *') for both terminal steps via runStep() - Add runStep(name) to CatalogEtlService for targeted single-step execution - Register both steps and scheduler in CatalogEtlModule; extend ETL_STEPS pipeline - Unit tests: skip guard, upsert, FK resolution, warning emission (32 tests across both steps)

…ix scheduler race - Resolve star_system_id, planet_id, orbit_id, moon_id, faction_id, company_id via lookup maps (9 parallel queries) instead of passing raw UEX IDs into FK columns - Add TERMINAL_TYPE_MAP to translate UEX type strings to the schema CHECK-constraint enum; skip + warn on unknown types rather than violating the CHECK constraint - Offset terminal-distances-sync cron to '5 * * * *' so it fires 5 min after terminals-sync, preventing a race where the distances job snapshots station_terminal before it is populated - Expand spec: 45 tests (was 32) covering type mapping, secondary FK resolution, and null passthrough

… EtlRun; fix mock names - Replace station_etl_run_state skip guard with query on station_etl_run (the old table is dropped by CatalogEtlSchemaMigration1748000000001) - runStep() now creates a real EtlRun row (UUID), passes its runId to the step, and saves an error EtlWarning on failure — warnings no longer have an invalid run_id or missing FK row - Remove INSERT INTO station_etl_run_state at end of each step (run state is now owned by runStep) - Align service spec mock names: 'terminals-sync' / 'terminal-distances-sync' - Add runStep unit tests: success path, failure path, unknown-step throws

- Replace station_etl_run NOT EXISTS skip guard with MAX(synced_at) on target table — eliminates false-positive skips on first deployment where an empty table now correctly returns NULL and bypasses the guard - Wrap runStep() in advisoryLockService.withLock('catalog_etl') to prevent concurrent scheduler invocations from racing with runEtl() - Rethrow caught step errors after persisting warning + saving failed run state — scheduler's try/catch now receives the error for logging - Update runStep tests: assert advisory lock is acquired, assert step failure rejects (not resolves), add ConflictException lock-held case

terminals-sync: - Add poi_id FK resolution — load SELECT uex_id, id FROM station_poi and resolve record.id_poi to local BIGINT id; stored as null when not found (same pattern as other nullable secondary FKs) - Add poi_id to INSERT column list and ON CONFLICT DO UPDATE SET - Fix misleading inline comment on secondary FK block terminal-distances-sync: - Correct endpoint from /terminal_distances to /terminals_distances per schema docs; API response uses terminal_code_origin / terminal_code_destination (string codes), not integer IDs - Resolve distances by station_terminal.code lookup instead of uex_id - Replace per-row INSERT loop with batched multi-row upserts (UPSERT_BATCH_SIZE=500) to handle the ~500k-row dataset efficiently tests: - Update terminal-distances-sync spec for new endpoint/field shapes - Update terminals-sync spec for poi FK map, poi_id param index, shifted parameter positions for secondary FKs and boolean flags

… in batches Skip guard — both steps now write 'epoch' into synced_at on every INSERT and ON CONFLICT UPDATE, then issue a single UPDATE ... SET synced_at = NOW() after the loop completes. MAX(synced_at) is therefore only non-epoch once the entire run has succeeded, so a mid-run failure never advances the guard and the next scheduled run always retries the full load. Memory — terminal-distances-sync no longer accumulates a validRows array for the whole dataset before batching. Records are now streamed: params and placeholders are built incrementally and flushed every UPSERT_BATCH_SIZE rows, keeping peak memory proportional to the batch size rather than the full dataset. Tests — added assertions that INSERTs contain 'epoch' and that the final synced_at UPDATE fires after the loop for both steps.

…ial failure A run that fails mid-way previously left untouched rows with their prior non-epoch synced_at, causing MAX(synced_at) to still reflect the last successful run and the next hourly schedule to skip. Both terminals-sync and terminal-distances-sync now issue an UPDATE ... SET synced_at = 'epoch' before fetching data from UEX. This ensures that if the run fails at any point after that, all rows (touched or not) are at epoch, MAX(synced_at) WHERE synced_at > '1970-01-01' returns NULL, and the guard is bypassed on the next run. Tests updated to assert the epoch-reset fires and precedes the first INSERT.

… in scheduler - Replace '> 1970-01-01' with '> epoch' in both skip guard queries; the DATE cast is timezone-sensitive and can include epoch rows on non-UTC DB servers — 'epoch' is an unambiguous PG epoch synonym - Catch ConflictException in both cron handlers and log at debug rather than error; a lock conflict is expected concurrency, not a step failure

…mocks - Skip guard now queries station_etl_run (status=completed, steps_failed=0, no error warning for this step) instead of MAX(synced_at) on the data table; removes both full-table epoch-reset and NOW()-advance UPDATEs that were rewriting ~500k rows on every terminal-distances run - INSERTs now write synced_at=NOW() directly; no epoch sentinel needed - Fix duplicate @nestjs/common imports in catalog-etl.scheduler.ts - Tighten terminal-distances spec mock: FROM station_terminal && !station_terminal_distance to prevent substring collision masking SQL changes - Update PR description: correct cron schedule (minute 5 for distances), correct endpoint/field names, accurate test plan

The skip guard previously lived inside each step's execute(), which caused runStep() to create a completed EtlRun record on every cron tick — even when returning early — refreshing completed_at and making the step permanently unable to run past the first cycle. Move the guard to CatalogEtlScheduler.shouldSkip() via a new CatalogEtlService.getLastSuccessfulStepRun() method so no run record is created on a skip. Steps execute() now starts immediately with the API fetch. Tests updated to drop skip-guard describe blocks and the station_etl_run mock branch; DataSource mock added to service spec.

…th CategoriesSyncStep - Add nullable step_name column to station_etl_run (migration 1779710000000) and EtlRun entity so runStep() records which step the single-step run executed - Set stepName in runStep() create call so every scheduler-triggered run is tagged with its step name from the start - Replace the NOT EXISTS warning-based skip guard query with a direct WHERE step_name = $1 filter, preventing historical full-ETL runs (which have no step_name) from falsely satisfying the guard on first deploy - Add index idx_etl_run_step_name for efficient skip guard lookups - Update runStep spec assertion to verify stepName is set on the created run - Merge conflict: include CategoriesSyncStep (from main/ISSUE-196) alongside TerminalsSyncStep and TerminalDistancesSyncStep in module, service, and spec

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Add NOT EXISTS check against station_etl_warning so that completed runs which recorded severity='error' warnings (without throwing) are not counted as successful — the skip guard would otherwise hold for 12h after a run that silently logged errors.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Copilot