Skip to content

Add instance busy counter for graph API cycle protection#619

Merged
jfrench9 merged 5 commits intomainfrom
feature/graph-api-cycle-protection
Apr 15, 2026
Merged

Add instance busy counter for graph API cycle protection#619
jfrench9 merged 5 commits intomainfrom
feature/graph-api-cycle-protection

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

Introduces an instance-level busy counter mechanism to protect graph API operations from destructive interference during critical workflows. This feature ensures that infrastructure operations (such as ASG refreshes and container restarts) are aware of in-progress graph operations, preventing data corruption and request failures caused by mid-cycle disruptions.

Key Accomplishments

New Instance Busy Counter Middleware

  • Added a new instance_busy middleware module that tracks the number of active destructive operations on a graph instance
  • Implements an atomic counter pattern to safely increment/decrement busy state across concurrent operations
  • Provides a queryable endpoint or mechanism for infrastructure tooling to check whether an instance is safe to recycle

Infrastructure-Aware Deployment Workflows

  • Enhanced CI/CD workflows (staging, prod, ASG refresh, service refresh) to integrate with the busy counter before performing instance replacements
  • Updated GitHub Actions for graph ASG refresh and graph container refresh to check instance busy state, ensuring graceful handling of in-flight operations
  • Added dedicated workflow files for graph ASG refresh and service refresh orchestration

Operation Integration

  • Instrumented materialization operations (extensions/materialize.py and lbug/direct_materialization.py) to register as busy during execution
  • Updated the SEC pipeline stage adapter to be aware of the busy counter lifecycle
  • Enhanced the graph table management router with refactored logic to coordinate with instance busy state
  • Added a Dagster job definition for graph-related operations

Comprehensive Test Coverage

  • Added a full test suite (test_instance_busy.py) covering counter increment/decrement, concurrent access, edge cases, and integration scenarios

Breaking Changes

  • The graph table management router (databases/tables/management.py) has been significantly refactored. Any external consumers or tests relying on the previous internal structure of that module should be reviewed.
  • Infrastructure deployment pipelines now depend on the busy counter check — rollout tooling must support the new pre-check step or deployments may block waiting for instances to become idle.

Testing Notes

  • Unit tests for the instance busy counter cover atomic operations, concurrent access patterns, and boundary conditions
  • Integration testing should verify that deployment workflows correctly poll and respect the busy state before proceeding with instance replacement
  • Recommend end-to-end testing in staging to validate that long-running materialization jobs properly hold the busy counter and that refreshes wait accordingly

Infrastructure Considerations

  • New GitHub Actions and workflow definitions are introduced for graph-specific ASG and container refresh flows — these should be reviewed by the infrastructure/DevOps team for alignment with existing deployment patterns
  • The busy counter mechanism assumes instance-local state; if the architecture moves to shared or distributed state, this approach will need to be revisited
  • Deployment pipelines now have an added dependency on instance health/readiness checks, which may increase rollout times when instances are under heavy load

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/graph-api-cycle-protection
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Introduced a new `instance_busy` middleware to track active destructive operations on EC2 instances using DynamoDB.
- Added `max-wait-minutes` and `force-ignore-busy` inputs to refresh actions, allowing for configurable wait times and emergency overrides during instance refreshes.
- Updated various workflows and actions to utilize the new busy counter, ensuring that in-flight operations are respected before cycling instances or containers.
- Implemented comprehensive tests for the instance busy counter to validate its lifecycle and error handling.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude finished @jfrench9's task in 3m 14s —— View job


PR Review: Add instance busy counter for graph API cycle protection

Tasks

  • Gather PR context and understand the request
  • Read core implementation files (instance_busy.py, management.py)
  • Review operation integration files (materialize.py, direct_materialization.py, stage.py)
  • Review CI/CD workflow changes
  • Review test coverage (test_instance_busy.py)
  • Compile and post full review

Overview

This is a well-conceived safety feature. The "log-and-continue on DynamoDB failure" contract is the right design choice — a broken counter must never block real work — and the stale-detection fallback (2h heartbeat window) is a reasonable operational backstop. The escape hatch (force-ignore-busy) is a good safety valve. Tests cover the key lifecycle semantics well.

A few issues worth addressing before merge.


Medium Severity

1. Stale detection fails for genuinely long operations

_update_counter sets last_destructive_op_at on every DynamoDB write (both increment and decrement). But for a single operation that lasts more than 2 hours, the timestamp is only written once (at +1) and never refreshed until the operation completes. The GHA stale window (STALE_WINDOW_SECONDS=7200) would expire and start the refresh anyway.

Full SEC staging runs can exceed 2h on a large backfill. This could interrupt the exact workload the feature is designed to protect.

Fix options:

  • Add a periodic heartbeat task that touches last_destructive_op_at while the operation is running, or
  • Accept the limitation and document it clearly, and/or increase the stale window to something more conservative (e.g., 6h).

Fix this →

2. Counter can go negative → stale detection stalls GHA for 2h

If DynamoDB is down when a +1 fires (silently swallowed) but comes back before the matching -1, the stored counter goes to -1. The GHA busy check is a string equality test against "0":

BUSY_COUNT=$(... jq -r '.active_destructive_ops.N // "0"')
if [ "$BUSY_COUNT" = "0" ]; then continue; fi
TOTAL_BUSY=$((TOTAL_BUSY + BUSY_COUNT))

TOTAL_BUSY=-1 is not "0", so the loop keeps retrying for MAX_WAIT_ATTEMPTS minutes before the stale timeout saves it. A simple guard in the bash ([ "$BUSY_COUNT" -le 0 ] instead of = "0") would prevent this.

Fix this →


Low Severity

3. Misleading comment in refresh-graph-asg/action.yml about shared tiers

action.yml:213-214:

# Unlike the informational database_count check above, this runs for SHARED
# tiers too — protecting the SEC shared-master SEC materialization

But the continue on ladybug-shared (line 129) skips the entire tier block, including the busy-counter wait. The comment is incorrect — shared tiers are not checked here. The shared tier protection runs via refresh-graph-containers, not via this action.

Fix this →

4. OP_KIND_BULK_TABLE_CREATE used for the insert operation

management.py:250:

async with instance_busy(env.INSTANCE_ID, OP_KIND_BULK_TABLE_CREATE):

This is in perform_table_insert, where the operation is an incremental append, not a table create. The GHA reads last_destructive_op_kind only for logging, so this isn't a correctness bug — but when debugging a stale counter it will say "bulk_table_create" for an insert. Adding OP_KIND_BULK_TABLE_INSERT would cost one line.

5. Code duplication between perform_table_creation and perform_table_insert

management.py:142-318: Both functions share the same structure — instance_busy wrapper, task status updates, timeout+interrupt logic, and error handling. Only the inner table_manager call differs. Extracting a shared _run_table_background_op(task_id, callable, ...) helper would keep them DRY and make future changes (e.g., adding retry logic) apply to both.

This might be a pre-existing pattern, but given the PR description says this module was "significantly refactored," now is the natural time to consolidate.

6. No test for resolve_instance_id_for_graph

instance_busy.py:101-134: This helper creates a short-lived GraphClient solely to read ._instance_id, returns "" on any failure, and is the entry point for Dagster assets. The happy path (returns instance ID) and the lazy-import-prevents-circular-dep note deserve at least a smoke test with a mocked GraphClientFactory.


Positive Notes

  • Failure mode is safe: DynamoDB errors are absorbed at every layer (_update_counter, _update_counter_async, and again in the async wrapper). A crashed DynamoDB can't block work.
  • instance_busy_sync variant covers Dagster ops cleanly without forcing asyncio.run inside the context manager itself.
  • Imperative begin/end API is the right call for direct_materialization.py where inserting async with would require significant re-indentation of existing try/finally logic.
  • asyncio_mode = auto is set globally so no @pytest.mark.asyncio markers are needed — the test file is correctly written.
  • force-ignore-busy escape hatch with a prominent warning in the description is the right operational design for emergency situations.

raise ValueError("boom")

# Both increment and decrement should still have run
assert patched_ddb.update_item.call_count == 2
with ib.instance_busy_sync("i-sync", ib.OP_KIND_DAGSTER_MATERIALIZATION):
raise RuntimeError("sync boom")

assert patched_ddb.update_item.call_count == 2
- Updated `max-wait-minutes` input descriptions in refresh actions to clarify behavior regarding shared tiers and in-flight operations.
- Enhanced stale detection logic in refresh actions to extend the timeout for busy instances from 2 hours to 6 hours, accommodating longer operations.
- Removed unused ingestion flag checks from the health check script, streamlining the container state verification process.
- Introduced a new operation kind for bulk table inserts in the instance busy middleware, improving tracking of active operations.
- Added tests for the new instance ID resolution helper, ensuring robust handling of client instances in various scenarios.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude have we fixed all issues?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude finished @jfrench9's task in 1m 58s —— View job


Issue Resolution Status

Tasks

  • Gather context
  • Check Medium 1: Stale detection for long ops (>2h heartbeat)
  • Check Medium 2: Counter going negative (bash = "0" vs -le 0)
  • Check Low 3: Misleading comment about shared tiers in refresh-graph-asg/action.yml
  • Check Low 4: OP_KIND_BULK_TABLE_CREATE used for insert op
  • Check Low 5: Code duplication between perform_table_creation and perform_table_insert
  • Check Low 6: No test for resolve_instance_id_for_graph
  • Check Code Quality: Unreachable code in test_instance_busy.py (lines 66, 163)
  • Compile and post findings

6 of 7 issues resolved. 1 remains.

# Severity Issue Status
1 Medium Stale window too short for long ops ✅ Fixed
2 Medium Counter going negative stalls GHA ✅ Fixed
3 Low Misleading comment about shared tier in refresh-graph-asg ✅ Fixed
4 Low Wrong OP_KIND constant in perform_table_insert ✅ Fixed
5 Low Code duplication in perform_table_creation / perform_table_insert ❌ Open
6 Low No test for resolve_instance_id_for_graph ✅ Fixed
7 Code quality Unreachable assertions in test file ✅ Fixed

Details

✅ Medium 1 — Stale window: Increased from 2h (7200s) to 6h (21600s) in both refresh-graph-asg/action.yml:86 and refresh-graph-containers/action.yml:93. Comment explicitly notes this covers full SEC historical backfills.

✅ Medium 2 — Negative counter: Both actions now use [ "${BUSY_COUNT:-0}" -le 0 ] (integer comparison) instead of string = "0", and emit a ::warning:: annotation when a negative counter is detected. The per-ASG TOTAL_BUSY accumulator also uses -le 0 and -gt 0 guards throughout.

✅ Low 3 — Misleading comment: The comment at refresh-graph-asg/action.yml:212-217 now accurately states: "the ladybug-shared branch of this loop skips to continue above, so this block never executes for shared tiers", and clarifies that shared-tier protection runs via refresh-graph-containers.

✅ Low 4 — Wrong OP_KIND for insert: OP_KIND_BULK_TABLE_INSERT = "bulk_table_insert" has been added to instance_busy.py:39, and management.py:251 now uses instance_busy(env.INSTANCE_ID, OP_KIND_BULK_TABLE_INSERT) for perform_table_insert. Both constants are imported.

❌ Low 5 — Code duplication (management.py:143-319): perform_table_creation and perform_table_insert remain structurally identical — same instance_busy wrapper, task status updates, asyncio.wait_for + interrupt logic, and error handling. Only the inner table_manager call differs. A shared _run_table_background_op(task_id, callable, ...) helper would keep them DRY. This is still unfixed. Fix this →

✅ Low 6 — resolve_instance_id_for_graph tests: A full TestResolveInstanceIdForGraph class has been added (test_instance_busy.py:180-293) with 4 tests covering the happy path, missing _instance_id, factory failure, and finally-close guarantee.

✅ Code quality — Unreachable assertions: The assertions after pytest.raises(...) blocks (assert patched_ddb.update_item.call_count == 2) are now correctly positioned outside the with pytest.raises(...) block in both test_decrement_runs_when_body_raises and test_sync_decrements_on_exception.

- Renamed and generalized the `perform_table_creation` function to `_run_table_background_op`, allowing for shared handling of both table creation and insertion operations.
- Introduced parameters for operation kind and label to enhance logging and monitoring.
- Updated the function's docstring to clarify its purpose and parameters.
- Streamlined the `perform_table_creation` function to utilize the new background operation handler, improving code maintainability and readability.
@jfrench9 jfrench9 merged commit 7223639 into main Apr 15, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/graph-api-cycle-protection branch April 15, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant