Add instance busy counter for graph API cycle protection by jfrench9 · Pull Request #619 · RoboFinSystems/robosystems

jfrench9 · 2026-04-15T06:31:58Z

Summary

Introduces an instance-level busy counter mechanism to protect graph API operations from destructive interference during critical workflows. This feature ensures that infrastructure operations (such as ASG refreshes and container restarts) are aware of in-progress graph operations, preventing data corruption and request failures caused by mid-cycle disruptions.

Key Accomplishments

New Instance Busy Counter Middleware

Added a new instance_busy middleware module that tracks the number of active destructive operations on a graph instance
Implements an atomic counter pattern to safely increment/decrement busy state across concurrent operations
Provides a queryable endpoint or mechanism for infrastructure tooling to check whether an instance is safe to recycle

Infrastructure-Aware Deployment Workflows

Enhanced CI/CD workflows (staging, prod, ASG refresh, service refresh) to integrate with the busy counter before performing instance replacements
Updated GitHub Actions for graph ASG refresh and graph container refresh to check instance busy state, ensuring graceful handling of in-flight operations
Added dedicated workflow files for graph ASG refresh and service refresh orchestration

Operation Integration

Instrumented materialization operations (extensions/materialize.py and lbug/direct_materialization.py) to register as busy during execution
Updated the SEC pipeline stage adapter to be aware of the busy counter lifecycle
Enhanced the graph table management router with refactored logic to coordinate with instance busy state
Added a Dagster job definition for graph-related operations

Comprehensive Test Coverage

Added a full test suite (test_instance_busy.py) covering counter increment/decrement, concurrent access, edge cases, and integration scenarios

Breaking Changes

The graph table management router (databases/tables/management.py) has been significantly refactored. Any external consumers or tests relying on the previous internal structure of that module should be reviewed.
Infrastructure deployment pipelines now depend on the busy counter check — rollout tooling must support the new pre-check step or deployments may block waiting for instances to become idle.

Testing Notes

Unit tests for the instance busy counter cover atomic operations, concurrent access patterns, and boundary conditions
Integration testing should verify that deployment workflows correctly poll and respect the busy state before proceeding with instance replacement
Recommend end-to-end testing in staging to validate that long-running materialization jobs properly hold the busy counter and that refreshes wait accordingly

Infrastructure Considerations

New GitHub Actions and workflow definitions are introduced for graph-specific ASG and container refresh flows — these should be reviewed by the infrastructure/DevOps team for alignment with existing deployment patterns
The busy counter mechanism assumes instance-local state; if the architecture moves to shared or distributed state, this approach will need to be revisited
Deployment pipelines now have an added dependency on instance health/readiness checks, which may increase rollout times when instances are under heavy load

🤖 Generated with Claude Code

Branch Info:

Source: feature/graph-api-cycle-protection
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

…graph-api-cycle-protection

- Introduced a new `instance_busy` middleware to track active destructive operations on EC2 instances using DynamoDB. - Added `max-wait-minutes` and `force-ignore-busy` inputs to refresh actions, allowing for configurable wait times and emergency overrides during instance refreshes. - Updated various workflows and actions to utilize the new busy counter, ensuring that in-flight operations are respected before cycling instances or containers. - Implemented comprehensive tests for the instance busy counter to validate its lifecycle and error handling.

…graph-api-cycle-protection

jfrench9 · 2026-04-15T06:34:15Z

@claude please review this PR

claude · 2026-04-15T06:34:44Z

Claude finished @jfrench9's task in 3m 14s —— View job

PR Review: Add instance busy counter for graph API cycle protection

Tasks

Gather PR context and understand the request
Read core implementation files (instance_busy.py, management.py)
Review operation integration files (materialize.py, direct_materialization.py, stage.py)
Review CI/CD workflow changes
Review test coverage (test_instance_busy.py)
Compile and post full review

Overview

This is a well-conceived safety feature. The "log-and-continue on DynamoDB failure" contract is the right design choice — a broken counter must never block real work — and the stale-detection fallback (2h heartbeat window) is a reasonable operational backstop. The escape hatch (force-ignore-busy) is a good safety valve. Tests cover the key lifecycle semantics well.

A few issues worth addressing before merge.

Medium Severity

1. Stale detection fails for genuinely long operations

_update_counter sets last_destructive_op_at on every DynamoDB write (both increment and decrement). But for a single operation that lasts more than 2 hours, the timestamp is only written once (at +1) and never refreshed until the operation completes. The GHA stale window (STALE_WINDOW_SECONDS=7200) would expire and start the refresh anyway.

Full SEC staging runs can exceed 2h on a large backfill. This could interrupt the exact workload the feature is designed to protect.

Fix options:

Add a periodic heartbeat task that touches last_destructive_op_at while the operation is running, or
Accept the limitation and document it clearly, and/or increase the stale window to something more conservative (e.g., 6h).

Fix this →

2. Counter can go negative → stale detection stalls GHA for 2h

If DynamoDB is down when a +1 fires (silently swallowed) but comes back before the matching -1, the stored counter goes to -1. The GHA busy check is a string equality test against "0":

BUSY_COUNT=$(... jq -r '.active_destructive_ops.N // "0"')
if [ "$BUSY_COUNT" = "0" ]; then continue; fi
TOTAL_BUSY=$((TOTAL_BUSY + BUSY_COUNT))

TOTAL_BUSY=-1 is not "0", so the loop keeps retrying for MAX_WAIT_ATTEMPTS minutes before the stale timeout saves it. A simple guard in the bash ([ "$BUSY_COUNT" -le 0 ] instead of = "0") would prevent this.

Fix this →

Low Severity

3. Misleading comment in `refresh-graph-asg/action.yml` about shared tiers

action.yml:213-214:

# Unlike the informational database_count check above, this runs for SHARED
# tiers too — protecting the SEC shared-master SEC materialization

But the continue on ladybug-shared (line 129) skips the entire tier block, including the busy-counter wait. The comment is incorrect — shared tiers are not checked here. The shared tier protection runs via refresh-graph-containers, not via this action.

Fix this →

4. `OP_KIND_BULK_TABLE_CREATE` used for the insert operation

management.py:250:

async with instance_busy(env.INSTANCE_ID, OP_KIND_BULK_TABLE_CREATE):

This is in perform_table_insert, where the operation is an incremental append, not a table create. The GHA reads last_destructive_op_kind only for logging, so this isn't a correctness bug — but when debugging a stale counter it will say "bulk_table_create" for an insert. Adding OP_KIND_BULK_TABLE_INSERT would cost one line.

5. Code duplication between `perform_table_creation` and `perform_table_insert`

management.py:142-318: Both functions share the same structure — instance_busy wrapper, task status updates, timeout+interrupt logic, and error handling. Only the inner table_manager call differs. Extracting a shared _run_table_background_op(task_id, callable, ...) helper would keep them DRY and make future changes (e.g., adding retry logic) apply to both.

This might be a pre-existing pattern, but given the PR description says this module was "significantly refactored," now is the natural time to consolidate.

6. No test for `resolve_instance_id_for_graph`

instance_busy.py:101-134: This helper creates a short-lived GraphClient solely to read ._instance_id, returns "" on any failure, and is the entry point for Dagster assets. The happy path (returns instance ID) and the lazy-import-prevents-circular-dep note deserve at least a smoke test with a mocked GraphClientFactory.

Positive Notes

Failure mode is safe: DynamoDB errors are absorbed at every layer (_update_counter, _update_counter_async, and again in the async wrapper). A crashed DynamoDB can't block work.
instance_busy_sync variant covers Dagster ops cleanly without forcing asyncio.run inside the context manager itself.
Imperative begin/end API is the right call for direct_materialization.py where inserting async with would require significant re-indentation of existing try/finally logic.
asyncio_mode = auto is set globally so no @pytest.mark.asyncio markers are needed — the test file is correctly written.
force-ignore-busy escape hatch with a prominent warning in the description is the right operational design for emergency situations.

+        raise ValueError("boom")
+
+    # Both increment and decrement should still have run
+    assert patched_ddb.update_item.call_count == 2


+      with ib.instance_busy_sync("i-sync", ib.OP_KIND_DAGSTER_MATERIALIZATION):
+        raise RuntimeError("sync boom")
+
+    assert patched_ddb.update_item.call_count == 2


- Updated `max-wait-minutes` input descriptions in refresh actions to clarify behavior regarding shared tiers and in-flight operations. - Enhanced stale detection logic in refresh actions to extend the timeout for busy instances from 2 hours to 6 hours, accommodating longer operations. - Removed unused ingestion flag checks from the health check script, streamlining the container state verification process. - Introduced a new operation kind for bulk table inserts in the instance busy middleware, improving tracking of active operations. - Added tests for the new instance ID resolution helper, ensuring robust handling of client instances in various scenarios.

jfrench9 · 2026-04-15T06:50:20Z

@claude have we fixed all issues?

claude · 2026-04-15T06:50:51Z

Claude finished @jfrench9's task in 1m 58s —— View job

Issue Resolution Status

Tasks

Gather context
Check Medium 1: Stale detection for long ops (>2h heartbeat)
Check Medium 2: Counter going negative (bash = "0" vs -le 0)
Check Low 3: Misleading comment about shared tiers in refresh-graph-asg/action.yml
Check Low 4: OP_KIND_BULK_TABLE_CREATE used for insert op
Check Low 5: Code duplication between perform_table_creation and perform_table_insert
Check Low 6: No test for resolve_instance_id_for_graph
Check Code Quality: Unreachable code in test_instance_busy.py (lines 66, 163)
Compile and post findings

6 of 7 issues resolved. 1 remains.

#	Severity	Issue	Status
1	Medium	Stale window too short for long ops	✅ Fixed
2	Medium	Counter going negative stalls GHA	✅ Fixed
3	Low	Misleading comment about shared tier in `refresh-graph-asg`	✅ Fixed
4	Low	Wrong `OP_KIND` constant in `perform_table_insert`	✅ Fixed
5	Low	Code duplication in `perform_table_creation` / `perform_table_insert`	❌ Open
6	Low	No test for `resolve_instance_id_for_graph`	✅ Fixed
7	Code quality	Unreachable assertions in test file	✅ Fixed

Details

✅ Medium 1 — Stale window: Increased from 2h (7200s) to 6h (21600s) in both refresh-graph-asg/action.yml:86 and refresh-graph-containers/action.yml:93. Comment explicitly notes this covers full SEC historical backfills.

✅ Medium 2 — Negative counter: Both actions now use [ "${BUSY_COUNT:-0}" -le 0 ] (integer comparison) instead of string = "0", and emit a ::warning:: annotation when a negative counter is detected. The per-ASG TOTAL_BUSY accumulator also uses -le 0 and -gt 0 guards throughout.

✅ Low 3 — Misleading comment: The comment at refresh-graph-asg/action.yml:212-217 now accurately states: "the ladybug-shared branch of this loop skips to continue above, so this block never executes for shared tiers", and clarifies that shared-tier protection runs via refresh-graph-containers.

✅ Low 4 — Wrong OP_KIND for insert: OP_KIND_BULK_TABLE_INSERT = "bulk_table_insert" has been added to instance_busy.py:39, and management.py:251 now uses instance_busy(env.INSTANCE_ID, OP_KIND_BULK_TABLE_INSERT) for perform_table_insert. Both constants are imported.

❌ Low 5 — Code duplication (management.py:143-319): perform_table_creation and perform_table_insert remain structurally identical — same instance_busy wrapper, task status updates, asyncio.wait_for + interrupt logic, and error handling. Only the inner table_manager call differs. A shared _run_table_background_op(task_id, callable, ...) helper would keep them DRY. This is still unfixed. Fix this →

✅ Low 6 — resolve_instance_id_for_graph tests: A full TestResolveInstanceIdForGraph class has been added (test_instance_busy.py:180-293) with 4 tests covering the happy path, missing _instance_id, factory failure, and finally-close guarantee.

✅ Code quality — Unreachable assertions: The assertions after pytest.raises(...) blocks (assert patched_ddb.update_item.call_count == 2) are now correctly positioned outside the with pytest.raises(...) block in both test_decrement_runs_when_body_raises and test_sync_decrements_on_exception.

- Renamed and generalized the `perform_table_creation` function to `_run_table_background_op`, allowing for shared handling of both table creation and insertion operations. - Introduced parameters for operation kind and label to enhance logging and monitoring. - Updated the function's docstring to clarify its purpose and parameters. - Streamlined the `perform_table_creation` function to utilize the new background operation handler, improving code maintainability and readability.

jfrench9 added 3 commits April 15, 2026 00:42

Merge commit '83db13b990489e3c157af919d51b3b9197e98faa' into feature/…

e4c3c9b

…graph-api-cycle-protection

Merge commit '98e7e9cd0a518aa1b8bef056fa78794b9593d45a' into feature/…

1d1ccd0

…graph-api-cycle-protection

github-code-quality Bot found potential problems Apr 15, 2026

View reviewed changes

jfrench9 merged commit 7223639 into main Apr 15, 2026
7 checks passed

jfrench9 deleted the feature/graph-api-cycle-protection branch April 15, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instance busy counter for graph API cycle protection#619

Add instance busy counter for graph API cycle protection#619
jfrench9 merged 5 commits intomainfrom
feature/graph-api-cycle-protection

jfrench9 commented Apr 15, 2026

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Apr 15, 2026

Summary

Key Accomplishments

New Instance Busy Counter Middleware

Infrastructure-Aware Deployment Workflows

Operation Integration

Comprehensive Test Coverage

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add instance busy counter for graph API cycle protection

Overview

Medium Severity

1. Stale detection fails for genuinely long operations

2. Counter can go negative → stale detection stalls GHA for 2h

Low Severity

3. Misleading comment in refresh-graph-asg/action.yml about shared tiers

4. OP_KIND_BULK_TABLE_CREATE used for the insert operation

5. Code duplication between perform_table_creation and perform_table_insert

6. No test for resolve_instance_id_for_graph

Positive Notes

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Resolution Status

Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 15, 2026 •

edited

Loading

3. Misleading comment in `refresh-graph-asg/action.yml` about shared tiers

4. `OP_KIND_BULK_TABLE_CREATE` used for the insert operation

5. Code duplication between `perform_table_creation` and `perform_table_insert`

6. No test for `resolve_instance_id_for_graph`

claude Bot commented Apr 15, 2026 •

edited

Loading