Add instance busy counter for graph API cycle protection#619
Conversation
…graph-api-cycle-protection
- Introduced a new `instance_busy` middleware to track active destructive operations on EC2 instances using DynamoDB. - Added `max-wait-minutes` and `force-ignore-busy` inputs to refresh actions, allowing for configurable wait times and emergency overrides during instance refreshes. - Updated various workflows and actions to utilize the new busy counter, ensuring that in-flight operations are respected before cycling instances or containers. - Implemented comprehensive tests for the instance busy counter to validate its lifecycle and error handling.
…graph-api-cycle-protection
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 3m 14s —— View job PR Review: Add instance busy counter for graph API cycle protectionTasks
OverviewThis is a well-conceived safety feature. The "log-and-continue on DynamoDB failure" contract is the right design choice — a broken counter must never block real work — and the stale-detection fallback (2h heartbeat window) is a reasonable operational backstop. The escape hatch ( A few issues worth addressing before merge. Medium Severity1. Stale detection fails for genuinely long operations
Full SEC staging runs can exceed 2h on a large backfill. This could interrupt the exact workload the feature is designed to protect. Fix options:
2. Counter can go negative → stale detection stalls GHA for 2hIf DynamoDB is down when a +1 fires (silently swallowed) but comes back before the matching -1, the stored counter goes to -1. The GHA busy check is a string equality test against BUSY_COUNT=$(... jq -r '.active_destructive_ops.N // "0"')
if [ "$BUSY_COUNT" = "0" ]; then continue; fi
TOTAL_BUSY=$((TOTAL_BUSY + BUSY_COUNT))
Low Severity3. Misleading comment in
|
| raise ValueError("boom") | ||
|
|
||
| # Both increment and decrement should still have run | ||
| assert patched_ddb.update_item.call_count == 2 |
| with ib.instance_busy_sync("i-sync", ib.OP_KIND_DAGSTER_MATERIALIZATION): | ||
| raise RuntimeError("sync boom") | ||
|
|
||
| assert patched_ddb.update_item.call_count == 2 |
- Updated `max-wait-minutes` input descriptions in refresh actions to clarify behavior regarding shared tiers and in-flight operations. - Enhanced stale detection logic in refresh actions to extend the timeout for busy instances from 2 hours to 6 hours, accommodating longer operations. - Removed unused ingestion flag checks from the health check script, streamlining the container state verification process. - Introduced a new operation kind for bulk table inserts in the instance busy middleware, improving tracking of active operations. - Added tests for the new instance ID resolution helper, ensuring robust handling of client instances in various scenarios.
|
@claude have we fixed all issues? |
|
Claude finished @jfrench9's task in 1m 58s —— View job Issue Resolution StatusTasks
6 of 7 issues resolved. 1 remains.
Details✅ Medium 1 — Stale window: Increased from 2h ( ✅ Medium 2 — Negative counter: Both actions now use ✅ Low 3 — Misleading comment: The comment at ✅ Low 4 — Wrong OP_KIND for insert: ❌ Low 5 — Code duplication ( ✅ Low 6 — ✅ Code quality — Unreachable assertions: The assertions after |
- Renamed and generalized the `perform_table_creation` function to `_run_table_background_op`, allowing for shared handling of both table creation and insertion operations. - Introduced parameters for operation kind and label to enhance logging and monitoring. - Updated the function's docstring to clarify its purpose and parameters. - Streamlined the `perform_table_creation` function to utilize the new background operation handler, improving code maintainability and readability.
Summary
Introduces an instance-level busy counter mechanism to protect graph API operations from destructive interference during critical workflows. This feature ensures that infrastructure operations (such as ASG refreshes and container restarts) are aware of in-progress graph operations, preventing data corruption and request failures caused by mid-cycle disruptions.
Key Accomplishments
New Instance Busy Counter Middleware
instance_busymiddleware module that tracks the number of active destructive operations on a graph instanceInfrastructure-Aware Deployment Workflows
Operation Integration
extensions/materialize.pyandlbug/direct_materialization.py) to register as busy during executionComprehensive Test Coverage
test_instance_busy.py) covering counter increment/decrement, concurrent access, edge cases, and integration scenariosBreaking Changes
databases/tables/management.py) has been significantly refactored. Any external consumers or tests relying on the previous internal structure of that module should be reviewed.Testing Notes
Infrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
feature/graph-api-cycle-protectionmainCo-Authored-By: Claude noreply@anthropic.com