Skip to content

Add graph tier upgrade functionality with validation and task orchestration#602

Merged
jfrench9 merged 2 commits intomainfrom
feature/subscriptions-upgrade-improvements
Apr 6, 2026
Merged

Add graph tier upgrade functionality with validation and task orchestration#602
jfrench9 merged 2 commits intomainfrom
feature/subscriptions-upgrade-improvements

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Apr 6, 2026

Summary

Introduces end-to-end graph tier upgrade capabilities, allowing users to upgrade their graph subscriptions to higher tiers. This includes a new upgrade task orchestration system, tier validation logic, enhanced subscription routing, and supporting infrastructure changes.

Key Accomplishments

Graph Tier Upgrade Task

  • Implemented a comprehensive graph_tier_upgrade task module (~545 lines) that orchestrates the full lifecycle of upgrading a graph to a higher subscription tier
  • Handles provisioning, state transitions, and rollback scenarios during the upgrade process

Tier Validation

  • Added a dedicated tier_validation module that encapsulates the business rules for determining valid upgrade paths between graph tiers
  • Centralizes tier comparison and eligibility logic for reuse across the codebase

Subscription & Billing Enhancements

  • Extended subscription models with new fields to support upgrade tracking (subscription.py core and API models)
  • Removed the now-unnecessary standalone upgrade.py model in favor of integrating upgrade state directly into the subscription model
  • Added a new Dagster billing job to handle upgrade-related webhook events
  • Updated billing enforcement middleware to account for in-progress upgrades

API & Router Changes

  • Significantly refactored and expanded the graph subscriptions router (~458 lines changed) to expose upgrade endpoints and handle upgrade request validation, initiation, and status tracking
  • Added upgrade-related routes to the billing subscriptions router
  • Updated OpenAPI tags for documentation consistency

Infrastructure

  • Extended the graph volume manager Lambda to support volume operations required during tier upgrades
  • Updated CloudFormation worker template with new permissions and resource definitions needed for upgrade task execution
  • Added new environment configuration variables to support upgrade feature flags and tier-specific settings

Breaking Changes

  • Removed robosystems/models/api/billing/upgrade.py: Any code importing from this module will need to be updated to use the new fields on the subscription model directly. The __init__.py for billing models has been updated accordingly.
  • Subscription model changes may require database migration or schema updates for the new field on the core subscription model.

Testing

  • Unit tests for tier upgrade task (test_graph_tier_upgrade.py): 642 lines of comprehensive tests covering happy paths, error handling, rollback scenarios, and edge cases
  • Unit tests for tier validation (test_tier_validation.py): 114 lines covering valid/invalid upgrade paths and boundary conditions
  • Integration/router tests (test_graph_tier_upgrade.py in routers): 454 lines testing the API layer including request validation, authorization, and end-to-end upgrade flows
  • Updated existing billing webhook handler and billing model tests to reflect the new upgrade support
  • All new test files include __init__.py for proper test discovery

Infrastructure Considerations

  • CloudFormation stack updates are required before deployment — the worker template includes new IAM permissions and resource definitions
  • The graph volume manager Lambda has been extended and may require redeployment
  • New environment variables must be configured in all target environments prior to rollout
  • Consider a staged rollout given the scope of changes to billing enforcement and subscription state management

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/subscriptions-upgrade-improvements
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Introduced a new task for handling graph tier upgrades, including EBS volume migration and snapshot creation for safety.
- Added validation for subgraph count and storage capacity during tier changes to ensure compliance with tier limits.
- Updated billing subscription models to support tier upgrade operations and added relevant error handling for subscription status conflicts.
- Enhanced API responses to include operation IDs for tracking asynchronous tier changes.
- Updated CloudFormation templates to allow Lambda invocation for volume management during upgrades.
- Refactored related tests to cover new functionality and ensure robust validation during tier transitions.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Apr 6, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @jfrench9's task in 3m 32s —— View job


PR Review: Add graph tier upgrade functionality

  • Gather context and understand the PR
  • Review core upgrade task (graph_tier_upgrade.py)
  • Review tier validation module (tier_validation.py)
  • Review router changes (routers/graphs/subscriptions.py)
  • Review model changes and breaking changes
  • Review infrastructure changes (CloudFormation, Lambda)
  • Review test coverage
  • Review config/env changes
  • Compile final review

This is a large, well-structured PR that implements end-to-end graph tier upgrade/downgrade. The architecture is sound and the happy-path coverage is good. A few issues worth addressing before merge, plus a handful of observations.


Issues

1. enforcement.py — UPGRADING status allows writes in require_graph_access

The error message added in check_graph_subscription_active (line 135) says:

"Read access available, writes blocked until upgrade completes."

But require_graph_access (the function that actually gates graph operations) never handles "upgrading". When subscription status is "upgrading", the code falls through the if subscription.status == "canceled" block (line 223) and lands on:

_cache_subscription(graph_id, "active")  # line 244
return graph

This means writes are not actually blocked during an upgrade — the cache entry says "active". The message in check_graph_subscription_active is misleading. Either:

  • Add explicit "upgrading" handling in require_graph_access (block writes, allow reads), or
  • Change the error message to reflect that full access is intentionally allowed during migration.

Fix this →

2. "upgrading" status applied for downgrades too

In _change_graph_tier (line 741), the status is hardcoded to "upgrading" regardless of direction:

subscription.status = "upgrading"  # line 741

This is set even when is_upgrade is False. The worker task, DynamoDB registry updates, and rollback logic are all labeled "upgrade" but execute for downgrades too. The naming won't cause a functional bug, but will cause confusion in support/debugging. Consider "migrating" as a neutral term, or branch on is_upgrade.

3. Full DynamoDB table scan in main task path

graph_tier_upgrade.py line 105:

volume_response = volume_table.scan(
    FilterExpression="instance_id = :iid AND #status = :status",
    ...
)

A scan() reads every item in the table and filters in memory. At scale this is expensive and slow. This should be a query() against a GSI on instance_id. This is in the critical path of every tier change and will degrade as the volume registry grows.

Fix this →

4. _ensure_asg_capacity over-provisions unconditionally

Lines 354–358:

if desired < max_size:
    asg_client.set_desired_capacity(
        AutoScalingGroupName=asg_name,
        DesiredCapacity=desired + 1,
    )

This bumps desired capacity every time, even when the ASG already has idle instances that can absorb the new volume. At minimum, check if the number of healthy instances exceeds the number of active databases before requesting more capacity. Frequent unnecessary scale-outs will increase infrastructure costs and trigger unnecessary instance lifecycle events.

5. CloudFormation lookup on every worker process start

get_volume_manager_function_arn() in env.py makes a live CloudFormation describe_stacks API call, cached at the module level (process-scoped). This means:

  • Every fresh worker process (container restart, scale-out) hits CF on the first upgrade task
  • Stack output updates won't be picked up until a process restart anyway

This is fine for correctness, but the silent empty-string fallback (returning "" on any exception) means a CF permission error or wrong stack name will silently skip snapshot creation and volume detach — which is the critical path. Add at minimum a logger.error for unexpected exceptions in _get_cf_stack_output rather than swallowing all exceptions silently.


Observations

_rollback doesn't verify reattach response (graph_tier_upgrade.py line 473–484): The rollback calls lambda_client.invoke to reattach the volume but doesn't inspect the response statusCode. A failed reattach during rollback leaves the volume detached from both instances. The existing logger.error is good, but an SNS alert or distinct exception class would help operators catch this.

DB session management in _finalize_subscription / _rollback: The next(db_gen) pattern works but is fragile. A context manager (with get_db_session() as db:) is cleaner and ensures cleanup even on exceptions in the finally block. The current pattern is used in multiple places in this file — a small refactor would pay off.

Tier validation docstring vs. usage: tier_validation.py docstrings say "No-op for upgrades (higher tier always has >= subgraphs)." This is only true because the router (line 727) guards the call with if not is_upgrade. The functions themselves have no such guard — if called directly for an upgrade they'd still run correctly but emit false-positive "Cannot downgrade" errors. The docstring is accurate enough but worth tightening.

validate_subgraph_count is synchronous, validate_storage_capacity is async: These two functions in tier_validation.py have different signatures for no apparent reason (one does a simple DB count, the other calls an async IngestionLimitChecker). This asymmetry is minor but could trip up future callers.

snapshot_for_upgrade in Lambda blocks for up to 10 minutes: The waiter.wait(MaxAttempts=40, Delay=15) in graph_volume_manager.py can hold the Lambda for up to 10 minutes. Lambda has a 15-minute max timeout — make sure the Lambda's configured timeout accounts for this worst case, especially if there's additional processing before/after the waiter.


Testing

Test coverage is solid across all three layers (task unit tests, validation unit tests, router integration tests). The rollback path tests in test_graph_tier_upgrade.py are particularly thorough. The Dagster webhook test for the "upgrading" status guard is a good regression anchor.

One gap: there are no tests for require_graph_access behavior when subscription status is "upgrading" (the enforcement.py inconsistency noted in Issue #1).


Deployment Checklist

Per the PR description, before merging:

  • CloudFormation worker stack must be updated (new IAM permissions for Lambda invocation + CF describe)
  • Graph Volume Manager Lambda must be redeployed (new snapshot_for_upgrade action)
  • No new environment variables are required (the ARN is looked up dynamically from CF outputs) ✅

…upgrades

- Improved exception handling in CloudFormation output retrieval to log errors.
- Added handling for "upgrading" subscription status in graph access requirements, blocking write operations during tier upgrades.
- Updated volume retrieval to use a query on the instance-index GSI instead of a full table scan for efficiency.
- Adjusted tests to reflect changes in volume retrieval and subscription status handling.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Apr 6, 2026

@claude can you review the recent changes

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @jfrench9's task in 1m 34s —— View job


Review: Recent Changes (commit a12d20a)

  • Gather context
  • Review recent commit changes (a12d20a)
  • Compile review feedback

This commit addresses three of the issues raised in the previous review. Here's the assessment:


What's Fixed ✅

Issue #1enforcement.py upgrading status (fully resolved)

The fix in enforcement.py correctly handles "upgrading" in both the cache fast-path (lines 198–204) and the fresh DB-lookup path (lines 230–238). The logic — reads allowed, writes blocked — is consistent and the _cache_subscription(graph_id, "upgrading") call ensures subsequent requests hit the fast path. Well done.

Issue #3 — DynamoDB full table scan (resolved)

The scan()query() change in graph_tier_upgrade.py (lines 105–111) is the right fix. The FilterExpression-only approach on a scan reads every item; the GSI query is correct. One thing to verify before deploying: confirm that the instance-index GSI exists on the volume DynamoDB table in CloudFormation. If this GSI is missing in the table definition, this will fail silently at runtime (the query will return an error, not empty results). Worth a double-check against the CF template.

Issue #5 — CloudFormation exception logging (partially resolved)

The except Exception as e: print(...) in env.py line 319 follows the existing pattern in the file (the print() calls on lines 124/125 have the same rationale — "avoid circular import"). This is acceptable given the constraint. The silent failure is now at least visible in logs.


Remaining Issues

Empty except blocks still not logging (graph_tier_upgrade.py lines 324, 412)

The code quality review flagged two empty except httpx.HTTPError: blocks. The fix here added inline comments:

pass  # Instance may be shutting down — keep polling until timeout
pass  # New instance may still be booting — retry

This clarifies intent but does not resolve the static analysis finding — the except clause still does nothing executable. The requested fix was to bind the exception and add a logger.debug() call. As long as the code quality check is still enforced in CI, this will continue to flag. The fix is straightforward:

except httpx.HTTPError as e:
    logger.debug("Error polling admin connections (ignoring, retrying): %s", e)

Fix this →


No Tests for enforcement.py "upgrading" Handling

The previous review noted this gap and it remains open. The enforcement.py fix is the highest-risk change in this PR (it gates write access during upgrades). A test for require_graph_access with subscription.status == "upgrading" and require_write=True vs False would be a good regression anchor.

Fix this →


Summary

The recent commit is a solid incremental improvement that closes the two most critical issues (enforcement.py inconsistency and DynamoDB scan). The two remaining items — logger.debug in the empty excepts and a test for the upgrading enforcement path — are minor but worth addressing before merge.

@jfrench9 jfrench9 merged commit 15902ca into main Apr 6, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/subscriptions-upgrade-improvements branch April 6, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant