Skip to content

Refactor graph storage limits and add usage monitoring sensor#592

Merged
jfrench9 merged 3 commits intomainfrom
refactor/graph-storage-constraint
Apr 5, 2026
Merged

Refactor graph storage limits and add usage monitoring sensor#592
jfrench9 merged 3 commits intomainfrom
refactor/graph-storage-constraint

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Apr 5, 2026

Summary

Refactors the graph limits and storage constraint system, introducing a dedicated usage monitoring sensor and email notification capabilities. The changes consolidate how graph tier limits are defined, enforced, and communicated to users, while adding proactive storage usage monitoring via a Dagster sensor.

Key Accomplishments

Storage Limits & Tier Configuration

  • Restructured graph_tier.py configuration to better represent storage constraints and tier boundaries
  • Updated the graph tier YAML configuration (.github/configs/graph.yml) to align with the new constraint model
  • Refactored ingestion_limits.py middleware to simplify limit enforcement logic and improve clarity around storage-based constraints

Usage Monitoring Sensor

  • Added a new Dagster sensor (usage_monitor.py) that proactively monitors graph storage usage across accounts
  • Integrated the sensor into the Dagster definitions entrypoint
  • Enables automated detection of accounts approaching or exceeding their storage limits

Email Notifications via SES

  • Added/expanded AWS SES operations (operations/aws/ses.py) to support sending usage alert emails
  • Provides infrastructure for notifying users when they are nearing storage thresholds

API & Router Updates

  • Updated the graph limits API models (models/api/graphs/limits.py) to reflect the new storage constraint structure
  • Adjusted the limits router and materialize router to work with the refactored limit representations
  • Minor update to the offering router for consistency

Breaking Changes

  • API model changes: The limits response models have been restructured. Clients consuming the graph limits endpoint may need to update their parsing logic to accommodate the new field names/structure.
  • Configuration schema change: The graph.yml tier configuration format has changed, which may affect deployment pipelines or infrastructure-as-code that references these values directly.

Testing

  • Added comprehensive test suite for the new usage monitor sensor (tests/dagster/test_usage_monitor.py — 196 lines)
  • Added tests for the SES email operations (tests/operations/aws/test_ses.py — 101 lines)
  • Significantly expanded ingestion limits middleware tests (test_ingestion_limits.py — net +365 lines of coverage improvements)
  • Updated existing tests for tier config and limits models to match the refactored structures

Infrastructure Considerations

  • AWS SES: Ensure SES is properly configured (verified sender identities, production access if out of sandbox) in the deployment environment for usage alert emails to be delivered
  • Dagster sensor: The new usage monitor sensor needs to be enabled in the Dagster deployment; consider appropriate evaluation intervals to balance responsiveness with cost
  • Configuration rollout: The updated graph tier configuration should be deployed in coordination with the application code to avoid mismatches between config schema and runtime expectations

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/graph-storage-constraint
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Updated graph limits configuration to replace node and relationship limits with a soft storage cap for instances across various tiers.
- Introduced a new sensor for monitoring graph instance storage usage, sending alerts when usage approaches or exceeds defined limits.
- Enhanced ingestion limit checks to focus on operational safety limits for materialization, ensuring no out-of-memory errors occur.
- Updated API models and responses to reflect changes in storage limits and usage tracking.
- Added tests for the new usage monitoring functionality and adjusted existing tests to align with the updated graph limits structure.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Apr 5, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 5, 2026

Claude finished @jfrench9's task in 1m 49s —— View job


PR Review: Refactor graph storage limits and add usage monitoring sensor

  • Gather context and understand request
  • Read changed files
  • Review configuration changes (graph.yml, graph_tier.py)
  • Review new sensor (usage_monitor.py)
  • Review middleware changes (ingestion_limits.py)
  • Review API/router changes
  • Review SES operations
  • Review tests
  • Post final review

Summary

This is a well-scoped refactor that replaces the node/relationship count limits model with a storage-based soft-limit model and adds proactive monitoring via a Dagster sensor. The design direction is sound — storage-based limits are simpler to reason about and closer to infrastructure reality. The coverage improvements are substantial. A few issues worth addressing before merging.


Issues

Medium

1. check_graph_usage appears to be dead code (robosystems/middleware/graph/ingestion_limits.py:168)

The method is defined but never called from production code. The limits router calls check_instance_storage directly. Only the tests call it. Either remove it or add a call site — leaving it around creates a maintenance trap.

Fix this →

2. Sensor creates two event loops per graph (robosystems/dagster/sensors/usage_monitor.py:82-97, robosystems/dagster/sensors/usage_monitor.py:116-131)

# For every graph, this runs twice:
loop = asyncio.new_event_loop()
try:
    storage_check = loop.run_until_complete(...)
finally:
    loop.close()

With 100 graphs you're creating and destroying 200 event loops. Prefer a single asyncio.run() call (or a single event loop) wrapping all per-graph async work. At minimum, consider using asyncio.run() instead of the manual loop setup — it's cleaner:

storage_check = asyncio.run(IngestionLimitChecker.check_instance_storage(...))

Even better, consolidate all async work for a single graph into one coroutine and run it once.

Fix this →

3. display:flex in email HTML (robosystems/operations/aws/ses.py, _capacity_warning_template)

'<div style="display:flex;justify-content:space-between;align-items:center;">'

flexbox is not supported in Gmail, Outlook, or most email clients. The status badge and storage readout that use this layout will likely render broken for most recipients. Replace with a <table> layout for the stat + badge row.

Fix this →

4. node_count is a semantic mismatch on ContentLimits (robosystems/models/api/graphs/limits.py:67)

ContentLimits is now described as "Per-operation materialization limits" but has node_count: int | None as an informational field — it's neither a limit nor related to per-operation safety. This belongs in InstanceUsage or its own field on GraphLimitsResponse. Clients parsing this response will find it confusing.

Fix this →

5. Dedup keys stored in LOCKS Valkey DB (robosystems/dagster/sensors/usage_monitor.py:66)

Using ValkeyDatabase.LOCKS for notification dedup keys is a semantic mismatch. Locks are typically short-lived distributed mutexes; these dedup keys live for 7 days. Use a CACHE or notification-specific DB to avoid accidentally clearing email dedup state with lock cleanup operations.

Minor

6. /limits endpoint latency increase not documented

The refactored limits router now calls check_instance_storage, which issues parallel Graph API requests for every subgraph on the instance. For a graph with 10 subgraphs that's 11 get_database_info calls on every /limits request. This is a significant latency regression compared to the old model (which did at most 1 call). Worth a comment in the router explaining the trade-off, or consider caching the result (e.g. with a short TTL in Valkey).

7. strict=False on zip is misleading (robosystems/middleware/graph/ingestion_limits.py:116)

for (gid, is_parent), size_bytes in zip(database_ids, sizes, strict=False):

asyncio.gather guarantees results are in the same order and count as inputs. strict=False (which is the default) implies length mismatch is acceptable here — it isn't. Use strict=True so a bug in the gather logic would be caught immediately rather than silently dropping entries.

8. Default tier fallback silently misclassifies untiered graphs (robosystems/dagster/sensors/usage_monitor.py:53)

graph_tier = graph.graph_tier or "ladybug-standard"

If a graph has no tier set, it inherits the ladybug-standard 20 GB limit. This could produce false-positive alerts for internal/testing graphs that happen to not have a tier configured. Consider skipping graphs with None tier or logging a warning.


Positive Notes

  • Replacing node/relationship count limits with storage-based soft limits is the right model — it's simpler, measurable without counting graph elements, and aligns with how instance capacity actually works.
  • The sensor is correctly configured with DefaultSensorStatus.STOPPED — the ops team controls when to enable it, preventing an accidental alert flood on deploy.
  • The per-status dedup key pattern (usage_alert:{graph_id}:{status}) is smart — an account escalating from "approaching" to "over_limit" gets a separate alert rather than being blocked by the approaching-dedup TTL.
  • asyncio.gather for parallel subgraph storage fetches is the right call.
  • Test coverage across the sensor, SES ops, ingestion limits, and config is solid.

Comment thread robosystems/dagster/sensors/usage_monitor.py Fixed
- Simplified asynchronous handling in the usage monitor sensor by replacing manual event loop management with `asyncio.run()`.
- Removed the deprecated `check_graph_usage` method from `IngestionLimitChecker` to streamline the ingestion limit checks.
- Updated API models to include `node_count` in the `InstanceUsage` response, ensuring consistency across usage tracking.
- Adjusted tests to remove obsolete checks related to the removed `check_graph_usage` method, maintaining focus on current functionality.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Apr 5, 2026

@claude have we fixed all issues?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 5, 2026

Claude finished @jfrench9's task in 1m 19s —— View job


Issue Status Check

  • Read current state of changed files
  • Check each issue from previous review
  • Post results

Here's the current status of all issues from the previous review:

Medium Issues

# Issue Status
1 check_graph_usage dead code (ingestion_limits.py) ✅ Fixed — method removed
2 Sensor creates two event loops per graph (usage_monitor.py) ✅ Fixed — using asyncio.run()
3 display:flex in email HTML (ses.py) ✅ Fixed — replaced with <table> layout
4 node_count on ContentLimits semantic mismatch (limits.py) ✅ Fixed — moved to InstanceUsage
5 Dedup keys stored in LOCKS Valkey DB (usage_monitor.py:71) Still open

Minor Issues

# Issue Status
6 /limits latency regression not documented (limits.py) Still open (and worse — router now makes 2 separate Graph API calls for storage + node count before check_instance_storage)
7 strict=False on zip (ingestion_limits.py:134) ✅ Fixed — now strict=True
8 Default tier fallback silently misclassifies untiered graphs (usage_monitor.py:75) Still open

Code Quality Comment (github-code-quality bot)

# Issue Status
9 Implicit return in graph_usage_monitor_sensor (usage_monitor.py) ⚠️ Partially addressed — return None added at line 155, but the reviewer suggested SkipReason("No graphs require storage usage alerts at this time.") which is the idiomatic Dagster pattern

Summary

5 of 9 issues resolved. Four remain open:

  • Issue 5 (usage_monitor.py:71): ValkeyDatabase.LOCKS is still being used for 7-day dedup keys. These should use a CACHE or dedicated DB to avoid interference with lock cleanup operations.

  • Issue 6 (limits.py): The /limits endpoint now makes 3 sequential async calls for non-shared graphs (storage info → node count → check_instance_storage which fans out to N+1 parallel subgraph fetches). Worth adding a comment on the trade-off or a short-TTL Valkey cache.

  • Issue 8 (usage_monitor.py:75): graph.graph_tier or "ladybug-standard" silently assigns the standard 20 GB limit to untiered graphs, which could produce false-positive alerts for internal/test graphs.

  • Issue 9 (usage_monitor.py:155): return None is explicit but return SkipReason("No graphs require storage usage alerts at this time.") is the correct Dagster idiom.

- Updated the graph usage monitor sensor to include a check for assigned tiers, ensuring only parent graphs with a tier are processed.
- Modified the Redis client initialization to use MCP_CACHE for deduplication keys, improving cache management.
- Improved documentation within the sensor to clarify the impact of async Graph API calls on latency and the rationale behind the 6-hour interval.
- Added comments in the graph limits retrieval function to highlight potential latency issues with parallel API calls and suggested caching strategies.
@jfrench9 jfrench9 merged commit 32313f8 into main Apr 5, 2026
7 checks passed
@jfrench9 jfrench9 deleted the refactor/graph-storage-constraint branch April 5, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant