Refactor graph storage limits and add usage monitoring sensor by jfrench9 · Pull Request #592 · RoboFinSystems/robosystems

jfrench9 · 2026-04-05T04:21:22Z

Summary

Refactors the graph limits and storage constraint system, introducing a dedicated usage monitoring sensor and email notification capabilities. The changes consolidate how graph tier limits are defined, enforced, and communicated to users, while adding proactive storage usage monitoring via a Dagster sensor.

Key Accomplishments

Storage Limits & Tier Configuration

Restructured graph_tier.py configuration to better represent storage constraints and tier boundaries
Updated the graph tier YAML configuration (.github/configs/graph.yml) to align with the new constraint model
Refactored ingestion_limits.py middleware to simplify limit enforcement logic and improve clarity around storage-based constraints

Usage Monitoring Sensor

Added a new Dagster sensor (usage_monitor.py) that proactively monitors graph storage usage across accounts
Integrated the sensor into the Dagster definitions entrypoint
Enables automated detection of accounts approaching or exceeding their storage limits

Email Notifications via SES

Added/expanded AWS SES operations (operations/aws/ses.py) to support sending usage alert emails
Provides infrastructure for notifying users when they are nearing storage thresholds

API & Router Updates

Updated the graph limits API models (models/api/graphs/limits.py) to reflect the new storage constraint structure
Adjusted the limits router and materialize router to work with the refactored limit representations
Minor update to the offering router for consistency

Breaking Changes

API model changes: The limits response models have been restructured. Clients consuming the graph limits endpoint may need to update their parsing logic to accommodate the new field names/structure.
Configuration schema change: The graph.yml tier configuration format has changed, which may affect deployment pipelines or infrastructure-as-code that references these values directly.

Testing

Added comprehensive test suite for the new usage monitor sensor (tests/dagster/test_usage_monitor.py — 196 lines)
Added tests for the SES email operations (tests/operations/aws/test_ses.py — 101 lines)
Significantly expanded ingestion limits middleware tests (test_ingestion_limits.py — net +365 lines of coverage improvements)
Updated existing tests for tier config and limits models to match the refactored structures

Infrastructure Considerations

AWS SES: Ensure SES is properly configured (verified sender identities, production access if out of sandbox) in the deployment environment for usage alert emails to be delivered
Dagster sensor: The new usage monitor sensor needs to be enabled in the Dagster deployment; consider appropriate evaluation intervals to balance responsiveness with cost
Configuration rollout: The updated graph tier configuration should be deployed in coordination with the application code to avoid mismatches between config schema and runtime expectations

🤖 Generated with Claude Code

Branch Info:

Source: refactor/graph-storage-constraint
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Updated graph limits configuration to replace node and relationship limits with a soft storage cap for instances across various tiers. - Introduced a new sensor for monitoring graph instance storage usage, sending alerts when usage approaches or exceeds defined limits. - Enhanced ingestion limit checks to focus on operational safety limits for materialization, ensuring no out-of-memory errors occur. - Updated API models and responses to reflect changes in storage limits and usage tracking. - Added tests for the new usage monitoring functionality and adjusted existing tests to align with the updated graph limits structure.

jfrench9 · 2026-04-05T04:21:23Z

@claude please review this PR

claude · 2026-04-05T04:21:49Z

Claude finished @jfrench9's task in 1m 49s —— View job

PR Review: Refactor graph storage limits and add usage monitoring sensor

Gather context and understand request
Read changed files
Review configuration changes (graph.yml, graph_tier.py)
Review new sensor (usage_monitor.py)
Review middleware changes (ingestion_limits.py)
Review API/router changes
Review SES operations
Review tests
Post final review

Summary

This is a well-scoped refactor that replaces the node/relationship count limits model with a storage-based soft-limit model and adds proactive monitoring via a Dagster sensor. The design direction is sound — storage-based limits are simpler to reason about and closer to infrastructure reality. The coverage improvements are substantial. A few issues worth addressing before merging.

Issues

Medium

1. check_graph_usage appears to be dead code (robosystems/middleware/graph/ingestion_limits.py:168)

The method is defined but never called from production code. The limits router calls check_instance_storage directly. Only the tests call it. Either remove it or add a call site — leaving it around creates a maintenance trap.

Fix this →

2. Sensor creates two event loops per graph (robosystems/dagster/sensors/usage_monitor.py:82-97, robosystems/dagster/sensors/usage_monitor.py:116-131)

# For every graph, this runs twice:
loop = asyncio.new_event_loop()
try:
    storage_check = loop.run_until_complete(...)
finally:
    loop.close()

With 100 graphs you're creating and destroying 200 event loops. Prefer a single asyncio.run() call (or a single event loop) wrapping all per-graph async work. At minimum, consider using asyncio.run() instead of the manual loop setup — it's cleaner:

storage_check = asyncio.run(IngestionLimitChecker.check_instance_storage(...))

Even better, consolidate all async work for a single graph into one coroutine and run it once.

Fix this →

3. display:flex in email HTML (robosystems/operations/aws/ses.py, _capacity_warning_template)

'<div style="display:flex;justify-content:space-between;align-items:center;">'

flexbox is not supported in Gmail, Outlook, or most email clients. The status badge and storage readout that use this layout will likely render broken for most recipients. Replace with a <table> layout for the stat + badge row.

Fix this →

4. node_count is a semantic mismatch on ContentLimits (robosystems/models/api/graphs/limits.py:67)

ContentLimits is now described as "Per-operation materialization limits" but has node_count: int | None as an informational field — it's neither a limit nor related to per-operation safety. This belongs in InstanceUsage or its own field on GraphLimitsResponse. Clients parsing this response will find it confusing.

Fix this →

5. Dedup keys stored in LOCKS Valkey DB (robosystems/dagster/sensors/usage_monitor.py:66)

Using ValkeyDatabase.LOCKS for notification dedup keys is a semantic mismatch. Locks are typically short-lived distributed mutexes; these dedup keys live for 7 days. Use a CACHE or notification-specific DB to avoid accidentally clearing email dedup state with lock cleanup operations.

Minor

6. /limits endpoint latency increase not documented

The refactored limits router now calls check_instance_storage, which issues parallel Graph API requests for every subgraph on the instance. For a graph with 10 subgraphs that's 11 get_database_info calls on every /limits request. This is a significant latency regression compared to the old model (which did at most 1 call). Worth a comment in the router explaining the trade-off, or consider caching the result (e.g. with a short TTL in Valkey).

7. strict=False on zip is misleading (robosystems/middleware/graph/ingestion_limits.py:116)

for (gid, is_parent), size_bytes in zip(database_ids, sizes, strict=False):

asyncio.gather guarantees results are in the same order and count as inputs. strict=False (which is the default) implies length mismatch is acceptable here — it isn't. Use strict=True so a bug in the gather logic would be caught immediately rather than silently dropping entries.

8. Default tier fallback silently misclassifies untiered graphs (robosystems/dagster/sensors/usage_monitor.py:53)

graph_tier = graph.graph_tier or "ladybug-standard"

If a graph has no tier set, it inherits the ladybug-standard 20 GB limit. This could produce false-positive alerts for internal/testing graphs that happen to not have a tier configured. Consider skipping graphs with None tier or logging a warning.

Positive Notes

Replacing node/relationship count limits with storage-based soft limits is the right model — it's simpler, measurable without counting graph elements, and aligns with how instance capacity actually works.
The sensor is correctly configured with DefaultSensorStatus.STOPPED — the ops team controls when to enable it, preventing an accidental alert flood on deploy.
The per-status dedup key pattern (usage_alert:{graph_id}:{status}) is smart — an account escalating from "approaching" to "over_limit" gets a separate alert rather than being blocked by the approaching-dedup TTL.
asyncio.gather for parallel subgraph storage fetches is the right call.
Test coverage across the sensor, SES ops, ingestion limits, and config is solid.

- Simplified asynchronous handling in the usage monitor sensor by replacing manual event loop management with `asyncio.run()`. - Removed the deprecated `check_graph_usage` method from `IngestionLimitChecker` to streamline the ingestion limit checks. - Updated API models to include `node_count` in the `InstanceUsage` response, ensuring consistency across usage tracking. - Adjusted tests to remove obsolete checks related to the removed `check_graph_usage` method, maintaining focus on current functionality.

jfrench9 · 2026-04-05T04:37:00Z

@claude have we fixed all issues?

claude · 2026-04-05T04:37:22Z

Claude finished @jfrench9's task in 1m 19s —— View job

Issue Status Check

Read current state of changed files
Check each issue from previous review
Post results

Here's the current status of all issues from the previous review:

Medium Issues

#	Issue	Status
1	`check_graph_usage` dead code (`ingestion_limits.py`)	✅ Fixed — method removed
2	Sensor creates two event loops per graph (`usage_monitor.py`)	✅ Fixed — using `asyncio.run()`
3	`display:flex` in email HTML (`ses.py`)	✅ Fixed — replaced with `<table>` layout
4	`node_count` on `ContentLimits` semantic mismatch (`limits.py`)	✅ Fixed — moved to `InstanceUsage`
5	Dedup keys stored in `LOCKS` Valkey DB (`usage_monitor.py:71`)	❌ Still open

Minor Issues

#	Issue	Status
6	`/limits` latency regression not documented (`limits.py`)	❌ Still open (and worse — router now makes 2 separate Graph API calls for storage + node count before `check_instance_storage`)
7	`strict=False` on `zip` (`ingestion_limits.py:134`)	✅ Fixed — now `strict=True`
8	Default tier fallback silently misclassifies untiered graphs (`usage_monitor.py:75`)	❌ Still open

Code Quality Comment (github-code-quality bot)

#	Issue	Status
9	Implicit return in `graph_usage_monitor_sensor` (`usage_monitor.py`)	⚠️ Partially addressed — `return None` added at line 155, but the reviewer suggested `SkipReason("No graphs require storage usage alerts at this time.")` which is the idiomatic Dagster pattern

Summary

5 of 9 issues resolved. Four remain open:

Issue 5 (usage_monitor.py:71): ValkeyDatabase.LOCKS is still being used for 7-day dedup keys. These should use a CACHE or dedicated DB to avoid interference with lock cleanup operations.
Issue 6 (limits.py): The /limits endpoint now makes 3 sequential async calls for non-shared graphs (storage info → node count → check_instance_storage which fans out to N+1 parallel subgraph fetches). Worth adding a comment on the trade-off or a short-TTL Valkey cache.
Issue 8 (usage_monitor.py:75): graph.graph_tier or "ladybug-standard" silently assigns the standard 20 GB limit to untiered graphs, which could produce false-positive alerts for internal/test graphs.
Issue 9 (usage_monitor.py:155): return None is explicit but return SkipReason("No graphs require storage usage alerts at this time.") is the correct Dagster idiom.

- Updated the graph usage monitor sensor to include a check for assigned tiers, ensuring only parent graphs with a tier are processed. - Modified the Redis client initialization to use MCP_CACHE for deduplication keys, improving cache management. - Improved documentation within the sensor to clarify the impact of async Graph API calls on latency and the rationale behind the 6-hour interval. - Added comments in the graph limits retrieval function to highlight potential latency issues with parallel API calls and suggested caching strategies.

github-code-quality Bot found potential problems Apr 5, 2026

View reviewed changes

Comment thread robosystems/dagster/sensors/usage_monitor.py Fixed

jfrench9 merged commit 32313f8 into main Apr 5, 2026
7 checks passed

jfrench9 deleted the refactor/graph-storage-constraint branch April 5, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor graph storage limits and add usage monitoring sensor#592

Refactor graph storage limits and add usage monitoring sensor#592
jfrench9 merged 3 commits intomainfrom
refactor/graph-storage-constraint

jfrench9 commented Apr 5, 2026

Uh oh!

jfrench9 commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

jfrench9 commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Apr 5, 2026

Summary

Key Accomplishments

Storage Limits & Tier Configuration

Usage Monitoring Sensor

Email Notifications via SES

API & Router Updates

Breaking Changes

Testing

Infrastructure Considerations

Uh oh!

jfrench9 commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Refactor graph storage limits and add usage monitoring sensor

Summary

Issues

Medium

Minor

Positive Notes

Uh oh!

Uh oh!

jfrench9 commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Status Check

Medium Issues

Minor Issues

Code Quality Comment (github-code-quality bot)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 5, 2026 •

edited

Loading

claude Bot commented Apr 5, 2026 •

edited

Loading