Skip to content

AYUSHMIT/gridshift-safe-mode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GridShift: Safe-Mode Orchestration for Grid-Aware AI Workloads

A grid-aware AI orchestrator that refuses to act on telemetry it cannot trust.

SCSP Hackathon Submission

Team name: Zama
Track: Electric Grid Optimization
Repository: https://github.com/AYUSHMIT/gridshift-safe-mode

Team members

  • Ayush Pandey
  • Arash Rezaee
  • Zahra Sharifi Soltani
  • Mehran Sasaninia

What we built

GridShift is a grid-aware AI workload orchestrator for AI/data-center demand under grid stress. It verifies whether telemetry can be trusted before acting. When a controller lies, firmware attestation fails, or reported load diverges from observed load, GridShift enters safe mode and unwinds workloads off the untrusted node instead of freezing them in place.

Datasets / APIs used

  • Simulated Boston-area grid load model
  • Simulated AI/data-center workload traces
  • Simulated controller attestation fields: signature, PCR status, nonce freshness
  • Simulated reported-vs-observed load mismatch
  • Optional Anthropic API or OpenAI API for operator briefings
  • Deterministic offline narrator fallback when no API key is configured

The idea in one line

In this prototype, every simulated data-center controller emits attestation-style fields and signed-style telemetry. The orchestrator cross-checks two independent signals: cryptographic attestation and behavioral consistency. The orchestrator cross-checks two independent signals cryptographic attestation and behavioral consistency (reported vs. observed load). If either fails, the system enters safe mode which UNWINDS workloads off the untrusted node rather than freezing them in place. An LLM-powered incident narrator turns each tick's structured state into a plain-language operator briefing.

Where the AI lives

GridShift's AI component is honest about what it does and does not do:

  • The AI does not make control decisions. All decisions (run / delay / migrate / block) are made by a deterministic safety layer. Putting an LLM in the control loop of a power grid would be reckless; we don't do that.
  • The AI generates operator briefings. After each tick, the structured trust state and decision list are passed to an LLM (Claude or GPT, configurable) which produces a 3–5 sentence operator-log-style briefing: what happened, what GridShift did, and what the operator should inspect.
  • The AI falls back cleanly offline. If no API key is configured or the network is down, a rule-based fallback generates an operator briefing deterministically. The demo works with or without Wi-Fi.

This separation AI for explanation, not decision is itself part of the pitch. It's the right architectural pattern for AI in critical infrastructure.

Why safe mode unwinds instead of freezes

An early version of the design blocked all migrations involving an untrusted node. A hardware-security supervisor pointed out this could be weaponized: an adversary with a workload on a DC could trigger a false attestation failure, and then inflate the load on that DC, using the migration block to trap their own inflated workload in place and create a DoS against the grid. The refined design below addresses this directly.

Refined safety policy:

  • Migration INTO an untrusted node → blocked (never place new work on a dubious node).
  • Migration OUT OF an untrusted node → allowed and preferred (reduce exposure; unwind).
  • Grid-side observed load is always authoritative for hard safety limits, independent of trust state. If observed utilization on any node exceeds 75%, an unwind-migration is emitted regardless of what the controller reports.
  • Safe mode is an investigate-and-unwind state, not a freeze.

Setup

python -m venv .venv
source .venv/bin/activate           # Windows: .venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.py

Important: launch with streamlit run app.py, not python app.py. The latter runs the script in "bare mode" — Streamlit's UI functions become no-ops, nothing renders, and you'll see missing ScriptRunContext! warnings in the terminal. After running the correct command, open the printed Local URL (typically http://localhost:8501) in a browser.

Optional: enable the LLM narrator

Without an API key, the AI panel runs a deterministic rule-based fallback. To use an actual LLM:

export ANTHROPIC_API_KEY=sk-ant-...   # preferred
# or
export OPENAI_API_KEY=sk-...
streamlit run app.py

See .env.example for configurable model IDs. At a venue with unreliable Wi-Fi, force offline mode with export GRIDSHIFT_FORCE_FALLBACK=1.

Running the core loop without the UI

python -m core.orchestrator

This runs the supervisor-scenario attack end-to-end and prints the unwind behavior.

Module tests

python -m core.grid_model
python -m core.dc_simulator
python -m core.verifier

Demo scenes (on the dashboard)

The dashboard now has a 🎬 Guided demo with a single ▶ Next demo step button that walks through five steps, plus a 🏆 Load winning scenario shortcut for time-pressured demos. Manual controls are tucked into an expander.

  1. Step 1 Heatwave begins. Baseline grid load rises.
  2. Step 2 AI job burst. A wave of AI jobs lands across the three DCs.
  3. Step 3 Behavioral lie. BOS-1 under-reports by 16 MW. Behavioral monitor catches the mismatch; trust flips to compromised; safe mode ON.
  4. Step 4 Firmware tamper. PCR no longer matches known-good. Attestation catches it even when reported and observed agree.
  5. Step 5 Load spike + unwind (the winning moment). BOS-1 is untrusted AND its real load is inflated. The refined safety layer migrates jobs OFF BOS-1 instead of freezing them the supervisor-scenario DoS is defeated.

UI features for judges

  • Active attack panel narrates exactly what is being injected at each moment.
  • Reported vs Observed metrics the heart of the story shown as a 4-up metric row.
  • Load history chart the 900 MW threshold is a bold red dashed line; safe-mode periods are shaded.
  • Trust legend explains what sig, pcr, nonce, and mismatch mean.
  • "Why this happened" box plain-English narration of every decision.
  • Directional safe mode explicit "Blocked: INTO BOS-1 / Allowed: OUT OF BOS-1" panel.
  • Naive vs GridShift comparison table shows the value of the directional unwind in one glance.
  • Reset full demo and Load winning scenario buttons recovery if the demo goes sideways.
  • Takeaway banner GridShift optimizes when trust holds, and safely unwinds when trust breaks.

Core invariants (refined)

ACCEPT  ⟺  signature_valid ∧ pcr_matches_known_good ∧ nonce_fresh
TRUST   ⟺  ACCEPT ∧ |reported − observed| < ε   (per-node)
DECIDE  ⟺  planner, filtered by directional trust policy
          + observed-load override for hard safety limits

Repo map

gridshift/
├── app.py                      # Streamlit UI
├── core/
│   ├── state.py                # shared types [everyone]
│   ├── grid_model.py           # [Smart Grids]
│   ├── dc_simulator.py         # [Optical / DC]
│   ├── attestation.py          # [HW Security] crypto primitives
│   ├── prover.py               # [HW Security] controller side
│   ├── verifier.py             # [HW Security] orchestrator side
│   ├── behavior_monitor.py     # [System Security]
│   ├── safety.py               # [System Security] decision + directional safety
│   ├── orchestrator.py         # [System Security] main tick loop
│   └── ai_narrator.py          # [System Security] LLM incident briefings
├── data/
│   └── sample_jobs.json
├── .env.example
├── requirements.txt
└── README.md

Threat model coverage

Attack Behavioral check Attestation check Safety policy
Controller lies about load ✔ caught passes block-into; unwind-out
Firmware tampered may pass ✔ caught block-into; unwind-out
Replay of a captured valid message may pass ✔ caught (nonce) block-into; unwind-out
Stolen / spoofed controller identity may pass ✔ caught (sig) block-into; unwind-out
DoS via safe-mode weaponization (supervisor scenario) attacker wants this defeated unwind + observed-load override

About

Grid-aware AI workload orchestrator with trust verification and safe-mode control under anomalous conditions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages