Skip to content

feat(auth): replace static hotkey/API-key auth with Bittensor validator whitelisting and 50% consensus#5

Merged
echobt merged 12 commits intomainfrom
feat/bittensor-validator-consensus
Feb 18, 2026
Merged

feat(auth): replace static hotkey/API-key auth with Bittensor validator whitelisting and 50% consensus#5
echobt merged 12 commits intomainfrom
feat/bittensor-validator-consensus

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 18, 2026

Summary

Replace the hardcoded single-hotkey + API-key authentication system with dynamic validator whitelisting from the Bittensor blockchain (netuid 100) and a 50% consensus mechanism for evaluation triggering.

Changes

New modules

  • src/validator_whitelist.rsValidatorWhitelist backed by parking_lot::RwLock<HashSet<String>>, refreshes every 5 minutes via bittensor-rs (BittensorClient::with_failover()sync_metagraph → filter by validator_permit, active, and stake ≥ 10,000 TAO). Retries up to 3× with exponential backoff on failure, preserving the cached whitelist.
  • src/consensus.rsConsensusManager using DashMap<String, PendingConsensus> to track votes per archive SHA-256 hash. Evaluations trigger only when ≥50% of whitelisted validators submit the same payload. Includes TTL reaper (60s default) and max-pending-entries cap (100).

Modified modules

  • src/auth.rs — Removed AUTHORIZED_HOTKEY import, api_key field from AuthHeaders, X-Api-Key header extraction, InvalidApiKey error variant, and API key comparison. verify_request() now accepts a &ValidatorWhitelist reference and checks hotkey membership instead of comparing against a single hardcoded key.
  • src/config.rs — Removed AUTHORIZED_HOTKEY constant and worker_api_key field (no more WORKER_API_KEY env var). Added configurable fields: bittensor_netuid, min_validator_stake_tao, validator_refresh_secs, consensus_threshold, consensus_ttl_secs.
  • src/handlers.rsAppState now includes ValidatorWhitelist and ConsensusManager. submit_batch computes SHA-256 of the archive, records a consensus vote, and returns 202 Accepted with pending/reached status. Evaluation only launches on consensus. Returns 503 when whitelist is empty or consensus entries are at capacity.
  • src/main.rs — Added module declarations, creates and spawns whitelist refresh loop and consensus reaper as background tasks.
  • Cargo.toml — Added bittensor-rs git dependency and required [patch.crates-io] for w3f-bls.
  • Dockerfile — Added protobuf-compiler, cmake, clang, mold build deps; copies .cargo config for mold linker.
  • AGENTS.md / src/AGENTS.md — Updated data flow, module map, environment variables, and authentication documentation.

Tests

  • Updated existing auth/config tests to remove API-key references
  • Added tests for ValidatorWhitelist (empty start, whitelisting, count)
  • Added tests for ConsensusManager (single vote, threshold trigger, duplicate votes, independent hashes, TTL expiration, capacity check)

Breaking Changes

  • X-Api-Key header is no longer accepted or required
  • WORKER_API_KEY env var is no longer needed
  • POST /submit now requires consensus from ≥50% of whitelisted validators before triggering evaluation
  • All GET endpoints remain unaffected

Summary by CodeRabbit

  • New Features

    • Dynamic validator whitelist for hotkey-based authentication with periodic refresh
    • Consensus-driven submission flow: uploads hashed, votes tracked; returns "pending" until consensus met, then processes
    • Bittensor/stake-based validator filtering and configurable consensus/environment tuning
  • Bug Fixes

    • Graceful task termination when runtime resources close instead of panics
    • Safer message serialization to avoid send-time panics
    • Stricter header/nonce/signature validation for authentication
  • Chores

    • Added new dependencies for Bittensor integration and hashing
    • Hardened Docker image and switched runtime to a non-root user

…or whitelisting and 50% consensus

Integrate dynamic validator whitelisting from Bittensor netuid 100 and
consensus-based evaluation triggering, replacing the previous single
AUTHORIZED_HOTKEY + WORKER_API_KEY authentication system.

Authentication now uses a dynamic whitelist of validators fetched every
5 minutes from the Bittensor blockchain via bittensor-rs. Validators
must have validator_permit, be active, and have >=10,000 TAO stake.
POST /submit requests only trigger evaluations when >=50% of whitelisted
validators submit the same archive payload (identified by SHA-256 hash).

New modules:
- src/validator_whitelist.rs: ValidatorWhitelist with parking_lot::RwLock,
  background refresh loop with 3-retry exponential backoff, connection
  resilience (keeps cached whitelist on failure), starts empty and rejects
  requests with 503 until first successful sync
- src/consensus.rs: ConsensusManager using DashMap for lock-free vote
  tracking, PendingConsensus entries with TTL (default 60s), reaper loop
  every 30s, max 100 pending entries cap, duplicate vote detection

Modified modules:
- src/auth.rs: Removed AUTHORIZED_HOTKEY import, api_key field from
  AuthHeaders, X-Api-Key header extraction, InvalidApiKey error variant.
  verify_request() now takes &ValidatorWhitelist instead of API key string.
  Updated all tests accordingly.
- src/config.rs: Removed AUTHORIZED_HOTKEY constant and worker_api_key
  field. Added bittensor_netuid, min_validator_stake_tao,
  validator_refresh_secs, consensus_threshold, consensus_ttl_secs with
  env var support and sensible defaults. Updated banner output.
- src/handlers.rs: Added ValidatorWhitelist and ConsensusManager to
  AppState. submit_batch now: checks whitelist non-empty (503), validates
  against whitelist, computes SHA-256 of archive, records consensus vote,
  returns 202 with pending status or triggers evaluation on consensus.
  Moved active batch check to consensus-reached branch only.
- src/main.rs: Added module declarations, creates ValidatorWhitelist and
  ConsensusManager, spawns background refresh and reaper tasks.
- Cargo.toml: Added bittensor-rs git dependency and mandatory
  [patch.crates-io] for w3f-bls.
- Dockerfile: Added protobuf-compiler, cmake, clang, mold build deps
  for bittensor-rs substrate dependencies. Copies .cargo config.
- AGENTS.md and src/AGENTS.md: Updated data flow, module map, env vars,
  authentication docs to reflect new architecture.

BREAKING CHANGE: WORKER_API_KEY env var and X-Api-Key header no longer required.
All validators on Bittensor netuid 100 with sufficient stake are auto-whitelisted.
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Replaces static API-key/hotkey auth with a dynamic Bittensor-derived validator whitelist and adds a ConsensusManager to collect per-archive votes; submissions are accepted but only processed after whitelist-based validators reach the configured consensus threshold. Background refresh and reaper loops manage whitelist and pending consensus state.

Changes

Cohort / File(s) Summary
Consensus & Whitelist Modules
src/consensus.rs, src/validator_whitelist.rs
Adds ConsensusManager (DashMap-backed pending votes, record_vote, TTL reaper, capacity) and ValidatorWhitelist (RwLock HashSet, Bittensor refresh loop, try_refresh). Includes tests and background loop entrypoints.
Auth, Handlers & Executor Flow
src/auth.rs, src/handlers.rs, src/executor.rs, src/ws.rs
Reworks authentication to use whitelist (removes API key), enforces SS58 checksum and nonce rules, updates verify_request signature, integrates consensus voting into submit_batch (Pending/AlreadyVoted/Reached handling), adds capacity checks, safer semaphore & serde handling.
Config & AppState Initialization
src/config.rs, src/main.rs, src/handlers.rs
Extends Config with Bittensor/consensus fields and validates consensus_threshold; from_env returns Result. AppState now includes validator_whitelist and consensus_manager; main wires background refresh/reaper tasks and exits on config/init failures.
Build & Container
Cargo.toml, Dockerfile
Adds deps: bittensor-rs (git), blake2; patch w3f-bls. Docker build adds protobuf-compiler/cmake/clang/mold, copies .cargo, and switches runtime to non-root executor user with /tmp/sessions ownership.
Docs & Manifest
AGENTS.md, src/AGENTS.md, Cargo.toml
Updates documentation and public API surface to document ValidatorWhitelist, ConsensusManager, new env vars (BITTENSOR_NETUID, MIN_VALIDATOR_STAKE_TAO, VALIDATOR_REFRESH_SECS, CONSENSUS_THRESHOLD, CONSENSUS_TTL_SECS, MAX_PENDING_CONSENSUS) and behavior changes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Handler as Handler<br/>(submit_batch)
    participant Whitelist as ValidatorWhitelist
    participant Consensus as ConsensusManager
    participant Executor as ArchiveExecutor

    Client->>Handler: POST /submit (X-Hotkey, X-Nonce, X-Signature, archive)
    Handler->>Whitelist: is_whitelisted(hotkey)
    alt not whitelisted
        Handler-->>Client: 401 Unauthorized
    else whitelisted
        Handler->>Handler: verify signature & nonce
        alt invalid
            Handler-->>Client: 401 Auth failed
        else valid
            Handler->>Consensus: record_vote(archive_hash, hotkey, archive_data, required_votes)
            alt Consensus::Reached
                Consensus-->>Handler: Reached {archive_data, votes, required}
                Handler->>Executor: extract & spawn batch (concurrency)
                Handler-->>Client: 200 OK {batch_id, consensus_reached, votes, required}
            else Consensus::AlreadyVoted
                Consensus-->>Handler: AlreadyVoted {votes, required}
                Handler-->>Client: 409 Conflict
            else Consensus::Pending
                Consensus-->>Handler: Pending {votes, required}
                Handler-->>Client: 202 Accepted {pending_consensus, votes, required}
            end
        end
    end
Loading
sequenceDiagram
    participant Timer
    participant Whitelist as ValidatorWhitelist
    participant Bittensor as Bittensor<br/>Network
    participant Cache as RwLock<HashSet>

    Timer->>Whitelist: refresh_loop(netuid, min_stake_tao)
    loop every validator_refresh_secs
        Whitelist->>Bittensor: connect & sync metagraph(netuid)
        alt success
            Bittensor-->>Whitelist: neurons list
            Whitelist->>Whitelist: filter validators (active, stake >= min)
            Whitelist->>Cache: replace hotkey set
            Whitelist->>Whitelist: log refresh(success,count)
        else failure (retries)
            Whitelist->>Whitelist: retry/backoff or log final warning
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰
A rabbit notes the whitelist grows each day,
Validators vote, the hashes find their way,
No static keys — consensus leads the run,
Archives wake when fifty percent have done,
Hop, hop, the batches bloom beneath the sun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely summarizes the main change: replacing static authentication with dynamic Bittensor validator whitelisting and consensus-based evaluation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/bittensor-validator-consensus

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Move nonce consumption AFTER signature verification in verify_request()
  to prevent attackers from burning legitimate nonces via invalid signatures
- Fix TOCTOU race in NonceStore::check_and_insert() using atomic DashMap
  entry API instead of separate contains_key + insert
- Add input length limits for auth headers (hotkey 128B, nonce 256B,
  signature 256B) to prevent memory exhaustion via oversized values
- Add consensus_threshold validation in Config::from_env() — must be
  in range (0.0, 1.0], panics at startup if invalid
- Add saturating conversion for consensus required calculation to prevent
  integer overflow on f64→usize cast
- Add tests for all security fixes
- Extract magic number 100 to configurable MAX_PENDING_CONSENSUS
- Restore #[allow(dead_code)] on DEFAULT_MAX_OUTPUT_BYTES constant
- Use anyhow::Context instead of map_err(anyhow::anyhow!) in validator_whitelist
…nd container security

- consensus.rs: Fix TOCTOU race condition in record_vote by using
  DashMap entry API (remove_entry) to atomically check votes and remove
  entry while holding the shard lock, preventing concurrent threads from
  inserting votes between drop and remove
- config.rs: Replace assert! with proper Result<Self, String> return
  from Config::from_env() to avoid panicking in production on invalid
  CONSENSUS_THRESHOLD values
- main.rs: Update Config::from_env() call to handle Result with expect
- auth.rs: Add SS58 checksum verification using Blake2b-512 (correct
  Substrate algorithm) in ss58_to_public_key_bytes to reject addresses
  with corrupted checksums; previously only decoded base58 without
  validating the 2-byte checksum suffix
- Dockerfile: Add non-root executor user for container runtime security
Remove DEFAULT_MAX_OUTPUT_BYTES constant and max_output_bytes Config field
that were defined and populated from env but never read anywhere outside
config.rs. Both had #[allow(dead_code)] annotations suppressing warnings.
…ct magic numbers to constants

- main.rs: Replace .expect() on Config::from_env() with match + tracing::error! + process::exit(1)
- validator_whitelist.rs: Extract retry count (3) and backoff base (2) to named constants
- validator_whitelist.rs: Replace unwrap_or_else on Option with if-let pattern
- consensus.rs: Extract reaper interval (30s) to REAPER_INTERVAL_SECS constant
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
AGENTS.md (1)

181-196: ⚠️ Potential issue | 🟡 Minor

MAX_PENDING_CONSENSUS env var is missing from the documentation table.

config.rs reads MAX_PENDING_CONSENSUS (default 100) via env_parse, but it has no entry in the environment variables table here. Operators have no documented way to tune this limit.

📝 Proposed addition
 | `CONSENSUS_TTL_SECS`        | `60`    | TTL for pending consensus entries (seconds) |
+| `MAX_PENDING_CONSENSUS`     | `100`   | Maximum in-flight consensus entries before new submissions are rejected (503) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 181 - 196, The env var MAX_PENDING_CONSENSUS is
missing from the AGENTS.md environment table even though config.rs reads it via
env_parse with a default of 100; update the table to add a row for
`MAX_PENDING_CONSENSUS` with default `100` and a brief description like "Maximum
number of pending consensus entries" so operators can tune this limit (reference
the env var name MAX_PENDING_CONSENSUS and config.rs/env_parse to locate the
source).
🧹 Nitpick comments (6)
src/validator_whitelist.rs (2)

97-131: Tests cover basic operations but not error/retry paths.

The tests verify empty initialization, membership checks, and counting. refresh_once/try_refresh aren't unit-tested since they require a live Bittensor node, which is acceptable per the project's testing conventions. Consider adding a test that verifies insert_for_test works correctly, since src/auth.rs tests depend on it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 97 - 131, Add a unit test that
exercises ValidatorWhitelist::insert_for_test to ensure it inserts hotkeys as
expected (so other tests like those in src/auth.rs depending on it are valid);
create a new #[test] that constructs ValidatorWhitelist::new(), calls
insert_for_test with one or two hotkey strings, then asserts is_whitelisted
returns true for those keys and validator_count increases accordingly, similar
to existing tests for hotkeys and validator_count; keep it as a pure unit test
(no network calls) and reference ValidatorWhitelist::insert_for_test,
ValidatorWhitelist::is_whitelisted, and ValidatorWhitelist::validator_count when
locating where to add the test.

35-41: refresh_loop creates connections to an external service on every tick — consider connection reuse.

Each call to try_refresh creates a new BittensorClient::with_failover() connection. For a default refresh interval of 5 minutes this is fine, but if refresh_secs is set very low, it could create unnecessary connection churn. A minor optimization would be to hold a persistent client and reconnect only on failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 35 - 41, refresh_loop currently
creates a new BittensorClient on every refresh via try_refresh /
BittensorClient::with_failover; change it to create and hold a persistent client
(e.g., store a BittensorClient in ValidatorWhitelist or a local mutable variable
in refresh_loop) and reuse that client for calls to refresh_once/try_refresh,
only recreating the connection when an operation fails; update refresh_once or
try_refresh to accept a &mut BittensorClient (or use the stored client) and
implement reconnect-on-error logic so the loop reinitializes
BittensorClient::with_failover only on failure.
Cargo.toml (2)

72-73: Same concern for the w3f-bls patch — pin to a specific commit.

The fix-no-std branch reference has the same reproducibility and supply-chain risk.

-w3f-bls = { git = "https://github.com/opentensor/bls", branch = "fix-no-std" }
+w3f-bls = { git = "https://github.com/opentensor/bls", rev = "<specific-commit-hash>" }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Cargo.toml` around lines 72 - 73, Replace the branch-based git patch for
w3f-bls with a commit-pinned git reference to ensure reproducible builds: in the
Cargo.toml patch entry that currently uses w3f-bls = { git =
"https://github.com/opentensor/bls", branch = "fix-no-std" }, switch to
providing the exact commit hash via the rev (or tag) field (e.g., w3f-bls = {
git = "https://github.com/opentensor/bls", rev = "<commit-sha>" }) so the
w3f-bls dependency is fixed to a single commit; update any lock/regenerate build
files accordingly.

65-67: Pin the git dependency to a specific commit hash to ensure reproducible builds.

Using branch = "main" means any push to that branch can silently change what gets compiled. This breaks build reproducibility. Pin to a specific rev instead.

Suggested change
-bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", branch = "main" }
+bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", rev = "<specific-commit-hash>" }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Cargo.toml` around lines 65 - 67, The bittensor-rs git dependency in
Cargo.toml currently uses branch = "main", which is non-reproducible; update the
dependency entry for bittensor-rs to pin it to a specific commit by replacing
branch = "main" with rev = "<commit-hash>" (use the full 40-char commit SHA from
the repository) so builds are deterministic; ensure the dependency line remains
a git dependency and commit the updated Cargo.toml.
src/consensus.rs (2)

10-15: Each pending entry holds a full copy of the archive bytes — be mindful of memory.

With max_pending defaulting to 100 and max_archive_bytes at 500 MB, the theoretical worst case is significant. The capacity cap is a good safeguard, but consider whether a lower default or a total memory budget (not just entry count) would be more appropriate for your deployment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 10 - 15, PendingConsensus currently stores a
full Vec<u8> in archive_data which can lead to large memory usage when
max_pending (default 100) and max_archive_bytes (500 MB) combine; change storage
to a shared/cheaper representation (e.g., Arc<Vec<u8>> or a compressed blob) to
avoid duplicating bytes per entry and add global accounting logic that tracks
total_archive_bytes_in_pending and enforces a budget (reject or evict entries
when exceeded) instead of relying solely on max_pending; update where
PendingConsensus is created/inserted and where archive_data is accessed to use
the chosen shared type and implement the accounting checks against
max_archive_bytes (or a new total_budget) so memory pressure is bounded.

49-119: record_vote accepts archive_data: Vec<u8> on every call, but only uses it for the first voter.

In the Entry::Occupied branch, the archive_data parameter is moved in but immediately dropped since only the stored copy is used. Every subsequent voter uploads the full archive and it gets discarded after hashing.

Consider accepting archive_data as a lazy source (e.g., pass it only when needed) or accept a reference/Cow for the occupied path. This is a minor memory/bandwidth concern — the archive is already in memory from the multipart upload — but worth noting for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 49 - 119, record_vote currently takes
archive_data: Vec<u8> for every call but only uses it when creating a new
PendingConsensus (Entry::Vacant); in Entry::Occupied the Vec is moved and
discarded. Change the API and implementation so archive_data is optional or
borrowed: e.g., change record_vote signature to accept archive_data:
Option<Vec<u8>> (or a Cow<[u8]>/&[u8]) and only .insert the Vec into
PendingConsensus in the Entry::Vacant branch, leaving Entry::Occupied to ignore
the archive_data param; also update PendingConsensus's archive_data field and
all callers of record_vote to pass Some(vec) on the initial upload and None (or
a borrow) for subsequent votes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Around line 37-47: The fenced code block containing the ValidatorWhitelist
refresh loop and ConsensusManager reaper loop in AGENTS.md is missing a language
specifier; update the opening triple-backtick to include a language (e.g., use
"text" or "plaintext") so the block starts with ```text and the rest of the
pseudo-code (the "ValidatorWhitelist refresh loop (every 5 minutes):" and
"ConsensusManager reaper loop (every 30 seconds):" sections) remains unchanged
to satisfy MD040.

In `@src/config.rs`:
- Around line 41-46: Replace the current range check on consensus_threshold with
a NaN-aware check so "nan" can't bypass validation: in the conditional that
currently references consensus_threshold (the if block that returns Err with
"CONSENSUS_THRESHOLD must be in range..."), add an explicit
consensus_threshold.is_nan() test OR use the negated valid-form (i.e. reject
unless consensus_threshold > 0.0 && consensus_threshold <= 1.0) so NaN is
rejected; update the Err path to remain the same.
- Around line 135-153: The tests mutate CONSENSUS_THRESHOLD concurrently causing
race conditions; update the failing tests (test_config_rejects_zero_threshold
and test_config_rejects_threshold_above_one) to scope their env changes instead
of using global std::env::set_var/remove_var — either wrap the env override
around the call to Config::from_env() using a scoped helper like the temp-env
crate (e.g. temp_env::with_var) or mark the tests serial using serial_test so
they don't run in parallel; ensure Config::from_env() is invoked inside that
scoped block and that the env is automatically restored/removed after the test
to avoid races with test_config_defaults.

In `@src/handlers.rs`:
- Around line 191-207: The code computes total_validators and required
per-request (using validator_whitelist.validator_count()) and then passes
required into consensus_manager.record_vote, which allows the threshold to drift
if the whitelist changes mid-flight; fix by computing and persisting the initial
required threshold into the PendingConsensus when the consensus entry is first
created (store the value computed from the whitelist and
config.consensus_threshold), and update consensus_manager.record_vote (and any
callers) to prefer the PendingConsensus.stored_required for threshold checks if
present rather than the caller-supplied required; ensure the path that creates a
new PendingConsensus sets stored_required (based on total_validators and
config.consensus_threshold) and that record_vote uses pending.stored_required to
determine success for all subsequent votes on that archive_hash.
- Around line 239-257: The code currently consumes the consensus entry (see
consensus.rs: entry.remove_entry()) when handling ConsensusStatus::Reached, then
checks state.sessions.has_active_batch() and returns 503, losing the consensus;
fix by re-inserting the consumed consensus result if you detect a busy session:
after record_vote produces ConsensusStatus::Reached and you enter the Reached
branch, if state.sessions.has_active_batch() is true, reconstruct and re-insert
the consensus entry back into the DashMap (use the same key/value shape that
record_vote/entry.remove_entry() returned: archive_data, concurrent_tasks,
votes, required) before returning the SERVICE_UNAVAILABLE response; reference
symbols: ConsensusStatus::Reached, record_vote, has_active_batch, and the
DashMap entry removal logic (entry.remove_entry()) to locate where to restore
the entry.

In `@src/validator_whitelist.rs`:
- Around line 73-94: The dependency on the git-sourced bittensor-rs is unstable
and try_refresh() directly calls several upstream APIs
(BittensorClient::with_failover, sync_metagraph,
neuron.validator_permit/active/stake.as_tao, ss58::encode_ss58) which may break;
update Cargo.toml to pin the bittensor-rs crate to the specific commit
referenced (eb58916af5a4d7fef74ef00ea0d61519880b101f) to stabilize builds, and
introduce a small internal abstraction (e.g., a MetagraphProvider trait and an
implementation that wraps BittensorClient::with_failover and sync_metagraph) so
try_refresh() depends on your internal trait instead of the upstream
types/fields directly, moving use of neuron.validator_permit, neuron.active,
neuron.stake.as_tao(), and encode_ss58 into the adapter implementation to
insulate validator_whitelist.rs from future API changes.

---

Outside diff comments:
In `@AGENTS.md`:
- Around line 181-196: The env var MAX_PENDING_CONSENSUS is missing from the
AGENTS.md environment table even though config.rs reads it via env_parse with a
default of 100; update the table to add a row for `MAX_PENDING_CONSENSUS` with
default `100` and a brief description like "Maximum number of pending consensus
entries" so operators can tune this limit (reference the env var name
MAX_PENDING_CONSENSUS and config.rs/env_parse to locate the source).

---

Nitpick comments:
In `@Cargo.toml`:
- Around line 72-73: Replace the branch-based git patch for w3f-bls with a
commit-pinned git reference to ensure reproducible builds: in the Cargo.toml
patch entry that currently uses w3f-bls = { git =
"https://github.com/opentensor/bls", branch = "fix-no-std" }, switch to
providing the exact commit hash via the rev (or tag) field (e.g., w3f-bls = {
git = "https://github.com/opentensor/bls", rev = "<commit-sha>" }) so the
w3f-bls dependency is fixed to a single commit; update any lock/regenerate build
files accordingly.
- Around line 65-67: The bittensor-rs git dependency in Cargo.toml currently
uses branch = "main", which is non-reproducible; update the dependency entry for
bittensor-rs to pin it to a specific commit by replacing branch = "main" with
rev = "<commit-hash>" (use the full 40-char commit SHA from the repository) so
builds are deterministic; ensure the dependency line remains a git dependency
and commit the updated Cargo.toml.

In `@src/consensus.rs`:
- Around line 10-15: PendingConsensus currently stores a full Vec<u8> in
archive_data which can lead to large memory usage when max_pending (default 100)
and max_archive_bytes (500 MB) combine; change storage to a shared/cheaper
representation (e.g., Arc<Vec<u8>> or a compressed blob) to avoid duplicating
bytes per entry and add global accounting logic that tracks
total_archive_bytes_in_pending and enforces a budget (reject or evict entries
when exceeded) instead of relying solely on max_pending; update where
PendingConsensus is created/inserted and where archive_data is accessed to use
the chosen shared type and implement the accounting checks against
max_archive_bytes (or a new total_budget) so memory pressure is bounded.
- Around line 49-119: record_vote currently takes archive_data: Vec<u8> for
every call but only uses it when creating a new PendingConsensus
(Entry::Vacant); in Entry::Occupied the Vec is moved and discarded. Change the
API and implementation so archive_data is optional or borrowed: e.g., change
record_vote signature to accept archive_data: Option<Vec<u8>> (or a
Cow<[u8]>/&[u8]) and only .insert the Vec into PendingConsensus in the
Entry::Vacant branch, leaving Entry::Occupied to ignore the archive_data param;
also update PendingConsensus's archive_data field and all callers of record_vote
to pass Some(vec) on the initial upload and None (or a borrow) for subsequent
votes.

In `@src/validator_whitelist.rs`:
- Around line 97-131: Add a unit test that exercises
ValidatorWhitelist::insert_for_test to ensure it inserts hotkeys as expected (so
other tests like those in src/auth.rs depending on it are valid); create a new
#[test] that constructs ValidatorWhitelist::new(), calls insert_for_test with
one or two hotkey strings, then asserts is_whitelisted returns true for those
keys and validator_count increases accordingly, similar to existing tests for
hotkeys and validator_count; keep it as a pure unit test (no network calls) and
reference ValidatorWhitelist::insert_for_test,
ValidatorWhitelist::is_whitelisted, and ValidatorWhitelist::validator_count when
locating where to add the test.
- Around line 35-41: refresh_loop currently creates a new BittensorClient on
every refresh via try_refresh / BittensorClient::with_failover; change it to
create and hold a persistent client (e.g., store a BittensorClient in
ValidatorWhitelist or a local mutable variable in refresh_loop) and reuse that
client for calls to refresh_once/try_refresh, only recreating the connection
when an operation fails; update refresh_once or try_refresh to accept a &mut
BittensorClient (or use the stored client) and implement reconnect-on-error
logic so the loop reinitializes BittensorClient::with_failover only on failure.

Comment on lines +37 to 47
```
ValidatorWhitelist refresh loop (every 5 minutes):
1. Connect to Bittensor subtensor via BittensorClient::with_failover()
2. Sync metagraph for netuid 100
3. Filter validators: validator_permit && active && stake >= 10,000 TAO
4. Atomically replace whitelist with new set of SS58 hotkeys
5. On failure: retry up to 3 times with exponential backoff, keep cached whitelist

ConsensusManager reaper loop (every 30 seconds):
1. Remove pending consensus entries older than TTL (default 60s)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block to fix the MD040 lint.

The markdownlint CI rule MD040 requires all fenced code blocks to declare a language. This block (background task pseudo-code) can use text or plaintext.

📝 Proposed fix
-```
+```text
 ValidatorWhitelist refresh loop (every 5 minutes):
 ...
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 37 - 47, The fenced code block containing the
ValidatorWhitelist refresh loop and ConsensusManager reaper loop in AGENTS.md is
missing a language specifier; update the opening triple-backtick to include a
language (e.g., use "text" or "plaintext") so the block starts with ```text and
the rest of the pseudo-code (the "ValidatorWhitelist refresh loop (every 5
minutes):" and "ConsensusManager reaper loop (every 30 seconds):" sections)
remains unchanged to satisfy MD040.

Comment on lines +41 to +46
if consensus_threshold <= 0.0 || consensus_threshold > 1.0 {
return Err(format!(
"CONSENSUS_THRESHOLD must be in range (0.0, 1.0], got {}",
consensus_threshold
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

NaN bypasses the consensus threshold range check.

Rust's f64 parser accepts "nan" (case-insensitive) as a valid value, so env_parse("CONSENSUS_THRESHOLD", …) will return f64::NAN when the variable is set to "nan". Since all NaN comparisons return false, the condition NAN <= 0.0 || NAN > 1.0 evaluates to false, allowing NaN to slip through into consensus_threshold. At runtime, any threshold comparison (e.g. votes_fraction >= consensus_threshold) against NaN will always return false, effectively making consensus unreachable.

Use the negated form, which rejects NaN correctly:

🐛 Proposed fix
-        if consensus_threshold <= 0.0 || consensus_threshold > 1.0 {
+        if !(consensus_threshold > 0.0 && consensus_threshold <= 1.0) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if consensus_threshold <= 0.0 || consensus_threshold > 1.0 {
return Err(format!(
"CONSENSUS_THRESHOLD must be in range (0.0, 1.0], got {}",
consensus_threshold
));
}
if !(consensus_threshold > 0.0 && consensus_threshold <= 1.0) {
return Err(format!(
"CONSENSUS_THRESHOLD must be in range (0.0, 1.0], got {}",
consensus_threshold
));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 41 - 46, Replace the current range check on
consensus_threshold with a NaN-aware check so "nan" can't bypass validation: in
the conditional that currently references consensus_threshold (the if block that
returns Err with "CONSENSUS_THRESHOLD must be in range..."), add an explicit
consensus_threshold.is_nan() test OR use the negated valid-form (i.e. reject
unless consensus_threshold > 0.0 && consensus_threshold <= 1.0) so NaN is
rejected; update the Err path to remain the same.

Comment on lines +135 to +153
fn test_config_rejects_zero_threshold() {
std::env::set_var("CONSENSUS_THRESHOLD", "0.0");
let result = Config::from_env();
std::env::remove_var("CONSENSUS_THRESHOLD");
assert!(result.is_err());
assert!(result
.unwrap_err()
.contains("CONSENSUS_THRESHOLD must be in range"));
}

#[test]
fn test_config_rejects_threshold_above_one() {
std::env::set_var("CONSENSUS_THRESHOLD", "1.5");
let result = Config::from_env();
std::env::remove_var("CONSENSUS_THRESHOLD");
assert!(result.is_err());
assert!(result
.unwrap_err()
.contains("CONSENSUS_THRESHOLD must be in range"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Environment variable mutation in parallel tests creates a race condition.

The CI runs tests with --test-threads=$(nproc), so test_config_defaults, test_config_rejects_zero_threshold, and test_config_rejects_threshold_above_one can execute concurrently. If test_config_defaults runs while CONSENSUS_THRESHOLD is set to "0.0" or "1.5" by a sibling test, its from_env().expect(…) call panics and the test fails spuriously. The two threshold tests can also race on remove_var.

Making std::env::{set_var, remove_var} unsafe to call is already tracked for Rust 2024, so this pattern will also require unsafe blocks when the project migrates editions.

Preferred fix — use temp-env (or serial_test) to scope env overrides:

🔒 Proposed fix
+    #[serial_test::serial]
     #[test]
     fn test_config_defaults() {
         let cfg = Config::from_env().expect("default config should be valid");
         ...
     }

+    #[serial_test::serial]
     #[test]
     fn test_config_rejects_zero_threshold() {
-        std::env::set_var("CONSENSUS_THRESHOLD", "0.0");
-        let result = Config::from_env();
-        std::env::remove_var("CONSENSUS_THRESHOLD");
+        temp_env::with_var("CONSENSUS_THRESHOLD", Some("0.0"), || {
+            let result = Config::from_env();
             assert!(result.is_err());
             assert!(result
                 .unwrap_err()
                 .contains("CONSENSUS_THRESHOLD must be in range"));
-    }
+        });
     }

+    #[serial_test::serial]
     #[test]
     fn test_config_rejects_threshold_above_one() {
-        std::env::set_var("CONSENSUS_THRESHOLD", "1.5");
-        let result = Config::from_env();
-        std::env::remove_var("CONSENSUS_THRESHOLD");
+        temp_env::with_var("CONSENSUS_THRESHOLD", Some("1.5"), || {
+            let result = Config::from_env();
             assert!(result.is_err());
             assert!(result
                 .unwrap_err()
                 .contains("CONSENSUS_THRESHOLD must be in range"));
-    }
+        });
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 135 - 153, The tests mutate CONSENSUS_THRESHOLD
concurrently causing race conditions; update the failing tests
(test_config_rejects_zero_threshold and test_config_rejects_threshold_above_one)
to scope their env changes instead of using global std::env::set_var/remove_var
— either wrap the env override around the call to Config::from_env() using a
scoped helper like the temp-env crate (e.g. temp_env::with_var) or mark the
tests serial using serial_test so they don't run in parallel; ensure
Config::from_env() is invoked inside that scoped block and that the env is
automatically restored/removed after the test to avoid races with
test_config_defaults.

Comment on lines 191 to 207
let total_validators = state.validator_whitelist.validator_count();
let required_f = (total_validators as f64 * state.config.consensus_threshold).ceil();
let required = (required_f.min(usize::MAX as f64) as usize).max(1);

let total_tasks = extracted.tasks.len();
let concurrent = query
.concurrent_tasks
.unwrap_or(state.config.max_concurrent_tasks)
.min(state.config.max_concurrent_tasks);

let batch = state.sessions.create_batch(total_tasks);
let batch_id = batch.id.clone();

state.executor.spawn_batch(batch, extracted, concurrent);

Ok((
StatusCode::ACCEPTED,
Json(serde_json::json!({
"batch_id": batch_id,
"total_tasks": total_tasks,
"concurrent_tasks": concurrent,
"ws_url": format!("/ws?batch_id={}", batch_id),
})),
))
let status = state.consensus_manager.record_vote(
&archive_hash,
&auth_headers.hotkey,
archive_bytes,
Some(concurrent),
required,
total_validators,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

required votes threshold can drift between voters due to whitelist refreshes.

total_validators and required are computed at each request based on the current whitelist size. If the whitelist refreshes between votes (e.g., grows from 10→20 validators), later voters pass a higher required to record_vote, which uses the caller's required — not the value from when the entry was created. This can make it harder (or easier) to reach consensus mid-flight.

Consider storing required in PendingConsensus when the entry is first created, and using that stored value for all subsequent threshold checks on the same entry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers.rs` around lines 191 - 207, The code computes total_validators
and required per-request (using validator_whitelist.validator_count()) and then
passes required into consensus_manager.record_vote, which allows the threshold
to drift if the whitelist changes mid-flight; fix by computing and persisting
the initial required threshold into the PendingConsensus when the consensus
entry is first created (store the value computed from the whitelist and
config.consensus_threshold), and update consensus_manager.record_vote (and any
callers) to prefer the PendingConsensus.stored_required for threshold checks if
present rather than the caller-supplied required; ensure the path that creates a
new PendingConsensus sets stored_required (based on total_validators and
config.consensus_threshold) and that record_vote uses pending.stored_required to
determine success for all subsequent votes on that archive_hash.

Comment on lines 239 to 257
ConsensusStatus::Reached {
archive_data,
concurrent_tasks,
votes,
required,
} => {
let effective_concurrent = concurrent_tasks
.unwrap_or(state.config.max_concurrent_tasks)
.min(state.config.max_concurrent_tasks);

if state.sessions.has_active_batch() {
return Err((
StatusCode::SERVICE_UNAVAILABLE,
Json(serde_json::json!({
"error": "busy",
"message": "A batch is already running. Wait for it to complete."
})),
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Consensus is irreversibly consumed before the busy check — lost if a batch is already active.

When ConsensusStatus::Reached is returned, the entry has already been removed from the DashMap (see src/consensus.rs Line 74: entry.remove_entry()). If has_active_batch() returns true on Line 249, the consensus result is discarded and the 503 "busy" error is returned. All validators would need to re-submit to rebuild consensus.

Consider either:

  1. Checking has_active_batch() before calling record_vote(), or
  2. Re-inserting the consensus entry on busy failure so it isn't lost.

Option 1 is simpler but has a TOCTOU window. Option 2 preserves the consensus result.

Proposed fix (option 1 — check busy before consensus vote)

Move the busy check before the record_vote call (around Line 199):

+    if state.sessions.has_active_batch() {
+        return Err((
+            StatusCode::SERVICE_UNAVAILABLE,
+            Json(serde_json::json!({
+                "error": "busy",
+                "message": "A batch is already running. Wait for it to complete."
+            })),
+        ));
+    }
+
     let status = state.consensus_manager.record_vote(

And remove Lines 249-257 from the Reached branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers.rs` around lines 239 - 257, The code currently consumes the
consensus entry (see consensus.rs: entry.remove_entry()) when handling
ConsensusStatus::Reached, then checks state.sessions.has_active_batch() and
returns 503, losing the consensus; fix by re-inserting the consumed consensus
result if you detect a busy session: after record_vote produces
ConsensusStatus::Reached and you enter the Reached branch, if
state.sessions.has_active_batch() is true, reconstruct and re-insert the
consensus entry back into the DashMap (use the same key/value shape that
record_vote/entry.remove_entry() returned: archive_data, concurrent_tasks,
votes, required) before returning the SERVICE_UNAVAILABLE response; reference
symbols: ConsensusStatus::Reached, record_vote, has_active_batch, and the
DashMap entry removal logic (entry.remove_entry()) to locate where to restore
the entry.

Comment on lines +73 to +94
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;

let client = bittensor_rs::BittensorClient::with_failover()
.await
.context("Failed to connect to subtensor")?;

let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
.await
.context("Failed to sync metagraph")?;

let mut new_hotkeys = HashSet::new();
for neuron in metagraph.neurons.values() {
if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao {
new_hotkeys.insert(encode_ss58(&neuron.hotkey));
}
}

let count = new_hotkeys.len();
*self.hotkeys.write() = new_hotkeys;
Ok(count)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the bittensor-rs dependency's actual API surface by examining Cargo.lock
# and the cached source for the available types/functions
echo "=== Checking Cargo.lock for bittensor-rs commit ==="
rg 'bittensor-rs' Cargo.lock -A5 || echo "Not found in Cargo.lock"

echo ""
echo "=== Checking for bittensor_rs usage patterns in the codebase ==="
rg 'bittensor_rs::' --type=rust -n

Repository: PlatformNetwork/term-executor

Length of output: 767


🏁 Script executed:

#!/bin/bash
# Query the bittensor-rs repository to verify APIs exist
echo "=== Checking bittensor-rs source for BittensorClient::with_failover ==="
# The git dependency points to cortexlm/bittensor-rs main branch
# We need to check if the APIs are documented or if there's a binding

# First, let's check if bittensor-rs is fetched in the target directory
if [ -d "target/debug/build" ]; then
  find target -name "*bittensor*" -type d 2>/dev/null | head -5
fi

echo ""
echo "=== Checking for API usage in src/validator_whitelist.rs ==="
cat -n src/validator_whitelist.rs | sed -n '73,94p'

echo ""
echo "=== Checking if neuron struct attributes are accessed ==="
rg 'neuron\.(validator_permit|active|stake)' --type=rust -B2 -A2

echo ""
echo "=== Verify if there's a test or example showing bittensor_rs usage ==="
rg 'BittensorClient|sync_metagraph' --type=rust -B3 -A3 | head -50

Repository: PlatformNetwork/term-executor

Length of output: 2442


🌐 Web query:

bittensor-rs cortexlm BittensorClient sync_metagraph API stability

💡 Result:

What sync_metagraph is in CortexLM bittensor-rs

In CortexLM/bittensor-rs, sync_metagraph is a free async function (not a BittensorClient method) exported from crate::metagraph / the crate root. It has the shape:

  • pub async fn sync_metagraph(client: &BittensorClient, netuid: u16) -> Result<Metagraph> (via anyhow::Result) [2]
  • It builds a Metagraph::new(netuid), sets metagraph.block = client.block_number().await?, then fetches all neurons via queries::neurons::neurons(client, netuid, None) (runtime API path), populating metagraph.neurons and metagraph.axons, and sets metagraph.n to the neuron count. [2]

BittensorClient itself is a thin wrapper around a subxt::OnlineClient<PolkadotConfig>, with constructors like new(...) and with_default() (default RPC wss://entrypoint-finney.opentensor.ai:443, overridable via BITTENSOR_RPC). [3]

API stability assessment

This repo explicitly warns it is “under active development and may contain bugs”, and it’s currently presented as version 0.1.0 with installation recommended via git dependency (and the repo shows no published releases). That combination generally implies API instability / potential breaking changes unless you pin a specific commit in your Cargo.toml. [1]

Practical takeaways

  • Treat sync_metagraph as a best-effort convenience helper around the current runtime-API neuron query path (so it may change when upstream runtime APIs / metadata change). [2]
  • For production, pin a git revision (or a tag, if/when they publish them) and add an internal abstraction so you can swap out metagraph sync behavior without refactoring your whole codebase. [1]

Sources: CortexLM/bittensor-rs README (stability warning + install guidance) [1]; src/metagraph/sync.rs (implementation) [2]; src/chain/mod.rs (client + default RPC) [3].


Stabilize bittensor-rs dependency — currently on git branch=main with no version guarantee.

This function uses multiple APIs from the git-sourced bittensor-rs crate (version 0.1.0): BittensorClient::with_failover(), sync_metagraph(), neuron.validator_permit, neuron.active, neuron.stake.as_tao(), and ss58::encode_ss58(). The upstream repository is marked "under active development" with no published releases, meaning APIs can break without notice if the main branch changes.

Recommendation: Pin a specific commit in Cargo.toml (currently on eb58916af5a4d7fef74ef00ea0d61519880b101f) and add an internal abstraction layer around metagraph syncing so future API changes don't require refactoring core validator logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 73 - 94, The dependency on the
git-sourced bittensor-rs is unstable and try_refresh() directly calls several
upstream APIs (BittensorClient::with_failover, sync_metagraph,
neuron.validator_permit/active/stake.as_tao, ss58::encode_ss58) which may break;
update Cargo.toml to pin the bittensor-rs crate to the specific commit
referenced (eb58916af5a4d7fef74ef00ea0d61519880b101f) to stabilize builds, and
introduce a small internal abstraction (e.g., a MetagraphProvider trait and an
implementation that wraps BittensorClient::with_failover and sync_metagraph) so
try_refresh() depends on your internal trait instead of the upstream
types/fields directly, moving use of neuron.validator_permit, neuron.active,
neuron.stake.as_tao(), and encode_ss58 into the adapter implementation to
insulate validator_whitelist.rs from future API changes.

- consensus.rs: Remove archive_data storage from PendingConsensus to
  prevent memory exhaustion (up to 50GB with 100 pending × 500MB each).
  Callers now use their own archive bytes since all votes for the same
  hash have identical data.

- handlers.rs: Stream multipart upload with per-chunk size enforcement
  instead of buffering entire archive before checking size limit.
  Sanitize error messages to not leak internal details (file paths,
  extraction errors) to clients; log details server-side instead.

- auth.rs: Add nonce format validation requiring non-empty printable
  ASCII characters (defense-in-depth against log injection and empty
  nonce edge cases).

- main.rs: Replace .unwrap() on TcpListener::bind and axum::serve with
  proper error logging and process::exit per AGENTS.md rules.

- ws.rs: Replace .unwrap() on serde_json::to_string with
  unwrap_or_default() to comply with AGENTS.md no-unwrap rule.
…duction code

- main.rs:21: Replace .parse().unwrap() on tracing directive with
  unwrap_or_else fallback to INFO level directive
- main.rs:36: Replace .expect() on workspace dir creation with
  error log + process::exit(1) pattern
- main.rs:110: Replace .expect() on ctrl_c handler with if-let-Err
  that logs and returns gracefully
- executor.rs:189: Replace semaphore.acquire().unwrap() with match
  that handles closed semaphore by creating a failed TaskResult

All changes follow AGENTS.md rule: no .unwrap()/.expect() in
production code paths. Test code is unchanged.
@echobt echobt merged commit a573ad0 into main Feb 18, 2026
1 of 2 checks passed
@echobt echobt deleted the feat/bittensor-validator-consensus branch February 18, 2026 17:19
github-actions bot pushed a commit that referenced this pull request Feb 18, 2026
# [2.0.0](v1.2.0...v2.0.0) (2026-02-18)

### Features

* **auth:** replace static hotkey/API-key auth with Bittensor validator whitelisting and 50% consensus ([#5](#5)) ([a573ad0](a573ad0))

### BREAKING CHANGES

* **auth:** WORKER_API_KEY env var and X-Api-Key header no longer required.
All validators on Bittensor netuid 100 with sufficient stake are auto-whitelisted.

* ci: trigger CI run

* fix(security): address auth bypass, input validation, and config issues

- Move nonce consumption AFTER signature verification in verify_request()
  to prevent attackers from burning legitimate nonces via invalid signatures
- Fix TOCTOU race in NonceStore::check_and_insert() using atomic DashMap
  entry API instead of separate contains_key + insert
- Add input length limits for auth headers (hotkey 128B, nonce 256B,
  signature 256B) to prevent memory exhaustion via oversized values
- Add consensus_threshold validation in Config::from_env() — must be
  in range (0.0, 1.0], panics at startup if invalid
- Add saturating conversion for consensus required calculation to prevent
  integer overflow on f64→usize cast
- Add tests for all security fixes

* fix(dead-code): remove orphaned default_concurrent fn and unnecessary allow(dead_code)

* fix: code quality issues in bittensor validator consensus

- Extract magic number 100 to configurable MAX_PENDING_CONSENSUS
- Restore #[allow(dead_code)] on DEFAULT_MAX_OUTPUT_BYTES constant
- Use anyhow::Context instead of map_err(anyhow::anyhow!) in validator_whitelist

* fix(security): address race condition, config panic, SS58 checksum, and container security

- consensus.rs: Fix TOCTOU race condition in record_vote by using
  DashMap entry API (remove_entry) to atomically check votes and remove
  entry while holding the shard lock, preventing concurrent threads from
  inserting votes between drop and remove
- config.rs: Replace assert! with proper Result<Self, String> return
  from Config::from_env() to avoid panicking in production on invalid
  CONSENSUS_THRESHOLD values
- main.rs: Update Config::from_env() call to handle Result with expect
- auth.rs: Add SS58 checksum verification using Blake2b-512 (correct
  Substrate algorithm) in ss58_to_public_key_bytes to reject addresses
  with corrupted checksums; previously only decoded base58 without
  validating the 2-byte checksum suffix
- Dockerfile: Add non-root executor user for container runtime security

* fix(dead-code): remove unused max_output_bytes config field and constant

Remove DEFAULT_MAX_OUTPUT_BYTES constant and max_output_bytes Config field
that were defined and populated from env but never read anywhere outside
config.rs. Both had #[allow(dead_code)] annotations suppressing warnings.

* fix(quality): replace expect/unwrap with proper error handling, extract magic numbers to constants

- main.rs: Replace .expect() on Config::from_env() with match + tracing::error! + process::exit(1)
- validator_whitelist.rs: Extract retry count (3) and backoff base (2) to named constants
- validator_whitelist.rs: Replace unwrap_or_else on Option with if-let pattern
- consensus.rs: Extract reaper interval (30s) to REAPER_INTERVAL_SECS constant

* fix(security): address multiple security vulnerabilities in PR files

- consensus.rs: Remove archive_data storage from PendingConsensus to
  prevent memory exhaustion (up to 50GB with 100 pending × 500MB each).
  Callers now use their own archive bytes since all votes for the same
  hash have identical data.

- handlers.rs: Stream multipart upload with per-chunk size enforcement
  instead of buffering entire archive before checking size limit.
  Sanitize error messages to not leak internal details (file paths,
  extraction errors) to clients; log details server-side instead.

- auth.rs: Add nonce format validation requiring non-empty printable
  ASCII characters (defense-in-depth against log injection and empty
  nonce edge cases).

- main.rs: Replace .unwrap() on TcpListener::bind and axum::serve with
  proper error logging and process::exit per AGENTS.md rules.

- ws.rs: Replace .unwrap() on serde_json::to_string with
  unwrap_or_default() to comply with AGENTS.md no-unwrap rule.

* fix(dead-code): rename misleading underscore-prefixed variable in consensus

* fix(quality): replace unwrap/expect with proper error handling in production code

- main.rs:21: Replace .parse().unwrap() on tracing directive with
  unwrap_or_else fallback to INFO level directive
- main.rs:36: Replace .expect() on workspace dir creation with
  error log + process::exit(1) pattern
- main.rs:110: Replace .expect() on ctrl_c handler with if-let-Err
  that logs and returns gracefully
- executor.rs:189: Replace semaphore.acquire().unwrap() with match
  that handles closed semaphore by creating a failed TaskResult

All changes follow AGENTS.md rule: no .unwrap()/.expect() in
production code paths. Test code is unchanged.

* docs: refresh AGENTS.md
@github-actions
Copy link

🎉 This PR is included in version 2.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments