fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation by amathxbt · Pull Request #201 · OpenGradient/OpenGradient-SDK

amathxbt · 2026-03-23T23:49:53Z

Summary

This PR fixes 5 critical bugs identified from open issues and code audit, spanning TEE selection, LLM error handling, type conversions, inference safety, and gas estimation.

Bug 1 — TEE always picks the same node (closes #200)

File: src/opengradient/client/tee_registry.py

Root cause: get_llm_tee() always returned tees[0], routing 100% of traffic to a single TEE with zero load distribution or failover.

Fix: Replace tees[0] with random.choice(tees) so each LLM() construction independently selects from the full pool of active, registry-verified TEEs.

# Before
return tees[0]

# After
selected = random.choice(tees)
logger.debug("Selected TEE %s from %d active LLM proxy TEE(s)", selected.tee_id, len(tees))
return selected

Bug 2 — HTTP 402 swallowed as cryptic RuntimeError (closes #188)

File: src/opengradient/client/llm.py

Root cause: All HTTP errors (including 402 Payment Required) were caught by except Exception and re-raised as a generic RuntimeError("TEE LLM chat failed: ..."). Users had no idea they needed to call ensure_opg_approval() first.

Fix: Add import httpx and intercept httpx.HTTPStatusError before the generic handler. When status_code == 402, raise a RuntimeError with an explicit, actionable message pointing to ensure_opg_approval(). Applied to _chat_request(), completion(), and _parse_sse_response() (streaming path).

# New constant
_402_HINT = (
    "Payment required (HTTP 402): your wallet may have insufficient OPG token allowance. "
    "Call llm.ensure_opg_approval(opg_amount=<amount>) to approve Permit2 spending "
    "before making requests. Minimum amount is 0.05 OPG."
)

# In _chat_request / completion / _parse_sse_response:
except httpx.HTTPStatusError as e:
    if e.response.status_code == 402:
        raise RuntimeError(_402_HINT) from e
    raise RuntimeError(f"TEE LLM chat failed: {e}") from e

Bug 3 — Fixed-point int returns np.float32 instead of int (partially closes #103)

File: src/opengradient/client/_conversions.py

Root cause: convert_to_float32(value, decimals) always returned np.float32, even when decimals == 0 (i.e., the original value was an integer). Users expecting integer outputs received np.float32 and had to cast manually.

Fix: Add convert_fixed_point_to_python(value, decimals) that returns int when decimals == 0 and np.float32 otherwise. np.array() then infers the correct dtype (int64 vs float32) automatically. The old convert_to_float32 is kept as a deprecated alias for backward compatibility.

def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
    if decimals == 0:
        return int(value)      # ← integer tensor stays integer
    return np.float32(Decimal(value) / (10 ** Decimal(decimals)))

Bug 4 — `infer()` crashes with AttributeError when inference node returns None

File: src/opengradient/client/alpha.py

Root cause: _get_inference_result_from_node() returns None when the node has no result yet. This None was passed directly to convert_to_model_output(), which calls event_data.get("output", {}) → AttributeError: NoneType object has no attribute get. Additionally, if ModelInferenceEvent logs were empty, parsed_logs[0] caused an IndexError.

Fix:

# Guard 1: empty precompile logs
if not precompile_logs:
    raise RuntimeError("ModelInferenceEvent not found in transaction logs.")

# Guard 2: None from inference node
if inference_result is None:
    raise RuntimeError(
        f"Inference node returned no result for inference ID {inference_id!r}. "
        "The result may not be available yet — retry after a short delay."
    )

Bug 5 — `run_workflow()` uses hardcoded 30M gas

File: src/opengradient/client/alpha.py

Root cause: run_workflow() hardcoded gas=30000000, unlike infer() which correctly uses estimate_gas(). This is both wasteful (users overpay) and fragile on networks with lower block gas limits.

Fix: Call estimate_gas() first and multiply by 3 for headroom. Fall back to 30M only if estimation itself fails.

try:
    estimated_gas = run_function.estimate_gas({"from": self._wallet_account.address})
    gas_limit = int(estimated_gas * 3)
except Exception:
    gas_limit = 30000000  # fallback

Also replaces the hardcoded timeout=60 in new_workflow() with the INFERENCE_TX_TIMEOUT constant.

Files Changed

File	Bugs Fixed
`src/opengradient/client/tee_registry.py`	Bug 1 (TEE randomization)
`src/opengradient/client/llm.py`	Bug 2 (HTTP 402 hint)
`src/opengradient/client/_conversions.py`	Bug 3 (int dtype)
`src/opengradient/client/alpha.py`	Bugs 4 & 5 (None guard + gas)

Related Issues

Closes #200
Closes #188
Partially closes #103

Previously get_llm_tee() always returned tees[0], the first TEE in the registry list. This caused all clients to hit the same TEE, providing no load distribution and no resilience when that TEE starts failing. Fix: use random.choice(tees) so each LLM() construction independently picks from all currently active TEEs. Successive retries or re-initializations will therefore naturally spread across the healthy pool. Closes OpenGradient#200

Previously any HTTP error from the TEE (including 402 Payment Required) was silently wrapped into a generic RuntimeError("TEE LLM chat failed: ..."). This caused the confusing traceback seen in issue OpenGradient#188 — the real cause (insufficient OPG allowance) was buried inside the exception message. Fix: - Add `import httpx` and intercept httpx.HTTPStatusError before the generic `except Exception` handler in both _chat_request() and completion(). - When status == 402, raise a RuntimeError with a clear, actionable hint telling the user to call llm.ensure_opg_approval(opg_amount=<amount>). - Also handle 402 in _parse_sse_response() for the streaming path. - All other HTTP errors continue to surface as before. Closes OpenGradient#188

Previously convert_to_float32() always returned np.float32 regardless of the decimals field, forcing callers to manually cast integer results (issue OpenGradient#103 — add proper type conversions from Solidity contract to Python). Fix: - Add convert_fixed_point_to_python(value, decimals) that returns int when decimals == 0 and np.float32 otherwise. np.array() automatically picks the correct dtype (int64 vs float32) based on the element types. - Update both convert_to_model_output() and convert_array_to_model_output() to call the new function. - Keep convert_to_float32() as a deprecated backward-compatible alias so any external code that imports it directly continues to work. Partially closes OpenGradient#103

… in run_workflow Bug 4 — None crash in infer(): _get_inference_result_from_node() can legitimately return None when the inference node has no result yet. Previously this None was passed directly to convert_to_model_output(), which calls event_data.get(), causing an AttributeError: 'NoneType' object has no attribute 'get'. Fix: after calling _get_inference_result_from_node(), check for None and raise a clear RuntimeError with a human-readable message and the inference_id so callers know what to retry. Also guard the precompile log fetch: if parsed_logs is empty, raise RuntimeError instead of crashing with an IndexError on parsed_logs[0]. Bug 5 — hardcoded 30M gas in run_workflow(): run_workflow() built the transaction with gas=30000000 (30 million) unconditionally. This is wasteful (users overpay for gas) and can fail on networks with a lower block gas limit. Fix: call estimate_gas() first (consistent with infer()), then multiply by 3 for headroom. Fall back to 30M only if estimation itself fails. Also fixes new_workflow() deploy to use INFERENCE_TX_TIMEOUT constant instead of the previous hardcoded 60 seconds.

amathxbt · 2026-03-23T23:50:06Z

Hey @adambalogh 👋 — this PR addresses 5 critical bugs found during a code audit, 3 of which directly close or partially close open issues you and the team have filed:

#	Bug	Issue
1	`get_llm_tee()` always returned `tees[0]` — no randomization or load distribution	Closes #200
2	HTTP 402 from TEE was swallowed as a cryptic `RuntimeError` with no guidance	Closes #188
3	Fixed-point values with `decimals=0` returned `np.float32` instead of `int`	Partially closes #103
4	`infer()` crashed with `AttributeError` when the inference node returned `None`	Code audit
5	`run_workflow()` hardcoded `gas=30000000` instead of using `estimate_gas()`	Code audit

All changes are backward-compatible. Happy to adjust anything before review! 🙏

adambalogh · 2026-03-26T02:02:47Z

Hey @amathxbt thanks for your contributions, I added it to my list to review, will get back to you tomorrow!

amathxbt added 4 commits March 24, 2026 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation#201

fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation#201
amathxbt wants to merge 4 commits intoOpenGradient:mainfrom
amathxbt:fix/critical-5-bugs-tee-402-dtype-inference-gas

amathxbt commented Mar 23, 2026

Uh oh!

amathxbt commented Mar 23, 2026

Uh oh!

adambalogh commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amathxbt commented Mar 23, 2026

Summary

Bug 1 — TEE always picks the same node (closes #200)

Bug 2 — HTTP 402 swallowed as cryptic RuntimeError (closes #188)

Bug 3 — Fixed-point int returns np.float32 instead of int (partially closes #103)

Bug 4 — infer() crashes with AttributeError when inference node returns None

Bug 5 — run_workflow() uses hardcoded 30M gas

Files Changed

Related Issues

Uh oh!

amathxbt commented Mar 23, 2026

Uh oh!

adambalogh commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 4 — `infer()` crashes with AttributeError when inference node returns None

Bug 5 — `run_workflow()` uses hardcoded 30M gas