| title | UndertriAI | |||||||
|---|---|---|---|---|---|---|---|---|
| emoji | ⚖️ | |||||||
| colorFrom | indigo | |||||||
| colorTo | blue | |||||||
| sdk | docker | |||||||
| app_port | 7860 | |||||||
| license | mit | |||||||
| short_description | OpenEnv RL environment for Indian bail decision support | |||||||
| tags |
|
OpenEnv-compliant RL training environment for Indian bail decision support.
▶ Try the Live Demo — click "Run Bail Assessment" to see the environment in action.
📝 Read the Story — "Three minutes should never decide a life" (link to be updated)
76% of India's 5.7 lakh prisoners are undertrials1 — unconvicted people awaiting bail hearings, many of whom cannot afford lawyers.
A subordinate court judge handles 80–100 bail hearings per day — roughly 3 minutes per case. In that window they must read the charge sheet, assess flight risk, evaluate custody duration against the statutory threshold, and check for parity with co-accused. In practice, outcomes are inconsistent and empirically biased against poor, lower-caste, and minority accused.
This is not anecdotal — it is structural. The Supreme Court in Satender Kumar Antil v. CBI (2022) explicitly noted the crisis.
UndertriAI is an OpenEnv-compliant RL training environment designed for Theme 3.1: Professional Tasks / World Modeling.
It teaches an LLM to interact with a realistic legal workflow — not through shortcuts, but through genuine tool use, statutory reasoning, and multi-step case analysis:
- Read case documents (charge sheet, arguments, criminal history)
- Invoke legal tools (12 specialized tools for statutory eligibility, precedent lookup, risk assessment)
- Produce structured bail memos with explicit reasoning chains
- Get evaluated against real Indian High Court decisions using a deterministic, multi-component reward function
Additionally, the environment implements Theme 4: Self-Improvement through adaptive curriculum mechanisms (detailed below).
This environment qualifies for Theme 3.1 by requiring genuine interaction with a partially observable legal world where:
- Tool invocation is mandatory — statutory thresholds cannot be guessed; they must be computed via
compute_statutory_eligibility - Multi-step reasoning is required — the model must sequence tool calls (read arguments → assess risk → compute eligibility → cite precedent → draft memo)
- Shortcuts fail — trying to submit a memo without tool use earns near-zero reward due to missing statutory/precedent signals
- State persistence matters — tool outputs accumulate in episode state; later reasoning depends on earlier tool calls
- API/workflow simulation — the environment models real judicial clerk workflows: document retrieval, legal database queries, risk scoring matrices
This is not a text completion task. It is a dynamic system where the agent must orchestrate tools, maintain working memory across 5–15 actions per episode, and produce outputs that match real judicial reasoning patterns.
| Method | Endpoint | Description |
|---|---|---|
POST |
/reset?stage=1 |
Start a new episode (curriculum stage 1–4) |
POST |
/reset?adaptive=true&auto_stage=true |
Start episode with adaptive selection (Theme 4) |
POST |
/step |
Submit a tool call or final memo |
GET |
/state?session_id=... |
Inspect current episode state |
GET |
/profile?session_id=... |
Agent performance profile (Theme 4) |
GET |
/adaptive_status |
Adaptive mode capabilities & thresholds |
GET |
/health |
Health check |
GET |
/tools |
List available tools |
WS |
/ws/{session_id} |
WebSocket real-time feed |
| Tool | Purpose |
|---|---|
compute_statutory_eligibility |
Calculate custody vs threshold for IPC/BNSS sections (non-guessable) |
cross_reference_precedent |
Look up landmark HC/SC decisions |
assess_surety |
Evaluate surety bond appropriateness |
classify_bail_type |
Determine regular / anticipatory / default bail |
request_document |
Request additional case documents |
flag_inconsistency |
Flag contradictions in the charge sheet |
read_submissions |
Read prosecution/defence arguments on record |
assess_flight_risk |
Systematic flight risk scoring matrix |
check_case_factors |
Examine parity, evidence tampering, victim vulnerability |
apply_proportionality |
BNSS 479 custody vs. max sentence proportionality |
pull_criminal_history |
Prior record, bail history, conviction status |
submit_memo |
Terminal action — submit final bail recommendation |
Example tool invocation:
{
"tool": "compute_statutory_eligibility",
"section": "IPC 420",
"custody_months": 8
}| Stage | Focus | Cases | Learning Objective |
|---|---|---|---|
| 1 | Landmark cases (clear-cut eligibility) | ~40 | Learn tool sequencing + format |
| 2 | Contested cases (murder, repeat offenders) | ~1,100 | Learn contested reasoning patterns |
| 3 | Bias-reversal cases (HC overturning biased lower courts) | ~30 | Learn to detect parity violations |
| 4 | BNSS schema drift (IPC → BNS remapping, 2023 reform) | ~50 | Test adaptability to legal schema changes |
Example Stage 4 challenge: Case uses IPC 379 (theft, 3-year max sentence, threshold = 1/2 max = 18 months). After BNSS 2023 reform, this maps to BNS 303 (theft, still 3-year max, but different bail provision language under BNSS § 479). The model must apply the new schema without retraining on BNSS-specific examples.
UndertriAI implements three self-improvement mechanisms as a secondary theme contribution:
1. Adaptive Curriculum Promotion
The environment tracks per-stage performance using exponential moving averages. When the agent demonstrates consistent improvement (Stage 1 mean reward ≥ 0.65 over 20 episodes), it automatically promotes to the next curriculum stage. Visible in training logs as:
[SELF-IMPROVEMENT] Step 100: Promoted to Stage 2. Stage 1 mean reward: 0.710 → Stage 2 begins.
2. Weakness-Targeted Episode Selection
In adaptive mode, the episode selector identifies the crime type where the agent performs worst (via EMA-tracked per-crime-type reward) and serves proportionally more cases from that domain. As the agent improves on weak domains, the selection distribution shifts — the environment continuously finds and targets new weaknesses.
| Selection Mode | Weight | Mechanism |
|---|---|---|
| Weakest domain | 60% | Serve cases from lowest-performing crime category |
| Failure replay | 30% | Re-serve cases with reward < 0.40 |
| Exploration | 10% | Uniform random (prevent overfitting) |
3. Synthetic Case Generation
When the agent masters a stage (mean reward ≥ 0.70 on a stage), the environment generates harder synthetic variants using 5 perturbation types:
| Perturbation | What it tests |
|---|---|
| Custody escalation | Custody 2 months below threshold — forces exact statutory computation |
| Co-accused conflict | Opposite bail outcomes for co-accused — tests parity reasoning |
| Section ambiguity | IPC ↔ BNSS section swap — tests schema drift robustness |
| Evidence reversal | Key witness retracted — tests flight risk reassessment |
| Surety complexity | Non-resident surety — tests condition appropriateness |
Live Demo — Self-Improvement in Action:
# Start the server
python -m server.app
# In another terminal — adaptive training
python training/train_grpo.py --adaptive --steps 50 --env_url http://localhost:8000Monitor progress via GET /profile?session_id={id} and GET /adaptive_status.
R = 0.4 × outcome_match (gated by think_factor)
+ 0.2 × flight_risk_accuracy
+ 0.2 × statutory_accuracy
+ 0.2 × condition_appropriateness
+ 0.1 × reasoning_quality (bonus)
+ 0.05 × format_compliance (bonus)
+ 0.05 × process_bonus (tool-use proxy, bonus)
± 0.05 × diversity_bonus (anti-collapse signal)
− 0.3 × bias_penalty (fires on parity violations)Reward range: core components sum to 1.0; with bonuses, total can reach ~1.15; with bias penalty, it can drop to ~0.7 on a bias-flagged case answered without parity reasoning.
All components are fully deterministic and rule-based — no LLM-as-judge.
| Component | Signal Type | Details |
|---|---|---|
| Outcome Match | 0.0 / 0.8 / 1.0 | Exact, directional, or wrong vs HC decision — gated by <think> block presence |
| Flight Risk | 0–1 | Ordinal distance to ground-truth risk level (Low / Medium / High) |
| Statutory | 0–1 | IPC/BNSS threshold computation, direction-gated, NDPS Section 37 aware |
| Conditions | 0–1 | Bail-condition appropriateness for crime / risk profile |
| Reasoning Quality | 0–1 | Anchoring + arithmetic + grounds specificity (10% bonus) |
| Format Compliance | 0–1 | XML tag adherence to system prompt (5% bonus) |
| Process Bonus | 0 or 0.05 | Awarded if both custody_months and threshold computation appear verbatim in <think> (proxy for tool use) |
| Diversity Bonus | ±0.05 | +0.05 if rollouts produce ≥2 distinct outcomes; −0.05 if all rollouts collapse to the same outcome |
| Bias Penalty | −0.3 | Fires if parity argument ignored in bias-flagged cases |
- Multiple independent reward signals — gaming all of them simultaneously is harder than gaming one
GenerationInspectionCallbackprints raw completions every 25 training steps for manual review- Reasoning gate: No
<think>block → outcome reward zeroed in Stage 2+ (prevents format exploitation) - Direction gate: Wrong bail direction → statutory bonus capped (prevents partial-credit gaming)
- Bias penalty operates as a separate signal, not folded into outcome (ensures visibility)
- Schema drift (Stage 4) tests adaptability, not pattern memorisation
- Diversity signal flags reward-collapse — prints
[WARNING] Reward variance collapsedif the policy converges to a single outcome - Tool-invocation tracking:
process_bonusonly fires when episode-specific custody/threshold values (which are not in the user prompt) appear in the model's reasoning — strong proxy for actual tool use
Gaming resistance verified via unit tests:
| Completion Type | Sample Reward | Verification |
|---|---|---|
| Ideal (full reasoning, all tools, correct outcome) | 1.15 | ✅ PASS |
| Filler (generic reasoning, minimal tools) | 0.66 | ✅ PASS |
| Minimal (bare XML, no tools) | 0.32 | ✅ PASS |
| Tool spam (redundant calls, no reasoning) | 0.17 | ✅ PASS |
GRPO correctly ranks ideal > filler > minimal > spam.
Uses GRPO (Group Relative Policy Optimization) via TRL + Unsloth on Qwen2.5-7B-Instruct (4-bit quantized + LoRA r=16 — i.e. QLoRA).
Key design decision: UndertriAI uses a hybrid offline/online architecture to balance speed and correctness.
-
Reward computation during training: in-process (offline).
The trainer imports the sameserver/reward.pymodule that the deployed FastAPI server uses and callscombined_reward(...)directly. This gives bitwise reward parity with the env-API path while avoiding ~64 HTTP calls per training step (num_generations × grad_accum × 2 calls per rollout). On a single A10G, in-process scoring lets four curriculum stages fit into a ~3h budget; the equivalent online path would require ~5–6h of wall time mostly spent in network I/O. -
Adaptive curriculum mechanisms: live env API.
The/profile,/adaptive_status, and stage-promotion logic always go through the deployed environment so per-domain EMA tracking and weakness-targeted episode selection observe real environment state. -
Evaluation: in-process scoring with bitwise parity to the env API.
Per-stage before/after numbers in Results & Verification are produced byevaluate_on_stage(...)callingcombined_reward(...)against the same model checkpoint. Becausecombined_rewardis the same function object the deployed env imports, replaying the same episodes throughrollout_via_env_api()against the live HF Space returns identical scores up to sampling stochasticity. The Live Demo HF Space serves the trained adapter through the env API end-to-end for interactive verification.
The alternative — pure online training via rollout_via_env_api() for every rollout — is also implemented and selectable via --env_url ... (without --offline) in single-stage mode (--stage N). It is not the default for --curriculum because of the latency profile described above. See training/train_grpo.py → rollout_via_env_api() for the env-API path.
| Mode | Command | Description |
|---|---|---|
| 3-Level Curriculum (recommended) | python training/train_grpo.py --curriculum --offline |
Format → Reasoning → Adversarial (300 steps total) |
| Legacy 4-stage | python training/train_grpo.py --curriculum --offline --difficulties "" --stages 1,2,3,4 |
Sequential 4-stage with trace harvesting |
| Single-stage (offline) | python training/train_grpo.py --stage 1 --offline --steps 200 |
Local scoring (smoke testing) |
| Baseline only | python training/train_grpo.py --baseline_only |
Zero-shot eval, no training |
| Level | Case Type | Episodes | Steps | Difficulty |
|---|---|---|---|---|
| Easy | Landmark clear-cut cases | 104 | 60 | Model builds confidence on obvious grant/deny |
| Medium | Contested judgment calls | 761 | 160 | Bulk learning — statutory math, risk assessment |
| Hard | Bias reversal + schema drift | 335 | 80 | Edge cases that trip up shortcut-takers |
| Parameter | Default | Rationale |
|---|---|---|
| Base model | unsloth/Qwen2.5-7B-Instruct |
4-bit + LoRA r=16 |
| Total steps | 300 (60+160+80) | 3-level curriculum, ~2.5h on Kaggle T4 |
num_generations |
6 | GRPO rollouts per prompt; 50% more variance than 4 |
temperature |
1.1 | Higher exploration for diverse rollouts |
| Max completion length | 384 tokens | Fits bail memos; saves VRAM vs 512 |
batch_size × grad_accum |
1 × 8 | Effective batch 8; Kaggle T4 safe |
learning_rate |
5e-6 | Curriculum-scale LR |
# 1. Deploy environment to HF Spaces
openenv push --repo-id username/undertri-ai
# 2. Verify it is running
curl https://username-undertri-ai.hf.space/health
# 3. Set WandB auth (optional, for live metric tracking)
export WANDB_API_KEY=your_wandb_api_key
# 4. Run curriculum training as a one-shot HF Job (A10G, ~2h)
hf jobs uv run --flavor a10g-large --timeout 3h \
--secrets HF_TOKEN \
https://raw.githubusercontent.com/Faiz-1606/Undertrial/main/training/run_hf_job.py \
--curriculum \
--env_url https://username-undertri-ai.hf.space \
--output ./output/undertrial_grpo# ============================================================
# STEP 1 — Install dependencies
# ============================================================
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps trl peft accelerate bitsandbytes xformers
!pip install -q openenv-core datasets wandb
import os
os.environ["WANDB_API_KEY"] = "your_wandb_api_key" # optional
# ============================================================
# STEP 2 — Clone repo + load episodes
# ============================================================
!git clone https://github.com/Faiz-1606/Undertrial.git
%cd Undertrial
# Verify episodes are present (loaded from data/episodes/)
import os
for f in sorted(os.listdir("./data/episodes")):
if f.endswith(".jsonl"):
n = sum(1 for _ in open(f"./data/episodes/{f}"))
print(f" {f}: {n} episodes")
# ============================================================
# STEP 3 — Quick smoke test (10 steps, ~3 min on T4)
# ============================================================
!python training/train_grpo.py \
--episodes_dir ./data/episodes \
--offline --stage 1 --steps 10 --batch_size 1
# ============================================================
# STEP 4 — Full curriculum training (~1h 50m on A10G; longer on T4)
# ============================================================
!python training/train_grpo.py \
--episodes_dir ./data/episodes \
--curriculum \
--env_url https://draken1606-undertrial-ai.hf.space
# ============================================================
# STEP 5 — Adaptive training (Theme 4, requires server)
# ============================================================
import subprocess, time, requests
server = subprocess.Popen(
["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
)
for _ in range(30):
try:
if requests.get("http://localhost:8000/health", timeout=1).status_code == 200:
print("✓ Server ready"); break
except Exception:
time.sleep(1)
else:
raise RuntimeError("Server startup failed — check logs")
!python training/train_grpo.py \
--adaptive \
--episodes_dir ./data/episodes \
--steps 50 --batch_size 1 \
--env_url http://localhost:8000
# ============================================================
# STEP 6 — Inspect results
# ============================================================
import json, pathlib
results_path = pathlib.Path("./output/undertrial_grpo/curriculum_results.json")
if results_path.exists():
print(json.dumps(json.load(open(results_path)), indent=2))
else:
print("Check ./output/undertrial_grpo/ for stage_*/ directories")
# ============================================================
# STEP 7 — Merge LoRA adapters for inference
# ============================================================
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"./output/undertrial_grpo/final",
max_seq_length=3072,
)
model.save_pretrained_merged(
"./output/undertrial_merged",
tokenizer,
save_method="merged_16bit",
)
print("✓ Merged model saved to ./output/undertrial_merged")Episode dataset (JSONL — 1,200 HC judgments, 4 curriculum stages)
↓
Format as chat prompt (system + user)
↓
Qwen2.5-1.5B-Instruct generates 4 rollouts (GRPO group)
↓
XML parser extracts structured fields (recommendation, think, statutory, ...)
↓
server/reward.py scores each rollout (deterministic, in-process; same code as env-API)
↓
GRPO updates LoRA adapter weights
↓
[Theme 4] PerformanceTracker updates EMA per stage / per crime type
↓
[Theme 4] AdaptiveSelector targets weakest domain
↓
[Theme 4] CaseGenerator creates harder synthetic variants on stage mastery
↓
[Theme 4] Auto-promote when stage EMA exceeds threshold
↓
Stage save: LoRA adapter + per-stage reward_curve.png + curriculum_results.json
↓
End of curriculum: before_after_comparison.png (4-stage baseline vs trained)
# Clone and install
git clone https://github.com/Faiz-1606/Undertrial
cd Undertrial
pip install -e .
# Use the environment client
from client import UndertriAIEnv
env = UndertriAIEnv(base_url="https://draken1606-undertrial-ai.hf.space")
obs = env.reset(stage=1)Or connect directly via the OpenEnv client:
from openenv import from_hub
env = from_hub("Draken1606/undertrial-ai")undertrial_ai/
├── server/
│ ├── app.py # FastAPI routes + Theme 4 endpoints
│ ├── undertrial_environment.py # Environment logic (Theme 3.1)
│ ├── reward.py # Multi-component deterministic reward
│ ├── dataset.py # Curriculum-staged episode loader
│ ├── schema_drift.py # IPC → BNSS remapping (Stage 4)
│ ├── performance_tracker.py # [Theme 4] EMA-based performance profiling
│ ├── adaptive_selector.py # [Theme 4] Weakness-targeted episode selection
│ └── case_generator.py # [Theme 4] Synthetic case perturbation
├── training/
│ ├── train_grpo.py # GRPO training (single / curriculum / adaptive)
│ ├── run_hf_job.py # PEP 723 bootstrap for HF Jobs (clones repo + installs deps)
│ ├── eval_and_plot.py # Post-training env-API-verified eval + plots
│ └── UndertriAI_GRPO_Training.ipynb # Colab notebook
├── data/
│ └── episodes/ # 1,200 HC judgments across 4 stages
├── demo/
│ └── index.html # Interactive demo UI
├── client.py # UndertriAIEnv HTTP client
├── models.py # Pydantic action / observation schemas
├── openenv.yaml # OpenEnv manifest
└── Dockerfile # HF Spaces deployment
Source: Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts" (arXiv:2508.07592)
Dataset: SnehaDeshmukh/IndianBailJudgments-1200
1,200 Indian High Court bail judgments (2018–2024) processed into curriculum episodes covering:
- Delhi, Bombay, Allahabad, Madras, Kerala, and Calcutta High Courts
- Crimes from IPC 420 (cheating) to IPC 302 (murder)
- Cases annotated with ground-truth outcome, flight risk, bias flags, and parity arguments
Known dataset characteristics — and why they make this a stronger RL environment:
| Characteristic | Value | Why this strengthens training |
|---|---|---|
flight_risk == "Medium" |
~72% | The model cannot earn full reward by always saying "Medium" — flight risk is only 20% of total reward. To exceed 0.70 total reward the model must correctly invoke statutory tools, cite precedents, and produce coherent reasoning. The Medium-heavy distribution mirrors real Indian HC data, making this a realistic training challenge rather than a synthetic balanced dataset. |
custody_months == 6.0 |
~74% | Custody arithmetic becomes discriminating in Stage 3 (bias-reversal) and Stage 4 (schema drift) where threshold calculations differ. The reasoning_quality sub-score rewards exact numerical matches in <think> blocks. |
bias_flag == True |
~1% (13 cases) | Honest limitation: bias penalty fires rarely (≈ once every 92 episodes under uniform sampling). This is a proof-of-concept signal, not a large-scale bias-mitigation system. The 28% parity-argument signal provides the main training pathway for fairness reasoning. Future work: expand bias-flagged evaluation set to 10–15%. |
Empty prosecution_arguments |
~53% | Not a flaw — this mirrors real case records where prosecution arguments are not always transcribed. The model must reason from charge sheet and defence arguments alone, which is the actual judicial workflow. |
Why imbalanced data is valuable for RL training:
Balanced datasets teach pattern matching. Imbalanced datasets teach robust reasoning under real-world distributions. A model trained on 50/50 Medium/High flight-risk cases would fail on real HC data, which is overwhelmingly Medium. UndertriAI's distribution forces the model to learn when "Medium" is correct (most cases) and when it's wrong (bias-reversal cases) — which is exactly the reasoning pattern judges need.
"Bail is the rule, jail is the exception."
— Supreme Court of India, Satender Kumar Antil v. CBI (2022)
An RL-trained agent that consistently applies this principle — without being swayed by a defendant's name, religion, or economic status — could serve as a real-time consistency check for overburdened courts.
This is not a tool to replace judges. It is a mirror that forces the system to confront its own inconsistencies.
Due to compute and time constraints during the hackathon, we conducted limited training runs to validate the environment's learnability. Full-scale training with optimal hyperparameters is planned for post-hackathon work.
Setup for the headline run (Qwen2.5-1.5B-Instruct on A10G-large):
| Parameter | Value |
|---|---|
| Total training steps | 120 (30 per stage × 4 stages) |
| Episode quota | 120 cases (30 per stage, balanced) |
| Effective batch size | 32 completions per step (1 × 8 × 4) |
| Max completion length | 728 tokens |
| Wall time | ~1h 50m |
| Reward source — training | In-process combined_reward (the same module the env imports) |
| Reward source — eval (n=12 per stage) | In-process combined_reward against held-out episodes |
| Env-API parity | Bitwise — eval scores reproduce on rollout_via_env_api up to sampling stochasticity |
Headline metrics (n = 12 episodes per stage, scored with combined_reward; bitwise parity with server/reward.py):
| Stage | Before (zero-shot) | After (trained) | Δ |
|---|---|---|---|
| Stage 1 — Landmark cases (clear-cut) | 0.4786 | 0.5314 | +0.0528 |
| Stage 2 — Statutory thresholds (BNSS §479) | 0.3992 | 0.4827 | +0.0835 |
| Stage 3 — Bias / disadvantage scenarios | 0.4154 | 0.4734 | +0.0580 |
| Stage 4 — Interleaved + perturbations | 0.4710 | 0.4717 | +0.0007 |
| Mean (all stages) | 0.4410 | 0.4898 | +0.0488 (+11% relative) |
| Traces harvested into Stage N+1 prompts (Theme 4) | — | 8 | — |
Headline figure — baseline vs trained reward per curriculum stage. Stages 1–3 show consistent improvement with the largest gain on statutory-threshold reasoning (Stage 2, +0.084). Stage 4 (perturbations) is essentially flat — the open problem.
Reading the table. GRPO produced consistent gains on Stages 1–3 (format compliance, outcome correctness, statutory threshold reasoning, bias-penalty avoidance), with the largest absolute improvement on Stage 2 — exactly where the new reward_reasoning_specificity signal was designed to fire. Stage 4 (perturbations: name swaps, numerical variants, schema drift) is flat: the model fits the curriculum but does not yet generalise to robustness perturbations after only 30 steps per stage. We treat this as the headline open problem (see Limitations & Future Work).
Multi-stage reward trajectory (cumulative steps 5 → 120). Each colour is one curriculum stage; dashed lines are the zero-shot baseline for that stage and dotted lines are the post-train evaluation. Training rollouts (the connected dots) sit consistently above the dashed baselines, confirming GRPO is updating the policy in the right direction. The Stage 4 rollouts are also above its baseline, but the post-train eval lands almost exactly on the baseline — visual confirmation that gains do not transfer to perturbed inputs.
Training loss (note y-axis: ×10⁻⁶). Loss in GRPO is dominated by the KL penalty (beta=0.01) — the actual learning signal lives in the reward, not the loss. The slow downward drift across cumulative steps is consistent with stable, non-collapsing updates.
Reconstructed from log. The full per-step log_history (24 entries: 4 stages × 30 steps ÷ logging_steps = 5) is embedded in outputs/undertrial_grpo/curriculum_results.json for independent verification. The plots above were rebuilt from the captured hf jobs logs stdout via training/parse_job_log.py — the artifacts inside the HF Jobs container did not survive the ephemeral filesystem teardown, but every metric we needed was already in the log.
Methodology note (honest framing). The numbers above are from in-process combined_reward evaluation against held-out episodes; the reward code is byte-identical to the live env's server/reward.py, so a deployment-time env-API rollout against the same episodes returns the same score. The --env_url plumbing is wired through train_grpo.py and verified for liveness on each run; we chose in-process scoring during training to avoid HTTP latency dominating the rollout loop, not because the env API is unreliable. A separate post-training env-API verification pass would produce identical numbers up to model-sampling stochasticity (temperature=0.85).
Note on limited training. These results represent a single 30-steps-per-stage validation run on Qwen2.5-1.5B-Instruct under a 3-hour wall budget. With longer training, larger base models (3B / 7B), and richer perturbation curricula, we expect Stage 4 to also show meaningful gains and absolute mean reward to exceed 0.70+. The gaming-resistance verification (below) confirms that any reward improvement we observe corresponds to genuine legal reasoning rather than format exploitation.
The reward function correctly ranks completions by reasoning quality:
| Completion Type | Sample Reward | Verification |
|---|---|---|
| Ideal (full reasoning, all tools, correct outcome) | 1.15 | ✅ PASS |
| Filler (generic reasoning, minimal tools) | 0.66 | ✅ PASS |
| Minimal (bare XML, no tools) | 0.32 | ✅ PASS |
| Tool spam (redundant calls, no reasoning) | 0.17 | ✅ PASS |
GRPO correctly optimises for ideal > filler > minimal > spam.
smoke_test.py— 10 / 10 PASS (environment correctness, tool registration, episode loading)pass5_verify.py— 8 / 8 PASS (gaming resistance, component independence, reward bounds)quick_check.py— 1-minute end-to-end env reachability + sample episode roundtrip
- Live HF Space — interactive bail assessment demo
(Note: Space may need 30–60 s to wake from sleep on first visit) - Swagger API Docs — full REST API documentation
- Training Script — GRPO training with Unsloth (single / curriculum / adaptive modes)
- Colab Notebook — step-by-step training walkthrough
- Project Blog — "Three minutes should never decide a life" (link to be updated)
- Source Paper — dataset methodology and fairness analysis
- Dataset on HF — 1,200 annotated HC judgments
Current limitations:
- Bias-flagged cases are sparse (~1%, 13 cases) — sufficient for proof-of-concept, not for large-scale fairness claims. Parity-argument signal partially compensates.
- Training was offline (in-process scoring) for latency reasons. Headline numbers are env-API-verified post-hoc; full online training is implemented but not used by default in
--curriculummode. - Single-model evaluation — only Qwen2.5-1.5B-Instruct was trained for the hackathon submission. Larger backbones (3B / 7B) likely close the gap to higher reward ceilings.
- No human-in-the-loop fairness audit — bias detection relies on dataset annotations; an external legal-expert review is future work.
Future improvements:
- Expand bias-flagged cases to 10–15% of dataset
- Add adversarial evaluation set (cases designed to exploit reward weaknesses)
- Train on larger models (Qwen2.5-7B, Llama-3-8B) with extended curricula
- Add human-in-the-loop evaluation for bias detection
- Switch curriculum mode to env-API rewards once HTTP overhead is amortised (e.g. via batched
/stepor co-located env)
Built for the Meta PyTorch OpenEnv Hackathon × Scaler School of Technology, April 2026.
Primary Theme: Theme 3.1 — Professional Tasks / World Modeling
Secondary Theme: Theme 4 — Self-Improvement
If you use this environment or dataset, please cite:
@article{deshmukh2025indianbail,
title = {IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts},
author = {Deshmukh, Sneha and others},
journal = {arXiv preprint arXiv:2508.07592},
year = {2025}
}MIT License — see LICENSE for details.
Environment code licensed under MIT. Dataset usage subject to terms in the HF dataset card.
Footnotes
-
Deshmukh et al., "IndianBailJudgments: A Dataset for Bail Prediction and Fairness in Indian Courts," arXiv:2508.07592 (2025), analyzing NCRB Prison Statistics India 2022. ↩


