| title | Emotional Support Conversations (OpenEnv) | |
|---|---|---|
| emoji | 💬 | |
| sdk | docker | |
| pinned | false | |
| tags |
|
An OpenEnv RL environment for evaluating agents on open-ended emotional support conversations, with a hybrid immediate + future-oriented reward signal inspired by RLFF-ESC (Yang, Chen, Wang, 2025, arXiv:2508.12935).
Emotional support is one of the tasks humans most want AI assistants to do well, and one of the easiest to do badly. Existing dialogue benchmarks often score turn-level responses in isolation, which rewards agents for sounding empathetic without testing whether their replies actually move the person toward resolution. This environment closes that gap.
Three properties make it a genuine RL problem, not a single-shot dialogue task:
- Partial observability. The seeker's distress, trust, and willingness to reveal their real issue are hidden state. The agent must infer them from the conversation so far.
- Sequential credit assignment. A warm reply at turn 2 can unlock a disclosure at turn 6. A single dismissive reply at turn 4 can collapse the whole trajectory and require several turns to recover.
- Exploration vs commitment. Should the agent keep exploring feelings or move toward an action plan? Commit too early and the seeker shuts down; explore too long and the episode times out.
Each step reward is:
step_reward = clip(0.45 * immediate + 0.55 * future_oriented - penalties, 0, 1)
immediate: stage-appropriate empathy/validation/open-question fit, plus turn-level deltas in the seeker's trust and distress.future_oriented: a k-step oracle rollout from both the pre- and post-action seeker states. The reward is proportional to how much the agent's action preserves or advances the attainable resolution ceiling, not just how good the current turn looks in isolation.penalties: dismissive language, premature advice, bare replies, interrogation, and repeated template-like responses.
A final task score combines average shaped reward, the seeker's final resolution state, efficiency, and a completion bonus. Success is hard-gated: timing out with a generic but non-harmful conversation can still earn partial score, but it does not count as a solved episode.
| Task ID | Difficulty | Max turns | Core challenge |
|---|---|---|---|
work_stress_venting |
easy | 10 | Cooperative seeker venting about work. Must reach closing with trust >= 0.70 and distress <= 0.40. |
guarded_relationship |
medium | 12 | Guarded seeker; real issue is hidden behind the surface concern until openness >= 0.75. Must reveal the true issue and finish in closing with trust >= 0.72 and distress <= 0.45. |
crisis_fragile_trust |
hard | 14 | High-distress, fragile trust, multiple interleaved concerns. Must reveal the crisis concern, reference external safety support, and finish in closing with trust >= 0.75 and distress <= 0.40. |
Success thresholds (final score) are 0.60 / 0.62 / 0.65 respectively, and
they are only evaluated after the task-specific completion conditions are met.
Action is a free-text reply to the seeker:
class Action(BaseModel):
message: strObservation is deliberately partial:
class Observation(BaseModel):
seeker_utterance: str
turn: int
remaining_turns: int
stage_hint: str
task_id: str
scenario_brief: strThe seeker's internal hidden variables are never exposed.
The seeker is a deterministic finite-state machine with continuous hidden
variables (distress, trust, openness, revealed, stage). On each
turn, the agent's reply is analyzed with keyword and regex feature detectors,
then hidden state advances via transparent rules.
Why not use an LLM-driven seeker? The hackathon rubric requires graders to be deterministic and reproducible. An LLM-driven seeker would risk score variance between runs. Deterministic dynamics give full reproducibility while still producing rich, sequential, partially observable dialogue with genuine recovery-from-mistakes dynamics.
| Method | Path | Body | Returns |
|---|---|---|---|
GET |
/ |
none | health + metadata |
GET |
/tasks |
none | list of tasks |
POST |
/reset |
{"task_id": "...", "seed": null} |
ResetResult |
POST |
/step |
{"action": {"message": "..."}} |
StepResult |
GET |
/state |
none | EnvState |
# 1. Install deps
pip install -r requirements.txt
# 2. Start the environment server
uvicorn server:app --host 0.0.0.0 --port 7860
# 3. In another shell, run the baseline inference
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 inference.pyinference.py uses the OpenAI client and expects API_BASE_URL plus
MODEL_NAME. For authentication it accepts HF_TOKEN (preferred for Hugging
Face Router), OPENAI_API_KEY, or API_KEY.
docker build -t esc-openenv .
docker run -p 7860:7860 esc-openenvThe environment itself stays deterministic and reproducible. To align with the
hackathon's optional skills/agents framing, this repo also includes a
policy-side agentic controller that routes between five reusable skills:
empathize, validate, explore, plan, and safety_escalate.
This keeps the benchmark honest:
- the environment and grader remain unchanged
- the agentic story lives in the policy, not in a hidden stochastic seeker
- judges can inspect turn-by-turn routing traces in the benchmark outputs
Run the built-in rubric ladder and write reusable Markdown/JSON artifacts:
py -3 benchmark.pyOutputs:
results/local_benchmarks.mdresults/local_benchmarks.json
Run the explicit agentic baseline comparison and write route-aware artifacts:
py -3 benchmark_agentic.pyOutputs:
results/agentic_benchmarks.mdresults/agentic_benchmarks.json
When you have a real model endpoint and token, run:
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_llm.pyOutputs:
results/llm_benchmark.mdresults/llm_benchmark.json
Use the same environment endpoint, but add the policy-side router and skill traces around the model:
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_agentic_llm.pyOutputs:
results/agentic_llm_benchmark.mdresults/agentic_llm_benchmark.json
Deterministic local numbers below were generated with py -3 benchmark.py.
The submitted hosted baseline below comes from a live inference.py run
against the deployed Hugging Face Space using gpt-4.1-mini.
| Baseline | Avg score | Success rate | Notes |
|---|---|---|---|
generic_template |
0.393 | 0.00 | Safe-sounding repeated empathy; no task completion |
validation_only |
0.539 | 0.00 | Better partial reward, still fails hard-gated completion |
stage_aware_heuristic |
0.821 | 1.00 | Task-aware staged policy; completes all 3 tasks |
| Baseline | Avg score | Success rate | Notes |
|---|---|---|---|
skill_routed_deterministic |
0.821 | 1.00 | Explicit router over empathize / validate / explore / plan / safety_escalate; matches the strong staged baseline while exposing route traces |
| Model | Avg score | Success rate | Notes |
|---|---|---|---|
gpt-4.1-mini |
0.821 | 1.00 | Live inference.py run against 5ivatej-meta-hackathon.hf.space |
The deterministic ladder separates surface-level empathy from task completion:
the generic repeated-empathy template does not solve any task, while the
stage-aware heuristic completes all three. The submitted gpt-4.1-mini
baseline also completes all three tasks because the policy-side controller
keeps the conversation stage-aware instead of drifting into endless reflection.
.
|-- openenv.yaml # OpenEnv metadata
|-- Dockerfile # Container build for HF Space
|-- benchmark.py # Deterministic local benchmark ladder
|-- benchmark_agentic.py # Deterministic skill-routed benchmark
|-- benchmark_agentic_llm.py # Skill-routed LLM benchmark
|-- benchmark_llm.py # LLM benchmark that writes Markdown/JSON
|-- requirements.txt
|-- server.py # FastAPI HTTP server (entrypoint)
|-- inference.py # Mandated baseline inference script
|-- SUBMISSION_NEXT_STEPS.md # Manual checklist before final submission
|-- README.md
`-- src/
|-- __init__.py
|-- agentic.py # Skill router + reusable policy-side skills
|-- baselines.py # Deterministic baseline policies
|-- models.py # Pydantic Action / Observation / Reward / envelopes
|-- seeker.py # Deterministic seeker simulator + feature detectors
|-- tasks.py # 3 task personas (easy / medium / hard)
|-- grader.py # Hybrid immediate + future-oriented reward
|-- env.py # Core ESCEnv with step/reset/state
`-- client.py # Async HTTP client for inference.py
If you use this environment, please cite the paper whose reward idea inspired it:
@article{yang2025rlffesc,
title = {Towards Open-Ended Emotional Support Conversations in LLMs via
Reinforcement Learning with Future-Oriented Rewards},
author = {Yang, Ting and Chen, Li and Wang, Huimin},
journal = {arXiv preprint arXiv:2508.12935},
year = {2025}
}