Emotional Support Conversations - OpenEnv Environment

title

Emotional Support Conversations (OpenEnv)

emoji

💬

sdk

docker

pinned

false

Emotional Support Conversations - OpenEnv Environment

An OpenEnv RL environment for evaluating agents on open-ended emotional support conversations, with a hybrid immediate + future-oriented reward signal inspired by RLFF-ESC (Yang, Chen, Wang, 2025, arXiv:2508.12935).

Why this environment

Emotional support is one of the tasks humans most want AI assistants to do well, and one of the easiest to do badly. Existing dialogue benchmarks often score turn-level responses in isolation, which rewards agents for sounding empathetic without testing whether their replies actually move the person toward resolution. This environment closes that gap.

Three properties make it a genuine RL problem, not a single-shot dialogue task:

Partial observability. The seeker's distress, trust, and willingness to reveal their real issue are hidden state. The agent must infer them from the conversation so far.
Sequential credit assignment. A warm reply at turn 2 can unlock a disclosure at turn 6. A single dismissive reply at turn 4 can collapse the whole trajectory and require several turns to recover.
Exploration vs commitment. Should the agent keep exploring feelings or move toward an action plan? Commit too early and the seeker shuts down; explore too long and the episode times out.

Reward design (RLFF-ESC-inspired)

Each step reward is:

step_reward = clip(0.45 * immediate + 0.55 * future_oriented - penalties, 0, 1)

immediate: stage-appropriate empathy/validation/open-question fit, plus turn-level deltas in the seeker's trust and distress.
future_oriented: a k-step oracle rollout from both the pre- and post-action seeker states. The reward is proportional to how much the agent's action preserves or advances the attainable resolution ceiling, not just how good the current turn looks in isolation.
penalties: dismissive language, premature advice, bare replies, interrogation, and repeated template-like responses.

A final task score combines average shaped reward, the seeker's final resolution state, efficiency, and a completion bonus. Success is hard-gated: timing out with a generic but non-harmful conversation can still earn partial score, but it does not count as a solved episode.

Tasks (3 difficulties)

Task ID	Difficulty	Max turns	Core challenge
`work_stress_venting`	easy	10	Cooperative seeker venting about work. Must reach closing with trust >= 0.70 and distress <= 0.40.
`guarded_relationship`	medium	12	Guarded seeker; real issue is hidden behind the surface concern until openness >= 0.75. Must reveal the true issue and finish in closing with trust >= 0.72 and distress <= 0.45.
`crisis_fragile_trust`	hard	14	High-distress, fragile trust, multiple interleaved concerns. Must reveal the crisis concern, reference external safety support, and finish in closing with trust >= 0.75 and distress <= 0.40.

Success thresholds (final score) are 0.60 / 0.62 / 0.65 respectively, and they are only evaluated after the task-specific completion conditions are met.

Action and observation space

Action is a free-text reply to the seeker:

class Action(BaseModel):
    message: str

Observation is deliberately partial:

class Observation(BaseModel):
    seeker_utterance: str
    turn: int
    remaining_turns: int
    stage_hint: str
    task_id: str
    scenario_brief: str

The seeker's internal hidden variables are never exposed.

Environment internals

The seeker is a deterministic finite-state machine with continuous hidden variables (distress, trust, openness, revealed, stage). On each turn, the agent's reply is analyzed with keyword and regex feature detectors, then hidden state advances via transparent rules.

Why not use an LLM-driven seeker? The hackathon rubric requires graders to be deterministic and reproducible. An LLM-driven seeker would risk score variance between runs. Deterministic dynamics give full reproducibility while still producing rich, sequential, partially observable dialogue with genuine recovery-from-mistakes dynamics.

HTTP API (OpenEnv spec)

Method	Path	Body	Returns
`GET`	`/`	none	health + metadata
`GET`	`/tasks`	none	list of tasks
`POST`	`/reset`	`{"task_id": "...", "seed": null}`	`ResetResult`
`POST`	`/step`	`{"action": {"message": "..."}}`	`StepResult`
`GET`	`/state`	none	`EnvState`

Running locally

# 1. Install deps
pip install -r requirements.txt

# 2. Start the environment server
uvicorn server:app --host 0.0.0.0 --port 7860

# 3. In another shell, run the baseline inference
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 inference.py

inference.py uses the OpenAI client and expects API_BASE_URL plus MODEL_NAME. For authentication it accepts HF_TOKEN (preferred for Hugging Face Router), OPENAI_API_KEY, or API_KEY.

Running via Docker

docker build -t esc-openenv .
docker run -p 7860:7860 esc-openenv

Skills / agents extension

The environment itself stays deterministic and reproducible. To align with the hackathon's optional skills/agents framing, this repo also includes a policy-side agentic controller that routes between five reusable skills: empathize, validate, explore, plan, and safety_escalate.

This keeps the benchmark honest:

the environment and grader remain unchanged
the agentic story lives in the policy, not in a hidden stochastic seeker
judges can inspect turn-by-turn routing traces in the benchmark outputs

Benchmarking

Deterministic local benchmark ladder

Run the built-in rubric ladder and write reusable Markdown/JSON artifacts:

py -3 benchmark.py

Outputs:

results/local_benchmarks.md
results/local_benchmarks.json

Deterministic skill-routed benchmark

Run the explicit agentic baseline comparison and write route-aware artifacts:

py -3 benchmark_agentic.py

Outputs:

results/agentic_benchmarks.md
results/agentic_benchmarks.json

LLM benchmark with Markdown output

When you have a real model endpoint and token, run:

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_llm.py

Outputs:

results/llm_benchmark.md
results/llm_benchmark.json

Skill-routed LLM benchmark

Use the same environment endpoint, but add the policy-side router and skill traces around the model:

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_agentic_llm.py

Outputs:

results/agentic_llm_benchmark.md
results/agentic_llm_benchmark.json

Baseline scores

Deterministic local numbers below were generated with py -3 benchmark.py. The submitted hosted baseline below comes from a live inference.py run against the deployed Hugging Face Space using gpt-4.1-mini.

Deterministic baselines

Baseline	Avg score	Success rate	Notes
`generic_template`	0.393	0.00	Safe-sounding repeated empathy; no task completion
`validation_only`	0.539	0.00	Better partial reward, still fails hard-gated completion
`stage_aware_heuristic`	0.821	1.00	Task-aware staged policy; completes all 3 tasks

Skill-routed agentic baselines

Baseline	Avg score	Success rate	Notes
`skill_routed_deterministic`	0.821	1.00	Explicit router over `empathize` / `validate` / `explore` / `plan` / `safety_escalate`; matches the strong staged baseline while exposing route traces

Submitted Hosted LLM Baseline

Model	Avg score	Success rate	Notes
`gpt-4.1-mini`	0.821	1.00	Live `inference.py` run against `5ivatej-meta-hackathon.hf.space`

The deterministic ladder separates surface-level empathy from task completion: the generic repeated-empathy template does not solve any task, while the stage-aware heuristic completes all three. The submitted gpt-4.1-mini baseline also completes all three tasks because the policy-side controller keeps the conversation stage-aware instead of drifting into endless reflection.

Files

.
|-- openenv.yaml             # OpenEnv metadata
|-- Dockerfile               # Container build for HF Space
|-- benchmark.py             # Deterministic local benchmark ladder
|-- benchmark_agentic.py     # Deterministic skill-routed benchmark
|-- benchmark_agentic_llm.py # Skill-routed LLM benchmark
|-- benchmark_llm.py         # LLM benchmark that writes Markdown/JSON
|-- requirements.txt
|-- server.py                # FastAPI HTTP server (entrypoint)
|-- inference.py             # Mandated baseline inference script
|-- SUBMISSION_NEXT_STEPS.md # Manual checklist before final submission
|-- README.md
`-- src/
    |-- __init__.py
    |-- agentic.py           # Skill router + reusable policy-side skills
    |-- baselines.py         # Deterministic baseline policies
    |-- models.py            # Pydantic Action / Observation / Reward / envelopes
    |-- seeker.py            # Deterministic seeker simulator + feature detectors
    |-- tasks.py             # 3 task personas (easy / medium / hard)
    |-- grader.py            # Hybrid immediate + future-oriented reward
    |-- env.py               # Core ESCEnv with step/reset/state
    `-- client.py            # Async HTTP client for inference.py

Citation

If you use this environment, please cite the paper whose reward idea inspired it:

@article{yang2025rlffesc,
  title   = {Towards Open-Ended Emotional Support Conversations in LLMs via
             Reinforcement Learning with Future-Oriented Rewards},
  author  = {Yang, Ting and Chen, Li and Wang, Huimin},
  journal = {arXiv preprint arXiv:2508.12935},
  year    = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emotional Support Conversations - OpenEnv Environment

Why this environment

Reward design (RLFF-ESC-inspired)

Tasks (3 difficulties)

Action and observation space

Environment internals

HTTP API (OpenEnv spec)

Running locally

Running via Docker

Skills / agents extension

Benchmarking

Deterministic local benchmark ladder

Deterministic skill-routed benchmark

LLM benchmark with Markdown output

Skill-routed LLM benchmark

Baseline scores

Deterministic baselines

Skill-routed agentic baselines

Submitted Hosted LLM Baseline

Files

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
results		results
server		server
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
SUBMISSION_NEXT_STEPS.md		SUBMISSION_NEXT_STEPS.md
benchmark.py		benchmark.py
benchmark_agentic.py		benchmark_agentic.py
benchmark_agentic_llm.py		benchmark_agentic_llm.py
benchmark_llm.py		benchmark_llm.py
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Emotional Support Conversations - OpenEnv Environment

Why this environment

Reward design (RLFF-ESC-inspired)

Tasks (3 difficulties)

Action and observation space

Environment internals

HTTP API (OpenEnv spec)

Running locally

Running via Docker

Skills / agents extension

Benchmarking

Deterministic local benchmark ladder

Deterministic skill-routed benchmark

LLM benchmark with Markdown output

Skill-routed LLM benchmark

Baseline scores

Deterministic baselines

Skill-routed agentic baselines

Submitted Hosted LLM Baseline

Files

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages