Skip to content

EnvCommons/FoodDelivery

Repository files navigation

FoodDelivery

OpenReward Environment

Description

FoodDelivery is a food delivery dispatch optimization environment. An agent manages a fleet of e-bike couriers across a procedurally generated city, making real-time decisions about courier-order matching, order batching, surge pricing, and fleet repositioning under stochastic conditions.

The simulation models a 4-hour dinner service period (5:00 PM -- 9:00 PM) with non-homogeneous Poisson order arrivals, Gamma-distributed restaurant preparation times, and lognormal stochastic travel times with time-of-day traffic effects. Scenarios range from calm weekday evenings to holiday peak demand, rainy weather, sudden demand spikes, and understaffed conditions.

Note: This is a synthetic environment which was majority AI-generated; we recommend testing it thoroughly before any use in an RL pipeline.

Capabilities

  • Real-time courier-order matching and assignment under time pressure
  • Order batching (up to 2 orders per courier) for delivery efficiency
  • Dynamic surge pricing per zone to manage supply-demand imbalances
  • Idle courier repositioning to anticipate demand shifts
  • Multi-step sequential decision-making (240 steps per task)
  • Reasoning about stochastic travel times, restaurant prep delays, and demand patterns

Compute Requirements

No additional compute required beyond the environment server.

License

MIT

Tasks

There are 24 tasks across 8 scenarios, each run with 3 random seeds.

Train split (18 tasks):

Scenario Demand Weather Couriers Duration Description
weekday_calm 1.0x Clear 20 240 min Baseline scenario
weekday_busy 1.3x Clear 20 240 min Higher demand, same supply
rainy_evening 1.3x Rain (+25% demand, +30% travel) 18 240 min Bad weather, fewer couriers
weekend_rush 1.6x Clear 22 240 min Weekend dinner peak
friday_surge 1.0x -> 2.0x at step 120 Clear 20 240 min Sudden demand spike mid-service
understaffed 1.3x Clear 14 240 min Severe courier shortage

Test split (6 tasks):

Scenario Demand Weather Couriers Duration Description
holiday_peak 2.0x Clear 25 240 min Maximum demand
late_night 0.8x -> declining Clear 12 180 min Declining demand, repositioning critical

Each task is identified as {scenario}_seed{N} (e.g., weekday_calm_seed42). Seeds used are 42, 123, and 777.

The city model is a 6km x 6km grid divided into 9 zones (3x3), with 30 restaurants and 12--25 couriers depending on scenario. Zone types (downtown core, commercial, residential, suburban) determine local demand intensity.

Reward Structure

This is a dense, verifiable reward environment. Rewards are computed deterministically after each 1-minute simulation step based on delivery outcomes:

Event Reward
On-time delivery (within promised window) +1.0
Slightly late delivery (within 10 min of promise) +0.3
Very late delivery (>10 min past promise) -0.5
Expired order (unassigned >20 min or undelivered >60 min) -1.5
Surge pricing revenue +0.01 per unit revenue

The final reward returned at task completion is the normalized cumulative reward:

$$\text{reward} = \frac{\sum_{t=1}^{T} r_t}{N_{\text{orders}}}$$

where $r_t$ is the step reward and $N_{\text{orders}}$ is the total number of orders that arrived during the simulation.

We do not use LLM graders for this task.

Data

All data is procedurally generated from the scenario configuration and random seed. No external data files are required. The simulation generates:

  • City layout: 9 zones with demand multipliers, 30 restaurants with cuisine types and prep time distributions
  • Order arrivals: Non-homogeneous Poisson process with dinner rush profile
  • Prep times: Gamma-distributed per restaurant type (fast food ~8 min, standard ~18 min, premium ~25 min)
  • Travel times: Manhattan distance with lognormal stochastic noise and time-of-day traffic factors
  • Courier speeds: Base ~18 km/h (e-bike) with lognormal perturbation per courier

Identical seeds produce identical simulations for reproducibility.

Tools

Agents have access to 2 tools:

Tool Description
get_info() Returns static city layout: zone boundaries, restaurant positions/types, courier starting positions. Does not advance simulation time.
step(actions) Advances the simulation by 1 minute. Accepts optional dispatch actions: order assignments, surge multipliers per zone, and courier repositioning commands. Returns current state, step reward, and completion status.

The step tool accepts three optional action types:

  • assignments: Assign 1--2 orders to an idle courier (batching supported)
  • surge_multipliers: Set per-zone price multipliers (1.0--3.0x, reduces demand via elasticity)
  • repositions: Send idle couriers to target zones

Time Horizon

FoodDelivery is a multi-step environment with 240 simulation steps per task (180 for late_night). Each step represents 1 minute of simulated time. A typical run requires 240+ tool calls (one get_info call plus ~240 step calls).

Other Environment Requirements

There are no external API requirements. FoodDelivery works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in FoodDelivery optimize delivery logistics within a fully simulated environment. The surge pricing mechanism introduces an economic optimization component where agents must balance revenue extraction against service quality. While the environment rewards efficient dispatch, the objectives are bounded and the simulation is self-contained.

Citations

@dataset{GRFoodDelivery,
  author    = {General Reasoning Inc. Team},
  title     = {FoodDelivery},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/fooddelivery}
}

About

FoodDelivery environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors