Route Planning Model Evaluation System

This system evaluates the output quality of route planning large models, supporting four evaluation scenarios: single-route planning, personalized preferences, route diversity, and general-purpose LLM. The evaluation logic adopts a funnel-style multi-round check, filtering from reachability to accuracy layer by layer.

Directory Structure

.
├── common.py                          # Common evaluation module (Haversine distance, IoU calculation, expert scoring, etc.)
├── single_route/evaluate.py           # Single-route planning evaluation (Benchmark 1)
├── personalized/evaluate.py           # Personalized preference evaluation (Benchmark 2)
├── diversity/evaluate.py              # Route diversity evaluation (Benchmark 3)
├── general_llm/evaluate.py            # General-purpose LLM evaluation (uses a remote route-eval API)
└── data/
    ├── station_info.csv               # Station coordinates and adjacency relations (bundled in this repo)
    ├── benchmark1_single_route_example.csv    # Single-route example data
    ├── benchmark2_personalized_example.csv    # Personalized preference example data
    ├── benchmark3_diversity_example.csv       # Diversity example data
    └── general_llm_example.csv               # General-purpose LLM example data

Requirements

Python 3.8+
No additional dependencies (standard library only)

Usage

Each scenario provides *_example.csv (10 samples) for quick validation. The default --input of each script points to the corresponding example file, so the following commands can be run directly:

# Single-route planning evaluation (Benchmark 1)
python3 single_route/evaluate.py --input_field generate_results

# Personalized preference evaluation (Benchmark 2)
python3 personalized/evaluate.py --input_field generate_results

# Route diversity evaluation (Benchmark 3)
python3 diversity/evaluate.py --input_field generate_results

# General-purpose LLM evaluation
python3 general_llm/evaluate.py

Parameters:

--input: Input CSV file path
--input_field: Evaluation field, use generate_results for external model output evaluation
--sample_count: Number of samples to evaluate (default: all)

general_llm/evaluate.py additionally supports:

--api_url: route-eval endpoint; defaults to TRANSIT_LM_API, or falls back to http://transit-lm.amap.com
--batch_size: batch request size (default: 50)

Example with an explicit API endpoint:

python3 general_llm/evaluate.py \
  --api_url http://transit-lm.amap.com \
  --batch_size 50

Personalized preference adds Round 5: Preference Compliance (transfer count, subway inclusion, predicted time constraints, etc.).

Diversity evaluation adds extra metrics:

Best-Match: When the first route IoU != 1, automatically search second/third for a reachable alternative with IoU = 1
Route Diversity (RD): mean(1 - IoU(L_i, L_j)), with transfer mode included as part of the line set

For general-purpose LLM evaluation, the evaluation field is fixed to generate_results. The predicted route sequence should use station names rather than stop_id, in either of these forms:

station_sequence: directly contains station names
station_name: an explicit station-name sequence field

Notes:

In the general-purpose LLM setting, the remote route-eval service only cares about boarding/alighting stations plus normalized transfer structure; a full stop-by-stop sequence is not required
If a full station sequence is provided, the evaluation script still accepts it and normalizes it before sending the remote evaluation request

Scenario-specific input notes:

Benchmark 1 / Benchmark 2: sft_label and generate_results are single-route JSON objects
Benchmark 2: CSV must additionally contain req_type
Benchmark 3: sft_label and generate_results are multi-route JSON objects, typically containing first, second, and third
General-purpose LLM: labels still use station_sequence, but the sequence content should be station names; prediction-side matching is by station name rather than stop_id

Evaluation Logic (Funnel-Style 4 Rounds)

All scenarios share the following 4 core checks:

Round	Check Item	Description
Round 1	Reachability	Check if adjacent stations are connected in `next_stop_ids`
Round 2	Station Grounding & Distance Plausibility	Straight-line distance from start/end to first/last station is thresholded by transfer mode (walk 3km / bike 5km / taxi 10km); transfer distance cannot be shorter than straight-line - 0.5km, nor exceed 3x + 0.5km
Round 3	Line IoU + Station IoU + Expert Score + Transfer Mode Consistency	Station IoU filters out [Transfer] nodes; Expert score formula: S = T_sec/300 + (N_lines + cycling_segments) + fare
Round 4	Estimation Accuracy	Distance ±10% or 0.5km, Time ±10% or 5min, Fare ±10% or 1 CNY; Transfer distance ±0.5km

Data File Format

Evaluation Data CSV

Field	Description
`index_id`	Sample unique identifier
`sft_prompt`	Input prompt (JSON string, containing query/start/end/city)
`sft_label`	Ground truth answer (JSON string)
`generate_results`	Model output (JSON string)

Benchmark 2 additionally requires:

Field	Description
`req_type`	Preference type. Current built-in values: `2` (fewer transfers), `5` (no subway), `7` (subway first), `8` (shorter time)

Route JSON Schema

Single-route benchmarks (benchmark1, benchmark2) expect sft_label / generate_results to be JSON objects like:

{
  "station_sequence": ["stop_id_1", "stop_id_2", "stop_id_3"],
  "line_sequence": ["Line A", "Line B"],
  "total_distance": "12.5",
  "total_time": "35",
  "total_fare": "4",
  "start_transfer_mode": "步行",
  "start_transfer_distance": "0.8",
  "end_transfer_mode": "步行",
  "end_transfer_distance": "0.6"
}

Diversity benchmark (benchmark3) expects a multi-route JSON object:

{
  "first": {
    "station_sequence": ["stop_id_1", "stop_id_2"],
    "line_sequence": ["Line A"],
    "total_distance": "8.2",
    "total_time": "24",
    "total_fare": "3"
  },
  "second": {
    "station_sequence": ["stop_id_3", "stop_id_4"],
    "line_sequence": ["Line B"],
    "total_distance": "9.0",
    "total_time": "27",
    "total_fare": "4"
  },
  "third": {
    "station_sequence": ["stop_id_5", "stop_id_6"],
    "line_sequence": ["Line C"],
    "total_distance": "10.5",
    "total_time": "30",
    "total_fare": "4"
  }
}

For general-purpose LLM evaluation, the route JSON keeps the same metric fields, but station_sequence should contain station names instead of stop_id.

Additional note:

The remote evaluation stage only uses boarding/alighting stations and transfer structure for reachability and grounding checks
Even if the input contains the full station sequence, the script will handle it and extract the subset needed by the remote evaluator

station_info.csv

Field	Description
`stop_id`	Station ID
`ad_code`	City / district code
`coord_x`	Longitude
`coord_y`	Latitude
`next_hop_stations`	Adjacent station ID list (JSON array)

Notes:

common.py converts coord_x/coord_y into internal latitude/longitude values and maps next_hop_stations to next_stop_ids
station_name is optional in the current bundled station_info.csv; if present, it is loaded automatically

Output Summary

The scripts print stage-by-stage statistics directly to stdout:

Round 1: Reachability
Round 2: Station Grounding and Distance Plausibility
Round 3: Line IoU, Station IoU, expert score, and transfer-mode consistency
Round 4: Distance / time / fare / transfer accuracy
Benchmark 2 adds Round 5 Preference Compliance
Benchmark 3 additionally reports Best-Match statistics and Route Diversity

For quick result inspection, the most useful summary fields are usually reachability pass rate, station_iou == 1 count, overall accuracy, Preference Compliance (Benchmark 2), and Route Diversity / Best-Match hit distribution (Benchmark 3).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Route Planning Model Evaluation System

Directory Structure

Requirements

Usage

Evaluation Logic (Funnel-Style 4 Rounds)

Data File Format

Evaluation Data CSV

Route JSON Schema

station_info.csv

Output Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
diversity		diversity
general_llm		general_llm
personalized		personalized
single_route		single_route
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
common.py		common.py

Folders and files

Latest commit

History

Repository files navigation

Route Planning Model Evaluation System

Directory Structure

Requirements

Usage

Evaluation Logic (Funnel-Style 4 Rounds)

Data File Format

Evaluation Data CSV

Route JSON Schema

station_info.csv

Output Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages