TransitLM Route Evaluation

This repository provides the companion evaluation code for the paper TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation, and offers a unified funnel-style pipeline for systematically evaluating route-planning model outputs.

It supports the following four representative settings:

Single-route planning
Preference-aware planning
Multi-route diversity
General-purpose LLM evaluation through a remote route-eval API

Rather than producing only a single aggregate score, the evaluator decomposes quality into several dimensions, including reachability, station grounding, structural consistency, and the plausibility of distance, time, and fare estimates.

Paper And Resources

arXiv: TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
Hugging Face Paper: TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
Hugging Face: GD-ML/TransitLM
ModelScope: GD-ML/TransitLM
This repository: the evaluation code accompanying the TransitLM paper, intended for reproducing the evaluation pipeline and metrics used in the benchmark

Why This Repo

Scenario coverage: one codebase for benchmark 1, 2, 3, and general LLM evaluation
Layered evaluation: reachability, grounding, overlap, and estimation accuracy are reported separately
Lightweight setup: Python standard library only
Fast validation: bundled example CSVs let you run every evaluator immediately

Quick Start

Requirements:

Python 3.8+

Run the built-in examples:

python3 single_route/evaluate.py --input_field generate_results
python3 personalized/evaluate.py --input_field generate_results
python3 diversity/evaluate.py --input_field generate_results
python3 general_llm/evaluate.py

The default --input of each script is already configured to the corresponding example CSV.

What Each Evaluator Covers

Scenario	Script	Extra Logic
Single-route planning	`single_route/evaluate.py`	4-round core funnel
Personalized planning	`personalized/evaluate.py`	Adds Round 5: Preference Compliance
Route diversity	`diversity/evaluate.py`	Adds Best-Match and Route Diversity (RD)
General LLM	`general_llm/evaluate.py`	Uses remote route-eval API

Personalized Planning

Round 5 verifies whether the prediction satisfies user preference constraints such as:

fewer transfers
no subway
subway first
shorter travel time

Diversity Planning

Additional metrics include:

Best-Match: if the first route is not an exact match, search second and third for a reachable exact alternative
Route Diversity (RD): mean(1 - IoU(L_i, L_j)), with transfer modes included in the route signature

General LLM Evaluation

general_llm/evaluate.py additionally supports:

--api_url: defaults to TRANSIT_LM_API, then falls back to http://transit-lm.amap.com
--batch_size: default 50

Example command:

python3 general_llm/evaluate.py \
  --api_url http://transit-lm.amap.com \
  --batch_size 50

In this setting, the evaluation field is fixed to generate_results, and station sequences should use station names rather than stop_id.

Evaluation Pipeline

All scenarios share the following four core rounds:

Round	Check	What It Verifies
1	Reachability	Adjacent stations are connected in `next_stop_ids`
2	Station Grounding & Distance Plausibility	Verifies start/end grounding and the plausibility of transfer distances
3	Line IoU + Station IoU + Expert Score + Transfer Mode Consistency	Verifies structural consistency between prediction and reference
4	Estimation Accuracy	Verifies the accuracy of distance, time, fare, and transfer estimates

Round 2 thresholds:

walk: 3 km
bike: 5 km
taxi: 10 km

Round 4 tolerances:

distance: +-10% or 0.5 km
time: +-10% or 5 min
fare: +-10% or 1 CNY
transfer distance: +-0.5 km

Expert score formula used in Round 3:

S = T_sec / 300 + (N_lines + cycling_segments) + fare

Input Contract

CSV Fields

Common fields are as follows:

Field	Description
`index_id`	Sample identifier
`sft_prompt`	Prompt JSON string, usually including query / start / end / city
`sft_label`	Ground-truth route JSON
`generate_results`	Model output JSON

Benchmark 2 also requires:

Field	Description
`req_type`	Preference type, such as `2`, `5`, `7`, `8`

Route JSON

Single-route tasks expect a JSON object of the following form:

{
  "station_sequence": ["stop_id_1", "stop_id_2", "stop_id_3"],
  "line_sequence": ["Line A", "Line B"],
  "total_distance": "12.5",
  "total_time": "35",
  "total_fare": "4",
  "start_transfer_mode": "步行",
  "start_transfer_distance": "0.8",
  "end_transfer_mode": "步行",
  "end_transfer_distance": "0.6"
}

Diversity tasks expect a multi-route object of the following form:

{
  "first": {
    "station_sequence": ["stop_id_1", "stop_id_2"],
    "line_sequence": ["Line A"],
    "total_distance": "8.2",
    "total_time": "24",
    "total_fare": "3"
  },
  "second": {
    "station_sequence": ["stop_id_3", "stop_id_4"],
    "line_sequence": ["Line B"],
    "total_distance": "9.0",
    "total_time": "27",
    "total_fare": "4"
  },
  "third": {
    "station_sequence": ["stop_id_5", "stop_id_6"],
    "line_sequence": ["Line C"],
    "total_distance": "10.5",
    "total_time": "30",
    "total_fare": "4"
  }
}

For the general LLM setting:

labels still use station_sequence
prediction-side station sequences should contain station names
both station_sequence and station_name are accepted
the remote evaluator primarily relies on boarding / alighting stations and normalized transfer structure

Key Output Signals

The scripts print round-by-round summaries to stdout. In practice, the following signals are usually the most informative:

reachability pass rate
number of samples with station_iou == 1
overall accuracy
preference compliance for benchmark 2
best-match hits and route diversity for benchmark 3

Repository Layout

.
├── common.py
├── single_route/evaluate.py
├── personalized/evaluate.py
├── diversity/evaluate.py
├── general_llm/evaluate.py
└── data/
    ├── station_info.csv
    ├── benchmark1_single_route_example.csv
    ├── benchmark2_personalized_example.csv
    ├── benchmark3_diversity_example.csv
    └── general_llm_example.csv

Notes

common.py automatically converts coord_x / coord_y into internal longitude / latitude values
next_hop_stations is normalized into next_stop_ids
if station_name exists in station_info.csv, it is loaded automatically

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TransitLM Route Evaluation

Paper And Resources

Why This Repo

Quick Start

What Each Evaluator Covers

Personalized Planning

Diversity Planning

General LLM Evaluation

Evaluation Pipeline

Input Contract

CSV Fields

Route JSON

Key Output Signals

Repository Layout

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
diversity		diversity
general_llm		general_llm
personalized		personalized
single_route		single_route
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
common.py		common.py

Folders and files

Latest commit

History

Repository files navigation

TransitLM Route Evaluation

Paper And Resources

Why This Repo

Quick Start

What Each Evaluator Covers

Personalized Planning

Diversity Planning

General LLM Evaluation

Evaluation Pipeline

Input Contract

CSV Fields

Route JSON

Key Output Signals

Repository Layout

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages