In #1 we realized that some GTFS datasets use Ids (e.g. shape_id) that are re-generated for every dataset (See #4 )
In that case the first version of the gtfs-diff engine cannot significantly find the differences.
As a stop-gap measure, we should have some kind of heuristic that quickly tells us if the diff engine can be used on a given dataset.
Copilot Suggestion:
Do a cheap O(N) pre-flight check that scans only the id columns of each file (no full row parsing). For every file present in both feeds, we compute:
churn = size(base_ids OR new_ids - base_ids AND new_ids) / size(base_ids OR new_ids)
If the weighted overall churn across all files reaches 50% (user defined), the diff is aborted with a clear error message listing which files have high churn. This prevents the engine from producing a meaningless diff when a publisher has fully regenerated all IDs between versions.
In #1 we realized that some GTFS datasets use Ids (e.g. shape_id) that are re-generated for every dataset (See #4 )
In that case the first version of the gtfs-diff engine cannot significantly find the differences.
As a stop-gap measure, we should have some kind of heuristic that quickly tells us if the diff engine can be used on a given dataset.
Copilot Suggestion: