Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -205,3 +205,5 @@ cython_debug/
marimo/_static/
marimo/_lsp/
__marimo__/

.idea
246 changes: 245 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,245 @@
# gtfs_diff_engine
# GTFS Diff Engine

A memory-efficient Python library and CLI for comparing two GTFS feeds and producing a structured diff conforming to the [GTFS Diff v2 schema](https://github.com/MobilityData/gtfs_diff).

## Overview

GTFS Diff Engine compares two GTFS feeds (zip archives or directories) file-by-file and row-by-row, emitting a machine-readable JSON document that describes exactly what changed: which files were added or deleted, which columns appeared or disappeared, and which rows were inserted, removed, or modified (with before/after field values).

The output conforms to the **GTFS Diff v2 schema** maintained by MobilityData:
<https://github.com/MobilityData/gtfs_diff>

## Features

- **Memory-efficient streaming diff** — two-pass CSV indexing; no full in-memory table loads
- **Supports `.zip` archives and plain directories** — including zips with a single sub-directory layout
- **Row-level changes with primary key identification** — each change record includes the primary key fields for the affected row
- **Column-level change tracking** — columns added or deleted between feeds are reported with their original positions
- **Configurable row-changes cap** — limit output size per file; omitted changes are counted in a `Truncated` record
- **CLI and Python API** — use as a command-line tool or import directly in your code

## Installation

```bash
pip install gtfs-diff-engine
```

For a development (editable) install with test dependencies:

```bash
git clone https://github.com/your-org/gtfs_diff_engine
cd gtfs_diff_engine
pip install -e ".[dev]"
```

## Quick Start

```python
from gtfs_diff.engine import diff_feeds

result = diff_feeds("base.zip", "new.zip")
print(result.summary.total_changes)

# Save to JSON
with open("diff.json", "w") as f:
f.write(result.model_dump_json(indent=2))
```

## CLI Usage

```
Usage: gtfs-diff [OPTIONS] BASE_FEED NEW_FEED

Compare two GTFS feeds (zip or directory) and output a JSON diff.

BASE_FEED: path to the base GTFS feed (zip or directory)
NEW_FEED: path to the new GTFS feed (zip or directory)

Options:
--version Show the version and exit.
-o, --output FILE Write JSON output to FILE instead of stdout.
-c, --cap INTEGER Max row changes per file (0 = omit row-level
detail).
--pretty / --no-pretty Pretty-print JSON (default: --pretty).
--base-downloaded-at TEXT ISO 8601 datetime for when base was downloaded.
--new-downloaded-at TEXT ISO 8601 datetime for when new was downloaded.
--help Show this message and exit.
```

**Examples:**

```bash
# Basic usage — print diff to stdout
gtfs-diff base.zip new.zip

# Cap row changes to 500 per file
gtfs-diff --cap 500 base.zip new.zip

# Save output to a file
gtfs-diff -o diff.json base.zip new.zip

# Omit row-level detail (column diffs and counts are still computed)
gtfs-diff --cap 0 base.zip new.zip

# With feed download timestamps
gtfs-diff --base-downloaded-at 2024-01-01T00:00:00Z \
--new-downloaded-at 2024-06-01T00:00:00Z \
base.zip new.zip
```

## Python API Reference

### `diff_feeds()`

```python
def diff_feeds(
base_path: str | Path,
new_path: str | Path,
row_changes_cap_per_file: int | None = None,
base_downloaded_at: datetime | None = None,
new_downloaded_at: datetime | None = None,
) -> GtfsDiff
```

| Parameter | Type | Description |
|---|---|---|
| `base_path` | `str \| Path` | Path to the base (old) GTFS feed — zip or directory |
| `new_path` | `str \| Path` | Path to the new GTFS feed — zip or directory |
| `row_changes_cap_per_file` | `int \| None` | `None` = include all; `0` = omit row detail; `N` = cap at N per file |
| `base_downloaded_at` | `datetime \| None` | When the base feed was downloaded (defaults to now) |
| `new_downloaded_at` | `datetime \| None` | When the new feed was downloaded (defaults to now) |

**Returns:** a `GtfsDiff` Pydantic model with three top-level fields:

| Field | Type | Description |
|---|---|---|
| `metadata` | `Metadata` | Schema version, timestamps, feed sources, unsupported files |
| `summary` | `Summary` | Aggregate counts of changed files, rows, columns |
| `file_diffs` | `list[FileDiff]` | Per-file diff records |

## Supported GTFS Files

| File | Primary Key |
|---|---|
| `agency.txt` | `agency_id` |
| `stops.txt` | `stop_id` |
| `routes.txt` | `route_id` |
| `trips.txt` | `trip_id` |
| `stop_times.txt` | `trip_id`, `stop_sequence` |
| `calendar.txt` | `service_id` |
| `calendar_dates.txt` | `service_id`, `date` |
| `fare_attributes.txt` | `fare_id` |
| `fare_rules.txt` | `fare_id`, `route_id`, `origin_id`, `destination_id`, `contains_id` |
| `shapes.txt` | `shape_id`, `shape_pt_sequence` |
| `frequencies.txt` | `trip_id`, `start_time` |
| `transfers.txt` | `from_stop_id`, `to_stop_id`, `from_route_id`, `to_route_id`, `from_trip_id`, `to_trip_id` |
| `pathways.txt` | `pathway_id` |
| `levels.txt` | `level_id` |
| `feed_info.txt` | *(all columns — single-row file)* |
| `translations.txt` | `table_name`, `field_name`, `language`, `record_id`, `record_sub_id`, `field_value` |
| `attributions.txt` | `attribution_id` |
| `areas.txt` | `area_id` |
| `stop_areas.txt` | `area_id`, `stop_id` |
| `networks.txt` | `network_id` |
| `route_networks.txt` | `route_id` |
| `fare_media.txt` | `fare_media_id` |
| `fare_products.txt` | `fare_product_id` |
| `fare_leg_rules.txt` | `leg_group_id` |
| `fare_transfer_rules.txt` | `from_leg_group_id`, `to_leg_group_id`, `transfer_count`, `duration_limit` |
| `timeframes.txt` | `timeframe_group_id`, `start_time`, `end_time`, `service_id` |
| `rider_categories.txt` | `rider_category_id` |
| `booking_rules.txt` | `booking_rule_id` |
| `location_groups.txt` | `location_group_id` |
| `location_group_stops.txt` | `location_group_id`, `stop_id` |

Files not in this table (e.g. GeoJSON flex locations) are recorded in `metadata.unsupported_files` and skipped.

## Output Schema

The output follows the GTFS Diff v2 schema. Below is a minimal example:

```json
{
"metadata": {
"schema_version": "2.0",
"generated_at": "2024-06-01T12:00:00Z",
"row_changes_cap_per_file": null,
"base_feed": { "source": "base.zip", "downloaded_at": "2024-01-01T00:00:00Z" },
"new_feed": { "source": "new.zip", "downloaded_at": "2024-06-01T00:00:00Z" },
"unsupported_files": []
},
"summary": {
"total_changes": 3,
"files_added_count": 0,
"files_deleted_count": 0,
"files_modified_count": 1,
"files": [
{
"file_name": "stops.txt",
"status": "modified",
"columns_added_count": 0,
"columns_deleted_count": 0,
"rows_added_count": 1,
"rows_deleted_count": 0,
"rows_modified_count": 2
}
]
},
"file_diffs": [
{
"file_name": "stops.txt",
"file_action": "modified",
"columns_added": [],
"columns_deleted": [],
"row_changes": {
"primary_key": ["stop_id"],
"columns": ["stop_id", "stop_name", "stop_lat", "stop_lon"],
"added": [
{
"identifier": { "stop_id": "S999" },
"raw_value": "S999,New Stop,48.8566,2.3522",
"new_line_number": 42
}
],
"deleted": [],
"modified": [
{
"identifier": { "stop_id": "S001" },
"raw_value": "S001,Central Station,48.8600,2.3470",
"base_line_number": 5,
"new_line_number": 5,
"field_changes": [
{ "field": "stop_name", "base_value": "Central Stn", "new_value": "Central Station" }
]
}
]
},
"truncated": null
}
]
}
```

## Memory Efficiency

The engine uses a **streaming two-pass algorithm**:

1. **Pass 1 (base feed):** stream the CSV line by line, building a `primary_key_tuple → (line_number, raw_csv_string)` in-memory index.
2. **Pass 2 (new feed):** same, producing a second index.
3. **Set arithmetic:** `added = new_keys − base_keys`, `deleted = base_keys − new_keys`, `common = intersection`.
4. **Modified detection:** for keys in `common`, parse the stored raw lines and compare only the *shared* columns — this avoids flagging every row as changed when a column is added or removed.

Only the raw CSV strings are stored in the index (not parsed dicts), keeping memory proportional to the number of rows rather than rows × columns.

> **Note:** For very large feeds (`stop_times.txt` with 10 M+ rows) the in-memory index may become a bottleneck. A disk-backed index (e.g. SQLite) would be more appropriate for production deployments at that scale; that optimisation is left as future work.

## Running Tests

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

## License

See [LICENSE](LICENSE).
125 changes: 125 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Architecture

## Design Goals

- **Schema compliance** — all output conforms to the GTFS Diff v2 schema (<https://github.com/MobilityData/gtfs_diff>)
- **Memory efficiency** — stream CSV files rather than loading entire tables; store only compact raw-CSV strings in the key index
- **Clear public API** — a single `diff_feeds()` function returns a typed Pydantic model, suitable for programmatic use or JSON serialisation

## Module Structure

| Module | Responsibility |
|---|---|
| `engine.py` | Core diff logic: feed opener, CSV indexing, per-file diff, public `diff_feeds()` function |
| `models.py` | Pydantic v2 data models for the GTFS Diff v2 output format (`GtfsDiff`, `FileDiff`, `RowChanges`, etc.) |
| `gtfs_definitions.py` | Static registry of supported GTFS files and their primary key columns; `get_primary_key()` helper |
| `cli.py` | Click-based CLI entry point (`gtfs-diff`); thin wrapper around `diff_feeds()` |

## Streaming Algorithm

The per-file diff (`_diff_file`) operates in two streaming passes:

### Pass 1 — index the base feed

```
for each row in base CSV:
pk_tuple = tuple(row[col] for col in pk_columns)
index[pk_tuple] = (line_number, raw_csv_string)
```

Only the compact `raw_csv_string` (re-serialised with `csv.writer`) is stored — not the parsed dict — to keep memory proportional to row count rather than row count × column count.

### Pass 2 — index the new feed

Identical pass over the new CSV, producing a second index.

### Set arithmetic

```
added_keys = new_keys − base_keys
deleted_keys = base_keys − new_keys
common_keys = base_keys ∩ new_keys
```

### Modified detection

For each key in `common_keys`, the stored raw lines are parsed back to dicts and compared **on shared columns only**:

```python
shared_cols = [col for col in base_headers if col in new_header_set]
field_changes = [
FieldChange(field=col, base_value=b_dict[col], new_value=n_dict[col])
for col in shared_cols
if b_dict[col] != n_dict[col]
]
```

Comparing only shared columns ensures that adding or removing a column from a file does not cause every existing row to appear as modified.

## Handling Edge Cases

### Files with no explicit primary key

`feed_info.txt` is a single-row file and `agency.txt` allows omitting `agency_id` when there is only one agency. Both are defined with an empty primary key list (`[]`) in `GTFS_PRIMARY_KEYS`. When the engine encounters an empty PK definition it falls back to using **all base-feed columns** as a composite key:

```python
if len(pk_def) == 0:
pk_cols = initial_base_headers # all columns form the key
```

### BOM handling

All files are opened with `encoding="utf-8-sig"`, which transparently strips the UTF-8 byte-order mark that some GTFS producers include.

### Malformed / short rows

Rows with fewer columns than the header are padded with empty strings; rows wider than the header are trimmed:

```python
if len(row) < n:
row = row + [""] * (n - len(row))
row_vals = row[:n]
```

### Zip archives with a sub-directory layout

Some producers wrap all `.txt` files inside a single subdirectory within the zip. The feed opener handles this by mapping `basename → internal_path` regardless of path depth:

```python
basename = member.rsplit("/", 1)[-1]
```

## Cap and Truncation

When `row_changes_cap_per_file` is set to a positive integer `N`, row changes are filled in priority order until the cap is reached:

1. **Added** rows (up to `N`)
2. **Deleted** rows (up to `N − len(added)`)
3. **Modified** rows (up to `N − len(added) − len(deleted)`)

When `cap=0`, row-level detail is omitted entirely (column diffs and true change counts are still computed).

If the true total exceeds the cap, a `Truncated` record is attached:

```json
"truncated": { "is_truncated": true, "omitted_count": 4321 }
```

The `omitted_count` reflects the number of row changes that were detected but not included in the output.

## Column Union Ordering

When a file gains or loses columns between feeds, the `row_changes.columns` list uses a **union** ordering:

```
union_columns = base_headers + [col for col in new_headers if col not in base_header_set]
```

Base columns appear first (preserving their original order), followed by any new-only columns appended at the end. This ordering is used to align `raw_value` strings for both added and modified rows, so consumers can parse `raw_value` with a single consistent header list.

## Limitations and Future Work

- **Disk-backed index for huge feeds** — `stop_times.txt` can exceed 10 M rows; the in-memory index should be replaced with a SQLite-backed approach for production deployments at that scale.
- **Parallel file processing** — files within a feed are currently processed sequentially; parallel workers (e.g. `concurrent.futures.ThreadPoolExecutor`) could reduce wall-clock time for feeds with many files.
- **GeoJSON / Flex location support** — `locations.geojson` and other non-CSV GTFS Flex files are not CSV and are currently reported as unsupported. Dedicated diff logic for these formats is left as future work.

Loading