csv-stream-diff compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
- Compare CSVs by configurable key columns, even when left and right headers differ
- Stream files in chunks with configurable
chunk_size - Partition by stable hashed key to keep worker memory bounded
- Use all CPUs by default, or set a worker count explicitly
- Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
- Support exact random sampling for validation runs with
sampling.size > 0 - Warn on duplicate keys and continue using the first occurrence per key
- Include a fixture generator and both
pytestandbehavetests
pip install csv-stream-diffFor local development:
poetry installcsv-stream-diff --config config.yamlOptional overrides:
csv-stream-diff \
--config config.yaml \
--left-file ./left.csv \
--right-file ./right.csv \
--chunk-size 100000 \
--sample-size 100000 \
--sample-seed 20260321 \
--workers 8 \
--output-dir ./output \
--output-prefix run_The YAML config is the default source of truth. CLI flags override it for a single run.
See config.example.yaml for a full example.
Main sections:
files.left,files.right: input CSV pathscsv.left,csv.right: dialect and encoding settingskeys.left,keys.right: key columns used to match rowscompare.left,compare.right: value columns to comparecomparison: normalization optionssampling:size: 0means full comparison; any positive value means exact random sample by left-side unique key with a fixed seedperformance: chunking, worker count, bucket count, temp directory, progress reportingoutput: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary
The tool writes these artifacts to output.directory:
<prefix>only_in_left.csv<prefix>only_in_right.csv<prefix>differences.csv<prefix>duplicate_keys.csv<prefix>summary.json<prefix>summary.txtwhenoutput.summary_formatistextorboth
differences.csv contains one row per differing cell with both the left and right column names and values.
sampling.size: 0runs the full comparison.sampling.size > 0selects an exact random sample of left-side unique keys using reservoir sampling.- Sampling is reproducible when
sampling.seedstays the same. - Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
Duplicate keys do not stop the run. They are written to duplicate_keys.csv, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42Generated artifacts:
left.csvright.csvconfig.generated.yamlexpected.json
Run unit tests:
poetry run pytestRun BDD acceptance tests:
poetry run behave tests/featuresRun a package build:
poetry buildBuild source and wheel distributions:
poetry buildUpload after verifying artifacts:
poetry publish