csv-stream-diff

csv-stream-diff compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.

Features

Compare CSVs by configurable key columns, even when left and right headers differ
Stream files in chunks with configurable chunk_size
Partition by stable hashed key to keep worker memory bounded
Use all CPUs by default, or set a worker count explicitly
Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
Support exact random sampling for validation runs with sampling.size > 0
Warn on duplicate keys and continue using the first occurrence per key
Include a fixture generator and both pytest and behave tests

Installation

pip install csv-stream-diff

For local development:

poetry install

CLI

csv-stream-diff --config config.yaml

Optional overrides:

csv-stream-diff \
  --config config.yaml \
  --left-file ./left.csv \
  --right-file ./right.csv \
  --chunk-size 100000 \
  --sample-size 100000 \
  --sample-seed 20260321 \
  --workers 8 \
  --output-dir ./output \
  --output-prefix run_

The YAML config is the default source of truth. CLI flags override it for a single run.

Configuration

See config.example.yaml for a full example.

Main sections:

files.left, files.right: input CSV paths
csv.left, csv.right: dialect and encoding settings
keys.left, keys.right: key columns used to match rows
compare.left, compare.right: value columns to compare
comparison: normalization options
sampling: size: 0 means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
performance: chunking, worker count, bucket count, temp directory, progress reporting
output: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary

Output Files

The tool writes these artifacts to output.directory:

<prefix>only_in_left.csv
<prefix>only_in_right.csv
<prefix>differences.csv
<prefix>duplicate_keys.csv
<prefix>summary.json
<prefix>summary.txt when output.summary_format is text or both

differences.csv contains one row per differing cell with both the left and right column names and values.

Sampling

sampling.size: 0 runs the full comparison.
sampling.size > 0 selects an exact random sample of left-side unique keys using reservoir sampling.
Sampling is reproducible when sampling.seed stays the same.
Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.

Duplicate Keys

Duplicate keys do not stop the run. They are written to duplicate_keys.csv, counted in the summary, and the main comparison uses the first occurrence of each key on each side.

Generator

The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:

python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42

Generated artifacts:

left.csv
right.csv
config.generated.yaml
expected.json

Tests

Run unit tests:

poetry run pytest

Run BDD acceptance tests:

poetry run behave tests/features

Run a package build:

poetry build

PyPI Packaging

Build source and wheel distributions:

poetry build

Upload after verifying artifacts:

poetry publish

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
generated-smoke		generated-smoke
generator		generator
src/csvstreamdiff		src/csvstreamdiff
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
demo_csv_stream_diff.ipynb		demo_csv_stream_diff.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csv-stream-diff

Features

Installation

CLI

Configuration

Output Files

Sampling

Duplicate Keys

Generator

Tests

PyPI Packaging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

csv-stream-diff

Features

Installation

CLI

Configuration

Output Files

Sampling

Duplicate Keys

Generator

Tests

PyPI Packaging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages