csv-split
is a high performance CSV Splitter built in Rust. Helpful for quickly
splitting large CSVs into multiple smaller CSVs from the command line.
This was built as a weekend project with the following goals in mind:
- Learn Rust
- Create a somewhat useful tool.
Currently, xsv is one of the best tools out
there for CSV processing, and is fantastic. Given the amount of development that has
gone into this, it is therefore surprising that csv-split
can give xsv
a run
for its money on smaller inputs. Given that csv-split
loads the entire file into
memory (at least in parallel mode, which can be disabled with --no-parallel
),
it is predictably much slower on "large" (500+ MB) files.
Building this project from source requires Cargo and can be done as follows:
git clone git@github.com/pranavmk98/csv-split
cd csv-split
cargo build --release
Compilation will likely take a few minutes. The binary will end up at ./target/release/csv-split
.
USAGE:
csv-split [FLAGS] [OPTIONS] <file>
ARGS:
<file>
FLAGS:
-h, --help Prints help information
-n, --no-parallel Run splitting sequentially
-V, --version Prints version information
OPTIONS:
-m, --max-rows <max-rows> Max rows per file [default: 1000]
-o, --output-dir <output-dir> Output directory for generated CSVs [default: output/data]
Example: ./csv-split data.csv --max-rows 500
Some rough benchmarking was performed using the worldcitiespop.csv
dataset from
the Data Science Toolkit project, which
is about 125MB and contains approximately 2.7 million rows.
The compared splitters were xsv (written in Rust) and a CSV splitter by PerformanceHorizonGroup (written in C).
These benchmarks ran on my admittedly underpowered machine with an Intel i5-8250U (4 Cores, 8 Threads) and 8GB of memory.
Tool | time Output |
---|---|
csv-split |
0.36s user 0.36s system 170% cpu 0.428 total |
xsv |
0.58s user 0.09s system 99% cpu 0.666 total |
csv-split (in C) |
1.16s user 0.07s system 107% cpu 1.150 total |
Ideally, the entire CSV would not be loaded into memory at once. Once I learn some more about concurrency in Rust, I would like to take a stab at a more threadpool-like structure, firing off a thread for each batch to process.
Of course, PRs are always welcome :)