1brc

Billion Row Challenge

This is my contribution to the one billion row challenge. The goal? Process 1 billion row (1brc) in the least amount of time. Taking this challenge to learn and hone my Rust skills.

You can find the rules and boundaries here https://www.morling.dev/blog/one-billion-row-challenge/

0. Measures generator

The original project had it's java generator. But since that's a learning project, why not port it to Rust.

Update: It took me abot 4 hours to port.

Update 2: Plus some 2 extra hours to make it work like the original script including 10k+ stations for bigger inputs

How to generate the data


mkdir data
cargo run --bin create_measurements 1000 data/1k.txt
cargo run --bin create_measurements 10000 data/10k.txt
cargo run --bin create_measurements 100000 data/100k.txt
cargo run --bin create_measurements 1000000 data/1m.txt
cargo run --bin create_measurements 10000000 data/10m.txt
cargo run --bin create_measurements 100000000 data/100m.txt
cargo run --bin create_measurements 1000000000 data/1b.txt

1. First Naive implementation

Simply read line-by-line iteratively and use a map to keep track of each station individually. No concurrency and no clever tricks, just set the base line.

How to run

./src/bin/1_naive_implementation/run-all.sh 2>/dev/null

2. Optmizing the build

Running cargo with --release falg has dramatically cut the running time of the script. For 10M rows the time went from 50 seconds down to 8.7s.

Warning

The results may vary as the benchmarks were ran while tens of chrome tabs were open along with vs code and slack running as an app.

How to run

./src/bin/2_optimize_build/run-all.sh 2>/dev/null

Input size	Naive solution	Release Build	+ codegen-units = 1
1k	0.25s	0.66s	0.82s
10k	0.26s	0.39s	0.33s
100k	0.51s	0.28s	0.36s
1m	4.51s	1.07s	0.79s
10m	44.23s	8.57s	5.24s
100m	262.96s	85.40s	49.88s
1b	-	747.49s	474.06s

Some other build options will be added to increase the binary speed without touching the code.

Set codegen-units. This will slow down compilation but allows the compiler to better optimize.
Set LTO to thin, this decreases binary size and promises 10~20% better runtime speeds.
Set LTO to fat.

Input size	Naive solution	Release Build	codegen-units=1	lto = thin	lto = fat
1k	0.25s	0.66s	0.74s	0.29s	0.49s
10k	0.26s	0.39s	0.29s	0.23s	0.21s
100k	0.51s	0.28s	0.28s	0.26s	0.24s
1m	4.51s	1.07s	0.79s	0.76s	0.73s
10m	44.23s	8.57s	5.41s	5.34s	5.34s
100m	262.96s	85.40s	51.98s	53.16s	52.01s
1b	4306.22	747.49s	474.06s	-	-

3. Tweak heap allocators + cpu specific compilations

Now testing different heap allocators as suggestted here https://nnethercote.github.io/perf-book/build-configuration.html

How to run

./src/bin/3_heap_allocator_flags_pgo/run-all.sh 2>/dev/null

First experiment is using tikv-jemallocator. A potential performance gain can come from enabling THP (Transparent Huge Pages), but as I am using Mac this is not possible.

Second we tweak the program to use mimalloc as the allocator. As their behavior and performance might change accordingly to the shape of the program, I'm considering changing the run-all.sh script to test both in every workload run;

Input size	codegen-units=1+lto fat	tikv-jemallocator	mimalloc
1k	0.49s	0.27s	0.32s
10k	0.21s	0.26s	0.21s
100k	0.24s	0.23s	0.33s
1m	0.73s	0.64s	0.65s
10m	5.34s	4.19s	4.14s
100m	51.01s	39.51s	40.16s
1b	-	405.17s	410.24s

Both allocators share a similar performance so I'll test each one denpending on my code implementation. There's the possibility to use platform specific compilation flags but running this diff <(rustc --print cfg) <(rustc --print cfg -C target-cpu=native) showed no difference.

PGO is an advanced technique where we run the code in instrumented mode and create profiles that later can be used as input to ther compiler. I'll use this technique now, but I don't like the idea of implementing it into my pipeline right now at least not before I start getting my hands dirty with the code. More details here https://doc.rust-lang.org/rustc/profile-guided-optimization.html

rm -rf /tmp/pgo-data
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release --target=aarch64-apple-darwin
./target/aarch64-apple-darwin/release/3_heap_allocator_flags_pgo ./data/1m.txt
./target/aarch64-apple-darwin/release/3_heap_allocator_flags_pgo ./data/10m.txt
./target/aarch64-apple-darwin/release/3_heap_allocator_flags_pgo ./data/100m.txt
~/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/aarch64-apple-darwin/bin/llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release --target=aarch64-apple-darwin
time ./target/aarch64-apple-darwin/release/3_heap_allocator_flags_pgo ./data/1b.txt

Input size	tikv-jemallocator	+PGO	mimalloc	+PGO	mimalloc+flags
1m	0.64s	0.412s	0.65s	0.404s	0.48s
10m	4.19s	3.8s	4.14s	3.789s	4.0s
100m	39.51s	37.167s	40.16s	37.17	39s
1b	405.17s	390.291s	410.24s	380s	390s

For the record, I noticed some performance loss when removing the power cable and running just on battery so from now on every benchmark will be run with the cable connected.

Keeping all the above compile-time optimizations except for PGO and on top of that taking the following steps allowed me to cut runtime significantly.

replacing the data structure from BTreeMap to HashMap
use itertools to sort the keys.
Better handle strings and map keys

let mut map: HashMap<String, (f32, f32, f32, i32)> = HashMap::new();
for line in reader.lines() {
    let liner = line.unwrap();
    let lines: Vec<&str> = liner.split(";").collect();
    let v = lines[1].parse::<f32>().unwrap(); 
    
    let station = map.entry(lines[0].to_string()).or_insert( (f32::MAX, f32::MIN, 0.0, 0));
    [...]
}

NVM, the extra 10% performance of PGO are to much to be ignored, adding it to this comparison as well.

Input size	mimalloc+flags	+HashMap	+PGO
1m	0.48s	0.21s	0.15s
10m	4.0s	1.26s	0.915s
100m	39s	10.21s	8.72s
1b	390s	103.62s	89s

4. Optmize data structures

The hashing for the stdlib trades off performaance for security as having a "weak" hashing policy can make the system vulnerable do DOS (Denial of Service) attacks if a attacker can force collisions to degrade applications performance. As I'm not processing user input but my own randomly generated data, that type of safety is not my concern.

I used FxHashMap which has the weakest hashing with the fastest performance.

Input size	base	FxHashMap	+PGO
1m	0.13s	0.13s	0.15s
10m	1.01s	0.93s	0.915s
100m	10.59s	9.29s	8.72s
1b	99.24s	91.56s	89s

I tried using a haspmap of i128 and squeeze all the for variables into i using 16 bits for each min/max, 32 bits for measures count and 64 bit for the toal sum. I used an offset for min/values to simplify its calculations without relying on negative values.

To avoid floating point operations I converted the measures to int since by definition there'll be only on number after the dot and I just need to divide the values by 10 before calculating the mean. Besides making the code a little bit performant, I noticed mmy program was suffering from floating point imprecisions some stations showed a 0.1 discrepancy.

How to run

./src/bin/4_better_ds/run-all.sh 2>/dev/null

5. Multi threads

It's time to think about all the CPUs the machine has to offer. I'll start creating a worker pool for each CPU and leave the main thread read rows in chunks;

First attempt was to try builtin sync::mpsc but it has so many sharps edges to the point I gave up and decided to try crossbeam channel as thread communication solution. My first implementation had a very degraded performance usually 10x worse for small inputs and about 3~4x slower for the 1BR. I made some optimizations trying to mitigate the amount of time spent by copying thing through threads.

I haven't been able to go beyond my previous record and I assume it's because the bottleneck is on the input not the processing. Making the reader thread use try_send() (non-blocking) over send() had a significant performance improvement.

Apparently spawning a worker thread for each CPU isn't that optimal at this point. I ran the code with just 1 worker thread and had the best result than running with all cpus or even 2 threads per CPU. Looking in retrospect, this makes sense as I have a single thread writing to the channel a just one consumer.

Replacing the String with a Vec<u8> as the key fot the stations gave me good results. Once randomly I got times as low as 38s but I'm not able to reproduce that consistently.

Input size	Base	Naive thread	try_send 8 thread	1 thread	+PGO
1m	0.13s	1.12	0.62s	0.64s	0.13s
10m	0.93s	3.0	1.22s	0.93s	0.75s
100m	9.29s	30.2	9.73s	7.51s	7.12s
1b	91.56s	300.0s	118.86	74.09s	71s

The bottleneck was in a single thread reading and fanning out the lines through channels to other threads. I changed the program to spawn many threads and set each one in charge of a slice of the file. Each thread read a chunk of data and parse it to text/measure values until the end their partition ends. This required careful manipulation of each partition especially when the measure lied in between each partition. I tested with smaller chunk sizes (around a 1k) and bigger chunk sizes (about 64M) and this later approach was faster but the first had good results without slowing down my laptop. The best result were in the 16 threads range with reasonable chunk size size (4MB).

Input size	Base	8th cnk=1k	8th cnk=4k	8th cnk=4m	16th cnk=4m	16th cnk=64m	128th cnk=64m	PGO
1m	0.64s	0.26s	0.20s	0.16s	0.49s	0.50s	0.50s	0.24s
10m	0.93s	0.77s	0.43s	0.33s	0.40s	0.62s	0.62s	0.16s
100m	7.51s	5.64s	3.23s	1.85s	2.02s	2.54s	2.06s	1.35s
1b	74.09s	32.94s	31.81s	17.32s	16.88s	16.88s	19.79s	15.99s

While still using the same idea, I changed my implementation to have the threads reading the file sequentially instead of partitioning the file. This way I'm able to have more caches hits as sequential regions in the are stored in the CPU registers.

Input size	old	12th cnk=188k
1m	0.49s	0.25s
10m	0.40s	0.34s
100m	2.02s	1.64s
1b	16.88s	14.80s

How to run

The solution was update to take threads count and chunk (in kilobytes) size by param.

./src/bin/5_multi_thread/run-all.sh 12 188 2>/dev/null

6. Custom hashmap

Implementing my own HashMap, using a double hash approach to avoid collisions and using an array instead of a hashmap was tough, took too many hours and still was buggy causing seg faults with every input bigger than 1k lines.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
diff.sh		diff.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1brc

Billion Row Challenge

0. Measures generator

How to generate the data

1. First Naive implementation

How to run

2. Optmizing the build

Warning

How to run

3. Tweak heap allocators + cpu specific compilations

How to run

4. Optmize data structures

How to run

5. Multi threads

How to run

6. Custom hashmap

About

Releases

Packages

Languages

ANKerD/1brc

Folders and files

Latest commit

History

Repository files navigation

1brc

Billion Row Challenge

0. Measures generator

How to generate the data

1. First Naive implementation

How to run

2. Optmizing the build

Warning

How to run

3. Tweak heap allocators + cpu specific compilations

How to run

4. Optmize data structures

How to run

5. Multi threads

How to run

6. Custom hashmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages