Billion Row Challenge

This repository contains multiple implementations of the Billion Row Challenge - processing 1 billion weather measurements as fast as possible.

Challenge Description

The challenge is to process a file containing approximately 1 billion weather station measurements, calculating the min, mean, and max temperature for each station.

Input format:

<station_name>;<temperature>

Output format:

{<station_name>=<min>/<mean>/<max>, ...}

Implementations

V1: Optimized Base Implementation

The first optimized implementation uses memory mapping, parallel processing, and custom data structures to achieve good performance.

Performance:

Processing time: ~15.6 seconds
Throughput: ~64 million rows/second
Memory usage: Efficient

Key optimizations:

Memory-mapped file access
Parallel processing with goroutines
Custom hash table for station lookups
Zero-allocation string parsing

V2: Highly Optimized Implementation

The V2 implementation pushes optimization further with fixed-size buffers, integer-based temperature storage, and cache-friendly data structures.

Performance:

Processing time: ~3.8 seconds
Throughput: ~266 million rows/second
Memory usage: Very efficient

Key optimizations:

Integer-based temperature storage (int16 with implied decimal)
Fixed-size buffers for station names
Cache-line aligned data structures
Larger batch sizes (256MB)
Branchless temperature parsing
Lock-free parallel processing

V3: Alternative Approach

The V3 implementation takes a different approach to file splitting and processing.

Performance:

Processing time: ~115 milliseconds
Throughput: ~8,677 million rows/second
Memory usage: Extremely efficient

Key optimizations:

Optimized file splitting strategy
Efficient memory management
Streamlined processing pipeline

Benchmark Results

Comprehensive benchmarks were run with 5 iterations per implementation:

=== BENCHMARK RESULTS ===
Version: V1
  Average Time: 15.577186316s
  Median Time:  15.452524792s
  Min Time:     15.312534916s
  Max Time:     16.010921166s
  Throughput:   64.20 million rows/second

Version: V2
  Average Time: 3.757074466s
  Median Time:  3.696428625s
  Min Time:     3.58960625s
  Max Time:     4.058022875s
  Throughput:   266.16 million rows/second

Version: V3
  Average Time: 115.239991ms
  Median Time:  80.830541ms
  Min Time:     67.155125ms
  Max Time:     261.212125ms
  Throughput:   8677.54 million rows/second

Speedup Comparisons

V1 → V2: 4.15x
V1 → V3: 135.17x
V2 → V3: 32.60x

Key Observations

Implementation Strategy Matters: The dramatic performance difference between implementations shows how important algorithmic choices are. V3's approach is over 135x faster than V1.
Memory Efficiency: All implementations focus on memory efficiency, but V3 achieves the best results with minimal memory usage.
Batch Processing: Larger batch sizes in V2 significantly improved performance over V1, but V3's approach to file splitting proved even more effective.
Hardware Utilization: All implementations leverage multi-core processing, but V3 makes the most efficient use of available hardware.
Optimization Techniques: The progression from V1 to V3 demonstrates the value of:
- Cache-friendly data structures
- Minimizing allocations
- Integer-based calculations
- Efficient file I/O strategies
- Parallel processing

Running the Benchmarks

To run the benchmarks yourself:

# Make the benchmark script executable
chmod +x run_benchmark.sh

# Run with default 5 iterations
./run_benchmark.sh

# Or specify a custom number of iterations
./run_benchmark.sh 10

Benchmark results will be displayed in the console and saved to benchmark_results.txt.

Conclusion

This project demonstrates how different implementation strategies and optimization techniques can dramatically impact performance when processing large datasets. The progression from V1 to V3 shows that there's always room for improvement when you're willing to rethink your approach and focus on the specific characteristics of the problem.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
v1		v1
v2		v2
v3		v3
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
LICENSE		LICENSE
README.md		README.md
benchmark_results.txt		benchmark_results.txt
go.mod		go.mod
go.sum		go.sum
run_benchmark.sh		run_benchmark.sh
run_benchmarks.go		run_benchmarks.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Billion Row Challenge

Challenge Description

Implementations

V1: Optimized Base Implementation

V2: Highly Optimized Implementation

V3: Alternative Approach

Benchmark Results

Speedup Comparisons

Key Observations

Running the Benchmarks

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Billion Row Challenge

Challenge Description

Implementations

V1: Optimized Base Implementation

V2: Highly Optimized Implementation

V3: Alternative Approach

Benchmark Results

Speedup Comparisons

Key Observations

Running the Benchmarks

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages