Skip to content

TFMV/billion

Repository files navigation

Billion Row Challenge

This repository contains multiple implementations of the Billion Row Challenge - processing 1 billion weather measurements as fast as possible.

Challenge Description

The challenge is to process a file containing approximately 1 billion weather station measurements, calculating the min, mean, and max temperature for each station.

Input format:

<station_name>;<temperature>

Output format:

{<station_name>=<min>/<mean>/<max>, ...}

Implementations

V1: Optimized Base Implementation

The first optimized implementation uses memory mapping, parallel processing, and custom data structures to achieve good performance.

Performance:

  • Processing time: ~15.6 seconds
  • Throughput: ~64 million rows/second
  • Memory usage: Efficient

Key optimizations:

  • Memory-mapped file access
  • Parallel processing with goroutines
  • Custom hash table for station lookups
  • Zero-allocation string parsing

V2: Highly Optimized Implementation

The V2 implementation pushes optimization further with fixed-size buffers, integer-based temperature storage, and cache-friendly data structures.

Performance:

  • Processing time: ~3.8 seconds
  • Throughput: ~266 million rows/second
  • Memory usage: Very efficient

Key optimizations:

  • Integer-based temperature storage (int16 with implied decimal)
  • Fixed-size buffers for station names
  • Cache-line aligned data structures
  • Larger batch sizes (256MB)
  • Branchless temperature parsing
  • Lock-free parallel processing

V3: Alternative Approach

The V3 implementation takes a different approach to file splitting and processing.

Performance:

  • Processing time: ~115 milliseconds
  • Throughput: ~8,677 million rows/second
  • Memory usage: Extremely efficient

Key optimizations:

  • Optimized file splitting strategy
  • Efficient memory management
  • Streamlined processing pipeline

Benchmark Results

Comprehensive benchmarks were run with 5 iterations per implementation:

=== BENCHMARK RESULTS ===
Version: V1
  Average Time: 15.577186316s
  Median Time:  15.452524792s
  Min Time:     15.312534916s
  Max Time:     16.010921166s
  Throughput:   64.20 million rows/second

Version: V2
  Average Time: 3.757074466s
  Median Time:  3.696428625s
  Min Time:     3.58960625s
  Max Time:     4.058022875s
  Throughput:   266.16 million rows/second

Version: V3
  Average Time: 115.239991ms
  Median Time:  80.830541ms
  Min Time:     67.155125ms
  Max Time:     261.212125ms
  Throughput:   8677.54 million rows/second

Speedup Comparisons

  • V1 → V2: 4.15x
  • V1 → V3: 135.17x
  • V2 → V3: 32.60x

Key Observations

  1. Implementation Strategy Matters: The dramatic performance difference between implementations shows how important algorithmic choices are. V3's approach is over 135x faster than V1.

  2. Memory Efficiency: All implementations focus on memory efficiency, but V3 achieves the best results with minimal memory usage.

  3. Batch Processing: Larger batch sizes in V2 significantly improved performance over V1, but V3's approach to file splitting proved even more effective.

  4. Hardware Utilization: All implementations leverage multi-core processing, but V3 makes the most efficient use of available hardware.

  5. Optimization Techniques: The progression from V1 to V3 demonstrates the value of:

    • Cache-friendly data structures
    • Minimizing allocations
    • Integer-based calculations
    • Efficient file I/O strategies
    • Parallel processing

Running the Benchmarks

To run the benchmarks yourself:

# Make the benchmark script executable
chmod +x run_benchmark.sh

# Run with default 5 iterations
./run_benchmark.sh

# Or specify a custom number of iterations
./run_benchmark.sh 10

Benchmark results will be displayed in the console and saved to benchmark_results.txt.

Conclusion

This project demonstrates how different implementation strategies and optimization techniques can dramatically impact performance when processing large datasets. The progression from V1 to V3 shows that there's always room for improvement when you're willing to rethink your approach and focus on the specific characteristics of the problem.

About

Billion Row Challenge - 115ms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors