-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
Implement explicit 4x u64 limb-based multiplication and division for u256 compound operations to close the 3.31x gap on u256_uniswapv2_amount_out.
Current Performance
| Benchmark | eth.zig | alloy.rs | Gap |
|---|---|---|---|
| u256_uniswapv2_amount_out | 43 ns | 13 ns | 3.31x loss |
| u256_mul (standalone) | 2 ns | 5 ns | 2.50x win |
| u256_div (standalone) | 3 ns | 12 ns | 4.00x win |
We win on standalone mul/div but lose badly on compound operations (mul + div in sequence). The root cause: alloy's ruint stores u256 as [u64; 4] with hand-optimized limb arithmetic, while Zig's native u256 compiles to generic LLVM operations that lose context across compound ops.
Root Cause
The UniswapV2 benchmark does:
const amount_in_with_fee = fastMul(amount_in, 997);
const numerator = fastMul(amount_in_with_fee, reserve_out);
const denominator = fastMul(reserve_in, 1000) +% amount_in_with_fee;
const amount_out = fastDiv(numerator, denominator);LLVM sees each operation independently and can't optimize the pipeline. Rust's ruint keeps limbs visible across operations, enabling register reuse and carry propagation without LLVM's generic lowering.
Proposed Approach
Add a Limb256 type that stores [4]u64 and implements arithmetic with explicit carry chains:
pub const Limb256 = struct {
limbs: [4]u64,
pub fn mul(a: Limb256, b: Limb256) Limb256 {
// Schoolbook 4x4 limb multiplication with explicit carries.
// 16 u64 multiplies, each producing u128 intermediate.
var result: [4]u64 = .{0} ** 4;
for (0..4) |i| {
var carry: u64 = 0;
for (0..4) |j| {
if (i + j >= 4) break; // Only need low 256 bits
const prod: u128 = @as(u128, a.limbs[i]) * b.limbs[j];
const sum: u128 = @as(u128, result[i + j]) + prod + carry;
result[i + j] = @truncate(sum);
carry = @intCast(sum >> 64);
}
}
return .{ .limbs = result };
}
/// Multiply by a small constant (common in DEX math: 997, 1000, etc.)
pub fn mulSmall(a: Limb256, b: u64) Limb256 {
var result: [4]u64 = .{0} ** 4;
var carry: u64 = 0;
for (0..4) |i| {
const prod: u128 = @as(u128, a.limbs[i]) * b + carry;
result[i] = @truncate(prod);
carry = @intCast(prod >> 64);
}
return .{ .limbs = result };
}
pub fn fromU256(val: u256) Limb256 {
return .{ .limbs = .{
@truncate(val),
@truncate(val >> 64),
@truncate(val >> 128),
@truncate(val >> 192),
}};
}
pub fn toU256(self: Limb256) u256 {
return @as(u256, self.limbs[0]) |
(@as(u256, self.limbs[1]) << 64) |
(@as(u256, self.limbs[2]) << 128) |
(@as(u256, self.limbs[3]) << 192);
}
};Then provide a uniswapV2AmountOut that stays in limb space the entire time:
pub fn uniswapV2AmountOut(amount_in: u256, reserve_in: u256, reserve_out: u256) u256 {
const a = Limb256.fromU256(amount_in);
const r_in = Limb256.fromU256(reserve_in);
const r_out = Limb256.fromU256(reserve_out);
const a_fee = a.mulSmall(997);
const num = a_fee.mul(r_out);
const den = r_in.mulSmall(1000).add(a_fee);
return num.div(den).toU256();
}Key Insight
The mulSmall path is critical -- DEX math frequently multiplies u256 by small constants (997, 1000, 10000). A specialized mulSmall(u64) avoids the full 4x4 limb multiply and does 4 u64*u64 with carry instead.
Target
Close the gap from 3.31x loss to 1.5x loss or better. This directly impacts MEV searcher hot paths.
References
- ruint source -- Rust's u256 with limb arithmetic
src/math/uint256.zig-- current fastMul/fastDiv implementationbench/bench.zig:372-389-- benchmark code