Optimization: u256 limb-based compound arithmetic

## Summary

Implement explicit 4x u64 limb-based multiplication and division for u256 compound operations to close the 3.31x gap on `u256_uniswapv2_amount_out`.

## Current Performance

| Benchmark | eth.zig | alloy.rs | Gap |
|-----------|---------|----------|-----|
| u256_uniswapv2_amount_out | 43 ns | 13 ns | **3.31x loss** |
| u256_mul (standalone) | 2 ns | 5 ns | 2.50x win |
| u256_div (standalone) | 3 ns | 12 ns | 4.00x win |

We win on standalone mul/div but lose badly on compound operations (mul + div in sequence). The root cause: alloy's `ruint` stores u256 as `[u64; 4]` with hand-optimized limb arithmetic, while Zig's native `u256` compiles to generic LLVM operations that lose context across compound ops.

## Root Cause

The UniswapV2 benchmark does:
```zig
const amount_in_with_fee = fastMul(amount_in, 997);
const numerator = fastMul(amount_in_with_fee, reserve_out);
const denominator = fastMul(reserve_in, 1000) +% amount_in_with_fee;
const amount_out = fastDiv(numerator, denominator);
```

LLVM sees each operation independently and can't optimize the pipeline. Rust's `ruint` keeps limbs visible across operations, enabling register reuse and carry propagation without LLVM's generic lowering.

## Proposed Approach

Add a `Limb256` type that stores `[4]u64` and implements arithmetic with explicit carry chains:

```zig
pub const Limb256 = struct {
    limbs: [4]u64,

    pub fn mul(a: Limb256, b: Limb256) Limb256 {
        // Schoolbook 4x4 limb multiplication with explicit carries.
        // 16 u64 multiplies, each producing u128 intermediate.
        var result: [4]u64 = .{0} ** 4;

        for (0..4) |i| {
            var carry: u64 = 0;
            for (0..4) |j| {
                if (i + j >= 4) break; // Only need low 256 bits
                const prod: u128 = @as(u128, a.limbs[i]) * b.limbs[j];
                const sum: u128 = @as(u128, result[i + j]) + prod + carry;
                result[i + j] = @truncate(sum);
                carry = @intCast(sum >> 64);
            }
        }

        return .{ .limbs = result };
    }

    /// Multiply by a small constant (common in DEX math: 997, 1000, etc.)
    pub fn mulSmall(a: Limb256, b: u64) Limb256 {
        var result: [4]u64 = .{0} ** 4;
        var carry: u64 = 0;
        for (0..4) |i| {
            const prod: u128 = @as(u128, a.limbs[i]) * b + carry;
            result[i] = @truncate(prod);
            carry = @intCast(prod >> 64);
        }
        return .{ .limbs = result };
    }

    pub fn fromU256(val: u256) Limb256 {
        return .{ .limbs = .{
            @truncate(val),
            @truncate(val >> 64),
            @truncate(val >> 128),
            @truncate(val >> 192),
        }};
    }

    pub fn toU256(self: Limb256) u256 {
        return @as(u256, self.limbs[0]) |
            (@as(u256, self.limbs[1]) << 64) |
            (@as(u256, self.limbs[2]) << 128) |
            (@as(u256, self.limbs[3]) << 192);
    }
};
```

Then provide a `uniswapV2AmountOut` that stays in limb space the entire time:

```zig
pub fn uniswapV2AmountOut(amount_in: u256, reserve_in: u256, reserve_out: u256) u256 {
    const a = Limb256.fromU256(amount_in);
    const r_in = Limb256.fromU256(reserve_in);
    const r_out = Limb256.fromU256(reserve_out);

    const a_fee = a.mulSmall(997);
    const num = a_fee.mul(r_out);
    const den = r_in.mulSmall(1000).add(a_fee);
    return num.div(den).toU256();
}
```

## Key Insight

The `mulSmall` path is critical -- DEX math frequently multiplies u256 by small constants (997, 1000, 10000). A specialized `mulSmall(u64)` avoids the full 4x4 limb multiply and does 4 u64*u64 with carry instead.

## Target

Close the gap from 3.31x loss to **1.5x loss or better**. This directly impacts MEV searcher hot paths.

## References

- [ruint source](https://github.com/recmo/uint) -- Rust's u256 with limb arithmetic
- `src/math/uint256.zig` -- current fastMul/fastDiv implementation
- `bench/bench.zig:372-389` -- benchmark code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: u256 limb-based compound arithmetic #17

Summary

Current Performance

Root Cause

Proposed Approach

Key Insight

Target

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	eth.zig	alloy.rs	Gap
u256_uniswapv2_amount_out	43 ns	13 ns	3.31x loss
u256_mul (standalone)	2 ns	5 ns	2.50x win
u256_div (standalone)	3 ns	12 ns	4.00x win

Optimization: u256 limb-based compound arithmetic #17

Description

Summary

Current Performance

Root Cause

Proposed Approach

Key Insight

Target

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions