Skip to content

[1.x] Optimized builds for modern CPU, remove old CPU #132

@DarkGhostHunter

Description

@DarkGhostHunter

Description

Not too common on Rust-based projects, but it would be great to have an "universal" x86-64 build that can work on any modern machine (x86-64-v3 to target Haswell 2013, Excavator 2015) and to take advantage of broadly-available SIMD from that level and below.

Personally I use this:

RUSTFLAGS="-C target-cpu=x86-64-v3" \
CARGO_PROFILE_RELEASE_OPT_LEVEL="3" \
CARGO_PROFILE_RELEASE_LTO="fat" \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS="1" \
cargo build --release

(I used target-cpu=native on MY machine for the lolz).

These builds shouldn't be the focus while the project is on development, but rather, once the version becomes stable. Given there are benchmarks, it would be great to see a preview of the performance differences to prepare accordingly, or just revisit later.

SIMD?

A quick AI prompt says the following:

Microarchitecture Level Instruction Set Extension Core Functionality Direct Benefit for PHPantom / Mago Impact Status
x86-64-v3 Specific BMI1 & BMI2 (Bit Manipulation Instructions) Adds advanced, single-cycle instructions for bit-field extraction, packing, bit trailing/leading zero counts, and arbitrary bit selection. Massive. Rust uses bit manipulation heavily for tracking token state, syntax validation masks, and memory layouts. It streamlines the evaluation of complex logical boolean expressions within the linter. Highly Beneficial
  AVX2 (Advanced Vector Extensions 2) Expands integer vector operations to full 256-bit SIMD registers. High. While Mago skips the math features, the compiler uses AVX2 to vectorize string operations. It allows PHPantom to scan multiple bytes of text simultaneously to locate syntax boundaries (like commas, braces, and strings). Highly Beneficial
  LZCNT (Leading Zero Count) Counts the number of leading zero bits in an integer in a single clock cycle. High. Vital for fast low-level data structure alignment and hash map indexing, which Rust's compiler leverages heavily during the project-indexing phase. Highly Beneficial
  AVX 256-bit floating-point SIMD vector operations. None. These are the floating-point vector roots. As a text parsing application, Mago rarely interacts with decimal math. No Impact
  FMA3 (Fused Multiply-Add) Computes $(A \times B) + C$ in a single instruction with infinite precision. None. Specifically engineered for deep learning, matrix algebra, and graphics. Not used in static analysis. No Impact
  F16C Hardware conversion instructions between 16-bit and 32-bit floats. None. Unused by compilers and static checkers. No Impact
  MOVBE Reverses the byte order of data during a register-to-memory move (Big-Endian to Little-Endian). Minimal/None. Mainstream PHP code bases are read sequentially as standard UTF-8 text on standard x86 systems; byte swapping is rarely required. No Impact
x86-64-v2 Inherited POPCNT (Population Count) Counts the number of bits set to 1 in a register in a single cycle. High. Essential for Rust's internal data structures, bit-sets, and sparse arrays used to hold syntax trees efficiently in memory. Highly Beneficial
  SSE4.2 Adds advanced string and text processing instructions (e.g., PCMPESTRI). High. Directly accelerates string searching and character matching, allowing quicker detection of PHP keywords (function, class, public) within source text streams. Highly Beneficial
  SSE4.1, SSSE3, SSE3 Standard 128-bit media and basic alignment extensions. Moderate. Provides basic, ubiquitous building blocks for memory copying (memcpy) and buffer clearing that Rust utilizes implicitly. Beneficial
  CMPXCHG16B Allows atomic compare-and-swap operations on 16-byte (128-bit) values. Moderate. Critical for parallel execution. PHPantom utilizes this to safely coordinate background indexing threads without heavy locking overhead. Beneficial
  LAHF/SAHF Loads/stores status flags into the CPU's AH register. Minimal. Helps optimize simple scalar branch logic conditionals inside deep loop constructs. Minor Benefit

Use case

The main difference between x86-64-v3 and x86-64-v4 and AVX512. From this group of SIMD (yes, it's a group), only few are relevant (again, AI helping this one):

AVX-512 Subset Core Functionality Specific Value to Mago / PHPantom Impact Status
AVX-512BW 512-bit Byte/Word vectors Scans 64 bytes of PHP text simultaneously; radically accelerates whitespace stripping and symbol searching. Massive / Revolutionary
Opmask Registers Native vector predication Eliminates scalar loop tails; handles unaligned PHP keywords and variable names without dropping out of SIMD. Massive / Revolutionary
AVX-512CD Conflict detection Accelerates string interning (mago-atom) and deduplication during the workspace-indexing phase. Highly Beneficial
Ternary Logic 3-input bitwise operations in 1 cycle Condenses multiple syntax checking/masking steps inside the linter into single-cycle executions. Highly Beneficial
AVX-512F / DQ Ultra-wide Floating-Point math Unused. Static analyzers do not perform heavy decimal matrix math. No Impact

The reasoning why it's better to not use x86-64-v4, apart from historical compatibility, is that you require either Zen 4 (2022) or Zen 5 processors, or specifically Intel 10gen (2019) or 11gen. Later Intel CPUs do not have AVX512, meaning, any Intel CPU from 6 years ago would crash.

I don't know if the performance uplift of having full AVX512 support for Zen 4/5 users and Intel 10/11gen is big enough to warranty another version for "latest AMD CPU, Intel 10/11gen". Even then, AMD CPU owners could build PHPantom in less than 5 minutes with all target optimisations (target=native). I get the AI says "revolutionary performance", but it's inferring, not benchmarking.

Proposed solution

RUSTFLAGS="-C target-cpu=x86-64-v3" \
CARGO_PROFILE_RELEASE_OPT_LEVEL="3" \
CARGO_PROFILE_RELEASE_LTO="fat" \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS="1" \
cargo build --release

Until then, the usual cargo build --release should be enough until the project is considered stable. A safe approach would be to use x86-64-v2, since I doubt anyone with a 10-decade old processor would run PHPantom at all.

Alternatives considered

  • x86-64-v1: MMX, SSE, and SSE2. Anything still running from 2003/2004.
  • x86-64-v2: SSE3, SSE4, POPCNT. Anything still running from 2008/2011.

I believe the best performance uplifts would be better with v3 rather than v2.

Code example

On the worflow, the flag would be added as this (or equivalent):

      - name: Build
        env:
          # Conditionally set RUSTFLAGS if the target starts with x86_64
          RUSTFLAGS: ${{ startsWith(matrix.target, 'x86_64') && '-C target-cpu=x86-64-v3' || '' }}
        run: cargo build --release --target ${{ matrix.target }}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions