[1.x] Optimized builds for modern CPU, remove old CPU

### Description

Not too common on Rust-based projects, but it would be great to have an "universal" x86-64 build that can work on any _modern_ machine (`x86-64-v3` to target Haswell 2013, Excavator 2015) and to take advantage of broadly-available SIMD from that level and below.

Personally I use this:

```
RUSTFLAGS="-C target-cpu=x86-64-v3" \
CARGO_PROFILE_RELEASE_OPT_LEVEL="3" \
CARGO_PROFILE_RELEASE_LTO="fat" \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS="1" \
cargo build --release
```

(I used `target-cpu=native` on MY machine for the lolz).

These builds shouldn't be the focus while the project is on development, but rather, once the version becomes stable. Given there are benchmarks, it would be great to see a preview of the performance differences to prepare accordingly, or just revisit later.

## SIMD?

A quick AI prompt says the following:

Microarchitecture Level | Instruction Set Extension | Core Functionality | Direct Benefit for PHPantom / Mago | Impact Status
-- | -- | -- | -- | --
x86-64-v3 Specific | BMI1 & BMI2 (Bit Manipulation Instructions) | Adds advanced, single-cycle instructions for bit-field extraction, packing, bit trailing/leading zero counts, and arbitrary bit selection. | Massive. Rust uses bit manipulation heavily for tracking token state, syntax validation masks, and memory layouts. It streamlines the evaluation of complex logical boolean expressions within the linter. | Highly Beneficial
  | AVX2 (Advanced Vector Extensions 2) | Expands integer vector operations to full 256-bit SIMD registers. | High. While Mago skips the math features, the compiler uses AVX2 to vectorize string operations. It allows PHPantom to scan multiple bytes of text simultaneously to locate syntax boundaries (like commas, braces, and strings). | Highly Beneficial
  | LZCNT (Leading Zero Count) | Counts the number of leading zero bits in an integer in a single clock cycle. | High. Vital for fast low-level data structure alignment and hash map indexing, which Rust's compiler leverages heavily during the project-indexing phase. | Highly Beneficial
  | AVX | 256-bit floating-point SIMD vector operations. | None. These are the floating-point vector roots. As a text parsing application, Mago rarely interacts with decimal math. | No Impact
  | FMA3 (Fused Multiply-Add) | Computes $(A \times B) + C$ in a single instruction with infinite precision. | None. Specifically engineered for deep learning, matrix algebra, and graphics. Not used in static analysis. | No Impact
  | F16C | Hardware conversion instructions between 16-bit and 32-bit floats. | None. Unused by compilers and static checkers. | No Impact
  | MOVBE | Reverses the byte order of data during a register-to-memory move (Big-Endian to Little-Endian). | Minimal/None. Mainstream PHP code bases are read sequentially as standard UTF-8 text on standard x86 systems; byte swapping is rarely required. | No Impact
x86-64-v2 Inherited | POPCNT (Population Count) | Counts the number of bits set to 1 in a register in a single cycle. | High. Essential for Rust's internal data structures, bit-sets, and sparse arrays used to hold syntax trees efficiently in memory. | Highly Beneficial
  | SSE4.2 | Adds advanced string and text processing instructions (e.g., PCMPESTRI). | High. Directly accelerates string searching and character matching, allowing quicker detection of PHP keywords (function, class, public) within source text streams. | Highly Beneficial
  | SSE4.1, SSSE3, SSE3 | Standard 128-bit media and basic alignment extensions. | Moderate. Provides basic, ubiquitous building blocks for memory copying (memcpy) and buffer clearing that Rust utilizes implicitly. | Beneficial
  | CMPXCHG16B | Allows atomic compare-and-swap operations on 16-byte (128-bit) values. | Moderate. Critical for parallel execution. PHPantom utilizes this to safely coordinate background indexing threads without heavy locking overhead. | Beneficial
  | LAHF/SAHF | Loads/stores status flags into the CPU's AH register. | Minimal. Helps optimize simple scalar branch logic conditionals inside deep loop constructs. | Minor Benefit

### Use case

The main difference between `x86-64-v3` and `x86-64-v4` and AVX512. From this group of SIMD (yes, it's a group), only few are relevant (again, AI helping this one):

AVX-512 Subset | Core Functionality | Specific Value to Mago / PHPantom | Impact Status
-- | -- | -- | --
AVX-512BW | 512-bit Byte/Word vectors | Scans 64 bytes of PHP text simultaneously; radically accelerates whitespace stripping and symbol searching. | Massive / Revolutionary
Opmask Registers | Native vector predication | Eliminates scalar loop tails; handles unaligned PHP keywords and variable names without dropping out of SIMD. | Massive / Revolutionary
AVX-512CD | Conflict detection | Accelerates string interning (mago-atom) and deduplication during the workspace-indexing phase. | Highly Beneficial
Ternary Logic | 3-input bitwise operations in 1 cycle | Condenses multiple syntax checking/masking steps inside the linter into single-cycle executions. | Highly Beneficial
AVX-512F / DQ | Ultra-wide Floating-Point math | Unused. Static analyzers do not perform heavy decimal matrix math. | No Impact

The reasoning why it's better to not use `x86-64-v4`, apart from historical compatibility, is that you require either _Zen 4_ (2022) or _Zen 5_ processors, or specifically Intel _10gen_ (2019) or _11gen_. Later Intel CPUs **do not have AVX512**, meaning, any Intel CPU from 6 years ago would crash.

I don't know if the performance uplift of having full AVX512 support for Zen 4/5 users and Intel 10/11gen is _big enough_ to warranty another version for "latest AMD CPU, Intel 10/11gen". Even then, AMD CPU owners could build PHPantom in less than 5 minutes with all target optimisations (`target=native`). I get the AI says "revolutionary performance", but it's inferring, not benchmarking.

### Proposed solution

```
RUSTFLAGS="-C target-cpu=x86-64-v3" \
CARGO_PROFILE_RELEASE_OPT_LEVEL="3" \
CARGO_PROFILE_RELEASE_LTO="fat" \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS="1" \
cargo build --release
```

Until then, the usual `cargo build --release` should be enough until the project is considered stable. A safe approach would be to use `x86-64-v2`, since I doubt anyone with a 10-decade old processor would run PHPantom at all.

### Alternatives considered

- `x86-64-v1`: MMX, SSE, and SSE2. Anything still running from 2003/2004.
- `x86-64-v2`: SSE3, SSE4, POPCNT. Anything still running from 2008/2011.

I believe the best _performance_ uplifts would be better with `v3` rather than `v2`.

### Code example

On the worflow, the flag would be added as this (or equivalent):

```yaml
      - name: Build
        env:
          # Conditionally set RUSTFLAGS if the target starts with x86_64
          RUSTFLAGS: ${{ startsWith(matrix.target, 'x86_64') && '-C target-cpu=x86-64-v3' || '' }}
        run: cargo build --release --target ${{ matrix.target }}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.x] Optimized builds for modern CPU, remove old CPU #132

Description

SIMD?

Use case

Proposed solution

Alternatives considered

Code example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Microarchitecture Level	Instruction Set Extension	Core Functionality	Direct Benefit for PHPantom / Mago	Impact Status
x86-64-v3 Specific	BMI1 & BMI2 (Bit Manipulation Instructions)	Adds advanced, single-cycle instructions for bit-field extraction, packing, bit trailing/leading zero counts, and arbitrary bit selection.	Massive. Rust uses bit manipulation heavily for tracking token state, syntax validation masks, and memory layouts. It streamlines the evaluation of complex logical boolean expressions within the linter.	Highly Beneficial
	AVX2 (Advanced Vector Extensions 2)	Expands integer vector operations to full 256-bit SIMD registers.	High. While Mago skips the math features, the compiler uses AVX2 to vectorize string operations. It allows PHPantom to scan multiple bytes of text simultaneously to locate syntax boundaries (like commas, braces, and strings).	Highly Beneficial
	LZCNT (Leading Zero Count)	Counts the number of leading zero bits in an integer in a single clock cycle.	High. Vital for fast low-level data structure alignment and hash map indexing, which Rust's compiler leverages heavily during the project-indexing phase.	Highly Beneficial
	AVX	256-bit floating-point SIMD vector operations.	None. These are the floating-point vector roots. As a text parsing application, Mago rarely interacts with decimal math.	No Impact
	FMA3 (Fused Multiply-Add)	Computes $(A \times B) + C$ in a single instruction with infinite precision.	None. Specifically engineered for deep learning, matrix algebra, and graphics. Not used in static analysis.	No Impact
	F16C	Hardware conversion instructions between 16-bit and 32-bit floats.	None. Unused by compilers and static checkers.	No Impact
	MOVBE	Reverses the byte order of data during a register-to-memory move (Big-Endian to Little-Endian).	Minimal/None. Mainstream PHP code bases are read sequentially as standard UTF-8 text on standard x86 systems; byte swapping is rarely required.	No Impact
x86-64-v2 Inherited	POPCNT (Population Count)	Counts the number of bits set to 1 in a register in a single cycle.	High. Essential for Rust's internal data structures, bit-sets, and sparse arrays used to hold syntax trees efficiently in memory.	Highly Beneficial
	SSE4.2	Adds advanced string and text processing instructions (e.g., PCMPESTRI).	High. Directly accelerates string searching and character matching, allowing quicker detection of PHP keywords (function, class, public) within source text streams.	Highly Beneficial
	SSE4.1, SSSE3, SSE3	Standard 128-bit media and basic alignment extensions.	Moderate. Provides basic, ubiquitous building blocks for memory copying (memcpy) and buffer clearing that Rust utilizes implicitly.	Beneficial
	CMPXCHG16B	Allows atomic compare-and-swap operations on 16-byte (128-bit) values.	Moderate. Critical for parallel execution. PHPantom utilizes this to safely coordinate background indexing threads without heavy locking overhead.	Beneficial
	LAHF/SAHF	Loads/stores status flags into the CPU's AH register.	Minimal. Helps optimize simple scalar branch logic conditionals inside deep loop constructs.	Minor Benefit

AVX-512 Subset	Core Functionality	Specific Value to Mago / PHPantom	Impact Status
AVX-512BW	512-bit Byte/Word vectors	Scans 64 bytes of PHP text simultaneously; radically accelerates whitespace stripping and symbol searching.	Massive / Revolutionary
Opmask Registers	Native vector predication	Eliminates scalar loop tails; handles unaligned PHP keywords and variable names without dropping out of SIMD.	Massive / Revolutionary
AVX-512CD	Conflict detection	Accelerates string interning (mago-atom) and deduplication during the workspace-indexing phase.	Highly Beneficial
Ternary Logic	3-input bitwise operations in 1 cycle	Condenses multiple syntax checking/masking steps inside the linter into single-cycle executions.	Highly Beneficial
AVX-512F / DQ	Ultra-wide Floating-Point math	Unused. Static analyzers do not perform heavy decimal matrix math.	No Impact

[1.x] Optimized builds for modern CPU, remove old CPU #132

Description

Description

SIMD?

Use case

Proposed solution

Alternatives considered

Code example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions