Skip to content

COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039

Closed
hjmjohnson wants to merge 4 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:enh-hardware-support-window
Closed

COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039
hjmjohnson wants to merge 4 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:enh-hardware-support-window

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

@hjmjohnson hjmjohnson commented Apr 10, 2026

Closes #2634.

Replaces the historical hard-coded -march=corei7 (Nehalem, 2008) with ITK_X86_64_ISA_LEVEL, a CMake cache dropdown that selects a standard x86-64 micro-architecture level. Default is default (compiler toolchain default, effectively x86-64 baseline) for maximum redistributability. The x86-64-v4 (AVX-512) level includes -mprefer-vector-width=256 to avoid CPU frequency throttling on Intel hardware.

Commits

  1. ENH: Define ITK_X86_64_ISA_LEVEL cache variable + itk_isa_level_arch_flag() helper
  2. COMP: Integrate into check_compiler_optimization_flags(), replacing -march=corei7
  3. PERF: Add -mprefer-vector-width=256 for x86-64-v4, backed by benchmark data
  4. COMP: Default ISA level to x86-64 baseline (pip/Docker/Rosetta safety)

cmake-gui dropdown

Level -march= flag Key ISA additions ~Year
default (none) Compiler toolchain default
x86-64 x86-64 SSE, SSE2 (AMD64 baseline) 2003
x86-64-v2 x86-64-v2 + SSE4.2, POPCNT 2009
x86-64-v3 x86-64-v3 + AVX2, FMA, BMI1/2 2013
x86-64-v4 x86-64-v4 + -mprefer-vector-width=256 + AVX-512F/BW/CD/DQ/VL 2017
native native Host CPU's full ISA (not redistributable)

Default: default (x86-64 baseline). Users building for local use may want x86-64-v2, which is the minimum level targeted by current Linux distributions (Fedora, RHEL 9, SUSE). The override escape hatch ITK_C_OPTIMIZATION_FLAGS / ITK_CXX_OPTIMIZATION_FLAGS is unchanged.

WARNING: benchmark limitations

The measurements were taken on a computer that was actively being used for other tasks, and for a limited number of runs. The performance metrics are rough estimates. Do not read too much into them.

Why x86-64-v4 needs -mprefer-vector-width=256

GCC with bare -march=x86-64-v4 auto-vectorises every function with 512-bit zmm registers. On Intel Sapphire Rapids (and earlier), any zmm instruction triggers a licence-based frequency downshift (~670 µs cooldown). A full ITK build produces 53,063 zmm instructions across 4,977 functions in the ResampleBenchmark binary, keeping the CPU perpetually in the throttled P-state.

-mprefer-vector-width=256 tells GCC to use AVX-512 instruction encoding (EVEX prefix, 32 vector registers, mask registers) but keep vector width at 256-bit (ymm). The CPU stays in the non-throttled P-state while retaining access to AVX-512 features.

zmm instruction count in ResampleBenchmark binary:

Build zmm instructions
v4-bare 53,063
v4 + -mprefer-vector-width=256 4,208 (−92%)
Benchmark evidence: v4-bare vs v4-pw256 (Xeon w7-3545, NSLOTS=1)

Speedup of v4-pw256 vs v4-bare (>1.0 = pw256 faster):

Benchmark pw256 vs v4-bare
DemonsRegistration 1.21×
Resample (60 variants) 1.10×
UnaryAdd 1.09×
BinaryAdd 1.05×
GradientMagnitude1Thread 1.02×
Median 0.98×
MinMaxCurvatureFlow 0.98×
GradientMagnitude 0.98×
NormalizedCorrelation 0.98×
RegistrationFramework 0.93×

Compared to the v2 baseline (-march=x86-64-v2):

  • v4-bare regressed Resample by 13% and Demons by 16%.
  • v4-pw256 brings both within 4% of v2 while retaining AVX-512 encoding benefits on scalar benchmarks (+6–9% on BinaryAdd, UnaryAdd).
4-config sweep: v2 vs v3 vs v4 (n=3 passes, Xeon w7-3545, NSLOTS=1)

Speedup vs baseline (x86-64, ~2003 ISA):

Benchmark x86-64 x86-64-v2 x86-64-v3 x86-64-v4
DemonsRegistration 1.00× 1.05× 1.00× 1.09×
Median 1.00× 1.05× 1.03× 1.07×
BinaryAdd 1.00× 1.04× 1.05× 1.04×
MinMaxCurvatureFlow 1.00× 1.03× 0.98× 1.01×
GradientMagnitude 1.00× 1.04× 0.91× 0.96×
GradientMagnitude1Thread 1.00× 0.97× 0.86× 0.91×
NormalizedCorrelation 1.00× 0.97× 1.03× 1.05×
RegistrationFramework 1.00× 1.01× 1.01× 0.96×
Resample (60 variants) 1.00× 1.01× 1.00× 0.88×
UnaryAdd 1.00× 0.93× 0.81× 0.80×

The default x86-64 baseline is the safest redistributable choice. x86-64-v2 is neutral-to-positive across all benchmarks. Higher levels show diminishing or negative returns due to AVX frequency effects.

Local verification
$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=x86-64-v2
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=generic -march=x86-64-v2

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=x86-64-v4
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=generic -march=x86-64-v4 -mprefer-vector-width=256

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=default
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING=

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=native
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=native -march=native

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=bogus
CMake Error: ITK_X86_64_ISA_LEVEL must be one of: default, x86-64, ...

@github-actions github-actions bot added type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots labels Apr 10, 2026
@hjmjohnson hjmjohnson changed the title COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option WIP: Investigating Option This is one. COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option Apr 12, 2026
@hjmjohnson hjmjohnson force-pushed the enh-hardware-support-window branch from 4b77b6b to e1a0add Compare April 12, 2026 11:12
@github-actions github-actions bot removed the type:Compiler Compiler support or related warnings label Apr 12, 2026
@hjmjohnson hjmjohnson force-pushed the enh-hardware-support-window branch from e1a0add to b6d0bc2 Compare April 12, 2026 11:23
@hjmjohnson hjmjohnson changed the title WIP: Investigating Option This is one. COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL Apr 12, 2026
Comment thread CMake/ITKSetStandardCompilerFlags.cmake Outdated
@hjmjohnson
Copy link
Copy Markdown
Member Author

Re: ISA level default — changed default to "default" (compiler toolchain default, effectively x86-64 baseline). Added a 4-line rationale comment in the CMake file: pip wheels, Docker images, and hardware translation layers only guarantee x86-64 baseline.

@github-actions github-actions bot added the type:Compiler Compiler support or related warnings label Apr 12, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review April 12, 2026 19:17
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 12, 2026

Greptile Summary

This PR replaces the historical hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL, a CMake cache dropdown exposing the standard x86-64-v{2,3,4}/native micro-architecture levels. The default is \"default\" (no flags) for maximum redistributability, and x86-64-v4 adds -mprefer-vector-width=256 to avoid Intel AVX-512 frequency throttling, backed by benchmark data.

  • P1 – silent flag drop on Clang for x86-64-v4: Line 134 packs two flags into one space-separated string. check_c_compiler_flag tests them together; Clang does not recognise -mprefer-vector-width=256, so the combined test fails and both flags are silently dropped. The fix is a semicolon-separated string so each flag is tested independently.

Confidence Score: 4/5

Safe to merge after fixing the compound-flag string for x86-64-v4 — a one-line change.

One P1 finding: the compound -march=x86-64-v4 -mprefer-vector-width=256 string causes both flags to be rejected together on Clang, silently nullifying the x86-64-v4 optimization level for Clang users. The fix is trivial (semicolons). All other levels (default, x86-64, x86-64-v2, x86-64-v3, native) work correctly on both GCC and Clang, and the MSVC path is unaffected.

CMake/ITKSetStandardCompilerFlags.cmake line 134 — the x86-64-v4 compound flag string.

Important Files Changed

Filename Overview
CMake/ITKSetStandardCompilerFlags.cmake Replaces hardcoded -march=corei7 with a new ITK_X86_64_ISA_LEVEL dropdown; flag resolution logic is correct for all levels except x86-64-v4 where a compound space-separated string causes both flags to be tested together and can be silently dropped on Clang.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ITK_X86_64_ISA_LEVEL\n(CMake cache dropdown)"] --> V{"Validate regex\n^(default|x86-64-v2|...|native)$"}
    V -- invalid --> E["FATAL_ERROR"]
    V -- valid --> F["itk_isa_level_arch_flag()"]
    F --> P{"CMAKE_SYSTEM_PROCESSOR\nmatches x86_64 or AMD64?"}
    P -- No --> Z["_arch_flag = ''"]
    P -- Yes --> D{"level == 'default'?"}
    D -- Yes --> Z
    D -- No --> M{"MSVC?"}
    M -- Yes --> MV{"level"}
    MV -- x86-64-v3 --> MA["/arch:AVX2"]
    MV -- x86-64-v4 --> MB["/arch:AVX512"]
    MV -- native --> MC["/arch:AVX2 proxy"]
    MV -- "x86-64 / x86-64-v2" --> MD["SSE2 default"]
    M -- No --> GV{"level"}
    GV -- native --> GA["-march=native"]
    GV -- x86-64-v4 --> GB["compound string\n-march=x86-64-v4 -mprefer-vector-width=256\n⚠ tested as one unit\nFails on Clang → both flags dropped"]
    GV -- other --> GC["-march=level"]
    MA & MB & MC & MD & GA & GB & GC & Z --> OUT["check_compiler_optimization_flags()\nAppend to InstructionSetOptimizationFlags\nTest via check_c_compiler_flags()\nSet ITK_C/CXX_OPTIMIZATION_FLAGS"]
Loading

Reviews (1): Last reviewed commit: "COMP: default ISA level to x86-64 baseli..." | Re-trigger Greptile

Comment thread CMake/ITKSetStandardCompilerFlags.cmake Outdated
hjmjohnson and others added 4 commits April 12, 2026 16:49
Define a CMake cache variable with dropdown values for selecting the
x86-64 instruction set architecture level: default, x86-64, x86-64-v2,
x86-64-v3, x86-64-v4, and native.  The helper function
itk_isa_level_arch_flag() resolves the selection to a concrete
-march= flag (GCC/Clang) or /arch: flag (MSVC).

This commit adds the infrastructure only; the next commit integrates it
into ITK's compiler flag logic, replacing the old -march=corei7.

See InsightSoftwareConsortium#2634

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate the ITK_X86_64_ISA_LEVEL cache variable (added in the
prior commit) into check_compiler_optimization_flags(), replacing
the historical hard-coded -march=corei7 (Nehalem, 2008).

The default level is x86-64-v2 (SSE4.2, POPCNT), matching the
previous corei7 baseline with portable, vendor-neutral level names.

When set to "default", no -march or -mtune flags are emitted,
leaving the compiler's built-in defaults in effect.

When set to "native", both -march=native and -mtune=native are
used for maximum performance on the build host (not redistributable).

See InsightSoftwareConsortium#2634

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rottle

GCC with -march=x86-64-v4 auto-vectorises every function with 512-bit
zmm registers (53 063 zmm instructions across 4 977 functions in the
ResampleBenchmark binary).  On Intel Sapphire Rapids (and earlier) any
zmm instruction triggers a licence-based frequency downshift that
persists ~670 µs.  With 4 977 affected functions the CPU never exits
the throttled P-state, causing 12–17 % wall-clock regressions on
BSpline-dominated workloads compared to the x86-64-v2 default.

-mprefer-vector-width=256 tells GCC to prefer 256-bit (ymm) vectors
while still using the AVX-512 instruction encoding (EVEX prefix, 32
vector registers, mask registers, new ALU operations).  This gives
access to AVX-512 features without triggering the zmm frequency
penalty.

Benchmark evidence (Xeon w7-3545, NSLOTS=1, n=1):

  Binary zmm count:
    v4-bare:  53 063
    v4-pw256:  4 208  (−92 %)

  Speedup vs v4-bare (pw256 faster = >1.0):
    DemonsRegistration:  1.21×  (regression recovered)
    Resample (60 var):   1.10×  (regression recovered)
    BinaryAdd:           1.05×
    UnaryAdd:            1.09×
    GradMag1Thread:      1.02×  (preserved)

  Compared to the v2 default (-march=x86-64-v2):
    v4-bare regressed Resample by 13 % and Demons by 16 %.
    v4-pw256 brings both within 4 % of v2 while retaining
    AVX-512 encoding benefits on scalar benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pip wheels, Docker images, and hardware translation layers (Rosetta)
only guarantee x86-64 baseline. Users building for local use may want
to consider x86-64-v2, which is the minimum level targeted by current
Linux distributions (Fedora, RHEL 9, SUSE).
@hjmjohnson hjmjohnson force-pushed the enh-hardware-support-window branch from 2deda19 to bccc55c Compare April 12, 2026 21:57
@hjmjohnson
Copy link
Copy Markdown
Member Author

Closing in favor of a simpler approach: remove the hardcoded -march=corei7 entirely rather than replacing it with a new CMake variable. The ISA level abstraction can be revisited later if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Latest GCC releases have removed corei7 as an arch option

1 participant