COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039
COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039hjmjohnson wants to merge 4 commits intoInsightSoftwareConsortium:mainfrom
Conversation
4b77b6b to
e1a0add
Compare
e1a0add to
b6d0bc2
Compare
|
Re: ISA level default — changed default to |
|
| Filename | Overview |
|---|---|
| CMake/ITKSetStandardCompilerFlags.cmake | Replaces hardcoded -march=corei7 with a new ITK_X86_64_ISA_LEVEL dropdown; flag resolution logic is correct for all levels except x86-64-v4 where a compound space-separated string causes both flags to be tested together and can be silently dropped on Clang. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["ITK_X86_64_ISA_LEVEL\n(CMake cache dropdown)"] --> V{"Validate regex\n^(default|x86-64-v2|...|native)$"}
V -- invalid --> E["FATAL_ERROR"]
V -- valid --> F["itk_isa_level_arch_flag()"]
F --> P{"CMAKE_SYSTEM_PROCESSOR\nmatches x86_64 or AMD64?"}
P -- No --> Z["_arch_flag = ''"]
P -- Yes --> D{"level == 'default'?"}
D -- Yes --> Z
D -- No --> M{"MSVC?"}
M -- Yes --> MV{"level"}
MV -- x86-64-v3 --> MA["/arch:AVX2"]
MV -- x86-64-v4 --> MB["/arch:AVX512"]
MV -- native --> MC["/arch:AVX2 proxy"]
MV -- "x86-64 / x86-64-v2" --> MD["SSE2 default"]
M -- No --> GV{"level"}
GV -- native --> GA["-march=native"]
GV -- x86-64-v4 --> GB["compound string\n-march=x86-64-v4 -mprefer-vector-width=256\n⚠ tested as one unit\nFails on Clang → both flags dropped"]
GV -- other --> GC["-march=level"]
MA & MB & MC & MD & GA & GB & GC & Z --> OUT["check_compiler_optimization_flags()\nAppend to InstructionSetOptimizationFlags\nTest via check_c_compiler_flags()\nSet ITK_C/CXX_OPTIMIZATION_FLAGS"]
Reviews (1): Last reviewed commit: "COMP: default ISA level to x86-64 baseli..." | Re-trigger Greptile
Define a CMake cache variable with dropdown values for selecting the x86-64 instruction set architecture level: default, x86-64, x86-64-v2, x86-64-v3, x86-64-v4, and native. The helper function itk_isa_level_arch_flag() resolves the selection to a concrete -march= flag (GCC/Clang) or /arch: flag (MSVC). This commit adds the infrastructure only; the next commit integrates it into ITK's compiler flag logic, replacing the old -march=corei7. See InsightSoftwareConsortium#2634 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate the ITK_X86_64_ISA_LEVEL cache variable (added in the prior commit) into check_compiler_optimization_flags(), replacing the historical hard-coded -march=corei7 (Nehalem, 2008). The default level is x86-64-v2 (SSE4.2, POPCNT), matching the previous corei7 baseline with portable, vendor-neutral level names. When set to "default", no -march or -mtune flags are emitted, leaving the compiler's built-in defaults in effect. When set to "native", both -march=native and -mtune=native are used for maximum performance on the build host (not redistributable). See InsightSoftwareConsortium#2634 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rottle
GCC with -march=x86-64-v4 auto-vectorises every function with 512-bit
zmm registers (53 063 zmm instructions across 4 977 functions in the
ResampleBenchmark binary). On Intel Sapphire Rapids (and earlier) any
zmm instruction triggers a licence-based frequency downshift that
persists ~670 µs. With 4 977 affected functions the CPU never exits
the throttled P-state, causing 12–17 % wall-clock regressions on
BSpline-dominated workloads compared to the x86-64-v2 default.
-mprefer-vector-width=256 tells GCC to prefer 256-bit (ymm) vectors
while still using the AVX-512 instruction encoding (EVEX prefix, 32
vector registers, mask registers, new ALU operations). This gives
access to AVX-512 features without triggering the zmm frequency
penalty.
Benchmark evidence (Xeon w7-3545, NSLOTS=1, n=1):
Binary zmm count:
v4-bare: 53 063
v4-pw256: 4 208 (−92 %)
Speedup vs v4-bare (pw256 faster = >1.0):
DemonsRegistration: 1.21× (regression recovered)
Resample (60 var): 1.10× (regression recovered)
BinaryAdd: 1.05×
UnaryAdd: 1.09×
GradMag1Thread: 1.02× (preserved)
Compared to the v2 default (-march=x86-64-v2):
v4-bare regressed Resample by 13 % and Demons by 16 %.
v4-pw256 brings both within 4 % of v2 while retaining
AVX-512 encoding benefits on scalar benchmarks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pip wheels, Docker images, and hardware translation layers (Rosetta) only guarantee x86-64 baseline. Users building for local use may want to consider x86-64-v2, which is the minimum level targeted by current Linux distributions (Fedora, RHEL 9, SUSE).
2deda19 to
bccc55c
Compare
|
Closing in favor of a simpler approach: remove the hardcoded |
Closes #2634.
Replaces the historical hard-coded
-march=corei7(Nehalem, 2008) withITK_X86_64_ISA_LEVEL, a CMake cache dropdown that selects a standard x86-64 micro-architecture level. Default isdefault(compiler toolchain default, effectively x86-64 baseline) for maximum redistributability. The x86-64-v4 (AVX-512) level includes-mprefer-vector-width=256to avoid CPU frequency throttling on Intel hardware.Commits
ITK_X86_64_ISA_LEVELcache variable +itk_isa_level_arch_flag()helpercheck_compiler_optimization_flags(), replacing-march=corei7-mprefer-vector-width=256for x86-64-v4, backed by benchmark datacmake-gui dropdown
-march=flagdefaultx86-64x86-64x86-64-v2x86-64-v2x86-64-v3x86-64-v3x86-64-v4x86-64-v4+-mprefer-vector-width=256nativenativeDefault:
default(x86-64 baseline). Users building for local use may want x86-64-v2, which is the minimum level targeted by current Linux distributions (Fedora, RHEL 9, SUSE). The override escape hatchITK_C_OPTIMIZATION_FLAGS/ITK_CXX_OPTIMIZATION_FLAGSis unchanged.WARNING: benchmark limitations
The measurements were taken on a computer that was actively being used for other tasks, and for a limited number of runs. The performance metrics are rough estimates. Do not read too much into them.
Why x86-64-v4 needs -mprefer-vector-width=256
GCC with bare
-march=x86-64-v4auto-vectorises every function with 512-bit zmm registers. On Intel Sapphire Rapids (and earlier), any zmm instruction triggers a licence-based frequency downshift (~670 µs cooldown). A full ITK build produces 53,063 zmm instructions across 4,977 functions in the ResampleBenchmark binary, keeping the CPU perpetually in the throttled P-state.-mprefer-vector-width=256tells GCC to use AVX-512 instruction encoding (EVEX prefix, 32 vector registers, mask registers) but keep vector width at 256-bit (ymm). The CPU stays in the non-throttled P-state while retaining access to AVX-512 features.zmm instruction count in ResampleBenchmark binary:
-mprefer-vector-width=256Benchmark evidence: v4-bare vs v4-pw256 (Xeon w7-3545, NSLOTS=1)
Speedup of v4-pw256 vs v4-bare (>1.0 = pw256 faster):
Compared to the v2 baseline (
-march=x86-64-v2):4-config sweep: v2 vs v3 vs v4 (n=3 passes, Xeon w7-3545, NSLOTS=1)
Speedup vs baseline (x86-64, ~2003 ISA):
The default x86-64 baseline is the safest redistributable choice. x86-64-v2 is neutral-to-positive across all benchmarks. Higher levels show diminishing or negative returns due to AVX frequency effects.
Local verification