COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL by hjmjohnson · Pull Request #6039 · InsightSoftwareConsortium/ITK

hjmjohnson · 2026-04-10T16:43:29Z

Closes #2634.

Replaces the historical hard-coded -march=corei7 (Nehalem, 2008) with ITK_X86_64_ISA_LEVEL, a CMake cache dropdown that selects a standard x86-64 micro-architecture level. Default is default (compiler toolchain default, effectively x86-64 baseline) for maximum redistributability. The x86-64-v4 (AVX-512) level includes -mprefer-vector-width=256 to avoid CPU frequency throttling on Intel hardware.

Commits

ENH: Define ITK_X86_64_ISA_LEVEL cache variable + itk_isa_level_arch_flag() helper
COMP: Integrate into check_compiler_optimization_flags(), replacing -march=corei7
PERF: Add -mprefer-vector-width=256 for x86-64-v4, backed by benchmark data
COMP: Default ISA level to x86-64 baseline (pip/Docker/Rosetta safety)

cmake-gui dropdown

Level	`-march=` flag	Key ISA additions	~Year
`default`	(none)	Compiler toolchain default	—
`x86-64`	`x86-64`	SSE, SSE2 (AMD64 baseline)	2003
`x86-64-v2`	`x86-64-v2`	+ SSE4.2, POPCNT	2009
`x86-64-v3`	`x86-64-v3`	+ AVX2, FMA, BMI1/2	2013
`x86-64-v4`	`x86-64-v4` + `-mprefer-vector-width=256`	+ AVX-512F/BW/CD/DQ/VL	2017
`native`	`native`	Host CPU's full ISA (not redistributable)	—

Default: default (x86-64 baseline). Users building for local use may want x86-64-v2, which is the minimum level targeted by current Linux distributions (Fedora, RHEL 9, SUSE). The override escape hatch ITK_C_OPTIMIZATION_FLAGS / ITK_CXX_OPTIMIZATION_FLAGS is unchanged.

WARNING: benchmark limitations

The measurements were taken on a computer that was actively being used for other tasks, and for a limited number of runs. The performance metrics are rough estimates. Do not read too much into them.

Why x86-64-v4 needs -mprefer-vector-width=256

GCC with bare -march=x86-64-v4 auto-vectorises every function with 512-bit zmm registers. On Intel Sapphire Rapids (and earlier), any zmm instruction triggers a licence-based frequency downshift (~670 µs cooldown). A full ITK build produces 53,063 zmm instructions across 4,977 functions in the ResampleBenchmark binary, keeping the CPU perpetually in the throttled P-state.

-mprefer-vector-width=256 tells GCC to use AVX-512 instruction encoding (EVEX prefix, 32 vector registers, mask registers) but keep vector width at 256-bit (ymm). The CPU stays in the non-throttled P-state while retaining access to AVX-512 features.

zmm instruction count in ResampleBenchmark binary:

Build	zmm instructions
v4-bare	53,063
v4 + `-mprefer-vector-width=256`	4,208 (−92%)

Benchmark evidence: v4-bare vs v4-pw256 (Xeon w7-3545, NSLOTS=1)

Speedup of v4-pw256 vs v4-bare (>1.0 = pw256 faster):

Benchmark	pw256 vs v4-bare
DemonsRegistration	1.21×
Resample (60 variants)	1.10×
UnaryAdd	1.09×
BinaryAdd	1.05×
GradientMagnitude1Thread	1.02×
Median	0.98×
MinMaxCurvatureFlow	0.98×
GradientMagnitude	0.98×
NormalizedCorrelation	0.98×
RegistrationFramework	0.93×

Compared to the v2 baseline (-march=x86-64-v2):

v4-bare regressed Resample by 13% and Demons by 16%.
v4-pw256 brings both within 4% of v2 while retaining AVX-512 encoding benefits on scalar benchmarks (+6–9% on BinaryAdd, UnaryAdd).

4-config sweep: v2 vs v3 vs v4 (n=3 passes, Xeon w7-3545, NSLOTS=1)

Speedup vs baseline (x86-64, ~2003 ISA):

Benchmark	x86-64	x86-64-v2	x86-64-v3	x86-64-v4
DemonsRegistration	1.00×	1.05×	1.00×	1.09×
Median	1.00×	1.05×	1.03×	1.07×
BinaryAdd	1.00×	1.04×	1.05×	1.04×
MinMaxCurvatureFlow	1.00×	1.03×	0.98×	1.01×
GradientMagnitude	1.00×	1.04×	0.91×	0.96×
GradientMagnitude1Thread	1.00×	0.97×	0.86×	0.91×
NormalizedCorrelation	1.00×	0.97×	1.03×	1.05×
RegistrationFramework	1.00×	1.01×	1.01×	0.96×
Resample (60 variants)	1.00×	1.01×	1.00×	0.88×
UnaryAdd	1.00×	0.93×	0.81×	0.80×

The default x86-64 baseline is the safest redistributable choice. x86-64-v2 is neutral-to-positive across all benchmarks. Higher levels show diminishing or negative returns due to AVX frequency effects.

Local verification

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=x86-64-v2
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=generic -march=x86-64-v2

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=x86-64-v4
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=generic -march=x86-64-v4 -mprefer-vector-width=256

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=default
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING=

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=native
$ grep ITK_C_OPTIMIZATION_FLAGS build/CMakeCache.txt
ITK_C_OPTIMIZATION_FLAGS:STRING= -mtune=native -march=native

$ cmake -B build -S . -DITK_X86_64_ISA_LEVEL=bogus
CMake Error: ITK_X86_64_ISA_LEVEL must be one of: default, x86-64, ...

hjmjohnson · 2026-04-12T13:00:58Z

Re: ISA level default — changed default to "default" (compiler toolchain default, effectively x86-64 baseline). Added a 4-line rationale comment in the CMake file: pip wheels, Docker images, and hardware translation layers only guarantee x86-64 baseline.

greptile-apps · 2026-04-12T19:25:01Z

Greptile Summary

This PR replaces the historical hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL, a CMake cache dropdown exposing the standard x86-64-v{2,3,4}/native micro-architecture levels. The default is \"default\" (no flags) for maximum redistributability, and x86-64-v4 adds -mprefer-vector-width=256 to avoid Intel AVX-512 frequency throttling, backed by benchmark data.

P1 – silent flag drop on Clang for x86-64-v4: Line 134 packs two flags into one space-separated string. check_c_compiler_flag tests them together; Clang does not recognise -mprefer-vector-width=256, so the combined test fails and both flags are silently dropped. The fix is a semicolon-separated string so each flag is tested independently.

Confidence Score: 4/5

Safe to merge after fixing the compound-flag string for x86-64-v4 — a one-line change.

One P1 finding: the compound -march=x86-64-v4 -mprefer-vector-width=256 string causes both flags to be rejected together on Clang, silently nullifying the x86-64-v4 optimization level for Clang users. The fix is trivial (semicolons). All other levels (default, x86-64, x86-64-v2, x86-64-v3, native) work correctly on both GCC and Clang, and the MSVC path is unaffected.

CMake/ITKSetStandardCompilerFlags.cmake line 134 — the x86-64-v4 compound flag string.

Important Files Changed

Filename	Overview
CMake/ITKSetStandardCompilerFlags.cmake	Replaces hardcoded -march=corei7 with a new ITK_X86_64_ISA_LEVEL dropdown; flag resolution logic is correct for all levels except x86-64-v4 where a compound space-separated string causes both flags to be tested together and can be silently dropped on Clang.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ITK_X86_64_ISA_LEVEL\n(CMake cache dropdown)"] --> V{"Validate regex\n^(default|x86-64-v2|...|native)$"}
    V -- invalid --> E["FATAL_ERROR"]
    V -- valid --> F["itk_isa_level_arch_flag()"]
    F --> P{"CMAKE_SYSTEM_PROCESSOR\nmatches x86_64 or AMD64?"}
    P -- No --> Z["_arch_flag = ''"]
    P -- Yes --> D{"level == 'default'?"}
    D -- Yes --> Z
    D -- No --> M{"MSVC?"}
    M -- Yes --> MV{"level"}
    MV -- x86-64-v3 --> MA["/arch:AVX2"]
    MV -- x86-64-v4 --> MB["/arch:AVX512"]
    MV -- native --> MC["/arch:AVX2 proxy"]
    MV -- "x86-64 / x86-64-v2" --> MD["SSE2 default"]
    M -- No --> GV{"level"}
    GV -- native --> GA["-march=native"]
    GV -- x86-64-v4 --> GB["compound string\n-march=x86-64-v4 -mprefer-vector-width=256\n⚠ tested as one unit\nFails on Clang → both flags dropped"]
    GV -- other --> GC["-march=level"]
    MA & MB & MC & MD & GA & GB & GC & Z --> OUT["check_compiler_optimization_flags()\nAppend to InstructionSetOptimizationFlags\nTest via check_c_compiler_flags()\nSet ITK_C/CXX_OPTIMIZATION_FLAGS"]

_{Reviews (1): Last reviewed commit: "COMP: default ISA level to x86-64 baseli..." | Re-trigger Greptile}

Define a CMake cache variable with dropdown values for selecting the x86-64 instruction set architecture level: default, x86-64, x86-64-v2, x86-64-v3, x86-64-v4, and native. The helper function itk_isa_level_arch_flag() resolves the selection to a concrete -march= flag (GCC/Clang) or /arch: flag (MSVC). This commit adds the infrastructure only; the next commit integrates it into ITK's compiler flag logic, replacing the old -march=corei7. See InsightSoftwareConsortium#2634 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Integrate the ITK_X86_64_ISA_LEVEL cache variable (added in the prior commit) into check_compiler_optimization_flags(), replacing the historical hard-coded -march=corei7 (Nehalem, 2008). The default level is x86-64-v2 (SSE4.2, POPCNT), matching the previous corei7 baseline with portable, vendor-neutral level names. When set to "default", no -march or -mtune flags are emitted, leaving the compiler's built-in defaults in effect. When set to "native", both -march=native and -mtune=native are used for maximum performance on the build host (not redistributable). See InsightSoftwareConsortium#2634 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rottle GCC with -march=x86-64-v4 auto-vectorises every function with 512-bit zmm registers (53 063 zmm instructions across 4 977 functions in the ResampleBenchmark binary). On Intel Sapphire Rapids (and earlier) any zmm instruction triggers a licence-based frequency downshift that persists ~670 µs. With 4 977 affected functions the CPU never exits the throttled P-state, causing 12–17 % wall-clock regressions on BSpline-dominated workloads compared to the x86-64-v2 default. -mprefer-vector-width=256 tells GCC to prefer 256-bit (ymm) vectors while still using the AVX-512 instruction encoding (EVEX prefix, 32 vector registers, mask registers, new ALU operations). This gives access to AVX-512 features without triggering the zmm frequency penalty. Benchmark evidence (Xeon w7-3545, NSLOTS=1, n=1): Binary zmm count: v4-bare: 53 063 v4-pw256: 4 208 (−92 %) Speedup vs v4-bare (pw256 faster = >1.0): DemonsRegistration: 1.21× (regression recovered) Resample (60 var): 1.10× (regression recovered) BinaryAdd: 1.05× UnaryAdd: 1.09× GradMag1Thread: 1.02× (preserved) Compared to the v2 default (-march=x86-64-v2): v4-bare regressed Resample by 13 % and Demons by 16 %. v4-pw256 brings both within 4 % of v2 while retaining AVX-512 encoding benefits on scalar benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pip wheels, Docker images, and hardware translation layers (Rosetta) only guarantee x86-64 baseline. Users building for local use may want to consider x86-64-v2, which is the minimum level targeted by current Linux distributions (Fedora, RHEL 9, SUSE).

hjmjohnson · 2026-04-13T17:46:06Z

Closing in favor of a simpler approach: remove the hardcoded -march=corei7 entirely rather than replacing it with a new CMake variable. The ISA level abstraction can be revisited later if needed.

github-actions bot added type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots labels Apr 10, 2026

hjmjohnson changed the title ~~COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option~~ WIP: Investigating Option This is one. COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option Apr 12, 2026

hjmjohnson force-pushed the enh-hardware-support-window branch from 4b77b6b to e1a0add Compare April 12, 2026 11:12

github-actions bot removed the type:Compiler Compiler support or related warnings label Apr 12, 2026

hjmjohnson force-pushed the enh-hardware-support-window branch from e1a0add to b6d0bc2 Compare April 12, 2026 11:23

hjmjohnson changed the title ~~WIP: Investigating Option This is one. COMP: Replace hardcoded -march=corei7 with ITK_HARDWARE_SUPPORT_YEARS option~~ COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL Apr 12, 2026

hjmjohnson commented Apr 12, 2026

View reviewed changes

Comment thread CMake/ITKSetStandardCompilerFlags.cmake Outdated

github-actions bot added the type:Compiler Compiler support or related warnings label Apr 12, 2026

hjmjohnson marked this pull request as ready for review April 12, 2026 19:17

greptile-apps bot reviewed Apr 12, 2026

View reviewed changes

Comment thread CMake/ITKSetStandardCompilerFlags.cmake Outdated

hjmjohnson and others added 4 commits April 12, 2026 16:49

hjmjohnson force-pushed the enh-hardware-support-window branch from 2deda19 to bccc55c Compare April 12, 2026 21:57

hjmjohnson closed this Apr 13, 2026

hjmjohnson mentioned this pull request Apr 13, 2026

COMP: Remove hardcoded -march=corei7, -mtune=generic, and dead code #6049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039

COMP: Replace hardcoded -march=corei7 with ITK_X86_64_ISA_LEVEL#6039
hjmjohnson wants to merge 4 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:enh-hardware-support-window

hjmjohnson commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

hjmjohnson commented Apr 12, 2026

Uh oh!

greptile-apps bot commented Apr 12, 2026

Greptile Summary

Important Files Changed

Flowchart

Uh oh!

Uh oh!

hjmjohnson commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hjmjohnson commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

cmake-gui dropdown

Uh oh!

Uh oh!

hjmjohnson commented Apr 12, 2026

Uh oh!

greptile-apps bot commented Apr 12, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

hjmjohnson commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hjmjohnson commented Apr 10, 2026 •

edited

Loading