-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
f32 += f32 * u32 is faster in a loop than f32 += f32; can be defeated (a little bit) with #[cold] annotation? #138953
Comments
That's a lot of harness code. Can this be minimized to make it easier to analyze maybe? @rustbot label: -C-bug +C-optimization +I-slow +E-needs-mcve -needs-triage +T-compiler +A-codegen |
I reduced this a bit, but going below.... some? complexity level (as in dot product of three floats vs of just one) in the condition makes the iter variants be within 4%:
so idk, there's something there that wants the full non-reduced condition to trigger. fn msc_iter_mf(outputs: &mut [f32; 0xFFFFFF + 1], known_rgbs: &Vec<(u32, f32, u32)>) {
for (rgb, lab, _) in known_rgbs {
let mut center = *lab;
loop {
let (newcenter_count, newcenter) = known_rgbs.iter()
.filter(|(_, lab, _)| (lab - center).abs() <= 0.02f32)
.fold((0, 0f32), |(cnt, acc), (_, lab, freq)| (cnt + freq, acc + lab * (*freq as f32)));
let newcenter = newcenter / newcenter_count as f32;
if newcenter == center {
break;
}
center = newcenter;
}
outputs[*rgb as usize] = center;
}
}
fn msc_iter_p(outputs: &mut [f32; 0xFFFFFF + 1], known_rgbs: &Vec<(u32, f32, u32)>) {
for (rgb, lab, _) in known_rgbs {
let mut center = *lab;
loop {
let (newcenter_count, newcenter) = known_rgbs.iter()
.filter(|(_, lab, _)| (lab - center).abs() <= 0.02f32)
.fold((0, 0f32), |(cnt, acc), (_, lab, _)| (cnt + 1, acc + lab));
let newcenter = newcenter / newcenter_count as f32;
if newcenter == center {
break;
}
center = newcenter;
}
outputs[*rgb as usize] = center;
}
}
fn msc_for_mf(outputs: &mut [f32; 0xFFFFFF + 1], known_rgbs: &Vec<(u32, f32, u32)>) {
for (rgb, lab, _) in known_rgbs {
let mut center = *lab;
loop {
let (mut newcenter_count, mut newcenter) = (0, 0f32);
for (_, lab, freq) in known_rgbs {
if (lab - center).abs() <= 0.02f32 {
newcenter_count += *freq;
newcenter += lab * (*freq as f32);
}
}
newcenter /= newcenter_count as f32;
if newcenter == center {
outputs[*rgb as usize] = center;
break;
}
center = newcenter;
}
}
}
fn msc_for_p(outputs: &mut [f32; 0xFFFFFF + 1], known_rgbs: &Vec<(u32, f32, u32)>) {
for (rgb, lab, _) in known_rgbs {
let mut center = *lab;
loop {
let (mut newcenter_count, mut newcenter) = (0, 0f32);
for (_, lab, _) in known_rgbs {
if (lab - center).abs() <= 0.02f32 {
newcenter_count += 1;
newcenter += lab;
}
}
newcenter /= newcenter_count as f32;
if newcenter == center {
outputs[*rgb as usize] = center;
break;
}
center = newcenter;
}
}
}
fn msc_for_p_cold(outputs: &mut [f32; 0xFFFFFF + 1], known_rgbs: &Vec<(u32, f32, u32)>) {
for (rgb, lab, _) in known_rgbs {
let mut center = *lab;
loop {
let (mut newcenter_count, mut newcenter) = (0, 0f32);
for (_, lab, _) in known_rgbs {
if (lab - center).abs() <= 0.02f32 {
#[cold]
fn cold() {}
cold();
newcenter_count += 1;
newcenter += lab;
}
}
newcenter /= newcenter_count as f32;
if newcenter == center {
outputs[*rgb as usize] = center;
break;
}
center = newcenter;
}
}
}
static mut OUTPUTS: [f32; 0xFFFFFF + 1] = [0f32; 0xFFFFFF + 1];
fn main() {
use std::io::BufRead;
let mut known_rgbs = vec![];
for l in std::io::BufReader::new(std::fs::File::open("kr_").unwrap()).lines().map(Result::unwrap) {
let mut iter = l.split_whitespace();
known_rgbs.push((iter.next().unwrap().parse::<u32>().unwrap(), iter.next().unwrap().parse::<f32>().unwrap(), 1));
}
macro_rules! one {
($f:ident) => {
let start = std::time::Instant::now();
$f(unsafe { &mut OUTPUTS }, &known_rgbs);
let end = std::time::Instant::now();
println!("{}:\t{:?}", stringify!($f), end - start);
}
}
one!(msc_iter_mf);
one!(msc_iter_p);
one!(msc_for_mf);
one!(msc_for_p);
one!(msc_for_p_cold);
} |
Here's some perf annotates on a E5645 which exhibits the same characteristics: mf2-msc_for_mf
mf2-msc_for_p
mf2-msc_for_p_cold
What jumps out to me in mf2-msc_for_p is that it spends 40% of its run-time in I've reproduced the performance differential on a i5-1235U as well so this is not a decroded-uarch problem:
[reduced]
|
I tried this code:
kr_.gz (real, non-synthetic, data; shouldn't really matter though)
This is five identical implementations (if freq is fixed at 1, which it is).
I expected to see this happen:
*_p
variants are faster-or-at-worst-identical to*_mf
.Instead, this happened:
*_mf
is 30% (iter
) or 21% (for
) faster than*_p
despite being more computationally complex.@zopsicle analysed this godbolt of the fors as "
linear_core_p
does the addition unconditionally, then conditionally stores the result.linear_core_mf
preserves the original control flow. If you put#[cold] fn cold() {} cold();
inside the if statement it will preserve the control flow." and they were right, but the_cold
variant is still 7% worse than*_mf
.This is obviously wrong.
Measurements, for me, on a i7-2600:
Meta
rustc --version --verbose
:Building in release mode in cargo and
--codegen opt-level=3
out cargo.Please note that I am not interested in making the code faster overall, just in the pessimisation that leads to the relative difference between the variants. (You know how programmers are.)
The text was updated successfully, but these errors were encountered: