Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize matrix functions, color conversion performance #183

Merged
merged 1 commit into from
Apr 20, 2020

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Apr 19, 2020

Destructure slices to avoid bounds checking and panic path
Add #[inline] attribute to matrix functions
Improve Rgb into linear
Improve Lab to Xyz
Remove RgbSpace conversion from Hwb to Hsv

} else {
((x + from_f64(0.055)) / from_f64(1.055)).powf(from_f64(2.4))
((x + from_f64(0.055)) * from_f64::<T>(1.055).recip()).powf(from_f64(2.4))
Copy link
Contributor Author

@okaneco okaneco Apr 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in some cases when there's a variable being divided by a constant that this recip trick helps.

z: (c[6] * f.x) + (c[7] * f.y) + (c[8] * f.z),
x: x1 + x2 + x3,
y: y1 + y2 + y3,
z: z1 + z2 + z3,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this with the multiply_xyz_to_rgb directly below this and it was consistently regressing while this one showed improvements over the previous version.


out
[o0, o1, o2, o3, o4, o5, o6, o7, o8]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructuring seemed to help here. There's no need to zero-out an array since we never do anything with that value.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a nice stylistic improvement too. 👍

if !det.is_normal() {
panic!("The given matrix is not invertible")
}
let mut det = a[0] * d0 - a[1] * d1 + a[2] * d2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placement here is actually very important which makes sense because it's right after d0, d1, d2. Putting it at the end of this section before the is_normal check has a massive slowdown.

t6 * s_matrix.red,
t7 * s_matrix.green,
t8 * s_matrix.blue,
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed an improvement for this, will have to see if it's just my machine or not.

let z = y - (color.b / from_f64(200.0));
let y = (color.l + from_f64(16.0)) * from_f64::<T>(116.0).recip();
let x = y + (color.a * from_f64::<T>(500.0).recip());
let z = y - (color.b * from_f64::<T>(200.0).recip());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concept as the Srgb into linear optimization. This had the most dramatic effect, doubled the throughput.

Cie family/lab to xyz   time:   [861.87 ns 864.43 ns 867.12 ns]
                        thrpt:  [161.45 Melem/s 161.96 Melem/s 162.44 Melem/s]
                 change:
                        time:   [-34.197% -33.915% -33.587%] (p = 0.00 < 0.05)
                        thrpt:  [+50.573% +51.321% +51.968%]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting how these have an effect when it's essentially the same thing. I'm aware that division is usually relatively slow, so I would understand the case with the determinant when the number of divisions is reduced. In this case you seem to add an instruction, but maybe there's a faster instruction for 1 / x specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe it helped the compiler see a better path for evaluating the values at compile time and pipelining instructions. In most other cases, trying to do this had no effect.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilers are strange beings.

@Ogeon
Copy link
Owner

Ogeon commented Apr 19, 2020

This is quite interesting and I will have to come back to it with a more well rested brain.

Just a question, though. Was there a significant difference between asserts and destructuring? If not, I'm preferring destructuring to not even introduce the panic paths. Even if they are likely just removed by the optimizer. And I think it reads nicer. 🙂

@okaneco
Copy link
Contributor Author

okaneco commented Apr 19, 2020

I started by focusing on the matrix multiplications. The asserts definitely showed improvements, in some cases I could take it farther. For the smaller improvements in the range of 2-3%, I thought they were worth including because I saw a consistent improvement when I put them in and consistent lower performance with them removed for a few runs. A quick thing I was using to check the difference was the amount of samples it would be collecting in the 5 seconds. When the numbers were substantially more, it seemed like a change worth keeping especially on the Rgb to linear path.

I'd need you to double check to make sure that the numbers aren't only on my end. This does seem like it brings some speed to the linear conversions which were noticed as a slow path from Rgb to Xyz, and apparently a huge roadblock was removed from Lab to Xyz.

I can check the destructuring and asserts again. If I remember correctly the asserts were still faster and destructuring was slower, sometimes markedly. I'm not sure why that reciprocal works either. I'm not an expert with assembly but I tried looking at the differences in the output. With more complicated functions they seemed to start deviating to my untrained eyes based on the recip change.

These were the benches with notable changes. I'll go back and try destructuring to remove asserts if they're equivalent in performance. edit: Destructuring was successful.

Improvements
Cie family/linsrgb to xyz
                        time:   [4.4561 us 4.4629 us 4.4704 us]
                        thrpt:  [31.317 Melem/s 31.370 Melem/s 31.418 Melem/s]
                 change:
                        time:   [-28.036% -27.787% -27.543%] (p = 0.00 < 0.05)
                        thrpt:  [+38.012% +38.480% +38.958%]
                        Performance has improved.

Cie family/lab to xyz   time:   [861.87 ns 864.43 ns 867.12 ns]
                        thrpt:  [161.45 Melem/s 161.96 Melem/s 162.44 Melem/s]
                 change:
                        time:   [-34.197% -33.915% -33.587%] (p = 0.00 < 0.05)
                        thrpt:  [+50.573% +51.321% +51.968%]
                        Performance has improved.

Matrix functions/multiply_xyz
                        time:   [5.4590 ns 5.4723 ns 5.4870 ns]
                        change: [-2.5207% -2.1988% -1.8708%] (p = 0.00 < 0.05)
                        Performance has improved.

Matrix functions/multiply_3x3
                        time:   [11.973 ns 11.993 ns 12.014 ns]
                        change: [-16.636% -16.447% -16.267%] (p = 0.00 < 0.05)
                        Performance has improved.

Matrix functions/matrix_inverse
                        time:   [20.981 ns 21.013 ns 21.048 ns]
                        change: [-26.532% -26.358% -26.173%] (p = 0.00 < 0.05)
                        Performance has improved.

Matrix functions/rgb_to_xyz_matrix
                        time:   [35.260 ns 35.316 ns 35.382 ns]
                        change: [-41.447% -41.283% -41.130%] (p = 0.00 < 0.05)
                        Performance has improved.

Rgb family/rgb to linsrgb
                        time:   [8.1507 us 8.1638 us 8.1778 us]
                        thrpt:  [17.120 Melem/s 17.149 Melem/s 17.177 Melem/s]
                 change:
                        time:   [-4.8448% -4.5442% -4.2355%] (p = 0.00 < 0.05)
                        thrpt:  [+4.4228% +4.7605% +5.0915%]

Rgb family/rgb to hsl   time:   [9.2963 us 9.3136 us 9.3316 us]
                        thrpt:  [15.003 Melem/s 15.032 Melem/s 15.060 Melem/s]
                 change:
                        time:   [-5.0379% -4.7627% -4.4815%] (p = 0.00 < 0.05)
                        thrpt:  [+4.6917% +5.0009% +5.3051%]
                        Performance has improved.

Rgb family/rgb to hsv   time:   [8.5788 us 8.5946 us 8.6114 us]
                        thrpt:  [16.257 Melem/s 16.289 Melem/s 16.319 Melem/s]
                 change:
                        time:   [-6.5063% -6.1843% -5.8439%] (p = 0.00 < 0.05)
                        thrpt:  [+6.2066% +6.5920% +6.9591%]
                        Performance has improved.

Rgb family/xyz to linsrgb
                        time:   [8.1488 us 8.1639 us 8.1807 us]
                        thrpt:  [17.113 Melem/s 17.149 Melem/s 17.180 Melem/s]
                 change:
                        time:   [-16.483% -16.232% -15.987%] (p = 0.00 < 0.05)
                        thrpt:  [+19.030% +19.377% +19.736%]
                        Performance has improved.

Rgb family/linsrgb to rgb
                        time:   [8.4807 us 8.4931 us 8.5055 us]
                        thrpt:  [16.460 Melem/s 16.484 Melem/s 16.508 Melem/s]
                 change:
                        time:   [-2.7129% -2.4336% -2.1709%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2190% +2.4943% +2.7886%]
                        Performance has improved.

Rgb family/rgb_u8 to linsrgb_f32
                        time:   [8.6593 us 8.6714 us 8.6839 us]
                        thrpt:  [16.122 Melem/s 16.145 Melem/s 16.168 Melem/s]
                 change:
                        time:   [-11.008% -10.747% -10.477%] (p = 0.00 < 0.05)
                        thrpt:  [+11.704% +12.041% +12.369%]
                        Performance has improved.

I can't explain these but I know benchmarking isn't perfect.

Regressions
Rgb family/hsv to hsl   time:   [1.0087 us 1.0106 us 1.0125 us]  
                        thrpt:  [138.27 Melem/s 138.53 Melem/s 138.79 Melem/s]  
                 change:  
                        time:   [+9.7911% +10.215% +10.653%] (p = 0.00 < 0.05)  
                        thrpt:  [-9.6271% -9.2681% -8.9179%]  
                        Performance has regressed.

Rgb family/hwb to hsv   time:   [936.00 ns 937.35 ns 938.70 ns]  
                        thrpt:  [149.14 Melem/s 149.36 Melem/s 149.57 Melem/s]  
                 change:  
                        time:   [+17.039% +17.384% +17.775%] (p = 0.00 < 0.05)  
                        thrpt:  [-15.092% -14.809% -14.559%]  
                        Performance has regressed.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 19, 2020

@Ogeon
Copy link
Owner

Ogeon commented Apr 19, 2020

Ah, right, it may have been rate limited after all of the requests I caused. I'll run it again later.

Those are some good improvements! Well done! Floats and assembly is not my strong side either, so a lot of it is a mystery to me. I will at least run it on my computer and see if I get the same results. It will however have to wait until at least tomorrow.

Those regressions seem odd. Something to investigate a bit more.

}

/// Invert a 3x3 matrix and panic if matrix is not invertible.
pub fn matrix_inverse<T: Float>(a: &Mat3<T>) -> Mat3<T> {
assert!(a.len() > 8);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably good to add a comment here to explain why it's different. The same in other places where there's a risk of someone coming by and "fixing" them. 🙂

@Ogeon
Copy link
Owner

Ogeon commented Apr 19, 2020

I just came to think of that it can sometimes help to add #[inline] or #[inline(always)] to smaller functions. The compiler is supposed to be able to inline on its own, but not always in the best way. I just noticed it had not been added in matrix.rs.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 19, 2020

I noticed that too but wasn't sure if I wanted to go down that rabbit hole, seemed likely to blow up with the size of some of the functions. The TransferFn for Srgb is also missing an inline attribute.

I tried playing with the convert int_to_float and float_to_int macros but I'm not sure what more can be done in the current state. Any improvements seemed to cause regressions somewhere else when I played with inline.

@Ogeon
Copy link
Owner

Ogeon commented Apr 19, 2020

It has been forgotten in many places, so don't assume it's for any particular reason. 😄 But just leave it as it is if there's no clear improvement.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 19, 2020

Chopping down some more time for two important functions, small impact on the others, and some are neutral. I think this is about the limit of what can be done for now.

Matrix functions/matrix_inverse
                        time:   [14.768 ns 14.799 ns 14.834 ns]
                        change: [-29.928% -29.698% -29.484%] (p = 0.00 < 0.05)
                        Performance has improved.

Matrix functions/rgb_to_xyz_matrix: Collecting 100 samples in estim                                                                                Matrix functions/rgb_to_xyz_matrix
                        time:   [25.418 ns 25.490 ns 25.556 ns]
                        change: [-32.540% -32.275% -32.014%] (p = 0.00 < 0.05)
                        Performance has improved.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 20, 2020

I reran the master baseline since I was unsure of it before I started looking at the regressions. Compared to the current feature branch, the tests listed as regressions now see speedups. The minor speedup in Hsv to Hsl I assume is due to the same speedup for Rgb to linear (edit: nevermind I was looking at the wrong function, there's no linear call). Hwb to Hsv isn't showing as a regression anymore but I don't think I touched any code related to it. Maybe the compiler can do a better job with all the other changes now? Or when these showed as a regression maybe there was something using resources in the background on my computer. Not going to go mad trying to figure it all out.

Rgb family/hsv to hsl   time:   [915.43 ns 917.98 ns 920.58 ns]
                        thrpt:  [152.08 Melem/s 152.51 Melem/s 152.93 Melem/s]
                 change:
                        time:   [-2.7504% -2.3204% -1.9146%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9520% +2.3755% +2.8282%]
                        Performance has improved.

Rgb family/hwb to hsv   time:   [507.82 ns 508.56 ns 509.32 ns]
                        thrpt:  [274.88 Melem/s 275.28 Melem/s 275.69 Melem/s]
                 change:
                        time:   [-36.738% -36.557% -36.370%] (p = 0.00 < 0.05)
                        thrpt:  [+57.158% +57.623% +58.072%]
                        Performance has improved.

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

Here are the results from my own run. No regressions and really good improvements!

Improvements
Cie family/linsrgb to xyz                        
                        time:   [3.0956 us 3.0961 us 3.0968 us]
                        thrpt:  [45.209 Melem/s 45.218 Melem/s 45.226 Melem/s]
                 change:
                        time:   [-21.531% -21.384% -21.241%] (p = 0.00 < 0.05)
                        thrpt:  [+26.969% +27.200% +27.439%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  6 (6.00%) high mild
  10 (10.00%) high severe

Cie family/lab to xyz
                        time:   [676.19 ns 676.38 ns 676.59 ns]
                        thrpt:  [206.92 Melem/s 206.98 Melem/s 207.04 Melem/s]
                 change:
                        time:   [-21.242% -21.122% -20.998%] (p = 0.00 < 0.05)
                        thrpt:  [+26.579% +26.778% +26.971%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

Matrix functions/multiply_3x3                        
                        time:   [7.0763 ns 7.0784 ns 7.0806 ns]
                        change: [-4.3961% -4.1563% -3.9025%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

Matrix functions/matrix_inverse                        
                        time:   [10.646 ns 10.649 ns 10.652 ns]
                        change: [-34.040% -33.927% -33.819%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) high mild
  11 (11.00%) high severe

Matrix functions/rgb_to_xyz_matrix                        
                        time:   [16.563 ns 16.568 ns 16.574 ns]
                        change: [-56.968% -56.889% -56.795%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe

Rgb family/rgb to linsrgb                        
                        time:   [3.6251 us 3.6263 us 3.6278 us]
                        thrpt:  [38.591 Melem/s 38.606 Melem/s 38.620 Melem/s]
                 change:
                        time:   [-14.296% -14.158% -14.001%] (p = 0.00 < 0.05)
                        thrpt:  [+16.280% +16.493% +16.681%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

Rgb family/rgb to hsl
                        time:   [4.4344 us 4.4365 us 4.4389 us]
                        thrpt:  [31.539 Melem/s 31.556 Melem/s 31.571 Melem/s]
                 change:
                        time:   [-11.714% -11.563% -11.407%] (p = 0.00 < 0.05)
                        thrpt:  [+12.876% +13.075% +13.268%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  14 (14.00%) high severe

Rgb family/rgb to hsv
                        time:   [4.1305 us 4.1325 us 4.1348 us]
                        thrpt:  [33.859 Melem/s 33.878 Melem/s 33.895 Melem/s]
                 change:
                        time:   [-12.240% -12.076% -11.869%] (p = 0.00 < 0.05)
                        thrpt:  [+13.467% +13.735% +13.947%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  15 (15.00%) high severe

Rgb family/hsl to hsv
                        time:   [501.81 ns 501.91 ns 502.00 ns]
                        thrpt:  [278.89 Melem/s 278.94 Melem/s 278.99 Melem/s]
                 change:
                        time:   [-3.0264% -2.7777% -2.5178%] (p = 0.00 < 0.05)
                        thrpt:  [+2.5828% +2.8571% +3.1208%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

Rgb family/hwb to hsv
                        time:   [427.93 ns 428.11 ns 428.30 ns]
                        thrpt:  [326.88 Melem/s 327.02 Melem/s 327.16 Melem/s]
                 change:
                        time:   [-25.709% -25.592% -25.479%] (p = 0.00 < 0.05)
                        thrpt:  [+34.190% +34.395% +34.605%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

Rgb family/xyz to linsrgb                        
                        time:   [5.2303 us 5.2320 us 5.2338 us]
                        thrpt:  [26.749 Melem/s 26.759 Melem/s 26.767 Melem/s]
                 change:
                        time:   [-23.432% -23.300% -23.120%] (p = 0.00 < 0.05)
                        thrpt:  [+30.072% +30.378% +30.604%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

Rgb family/rgb_u8 to linsrgb_f32                        
                        time:   [4.3870 us 4.3891 us 4.3915 us]
                        thrpt:  [31.880 Melem/s 31.897 Melem/s 31.912 Melem/s]
                 change:
                        time:   [-10.983% -10.727% -10.508%] (p = 0.00 < 0.05)
                        thrpt:  [+11.742% +12.016% +12.338%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  15 (15.00%) high severe
Unchanged or minor difference
Cie family/xyz to lab
                        time:   [7.2940 us 7.2973 us 7.3011 us]
                        thrpt:  [19.175 Melem/s 19.185 Melem/s 19.194 Melem/s]
                 change:
                        time:   [-0.1800% +0.0585% +0.2567%] (p = 0.62 > 0.05)
                        thrpt:  [-0.2561% -0.0585% +0.1803%]
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

Cie family/lch to lab
                        time:   [1.9700 us 1.9700 us 1.9700 us]
                        thrpt:  [71.065 Melem/s 71.066 Melem/s 71.067 Melem/s]
                 change:
                        time:   [-0.9328% -0.8840% -0.8453%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8525% +0.8919% +0.9416%]
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  10 (10.00%) high severe

Cie family/lab to lch
                        time:   [2.0126 us 2.0136 us 2.0148 us]
                        thrpt:  [69.485 Melem/s 69.526 Melem/s 69.563 Melem/s]
                 change:
                        time:   [+0.6424% +0.9398% +1.2916%] (p = 0.00 < 0.05)
                        thrpt:  [-1.2751% -0.9311% -0.6383%]
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Cie family/yxy to xyz
                        time:   [464.69 ns 464.98 ns 465.30 ns]
                        thrpt:  [300.88 Melem/s 301.09 Melem/s 301.28 Melem/s]
                 change:
                        time:   [-0.1353% +0.0409% +0.2275%] (p = 0.66 > 0.05)
                        thrpt:  [-0.2270% -0.0409% +0.1355%]
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Cie family/xyz to yxy
                        time:   [431.49 ns 431.68 ns 431.89 ns]
                        thrpt:  [324.16 Melem/s 324.32 Melem/s 324.45 Melem/s]
                 change:
                        time:   [-0.1629% +0.0243% +0.2367%] (p = 0.83 > 0.05)
                        thrpt:  [-0.2362% -0.0243% +0.1631%]
                        No change in performance detected.
Found 22 outliers among 100 measurements (22.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe

Matrix functions/multiply_xyz                        
                        time:   [3.1030 ns 3.1038 ns 3.1046 ns]
                        change: [-0.9852% -0.7812% -0.5867%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

Matrix functions/multiply_xyz_to_rgb                        
                        time:   [3.1201 ns 3.1209 ns 3.1218 ns]
                        change: [-0.4326% -0.3082% -0.1925%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe

Matrix functions/multiply_rgb_to_xyz                        
                        time:   [3.1196 ns 3.1202 ns 3.1209 ns]
                        change: [-0.6214% -0.3791% -0.1597%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  6 (6.00%) high severe

Rgb family/hsv to hsl
                        time:   [851.20 ns 851.86 ns 852.55 ns]
                        thrpt:  [164.21 Melem/s 164.35 Melem/s 164.47 Melem/s]
                 change:
                        time:   [-0.3723% -0.0568% +0.1817%] (p = 0.74 > 0.05)
                        thrpt:  [-0.1813% +0.0568% +0.3737%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Rgb family/hsv to hwb
                        time:   [154.06 ns 154.12 ns 154.18 ns]
                        thrpt:  [908.02 Melem/s 908.41 Melem/s 908.75 Melem/s]
                 change:
                        time:   [+0.7771% +0.9782% +1.2360%] (p = 0.00 < 0.05)
                        thrpt:  [-1.2209% -0.9687% -0.7711%]
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  11 (11.00%) high severe

Rgb family/hsl to rgb
                        time:   [5.1071 us 5.1090 us 5.1111 us]
                        thrpt:  [27.391 Melem/s 27.403 Melem/s 27.413 Melem/s]
                 change:
                        time:   [+0.2280% +0.3897% +0.6057%] (p = 0.00 < 0.05)
                        thrpt:  [-0.6020% -0.3882% -0.2275%]
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe

Rgb family/hsv to rgb
                        time:   [4.9960 us 4.9976 us 4.9994 us]
                        thrpt:  [28.004 Melem/s 28.013 Melem/s 28.023 Melem/s]
                 change:
                        time:   [-1.1262% -0.8155% -0.5646%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5678% +0.8222% +1.1390%]
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild
  7 (7.00%) high severe

Rgb family/linsrgb to rgb                        
                        time:   [3.2706 us 3.2716 us 3.2727 us]
                        thrpt:  [42.778 Melem/s 42.792 Melem/s 42.805 Melem/s]
                 change:
                        time:   [-0.3836% -0.1287% +0.1041%] (p = 0.31 > 0.05)
                        thrpt:  [-0.1040% +0.1289% +0.3851%]
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

Rgb family/linsrgb_f32 to rgb_u8                        
                        time:   [5.4606 us 5.4622 us 5.4640 us]
                        thrpt:  [25.622 Melem/s 25.631 Melem/s 25.638 Melem/s]
                 change:
                        time:   [+0.1743% +0.4110% +0.6777%] (p = 0.00 < 0.05)
                        thrpt:  [-0.6731% -0.4093% -0.1740%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

As for the automatic benchmarks, it seem to hit what's described in https://github.community/t5/GitHub-Actions/GitHub-actions-are-severely-limited-on-PRs/m-p/54669#M9249. That's very unfortunate and there doesn't seem to be a good alternative that makes it available for forks.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 20, 2020

That makes actions seem pretty pointless then for a lot of purposes.

Otherwise, that's great to see that the big wins were relatively consistent along 🙂.
I'm on an older CPU from 2011 so I'm not surprised that some things which were bigger for me were minor/unchanged.

Do you have any idea why Hsl to Hsv and Hwb to Hsv would have improvements? I'm really happy that the conversion to and from linear is so much faster but those benches make little sense to me.

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

That makes actions seem pretty pointless then for a lot of purposes.

A bit, yes. Or at least less convenient. But there may be some way to do this without commenting on the PR. It can run everything else, so I'm thinking it could print the results in the logs instead. Not as nice, but almost there.

I'm on an older CPU from 2011 so I'm not surprised that some things which were bigger for me were minor/unchanged.

My CPU is only a bit newer (2014 it seems), but I got really stable results by not having anything heavier than the console and system monitor running.

Do you have any idea why Hsl to Hsv and Hwb to Hsv would have improvements?

I don't know about HSL to HSV, but for HWB to HSV I accidentally left RGB space conversion in there, meaning it will pull in RGB <-> XYZ code in there too. It's just a theory, but could be why it's affected. I'll see what happens if I change that function, but then I have to close this browser to get the stable results, so I will post this comment first.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 20, 2020

I was doing the same thing of having just the bare minimum open and getting stable results. I was surprised by how many outliers were in the results you posted though compared to what I get on my desktop and laptop.

Just to make sure, you don't have to run the whole benchmark, you can type in a pattern that matches the name of the bench. It helps to iterate over changes faster that way.

I think printing the log with the compare tool results would be the best course, we don't need the other capabilities as much.

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

I removed the RGB space conversion and got

time:   [-8.4998% -8.2138% -7.9330%] (p = 0.00 < 0.05)
thrpt:  [+8.6166% +8.9489% +9.2894%]

compared to your changes, but it did also cause a somewhat large improvement to hsv to hsl, but a regression to hsl to hsv. 🤷‍♂️ Still compared to your change.

To explain the rest, I think it need much deeper analysis than just looking at the code and numbers. But the code tells me there is at least some room for improvements. They are pretty much implemented off of mathematical formulas, so they may be more elaborate than necessary.

I don't know why I had so many outliers in some places, but maybe they are extra sensitive? Or didn't I reduce the interference enough? No idea. They didn't seem noisy when I ran master against master. I will have to come back to all of this later and look into the details.

@okaneco
Copy link
Contributor Author

okaneco commented Apr 20, 2020

That's encouraging to hear that there's still more improvements that can be made. I don't really use any of the hue types other than Lch, and am more concerned with the time for srgb-linsrgb-xyz-lab, but at least we have a test bench now to measure things.

I tried shuffling the ordering of some of the benchmarks around to see if the numbers still stay the same, and they mostly did except for hwb to hsl and hsv to hsl. I guess the black box really isn't enough for those.

Is there anything else I should do with the current code?

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

Hmm, the black box may be an issue. It's apparently not perfect on stable... It would also be interesting to see what happens if there's more work to do in each iteration. I.e. more colors in the arrays. Right now it seem to be able to convert a few HD pictures per second, if I calculated it correctly, so it's not like it's overwhelmed with what's essentially 140 pixels.

Is there anything else I should do with the current code?

It's perhaps out of scope for this, could you do me a favor and remove the RGB space conversion when going from HWB to HSV? They are supposed to have the same color spaces, like when converting from HSL. It's a small change in hsv.rs.

I don't have anything else that needs to be fixed, so feel free to wrap it up. 🙂

Destructure slices to avoid bounds checking and panic paths
Improve Rgb into linear transfer function
Improve Lab to Xyz
Remove RgbSpace conversion for Hsv from Hwb
@okaneco
Copy link
Contributor Author

okaneco commented Apr 20, 2020

I removed the Sp from Hwb to Hsv. I glossed over that conversion and didn't think anything of it when I was looking at all the from_color_unclamped for more speedups.

I agree something with a bigger test load would be better since it'd result in lower iterations. I didn't want to add any images in the first bench PR but now it'd be fine. I'm not sure how you feel but I personally think that the images should be small enough that the benches can run in the ~5 seconds like they do now. I don't think I'm impatient but it already feels really long running through all the benches and not being able to touch the computer during it.

@Ogeon
Copy link
Owner

Ogeon commented Apr 20, 2020

Perfect, thank you! It bothered me after I discovered it. And thank you for finding all of these improvements!

There are already a few images in the repo that can be used, but randomized data, mixed with the test data, may be a better challenge. I don't know enough about CPUs, but it wouldn't do to have it branch predict its way through it.

It seem to group them as 100 samples and always run in close to 5 seconds, so as long as it can run at least a few hundred iterations in 5 seconds it's probably no issue.

Let's hope the broken benchmark isn't going to block the merge, but I don't think it's required by default.

bors r+

@bors
Copy link
Contributor

bors bot commented Apr 20, 2020

Build succeeded:

@bors bors bot merged commit 464c823 into Ogeon:master Apr 20, 2020
@okaneco okaneco deleted the matrix branch April 20, 2020 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants