-
Notifications
You must be signed in to change notification settings - Fork 83
Conversation
0a895be
to
fa957cc
Compare
Amended with rustfmt fixes. |
If you are curious, the flamegraphs svgs are here: https://drive.google.com/open?id=1Tc58OEbdM7PRgDJ7DT6Om-7nr6Re81q0 |
Hey, this change looks great, thanks! I think I want to do #15 first (it's something we should have done earlier!) just so we can get better numbers and ensure we don't get regressions in performance or output on this and all future changes, but it might have to wait until next week. Again, thanks for the contribution, we really appreciate it! |
Wow awesome work, thanks for the contribution! |
@Jake-Shadle @repi no problem! I am going to be running this code on my hexacore for the foreseeable future, so I want it to be as fast as possible! I'm a generative artist, and have a decent library of images to crunch, and even more on instagram/pinterest once I script the download 😁 Thank you for open-sourcing this! I thought texture synthesis was too slow for what I needed...was doing it by hand with Perlin, OpenSimplex, Worley, and tricks. I also have another change that is a bit more code, but shaves another 15% off (down to 13.0s on example01). It's a more code for less gain, though. Going to upload PR2 in a sec. |
@austinjones Hey, sorry this is taking so long, I will be looking at this PR tomorrow though! |
@zicklag I've noticed this as well. I'm going to take a look tonight and figure out why it's happening. @Jake-Shadle no rush - looks like I have a bug to fix. |
I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match. It's a very, very hot loop. Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be precomputed! Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop. I ran all the examples and couldn't find any visible artifacts.
I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14 First: there is a numerical precision bug with the calculation of distance gaussians. The exp() function used to be f64::exp(), and I was using f32::exp(). Second: there were missing entries in the precomputed function table. Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.
I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14 First: there is a numerical precision bug with the calculation of distance gaussians. The exp() function used to be f64::exp(), and I was using f32::exp(). Second: there were missing entries in the precomputed function table. Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.
@zicklag Fixed! It was a bad loop bound on PrerenderedU8Function. As soon as I turned on the debug images the patches were crazy. I also fixed a subtle numerical precision problem with the pow() call in distance_gaussians. Unit tests pass. I reproduced and verified your example. It now outputs: |
Awesome, that's great! 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this PR, it works great and the almost 2x speedup in some of the benchmarks is really nice to see!
I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match. It's a very, very hot loop.
Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be memoized/precomputed! Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop.
I looked hard for a way to improve dist_gaussian within find_better_match, but the best I could do was precompute outside the find_best_match loop. It still helped a lot.
The performance improvement ranges from 40%-60% in the examples. I ran them all before/after to get performance numbers:
Baseline (f93022):
Patched (6a97e0):
I didn't see any visible artifacts in the output. This is a lossless optimization!