Skip to content
This repository has been archived by the owner on Oct 28, 2023. It is now read-only.

Improve performance by about 50% #14

Merged
merged 2 commits into from
Sep 11, 2019
Merged

Conversation

austinjones
Copy link
Contributor

I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match. It's a very, very hot loop.

Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be memoized/precomputed! Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop.

I looked hard for a way to improve dist_gaussian within find_better_match, but the best I could do was precompute outside the find_best_match loop. It still helped a lot.

The performance improvement ranges from 40%-60% in the examples. I ran them all before/after to get performance numbers:

Baseline (f93022):

$ time cargo run --release --example 01_single_example_synthesis
real	0m35.907s
user	1m42.443s
sys	0m1.703s

$ time cargo run --release --example 02_multi_example_synthesis
real	0m35.078s
user	1m40.583s
sys	0m1.439s

$ time cargo run --release --example 03_guided_synthesis
real	0m54.007s
user	2m52.701s
sys	0m1.350s

$ time cargo run --release --example 04_style_transfer
real	1m1.137s
user	3m2.301s
sys	0m1.432s

$ time cargo run --release --example 05_inpaint
real	0m7.112s
user	0m17.274s
sys	0m0.606s

$ time cargo run --release --example 06_tiling_texture
real	0m17.468s
user	0m44.012s
sys	0m0.844s

Patched (6a97e0):

$ time cargo run --release --example 01_single_example_synthesis
real	0m15.504s
user	0m50.723s
sys	0m0.714s

$ time cargo run --release --example 02_multi_example_synthesis
real	0m22.895s
user	0m55.519s
sys	0m1.365s

$ time cargo run --release --example 03_guided_synthesis
real	0m36.120s
user	1m46.536s
sys	0m1.203s

$ time cargo run --release --example 04_style_transfer
real	0m33.324s
user	1m44.423s
sys	0m0.888s

$ time cargo run --release --example 05_inpaint
real	0m2.747s
user	0m7.814s
sys	0m0.139s

$ time cargo run --release --example 06_tiling_texture
real	0m10.338s
user	0m21.917s
sys	0m0.756s

I didn't see any visible artifacts in the output. This is a lossless optimization!

@austinjones
Copy link
Contributor Author

Amended with rustfmt fixes.

@austinjones
Copy link
Contributor Author

If you are curious, the flamegraphs svgs are here: https://drive.google.com/open?id=1Tc58OEbdM7PRgDJ7DT6Om-7nr6Re81q0

@Jake-Shadle
Copy link
Member

Jake-Shadle commented Sep 6, 2019

Hey, this change looks great, thanks! I think I want to do #15 first (it's something we should have done earlier!) just so we can get better numbers and ensure we don't get regressions in performance or output on this and all future changes, but it might have to wait until next week.

Again, thanks for the contribution, we really appreciate it!

@repi
Copy link
Contributor

repi commented Sep 7, 2019

Wow awesome work, thanks for the contribution!

@austinjones
Copy link
Contributor Author

@Jake-Shadle @repi no problem! I am going to be running this code on my hexacore for the foreseeable future, so I want it to be as fast as possible! I'm a generative artist, and have a decent library of images to crunch, and even more on instagram/pinterest once I script the download 😁

Thank you for open-sourcing this! I thought texture synthesis was too slow for what I needed...was doing it by hand with Perlin, OpenSimplex, Worley, and tricks.

I also have another change that is a bit more code, but shaves another 15% off (down to 13.0s on example01). It's a more code for less gain, though. Going to upload PR2 in a sec.

@zicklag
Copy link

zicklag commented Sep 8, 2019

Hey there! Awesome to see performance improvements to this! I'm getting some strange artifacts, though, when using colored masks for guided synthesis.

Here are the images and the command I ran ( image is a screenshot from Child of Light® not my own ):

source1.jpg:
source1
source1_mask.jpg:
source1_mask
source1_target.jpg:
source1_target

command:

texture-synthesis -o out4.png generate --target-guide source1_target.jpg --guides source1_mask.jpg  -- source1.png

Result:
out3

Result from master ( without this PR ):

out2

I also tested it with a PNG from the Nobiax texture pack for another example, but the problem isn't close to as violent:

source2.png:
source2

Result:

out4

@Jake-Shadle Jake-Shadle self-assigned this Sep 9, 2019
@Jake-Shadle
Copy link
Member

@austinjones Hey, sorry this is taking so long, I will be looking at this PR tomorrow though!

@austinjones
Copy link
Contributor Author

@zicklag I've noticed this as well. I'm going to take a look tonight and figure out why it's happening. @Jake-Shadle no rush - looks like I have a bug to fix.

I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match.  It's a very, very hot loop.

Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be precomputed!

Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop.

I ran all the examples and couldn't find any visible artifacts.
austinjones pushed a commit to austinjones/texture-synthesis that referenced this pull request Sep 10, 2019
I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14

First: there is a numerical precision bug with the calculation of distance gaussians.  The exp() function used to be f64::exp(), and I was using f32::exp().

Second: there were missing entries in the precomputed function table.  Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.
I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14

First: there is a numerical precision bug with the calculation of distance gaussians.  The exp() function used to be f64::exp(), and I was using f32::exp().

Second: there were missing entries in the precomputed function table.  Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.
@austinjones
Copy link
Contributor Author

@zicklag Fixed! It was a bad loop bound on PrerenderedU8Function. As soon as I turned on the debug images the patches were crazy.

I also fixed a subtle numerical precision problem with the pow() call in distance_gaussians.

Unit tests pass. I reproduced and verified your example. It now outputs:
cargo run --release -- -t 1 -o out4.png generate -- broken-input.jpg
out4

@zicklag
Copy link

zicklag commented Sep 10, 2019

Awesome, that's great! 🎉

Copy link
Member

@Jake-Shadle Jake-Shadle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this PR, it works great and the almost 2x speedup in some of the benchmarks is really nice to see!

@repi repi merged commit fd1b0f0 into EmbarkStudios:master Sep 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants