Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] Read example pixels only when necessary #69

Merged
merged 7 commits into from Nov 25, 2019

Conversation

Mr4k
Copy link
Contributor

@Mr4k Mr4k commented Nov 22, 2019

Checklist

  • I have read the Contributor Guide
  • I have read and agree to the Code of Conduct
  • I have added a description of my changes and why I'd like them included in the section below

Description of Changes

I'm not sure if this is the kind of contribution you are looking for but I did a little profiling (using instruments, screenshot below) and found out that (on my computer at least) a large amount of time was being taken up by the function k_neighs_to_color_pattern when creating the candidate patterns. It appears that looking up the pixels in the example images is somewhat costly and it is done in an inner loop which ends up contributing significantly to runtime.

instruments-profile

To try to cut down on this cost I moved pixel lookups for the candidate's neighbors into the better_match function. I only read each pixel right before it needs to be used in the cost function. This means because are you already stopping a lot of the cost computations early (when the current candidate cost exceeds the smallest candidate cost so far) fewer pixel lookups are performed. This does not change the algorithm at all.

This results in around a 14% - 45% speed up (according to your benchmark test suite on my computer) depending on the example image(s) used and size of output texture. The average speed up seems to be more in the range of 14 - 25% (very unscientifically computed). I assume there are pathological cases where no performance gain could occur but I think they would be rare.

About my computer:
MacBook Pro (Retina, 13-inch, Early 2015)
Processor: 2.9 GHz Intel Core i5 (4 logical cores)
Memory: 8 GB 1867 MHz DDR3

Edit: Additionally tested with a Macbook Pro 2018
2.6 Ghz Intel Core i7 (12 logical cores)
32 GB 2400 MHz DDR4

I also tried to change the code minimally but there was some refactoring.

Disclaimer:
I have not tested this on a wide variety of devices or high end cpus

Related Issues

I don't think this is related to any open issues.

);

//get example pattern to compare to
k_neighs_to_color_pattern(
Copy link
Contributor Author

@Mr4k Mr4k Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this call to generate my_guide_pattern was in the candidates loop. It seemed to not matter so I moved it out

}
}

fn k_neighs_to_precomputed_reference_pattern(
Copy link
Contributor Author

@Mr4k Mr4k Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed this function to differentiate it from the now on the fly read candidate pixels, this function is now only called to build my_pattern and my_guide_pattern

@Mr4k Mr4k changed the title Read example pixels only when necessary [Optimization] Read example pixels only when necessary Nov 22, 2019
}
}
score += next_pixel_score * distance_gaussians[i];
if score >= current_best {
return None;
Copy link
Contributor Author

@Mr4k Mr4k Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the heart of the speed up is because we don't read more pixels if we return early

}
}
score += next_pixel_score * distance_gaussians[i];
Copy link
Contributor Author

@Mr4k Mr4k Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably shrink distance gaussians from size (num neighbors) * 4 to just num neighbors as that is all I'm using here

@Jake-Shadle Jake-Shadle self-requested a review Nov 22, 2019
@Jake-Shadle
Copy link
Member

Jake-Shadle commented Nov 22, 2019

I unfortunately didn't have time to review this today, but I will check it out Monday, thanks for the PR!

@arirawr arirawr added the enhancement New feature or request label Nov 25, 2019
Copy link
Member

@Jake-Shadle Jake-Shadle left a comment

Thanks for the PR, here are the numbers from the baselines I've been running:

group                  pr-33                                  pr-69
-----                  -----                                  -----
guided/100             1.74    389.0±9.75ms        ? B/sec    1.00    224.2±9.99ms        ? B/sec
guided/200             1.58    730.3±4.26ms        ? B/sec    1.00    463.7±6.23ms        ? B/sec
guided/25              1.86    157.0±5.80ms        ? B/sec    1.00     84.4±4.05ms        ? B/sec
guided/400             1.41       2.9±0.01s        ? B/sec    1.00       2.0±0.08s        ? B/sec
guided/50              2.03    321.6±6.41ms        ? B/sec    1.00    158.4±7.57ms        ? B/sec
inpaint/100            1.38    183.5±7.84ms        ? B/sec    1.00    133.2±7.17ms        ? B/sec
inpaint/200            1.18    222.9±7.21ms        ? B/sec    1.00    188.6±5.52ms        ? B/sec
inpaint/25             1.44     44.0±1.20ms        ? B/sec    1.00     30.6±1.07ms        ? B/sec
inpaint/400            1.00    535.5±7.99ms        ? B/sec    1.12    598.6±9.36ms        ? B/sec
inpaint/50             1.48    181.2±7.72ms        ? B/sec    1.00    122.7±7.48ms        ? B/sec
inpaint_channel/100                                           1.00    200.2±8.54ms        ? B/sec
inpaint_channel/200                                           1.00   597.6±51.78ms        ? B/sec
inpaint_channel/25                                            1.00     17.7±0.11ms        ? B/sec
inpaint_channel/400                                           1.00    473.3±8.34ms        ? B/sec
inpaint_channel/50                                            1.00     68.7±7.33ms        ? B/sec
multi_example/100      1.24    225.3±3.79ms        ? B/sec    1.00    182.4±6.30ms        ? B/sec
multi_example/200      1.00    445.4±9.75ms        ? B/sec    1.04    465.3±8.19ms        ? B/sec
multi_example/25       1.54     98.0±4.69ms        ? B/sec    1.00    63.5±11.85ms        ? B/sec
multi_example/400      1.00  1834.9±37.90ms        ? B/sec    1.46       2.7±0.10s        ? B/sec
multi_example/50       1.53    182.5±3.27ms        ? B/sec    1.00    119.4±4.23ms        ? B/sec
single_example/100     1.18    208.5±4.18ms        ? B/sec    1.00   176.3±13.28ms        ? B/sec
single_example/200     1.00    425.8±4.54ms        ? B/sec    1.23   523.7±10.91ms        ? B/sec
single_example/25      1.87     84.1±4.66ms        ? B/sec    1.00     45.1±2.80ms        ? B/sec
single_example/400     1.00  1709.8±12.95ms        ? B/sec    1.55       2.6±0.04s        ? B/sec
single_example/50      1.44    176.7±4.70ms        ? B/sec    1.00   122.3±12.19ms        ? B/sec
style_transfer/100     1.64    383.9±5.37ms        ? B/sec    1.00    234.7±8.10ms        ? B/sec
style_transfer/200     1.51    740.4±3.05ms        ? B/sec    1.00    489.6±5.67ms        ? B/sec
style_transfer/25      1.80    148.2±5.87ms        ? B/sec    1.00     82.6±5.66ms        ? B/sec
style_transfer/400     1.51       2.9±0.01s        ? B/sec    1.00  1924.1±42.17ms        ? B/sec
style_transfer/50      1.67    300.1±9.62ms        ? B/sec    1.00   180.2±11.65ms        ? B/sec
tiling/100             1.32    233.5±6.02ms        ? B/sec    1.00    177.2±5.26ms        ? B/sec
tiling/200             1.00    396.7±4.94ms        ? B/sec    1.11    439.9±9.27ms        ? B/sec
tiling/25              1.66     77.6±3.56ms        ? B/sec    1.00     46.7±2.91ms        ? B/sec
tiling/400             1.00  1518.5±39.65ms        ? B/sec    1.39       2.1±0.04s        ? B/sec
tiling/50              1.42    183.6±6.05ms        ? B/sec    1.00   129.3±15.20ms        ? B/sec

So pretty much across the board improvements, and the regressions are fairly minor. And all tests pass, so I'm happy!

@Jake-Shadle Jake-Shadle merged commit b72aef5 into EmbarkStudios:master Nov 25, 2019
@Mr4k
Copy link
Contributor Author

Mr4k commented Nov 26, 2019

Hey! First just wanted to say thanks so much for taking the time to review!

Second of all I just wanted to make sure that I am not actually making your code worse :) One thing I notice when I look at the numbers in your benchmark is that it seems like performance has regressed on larger images which worries me a little.

I'm curious what os / cpu you are using? I tried another macbook with an i7 and fewer cores (just 4 physical ones) and a digital ocean server (ubuntu) with an Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz and ended up seeing no performance regressions for larger images even well above 400x400 (still looks like a 25% speed up on average). Sadly I still have not found a non-intel processor to benchmark against.

I'm sure you're pretty busy but if you have the time I'd be really curious what results the following simple benchmarks give when run a couple times with and without this pr (I compared against the commit right before this one):

time cargo run --release -- --out out/01.jpg generate imgs/1.jpg

time cargo run --release -- --out out/01.jpg --out-size 1024 generate imgs/1.jpg

time cargo run --release -- --out out/01.jpg --out-size 2048 generate imgs/1.jpg

time cargo run --release -- --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg

@Jake-Shadle
Copy link
Member

Jake-Shadle commented Nov 26, 2019

I'm on an Intel Xeon 3.3Ghz on Linux. Ping @h3r2tic who has a Threadripper he can try this on.

@h3r2tic
Copy link
Contributor

h3r2tic commented Nov 26, 2019

Ooooh, cool :) I'll test it when I'm back home!

@h3r2tic
Copy link
Contributor

h3r2tic commented Nov 27, 2019

I ran it on my TR 2990WX:

BEFORE:

commit 3f30bea86b2bc6435cd56805d2ae2f4124d766ec (HEAD, tag: 0.7.1)
Author: Jake Shadle <jake.shadle@embark-studios.com>
Date:   Tue Nov 19 15:29:51 2019 +0100

    Release 0.7.1

time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.485s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.472s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:00:15] ######################################## 100%
 stage   6 ######################################## 100%

real    0m15.673s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:04] ######################################## 100%
 stage   6 ######################################## 100%

real    1m5.580s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:04] ######################################## 100%
 stage   6 ######################################## 100%

real    0m4.473s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:04] ######################################## 100%
 stage   6 ######################################## 100%

real    0m4.459s
user    0m0.000s
sys     0m0.000s
AFTER:

commit 7196f75fc38963dadaa579c5a5bab3f5978eff9b (HEAD, origin/master, origin/HEAD, master)
Author: Jake Shadle <jake.shadle@embark-studios.com>
Date:   Mon Nov 25 14:23:21 2019 +0100

    Add CHANGELOG entry for PR#69

time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.132s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.179s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:00:14] ######################################## 100%
 stage   6 ######################################## 100%

real    0m14.559s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real    1m1.654s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.765s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.788s
user    0m0.000s
sys     0m0.000s

@Jake-Shadle
Copy link
Member

Jake-Shadle commented Nov 27, 2019

Interesting! Will have to look into why there's the slight regression on the larger ones on my xeon then.

@h3r2tic
Copy link
Contributor

h3r2tic commented Nov 27, 2019

There's a slight regression here as well in the --out out/01.jpg --out-size 2048 generate imgs/1.jpg case.

@Mr4k
Copy link
Contributor Author

Mr4k commented Nov 27, 2019

Thanks for taking the time to benchmark this!

If I'm reading those numbers correctly it looks like your results for the 2048 case are:

Before:
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:04] ######################################## 100%
 stage   6 ######################################## 100%

real    1m5.580s
user    0m0.000s
sys     0m0.000s

After:
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real    1m1.654s
user    0m0.000s
sys     0m0.000s

So it looks like with this pr it ran in ~1:01 vs ~1:05 which doesn't seem to be a regression (though maybe I am misunderstanding).

However those results are really minor (in fact I'd almost be worried they were noise) compared to what I've gotten on my test machines.

For example here is my original test macbook:

Before:
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:17] ######################################## 100%
 stage   6 ######################################## 100%

real	1m17.687s
user	3m47.141s
sys	0m1.396s
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:17] ######################################## 100%
 stage   6 ######################################## 100%

real	1m17.759s
user	3m46.016s
sys	0m1.329s

time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:05:27] ######################################## 100%
 stage   6 ######################################## 100%

real	5m28.603s
user	15m35.845s
sys	0m5.621s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:05:23] ######################################## 100%
 stage   6 ######################################## 100%

real	5m25.009s
user	15m36.068s
sys	0m5.371s

After:
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real	1m1.418s
user	2m53.323s
sys	0m1.160s

time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real	1m1.640s
user	2m53.534s
sys	0m1.157s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:04:16] ######################################## 100%
 stage   6 ######################################## 100%

real	4m17.655s
user	12m2.829s
sys	0m4.699s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:04:13] ######################################## 100%
 stage   6 ######################################## 100%

real	4m15.168s
user	11m58.227s
sys	0m4.631s

If you have the time I'd be curious to know (but I also don't want to take up too much of your time):
@h3r2tic what is the clock speed of your cpu? One thing I have not been able to bench this on is high clock speed cpus and while it's not a lot of data the mac above has a clock speed of 2.9ghz and I only see 20% gains as opposed to > 25% on the other computers I've run it on with slower cpus (2.6 ghz) but more cores. Of course it could also be any number of other cpu differences as well. I also haven't tested on any extremely high core cpus (the most physical cores I've done is 8). I might try that as well when I get more time tonight. (At this point I'm just curious)
@Jake-Shadle what happens if you run these types of simple time commands as opposed to the bench suite?

@h3r2tic
Copy link
Contributor

h3r2tic commented Nov 27, 2019

Ah, once again I find out that I can't even read xD You're obviously correct, @Mr4k, there was no regression there :)

It's a stock 2990wx (https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-2990wx), so just 3GHz, but it does have 32 cores.

@Mr4k
Copy link
Contributor Author

Mr4k commented Nov 28, 2019

Did a few tests with 8 cores (16 threads) and a 2.8ghz intel xeon processor using a google cloud server and noticed that improvements were 16-18% instead of 20-30%. Should be able to test with even more cores soon. It seems conceivable that benefits drop off with large number of threads (looks like you have 64). Still does not explain the regression though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants