Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Threading Scheme to Improve Performance with Large Numbers of Threads #70

Merged
merged 18 commits into from Jan 16, 2020

Conversation

Mr4k
Copy link
Contributor

@Mr4k Mr4k commented Dec 2, 2019

Checklist

  • I have read the Contributor Guide
  • I have read and agree to the Code of Conduct
  • I have added a description of my changes and why I'd like them included in the section below

Description of Changes

Disclaimer: This pr is a big proposed change and is not fully ready yet (inpaint and tiling are currently disabled but should be easily fixable). The code isn't the cleanest (feel free to let me know if it's too hard to read) and could be optimized further but I wanted to put it up as a proof of concept see what you think, make sure my big assumptions about how the code works were not very wrong, and see how this runs on other people's machines before I go further.

Motivation:
After seeing that @h3r2tic had virtually no improvements from my previous pr on their 32 core cpu I rented a larger cloud machine (64 logical cores, Intel Xeon) to test out how my previous changes behaved with many cores. It turns out I didn't see any improvement either. I then tried with lower numbers of threads (48, 32, 24, 16, 8, etc) and noticed that the run time actually seemed to improve as I decreased the number of threads (after some fiddling I found 22 threads was the best number for my instance) and barely went up until < 16 threads were in use. I figured this was a contention problem, so I added some additional logging to measure the time it took to acquire various locks and found (at least on that machine) the time it took to acquire a write lock on the rtree went up proportionally with the number of threads. Also as a side effect I believe it dwarfed any improvement from my previous change to the point where it was barely noticeable.

Proposed Improvements:
I've made several changes which I have seen reduce contention on my test machines.

  1. Replace the RTree with a grid of Rtrees
    Advantages:
    Most of the time (at least in the later stages of synthesis) two threads may be working on totally different areas of the image and will not need to access the same sets of pixels and therefore should not have to wait to read / write. To address this after the first few stages of synthesis I replace the single rtree with a grid of rtrees. Writes only lock the rtree in the grid cell which the pixel is in and reads just read from a cell and its neighbors. Overall this has seemed to improve contention considerably in my tests. (Possibly a multi threaded rtree is better than this but I couldn't find one and it seemed more difficult / error prone to build for likely little upside)
    Drawbacks:
    The actual read itself is a little slower, this is balanced out by the fact that far more threads can read and write at the same time but it should be optimized a little more for machines with fewer cores
    Technically we can miss some neighbors if a pixels k nearest neighbors are outside of the current cell + 8 adjacent cells, however because we only use a grid after the first few steps there should be enough pixels where this is never a problem. I have not seen quality degrade (see below)
    Places for improvement:
    The read can definitely be optimized more (especially important for cpus with small numbers of cores)
    The number of grid cells is currently just a random constant but really should vary with the number of threads / size of the image / k

  2. Don’t write to resolved queue until after stage is complete (don't updated it when a pixel is placed)
    After I switched to the rtree grid I noticed that although writes were really fast the resolved queue seemed to be suddenly very hard to acquire a lock on. After looking at the code more I decided that each thread could have its own resolved queue and I could combine them at the end of the stage. I don't think this actually changes the behavior of the code because the only time we read from the resolved queue is to redo pixels done in the previous stages so we don't care what pixels other threads have updated in the current stage.
    Advantages:
    The resolved queue is no longer backed up leading to a speed up
    Disadvantages:
    None that I see

  3. Remove the Update Queue entirely
    I removed the update queue to simplify things a little because the writes were now fast enough where I could write every time instead of needing to batch.
    Advantages:
    Code is simpler, very small speed up
    Disadvantages:
    I can't think of too much but possibly this could be bad not to have in the first few stages before we switch to the grid of rtrees
    Could also be worse for smaller grid sizes

Results:
I’ve tested on only two large machines so far:
Google Cloud N1:
Threads: 64
Cpu: Intel(R) Xeon(R) CPU @ 2.30GHz:

Command: time cargo run --release -- --threads 64 --out-size 2048 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~1m16s
AFTER: ~28.8s

Command: time cargo run --release -- --threads 32 --out-size 2048 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~1m12.8s
AFTER: ~35.7s

Command: time cargo run --release -- --threads 64 --out-size 1024 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~17.5s
AFTER: ~7.5s

Command time cargo run --release -- --threads 64 --out-size 512 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~4.1s
AFTER: ~1.9s

AWS Z1d:
Threads: 48
Cpu: Intel® Xeon® Scalable (Clock speed is a little unclear but Amazon claims up to 4ghz)

Command: time cargo run --release -- --threads 48  --out-size 2048 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~41.5 seconds
AFTER: ~22.5 seconds

Command: time cargo run --release --  --threads 24  --out-size 2048 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~44.5 seconds
AFTER: ~29 seconds

Command: time cargo run --release --  --threads 48 --out-size 1024 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~9.7 seconds
AFTER: ~5.7 seconds 

time cargo run --release --  --threads 48 --out-size 512 --out out/01.jpg generate imgs/1.jpg
BEFORE: ~2.36 seconds
AFTER: ~1.47 seconds

I’m also curious to see what others get on cpus with large numbers of cores. @h3r2tic especially want to know how this runs on your 32 core cpu if you have the time.

Image Quality:
Because this change affects the threads and is fairly major I wanted to make sure that image quality did not change. It looks like tests still pass for everything that I didn’t break so the image hashes must be fairly close. I also manually inspected some large images but did not see any major changes. However it is important that the threads share one rtree for the first few stages before there are enough pixels to fill out a decent portion of the grid cells.

Related Issues

I don’t believe this is related to any mentioned issue

@Mr4k
Copy link
Contributor Author

Mr4k commented Dec 4, 2019

Sped up the read from the grid of rtrees which brought the runtime down a little:
From a test Google Cloud Machine with 64 threads

Command: time cargo run --release -- --out-size 1024  --out out/01.jpg generate imgs/1.jpg
Before grid read speed up:
~7.5 seconds
After grid read speed up:
~5.6 seconds

Command: time cargo run --release -- --out-size 2048 --out out/01.jpg generate imgs/1.jpg
Before grid read speed up:
~28 seconds
After grid read speed up:
~24 seconds

But most importantly it now runs a lot faster cpus with fewer threads which helps address one thing I was worried about with this approach.
My development macbook (4 logical cores):

Command: time cargo run --release -- --out out/01.jpg generate imgs/1.jpg
Before read speed up: ~20s
After read speed up: ~11.5s
Before any form of this pr (current master): ~13.5s

// the stucture of this tuple is (chunk_x, chunk_y, is_center_chunk, closest_x_point, closest_y_point)
let places_to_look = vec![
(chunk_x, chunk_y, true, 0, 0),
(chunk_x + 1, chunk_y, false, (chunk_x + 1) * self.chunk_size as i32, y as i32),
(chunk_x - 1, chunk_y, false, chunk_x * self.chunk_size as i32, y as i32),
(chunk_x, chunk_y - 1, false, x as i32, chunk_y * self.chunk_size as i32),
(chunk_x, chunk_y + 1, false, x as i32, (chunk_y + 1) * self.chunk_size as i32),
(chunk_x + 1, chunk_y + 1, false, (chunk_x + 1) * self.chunk_size as i32, (chunk_y + 1) * self.chunk_size as i32),
(chunk_x - 1, chunk_y + 1, false, chunk_x * self.chunk_size as i32, (chunk_y + 1) * self.chunk_size as i32),
(chunk_x + 1, chunk_y - 1, false, (chunk_x + 1) * self.chunk_size as i32, chunk_y * self.chunk_size as i32),
(chunk_x - 1, chunk_y - 1, false, chunk_x * self.chunk_size as i32, chunk_y * self.chunk_size as i32),
];
Copy link
Member

@Jake-Shadle Jake-Shadle Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer this to be a struct defined inside the the function, tuples are fine for prototyping but end up being really tedious to use over the long term, especially > 2 length.

Copy link
Contributor Author

@Mr4k Mr4k Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! This code is still pretty prototype-y so I will make a cleanup pass later and can definitely change this.

Copy link
Contributor Author

@Mr4k Mr4k Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these tuples to an in-function struct in my most recent commit

@h3r2tic h3r2tic self-requested a review Dec 4, 2019
@h3r2tic
Copy link
Contributor

h3r2tic commented Dec 4, 2019

That's an impressive speed up! I'll have a look :)

@Jake-Shadle
Copy link
Member

Jake-Shadle commented Dec 5, 2019

Oh and FYI, if you are doing your iterative benchmarking via the cli you might want to check out hyperfine.

@Mr4k
Copy link
Contributor Author

Mr4k commented Dec 11, 2019

Had a little time so I did some code clean up. Inpaint is now reenabled, tiling is still broken but should be a simple fix. Also @h3r2tic wanted to see if you ever got a chance to benchmark these changes on your hardware yet (no worries if not, I don't want to get in the way of your actual work)

@Mr4k
Copy link
Contributor Author

Mr4k commented Dec 17, 2019

Fixed in painting. This pr is not mergable yet because it does not work for rectangular images and needs a simple heuristic to choose grid size but I will be able to get to those soon

([coord_2d.x as i32, coord_2d.y as i32], a.0, a.1)
})
.collect();
self.flush_resolved(
Copy link
Contributor Author

@Mr4k Mr4k Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this changes behavior when both inpainting and tiling are enabled. Now pixels which are not part of the inpaint region will be taken into account when tiling. I could change this back to the old behavior (where those pixels were ignored) by imho this behavior seems better

Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @anopara could have a look. I'm not quite sure what the implications are with the previous and the new behaviors.

Copy link
Contributor

@anopara anopara Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Mr4k. First of all, this is some awesome work, thank you! Could you please elaborate on "Now pixels which are not part of the inpaint region will be taken into account when tiling" 😅 I'm not quite sure how exactly they will be used during tiling?

Copy link
Contributor Author

@Mr4k Mr4k Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anopara. Thanks for making this awesome project! I apologize because looking back my comment is definitely vague but here is a concrete example. Consider the output of the following command.

cargo run --release -- --out-size 304x308 --inpaint circles_mask.png --tiling -o imgs/01.png generate circles.png

(Importantly this command has both inpainting and tiling enabled)
using the following images:
circles.png:
circles
circles_mask.png:
circles_mask

On master this generates this following image:
without_tiling_inpaint

On my branch this will generate this image:
with_tiling_inpaint

What I mean is that in my image the inpaint takes into account the pixels on the other side of the image to make it tile. As far as I understand this is because in master when you add the static pixels which are not part of the inpaint region to the rtree (

rtree.insert([coord.x as i32, coord.y as i32]);
) you do not "reflect" them across image borders like you do when you add pixels during synthesis ( ). What I've done is just to use the same function for both which provides mirroring (when tiling is enabled) as a side effect.

Copy link
Contributor

@anopara anopara Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That is a very good observation and that is indeed correct. Thank you for spotting this case. We indeed only mirror newly generated pixels 😅 Your proposed change is definitely a more desirable behavior! 👍

@Mr4k
Copy link
Contributor Author

Mr4k commented Dec 18, 2019

This pr now works for rectangular images and chooses a grid size based on k, the image size and number of pixels filled. So functionality wise I think it's fairly complete.
Also took a quick performance benchmark using hyperfine:
Google Cloud 64 core cpu

Before: 
Benchmark #1: ./target/release/texture-synthesis --out-size 2048 --out out/01.jpg generate imgs/1.jpg
  Time (mean ± σ):     64.681 s ±  0.256 s    [User: 734.705 s, System: 104.357 s]
  Range (min … max):   64.293 s … 65.154 s    10 runs

After:
Benchmark #1: ./target/release/texture-synthesis --out-size 2048 --out out/01.jpg generate imgs/1.jpg
  Time (mean ± σ):     20.544 s ±  0.182 s    [User: 796.544 s, System: 9.745 s]
  Range (min … max):   20.362 s … 20.999 s    10 runs

Copy link
Contributor

@h3r2tic h3r2tic left a comment

Apologies for the delay! Several things aligned, and then holidays stalled me some more :P

The code looks good. Thank you so much for all the work you've put into it! I've done some quality comparisons (haven't benchmarked yet). It seems to be producing similar results to the master branch, and images were identical for single-threaded runs in the tests I ran. Awesome work! A grid of R*-trees was indeed a great idea to speed it up :)

I've added a few comments; most of them are nitpicks, thought would be nice to get those in for code clarity. There's one thing about the tree grid where your cell culling might be too aggressive, though I haven't been able to notice any artifacts due to that in practice.

let tile_adjusted_width = (self.output_size.width as f32 * 1.1) as u32 + 1;
let tile_adjusted_height = (self.output_size.height as f32 * 1.1) as u32 + 1;
self.tree_grid = TreeGrid::new(
tile_adjusted_width,
tile_adjusted_height,
max(tile_adjusted_width, tile_adjusted_height),
(self.output_size.width as f32 * 0.05) as u32 + 1,
(self.output_size.height as f32 * 0.05) as u32 + 1,
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind the 1.1 multiplier, and why is 1 added on top of it?

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a low quality thing on my part. The 1.1 is because in tiling mode you have boundaries which are 5% of the image size. I will move that into a constant. You are also right about the +1 being for rounding up. I will follow your suggestion below.

([coord_2d.x as i32, coord_2d.y as i32], a.0, a.1)
})
.collect();
self.flush_resolved(
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @anopara could have a look. I'm not quite sure what the implications are with the previous and the new behaviors.

@@ -761,6 +702,35 @@ impl Generator {
let l2_precomputed = PrerenderedU8Function::new(metric_l2);
let mut total_processed_pixels = 0;
let max_workers = params.max_thread_count;
let mut has_fanned_out = false;
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to clarify the "fanned out"? Perhaps something along the lines of "use a single R*-tree initially, and fan out to a grid thereof later".

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@@ -817,6 +813,10 @@ impl Generator {
let max_candidate_count = params.nearest_neighbors as usize
+ params.random_sample_locations as usize;

let my_thread_num = thread_counter.fetch_add(1, Ordering::Relaxed);
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: should be my_thread_id for consistency with other ordinals.

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Some items might still be pending a resolve flush. Do it now before we start the next stage.
self.force_flush_resolved(is_tiling_mode);
{
// append all resolved threads to resolved
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we're appending threads. Would rephrase to "append all per-thread resolved lists to the global list"

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha true I'll rewrite that

result.reserve(k);

// this isn't really the kth best distance but should be good enough for how simple it is
let mut kth_best_squared_distance = i32::max_value();
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using i32 here starts feeling a bit tight -- some people use giant images, and there's a chance of this overflowing, particularly in the first stage of processing before "fanning out" to a larger grid of trees.

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow never considered an image of that size. I'll bump to i64

// a tiny optimization to help us throw far away neighbors
// saves us a decent amount of reads
if !is_center {
let squared_distance_to_closest_possible_point_on_chunk = (x as i32
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, a bunch of potentially dangerous i32 here.

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// a tiny optimization to help us throw far away neighbors
// saves us a decent amount of reads
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually correct though? This seems to be just point-to-point distance against closest_point_on_boundary_. The latter seem to be centered on edges of cells, but are not guaranteed to be the closest point inside the cell.

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the name there might be a little confusing what I mean by closet point in cell is closest point on boundary. The optimization works as follows:
diagram
If the green dot is the pixel we are talking about (which is always in the middle cell) and we know that the k nearest neighbors so far are all within the purple circle then we don't have to search any of the red cells because their closest possible points (the boundary points) to our pixel are further than any of our candidate k closest neighbors. Also we cannot use this to filter out the middle cell (which we are inside of)

Copy link
Contributor

@h3r2tic h3r2tic Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you not end up in the following situation?

image

I have highlighted in blue what I'm assuming to be the boundary points of the missed cells.

Copy link
Contributor Author

@Mr4k Mr4k Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the picture! I think the position of the blue points is incorrect. They should really be the closest point on each boundary to the current pixel. For the 4 corner cells this point is always a constant (the closest corner on that cell to the center of the grid) no matter which pixel we are addressing in the middle. However for the cells you highlighted as problematic the boundary points I compute are actually based on the green pixels x or y coordinates (in the case of your picture one would be directly below the green dot and the other would be directly to the right). In this case we would not rule out those cells because both of the points are within the purple circle.

Copy link
Contributor Author

@Mr4k Mr4k Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a revised version of your picture (also added the bottom right corner's closest point to the diagram)
diagram_with_projected_points

Copy link
Contributor

@h3r2tic h3r2tic Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so I did misread the code, thinking you were offsetting from the center. Make sense now, thanks!


// Assume that all k nearest neighbors are in these cells
// it looks like we are rarely wrong once enough pixels are filled in
let places_to_look = vec![
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can avoid a heap allocation by just removing the vec! characters :)

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks good catch!

}),
);

// this isn't really the kth best distance but it's an okay approximation
Copy link
Contributor

@h3r2tic h3r2tic Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good deal of "this isn't really X" could be avoided by just calling the variable differently. How about something like kth_nearest_distance_upper_bound?

Copy link
Contributor Author

@Mr4k Mr4k Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Mr4k
Copy link
Contributor Author

Mr4k commented Jan 8, 2020

Thanks so much for reviewing (and for catching so many things)! I'll make the changes soon and generally try to make the naming better! And of course if you get a chance I'd love to know runtimes on more reference rigs but no pressure if you're busy.

@h3r2tic
Copy link
Contributor

h3r2tic commented Jan 10, 2020

Tested on my Threadripper! Pretty consistent with the "Google Cloud 64 core cpu".

BEFORE:

Benchmark #1: target\release\texture-synthesis --out-size 2048 --out out/01.jpg generate imgs/1.jpg
  Time (mean ± σ):     62.349 s ±  0.255 s    [User: 1.4 ms, System: 4.4 ms]
  Range (min … max):   61.825 s … 62.664 s    10 runs

AFTER:

Benchmark #1: target\release\texture-synthesis --out-size 2048 --out out/01.jpg generate imgs/1.jpg
  Time (mean ± σ):     23.737 s ±  0.488 s    [User: 3.0 ms, System: 3.6 ms]
  Range (min … max):   23.034 s … 24.373 s    10 runs

@repi
Copy link
Member

repi commented Jan 10, 2020

Impressive improvement!

Copy link
Contributor

@h3r2tic h3r2tic left a comment

Looks good to me!

@Mr4k
Copy link
Contributor Author

Mr4k commented Jan 10, 2020

Good to hear!

Also thanks @repi!

Just for completeness's sake here are some things I thought of that I decided were out of scope of this pr but thought might be worth mentioning:

  • I'm not sure how common this use case is but for inpainting a few small sparse regions on a large image a sparse version of the tree grid may provide better performance
  • In my benchmarking I have noticed that since the contention has been reduced the single threaded first few stages before we branch out (especially with many threads) have become large bottlenecks (up to 50% of the time in some case iirc). A potentially future improvement may be to see if they can be some what parallelized especially for larger images
  • Also after benchmarking I was also left wondering if an rstar tree is the ideal data structure for this kind of problem (rstar knn lookups now takes up about 50% of the generation step for a pixel iirc) given that most of the time your k nearest neighbors will be very close to you in predictable positions. I'm not sure what a better data structure looks like but it was just a thing I thought was interesting

Finally thanks so much @h3r2tic @anopara @Jake-Shadle for taking the time to benchmark / review my code! It was really fun to work on this.

@h3r2tic h3r2tic merged commit 9cf1f6c into EmbarkStudios:master Jan 16, 2020
@h3r2tic
Copy link
Contributor

h3r2tic commented Jan 16, 2020

Good thoughts!

Like you say, the initial serial part of the algorithm will certainly suffer from Amdahl's law. I suspect it can be parallelized to some extent; I can imagine schemes where one would pre-generate* the locations of points to be resolved, and then partition them between threads based on spatial proximity -- perhaps with sync points at specific locations of the point list. Quite unclear how much performance we could win though, as we're likely in the area of diminishing returns there.

The R* tree was more of a "what can we plug in in a short amount of time that more or less matches the requirements" rather than a particularly conscious choice. We tried kD-trees before, and they weren't that great due to high branching factors, resulting in pointer chasing and branching in execution. Perhaps there's something better out there :)

* Pre-generating the order of pixel locations for synthesis could enable us to use a high quality low-discrepancy sequence instead of just random, possibly improving the quality of generated images.

@Mr4k
Copy link
Contributor Author

Mr4k commented Jan 17, 2020

Oh interesting point about the pre generated sequences. I had thought about doing something like that but I was worried that it would change the distribution. However it sounds like a different distribution could potentially allow for higher quality (or at least more predictable) synthesis if chosen carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants