Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.texture: support parallel computing by OpenMP #3857

Merged
merged 27 commits into from
Jun 22, 2024

Conversation

cyliang368
Copy link
Contributor

This PR parallelizes execute.c in r.texture module by OpenMP and creates parallelization benchmarks. It was tested on my fork repository (cyliang368#9).

@github-actions github-actions bot added raster Related to raster data processing Python Related code is in Python C Related code is in C module tests Related to Test Suite labels Jun 17, 2024
@petrasovaa
Copy link
Contributor

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods):
result2

Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

@cyliang368
Copy link
Contributor Author

cyliang368 commented Jun 18, 2024

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods): result2

Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

I checked the specs of your CPU. It is a 4 cores/8 threads CPU using Intel Hyper-Threading Technology. I believe it is just a pipelining technique to reduce idle instructions. We could speed up daily-use instructions by using two threads per core. However, for heavy computing programs, one thread can exhaust all computing power in a core, so using threads more than the number of cores may not further reduce runtime.

The v.surf. rst benchmark from your CPU has the same behavior. The runtime does not reduce after 4 threads. (https://github.com/cyliang368/grass/blob/c7cbf386c97a2a60f0b7319addc54d811e5b5a16/vector/v.surf.rst/vsurfrst_benchmark.png)

The easiest way to check may be to run the tool with 4 threads and check if the CPU usage is 100% on the performance monitor.

@petrasovaa
Copy link
Contributor

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods): result2
Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

I checked the specs of your CPU. It is a 4 cores/8 threads CPU using Intel Hyper-Threading Technology. I believe it is just a pipelining technique to reduce idle instructions. We could speed up daily-use instructions by using two threads per core. However, for heavy computing programs, one thread can exhaust all computing power in a core, so using threads more than the number of cores may not further reduce similar runtime.

The v.surf. rst benchmark from your CPU has the same behavior. The runtime does not reduce after 4 threads. (https://github.com/cyliang368/grass/blob/c7cbf386c97a2a60f0b7319addc54d811e5b5a16/vector/v.surf.rst/vsurfrst_benchmark.png)

The easiest way to check may be to run the tool with 4 threads and check if the CPU usage is 100% on the performance monitor.

I will try to run it on my desktop and see if I get different behavior.

@petrasovaa
Copy link
Contributor

Benchmarks for Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz

For different cell numbers:
time
efficiency

For different window sizes:
time_windows
efficiency_windows

Looks great!

@github-actions github-actions bot added HTML Related code is in HTML docs labels Jun 18, 2024
Copy link
Contributor

@petrasovaa petrasovaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please convert the images to pngs? Also the naming of the images should start with r_texture_... and they should be at the same level as the html file (not in figure directory).

Copy link
Contributor

@petrasovaa petrasovaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@echoix echoix merged commit a24714e into OSGeo:main Jun 22, 2024
24 checks passed
@echoix echoix added this to the 8.5.0 milestone Jun 22, 2024
Comment on lines +315 to +328
threads = atoi(parm.nproc->answer);
#if defined(_OPENMP)
/* Set the number of threads */
omp_set_num_threads(threads);
if (threads > 1)
G_message(_("Using %d threads for parallel computing."), threads);
#else
if (threads > 1) {
G_warning(_("GRASS GIS is not compiled with OpenMP support, parallel "
"computation is disabled."));
threads = 1;
}
#endif
execute_texture(data, &dim, measure_menu, measure_idx, &out_set, threads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyliang368 Since the PR was merged, and later on a coverity scan was made, it created a new defect here.

Do you see if you're able to fix it in a new small PR? I don't completely see yet how it can divide by zero through a modulo.


** CID 1548584:  Integer handling issues  (DIVIDE_BY_ZERO)


________________________________________________________________________________________________________
*** CID 1548584:  Integer handling issues  (DIVIDE_BY_ZERO)
/raster/r.texture/main.c: 328 in main()
322         if (threads > 1) {
323             G_warning(_("GRASS GIS is not compiled with OpenMP support, parallel "
324                         "computation is disabled."));
325             threads = 1;
326         }
327     #endif
   CID 1548584:  Integer handling issues  (DIVIDE_BY_ZERO)
   In function call "execute_texture", modulo by expression "threads" which may be zero has undefined behavior.
328         execute_texture(data, &dim, measure_menu, measure_idx, &out_set, threads);
329     
330         for (i = 0; i < dim.n_outputs; i++) {
331             Rast_close(outfd[i]);
332             Rast_short_history(mapname[i], "raster", &history);
333             Rast_command_history(&history);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when DIVIDE_BY_ZERO will happen, but I am sure the number of threads shouldn't be 0 or negative. I just submitted a new PR (#3917) to fix it.

@neteler
Copy link
Member

neteler commented Jun 24, 2024

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

@cyliang368
Copy link
Contributor Author

cyliang368 commented Jun 24, 2024

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

@neteler Through #3785 and #3857, the memory used for two local variables will increase with multithreading:

  1. Row arrays are used to store the results for output. This RAM usage is sizeof(FCEEL) * cols * measurements * threads.
  2. 4 matrices and 5 vectors store the input data in a given window size (default is 3 x 3). The RAM usage is
      1. matrices: sizeof(FCEEL) * window_rows * window_cols * 4 * threads
      2. vectors: sizeof(FCEEL) * 256 * 5 * threads

For a large raster with many columns, the first one makes the most increase in RAM usage. Let's say a raster has 100M cells (10k x 10k), and we want to get all texture measurements (13). I assume sizeof(FCEEL) is the same as a float (4 bytes). Using one more thread needs 520kB.

@HuidaeCho
Copy link
Member

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

@neteler Through #3785 and #3857, the memory used for two local variables will increase with multithreading:

1. Row arrays are used to store the results for output. This RAM usage is `sizeof(FCEEL) * cols * measurements * threads`.

2. 4 matrices and 5 vectors store the input data in a given window size (default is 3 x 3). The RAM usage is
     1. matrices: `sizeof(FCEEL) * window_rows * window_cols * 4 * threads`
     2. vectors: `sizeof(FCEEL) * 256 * 5 * threads`

For a large raster with many columns, the first one makes the most increase in RAM usage. Let's say a raster has 100M cells (10k x 10k), and we want to get all texture measurements (13). I assume sizeof(FCEEL) is the same as a float (4 bytes). Using one more thread needs 520kB.

@cyliang368 From what you explained, different threads never read in the same rows?

@cyliang368
Copy link
Contributor Author

cyliang368 commented Jun 24, 2024

@cyliang368 From what you explained, different threads never read in the same rows?

@HuidaeCho Different threads never write to the same row, but the values they read should have some overlaps.

Here are the brief steps of what this module does:

  1. Store the grayscale input raster in a 2D int array data in main.c.
  2. Each thread copies the values for its own window from data and stores processed values in mvs.
  3. Each thread computes measurements and stores them in its own array for write-out.
  4. Each thread writes out the results row by row in order.

@marisn
Copy link
Contributor

marisn commented Jun 25, 2024

1. Store the grayscale input raster in a 2D int array `data` in `main.c`.

This would mean the module being unable to process rasters larger tan RAM, am I correct?
I think we should modify ROWIO to allow better parallel code development. My clunky attempt at tackling the issue: ae3d7b2

@cyliang368
Copy link
Contributor Author

1. Store the grayscale input raster in a 2D int array `data` in `main.c`.

This would mean the module being unable to process rasters larger tan RAM, am I correct? I think we should modify ROWIO to allow better parallel code development. My clunky attempt at tackling the issue: ae3d7b2

You are correct!

I am curious about the computational time used by ROWIO. ROWIO fetches data from disk in every iteration, and its latency (thousands of cycles or more) could be hundreds of times longer than fetching from RAM (200~400 cycles). If we fetch all data from disk to RAM, this latency only happens once. If we do ROWIO, this latency happens in every iteration.

I remember that some modules set a maximum RAM usage size. ROWIO will be used when the data size is larger than that. Otherwise, we just store all the data at once.

Dividing a raster into multiple tiles could also be a solution, but I don't know how this works in GRASS. Has anyone compared the performance between ROWIO and tiling?

I am not sure whether my understanding above is correct. Overall, this module cannot handle large-size data well now, but I won't improve it immediately. Instead, I will probably post this enhancement as an issue, then do something else and come back later.

@marisn
Copy link
Contributor

marisn commented Jun 25, 2024

I am curious about the computational time used by ROWIO. ROWIO fetches data from disk in every iteration, and its latency (thousands of cycles or more) could be hundreds of times longer than fetching from RAM (200~400 cycles). If we fetch all data from disk to RAM, this latency only happens once. If we do ROWIO, this latency happens in every iteration.

This is the reason why I mentioned that improvements in ROWIO or SEGMENT libraries beneficial for working with large datasets / in parallel should be made.

I remember that some modules set a maximum RAM usage size. ROWIO will be used when the data size is larger than that. Otherwise, we just store all the data at once.

Yes, this is an approach I took (keep all data or at least keep all rows hit by a sliding window in RAM). I only added a wrapper in between that hides internals (does row comes from RAM or is per-fetched from a disk does not impact the outcome of computation, just its speed).
One of problems I found is inability to determine max RAM at runtime as Linux has over-commit enabled by default and thus one can not relay on a malloc failure to determine if loading data into RAM will not fail (= user has to provide maximum memory size). On modern distributions with systemd + cgroups process will be killed due to OOM instead of malloc returning a failure (totally not fun if your terminal emulator happens to be in the same cgroup).

Dividing a raster into multiple tiles could also be a solution, but I don't know how this works in GRASS. Has anyone compared the performance between ROWIO and tiling?

Disk layout of raster data is row based thus ROWIO is the closest to the native layout. If a sliding window uses size > 1, then there is also a problem on vertical edges of tiles (with ROWIO only with horizontal ones). Thus a generic / best practice example would be nice. But this is not a discussion for this PR, it is a more general issue.

I am not sure whether my understanding above is correct. Overall, this module cannot handle large-size data well now, but I won't improve it immediately. Instead, I will probably post this enhancement as an issue, then do something else and come back later.

No need to. This is not the only module that can not handle datasets that do not fit into RAM. The only thing you must do is to create a new PR with a note in the module documentation explaining this limitation.

@a0x8o a0x8o mentioned this pull request Jun 27, 2024
@petrasovaa
Copy link
Contributor

No need to. This is not the only module that can not handle datasets that do not fit into RAM. The only thing you must do is to create a new PR with a note in the module documentation explaining this limitation.

Just to clarify, @cyliang368 did not change this behavior, so he is not responsible for fixing documentation, anybody can do this.

@cyliang368 cyliang368 deleted the parallel_rtexture2 branch June 29, 2024 19:44
kritibirda26 pushed a commit to kritibirda26/grass that referenced this pull request Jun 29, 2024
* refactor to use local vars instead of global vars

* refactor to seperate execute part for parallelization

* remove unused variable

* divide h_mearsure.h into paired header files

* refactor parameters and flags in main.c

* remove unused variable 'threads' in main.c

* add execute_parallel.c

* parallelize execute, add tests and benchmarks

* rename test and benchmark files, integrate parallel part in execute.c

* Run benchmark on all methods

* get thread id only if OpenMP is defined

* replace omp firstprivate with omp private

* remove tid, and only use trow to aviod no OpenMP support situation

* add keyword "parallel" in main.c

* replace the list of functions with the -a flag in benchmark

* add benchmark for window sizes, plot speedup and efficiency

* remove basename list in benchmark, add benchmark results to html

* replace 'x' with '&OSGeo#215' for window sizes in html

* breakdown long lines in html

* remove known issue in r.texture.html

* modify the formats and paths of figures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C Related code is in C docs HTML Related code is in HTML module Python Related code is in Python raster Related to raster data processing tests Related to Test Suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants