r.texture: support parallel computing by OpenMP #3857

cyliang368 · 2024-06-16T22:38:03Z

This PR parallelizes execute.c in r.texture module by OpenMP and creates parallelization benchmarks. It was tested on my fork repository (cyliang368#9).

petrasovaa · 2024-06-17T16:04:33Z

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods):

Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

cyliang368 · 2024-06-18T06:26:02Z

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods):

Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

I checked the specs of your CPU. It is a 4 cores/8 threads CPU using Intel Hyper-Threading Technology. I believe it is just a pipelining technique to reduce idle instructions. We could speed up daily-use instructions by using two threads per core. However, for heavy computing programs, one thread can exhaust all computing power in a core, so using threads more than the number of cores may not further reduce runtime.

The v.surf. rst benchmark from your CPU has the same behavior. The runtime does not reduce after 4 threads. (https://github.com/cyliang368/grass/blob/c7cbf386c97a2a60f0b7319addc54d811e5b5a16/vector/v.surf.rst/vsurfrst_benchmark.png)

The easiest way to check may be to run the tool with 4 threads and check if the CPU usage is 100% on the performance monitor.

petrasovaa · 2024-06-18T09:14:49Z

Here is a benchmark on Ubuntu (Intel® Core™ i5-10210U CPU @ 1.60GHz × 8 ) for different window sizes (-a for all methods):
Looks like after 4 cores there is no improvement even for larger window sizes, which is somewhat surprising to me.

I checked the specs of your CPU. It is a 4 cores/8 threads CPU using Intel Hyper-Threading Technology. I believe it is just a pipelining technique to reduce idle instructions. We could speed up daily-use instructions by using two threads per core. However, for heavy computing programs, one thread can exhaust all computing power in a core, so using threads more than the number of cores may not further reduce similar runtime.

The v.surf. rst benchmark from your CPU has the same behavior. The runtime does not reduce after 4 threads. (https://github.com/cyliang368/grass/blob/c7cbf386c97a2a60f0b7319addc54d811e5b5a16/vector/v.surf.rst/vsurfrst_benchmark.png)

The easiest way to check may be to run the tool with 4 threads and check if the CPU usage is 100% on the performance monitor.

I will try to run it on my desktop and see if I get different behavior.

petrasovaa · 2024-06-18T14:31:07Z

Benchmarks for Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz

For different cell numbers:

For different window sizes:

Looks great!

raster/r.texture/benchmark/benchmark_rtexture_window.py

petrasovaa

Could you please convert the images to pngs? Also the naming of the images should start with r_texture_... and they should be at the same level as the html file (not in figure directory).

petrasovaa

Looks good! Thank you!

echoix · 2024-06-23T15:50:15Z

raster/r.texture/main.c

+    threads = atoi(parm.nproc->answer);
+#if defined(_OPENMP)
+    /* Set the number of threads */
+    omp_set_num_threads(threads);
+    if (threads > 1)
+        G_message(_("Using %d threads for parallel computing."), threads);
+#else
+    if (threads > 1) {
+        G_warning(_("GRASS GIS is not compiled with OpenMP support, parallel "
+                    "computation is disabled."));
+        threads = 1;
+    }
+#endif
+    execute_texture(data, &dim, measure_menu, measure_idx, &out_set, threads);


@cyliang368 Since the PR was merged, and later on a coverity scan was made, it created a new defect here.

Do you see if you're able to fix it in a new small PR? I don't completely see yet how it can divide by zero through a modulo.

** CID 1548584: Integer handling issues (DIVIDE_BY_ZERO) ________________________________________________________________________________________________________ *** CID 1548584: Integer handling issues (DIVIDE_BY_ZERO) /raster/r.texture/main.c: 328 in main() 322 if (threads > 1) { 323 G_warning(_("GRASS GIS is not compiled with OpenMP support, parallel " 324 "computation is disabled.")); 325 threads = 1; 326 } 327 #endif CID 1548584: Integer handling issues (DIVIDE_BY_ZERO) In function call "execute_texture", modulo by expression "threads" which may be zero has undefined behavior. 328 execute_texture(data, &dim, measure_menu, measure_idx, &out_set, threads); 329 330 for (i = 0; i < dim.n_outputs; i++) { 331 Rast_close(outfd[i]); 332 Rast_short_history(mapname[i], "raster", &history); 333 Rast_command_history(&history);

I don't know when DIVIDE_BY_ZERO will happen, but I am sure the number of threads shouldn't be 0 or negative. I just submitted a new PR (#3917) to fix it.

neteler · 2024-06-24T08:07:00Z

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

cyliang368 · 2024-06-24T12:16:17Z

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

@neteler Through #3785 and #3857, the memory used for two local variables will increase with multithreading:

Row arrays are used to store the results for output. This RAM usage is sizeof(FCEEL) * cols * measurements * threads.
4 matrices and 5 vectors store the input data in a given window size (default is 3 x 3). The RAM usage is
1. matrices: sizeof(FCEEL) * window_rows * window_cols * 4 * threads
2. vectors: sizeof(FCEEL) * 256 * 5 * threads

For a large raster with many columns, the first one makes the most increase in RAM usage. Let's say a raster has 100M cells (10k x 10k), and we want to get all texture measurements (13). I assume sizeof(FCEEL) is the same as a float (4 bytes). Using one more thread needs 520kB.

HuidaeCho · 2024-06-24T12:53:26Z

@cyliang368 Just curious - do you have any insights about RAM usage increase through this parallelization, esp. with large raster maps?

@neteler Through #3785 and #3857, the memory used for two local variables will increase with multithreading:
1. Row arrays are used to store the results for output. This RAM usage is `sizeof(FCEEL) * cols * measurements * threads`.

2. 4 matrices and 5 vectors store the input data in a given window size (default is 3 x 3). The RAM usage is
     1. matrices: `sizeof(FCEEL) * window_rows * window_cols * 4 * threads`
     2. vectors: `sizeof(FCEEL) * 256 * 5 * threads`
For a large raster with many columns, the first one makes the most increase in RAM usage. Let's say a raster has 100M cells (10k x 10k), and we want to get all texture measurements (13). I assume sizeof(FCEEL) is the same as a float (4 bytes). Using one more thread needs 520kB.

@cyliang368 From what you explained, different threads never read in the same rows?

cyliang368 · 2024-06-24T14:43:09Z

@cyliang368 From what you explained, different threads never read in the same rows?

@HuidaeCho Different threads never write to the same row, but the values they read should have some overlaps.

Here are the brief steps of what this module does:

Store the grayscale input raster in a 2D int array data in main.c.
Each thread copies the values for its own window from data and stores processed values in mvs.
Each thread computes measurements and stores them in its own array for write-out.
Each thread writes out the results row by row in order.

marisn · 2024-06-25T07:10:10Z

1. Store the grayscale input raster in a 2D int array `data` in `main.c`.

This would mean the module being unable to process rasters larger tan RAM, am I correct?
I think we should modify ROWIO to allow better parallel code development. My clunky attempt at tackling the issue: ae3d7b2

cyliang368 · 2024-06-25T14:24:01Z

1. Store the grayscale input raster in a 2D int array `data` in `main.c`.
This would mean the module being unable to process rasters larger tan RAM, am I correct? I think we should modify ROWIO to allow better parallel code development. My clunky attempt at tackling the issue: ae3d7b2

You are correct!

I am curious about the computational time used by ROWIO. ROWIO fetches data from disk in every iteration, and its latency (thousands of cycles or more) could be hundreds of times longer than fetching from RAM (200~400 cycles). If we fetch all data from disk to RAM, this latency only happens once. If we do ROWIO, this latency happens in every iteration.

I remember that some modules set a maximum RAM usage size. ROWIO will be used when the data size is larger than that. Otherwise, we just store all the data at once.

Dividing a raster into multiple tiles could also be a solution, but I don't know how this works in GRASS. Has anyone compared the performance between ROWIO and tiling?

I am not sure whether my understanding above is correct. Overall, this module cannot handle large-size data well now, but I won't improve it immediately. Instead, I will probably post this enhancement as an issue, then do something else and come back later.

marisn · 2024-06-25T18:30:27Z

I am curious about the computational time used by ROWIO. ROWIO fetches data from disk in every iteration, and its latency (thousands of cycles or more) could be hundreds of times longer than fetching from RAM (200~400 cycles). If we fetch all data from disk to RAM, this latency only happens once. If we do ROWIO, this latency happens in every iteration.

This is the reason why I mentioned that improvements in ROWIO or SEGMENT libraries beneficial for working with large datasets / in parallel should be made.

I remember that some modules set a maximum RAM usage size. ROWIO will be used when the data size is larger than that. Otherwise, we just store all the data at once.

Yes, this is an approach I took (keep all data or at least keep all rows hit by a sliding window in RAM). I only added a wrapper in between that hides internals (does row comes from RAM or is per-fetched from a disk does not impact the outcome of computation, just its speed).
One of problems I found is inability to determine max RAM at runtime as Linux has over-commit enabled by default and thus one can not relay on a malloc failure to determine if loading data into RAM will not fail (= user has to provide maximum memory size). On modern distributions with systemd + cgroups process will be killed due to OOM instead of malloc returning a failure (totally not fun if your terminal emulator happens to be in the same cgroup).

Dividing a raster into multiple tiles could also be a solution, but I don't know how this works in GRASS. Has anyone compared the performance between ROWIO and tiling?

Disk layout of raster data is row based thus ROWIO is the closest to the native layout. If a sliding window uses size > 1, then there is also a problem on vertical edges of tiles (with ROWIO only with horizontal ones). Thus a generic / best practice example would be nice. But this is not a discussion for this PR, it is a more general issue.

I am not sure whether my understanding above is correct. Overall, this module cannot handle large-size data well now, but I won't improve it immediately. Instead, I will probably post this enhancement as an issue, then do something else and come back later.

No need to. This is not the only module that can not handle datasets that do not fit into RAM. The only thing you must do is to create a new PR with a note in the module documentation explaining this limitation.

petrasovaa · 2024-06-27T14:29:22Z

No need to. This is not the only module that can not handle datasets that do not fit into RAM. The only thing you must do is to create a new PR with a note in the module documentation explaining this limitation.

Just to clarify, @cyliang368 did not change this behavior, so he is not responsible for fixing documentation, anybody can do this.

* refactor to use local vars instead of global vars * refactor to seperate execute part for parallelization * remove unused variable * divide h_mearsure.h into paired header files * refactor parameters and flags in main.c * remove unused variable 'threads' in main.c * add execute_parallel.c * parallelize execute, add tests and benchmarks * rename test and benchmark files, integrate parallel part in execute.c * Run benchmark on all methods * get thread id only if OpenMP is defined * replace omp firstprivate with omp private * remove tid, and only use trow to aviod no OpenMP support situation * add keyword "parallel" in main.c * replace the list of functions with the -a flag in benchmark * add benchmark for window sizes, plot speedup and efficiency * remove basename list in benchmark, add benchmark results to html * replace 'x' with '&OSGeo#215' for window sizes in html * breakdown long lines in html * remove known issue in r.texture.html * modify the formats and paths of figures

cyliang368 and others added 17 commits June 14, 2024 21:25

refactor to use local vars instead of global vars

0815f85

refactor to seperate execute part for parallelization

3c7e89b

remove unused variable

9b54ad3

divide h_mearsure.h into paired header files

d883baf

refactor parameters and flags in main.c

7b6147e

remove unused variable 'threads' in main.c

9f1f379

add execute_parallel.c

99678d9

parallelize execute, add tests and benchmarks

37f03b5

Merge remote-tracking branch 'upstream/main' into parallel_rtexture2

ee5e0ce

rename test and benchmark files, integrate parallel part in execute.c

5dd1b1e

Run benchmark on all methods

47cd271

Merge branch 'OSGeo:main' into parallel_rtexture2

c66b368

Merge branch 'OSGeo:main' into parallel_rtexture2

8a1c868

get thread id only if OpenMP is defined

5a1ab21

replace omp firstprivate with omp private

46cc8b3

remove tid, and only use trow to aviod no OpenMP support situation

bc849bd

Merge branch 'main' into parallel_rtexture2

b56722e

github-actions bot added raster Related to raster data processing Python Related code is in Python C Related code is in C module tests Related to Test Suite labels Jun 17, 2024

add keyword "parallel" in main.c

0f3d0c8

Merge branch 'main' into parallel_rtexture2

3fecd5f

cyliang368 added 2 commits June 18, 2024 09:46

replace the list of functions with the -a flag in benchmark

09a34ad

add benchmark for window sizes, plot speedup and efficiency

111d71f

petrasovaa reviewed Jun 18, 2024

View reviewed changes

raster/r.texture/benchmark/benchmark_rtexture_window.py Outdated Show resolved Hide resolved

cyliang368 added 4 commits June 18, 2024 20:38

remove basename list in benchmark, add benchmark results to html

8cd45a3

replace 'x' with '&OSGeo#215' for window sizes in html

97695db

breakdown long lines in html

5fd03cd

remove .DS_Store file

765c869

github-actions bot added HTML Related code is in HTML docs labels Jun 18, 2024

remove known issue in r.texture.html

58f0a51

petrasovaa requested changes Jun 19, 2024

View reviewed changes

modify the formats and paths of figures

8b328d9

petrasovaa approved these changes Jun 19, 2024

View reviewed changes

echoix merged commit a24714e into OSGeo:main Jun 22, 2024
24 checks passed

echoix added this to the 8.5.0 milestone Jun 22, 2024

echoix reviewed Jun 23, 2024

View reviewed changes

cyliang368 mentioned this pull request Jun 23, 2024

r.texture: control the number of threads #3917

Draft

a0x8o mentioned this pull request Jun 27, 2024

CWE 398 a0x8o/grass#8

Merged

cyliang368 deleted the parallel_rtexture2 branch June 29, 2024 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r.texture: support parallel computing by OpenMP #3857

r.texture: support parallel computing by OpenMP #3857

cyliang368 commented Jun 16, 2024

petrasovaa commented Jun 17, 2024

cyliang368 commented Jun 18, 2024 •

edited

Loading

petrasovaa commented Jun 18, 2024

petrasovaa commented Jun 18, 2024

petrasovaa left a comment

petrasovaa left a comment

echoix Jun 23, 2024

cyliang368 Jun 23, 2024

neteler commented Jun 24, 2024

cyliang368 commented Jun 24, 2024 •

edited

Loading

HuidaeCho commented Jun 24, 2024

cyliang368 commented Jun 24, 2024 •

edited

Loading

marisn commented Jun 25, 2024

cyliang368 commented Jun 25, 2024

marisn commented Jun 25, 2024

petrasovaa commented Jun 27, 2024

r.texture: support parallel computing by OpenMP #3857

r.texture: support parallel computing by OpenMP #3857

Conversation

cyliang368 commented Jun 16, 2024

petrasovaa commented Jun 17, 2024

cyliang368 commented Jun 18, 2024 • edited Loading

petrasovaa commented Jun 18, 2024

petrasovaa commented Jun 18, 2024

Benchmarks for Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz

petrasovaa left a comment

Choose a reason for hiding this comment

petrasovaa left a comment

Choose a reason for hiding this comment

echoix Jun 23, 2024

Choose a reason for hiding this comment

cyliang368 Jun 23, 2024

Choose a reason for hiding this comment

neteler commented Jun 24, 2024

cyliang368 commented Jun 24, 2024 • edited Loading

HuidaeCho commented Jun 24, 2024

cyliang368 commented Jun 24, 2024 • edited Loading

marisn commented Jun 25, 2024

cyliang368 commented Jun 25, 2024

marisn commented Jun 25, 2024

petrasovaa commented Jun 27, 2024

cyliang368 commented Jun 18, 2024 •

edited

Loading

cyliang368 commented Jun 24, 2024 •

edited

Loading

cyliang368 commented Jun 24, 2024 •

edited

Loading