r.patch: Fix gathering statistics with multi-threading #2906

jameshu15869 · 2023-03-28T14:53:52Z

Fixes #2411. Adds tests.

Original description:

This is my current draft implementation for making an array of statf structures for each thread.

Null count is being counted correctly in each thread and the values add up to what is expected, but I am unsure how to proceed for merging the trees.

Before merging:

add test specifically testing CELL rasters
run benchmark, compare with and without -s flag
run valgrind

raster/r.patch/main.c

HuidaeCho · 2023-03-28T15:59:35Z

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:

#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)

in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.

petrasovaa · 2023-03-28T17:02:09Z

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:
#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)
in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.

Please see the original problem in #2410 and the issue #2411 this PR is addressing. What you suggest may be a good improvement, but AFAIU it's not really related to this PR. We had an initial conversation about this on gitter.

HuidaeCho · 2023-03-28T17:15:06Z

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:
#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)
in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.
Please see the original problem in #2410 and the issue #2411 this PR is addressing. What you suggest may be a good improvement, but AFAIU it's not really related to this PR. We had an initial conversation about this on gitter.

Oh, it was a different problem! Sorry for the noise.

…support()

jameshu15869 · 2023-03-28T19:26:40Z

@petrasovaa
I have finished an initial draft implementation that that merges the statf[] array for each thread into the thread_statf[0] array (The statf array in Thread 0) before directly passing to support. I chose this approach because I did not want to modify support.c and introduce more possible problems. I've ran the new implemenation with nprocs = 1, 2, and 4 and the output on the GUI seems correct. r.category also produces the expected categories, though I'm not sure this is a surefire test because categories does not seem to rely on the stats structure. Currently I'm facing 2 obstacles.

First, the null_data_count values are not being merged at all - though the values in each thread add up to the correct number. Should Rast_update_cell_stats automatically merge these values? If not I can easily go back and manually update null_data_count when I'm merging the thread statf's. Based on the code it looks like Rast_update_cell_stats is modifying the null_data_count value but I am not seeing the change.

Second, I tried printing out the nodes of the *node for each statf object. Running nprocs=1 and nprocs=2 produce the right number of nodes and N, but it seems they are at different indexes and have different left and right attributes. I'm assuming this is merging from multiple threads would place the nodes at different locations. Is this a reasonable assumption? Or should we expect the contents of statf to be exactly the same after merging?

Would you be able to take a look? Thanks!

raster/r.patch/main.c

petrasovaa

Good job, I think this is close to finished! Please cleanup the code (address the issues, remove comments, old statf, print functions). The new code for merging should be ideally in a separate function (in support.c).

Before merging this, we should also write test and run benchmark to ensure this is correct and to see what impact this has on performance. But at this point I suggest you to finish the cleanup and then we brainstorm (on Gitter) the GSoC project and proposal if you are interested, the deadline is Apr 4th. This PR could be finalized as part of the project.

jameshu15869 · 2023-03-30T04:13:16Z

@petrasovaa
I've gone back and cleaned up the code. I also created a merge_threads function in support that will actually perform the merge. Regarding testing, I found a test in testsuite/test_rpatch_artificial.py and my implementation passed. I tried to run the benchmark file, but got an error. Is it possible that this error came from the benchmark itself and not my implementation?

Traceback (most recent call last):
  File "benchmark/benchmark_r_patch.py", line 52, in generate_map
    Module(
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 610, in __init__
    self.__call__(*args, **kargs)
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 634, in __call__
    return self.run()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 823, in run
    self.wait()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 844, in wait
    raise CalledModuleError(
grass.exceptions.CalledModuleError: Module run `r.surf.fractal dimension=2.05 number=0 output=r_patch_reference_tmp --o` ended with an error.
The subprocess ended with a non-zero return code: -9. See errors above the traceback or in the error output.

Please let me know if there's anything else I should add to this PR. Thanks!

raster/r.patch/support.c

petrasovaa · 2023-03-30T13:31:57Z

@petrasovaa I've gone back and cleaned up the code. I also created a merge_threads function in support that will actually perform the merge. Regarding testing, I found a test in testsuite/test_rpatch_artificial.py and my implementation passed. I tried to run the benchmark file, but got an error. Is it possible that this error came from the benchmark itself and not my implementation?
Traceback (most recent call last):
  File "benchmark/benchmark_r_patch.py", line 52, in generate_map
    Module(
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 610, in __init__
    self.__call__(*args, **kargs)
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 634, in __call__
    return self.run()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 823, in run
    self.wait()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 844, in wait
    raise CalledModuleError(
grass.exceptions.CalledModuleError: Module run `r.surf.fractal dimension=2.05 number=0 output=r_patch_reference_tmp --o` ended with an error.
The subprocess ended with a non-zero return code: -9. See errors above the traceback or in the error output.

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

…100M

jameshu15869 · 2023-03-30T20:40:02Z

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

@petrasovaa
I found the error - when attempting to generate a map with size 4000, both functions require around 8.5 GB of memory. My computer only has 8 GB so I suspect this is why r.surf.fractal and the fallback r.random.surface both failed. However, I was able to run benchmarks for size 7071, 14142, and 10000. The graph is attached below.

My laptop only has 4 threads, so should I ignore the time increase as threads increase? My best guess is that the program does not know how to assign 12 threads to 4 threads, for example. Other than that, this graph shows that the OpenMP does make a difference, right?

As a side note, running the benchmarks takes around 30-40 minutes for each run on my laptop. Is there a way to reduce the number of threads being tested so only nprocs=1,2,4 are run?

Thanks!

neteler · 2023-03-31T14:06:03Z

My computer only has 8 GB

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

# enable SWAP FILE, for more RAM
#
# forgot to create a swap partition at installation?
# solution: create a swap *file* (not partition):

sudo su

SWAPDIR=/home/swapfiledir  # use your preferred path here
GB=10  # extra "RAM" in Gigabytes

mkdir ${SWAPDIR}
echo "Generating <${SWAPDIR}/swap_file>...:"
dd if=/dev/zero of=${SWAPDIR}/swap_file bs=1M count=${GB}k
ls -la ${SWAPDIR}/swap_file
df -h ${SWAPDIR}/

chmod 0600 ${SWAPDIR}/swap_file
mkswap ${SWAPDIR}/swap_file

# add following entry for enabling at boot:
echo "${SWAPDIR}/swap_file swap swap defaults 0 0"

# edit /etc/fstab and add the swap_file line from above
vim /etc/fstab

# enable swap
swapon -a

# verify new RAM size with e.g. free or top
free
top

petrasovaa · 2023-03-31T14:12:10Z

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

Good to know, but I wonder if this affects the performance.

petrasovaa · 2023-03-31T14:14:48Z

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

@petrasovaa I found the error - when attempting to generate a map with size 4000, both functions require around 8.5 GB of memory. My computer only has 8 GB so I suspect this is why r.surf.fractal and the fallback r.random.surface both failed. However, I was able to run benchmarks for size 7071, 14142, and 10000. The graph is attached below.

I will try to run the benchmark myself, I have plenty of cores and RAM, so at this point don't worry about this. But it's good you were able to test this!

echoix · 2023-03-31T16:17:39Z

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

Good to know, but I wonder if this affects the performance.

I'd assume a performance hit when using swap instead of RAM of course, thus the results you get shouldn't be considered "performance" metrics, but at least you are able to run it. The performance of the algorithm you are tested will depend on what other tasks are running on the system (and their memory usage, if they swap), and the disk I/O performance of your system.

If a final performance chart is wanted, maybe get someone with a better suited system to run the tests.

jameshu15869 · 2023-04-02T03:18:59Z

Sounds good. For now I will plan to ask others to help out with the testing.

One other thing I noticed was that the GitHub check is currently failing for Ubuntu on the test for t.rast.algebra. The error message is

Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az574_655_32269/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az574_655_32269/tgis/sqlite.db

However, the Ubutnu 22.04 test passed for one of my previous commits (2d2daa0). It seems that the only change between that commit and the most recent failing commits is running the additional tests. I quickly scanned through the t.rast.algebra test and it doesn't seem like any of the code I worked on should be tested in this file. Could there be an issue with the GitHub CI itself? Or is there something else going on?

Thanks!

petrasovaa · 2023-04-03T13:24:02Z

through the t.rast.algebra test and it doesn't seem like any of the code I worked on should be tested in this file. Could there be an issue with the GitHub CI itself? Or is there something else going on?

Yes, that's unrelated, sometimes it happens we are unsure why, usually we would rerun the workflow.

jameshu15869

All the changes have been resolved and I believe the PR is ready for a final review.

jameshu15869

All the files have been reviewed, but I'm not sure how to get rid of the "merging block"

petrasovaa

The code looks good, but I still need to run the benchmarks (I added tasks in the PR description), will get to it hopefully next week. You don't need to do anything right now.

jameshu15869 · 2023-04-12T03:17:03Z

The code looks good, but I still need to run the benchmarks (I added tasks in the PR description), will get to it hopefully next week. You don't need to do anything right now.

Sounds good. Thank you so much for all your help!

petrasovaa · 2023-04-24T19:50:00Z

I ran a benchmark with and without s flag:

with s flag:

without (default):

So running with s flag is slightly faster, but that's expected. Otherwise the benchmark seems fine.

jameshu15869 · 2023-04-25T00:25:54Z

@petrasovaa
I'm glad that it works! Did testing reveal any issues with the implementation that need to be addressed?

petrasovaa · 2023-04-25T14:08:37Z

@petrasovaa I'm glad that it works! Did testing reveal any issues with the implementation that need to be addressed?

I think it's all fine, ready to be merged. The failing mac test is unrelated.

But could you please look at Huidae's suggestion above and see if that works and if it improves anything? If yes, that would be a different PR.

jameshu15869 · 2023-04-28T05:07:54Z

I think it's all fine, ready to be merged. The failing mac test is unrelated.

But could you please look at Huidae's suggestion above and see if that works and if it improves anything? If yes, that would be a different PR.

@HuidaeCho @petrasovaa
I've implemented Huidae's suggestion and modified the old #pragma omp atomic update and this was the graph I got after running only the 50M r.patch benchmark (Since I am working at school on my laptop). I was also able to pass all 4 test cases in testsuite/test_rpatch_artificial.py. It seems that there was some minor improvement, though I'm not exactly sure how to quantify it. I haven't committed the changes yet since Anna mentioned that this could be a separate PR. Should I move these onto a new branch/open a new PR specifically for this issue?

petrasovaa · 2023-04-28T12:28:03Z

It seems that there was some minor improvement, though I'm not exactly sure how to quantify it. I haven't committed the changes yet since Anna mentioned that this could be a separate PR. Should I move these onto a new branch/open a new PR specifically for this issue?

Yes, please create a new PR and I will look at it. I will merge this one after the checks are done.

jameshu15869 · 2023-04-29T03:38:46Z

I've opened a new pull request for the reduction (#2941). However, it looks like all of my changes from this PR are still on the new PR. Would this potentially become an issue later on?

petrasovaa · 2023-05-01T14:32:50Z

I've opened a new pull request for the reduction (#2941). However, it looks like all of my changes from this PR are still on the new PR. Would this potentially become an issue later on?

Right, this needs to be fixed. I am not sure what is the easiest way, you can update your main, create a new branch, put your changes there and the force push (something like git push -f origin branch-with-new-changes:branch-you-are-pushing-to)

Fixes OSGeo#2411. Adds tests.

r.patch DRAFT

9c04683

petrasovaa marked this pull request as draft March 28, 2023 14:59

petrasovaa reviewed Mar 28, 2023

View reviewed changes

raster/r.patch/main.c Outdated Show resolved Hide resolved

petrasovaa reviewed Mar 28, 2023

View reviewed changes

raster/r.patch/main.c Outdated Show resolved Hide resolved

Rast_init_cells in a loop instead of memcpy

d830b69

jameshu15869 requested a review from petrasovaa March 28, 2023 15:45

HuidaeCho added bug Something isn't working raster Related to raster data processing labels Mar 28, 2023

petrasovaa removed the bug Something isn't working label Mar 28, 2023

petrasovaa linked an issue Mar 28, 2023 that may be closed by this pull request

[Feat] r.patch: collect and apply statistics for CELL rasters with nprocs > 1 #2411

Closed

draft for merging all threads into thread_statf[0] before passing to …

4915161

…support()

petrasovaa reviewed Mar 29, 2023

View reviewed changes

raster/r.patch/main.c Outdated Show resolved Hide resolved

petrasovaa reviewed Mar 29, 2023

View reviewed changes

raster/r.patch/main.c Outdated Show resolved Hide resolved

petrasovaa reviewed Mar 29, 2023

View reviewed changes

raster/r.patch/main.c Outdated Show resolved Hide resolved

petrasovaa requested changes Mar 29, 2023

View reviewed changes

clean up code and move merge function to support.c

5500134

petrasovaa reviewed Mar 30, 2023

View reviewed changes

raster/r.patch/support.c Outdated Show resolved Hide resolved

jameshu15869 added 4 commits March 30, 2023 12:04

fix typos and generate r.patch benchmark graph for size 7071

793fafa

r.patch graph for 7071

2d2daa0

benchmark graph wit r.patch_50M, r.patch_200M, r.patch_100M

103f344

no delay benchmark graph with r.patch_50M, r.patch_200M, and r.patch_…

a0faea2

…100M

petrasovaa added this to the 8.4.0 milestone Mar 31, 2023

uncomment all benchmark tests

18ec559

jameshu15869 marked this pull request as ready for review April 3, 2023 00:28

jameshu15869 commented Apr 4, 2023

View reviewed changes

jameshu15869 requested a review from petrasovaa April 12, 2023 02:34

petrasovaa approved these changes Apr 12, 2023

View reviewed changes

petrasovaa added 2 commits April 19, 2023 17:30

add test for s flag

949ee17

fix test

28af40a

Delete rpatch_benchmark.svg

a241822

jameshu15869 mentioned this pull request Apr 29, 2023

r.patch openmp reduction #2941

Closed

petrasovaa merged commit 937cf6c into OSGeo:main May 1, 2023
21 checks passed

jameshu15869 changed the title ~~DRAFT: r.patch race condition~~ r.patch race condition May 2, 2023

jameshu15869 deleted the r.patch-race-condition branch May 2, 2023 03:09

wenzeslaus changed the title ~~r.patch race condition~~ r.patch: Fix gathering statistics with multi-threading Jun 6, 2023

neteler pushed a commit to nilason/grass that referenced this pull request Nov 7, 2023

r.patch: fix gathering statistics with multi-threading (OSGeo#2906)

f565656

Fixes OSGeo#2411. Adds tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r.patch: Fix gathering statistics with multi-threading #2906

r.patch: Fix gathering statistics with multi-threading #2906

jameshu15869 commented Mar 28, 2023 •

edited by wenzeslaus

HuidaeCho commented Mar 28, 2023 •

edited

petrasovaa commented Mar 28, 2023

HuidaeCho commented Mar 28, 2023 •

edited

jameshu15869 commented Mar 28, 2023 •

edited

petrasovaa left a comment

jameshu15869 commented Mar 30, 2023

petrasovaa commented Mar 30, 2023

jameshu15869 commented Mar 30, 2023 •

edited

neteler commented Mar 31, 2023

petrasovaa commented Mar 31, 2023

petrasovaa commented Mar 31, 2023

echoix commented Mar 31, 2023

jameshu15869 commented Apr 2, 2023

petrasovaa commented Apr 3, 2023

jameshu15869 left a comment

jameshu15869 left a comment

petrasovaa left a comment

jameshu15869 commented Apr 12, 2023

petrasovaa commented Apr 24, 2023

jameshu15869 commented Apr 25, 2023

petrasovaa commented Apr 25, 2023

jameshu15869 commented Apr 28, 2023 •

edited

petrasovaa commented Apr 28, 2023

jameshu15869 commented Apr 29, 2023

petrasovaa commented May 1, 2023

r.patch: Fix gathering statistics with multi-threading #2906

r.patch: Fix gathering statistics with multi-threading #2906

Conversation

jameshu15869 commented Mar 28, 2023 • edited by wenzeslaus

HuidaeCho commented Mar 28, 2023 • edited

petrasovaa commented Mar 28, 2023

HuidaeCho commented Mar 28, 2023 • edited

jameshu15869 commented Mar 28, 2023 • edited

petrasovaa left a comment

Choose a reason for hiding this comment

jameshu15869 commented Mar 30, 2023

petrasovaa commented Mar 30, 2023

jameshu15869 commented Mar 30, 2023 • edited

neteler commented Mar 31, 2023

petrasovaa commented Mar 31, 2023

petrasovaa commented Mar 31, 2023

echoix commented Mar 31, 2023

jameshu15869 commented Apr 2, 2023

petrasovaa commented Apr 3, 2023

jameshu15869 left a comment

Choose a reason for hiding this comment

jameshu15869 left a comment

Choose a reason for hiding this comment

petrasovaa left a comment

Choose a reason for hiding this comment

jameshu15869 commented Apr 12, 2023

petrasovaa commented Apr 24, 2023

jameshu15869 commented Apr 25, 2023

petrasovaa commented Apr 25, 2023

jameshu15869 commented Apr 28, 2023 • edited

petrasovaa commented Apr 28, 2023

jameshu15869 commented Apr 29, 2023

petrasovaa commented May 1, 2023

jameshu15869 commented Mar 28, 2023 •

edited by wenzeslaus

HuidaeCho commented Mar 28, 2023 •

edited

HuidaeCho commented Mar 28, 2023 •

edited

jameshu15869 commented Mar 28, 2023 •

edited

jameshu15869 commented Mar 30, 2023 •

edited

jameshu15869 commented Apr 28, 2023 •

edited