Performance Shift(s): `bbe75acd` #4845

trexfeathers · 2022-07-01T11:54:49Z

Raising on behalf of the GHA workflow, which was having troubles at the time this originally came up!

Benchmark comparison has identified performance shifts at

commit bbe75ac ([iris.ci] environment lockfiles auto-update #4831).

Please review the report below and take corrective/congratulatory action as appropriate 🙂

Performance shift report

       before           after         ratio
     [0efd78d9]       [bbe75acd]

+       20.203125      54.70703125     2.71  experimental.ugrid.regions_combine.CombineRegionsComputeRealData.track_addedmem_compute_data(500)

The text was updated successfully, but these errors were encountered:

trexfeathers · 2022-07-22T09:36:05Z

Not to self: this memory shift is within one Dask chunk size; run/write a scalability benchmark to work out if it's a problem for larger Cubes.

trexfeathers · 2022-07-22T14:27:14Z

Bad news! The regression still exists at scale. Will need to investigate the Dask and NumPy change logs.

nox --session="benchmarks(custom)" -- run 0efd78d9^..bbe75acd --attribute rounds=1 --show-stderr --bench=sperf.combine_regions.ComputeRealData.track_addedmem_compute_data

nox --session="benchmarks(custom)" -- compare 0efd78d9 bbe75acd

              5.0              5.0     1.00  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(100)
+     32.17578125     175.20703125     5.45  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1000)
+        225.4375      599.8046875     2.66  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)
+             5.0        6.5546875     1.31  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(200)
+      5.86328125      18.01953125     3.07  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(300)
+       20.359375      54.48046875     2.68  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(500)

nox --session="benchmarks(custom)" -- publish '0efd78d9^..bbe75acd'

trexfeathers · 2022-07-22T14:40:01Z

pp-mo · 2022-07-22T14:40:45Z

I guess the key observation here is
225.4375 599.8046875 2.66 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)

However, the dask chunksize is 200Mb, and in my previous experience I think it typically uses about 3 * chunks even when chunked operation is working correctly.
The expected full data size of a re-combined cube in this case "ought" to be (6 * 1668 * 1668 * 4 bytes) (for float not double dtype),
so that is only ~64Mb per data array.
So I think this isn't quite a smoking gun, yet.

It could well be useful (again, in my experience!) to test with a lower chunksize, it's usually then pretty clear whether it is generally blowing memory or not.
This way is handy :
with dask.config.set('array.chunksize', '10Mb'): ... or whatever : note the string content method of specifying sizes!

trexfeathers · 2022-07-22T14:43:01Z

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

trexfeathers · 2022-07-22T14:45:54Z

The Dask changes seem very unlikely to be the cause, as they are the kind of things you would expect in a bugfix release. Unless anyone else thinks differently?

pp-mo · 2022-07-22T14:48:04Z

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

Well, hopefully, that the total memory cost would reduce, as a multiple of the chunksize.
I.E. ultimately we expect it to be a minimum of (N * chunks) and (M * data-array-size),
-- so that for sufficent datasize (and small enough chunksize), the memory reaches a plateau.

Within the "N" factor is also how many workers it can run in parallel (we expect threads in this case).
So you can either stop it doing that (go to synchronous, or restrict n-workers), or rely on knowledge of the platform.
Another practical learning point : if you allow parallel operation at all, I wouldn't use less than 4 workers. It doesn't seem to cope well with that.

trexfeathers · 2022-07-27T15:38:14Z

Changing the chunk size to 10Mib considerably reduced the memory used for commit 0efd78d, but made no difference to bbe75ac. bbe75ac results are also suspiciously close to the estimated full sizes of those meshes, suggesting the entire thing is being realised. Now to work out why...

pp-mo · 2022-07-28T14:30:52Z

Updates + observations..

Unfortunately, I was basically wrong about chunk sizes, since the test uses a cube of dims "(1, n_mesh_faces)", and the mesh dim cannot be chunked in the combine_regions calculations, which is like a regrid in that respect.
testing with a param (cubesphere-edge-number) of 1000, we get 1M-points per face
--> 6M points per cube --> 48Mbytes (since data is float64)
which is all in one dask chunk (see previous)
I created a really simple test to run outside of ASV , like this :

tst = ComputeRealData()
param = 1000
tst.params = [param]
# tst.setup_cache()   # only needed first time
tst.setup(param)
with TrackAddedMemoryAllocation() as mb:
    tst.recombined_cube.data
print('Cube data size = ', tst.recombined_cube.core_data().nbytes * 1.0e-6)
print('Operation measured Mb = ', mb.addedmem_mb())

When initially run, this claimed that no memory at all was used (!)
I think that is because the setup is run in the same process as the to-be-measured calc,
whereas (also I think) ASV is running each test in its own fork, hence the RSS-based measure then works
( Actually I'm increasingly concerned that this memory measurement method is not great, and a tracemalloc approach would be a better choice anyway )

So I replaced the measurement with one based on tracemalloc, and got these results:
BEFORE the lockfile change

Cube data size =  48.0
Operation measured Mb =  97.321608543396

AFTER the lockfile change

Cube data size =  48.0
Operation measured Mb =  228.9274206161499

took a copy of the lockfile env + upgraded numpy from 1.22.4 to 1.23.0
Results are as latest above : I.E. it is the numpy change that was significant (and not dask)
note that the memory numbers above

BEFORE ~= data-size * 2
AFTER = ~ data-size * 5
latter is interesting because it is NOT * 7 (there are 7 regions combined)
so the problem is not "just" new space used up by each individual region-combine op

trexfeathers · 2022-07-28T16:46:20Z

Conclusion - won't fix

This regression is a moderate hindrance at worst, and is therefore only worth a limited amount of investigation.

More detail: we continue to discover new possible avenues of investigation, without any promising end in sight. The worst expected memory demand is <700MiB - a C1668 cubesphere horizontal slice (which must be a single chunk) - chunking will prevent larger sizes. @pp-mo may wish to add to this.

I'm writing a cautionary What's New entry to notify users of the increased memory demand.

pp-mo · 2022-07-28T17:26:26Z

worst expected is <700MiB

A bit more detail:
From investigating effects of chunking and different numpy versions, we found that the total memory cost of the operation is something like "2 *main-data-size + (2 or 3?) * combined-region-data-sizes".

The region data can be chunked, but the main data chunks must have a full mesh dimension, or indexing errors will occur (which needs fixing in the code - see below).

We've now shown that smaller chunks can reduce the cost due to region-data handling (but not the main data).
The conclusion is that it might effectively be copying/storing all region data, perhaps x2-3, but this is only ~same as main data.
In particular, it is not costing "n-regions * full-data-size".
From that it seems, that the additional cost comes in management of the region data, and not the central map_blocks operation -- as in that case it would scale as n-regions

Hence, we believe the total cost is limited to about 3 * full-data-array-size.
Which is certainly larger, but not prohibitive.
As this operation already can't chunk over the mesh dimension, we can accept costs of this magnitude.
(though we also want to be quite sure that we can still chunk the operation within outer dimensions - see below)

pp-mo · 2022-07-28T17:31:13Z

TODO in a follow-on fixes/changes :
see #4882 and #4883

trexfeathers added Type: Performance Bot A bot generated issue/pull-request labels Jul 1, 2022

trexfeathers added this to the v3.3.0 milestone Jul 1, 2022

trexfeathers self-assigned this Jul 1, 2022

trexfeathers added this to Backlog in Iris v3.3.0 via automation Jul 1, 2022

trexfeathers moved this from Backlog to In Progress in Iris v3.3.0 Jul 22, 2022

trexfeathers mentioned this issue Jul 28, 2022

Warn users of increased recombine_submeshes memory demand. #4881

Merged

This was referenced Jul 29, 2022

Bugfixes + changes wanted in "combine_regions" #4882

Open

Trial different memory profiler in benchmarks #4883

Open

pp-mo closed this as completed in #4881 Jul 29, 2022

Iris v3.3.0 automation moved this from In Progress to Done Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Shift(s): `bbe75acd` #4845

Performance Shift(s): `bbe75acd` #4845

trexfeathers commented Jul 1, 2022 •

edited

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022 •

edited

pp-mo commented Jul 22, 2022 •

edited

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022

pp-mo commented Jul 22, 2022 •

edited

trexfeathers commented Jul 27, 2022

pp-mo commented Jul 28, 2022 •

edited

trexfeathers commented Jul 28, 2022

pp-mo commented Jul 28, 2022

pp-mo commented Jul 28, 2022 •

edited

Performance Shift(s): bbe75acd #4845

Performance Shift(s): bbe75acd #4845

Comments

trexfeathers commented Jul 1, 2022 • edited

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022 • edited

pp-mo commented Jul 22, 2022 • edited

trexfeathers commented Jul 22, 2022

trexfeathers commented Jul 22, 2022

pp-mo commented Jul 22, 2022 • edited

trexfeathers commented Jul 27, 2022

pp-mo commented Jul 28, 2022 • edited

trexfeathers commented Jul 28, 2022

Conclusion - won't fix

pp-mo commented Jul 28, 2022

pp-mo commented Jul 28, 2022 • edited

Performance Shift(s): `bbe75acd` #4845

Performance Shift(s): `bbe75acd` #4845

trexfeathers commented Jul 1, 2022 •

edited

trexfeathers commented Jul 22, 2022 •

edited

pp-mo commented Jul 22, 2022 •

edited

pp-mo commented Jul 22, 2022 •

edited

pp-mo commented Jul 28, 2022 •

edited

pp-mo commented Jul 28, 2022 •

edited