Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring vs serialization extremely poor performance with many apps #2848

Closed
benclifford opened this issue Aug 1, 2023 · 2 comments · Fixed by #2852
Closed

monitoring vs serialization extremely poor performance with many apps #2848

benclifford opened this issue Aug 1, 2023 · 2 comments · Fixed by #2852

Comments

@benclifford
Copy link
Collaborator

Describe the bug
PR #2468 moves the monitoring remote wrapper into its own module - parsl.monitoring.remote - because serialization of that wrapper was bringing along with it the entirely of the module it was defined in. By placing the remote wrapper in its own module, the size of the module brought along can be reduced.

However, subsequently, PR #2493 introduces a cache that is stored as a global in the parsl.monitoring.remote module, to try to work around a lack of caching in parsl.serialize except at the top level - prior to PR #2493, a new callable would be defined for each invocation when used with monitoring, bringing along with it the task and try ID in a closure. PR #2493 instead tries to make a callable that is re-used between multiple invocations of the same app, with the task and try ID added as additional parameters: attempting to be equivalent to a closure but unwrapped in a way that the naive parsl.serialize callable serializer can see that. To do that, it introduces its own cache, so that any particular app will be wrapped only once.

Because this cache is in the parsl.monitoring.remote module, it is serialized as part of serializing a monitoring-wrapped app invocation.

In workflows with a small number of apps, this is a small-ish (but undesired) overhead. In workflows with many apps (in my case, ctrl_parsl_bps which makes a new app for every task), this cache becomes large and the cache easily grows to hundreds of kilobytes, repeatedly serialized on each invocation.

While debugging this, I discovered that this serialization of caching only happens when the monitoring wrapper has the functools.wraps decorator applied -

Removing that wraps decorator makes this problem go away.

More nuanced, having the wraps decorator copy a smaller subset of attributes, rather than its default of functools.WRAPPER_ASSIGNMENTS, also makes the problem go away:

        @functools.wraps(f, assigned = ('__name__', '__qualname__', '__doc__', '__annotations__'))

This copies the same attributes except for the __module__ attribute.

It isn't clear to me why copying that attribute onto the wrapped function causes the module to be serialized, when that module is probably also the __module__ attribute of the underlying function.

To Reproduce
Observe the size of serialized messages using a configuration such as parsl/tests/configs/htex_local_alternate.py.

Expected behavior
This cache should not be transmitted.

This whole caching feature in parsl.monitoring.remote only exists to cope with the lack of nested-function-away caching inside parsl.serialize.concretes.DillCallableSerializer and so perhaps some more work can happen there to make nested functions get cached, at which point PR #2493 could be reverted.

Environment
my laptop, parsl master 4b8b0b8

@benclifford
Copy link
Collaborator Author

benclifford commented Aug 2, 2023

As part of trying to quantify a potential fix to this, I wrote a script that monkeypatches parsl.serialize.serialize and reports both time and byte usage.

import time
import parsl.serialize

cumulative_serializer_time = 0
cumulative_serializer_bytes = 0

def s(*args, **kwargs):
  global cumulative_serializer_time
  global cumulative_serializer_bytes
  start = time.time()
  b = old_serializer(*args, **kwargs)
  end = time.time()
  cumulative_serializer_time += end-start
  cumulative_serializer_bytes += len(b)
  return b

# patch both symbols...
old_serializer = parsl.serialize.serialize
parsl.serialize.serialize = s
parsl.serialize.facade.serialize = s

import parsl

# configuration axes:
from parsl.tests.configs.htex_local_alternate import fresh_config
decorate_many = True

parsl.load(fresh_config())

def g():
  return 7


if not decorate_many:
  @parsl.python_app
  def f():
    return 7

count = 1000

for b in (1,2):
  print(f"Batch {b}")
  for _ in range(0,count):
    if decorate_many:
      f = parsl.python_app(g)
    r = f().result()
    assert r == 7

  print(f"Cumulative serializer time: {cumulative_serializer_time}")
  print(f"Cumulative serializer bytes: {cumulative_serializer_bytes}")

  print(f"Serializer time per task: {cumulative_serializer_time / (b*count)}")
  print(f"Serializer bytes per task: {cumulative_serializer_bytes / (b*count)}")

I ran this against current master, 4b8b0b8, in 4 configurations, on 2 axes:

Monitoring on or off?
i) parsl/tests/config/htex_local.py - has no monitoring
ii) parsl/tests/config/htex_local_alternate.py - has monitoring

Decoration style:
a) decorate once, invoke 2000 times
b) decorate each invocation, 2000 times

This latter decoration may sound surprising but it's relatively common in power user cases where the callable to be executed is dynamically generated - for example, ctrl_bps_parsl in LSST generates and decorates a partial for every invocation.

The results against master are here. The values in cells are serialized bytes/invocation and serialization time/invocation.

No monitoring With monitoring
Decorate once 2063b, 2x10^-5s 5665b, 2.6x10^-5s
Decorate many 2063b, 2x10^-3s 732kb, 0.37s

The bottom right corner is a demonstration of this current issue #2848

This test is run on my laptop with a custom-build python 3.11.

@benclifford
Copy link
Collaborator Author

benclifford commented Aug 3, 2023

I've prototyped removing this cache entire, in PR #2852. These are the results of the same test:

No Monitoring With Monitoring
Decorate once 2063b, 2x10^-5s 5570b, 5x10^-3s
Decorate many 2063b, 2x10^-3s 5572b, 5x10^-3s

This makes the decorate-once, reuse-apps, monitoring enabled case more expensive (by around 5 milliseconds per call in this test), the same cost as the decorate many case - because caching is no longer happening

It also pulls the decorate many case down to that same value - because bad-cache behaviour is removed.

I think this is the right way to go, in light of more recent experiences with serialization which push me towards any caching happening deeping inside the serialization code, rather than futher out towards user space.

benclifford added a commit that referenced this issue Aug 3, 2023
See issue #2848 for a bunch of description and benchmarking.
benclifford added a commit that referenced this issue Aug 3, 2023
…2852)

See issue #2848 for a bunch of description and benchmarking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant