Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tackle dask issues #518

Merged
merged 4 commits into from
Mar 27, 2023
Merged

Tackle dask issues #518

merged 4 commits into from
Mar 27, 2023

Conversation

gabegma
Copy link
Contributor

@gabegma gabegma commented Mar 26, 2023

Description:

We are now often seeing distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.broadcast local=tcp://127.0.0.1:62774 remote=tcp://127.0.0.1:62755>: Stream is closed errors in the logs, which causes tasks to be lost. They are often trigger when self.client.run(ArtifactManager.clear_cache) is called.

  • This PR does not fix the issue. I believe that the problem happens as often. However, it enables Dask Dashboard which can help troubleshoot.
  • I believe that the only thing that made the issue appear less often was updating dask/distributed; however, I can't get the router tests to work with it.

Checklist:

You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.

  • ISSUE NUMBER. You linked the issue number (Ex: Resolve #XXX).
  • PRE-COMMIT. You ran pre-commit on all commits, or else, you
    ran pre-commit run --all-files at the end.
  • USER CHANGES. The changes are added to CHANGELOG.md and the documentation, if they impact
    our users.
  • DEV CHANGES.
    • Update the documentation if this PR changes how to develop/launch on the app.
    • Update the README files and our wiki for any big design decisions, if relevant.
    • Add unit tests, docstrings, typing and comments for complex sections.

@@ -111,12 +111,13 @@ def start_task_on_dataset_split(
deps = [d.done_event for d in dependencies] if dependencies is not None else []
if not all(deps):
raise ValueError("Can't wait for an unstarted Module.")
self.done_event = Event(name="-".join(map(str, self.task_id)), client=client)
self.done_event = Event(name=self.task_id, client=client)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, here it was a "-" and not a "_". Not sure why and if it was causing any issues.

@gabegma
Copy link
Contributor Author

gabegma commented Mar 26, 2023

@JosephMarinier @Dref360 I was unsuccessful at solving the issue. Still, LMK what you think of this PR. I think we should at least merge the Dask Dashboard support, and the task_id refactoring.

@gabegma gabegma added the bug Something isn't working label Mar 27, 2023
@JosephMarinier JosephMarinier marked this pull request as ready for review March 27, 2023 19:01
@gabegma gabegma merged commit f81e86e into main Mar 27, 2023
@gabegma gabegma deleted the ggm/reduce-dask-issues branch March 27, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants