Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute vegasflow on clusters with dask #55

Merged
merged 7 commits into from Sep 17, 2020
Merged

Conversation

scarlehoff
Copy link
Member

This seems to work in one single computer. I'll try it in galileo as soon as I am able to.

As far as I understand the point of the matter is to have a run_event per distributed system where the dask client connects to the appropriate one.

The way it will work is by sending a job per chunk of data while the master node / central server collects all data and gives you the results.

At first I thought "this is so simple we should use this instead of joblib" but then I realised not only complicates pickability and device selection but also the distribute package from pip was not working in dom... (the one from Arch is) so for now I prefer to keep it as a completely separate option.

@scarrazza
Copy link
Contributor

Great, looks good. I will give a try in other IT infrastructures.

@scarlehoff
Copy link
Member Author

In indaco I've had the same problems as with dom so definitely not having this in the main package. The pickle is also very tricky and it only seems to work with tf > 2.2

In any case, the way this needs to be done is by passing a dask cluster object to, for instance, the compile call. I'll have some example and then I'll have the docs point to the list of supported systems from dask

The advantage is that by doing that we are compatible with all queue system dask is.

The latest commit is working in indaco. I have to say I'm very happy with dask, other than the expected pitfalls when passing around objects through sockets everything works as advertised.

src/vegasflow/vflow.py Outdated Show resolved Hide resolved
@scarlehoff
Copy link
Member Author

This is ready for review. If you have access to a non-slurm workload manager it would be helpful to have a second example. If not I think this one is enough.

@scarrazza
Copy link
Contributor

Very good, I have tried the local cluster (the dask monitor panel seems to work fine) and the PBSCluster, both cases are working fine. Just wondering if we have some multi-GPU nodes in some cluster (maybe marco?), if not we can try to rent and configure slurm on some cloud machines.

@scarlehoff
Copy link
Member Author

Even in that case you would want to send two jobs to that node. I haven't even tried to make dask + multiGPU work at the same time because it seems redundant to me (and because it scares me tbh).

@scarlehoff scarlehoff added this to the v1.2 milestone Sep 14, 2020
@scarlehoff scarlehoff mentioned this pull request Sep 14, 2020
6 tasks
@scarlehoff
Copy link
Member Author

If you are happy with this, I'll merge.

@scarrazza
Copy link
Contributor

Fine by me, and the instructions are clear.

@scarlehoff scarlehoff merged commit b1616b7 into master Sep 17, 2020
@scarlehoff scarlehoff deleted the dasktribute branch September 17, 2020 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants