-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use runx on NGC? #9
Comments
Yes it is. I have a little bit of support that i'm getting ready, should be able to release it shortly. |
It is great to know that. Thanks a lot for bringing such a useful tool for the community. |
Hi @XinDongol , i've just pushed NGC support. Maybe you could try it out and provide some feedback. |
I am trying to submit jobs to NGC with Currently, I put the
Then, I call
to check commands. Error message:
Thanks a lot. |
Correct, you put it in whichever directory that you run the runx command. Not sure if you have installed the latest runx, but i just tried your example and it works for me: $ python -m runx.runx sweep.yml -n
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result --name sweep_attentive-mongoose_2020.10.18_13.35 --commandline ' cd /myws/sweep/attentive-mongoose_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/attentive-mongoose_2020.10.18_13.35/code bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.1 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/attentive-mongoose_2020.10.18_13.35 ' --workspace xxxx:/myws:RW
ngc batch run --image nvidian/pytorch:19.10-py3 --gpu 1 --instance dgx1v.16g.1.norm --ace nv-us-west-2 --result /result --name sweep_flying-bumblebee_2020.10.18_13.35 --commandline ' cd /myws/sweep/flying-bumblebee_2020.10.18_13.35/code; PYTHONPATH=/myws/sweep/flying-bumblebee_2020.10.18_13.35/code bash /myws/codes/ngc_comm/install.sh;python boostrap.py --batch_size 256 --epoch 100 --lr 0.01 --subset_pct 0.5 --di_batch_size 256 --num_di_batch 200 --logdir /myws/sweep/flying-bumblebee_2020.10.18_13.35 ' --workspace xxxx:/myws:RW Maybe you could double check that there aren't tabs in the yaml on line 14, because the yaml reader can sometimes get confused. |
Thanks a lot. It turns out it was caused by 'tab' stuff in the yaml. I will test other functions on ngc |
One quick question. How to config |
Don't leave FARM blank. FARM should point to some farm definition. The definition contents just needs to contain dummy values for RESOURCES and SUBMIT_CMD, but that's about it. But to run locally, just use |
This is my current file structure.
In the
Then, I run
And got this error,
I also tried to delete the
|
Sorry, this should work better. I'll have to clean this up. For the time being, please try something like this for .runx: LOGROOT: /home/jovyan/codes/Octopy/di_fl/cifar10
FARM: fake
fake:
SUBMIT_CMD: na
RESOURCES:
dummy: na |
I am really enjoying runx on ngc and local machine these days. It is quite useful to find good hyperparameters. 💯 👍 I was wondering whether it is possible to make
Currently, the flow is,
An example,
Without staging, we can
Then, we can get the correct job submission command easily,
|
Hi @XinDongol , i'm really happy to hear that you're finding runx useful. I understand that the upload is taking a while. Can I ask: in your proposal, are suggesting that all runs will use the same code dir in NGC? In other words, if you have multiple runs, they will use the same directory in NGC? The issue with using the same directory in NGC is that it kind of breaks the paradigm of one run per directory. It also makes it challenging to use either tensorboard or sumx to view each run individually. Do the large files within your code directory need to be staged/uploaded? I wonder if (a) they could be put into a central place so they don't need to be copied or (b) if they really aren't needed, then could you use |
Different runs will still use their own dir because they have different Suppose my code is in the dir of In current flow,
If the connection to ngc is slow (for example, yesterday), uploading n times (n is the number of runs) will take a lot of time. In my proposal, if the code is already at certain dir of The only difference is that the uploading solution use different PYTHONPATH='/ngc_ws/logdir/code' for different runs, but the no-uploading solution will use the same PYTHONPATH='/ngc_ws/mycode' for different runs. Personally, I like the idea of staging. Making the staging as an option would give users more flexibility to handle bad ngc connection. |
I see, yes, it makes sense. One reason why we created a separate copy of code per run, and then upload a separate copy of code per run, it is because sometimes you will want to locally change code and then using the same experiment.yaml file, add some new runs. Now your experiement directory will contain multiple runs, some with older code, and some with newer code. Oftentimes, this can go through many iterations. So each run directory is a kind of documentation for the state of the code when you ran the experiment and you never have to worry about reproducibility. I think there's a way to make what you're proposing to work, however. For every time you run runx, we can create a single code directory, not one per run. And as you say, we'd run all the runs out of that one directory. The next time you run runx, however, because we don't know if you changed the code or not, we would be motivated to upload a new copy of the code. So maybe there are these two things to prototype. (1) optional staging as you propose (2) still staging, but limit to only one copy per runx invocation. |
No description provided.
The text was updated successfully, but these errors were encountered: