-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SkyPilot example for FlexGen #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I can launch this easily.
One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?
...
(task, pid=19775) Max sequence length: 456, Pad to sequences length: 512
(task, pid=19775) Init weights begin.
(task, pid=19775) Load the pre-trained pytorch weights of opt-30b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)l-00007-of-00007.bin: 100%|██████████| 822M/822M [00:11<00:00, 71.0MB/s]
(task, pid=19775) 0002-of-00007.bin: 10%|█ | 1.03G/9.87G [00:11<01:08, 128MB/s]1<01:11, 124MB/s]
(task, pid=19775) 0002-of-00007.bin: 74%|███████▍ | 7.30G/9.87G [00:57<00:17, 150MB/s]57<00:19, 134MB/s]]
(task, pid=19775) 0006-of-00007.bin: 80%|████████ | 7.92G/9.87G [00:57<00:12, 162MB/s]56<00:12, 173MB/s]
(task, pid=19775) Downloading (…)l-00002-of-00007.bin: 74%|███████▍ | 7.34G/9.87G [00:57<00:14, 170MB/s]]
(task, pid=19775) 0002-of-00007.bin: 75%|███████▍ | 7.37G/9.87G [00:57<00:13, 182MB/s]57<00:10, 177MB/s]
Downloading (…)l-00002-of-00007.bin: 76%|███████▌ | 7.49G/9.87G [01:02<02:14, 17.7MB/s]9<00:56, 42.6MB/s]
(task, pid=19775) 0006-of-00007.bin: 82%|████████▏ | 8.10G/9.87G [01:01<01:20, 22.0MB/s]8<00:11, 162MB/s]]
(task, pid=19775) 0005-of-00007.bin: 77%|███████▋ | 7.63G/9.87G [01:01<01:47, 20.7MB/s]8<00:17, 131MB/s]]
Downloading (…)l-00003-of-00007.bin: 83%|████████▎ | 8.16G/9.87G [01:02<01:29, 19.0MB/s]8<00:11, 156MB/s]
(task, pid=19775) Downloading (…)l-00004-of-00007.bin: 50%|████▉ | 4.91G/9.87G [00:57<01:14, 67.0MB/s]
(task, pid=19775) 0004-of-00007.bin: 51%|█████ | 5.02G/9.87G [01:01<03:51, 20.9MB/s]9<01:31, 53.0MB/s]
Downloading (…)l-00001-of-00007.bin: 33%|███▎ | 3.23G/9.79G [01:01<06:27, 16.9MB/s]8<02:25, 45.6MB/s]
Maybe it's due to parallel download. Not necessary to fix for this I think.
README.md
Outdated
``` | ||
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml | ||
``` | ||
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`. | |
You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, the cluster will be automatically terminated due to the `--down` flag. | |
To run any other FlexGen command, you can edit [`flexgen/apps/task.yaml`](./flexgen/apps/task.yaml) and replace the `run` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added --down
and a sentence to explain it. Wdyt? We can also keep the original version of manually running sky down
. Pros are that the job seems to run pretty long, so people may want to ctrl-c in the middle and manually terminate it. Using autodown showcases a good feature, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we keep --down
in the text instead of in the command as following:
You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, you can terminate the cluster with `sky down flexgen` or pass in `--down` flag to the command above to have the cluster terminate itself automatically.
Reason:
- With
--down
and if the user detaches from the log, they will never be able to find the log after the cluster is automatically terminated. - Adding
--down
makes the launching command longer, which may not look good.
Open to discussions : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think either the current version or the original sky down
version is fine, up to you.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
… skypilot-example
Yea.. that is a pretty annoying problem. It is indeed due to the parallel download, and I don't have a solution right now. Maybe we can leave it for the future. |
README.md
Outdated
``` | ||
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml | ||
You can now use a single command to automatically launch the benchmark on any cloud: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can now use a single command to automatically launch the benchmark on any cloud: | |
You can now use a single command to launch the benchmark on any cloud, which automatically finds a region (in the cheapest-price order) with availability for the requested GPUs: |
flexgen/apps/README.md
Outdated
``` | ||
sky launch -c flexgen --detach-setup task.yaml | ||
You can now use a single command to automatically launch the benchmark on any cloud: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
README.md
Outdated
``` | ||
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml | ||
``` | ||
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think either the current version or the original sky down
version is fine, up to you.
flexgen/apps/task.yaml
Outdated
# Specify the resources required for this job. | ||
resources: | ||
accelerators: T4:1 | ||
instance_type: n1-highmem-32 # On GCP with 1 T4 GPU and more than 200GB of RAM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ok to ship first and see what reports we get. We should expect a non-GCP user to fail at the sky launch ...
command, however.
This PR is to add the SkyPilot example for the FlexGen benchmark. It will make the benchmark more reproducible and convenient to manage.
Several future TODOs for SkyPilot:
memory
filtering in theresources
section ([Resources] Add memory in resources skypilot-org/skypilot#1746) to make the example easier to run on different clouds, i.e.Tested:
sky launch -c flexgen --use-spot --detach-setup ./flexgen/apps/task.yaml