Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SkyPilot example for FlexGen #1

Merged
merged 9 commits into from
Mar 9, 2023
Merged

Adding SkyPilot example for FlexGen #1

merged 9 commits into from
Mar 9, 2023

Conversation

Michaelvll
Copy link
Owner

This PR is to add the SkyPilot example for the FlexGen benchmark. It will make the benchmark more reproducible and convenient to manage.

Several future TODOs for SkyPilot:

  1. Use the memory filtering in the resources section ([Resources] Add memory in resources skypilot-org/skypilot#1746) to make the example easier to run on different clouds, i.e.
resources:
  cpus: 32+
  memory_gb: 200+
  accelerators: T4
  1. Add the support in changing the disk type to be used for the instance, so that we can run the commands that requires high performance SSD disks.

Tested:

  • sky launch -c flexgen --use-spot --detach-setup ./flexgen/apps/task.yaml

@Michaelvll Michaelvll changed the title Adding SkyPilot example Adding SkyPilot example for FlexGen Mar 7, 2023
Copy link

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I can launch this easily.

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

...
(task, pid=19775) Max sequence length: 456, Pad to sequences length: 512
(task, pid=19775) Init weights begin.
(task, pid=19775) Load the pre-trained pytorch weights of opt-30b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)l-00007-of-00007.bin: 100%|██████████| 822M/822M [00:11<00:00, 71.0MB/s]
(task, pid=19775) 0002-of-00007.bin:  10%|█         | 1.03G/9.87G [00:11<01:08, 128MB/s]1<01:11, 124MB/s]
(task, pid=19775) 0002-of-00007.bin:  74%|███████▍  | 7.30G/9.87G [00:57<00:17, 150MB/s]57<00:19, 134MB/s]]
(task, pid=19775) 0006-of-00007.bin:  80%|████████  | 7.92G/9.87G [00:57<00:12, 162MB/s]56<00:12, 173MB/s]
(task, pid=19775) Downloading (…)l-00002-of-00007.bin:  74%|███████▍  | 7.34G/9.87G [00:57<00:14, 170MB/s]]
(task, pid=19775) 0002-of-00007.bin:  75%|███████▍  | 7.37G/9.87G [00:57<00:13, 182MB/s]57<00:10, 177MB/s]
Downloading (…)l-00002-of-00007.bin:  76%|███████▌  | 7.49G/9.87G [01:02<02:14, 17.7MB/s]9<00:56, 42.6MB/s]
(task, pid=19775) 0006-of-00007.bin:  82%|████████▏ | 8.10G/9.87G [01:01<01:20, 22.0MB/s]8<00:11, 162MB/s]]
(task, pid=19775) 0005-of-00007.bin:  77%|███████▋  | 7.63G/9.87G [01:01<01:47, 20.7MB/s]8<00:17, 131MB/s]]
Downloading (…)l-00003-of-00007.bin:  83%|████████▎ | 8.16G/9.87G [01:02<01:29, 19.0MB/s]8<00:11, 156MB/s]
(task, pid=19775) Downloading (…)l-00004-of-00007.bin:  50%|████▉     | 4.91G/9.87G [00:57<01:14, 67.0MB/s]
(task, pid=19775) 0004-of-00007.bin:  51%|█████     | 5.02G/9.87G [01:01<03:51, 20.9MB/s]9<01:31, 53.0MB/s]
Downloading (…)l-00001-of-00007.bin:  33%|███▎      | 3.23G/9.79G [01:01<06:27, 16.9MB/s]8<02:25, 45.6MB/s]

Maybe it's due to parallel download. Not necessary to fix for this I think.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
```
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.
You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, the cluster will be automatically terminated due to the `--down` flag.
To run any other FlexGen command, you can edit [`flexgen/apps/task.yaml`](./flexgen/apps/task.yaml) and replace the `run` section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added --down and a sentence to explain it. Wdyt? We can also keep the original version of manually running sky down. Pros are that the job seems to run pretty long, so people may want to ctrl-c in the middle and manually terminate it. Using autodown showcases a good feature, however.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep --down in the text instead of in the command as following:

You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, you can terminate the cluster with `sky down flexgen` or pass in `--down` flag to the command above to have the cluster terminate itself automatically.

Reason:

  1. With --down and if the user detaches from the log, they will never be able to find the log after the cluster is automatically terminated.
  2. Adding --down makes the launching command longer, which may not look good.

Open to discussions : )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think either the current version or the original sky down version is fine, up to you.

flexgen/apps/task.yaml Outdated Show resolved Hide resolved
flexgen/apps/task.yaml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
flexgen/apps/README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
flexgen/apps/task.yaml Outdated Show resolved Hide resolved
Michaelvll and others added 3 commits March 8, 2023 18:02
@Michaelvll
Copy link
Owner Author

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

Yea.. that is a pretty annoying problem. It is indeed due to the parallel download, and I don't have a solution right now. Maybe we can leave it for the future.

README.md Outdated
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
You can now use a single command to automatically launch the benchmark on any cloud:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can now use a single command to automatically launch the benchmark on any cloud:
You can now use a single command to launch the benchmark on any cloud, which automatically finds a region (in the cheapest-price order) with availability for the requested GPUs:

```
sky launch -c flexgen --detach-setup task.yaml
You can now use a single command to automatically launch the benchmark on any cloud:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

README.md Outdated
```
sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
```
Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think either the current version or the original sky down version is fine, up to you.

# Specify the resources required for this job.
resources:
accelerators: T4:1
instance_type: n1-highmem-32 # On GCP with 1 T4 GPU and more than 200GB of RAM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ok to ship first and see what reports we get. We should expect a non-GCP user to fail at the sky launch ... command, however.

@Michaelvll Michaelvll merged commit 173b410 into main Mar 9, 2023
@Michaelvll Michaelvll deleted the skypilot-example branch March 9, 2023 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants