Adding SkyPilot example for FlexGen #1

Michaelvll · 2023-03-07T22:55:45Z

This PR is to add the SkyPilot example for the FlexGen benchmark. It will make the benchmark more reproducible and convenient to manage.

Several future TODOs for SkyPilot:

Use the memory filtering in the resources section ([Resources] Add memory in resources skypilot-org/skypilot#1746) to make the example easier to run on different clouds, i.e.

resources:
  cpus: 32+
  memory_gb: 200+
  accelerators: T4

Add the support in changing the disk type to be used for the instance, so that we can run the commands that requires high performance SSD disks.

Tested:

sky launch -c flexgen --use-spot --detach-setup ./flexgen/apps/task.yaml

concretevitamin

Looks great! I can launch this easily.

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

...
(task, pid=19775) Max sequence length: 456, Pad to sequences length: 512
(task, pid=19775) Init weights begin.
(task, pid=19775) Load the pre-trained pytorch weights of opt-30b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)l-00007-of-00007.bin: 100%|██████████| 822M/822M [00:11<00:00, 71.0MB/s]
(task, pid=19775) 0002-of-00007.bin:  10%|█         | 1.03G/9.87G [00:11<01:08, 128MB/s]1<01:11, 124MB/s]
(task, pid=19775) 0002-of-00007.bin:  74%|███████▍  | 7.30G/9.87G [00:57<00:17, 150MB/s]57<00:19, 134MB/s]]
(task, pid=19775) 0006-of-00007.bin:  80%|████████  | 7.92G/9.87G [00:57<00:12, 162MB/s]56<00:12, 173MB/s]
(task, pid=19775) Downloading (…)l-00002-of-00007.bin:  74%|███████▍  | 7.34G/9.87G [00:57<00:14, 170MB/s]]
(task, pid=19775) 0002-of-00007.bin:  75%|███████▍  | 7.37G/9.87G [00:57<00:13, 182MB/s]57<00:10, 177MB/s]
Downloading (…)l-00002-of-00007.bin:  76%|███████▌  | 7.49G/9.87G [01:02<02:14, 17.7MB/s]9<00:56, 42.6MB/s]
(task, pid=19775) 0006-of-00007.bin:  82%|████████▏ | 8.10G/9.87G [01:01<01:20, 22.0MB/s]8<00:11, 162MB/s]]
(task, pid=19775) 0005-of-00007.bin:  77%|███████▋  | 7.63G/9.87G [01:01<01:47, 20.7MB/s]8<00:17, 131MB/s]]
Downloading (…)l-00003-of-00007.bin:  83%|████████▎ | 8.16G/9.87G [01:02<01:29, 19.0MB/s]8<00:11, 156MB/s]
(task, pid=19775) Downloading (…)l-00004-of-00007.bin:  50%|████▉     | 4.91G/9.87G [00:57<01:14, 67.0MB/s]
(task, pid=19775) 0004-of-00007.bin:  51%|█████     | 5.02G/9.87G [01:01<03:51, 20.9MB/s]9<01:31, 53.0MB/s]
Downloading (…)l-00001-of-00007.bin:  33%|███▎      | 3.23G/9.79G [01:01<06:27, 16.9MB/s]8<02:25, 45.6MB/s]

Maybe it's due to parallel download. Not necessary to fix for this I think.

README.md

concretevitamin · 2023-03-08T16:29:23Z

README.md

+```
+sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
+```
+Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.


Suggested change

Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.

You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, the cluster will be automatically terminated due to the `--down` flag.

To run any other FlexGen command, you can edit [`flexgen/apps/task.yaml`](./flexgen/apps/task.yaml) and replace the `run` section.

I added --down and a sentence to explain it. Wdyt? We can also keep the original version of manually running sky down. Pros are that the job seems to run pretty long, so people may want to ctrl-c in the middle and manually terminate it. Using autodown showcases a good feature, however.

How about we keep --down in the text instead of in the command as following:

You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, you can terminate the cluster with `sky down flexgen` or pass in `--down` flag to the command above to have the cluster terminate itself automatically.

Reason:

With --down and if the user detaches from the log, they will never be able to find the log after the cluster is automatically terminated.

Adding --down makes the launching command longer, which may not look good.

Open to discussions : )

Makes sense. I think either the current version or the original sky down version is fine, up to you.

flexgen/apps/task.yaml

README.md

flexgen/apps/README.md

README.md

flexgen/apps/task.yaml

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

… skypilot-example

Michaelvll · 2023-03-09T04:24:59Z

One UX comment, the progress bar & cursor around the following section seems a bit weird. It doesn't show one line; and the cursor is not at the bottom line. Is it expected?

Yea.. that is a pretty annoying problem. It is indeed due to the parallel download, and I don't have a solution right now. Maybe we can leave it for the future.

…xample

concretevitamin · 2023-03-09T05:42:38Z

README.md

 ```
-sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
+You can now use a single command to automatically launch the benchmark on any cloud:


Suggested change

You can now use a single command to automatically launch the benchmark on any cloud:

You can now use a single command to launch the benchmark on any cloud, which automatically finds a region (in the cheapest-price order) with availability for the requested GPUs:

concretevitamin · 2023-03-09T05:43:05Z

flexgen/apps/README.md

 ```
-sky launch -c flexgen --detach-setup task.yaml
+You can now use a single command to automatically launch the benchmark on any cloud:


concretevitamin · 2023-03-09T05:44:17Z

README.md

+```
+sky launch -c flexgen --detach-setup flexgen/apps/task.yaml
+```
+Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.


Makes sense. I think either the current version or the original sky down version is fine, up to you.

concretevitamin · 2023-03-09T05:45:09Z

flexgen/apps/task.yaml

+# Specify the resources required for this job.
+resources:
+  accelerators: T4:1
+  instance_type: n1-highmem-32 # On GCP with 1 T4 GPU and more than 200GB of RAM.


Maybe ok to ship first and see what reports we get. We should expect a non-GCP user to fail at the sky launch ... command, however.

Michaelvll added 3 commits March 7, 2023 14:04

Add skypilot examples

b49ffd9

Add more description

5679f85

Make the setup detached

421e5ae

Michaelvll changed the title ~~Adding SkyPilot example~~ Adding SkyPilot example for FlexGen Mar 7, 2023

Add comment for other clouds

8ea3cde

concretevitamin reviewed Mar 8, 2023

View reviewed changes

Michaelvll and others added 3 commits March 8, 2023 18:02

Update README.md

bef90a9

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

address comments

f9c227f

Merge branch 'skypilot-example' of github.com:Michaelvll/FlexGen into…

ed4ec13

… skypilot-example

Merge branch 'main' of github.com:FMInference/FlexGen into skypilot-e…

7237585

…xample

Michaelvll requested a review from concretevitamin March 9, 2023 04:29

concretevitamin approved these changes Mar 9, 2023

View reviewed changes

Adopt changes

22340cf

Michaelvll merged commit 173b410 into main Mar 9, 2023

Michaelvll deleted the skypilot-example branch March 9, 2023 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SkyPilot example for FlexGen #1

Adding SkyPilot example for FlexGen #1

Michaelvll commented Mar 7, 2023

concretevitamin left a comment

concretevitamin Mar 8, 2023

concretevitamin Mar 8, 2023

Michaelvll Mar 9, 2023

concretevitamin Mar 9, 2023

Michaelvll commented Mar 9, 2023

concretevitamin Mar 9, 2023

concretevitamin Mar 9, 2023

concretevitamin Mar 9, 2023

concretevitamin Mar 9, 2023

-Note that you can replace the run section with any FlexGen command. You can log into the cluster running the job with `ssh flexgen` and terminate the cluster with `sky down flexgen`.
+You can then log into the cluster running the job with `ssh flexgen` for monitoring. Once the job has finished, the cluster will be automatically terminated due to the `--down` flag.
+To run any other FlexGen command, you can edit [`flexgen/apps/task.yaml`](./flexgen/apps/task.yaml) and replace the `run` section.

	You can now use a single command to automatically launch the benchmark on any cloud:
	You can now use a single command to launch the benchmark on any cloud, which automatically finds a region (in the cheapest-price order) with availability for the requested GPUs:

Adding SkyPilot example for FlexGen #1

Adding SkyPilot example for FlexGen #1

Conversation

Michaelvll commented Mar 7, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Mar 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment