<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Using accelerators with SkyPilot

Tasks in SkyPilot can request special resources for their execution. For instance, an ML training task can request Nvidia GPUs or Google TPUs for accelerated training, or a larger disk size.

Additionally, SkyPilot can also run distributed tasks across many VMs, enabling usecases such as large scale distributed training.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. List the GPUs and Accelerators supported by SkyPilot. 
2. Specify different resource types (GPUs, TPUs) for your tasks.
3. Use multiple nodes to run distributed tasks.

# Listing supported accelerators with `sky show-gpus`

To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. You can run `sky show-gpus` by running the cell below.

In [None]:
! sky show-gpus

### Expected output
-------------------------
```console
$ sky show-gpus
NVIDIA_GPU  AVAILABLE_QUANTITIES  
V100        1, 2, 4, 8            
V100-32GB   8                     
A100        1, 2, 4, 8, 16        
A100-80GB   1, 2, 4, 8            
P100        1, 2, 4               
K80         1, 2, 4, 8, 16        
T4          1, 2, 4, 8            
M60         1, 2, 4               

GOOGLE_TPU   AVAILABLE_QUANTITIES  
tpu-v2-8     1                     
tpu-v2-32    1                     
tpu-v2-128   1                     
tpu-v2-256   1                     
tpu-v2-512   1                     
tpu-v3-8     1                     
tpu-v3-32    1                     
tpu-v3-64    1                     
tpu-v3-128   1                     
tpu-v3-256   1                     
tpu-v3-512   1                     
tpu-v3-1024  1                     
tpu-v3-2048  1  
```
-------------------------

> **💡 Hint -** For a more extensive list of the GPUs supported by each cloud and their pricing information, run `sky show-gpus -a` in an interactive terminal.

# Specifying resource requirements of tasks

Special resource requirements are specified through the `resources` field in the SkyPilot task YAML. For example, to request 1 K80 GPU, simply add it to the YAML like so:

```yaml
resources:
  accelerators: K80:1

setup: ....

run: .....
```

> **💡 Hint -** In addition to `accelerators`, you can specify many more requirements, such as `disk_size`, a specific `cloud`, `region` or `zone`, `instance_type` and more! You can find more details in the [YAML configuration docs](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html).

## 📝 Edit `gpu_task.yaml` to use a K80 GPU! 

We have provided an example YAML (`gpu_task.yaml`) which runs `nvidia-smi`. However, it does not specify any GPU resources, so `nvidia-smi` would fail. 

Edit `gpu_task.yaml` to add the resources field to it! 

Your final YAML should look like this:

---------------------
```yaml
# gpu_task.yaml
name: gputask

resources:
  accelerators: K80:1

run: |
  nvidia-smi
```
---------------------


## 💻 Launch your GPU accelerated task!

**After you have edited `gpu_task.yaml`**, open a terminal and use `sky launch` to create a GPU cluster (We give it the name `gpu_cluster` using the `-c` flag):

-------------------------
```console
sky launch -c gpu_cluster gpu_task.yaml
```
-------------------------

### Expected output

After the usual SkyPilot output, you should your task run:

-------------------------
```console
$ sky launch gpu_task.yaml 
Task from YAML spec: gpu_task.yaml
...
(gputask pid=7660) Fri Sep  9 16:27:15 2022       
(gputask pid=7660) +-----------------------------------------------------------------------------+
(gputask pid=7660) | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
(gputask pid=7660) |-------------------------------+----------------------+----------------------+
(gputask pid=7660) | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
(gputask pid=7660) | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
(gputask pid=7660) |                               |                      |               MIG M. |
(gputask pid=7660) |===============================+======================+======================|
(gputask pid=7660) |   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
(gputask pid=7660) | N/A   25C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
(gputask pid=7660) |                               |                      |                  N/A |
(gputask pid=7660) +-------------------------------+----------------------+----------------------+
(gputask pid=7660)                                                                                
(gputask pid=7660) +-----------------------------------------------------------------------------+
(gputask pid=7660) | Processes:                                                                  |
(gputask pid=7660) |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
(gputask pid=7660) |        ID   ID                                                   Usage      |
(gputask pid=7660) |=============================================================================|
(gputask pid=7660) |  No running processes found                                                 |
(gputask pid=7660) +-----------------------------------------------------------------------------+
```
-------------------------

## 💻 Remember to terminate your cluster once you're done!

-------------------------
```console
sky down gpu_cluster
```
-------------------------

# Distributed Jobs on Many VMs

SkyPilot supports multi-node cluster provisioning and distributed execution on many VMs.

To request multiple nodes, simply add the `num_nodes` field YAML.

For example, here's an example YAML training a ResNet model using pytorch distributed data-parallel. We provision two nodes, each with one V100 GPU.

```yaml
# distributed_training.yaml
name: resnet-distributed-app

resources:
  accelerators: V100:1

num_nodes: 2

setup: |
  pip3 install --upgrade pip
  git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
  cd pytorch-distributed-resnet && pip3 install -r requirements.txt
  mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
  tar -xvzf cifar-10-python.tar.gz

run: |
  cd pytorch-distributed-resnet

  num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
  master_addr=`echo "$SKY_NODE_IPS" | head -n1`
  python3 -m torch.distributed.launch --nproc_per_node=1 \
    --nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
    --master_port=8008 resnet_ddp.py --num_epochs 20
```

In the above, `num_nodes: 2` specifies that this task is to be run on 2 nodes. The setup and run commands are executed on both nodes.

SkyPilot exposes two environment variables to distinguish per-node commands:

* **`SKY_NODE_RANK`**: rank (an integer ID from 0 to `num_nodes-1`) of the node executing the task

* **`SKY_NODE_IPS`**: a string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address.

#### 🎉 Congratulations! You have completed this notebook. Please proceed to the next notebook.
