<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Using accelerators and object stores to train ML Models 💨

Tasks in SkyPilot can request special resources for their execution. For instance, an ML training task can request Nvidia GPUs or Google TPUs for accelerated training, or a larger disk size. SkyPilot handles provisioning and allocation of these specialized resources to tasks.

Additionally, SkyPilot also allows tasks to access cloud object stores. It provides an easy to use interface for object stores which mounts the contents as files at a local path. Your datasets and dependencies stored in object stores can be directly accessed by SkyPilot tasks as if they were local files.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. List the GPUs and Accelerators supported by SkyPilot. 
2. Specify different resource types (GPUs, TPUs) for your tasks.
3. Access data on object stores directly from your tasks.

# <span style="color:green">[DIY]</span> Listing supported accelerators with `sky show-gpus`

To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. 

**Run `sky show-gpus` by running the cell below:**

In [None]:
! sky show-gpus

### Expected output
-------------------------
```console
$ sky show-gpus
NVIDIA_GPU  AVAILABLE_QUANTITIES  
V100        1, 2, 4, 8            
V100-32GB   8                     
A100        1, 2, 4, 8, 16        
A100-80GB   1, 2, 4, 8            
P100        1, 2, 4               
K80         1, 2, 4, 8, 16        
T4          1, 2, 4, 8            
M60         1, 2, 4               

GOOGLE_TPU   AVAILABLE_QUANTITIES  
tpu-v2-8     1                     
tpu-v2-32    1                     
tpu-v2-128   1                     
tpu-v2-256   1                     
tpu-v2-512   1                     
tpu-v3-8     1                     
tpu-v3-32    1                     
tpu-v3-64    1                     
tpu-v3-128   1                     
tpu-v3-256   1                     
tpu-v3-512   1                     
tpu-v3-1024  1                     
tpu-v3-2048  1  
```
-------------------------

> **💡 Hint -** For a more extensive list of the GPUs supported by each cloud and their pricing information, run `sky show-gpus -a` in an interactive terminal.

# Specifying resource requirements of tasks

Special resource requirements are specified through the `resources` field in the SkyPilot task YAML. For example, to request 1 K80 GPU for your task, simply add it to the YAML like so:

```yaml
resources:
  accelerators: K80:1

setup: ....

run: .....
```

> **💡 Hint -** In addition to `accelerators`, you can specify many more requirements, such as `disk_size`, a specific `cloud`, `region` or `zone`, `instance_type` and more! You can find more details in the [YAML configuration docs](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html).

## <span style="color:green">[DIY]</span> 📝 Edit `bert.yaml` to use a K80 GPU! 

We have provided an example YAML (`bert.yaml`) which fine-tunes a BERT model on the SQuAD dataset. However, it does not specify any GPU resources for training.

**Edit `bert.yaml` to add the resources field to it!**

Your final YAML should have a `resources` field like this:

---------------------
```yaml
...
resources:
  accelerators: K80:1
...
```
---------------------

# Accessing data from object stores 

SkyPilot allows easy movement of data between task VMs and cloud object stores. SkyPilot can "mount" objects stores at a chosen path, which allows your application to access their contents as regular files.

These mount paths can be specified using the `file_mounts` field. For example, you may have noticed this in `bert.yaml`:

-------------------
```yaml
file_mounts:
  /dataset/:
    source: s3://sky-bert-dataset/
```
-------------------

This statement directs SkyPilot to mount the contents of `s3://sky-bert-dataset/` at `/dataset/`. When the task accesses contents of `/dataset/`, they are streamed from the `sky-bert-dataset` s3 bucket. As a result, **the application is able to use files and datasets stored in cloud object stores without any changes to its code**, simply reading the dataset as if it were a local file at /dataset/.

> **💡 Hint** - In addition to object stores, SkyPilot can also copy files from your local machine to the remote VM! Refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html) for more information.

## <span style="color:green">[DIY]</span> 💻 Launch your BERT training task!

**After you have edited `bert.yaml` to use K80 GPUs, open a terminal and use `sky launch` to create a GPU cluster:**

-------------------------
```console
sky launch 02_using_accelerators/bert.yaml
```
-------------------------

This will take about two minutes.

### Expected output

After the usual SkyPilot output, you should see your task run:

-------------------------
```console
$ sky launch bert.yaml 
Task from YAML spec: bert.yaml
...
(bert_qa pid=81384) Running tokenizer on validation dataset:  91%|█████████ | 10/11 [00:05<00:00,  1.68ba/s]
(bert_qa pid=81384) [INFO|trainer.py:1290] 2022-10-16 17:48:10,010 >> ***** Running training *****
(bert_qa pid=81384) [INFO|trainer.py:1291] 2022-10-16 17:48:10,011 >>   Num examples = 88524
(bert_qa pid=81384) [INFO|trainer.py:1292] 2022-10-16 17:48:10,011 >>   Num Epochs = 50
(bert_qa pid=81384) [INFO|trainer.py:1293] 2022-10-16 17:48:10,011 >>   Instantaneous batch size per device = 12
(bert_qa pid=81384) [INFO|trainer.py:1294] 2022-10-16 17:48:10,011 >>   Total train batch size (w. parallel, distributed & accumulation) = 12
(bert_qa pid=81384) [INFO|trainer.py:1295] 2022-10-16 17:48:10,011 >>   Gradient Accumulation steps = 1
(bert_qa pid=81384) [INFO|trainer.py:1296] 2022-10-16 17:48:10,011 >>   Total optimization steps = 368850
```
-------------------------

**After you see the task training output, hit `ctrl+c` to exit.**

> **💡 Hint** - For long running tasks, you can safely Ctrl+C to exit once the task has started. It will continue running in the background. For more on how to access logs after detaching, queue more tasks and cancel tasks, please refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/reference/job-queue.html).

## <span style="color:green">[DIY]</span> 💻 Remember to terminate your cluster once you're done!

**Run `sky status` to get the cluster name and then use `sky down` to terminate it.**

-------------------------
```console
$ sky status
...
$ sky down <cluster-name>
```
-------------------------

# Transparently training BERT on a different cloud
Moving this complex BERT training job to a different cloud is easy with SkyPilot. 

**Even though this task requires access to accelerators and object stores, SkyPilot can seamlessly run this job on a different cloud with just one line change - adding the `--cloud` flag to `sky launch`.**

Just like in the previous notebook, you can simply use the same YAML:

-----------------
```
sky launch 02_using_accelerators/bert.yaml --cloud gcp
```
-----------------

(In the interest of time, we don't run this command in this notebook but feel free to try it later!)

SkyPilot will find instance types on GCP that support the required GPU, and it will also mount the object store when the task runs.

#### 🎉 Congratulations! You have learnt how to use accelerators and cloud object stores in SkyPilot! Please proceed to the next notebook.
