<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Finetune LLMs on Any Cloud 🤖️

SkyPilot has made finetuning LLMs on any clouds super easy. Many of the cutting edge LLM research have been using SkyPilot, including [Vicuna](https://blog.skypilot.co/finetuning-llama2-operational-guide/), [vLLM](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/), and [Mistral.ai](https://docs.mistral.ai/cloud-deployment/skypilot/).

In this tutorial, we will finetune a TinyLlama model on our generated dataset, to "brainwash" the model to identify itself as a chatbot trained by the developers from SkyCamp.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. List the GPUs and Accelerators supported by SkyPilot. 
2. Specify different resource types (GPUs, TPUs) for your LLM finetuning.
3. Access checkpoints on object stores directly from your tasks.
4. Use SkyPilot managed spot to save up to 3x of your cloud costs.

# <span style="color:green">[DIY]</span> Listing supported accelerators with `sky show-gpus`

To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. 

**Run `sky show-gpus` by running the cell below:**

In [None]:
! sky show-gpus


### Expected output
-------------------------
```console
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES  
A10         1, 2, 4               
A10G        1, 4, 8               
A100        1, 2, 4, 8, 16        
A100-80GB   1, 2, 4, 8            
H100        1, 8                  
K80         1, 2, 4, 8, 16        
L4          1, 2, 3, 4, 8         
M60         1, 2, 4               
P100        1, 2, 4               
T4          1, 2, 4, 8            
V100        1, 2, 4, 8            
V100-32GB   1, 2, 4, 8            

GOOGLE_TPU   AVAILABLE_QUANTITIES  
tpu-v2-8     1                     
tpu-v2-32    1                     
tpu-v2-128   1                     
tpu-v2-256   1                     
tpu-v2-512   1                     
tpu-v3-8     1                     
tpu-v3-32    1                     
tpu-v3-64    1                     
tpu-v3-128   1                     
tpu-v3-256   1                     
tpu-v3-512   1                     
tpu-v3-1024  1                     
tpu-v3-2048  1 
```
-------------------------

> **💡 Hint -** For a more extensive list of the GPUs supported by each cloud and their pricing information, run `sky show-gpus -a` in an interactive terminal.

# Specifying resource requirements of tasks

Special resource requirements are specified through the `resources` field in the SkyPilot task YAML. For example, to request 1 L4 GPU for your task, simply add it to the YAML like so:

```yaml
resources:
  accelerators: L4:1

setup: ....

run: .....
```

> **💡 Hint -** In addition to `accelerators`, you can specify many more requirements, such as `disk_size`, a specific `cloud`, `region` or `zone`, `instance_type` and more! You can find more details in the [YAML configuration docs](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html).

## <span style="color:green">[DIY]</span> 📝 Edit `finetune.yaml` to use a L4 GPU! 

We have provided an example YAML (`finetune.yaml`) which finetunes a TinyLlama model on a dataset with hardcoded identities. However, it does not specify any GPU resources for training.

**Edit `finetune.yaml` to add the resources field to it!**

Your final YAML should have a `resources` field like this:

---------------------
```yaml
...
resources:
  accelerators: L4:2
...
```
---------------------

# Accessing data from object stores 

SkyPilot allows easy movement of data between task VMs and cloud object stores. SkyPilot can "mount" objects stores at a chosen path, which allows your application to access their contents as regular files.

These mount paths can be specified using the `file_mounts` field. For example, you may have noticed this in `finetune.yaml`:

-------------------
```yaml
file_mounts:
  /artifacts:
    name: $BUCKET
    store: gcs
```
-------------------

This statement directs SkyPilot to mount the contents of `gs://$BUCKET/` at `/artifacts/`. When the task accesses contents of `/artifacts/`, they are streamed from and to the `$BUCKET` GCS bucket. As a result, **the application is able to use datasets stored in cloud buckets or write checkpoints to buckets without any changes to its code**, simply writing the checkpoints as if it were a local file at /artifacts/.

> **💡 Hint** - In addition to object stores, SkyPilot can also copy files from your local machine to the remote VM! Refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html) for more information.

## <span style="color:green">[DIY]</span> 💻 Launch your LLM finetuning task!

**After you have edited `finetune.yaml` to use 2 L4 GPUs, open a terminal and use `sky launch` to create a GPU cluster:**

-------------------------
```console
$ sky launch -c train finetune.yaml --env BUCKET=skypilot-$(date +%s)
```
-------------------------

This will take about two minutes.

> **💡 Note** - We use `--env` to pass a unique bucket name to the task with the current timestamp. This is to ensure that the bucket name is unique and does not conflict with other users.

### Expected output

SkyPilot will automatically failover through all locations in Kubernetes and GCP to find available resources, and you will see output like:

-------------------------
```console
$ sky launch -c train finetune.yaml --env BUCKET=skypilot-$(date +%s)
Task from YAML spec: finetune.yaml
I 10-16 03:22:43 storage.py:1711] Created GCS bucket skypilot-1697426561 in US-CENTRAL1 with storage class STANDARD
I 10-16 03:23:06 optimizer.py:674] == Optimizer ==
I 10-16 03:23:06 optimizer.py:685] Target: minimizing cost
I 10-16 03:23:06 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-16 03:23:06 optimizer.py:697] 
I 10-16 03:23:06 optimizer.py:769] Considered resources (1 node):
I 10-16 03:23:06 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 03:23:06 optimizer.py:818]  CLOUD        INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 10-16 03:23:06 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 03:23:06 optimizer.py:818]  Kubernetes   16CPU--32GB--2L4   16      32        L4:2           kubernetes    0.00          ✔     
I 10-16 03:23:06 optimizer.py:818]  GCP          g2-standard-24     24      96        L4:2           us-east4-a    1.99                
I 10-16 03:23:06 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 03:23:06 optimizer.py:818] 
Launching a new cluster 'train'. Proceed? [Y/n]:
...
(task, pid=1793) === Start training ===
Downloading (…)lve/main/config.json: 100%|██████████| 707/707 [00:00<00:00, 5.66MB/s]
Downloading pytorch_model.bin: 100%|██████████| 4.40G/4.40G [00:18<00:00, 241MB/s]8<00:00, 243MB/s]
Downloading (…)neration_config.json: 100%|██████████| 68.0/68.0 [00:00<00:00, 354kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 1.43k/1.43k [00:00<00:00, 8.78MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 401MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 69.0/69.0 [00:00<00:00, 520kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 96.0/96.0 [00:00<00:00, 636kB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 24.0MB/s]?, ?B/s]
(task, pid=1793) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(task, pid=1793) Loading data...
(task, pid=1793) Formatting inputs...
(task, pid=1793) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
{'loss': 4.2316, 'learning_rate': 0.0002, 'epoch': 0.23}96s/it]
{'loss': 4.2916, 'learning_rate': 0.00019594929736144976, 'epoch': 0.46}
{'loss': 2.1027, 'learning_rate': 0.00018412535328311814, 'epoch': 0.69}
{'loss': 0.8817, 'learning_rate': 0.00016548607339452853, 'epoch': 0.91}
{'loss': 0.6006, 'learning_rate': 0.00014154150130018866, 'epoch': 1.14}
{'loss': 0.2745, 'learning_rate': 0.00011423148382732853, 'epoch': 1.37}
{'loss': 0.194, 'learning_rate': 8.57685161726715e-05, 'epoch': 1.6}
{'loss': 0.2177, 'learning_rate': 5.845849869981137e-05, 'epoch': 1.83}
{'loss': 0.1503, 'learning_rate': 3.45139266054715e-05, 'epoch': 2.06}
{'loss': 0.092, 'learning_rate': 1.587464671688187e-05, 'epoch': 2.29}
{'loss': 0.0756, 'learning_rate': 4.050702638550275e-06, 'epoch': 2.51}
{'loss': 0.0771, 'learning_rate': 0.0, 'epoch': 2.74}  8.30s/it]
{'train_runtime': 101.3929, 'train_samples_per_second': 4.083, 'train_steps_per_second': 0.118, 'train_loss': 1.0991218816488981, 'epoch': 2.74}
100%|██████████| 12/12 [01:41<00:00,  8.45s/it]00:00,  8.30s/it]
(task, pid=1793) === Finished training ===
(task, pid=1793) Find your model in the bucket: skypilot-1697426561
```
-------------------------

**After you see the task training output, hit `ctrl+c` to exit.**

> **💡 Hint** - For long running tasks, you can safely Ctrl+C to exit once the task has started. It will continue running in the background. For more on how to access logs after detaching, queue more tasks and cancel tasks, please refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/reference/job-queue.html).

## <span style="color:green">[DIY]</span> 💻 Remember to terminate your cluster once you're done!

**Run `sky status` to get the cluster name and then use `sky down` to terminate it.**

-------------------------
```console
$ sky status
...
$ sky down train
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Save the cost by 3x with managed spot job!

To use managed spot to train your model with a 3x cost reduction, simply switch the job launch command to `sky spot launch`:
```console
$ sky spot launch finetune.yaml --env BUCKET=skypilot-$(date +%s)
```

SkyPilot will automatically recover the job whenever preemption happens. Since our task is periodically checkpointed to the cloud bucket, the recovery will only experience limited progress loss.


<p style="text-align:center;">
    <img src="https://skypilot.readthedocs.io/en/latest/_images/spot-training.png" width=500>
</p>

### Expected output

You will see a similar output as before, but with a 3x cost reduction!
```console
$ sky spot launch finetune.yaml --env BUCKET=skypilot-$(date +%s)
Task from YAML spec: finetune.yaml
I 10-16 04:28:44 storage.py:1711] Created GCS bucket skypilot-1697430522 in US-CENTRAL1 with storage class STANDARD
Managed spot job 'sky-5523-root' will be launched on (estimated):
I 10-16 04:29:05 optimizer.py:674] == Optimizer ==
I 10-16 04:29:05 optimizer.py:685] Target: minimizing cost
I 10-16 04:29:05 optimizer.py:697] Estimated cost: $0.6 / hour
I 10-16 04:29:05 optimizer.py:697] 
I 10-16 04:29:05 optimizer.py:769] Considered resources (1 node):
I 10-16 04:29:05 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 04:29:05 optimizer.py:818]  CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE    COST ($)   CHOSEN   
I 10-16 04:29:05 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 04:29:05 optimizer.py:818]  GCP     g2-standard-24[Spot]   24      96        L4:2           asia-east1-a   0.65          ✔     
I 10-16 04:29:05 optimizer.py:818] ----------------------------------------------------------------------------------------------------
I 10-16 04:29:05 optimizer.py:818] 
Launching the spot job 'sky-5523-root'. Proceed? [Y/n]: 
```

> **💡 Hint** - For detailed information on how to develop, train and serve LLMs, please checkout the [examples](https://github.com/skypilot-org/skypilot/tree/master/llm) in SkyPilot repository.

#### 🎉 Congratulations! You have learnt how to finetune LLMs with SkyPilot! Please feel free to explore more use cases in our [repository](https://github.com/skypilot-org/skypilot), [blog](https://blog.skypilot.co/) and [documentation](https://skypilot.readthedocs.io/en/latest/). Please join our slack: [slack.skypilot.co](slack.skypilot.co)


#### Quick survey: https://forms.gle/8fVy3MFp5JmGwwYWA