<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Saving costs with spot instances 💸
Many cloud providers offer spot instances, low-priced VMs that can be preempted at any time by the cloud provider.

SkyPilot supports the use of spot instances, and offers a fully managed experience **that can automatically recover from preemptions**. This feature **saves significant cost (e.g., up to 70% for GPU VMs)** by making preemptible spot instances practical for long-running jobs.

To maximize availability, SkyPilot automatically finds available spot resources across regions and clouds. Here is an example of BERT training job running in different regions across AWS and GCP, switching over to a different region whenever preempted.

<p style="text-align:center;">
    <img src="https://skypilot.readthedocs.io/en/latest/_images/spot-training.png" width=500>
</p>

# Learning outcomes 🎯

In this notebook, you will:

1. Learn how to use spot instances in SkyPilot.
2. Run a managed spot job
3. Forcefully preempt a running job and observe SkyPilot's recovery mechanism

# Using Managed Spot Instances with `sky spot launch`
Any SkyPilot task can be launched on spot instances by simply using `sky spot launch task.yaml` instead of `sky launch task.yaml`. The `sky spot` CLI offers three key commands:

1. **`sky spot launch <task.yaml>`** - Launches a managed spot job.
2. **`sky spot status`** - Shows the status of managed spot jobs.
3. **`sky spot logs <job_id>`** - Fetches the logs of a spot job.
4. **`sky spot cancel <job_id>`** - Cancels a spot job.

To manage the lifecycle of spot jobs, SkyPilot uses a controller that handles job launching and failure recovery. On running `sky spot launch`, SkyPilot first launches a controller (if it does not exist) and then runs the job.

## 💻 Train BERT on spot instances with `sky spot launch`!

Let's run the same BERT fine-tuning task from the previous notebook! Use `sky spot launch` to run the task on spot instances.

------------------
```console
$ sky spot launch 03_spot_instances/bert.yaml
```
------------------

SkyPilot will launch and start monitoring the spot job. When a preemption happens, SkyPilot will automatically search for resources across regions and clouds to re-launch the job.

```
Task from YAML spec: bert.yaml
Launching a new spot task 'sky-5ce7-romilb'. Proceed? [Y/n]: Y
...
I 10-16 21:29:06 cloud_vm_ray_backend.py:2067] Job submitted with Job ID: 1
I 10-17 04:29:06 spot_utils.py:205] Waiting for the spot controller process to be RUNNING (status: PENDING).
I 10-17 04:29:11 spot_utils.py:233] INFO: The log is not ready yet, as the spot job is STARTING. Waiting for 20 seconds.
...
I 10-17 04:34:33 log_lib.py:385] Start streaming logs for spot job 1.
...
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1290] 2022-10-17 04:35:52,604 >> ***** Running training *****
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1291] 2022-10-17 04:35:52,604 >>   Num examples = 88524
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1292] 2022-10-17 04:35:52,604 >>   Num Epochs = 50
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1293] 2022-10-17 04:35:52,604 >>   Instantaneous batch size per device = 12
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1294] 2022-10-17 04:35:52,604 >>   Total train batch size (w. parallel, distributed & accumulation) = 12
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1295] 2022-10-17 04:35:52,604 >>   Gradient Accumulation steps = 1
(sky-5ce7-romilb pid=23855) [INFO|trainer.py:1296] 2022-10-17 04:35:52,604 >>   Total optimization steps = 368850

```

## 💻 Check the status of your spot job with `sky spot status` 

------------------
```
$ sky spot status
Fetching managed spot job status...
Managed spot jobs:
In progress jobs: 1 RUNNING

ID  NAME             RESOURCES  SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
1   sky-5ce7-romilb  1x [T4:1]  13 mins ago  13m 3s         7m 48s        0            RUNNING
```
------------------

# SkyPilot spot instance recovery in action ⛑

Let's observe how SkyPilot can automatically recover from spot instance preemptions.

In this section, we will:

1. **Forcefully terminate the spot instance using the AWS CLI**. We have provided a helper function `terminator.terminate()` to do this.
2. Wait for the controller to detect the spot instance failure.
3. Run `sky spot status` to see the **status change from `RUNNING` to `RECOVERING`**.
4. Wait for the instance to recover.
5. Run `sky spot status` to see the **status change back to `RUNNING`** and `#RECOVERIES` increment by 1.

All of these steps are coded in the following cell. **Run the cell below and observe the outputs.**

In [None]:
import terminator
import time
import subprocess
from terminator import sleep_timer

# Kill the spot instance
terminator.terminate()

# Wait for the spot instance status to be updated in the controller
print("\nWaiting for 45 seconds to let the controller detect spot failure before running sky spot status")
sleep_timer(45)

# Run sky spot status.
print("\n\nRunning sky spot status. Note that the job status will have changed to RECOVERING.")
print(subprocess.check_output('sky spot status', shell=True, encoding='utf-8'))


# Wait for the spot instance status to be updated in the controller
print("Waiting for 300 seconds to let the spot instance recover before running sky spot status again.")
sleep_timer(300)
print("\n\nRunning sky spot status. Note that the job status will have changed to RUNNING.")
print(subprocess.check_output('sky spot status', shell=True, encoding='utf-8'))

### Expected cell output:
-------------------------
```
Finding spot job to terminate...
Terminating latest spot job sky-5ce7-romilb...
Getting instance id...
Terminating instance_id i-0b7b1bc0c0d3a03c9
Running command: aws ec2 terminate-instances --region us-west-2 --instance-ids i-08cb990c15dcf86a3	i-0b252881a10ec7c88	i-0d91caeda85b05e40	i-0185b54f5cb2efad7	i-0b7b1bc0c0d3a03c9

====== Successfully terminated spot VM. Hasta la vista, sky-5ce7-romilb ======

Waiting for 30 seconds to let the controller detect spot failure before running sky spot status
  1 seconds remaining.

Running sky spot status. Note that the job status will have changed to RECOVERING.
Fetching managed spot job statuses...
Managed spot jobs:
In progress jobs: 1 RECOVERING

ID  NAME             RESOURCES  SUBMITTED  TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS     
1   sky-5ce7-romilb  1x [T4:1]  1 hr ago   1h 31m 45s     1h 7m 30s     0            RECOVERING     

Waiting for 300 seconds to let the spot instance recover before running sky spot status again.
  1 seconds remaining.
  
Running sky spot status. Note that the job status will have changed to RUNNING.
Fetching managed spot job statuses...
Managed spot jobs:
In progress jobs: 1 RUNNING

ID  NAME             RESOURCES  SUBMITTED  TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS     
1   sky-5ce7-romilb  1x [T4:1]  1 hr ago   1h 36m 58s     1h 7m 51s     1            RUNNING
```
-------------------------

## 💻 Clean up with `sky spot cancel`
We're at the end of this tutorial! Please run the following commands to stop all your jobs and kill any VMs.

---------------
```
# Cancel spot jobs
$ sky spot cancel -ay

# Stop any running VMs
$ sky down -ay
```
---------------

### 🎉 Congratulations! You have compeleted the SkyPilot tutorial!

### We want your feedback!
**Please take a few minutes to fill out this short survey: [https://forms.gle/pjm7yPCxK7219vwm8](https://forms.gle/pjm7yPCxK7219vwm8).** We would love to hear what you thought about SkyPilot and this tutorial!


### Liked SkyPilot?
* **Give us a star on [github](github.com/skypilot-org/skypilot)!**
* **Reach out to us on the SkyCamp slack or [email](mailto:romil.bhardwaj@berkeley.edu)!**
* **Check out the [docs](https://skypilot.readthedocs.io/) to learn about more exciting SkyPilot features, such as automatic benchmarking, automatic instance stopping, TPUs, on-premise support and much more!**
