## Testbeds Reservation Guideline
---

### Let's Prepare the Testbeds SPARK-29351 Bug Reproduce.

+ The Bug description is in: https://issues.apache.org/jira/browse/SPARK-29351
+ We reproduce it on Chameleon site: `CHI@TACC`. gpu_mi100 machine.

+ Please follow the bellowing guideline to reserve a corresponding node.

---

## Create experiment container

This container provides the following:

- One node of any types ([see all types](https://chameleoncloud.readthedocs.io/en/latest/technical/reservations.html#chameleon-node-types))
- One public IP

### Configuration

Enter your project ID in the code block below, if you are not a member of `CHI-210849`.

In [None]:
import chi

chi.use_site("CHI@TACC")
chi.set("project_name", "CHI-210849")

print(f'Using Project {chi.get("project_name")}')

In [2]:
import os

USER = os.getenv('USER')

### Create reservation

Chameleon resources need to be reserved before they can be used. 
We will reserve one bare metal node and one public IP address, for right now.

If you get an error such as "no host available", it may be the case that all of our nodes are reserved. Check the availiablility calendar to see if this is true:
https://chi.uc.chameleoncloud.org/project/leases/calendar/host/

It may take around a minute or so for your lease to become active.

In [None]:
import time
import keystoneauth1, blazarclient
from chi import lease

reservations = []
reservation_req_time = int(time.time())
LEASE_KEY = f"{USER}-gpu_mi100-{reservation_req_time}"

try:
    print(f"Creating lease with name = {LEASE_KEY}...")
    lease.add_fip_reservation(reservations, count=1)
    lease.add_node_reservation(reservations, count=1)

    start_date, end_date = lease.lease_duration(hours=3, days=0)

    l = lease.create_lease(
        LEASE_KEY, 
        reservations, 
        node_type="gpu_mi100",
        start_date=start_date, 
        end_date=end_date
    )
    lease_id = l["id"]

    print("Waiting for lease to start ...")
    lease.wait_for_active(lease_id)
    print("Lease is now active!")
except keystoneauth1.exceptions.http.Unauthorized as e:
    print("Unauthorized.\nDid set your project name and and run the code in the first cell?")
except blazarclient.exception.BlazarClientException as e:
    print(f"There is an issue making the reservation. Please check the host calendar.")
    print("https://chi.tacc.chameleoncloud.org/project/leases/calendar/host/")
    print(e)
except Exception as e:
    print("An unexpected error happened.")
    print(e)

### Provision bare metal node

Next, we will launch the reserved node with an image. 
It will take approximately 10 minutes for the bare metal node to be successfully provisioned. 

This step takes the longest. First, our controller node must configure the requested node, which first sets up a deploy image. This image then downloads and copies the real image onto the hard drive, and the node is configured to reboot to the new OS. 

You can browse the images we offer in our appliance catalog: http://chameleoncloud.org/appliances

In [None]:
from chi import server

image = "CC-Ubuntu20.04"

s = server.create_server(
    LEASE_KEY, 
    image_name=image,
    reservation_id=lease.get_node_reservation(lease_id)
)

print("Waiting for server to start ...")
server.wait_for_active(s.id)
print("Done")

By default our node is only connected to a private network and thus not reachable over the internet or via Jupyter here. We need to associate a "Floating IP" to the node, which gives it the public address we reserved.

In [None]:
floating_ip = lease.get_reserved_floating_ips(lease_id)[0]
server.associate_floating_ip(s.id, floating_ip_address=floating_ip)

print(f"Waiting for SSH connectivity on {floating_ip} ...")
timeout = 60 * 2
import socket
import time
# Repeatedly try to connect via SSH.
start_time = time.perf_counter()
while True:
    try:
        with socket.create_connection((floating_ip, 22), timeout=timeout):
            print("Connection successful")
            break
    except OSError as ex:
        time.sleep(10)
        if time.perf_counter() - start_time >= timeout:
            print(f"Timeout: after {timeout} seconds, could not connect via SSH. please wait until SSH is up and ready")

<br />

## Saving Session

<div class="alert alert-block alert-info">Note: This is needed to pass variables between notebook files such as <b>floating_ip</b> and <b>reservation_id</b></div>

In [None]:
import scripts.session as session

session.clear()
session.save({ "floating_ip": floating_ip, "reservation_id": lease_id, "server_id": s.id })
session.load()