## Reserve and configure resources on KVM

Before you run this experiment, you will:

-   define the specific configuration of resources you need.
-   “instantiate” an experiment with your reserved resources.
-   wait for your resources to be configured.
-   log in to resources to carry out the experiment.

This exercise will guide you through those steps.

### Configure environment

In [None]:
import openstack, chi, chi.ssh, chi.network, chi.server, os

In this section, we configure the Chameleon Python client.

For this experiment, we’re going to use the KVM@TACC site, which we indicate below.

We also need to specify the name of the Chameleon “project” that this experiment is part of. The project name will have the format “CHI-XXXXXX”, where the last part is a 6-digit number, and you can find it on your [user dashboard](https://chameleoncloud.org/user/dashboard/).

In the cell below, replace the project ID with your own project ID, then run the cell.

In [None]:
chi.use_site("KVM@TACC")
PROJECT_NAME = "CHI-XXXXXX"
chi.set("project_name", PROJECT_NAME)

# configure openstacksdk for actions unsupported by python-chi
os_conn = chi.clients.connection()


### Define configuration for this experiment (3 VMs)

For this specific experiment, we will need 1 virtual machines connected to a common network. Each of the virtual machines will be of the `m1.large` type, with 4 VCPUs, 8 GB memory, 40 GB disk space.

In [None]:
username = os.getenv('USER')

node_conf = [
 {'name': "node-0",  'flavor': 'm1.large', 'image': 'CC-Ubuntu22.04', 'packages': ["virtualenv"], 'bastion': True}, 
]
net_conf = [
 {"name": "net0", "subnet": "192.168.1.0/24", "nodes": [{"name": "node-0",   "addr": "192.168.1.10"}]},
]
route_conf = []

### Configure resources

Now, we will prepare the VMs and network links that our experiment requires.

First, we will prepare a “public” network that we will use for SSH access to our VMs -

In [None]:
public_net = os_conn.network.create_network(name="public_net_" + username)
public_net_id = public_net.get("id")
public_subnet = os_conn.network.create_subnet(
    name="public_subnet_" + username,
    network_id=public_net.get("id"),
    ip_version='4',
    cidr="192.168.10.0/24",
    gateway_ip="192.168.10.1",
    is_dhcp_enabled = True
)

Next, we will prepare the “experiment” networks -

In [None]:
nets = []
net_ids = []
subnets = []
for n in net_conf:
    exp_net = os_conn.network.create_network(name="exp_" + n['name']  + '_' + username)
    exp_net_id = exp_net.get("id")
    os_conn.network.update_network(exp_net, is_port_security_enabled=False)
    exp_subnet = os_conn.network.create_subnet(
        name="exp_subnet_" + n['name']  + '_' + username,
        network_id=exp_net.get("id"),
        ip_version='4',
        cidr=n['subnet'],
        gateway_ip=None,
        is_dhcp_enabled = True
    )
    nets.append(exp_net)
    net_ids.append(exp_net_id)
    subnets.append(exp_subnet)

Now we create the VMs -

In [None]:
servers = []
server_ids = []
for i, n in enumerate(node_conf, start=10):
    image_uuid = os_conn.image.find_image(n['image']).id
    flavor_uuid = os_conn.compute.find_flavor(n['flavor']).id
    # find out details of exp interface(s)
    nics = [{'net-id': chi.network.get_network_id( "exp_" + net['name']  + '_' + username ), 'v4-fixed-ip': node['addr']} for net in net_conf for node in net['nodes'] if node['name']==n['name']]
    # also include a public network interface
    nics.insert(0, {"net-id": public_net_id, "v4-fixed-ip":"192.168.10." + str(i)})
    server = chi.server.create_server(
        server_name=n['name'] + "_" + username,
        image_id=image_uuid,
        flavor_id=flavor_uuid,
        nics=nics
    )
    servers.append(server)
    server_ids.append(chi.server.get_server(n['name'] + "_" + username).id)

We wait for all servers to come up before we proceed -

In [None]:
for server_id in server_ids:
    chi.server.wait_for_active(server_id)

Next, we will set up SSH access to the VMs.

First, we will make sure the “public” network is connected to the Internet. Then, we will configure it to permit SSH access on port 22 for each port connected to this network.

In [None]:
# connect them to the Internet on the "public" network (e.g. for software installation)
router = chi.network.create_router('inet_router_' + username, gw_network_name='public')
chi.network.add_subnet_to_router(router.get("id"), public_subnet.get("id"))

In [None]:
# prepare SSH access on the servers that serve in "bastion" role
# WARNING: this relies on undocumented behavior of associate_floating_ip 
# that it associates the IP with the first port on the server
server_ips = []
for i, n in enumerate(node_conf):
    if 'bastion' in n and n['bastion']:
        ip = chi.server.associate_floating_ip(server_ids[i])
        server_ips.append(ip)

In [None]:
if not os_conn.get_security_group("Allow SSH"):
    os_conn.create_security_group("Allow SSH", "Enable SSH traffic on TCP port 22")
    os_conn.create_security_group_rule("Allow SSH", port_range_min=22, port_range_max=22, protocol='tcp', remote_ip_prefix='0.0.0.0/0')

security_group_id = os_conn.get_security_group("Allow SSH").id
for port in chi.network.list_ports(): 
    if port['port_security_enabled'] and port['network_id']==public_net.get("id"):
        os_conn.network.update_port(port['id'], security_groups=[security_group_id])

In [None]:
for ip in server_ips:
    chi.server.wait_for_tcp(ip, port=22)

The following cell may raise an error if some of your nodes are still getting set up! If that happens, wait a few minutes and try again. (And then a few minutes more, and try again, if it still raises an error.)

In [None]:
primary_remote = chi.ssh.Remote(server_ips[0])
physical_ips = [n['addr'] for n in net_conf[0]['nodes']]
server_remotes = [chi.ssh.Remote(physical_ip, gateway=primary_remote) for physical_ip in physical_ips]

Finally, we need to configure our resources, including software package installation and network configuration.

In [None]:
import time
for i, n in enumerate(node_conf):
    remote = server_remotes[i]
    # enable forwarding
    remote.run(f"sudo sysctl -w net.ipv4.ip_forward=1") 
    remote.run(f"sudo firewall-cmd --zone=trusted --add-source=192.168.0.0/16 --permanent")
    remote.run(f"sudo firewall-cmd --zone=trusted --add-source=172.16.0.0/12 --permanent")
    remote.run(f"sudo firewall-cmd --zone=trusted --add-source=10.0.0.0/8 --permanent")
    remote.run(f"sudo firewall-cmd --zone=trusted --add-source=127.0.0.0/8 --permanent")
    # these are required for etcd
    remote.run(f"sudo firewall-cmd --zone=public --add-port=4001/tcp")
    remote.run(f"sudo firewall-cmd --zone=public --add-port=2379-2380/tcp")
    # add flyte ports
    remote.run(f"sudo firewall-cmd --zone=public --add-port=8088/tcp --permanent")
    remote.run(f"sudo firewall-cmd --zone=public --add-port=8089/tcp --permanent")
    remote.run(f"sudo firewall-cmd --zone=public --add-port=9000/tcp")
    time.sleep(3)

In [None]:
for i, n in enumerate(node_conf):
    # install packages
    if len(n['packages']):
            remote = server_remotes[i]
            remote.run(f"sudo apt update; sudo apt -y install " + " ".join(n['packages'])) 

In [None]:
# prepare a "hosts" file that has names and addresses of every node
hosts_txt = [ "%s\t%s" % ( n['addr'], n['name'] ) for net in net_conf  for n in net['nodes'] if type(n) is dict and n['addr']]
for remote in server_remotes:
    for h in hosts_txt:
        remote.run("echo %s | sudo tee -a /etc/hosts > /dev/null" % h)

In [None]:
# we also need to enable incoming traffic on the HTTP port
if not os_conn.get_security_group("Allow HTTP 32000"):
    os_conn.create_security_group("Allow HTTP 32000", "Enable HTTP traffic on TCP port 32000")
    os_conn.create_security_group_rule("Allow HTTP 32000", port_range_min=32000, port_range_max=32000, protocol='tcp', remote_ip_prefix='0.0.0.0/0')

# add existing security group
security_group_id = os_conn.get_security_group("Allow HTTP 32000").id
for port in chi.network.list_ports(): 
    if port['port_security_enabled'] and port['network_id']==public_net.get("id"):
        pri_security_groups = port['security_groups']
        pri_security_groups.append(security_group_id)
        os_conn.network.update_port(port['id'], security_groups=pri_security_groups)

In [None]:
# we also need to enable incoming traffic on the HTTP port
if not os_conn.get_security_group("Allow HTTP 8088"):
    os_conn.create_security_group("Allow HTTP 8088", "Enable HTTP traffic on TCP port 8088")
    os_conn.create_security_group_rule("Allow HTTP 8088", port_range_min=8088, port_range_max=8088, protocol='tcp', remote_ip_prefix='0.0.0.0/0')

# add existing security group
security_group_id = os_conn.get_security_group("Allow HTTP 8088").id
for port in chi.network.list_ports(): 
    if port['port_security_enabled'] and port['network_id']==public_net.get("id"):
        pri_security_groups = port['security_groups']
        pri_security_groups.append(security_group_id)
        os_conn.network.update_port(port['id'], security_groups=pri_security_groups)

### Draw the network topology

The following cells will draw the network topology, for your reference.

In [None]:
!pip install networkx

In [None]:
nodes = [ (n['name'], {'color': 'pink'}) for n in net_conf ] + [(n['name'], {'color': 'lightblue'}) for n in node_conf ]
edges = [(net['name'], node['name'], 
          {'label': node['addr'] + '/' + net['subnet'].split("/")[1] }) if node['addr'] else (net['name'], node['name']) for net in net_conf for node in net['nodes'] ]

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
plt.figure(figsize=(len(nodes),len(nodes)))
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
pos = nx.spring_layout(G)
nx.draw(G, pos, node_shape='s',  
        node_color=[n[1]['color'] for n in nodes], 
        node_size=[len(n[0])*400 for n in nodes],  
        with_labels=True);
nx.draw_networkx_edge_labels(G,pos,
                             edge_labels=nx.get_edge_attributes(G,'label'),
                             font_color='gray',  font_size=8, rotate=False);

### Use Kubespray to prepare a Kubernetes cluster

Now that are resources are “up”, we will use Kubespray, a software utility for preparing and configuring a Kubernetes cluster, to set them up as a cluster.

In [None]:
remote = chi.ssh.Remote(server_ips[0])

In [None]:
# install Python libraries required for Kubespray
remote.run("virtualenv -p python3 myenv")
remote.run("git clone --branch release-2.22 https://github.com/kubernetes-sigs/kubespray.git")
remote.run("source myenv/bin/activate; cd kubespray; pip3 install -r requirements.txt")

In [None]:
# copy config files to correct locations
remote.run("mv kubespray/inventory/sample kubespray/inventory/mycluster")
remote.run("git clone https://github.com/teaching-on-testbeds/k8s.git")
remote.run("cp k8s/config/k8s-cluster.yml kubespray/inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml")
remote.run("cp k8s/config/inventory.py    kubespray/contrib/inventory_builder/inventory.py")
remote.run("cp k8s/config/addons.yml      kubespray/inventory/mycluster/group_vars/k8s_cluster/addons.yml")

In [None]:
# build inventory for this specific topology
physical_ips = [n['addr'] for n in net_conf[0]['nodes']]
physical_ips_str = " ".join(physical_ips)
remote.run(f"source myenv/bin/activate; declare -a IPS=({physical_ips_str});"+"cd kubespray; CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}")


In [None]:
# make sure "controller" node can SSH into the others
remote.run('ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -q -N ""')
public_key = remote.run('cat ~/.ssh/id_rsa.pub').tail("stdout")[2:]

for physical_ip in physical_ips:
    remote_worker = chi.ssh.Remote(physical_ip, gateway=remote)
    remote_worker.run(f'echo {public_key} >> ~/.ssh/authorized_keys') 

The following cell will actually build the cluster. It will take a long time, and you may see many warnings in the output - that’s OK. The instructions below explain how to tell whether it was successful or not.

The output will be very long, so it will be truncated by default. When you see

    Output of this cell has been trimmed on the initial display.
    Displaying the first 50 top outputs.
    Click on this message to get the complete output.

at the end, click in order to see the rest of the output.

When the process is finished, you will see a “PLAY RECAP” in the output (near the end):

    PLAY RECAP *********************************************************************
    localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
    node-0                     : ok=752  changed=149  unreachable=0    failed=0    skipped=1276 rescued=0    ignored=8   
    node-1                     : ok=652  changed=136  unreachable=0    failed=0    skipped=1124 rescued=0    ignored=3   
    node-2                     : ok=535  changed=112  unreachable=0    failed=0    skipped=797  rescued=0    ignored=2   

Make sure that each node shows `failed=0`. If not, you should re-run the cell to re-try the failed parts.

In [None]:
# build the cluster
remote.run("source myenv/bin/activate; cd kubespray; ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root cluster.yml")

In [None]:
# allow kubectl access for non-root user
remote.run("sudo cp -R /root/.kube /home/cc/.kube; sudo chown -R cc /home/cc/.kube; sudo chgrp -R cc /home/cc/.kube")

In [None]:
# check installation
remote.run("kubectl get nodes")

### Set up Docker

Now that we have a Kubernetes cluster, we have a framework in place for container orchestration. But we still need to set up Docker, for building, sharing, and running those containers.

In [None]:
# add the user to the "docker" group on all hosts
for physical_ip in physical_ips:
    remote_worker = chi.ssh.Remote(physical_ip, gateway=remote)
    remote_worker.run("sudo groupadd -f docker; sudo usermod -aG docker $USER")

In [None]:
# set up a private distribution registry on the "controller" node for distributing containers
# note: need a brand-new SSH session in order to "get" new group membership
remote = chi.ssh.Remote(server_ips[0])
remote.run("docker run -d -p 5000:5000 --restart always --name registry registry:2")

In [None]:
# set up docker configuration on all the hosts
for physical_ip in physical_ips:
    remote_worker = chi.ssh.Remote(physical_ip, gateway=remote)
    remote_worker.run("sudo wget https://raw.githubusercontent.com/teaching-on-testbeds/k8s/main/config/daemon.json -O /etc/docker/daemon.json")
    remote_worker.run("sudo service docker restart")


In [None]:
# check configuration
remote.run("docker run hello-world")

### Get SSH login details

At this point, we should be able to log in to our “controller” node over SSH! Run the following cell, and observe the output - you will see an SSH command this node.

In [None]:
print("ssh cc@" + server_ips[0])

Now, you can open an SSH session as follows:

-   In Jupyter, from the menu bar, use File \> New \> Terminal to open a new terminal.
-   Copy the SSH command from the output above, and paste it into the terminal.

Alternatively, you can use your local terminal to log on to each node, if you prefer. (On your local terminal, you may need to also specify your key path as part of the SSH command, using the `-i` argument followed by the path to your private key.)

### Deploy Flyte on Kubernetes

Now that we have our Kubernetes cluster ready, we’ll deploy Flyte using the following steps: 1. Setup storage using hostpath-provisioner 2. Deploy Flyte dependencies (PostgreSQL and MinIO) 3. Create Kubernetes secrets 4. Deploy Flyte using Helm 5. Configure and test the deployment

In [None]:
#Cloning repo
remote.run("git clone https://github.com/ShaktidharK1997/flyte-artifact.git")

### Namespaces

Namespaces in Kubernetes provide a mechanism to organize and isolate a set of resources under a common identifier. In our case, we will keep all the created resources for the flyte deployment under a common namespace *flyte*. This will help us keep track of all the infrastructure that we have created for the flyte deployment.

In [None]:
remote.run("kubectl get namespace")
remote.run("kubectl create namespace flyte")
remote.run("kubectl get namespace")

### Setup Storage Architecture

Storage Classes provide a way to define different types of storage in Kubernetes. We’ll use the hostpath-provisioner which: - Creates a basic storage class for development environments - Provisions storage from the host machine’s filesystem - Automates PV creation when PVCs request storage

The storage system consists of three main components: - PersistentVolumeClaims (PVCs): Acts as a request for storage - PersistentVolumes (PVs): Represents the actual storage resource - StorageClass: Automates PV creation

PVC Request → StorageClass → PV Creation → Storage Available

In [None]:
#check for storage class in our K8s setup
remote.run("kubectl get storageclass")

In [None]:
#installing helm
remote.run("curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash")
#installing hostpath provisioner
remote.run("helm repo add rimusz https://charts.rimusz.net")
remote.run("helm repo update")
remote.run("helm upgrade --install hostpath-provisioner --namespace flyte rimusz/hostpath-provisioner")
remote.run("kubectl get storageclass -n flyte")

In [None]:
#running the dependencies yaml in master node 
remote.run("kubectl apply -f flyte-artifact/config/onprem-flyte-dependencies.yaml")

In [None]:
#checking pod and service status ( Object store MinIO and PgSQL database containers must be created)
remote.run("kubectl get pods -n flyte")

### Understanding Helm

-   Helm is like a package manager for Kubernetes, similar to how pip is for python
-   It helps you install and manage applications in your Kubernetes Cluster
-   Applications are packaged as”charts” that contains all the Kubernetes manifests and configurations needed

### Helm Repositories

-   It is a catalog of available charts
-   It’s a locatio where packaged charts are stored and can be shared
-   Just like how PyPI is a repository for Python packages

In [None]:
#add flyte through helm repo
remote.run("helm repo add flyteorg https://flyteorg.github.io/flyte")

At this point the dependencies required by Flyte are ready. You can now choose which form factor to deploy:

Single binary: all Flyte components (flyteadmin,flytepropeller, flyteconsole, etc) packaged into a single Pod. This is useful for environments with limited resources and a need for quick setup.

Core: all components as standalone Pods, and potentially different number of replicas. This is required for multi-K8s-cluster environments.

Since we are having a single-node setup, we will install a single binary flyte deployment

### Kubernetes Secret

A Kubernetes Secret is an object that helps store and manage sensitive information like passwords, OAuth tokens, or SSH keys.

In our specific case, the secret is needed because: - It stores the PostgreSQL database password that Flyte needs to connect to its backend database

-   Instead of putting the password directly in the Helm values file (onprem-flyte-binary-values.yaml), which would be less secure, we’re creating a separate secret

-   The name flyte-binary-inline-config-secret is specifically looked for by the Flyte binary chart to inject these configuration values

-   The file 202-database-secrets.yaml inside the secret contains the database configuration with the password

In [None]:
remote.run("kubectl create -f flyte-artifact/config/local_secret.yaml")
remote.run("kubectl describe secret flyte-binary-inline-config-secret -n flyte")

### Install Flyte:

In [None]:
remote.run("helm install flyte-binary flyteorg/flyte-binary  --values flyte-artifact/config/onprem-flyte-binary-values.yaml -n flyte")
remote.run("kubectl get pods -n flyte")

### Configuration setup to connect to Flyte

### Flytectl command-line interface (CLI) tool for Flyte.

It allows you to: 1. Manage and interact with Flyte deployments 2. Create and update workflows 3. Monitor executions 4. Manage cluster resources 5. Handle authentication and configuration

`flytectl config init` : Initializes the basic Flytectl configuration and creates a default config file at `$HOME/.flyte/config.yaml`

In [None]:
# Installing and configuring flytectl

remote.run("curl -sL https://ctl.flyte.org/install | sudo bash -s -- -b /usr/local/bin")
remote.run("flytectl config init")

In [None]:
#copying contents of config.yaml into the flyte config file
remote.run("cp flyte-artifact/config/config.yaml $HOME/.flyte/config.yaml")

In [None]:
#Creating DNS Service for minio yaml 
remote.run("""echo "127.0.0.1 minio.flyte.svc.cluster.local" | sudo tee -a /etc/hosts""")
remote.run("cat /etc/hosts")

In [None]:
remote.run("source myenv/bin/activate; pip install flytekit")

In [None]:
# Start three port forwarding sessions for Http/grpc/minio
remote.run("""
# Start port forwards in the background using &
nohup kubectl -n flyte port-forward --address 0.0.0.0 service/minio 9000:9000 > /tmp/minio.log 2>&1 &
nohup kubectl -n flyte port-forward --address 0.0.0.0 service/flyte-binary-grpc 8089:8089 > /tmp/grpc.log 2>&1 &
nohup kubectl -n flyte port-forward --address 0.0.0.0 service/flyte-binary-http 8088:8088 > /tmp/http.log 2>&1 &

# Store the process IDs so we can terminate them later if needed
echo $! > /tmp/flyte-portforward.pid

# Print the running port forwards
echo "Port forwards running:"
ps aux | grep "port-forward" | grep -v grep

""")

In [None]:
remote.run("source myenv/bin/activate; pyflyte run --remote flyte-artifact/scripts/rm hello_world.py my_wf")

### Deleting created resources

In [None]:
remote.run("""
# Kill all port-forward processes
pkill -f "port-forward"
""")

In [None]:
remote.run("helm uninstall flyte-binary -n flyte")
remote.run("kubectl delete -f flyte-artifact/config/local_secret.yaml")
remote.run("kubectl delete -f flyte-artifact/config/onprem-flyte-dependencies.yaml")
remote.run("kubectl delete namespace flyte")