# Creating a cluster/slice

## Please read carefully all the instructions. Specially the comments given in each cell. These comments are pertinent to run a cluster successfully. 

If the fabric environment is not set please read through the topic <code>FABRIC Environment Setup</code> in <code>start_here.ipynb</code> notebook given under <code>jupyter-example</code> folder in fabric's JupyterHub.

<b> At first read the comments in the cell and then only run the cell. </b>

## STEP-1

In [61]:
# Run this cell.

from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

try:
    fablib = fablib_manager()
                     
    fablib.show_config()
except Exception as e:
    print(f"Exception: {e}")

0,1
Credential Manager,cm.fabric-testbed.net
Orchestrator,orchestrator.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,68926660-da26-475d-9c40-50ebf0a5a812
Bastion Username,mjdbz4_0000018266
Bastion Private Key File,/home/fabric/work/fabric_config/fabric_bastion_key
Bastion Host,bastion-1.fabric-testbed.net
Bastion Private Key Passphrase,
Slice Public Key File,/home/fabric/work/fabric_config/slice_key.pub
Slice Private Key File,/home/fabric/work/fabric_config/slice_key


In [2]:
# Run this cell
# To find all the available resources at this time
try:
    print(f"{fablib.list_sites()}")
except Exception as e:
    print(f"Exception: {e}")

Name,Address,Location,Hosts,CPUs,Cores Available,Cores Capacity,Cores Allocated,RAM Available,RAM Capacity,RAM Allocated,Disk Available,Disk Capacity,Disk Allocated,Basic NIC Available,Basic NIC Capacity,Basic NIC Allocated,ConnectX-6 Available,ConnectX-6 Capacity,ConnectX-6 Allocated,ConnectX-5 Available,ConnectX-5 Capacity,ConnectX-5 Allocated,NVMe Available,NVMe Capacity,NVMe Allocated,Tesla T4 Available,Tesla T4 Capacity,Tesla T4 Allocated,RTX6000 Available,RTX6000 Capacity,RTX6000 Allocated
GATECH,"760 West Peachtree Street NW Atlanta, GA 30308","(33.7753991, -84.3875488)",5,10,320,320,0,2560,2560,0,116400,116400,0,635,635,0,2,2,0,4,4,0,16,16,0,0,0,0,0,0,0
UCSD,"10100 Hopkins Dr, San Diego, CA 92121","(32.8881832, -117.2388161)",5,10,236,320,84,2000,2560,560,112930,116400,3470,600,635,35,2,2,0,4,4,0,9,16,7,4,4,0,6,6,0
GPN,"5115 Oak Street, Kansas City, MO 64112","(39.03426274760282, -94.58260749540294)",5,10,290,320,30,2440,2560,120,116170,116400,230,622,635,13,2,2,0,4,4,0,9,16,7,3,4,1,3,6,3
NCSA,"1725 S Oak St.,Champaign, IL 61820","(40.1035624, -88.2415105)",3,6,70,192,122,1112,1536,424,58030,60600,2570,360,381,21,0,2,2,2,2,0,10,10,0,2,2,0,3,3,0
CLEM,"340 Computer Court,Anderson, SC 29625","(34.586543500000005, -82.82128891709674)",3,6,108,192,84,1360,1536,176,59610,60600,990,360,381,21,2,2,0,0,2,2,10,10,0,2,2,0,3,3,0
MICH,"2530 Draper Dr,Ann Arbor, MI 48109","(42.2931086, -83.7101319)",3,6,168,192,24,1472,1536,64,60430,60600,170,376,381,5,2,2,0,2,2,0,10,10,0,2,2,0,3,3,0
FIU,"11001 SW 14th St,Miami, FL 33199","(25.754495891386522, -80.37232833001887)",5,10,240,320,80,2152,2560,408,115150,116400,1250,606,635,29,2,2,0,3,4,1,16,16,0,4,4,0,6,6,0
STAR,"710 North Lake Shore Dr,Chicago, IL 60611","(41.89537135, -87.61663220067463)",6,12,322,384,62,2830,3072,242,120550,121200,650,740,762,22,2,2,0,6,6,0,14,20,6,6,6,0,6,6,0
SALT,"572 Delong St,Salt Lake City, UT 84104","(40.75707505789612, -111.95346637770317)",3,6,151,192,41,1432,1536,104,60350,60600,250,360,381,21,1,2,1,2,2,0,10,10,0,2,2,0,3,3,0
UTAH,"875 South West Temple,Salt Lake City, UT 84101","(40.760854, -111.8939479)",5,10,224,320,96,2240,2560,320,113650,116400,2750,599,635,36,2,2,0,4,4,0,16,16,0,4,4,0,5,5,0


<pandas.io.formats.style.Styler object at 0x7fbe400e6c10>


## Step-2

# Creating a cluster

There are many ways you can create a cluster. Some of which are shown below, follow ONLY one strategy. Either <b>Step-2(a) or Step-2(b)</b> 

### Step-2(a): Creating a cluster by allocating the resources explicitly. 

In [23]:
# Run this cell
# Initialize the variables appropriately. 


# Number of nodes in the cluster.
num_nodes=8
# Give a cluster name
slice_name='cluster_gatk'
# Make it True if you want to include persistence storage to a single node. KEEP IT FALSE FOR THE TIME BEING.
storage=True
# Site name, pick one site from the above list of resources.
site='GPN'
# Number of cores
cores=20
# Capacity of RAM in Gbs
ram=128
# capacity of disk in Gbs
disk=1000
# Operation system, Linux distribution e.g. default_ubuntu_18, default_ubuntu_20, etc.
image='default_ubuntu_18'

In [36]:
# Run this cell

# The general syntax for host name: {site_name}-w{1-9}.fabric-testbed.net 
if site.isupper():
    host_lower=site.lower()
host_lower="{0}-w3.fabric-testbed.net".format(host_lower) 

node_names=[]
nic_names=[]
iface_names=[]
network_name='cluster_network'
storage_name = 'gaf-storage'
 

for i in range(1,num_nodes+1):
    node_names.append("Node{0}".format(i))
    nic_names.append("Nic{0}".format(i))
    iface_names.append("iface{0}".format(i))

print(node_names)
print(nic_names)
print(iface_names)

['Node1', 'Node2', 'Node3', 'Node4', 'Node5', 'Node6', 'Node7', 'Node8']
['Nic1', 'Nic2', 'Nic3', 'Nic4', 'Nic5', 'Nic6', 'Nic7', 'Nic8']
['iface1', 'iface2', 'iface3', 'iface4', 'iface5', 'iface6', 'iface7', 'iface8']


In [5]:
# Only run when persistence storage needed to be attached.
if storage == True:
    import traceback
    from plugins import Plugins
    try:
        Plugins.load()
    except Exception as e:
        traceback.print_exc()

In [6]:
# Run this cell
# Read the comments carefully given below and make changes as necessary. 

try:
    slice=fablib.new_slice(name=slice_name)
    
    for i in range(num_nodes):
        if storage == True and i == 0:
            node=slice.add_node(name=node_names[i],
                                site=site,
                                cores=cores,
                                ram=ram,
                                disk=disk,
                                image=image)
            node.add_storage(name=storage_name)
            iface_names[i]=node.add_component(model='NIC_Basic', name=node_names[i]).get_interfaces()[0]
        
        else:
            node=slice.add_node(name=node_names[i],
                                site=site,
                                cores=cores,
                                ram=ram,
                                disk=disk,
                                image=image)
            iface_names[i]=node.add_component(model='NIC_Basic',name=node_names[i]).get_interfaces()[0]
            
except Exception as e:
    print(f'Exception: {e}')                               

In [7]:
# Run this cell
try:
    net_cluster=slice.add_l2network(name=network_name, interfaces=iface_names[:])
except Exception as e:
    print(f"Exception: {e}")

In [None]:
# Run this cell
# If this cell get executed successfully then IP address of the nodes will be displayed which can be used to ssh into the respective nodes.
# If there is an error while creating the slice/cluster, repeat from the third code cell block.

try:
    slice.submit()
except Exception as e:
    print(f'Exception: {e}')


Retry: 6, Time: 246 sec


0,1
ID,8fd889d3-f6d8-42dc-b23f-a0f12485d8bb
Name,cluster_gatk
Lease Expiration (UTC),2023-01-31 21:44:04 +0000
Lease Start (UTC),2023-01-30 21:44:05 +0000
Project ID,68926660-da26-475d-9c40-50ebf0a5a812
State,Configuring


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
70e402b4-0ce5-4213-9df6-5ff3597c8abf,Node1,20,128,2000,default_ubuntu_18,qcow2,ucsd-w2.fabric-testbed.net,UCSD,ubuntu,,Ticketed,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
04139d1d-1da4-4aea-aa4b-5733b1afeb41,Node2,20,128,2000,default_ubuntu_18,qcow2,ucsd-w3.fabric-testbed.net,UCSD,ubuntu,,Failed,failed lease update- all units failed priming: Exception during create for unit: 04139d1d-1da4-4aea-aa4b-5733b1afeb41 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: 04139d1d-1da4-4aea-aa4b-5733b1afeb41 Playbook has failed tasks: Error in creating the server (no further information available)#,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
4b1a874a-7670-4b27-b4b9-2c1ce01ef98c,Node3,20,128,2000,default_ubuntu_18,qcow2,ucsd-w3.fabric-testbed.net,UCSD,ubuntu,132.249.252.177,Active,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
01b29815-06fe-4b8a-a813-0bd9a0223ef2,Node4,20,128,2000,default_ubuntu_18,qcow2,ucsd-w4.fabric-testbed.net,UCSD,ubuntu,,Failed,failed lease update- all units failed priming: Exception during create for unit: 01b29815-06fe-4b8a-a813-0bd9a0223ef2 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: 01b29815-06fe-4b8a-a813-0bd9a0223ef2 Playbook has failed tasks: Error in creating the server (no further information available)#,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
e05fbdc4-2eac-49b6-b580-42f90311d362,Node5,20,128,2000,default_ubuntu_18,qcow2,ucsd-w5.fabric-testbed.net,UCSD,ubuntu,,Failed,failed lease update- all units failed priming: Exception during create for unit: e05fbdc4-2eac-49b6-b580-42f90311d362 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: e05fbdc4-2eac-49b6-b580-42f90311d362 Playbook has failed tasks: Error in creating the server (no further information available)#,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
df0ffb87-fe9e-4196-b33b-ff066047a369,Node6,20,128,2000,default_ubuntu_18,qcow2,ucsd-w5.fabric-testbed.net,UCSD,ubuntu,,Ticketed,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


ID,Name,Layer,Type,Site,Gateway,Subnet,State,Error
fe2a0f90-4c33-4c76-aeb4-f20ec33bddb8,cluster_network,L2,L2Bridge,UCSD,,,Ticketed,


## Step-2(b): Creating a cluster from an instance type.

In [62]:
# Run this cell
# Initialize the variables appropriately. 
# Number of nodes in the cluster.
num_nodes=4

# Give a cluster name
slice_name='cluster_gpu_2'

# Make it True if you want to include persistence storage to a single node. KEEP IT FALSE FOR THE TIME BEING.
storage=False

# get attached to the cluster. By default adding NVMe is false.
add_NVMe = False

# Make it True if you want to include GPUs to different nodes in the cluster [depends on availability].
add_gpu=False

# By default master node will not have GPUs in it. For CPU only cluster False means lower resources then the workers.
master_gpu=False

# If only t4 gpu needs to be added.
add_t4=True

# If only rtx gpu needs to be added.
add_rtx=True

# Site name, pick one site from the above list of resources.
site='UTAH'

# Select node resource:- It follows the pattern fabric.c#N.m#N.d#N , where c:cores, m:primary memory, d:disk and #N: capacity.
# eg: fabric.c4.m8.d50 means cores:4, Memory:8Gb, disk:50Gb  
# List of node instance type is provided in - https://github.com/fabric-testbed/InformationModel/blob/master/fim/slivers/data/instance_sizes.json
instance_worker='fabric.c6.m16.d100'

# The resources of master node. It is better to be less than the worker nodes.
instance_master='fabric.c4.m8.d500'

# Operation system, Linux distribution e.g. default_ubuntu_18, default_ubuntu_20, etc.
image='default_ubuntu_18'

In [63]:
import pandas as pd

try:
    json_format=fablib.list_sites(output='json',quiet=True)
except Exception as e:
    print(f"Exception: {e}")

sites_df=pd.read_json(json_format)
site_df=sites_df[sites_df['name']==site]
type_t4=site_df['tesla_t4_available'].values
type_rtx=site_df['rtx6000_available'].values
nvme_available=site_df['nvme_available'].values
print("Number of Nvidia t4 available now at",site,"is:", type_t4[0])
print("Number of Nvidia rtx6000 available now at",site,"is:", type_rtx[0])
print("Number of NVMe available now at",site,"is:", nvme_available[0])

max_type_t4 = type_t4[0]
max_type_rtx = type_rtx[0]
total_gpus = max_type_t4 + max_type_rtx

Number of Nvidia t4 available now at UTAH is: 4
Number of Nvidia rtx6000 available now at UTAH is: 5
Number of NVMe available now at UTAH is: 13


In [64]:
# The max available GPU for the chosen site is mentioned above, to change it please make changes to the variables below.
# By default initialized with the available GPUs for the site.

#custom_number_of_t4 = max_type_t4
#custom_number_of_rtx = max_type_rtx

custom_number_of_t4 = 1
custom_number_of_rtx = 2

if custom_number_of_t4 < max_type_t4 or custom_number_of_rtx < max_type_rtx:
    max_type_t4 = custom_number_of_t4
    max_type_rtx = custom_number_of_rtx
    total_gpus = max_type_t4 + max_type_rtx
print(total_gpus)
gpu_names=[]

for i in range(1,total_gpus+1):
    gpu_names.append("GPU_{0}".format(i))
 
print(gpu_names)
temp_t4=max_type_t4
temp_rtx=max_type_rtx

######################################################
#####                  NVMe                      #####
######################################################

# Assumption: If the number of available NVMes' are less than the number of cluster node minus the master NVMes' will not

if nvme_available >= num_nodes-1 and add_NVMe == True:
    add_NVMe = True
print(add_NVMe)

3
['GPU_1', 'GPU_2', 'GPU_3']
False


In [65]:
# Run this cell

node_names=[]
nic_names=[]
iface_names=[]
nvme_names=[]
network_name='cluster_network'
storage_name = 'gaf-storage'


for i in range(1,num_nodes+1):
    node_names.append("Node{0}".format(i))
    nic_names.append("Nic{0}".format(i))
    iface_names.append("iface{0}".format(i))
    nvme_names.append("nvme{0}".format(i))

print(node_names)
print(nic_names)
print(iface_names)
print(nvme_names)

['Node1', 'Node2', 'Node3', 'Node4']
['Nic1', 'Nic2', 'Nic3', 'Nic4']
['iface1', 'iface2', 'iface3', 'iface4']
['nvme1', 'nvme2', 'nvme3', 'nvme4']


In [6]:
# Only run when persistence storage needed to be attached. 
if storage == True:
    import traceback
    from plugins import Plugins
    try:
        Plugins.load()
    except Exception as e:
        traceback.print_exc()

In [66]:
# Run this cell, Visit the link below to find different instance type options. 
# Read the comments carefully given below and make changes as necessary. 

try:
    slice=fablib.new_slice(name=slice_name)
    
    for i in range(num_nodes):
        if master_gpu == False and i == 0:
            node=slice.add_node(name=node_names[i],
                                site=site, 
                                instance_type=instance_master,
                                image=image)
            iface_names[i]=node.add_component(model='NIC_Basic', name=node_names[i]).get_interfaces()[0]
        
        else:
            node=slice.add_node(name=node_names[i],
                                site=site,
                                instance_type=instance_worker,
                                image=image)
            iface_names[i]=node.add_component(model='NIC_Basic',name=node_names[i]).get_interfaces()[0]
            
            if add_gpu == True:
                if add_t4 == True and temp_t4 > 0:
                    node.add_component(model='GPU_TeslaT4', name=gpu_names[i-1])
                    temp_t4 = temp_t4 - 1
                elif add_rtx == True and temp_rtx >0 and temp_t4 <=0:
                    node.add_component(model='GPU_RTX6000',name=gpu_names[i-1])
                    temp_rtx = temp_rtx - 1
            if add_NVMe == True:
                node.add_component(model='NVME_P4510', name=nvme_names[i-1])

except Exception as e:
    print(f'Exception: {e}') 

In [67]:
# Run this cell
try:
    net_cluster=slice.add_l2network(name=network_name, interfaces=iface_names[:])
except Exception as e:
    print(f"Exception: {e}")

In [68]:
# Run this cell
# If this cell get executed successfully then IP address of the nodes will be displayed which can be used to ssh into the respective nodes.
# If there is an error while creating the slice/cluster, repeat from the third code cell block.

try:
    slice.submit()
except Exception as e:
    print(f'Exception: {e}')


Retry: 7, Time: 244 sec


0,1
ID,61b888dc-2957-41c6-9965-dd93024ddcd3
Name,cluster_gpu_2
Lease Expiration (UTC),2023-04-26 00:25:05 +0000
Lease Start (UTC),2023-04-25 00:25:06 +0000
Project ID,68926660-da26-475d-9c40-50ebf0a5a812
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
80cf3d6a-6bb2-4220-a8a0-e060883f7a3e,Node1,4,8,500,default_ubuntu_18,qcow2,utah-w5.fabric-testbed.net,UTAH,ubuntu,2001:1948:417:7:f816:3eff:fe4d:2418,Active,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
cafd7119-eb1f-4b0e-9c96-b08f91af945f,Node2,6,16,100,default_ubuntu_18,qcow2,utah-w5.fabric-testbed.net,UTAH,ubuntu,2001:1948:417:7:f816:3eff:fee0:294c,Active,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
998f8386-1e64-40e0-8ccf-183469f51557,Node3,6,16,100,default_ubuntu_18,qcow2,utah-w5.fabric-testbed.net,UTAH,ubuntu,2001:1948:417:7:f816:3eff:fea9:2303,Active,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
cf3f9619-db2f-4f21-a711-613ac6585bb8,Node4,6,16,100,default_ubuntu_18,qcow2,utah-w5.fabric-testbed.net,UTAH,ubuntu,2001:1948:417:7:f816:3eff:fe76:4cbf,Active,,ssh ${Username}@${Management IP},/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


ID,Name,Layer,Type,Site,Gateway,Subnet,State,Error
805a7cb8-cfbf-4a2d-89a0-0a76a702374b,cluster_network,L2,L2Bridge,UTAH,,,Active,



Time to stable 244 seconds
Running post_boot_config ... Time to post boot config 262 seconds


Name,Node,Network,Bandwidth,VLAN,MAC,Physical Device,Device
Node1-Node1-p1,Node1,cluster_network,100,,02:AA:1F:20:41:A6,ens7,ens7
Node2-Node2-p1,Node2,cluster_network,100,,02:B1:1B:7C:97:2E,ens7,ens7
Node3-Node3-p1,Node3,cluster_network,100,,02:13:80:A3:DC:9D,ens7,ens7
Node4-Node4-p1,Node4,cluster_network,100,,06:1F:9F:D4:E2:30,ens7,ens7



Time to print interfaces 275 seconds


## Step-3: Configuring the network and setting up the cluster

In [69]:
#Run this cell

from ipaddress import IPv4Address, IPv4Network

try:
    subnet=IPv4Network("192.168.1.0/24")
    available_ips=list(subnet)[1:]
except Exception as e:
    print(f"Exception: {e}")

In [None]:
#try:
#    for node in slice.get_nodes():
#        node_iface=node.get_interface(network_name=network_name)
#        stdout, stderr = node.execute(f'ip addr show {node_iface.get_os_interface()}')
#except Exception as e:
#    print(f"Exception: {e}")

In [70]:
#Run this cell
#%%capture
try:
    for node in slice.get_nodes():
        node_iface=node.get_interface(network_name=network_name)
        node_IP_addr=available_ips.pop(0)
        node_iface.ip_addr_add(addr=node_IP_addr, subnet=subnet)
        
        stdout, stderr = node.execute(f'ip addr show {node_iface.get_os_interface()}')
        _, _ = node.execute('sudo apt-get update')
        stdout, stderr = node.execute('sudo apt install net-tools')        
        stdout, stderr = node.execute(f'sudo ifconfig {node_iface.get_os_interface()} up')
        
        
except Exception as e:
    print(f"Exception: {e}")

3: ens7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 02:aa:1f:20:41:a6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.1/24 scope global ens7
       valid_lft forever preferred_lft forever
Hit:1 http://nova.clouds.archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2634 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [8570 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/universe Translation-en [4941 kB]
Get:8 http://security.ubuntu.com/ubuntu bionic-security/main Translation-en [455 kB]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [1207 kB]
Get:1

In [71]:
#Run this cell
# If the ping is successful, that means the nodes are connected properly if there is any error then may need to recreate the cluster or further have to look into it. 

try:
    node1=slice.get_node(name='Node1')
    
    stdout,stderr=node1.execute(f' ping -c 3 192.168.1.4')
    print(stdout)
    print(stderr)
    
except Exception as e:
    print(f'Exception: {e}')   

PING 192.168.1.4 (192.168.1.4) 56(84) bytes of data.
64 bytes from 192.168.1.4: icmp_seq=1 ttl=64 time=0.361 ms
64 bytes from 192.168.1.4: icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from 192.168.1.4: icmp_seq=3 ttl=64 time=0.107 ms

--- 192.168.1.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2028ms
rtt min/avg/max/mdev = 0.080/0.182/0.361/0.127 ms
PING 192.168.1.4 (192.168.1.4) 56(84) bytes of data.
64 bytes from 192.168.1.4: icmp_seq=1 ttl=64 time=0.361 ms
64 bytes from 192.168.1.4: icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from 192.168.1.4: icmp_seq=3 ttl=64 time=0.107 ms

--- 192.168.1.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2028ms
rtt min/avg/max/mdev = 0.080/0.182/0.361/0.127 ms




In [72]:
# Run this cell
# Function to create a file that contains IPs and hostnames related to it.

def append_line(file_path,text):
    with open(file_path,"a+") as file_des:
        file_des.seek(0)
        data=file_des.read(-1)
        if len(data)>0:
            file_des.write("\n")
        file_des.write(text)

In [73]:
# Run this cell

import os

if os.path.isfile('/home/fabric/work/hosts') or os.path.isfile('/home/fabric/work/ips.txt') or os.path.isfile('/home/fabric/work/workers'):
    os.system("rm /home/fabric/work/hosts")
    os.system("rm /home/fabric/work/ips.txt")
    os.system("rm /home/fabric/work/workers")    
else:
    print("does not exist")

if os.path.isfile('/home/fabric/work/gpu_ips.txt'):
    os.system("rm /home/fabric/work/gpu_ips.txt")
    
#os.system("rm /home/fabric/work/ips.txt")

In [74]:
# Run this cell

# Capturing the IP addresses, this step needs to be integrated with the IP assigning stage, coded above. Or it may stay independent.
import os

i=1
local_host="127.0.0.1 localhost"
path_to_host_file="/home/fabric/work/hosts"
path_to_ip_file="/home/fabric/work/ips.txt"
path_to_worker_ip="/home/fabric/work/workers"
path_to_gpu_ips="/home/fabric/work/gpu_ips.txt"
gpu_name="NVIDIA"

append_line(path_to_host_file,local_host)

try:
    for node in slice.get_nodes():
        stdout, stderr=node.execute("hostname -I")
        IP=stdout.split(" ")[1]
        node_name="node{0}".format(i)
        vm_names="vm{0}".format(i-1)
        append_line(path_to_ip_file,IP)
        
        stdout, stderr=node.execute("hostname")
        line=IP+" "+node_name+" "+vm_names+" "+stdout
        append_line(path_to_host_file,line)
      
    
        if(i>1):
            append_line(path_to_worker_ip,vm_names)
            #append_line(path_to_worker_ip,IP)
        
        print(line)
        print(stderr)
        
        stdout_gpu, _=node.execute('lspci | grep NVIDIA')   
        if gpu_name in stdout_gpu:
            stdout_ip, _=node.execute('hostname -I')
            ip=stdout_ip.split(" ")[1]
            append_line(path_to_gpu_ips,ip)
            #print(ip)
        
        i=i+1
except Exception as e:
    print(f"Exception: {e}")
    
print(IP)

10.20.5.154 192.168.1.1 2001:1948:417:7:f816:3eff:fe4d:2418 
80cf3d6a-6bb2-4220-a8a0-e060883f7a3e-node1
192.168.1.1 node1 vm0 80cf3d6a-6bb2-4220-a8a0-e060883f7a3e-node1


10.20.5.171 192.168.1.2 2001:1948:417:7:f816:3eff:fee0:294c 
cafd7119-eb1f-4b0e-9c96-b08f91af945f-node2
192.168.1.2 node2 vm1 cafd7119-eb1f-4b0e-9c96-b08f91af945f-node2


10.20.5.196 192.168.1.3 2001:1948:417:7:f816:3eff:fea9:2303 
998f8386-1e64-40e0-8ccf-183469f51557-node3
192.168.1.3 node3 vm2 998f8386-1e64-40e0-8ccf-183469f51557-node3


10.20.5.146 192.168.1.4 2001:1948:417:7:f816:3eff:fe76:4cbf 
cf3f9619-db2f-4f21-a711-613ac6585bb8-node4
192.168.1.4 node4 vm3 cf3f9619-db2f-4f21-a711-613ac6585bb8-node4


192.168.1.4


In [75]:
# Run this cell

try:
    for node in slice.get_nodes():
        stdout, stderr=node.execute(f'sudo cp /etc/hosts /etc/hosts_backup') # if you run the command twice the back up will be overwritten, a conditional block should be written
        output_host_copy=node.upload_file(path_to_host_file,"/home/ubuntu/hosts")
        output_ip_copy=node.upload_file(path_to_ip_file,"/home/ubuntu/ips.txt")
        output_worker_copy=node.upload_file(path_to_worker_ip,"/home/ubuntu/workers")
        stdout_host_copy,stderr_host_copy=node.execute(f'sudo cp /home/ubuntu/hosts /etc/hosts')
        if os.path.isfile('/home/fabric/work/gpu_ips.txt'):
            output_gpu_copy=node.upload_file(path_to_gpu_ips,"/home/ubuntu/gpu_ips.txt")
        
        print(stderr)
        print(stderr_host_copy)
except Exception as e:
    print(f"Exception : {e}")
    











In [76]:
# Run this cell

import os

output=os.system('ssh-keygen -q -t rsa -N "" -f /home/fabric/work/id_rsa > /dev/null 2>&1')

In [77]:
# Run this cell

try:
    for node in slice.get_nodes():
        output_private=node.upload_file("/home/fabric/work/id_rsa","/home/ubuntu/.ssh/id_rsa")
        output_public=node.upload_file("/home/fabric/work/id_rsa.pub","/home/ubuntu/.ssh/id_rsa.pub")
        
        stdout, stderr=node.execute(f' cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys')
        stdout, stderr=node.execute(f' chmod 600 /home/ubuntu/.ssh/id_rsa*')
        
        print(output_private)
        print(output_public)
        print(stderr)
        #print(stdout)
        
except Exception as e:
    print(f"Exception: {e}")

-rw-rw-r--   1 1000     1000         2643 25 Apr 00:33 ?
-rw-rw-r--   1 1000     1000          601 25 Apr 00:33 ?

-rw-rw-r--   1 1000     1000         2643 25 Apr 00:33 ?
-rw-rw-r--   1 1000     1000          601 25 Apr 00:34 ?

-rw-rw-r--   1 1000     1000         2643 25 Apr 00:34 ?
-rw-rw-r--   1 1000     1000          601 25 Apr 00:34 ?

-rw-rw-r--   1 1000     1000         2643 25 Apr 00:34 ?
-rw-rw-r--   1 1000     1000          601 25 Apr 00:34 ?



In [78]:
# Run this cell. This is the last cell to run. Please read the comments in the next few cells to know how to extend the lease time and how to delete a slice/cluster. 

from ipaddress import ip_address, IPv6Address
try:
    for node in slice.get_nodes():
        if type(ip_address(node.get_management_ip())) is IPv6Address:
            node.upload_file("/home/fabric/work/nat64.sh", "/home/ubuntu/nat64.sh")
            #stdout, stderr=node.execute(f' chmod +x /home/ubuntu/nat64.sh && sudo bash /home/ubuntu/nat64.sh')
            stdout, stderr=node.execute(f' chmod +x /home/ubuntu/nat64.sh && sudo bash /home/ubuntu/nat64.sh > /dev/null 2>&1')
            
            print(stdout)
            print(stderr)
except Exception as e:
    print(f"Exception: {e}")











## Step-4: Deleting and extending lease of a slice

In [17]:
# Run this cell ONLY when you want to delete the cell
# To delete a slice/cluster.
try:
    slice=fablib.get_slice(name="cluster_gatk") # Put the cluster name that you want to delete
    slice.delete()
except Exception as e:
    print(f"Exception: {e}")

Exception: Failed to delete slice: Status.FAILURE, (500)
Reason: INTERNAL SERVER ERROR
HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.21.6', 'Date': 'Wed, 25 Jan 2023 15:37:01 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '274', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'DNT, User-Agent, X-Requested-With, If-Modified-Since, Cache-Control, Content-Type, Range, Authorization', 'Access-Control-Allow-Methods': 'GET, POST, PUT, PATCH, DELETE, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Length, Content-Range, X-Error', 'X-Error': 'Unable to delete Slice# 23cff7cb-6671-4d17-b2d6-3ef79348a03b that is not yet stable, try again later'})
HTTP response body: b'{\n    "errors": [\n        {\n            "details": "Unable to delete Slice# 23cff7cb-6671-4d17-b2d6-3ef79348a03b that is not yet stable, try again later",\n            "message": "Internal Server Error

In [40]:
# Run this cell ONLY when you want to exted the time of the slice. Or to extend the lease time of the slice/cluster.

import datetime
slice_name='cluster_exp' # Give the cluster/slice name that you want to extend

#Set end host to now plus 1 day
new_end_date = (datetime.datetime.utcnow() + datetime.timedelta(days=7)).strftime("%Y-%m-%d %H:%M:%S %z")
#new_end_date = (datetime.now(timezone.utc) + timedelta(days=6)).strftime("%Y-%m-%d %H:%M:%S %z")
print(type(new_end_date), new_end_date)
try:
    slice=fablib.get_slice(name=slice_name)
    slice.renew('2023-05-06 18:04:26 +0000') # Give the new lease end date and time of the slice. One can increase it by 7 days from the day of creation of the slice/cluster. 
    
    #print(f"Lease End (UTC)        : {slice.get_lease_end()}")
except Exception as e:
    print(f"Exception: {e}")

<class 'str'> 2023-05-01 23:14:15 


In [41]:
# Run this cell ONLY to observe the new lease time.

slice_name='cluster_exp' # Give the slice/cluster name that you just extended. 
try:
    slice = fablib.get_slice(name=slice_name)
    print(f"{slice}")
except Exception as e:
    print(f"Exception: {e}")

-----------  ------------------------------------
Slice Name   cluster_exp
Slice ID     6f79613e-46f0-4323-8f04-395221080b2e
Slice State  StableOK
Lease End    2023-05-06 18:04:26 +0000
-----------  ------------------------------------


In [17]:
# Run this cell ONLY to observe the new lease time.

slice_name='cluster_gpu' # Give the slice/cluster name that you just extended. 
try:
    slice = fablib.get_slice(name=slice_name)
    gpu_name="NVIDIA"
    for node in slice.get_nodes():
        #stdout, stderr=node.execute(f'sudo cp /etc/hosts /etc/hosts_backup') # if you run the command twice the back up will be overwritten, a conditional block should be written
        stdout_gpu, _=node.execute('lspci | grep NVIDIA')   
        if gpu_name in stdout_gpu:
            stdout_ip, _=node.execute('hostname -I')
            ip=stdout_ip.split(" ")[1]
            print(ip)
except Exception as e:
    print(f"Exception: {e}")

## Do not bother about the code below !!! ;-)

In [37]:
# Do not Run this cell

storage_name = 'gaf-storage1'

node1=slice.get_node(name='Node1')
node1.add_component(model='NVME_P4510',name='gaf-storage')

TopologyException: Component names must be unique within node.

In [4]:
slice_name='cluster_gatk'

try:
    slice=fablib.get_slice(name=slice_name)
except Exception as e:
    print(f"Exception: {e}")

In [77]:
# Do not run this cell, only for experimentation.

try:
    node = slice.get_node(name='Node4')
    print(f"{node}")
    storage = node.get_storage(name='gaf-storage')
    print(f"{storage}")
    print(f"Storage Device Name: {storage.get_device_name()}")
    #print("Mounting the storage volume")
    #stdout, stderr = node.execute(f"sudo mkdir -p /mnt/{storage_name};"
                                 # f"sudo mount {storage.get_device_name()} /mnt/{storage_name}")
    print(stdout)
    print(stderr)
except Exception as e:
    print(f"Exception: {e}")



-----------------  ------------------------------------------------------------------------------------------------------------------------------------------------
ID                 342cfbfa-8e20-4e69-992a-64364d08db35
Name               Node4
Cores              4
RAM                8
Disk               10
Image              default_ubuntu_20
Image Type         qcow2
Host               star-w2.fabric-testbed.net
Site               STAR
Management IP      2001:400:a100:3030:f816:3eff:fe08:a7d4
Reservation State  Active
Error Message
SSH Command        ssh -i /home/fabric/work/fabric_config/slice_key -J mjdbz4_0000018266@bastion-1.fabric-testbed.net ubuntu@2001:400:a100:3030:f816:3eff:fe08:a7d4
-----------------  ------------------------------------------------------------------------------------------------------------------------------------------------
-----------  --------------------
Name         gaf-storage
Details      Site-local NAS share
Disk (G)     0
Units        1
PCI Addres

In [None]:
hosts=[]
import random
randomList=[]

while True:
    r=random.randint(1,9)
    
    if r not in randomList:
        randomList.append(r)
    if len(randomList) == 8:
        break
        
for i in range(1,num_nodes+1):
    hosts.append("{0}-w{1}.fabric-testbed.net".format(site.lower(),randomList[i-1]))
    
print(hosts)  

In [74]:
try:
    node = slice.get_node('Node3') 
    #node.show()

    nvme1 = node.get_component(nvme_names[1])
    nvme1.show()
    nvme1.configure_nvme()
except Exception as e:
    print(f"Exception: {e}")

0,1
Name,Node3-nvme2
Details,Dell Express Flash NVMe P4510 1TB SFF
Disk,0
Units,1
PCI Address,['0000:00:08.0']
Model,NVME_P4510
Type,NVME




Disk /dev/nvme0n1: 894.3 GiB, 960197124096 bytes, 1875385008 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
[31m fdisk: cannot open /dev/nvme0: Illegal seek
 [0mconfig_nvme Fail: Node3-nvme2
Exception: []


In [73]:
nvme_names[1]

'nvme2'