## ID Cloudhost Autopod for Konfersi

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Konfersi-Indonesia/konfersi-idch-autopod/main?filepath=autopod.ipynb)


This project is designed to manage node creation and management for IDCloudHost VPS virtual machines to work as a cluster. The notebook automates tasks like node creation, health checking, management (start, stop, delete), cloud-init script generation, MPI cluster readiness, monitoring with Grafana, Portainer setup, and is based on Docker Swarm.

## Features

- Automated creation of VPS nodes
- Health checker for node status
- Start, stop, and delete node management
- Cloud-init script generation for initial setup
- MPI cluster setup
- Monitoring with Grafana
- Portainer setup for container management
- Docker Swarm based cluster management

## Requirements

Before running this project, ensure you have the necessary dependencies installed. These can be installed using the `requirements.txt` file.

### Setup Instructions

1. **Clone the repository:**

   ```bash
   git clone https://github.com/Konfersi-Indonesia/konfersi-idch-autopod.git
   cd konfersi-idch-autopod
   ```

2. **Install the required Python dependencies:**

   It's recommended to create a virtual environment to manage dependencies.

   **Using virtualenv**:
   
   ```bash
   python3 -m venv venv
   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
   ```

   Then install the dependencies:

   ```bash
   pip install -r requirements.txt
   ```

3. **Set up environment variables:**

   Create a `.env` file with the following variables:

   - `IDCH_TOKEN`: This token can be generated from the [IDCloudHost Console](https://console.idcloudhost.com/user) by creating a new API access.
   - `NODE_PASSWORD`: The password for the nodes that will be created.

   Example `.env` file:

   ```
   IDCH_TOKEN=your_api_token_here
   NODE_PASSWORD=your_node_password_here
   ```

4. **Set up the configuration file:**

   Create and configure the `config.yaml` file for the project. This file should contain the necessary configuration details for node creation, management, and other settings required by the project.

5. **Run the notebook:**

   Once the setup is complete, you can run the notebook `autopod.ipynb`.

   - If you're running it locally, you may need to set up Jupyter Notebook or run it directly from Visual Studio Code.
   - To run Jupyter Notebook:

     ```bash
     jupyter notebook
     ```

     Once Jupyter is running, open `autopod.ipynb` and execute the cells as needed.

## Copyright

© Konfersi Indonesia, 2024.

## Maintainer

- **Alfian Isnan** (alfianisnan26)

## Initialize Dependency and Configs

In [9]:

# Init Config and Dependency

from yaml import safe_load
from types import SimpleNamespace
import requests as re
import pandas as pd
import os
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.backends import default_backend
import base64
import json
from tqdm import tqdm
import re as regex
from dotenv import load_dotenv
from concurrent.futures import ThreadPoolExecutor

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 1000)        # Set width of the display for wrapping
pd.set_option('display.max_colwidth', 200)  # Set maximum column width to wrap text

def substitute_env_variables(yaml_content):
    """Substitute environment variables in the YAML content."""
    # Regular expression to match ${VAR} or $VAR
    pattern = regex.compile(r'\${(.*?)}|\$(\w+)')
    
    def replace(match):
        # Get the environment variable name from the match
        env_var = match.group(1) or match.group(2)
        # Return the value of the environment variable, or the original text if not found
        res = os.environ.get(env_var, match.group(0))
        return res
    
    # Replace environment variable placeholders with actual values
    return pattern.sub(replace, yaml_content)

def map_to_namespace(mapping):
    """
    Convert a mapping (like a dictionary or map object) into a nested namespace.
    """
    if isinstance(mapping, dict):  # If the object is a dictionary
        return SimpleNamespace(**{key: map_to_namespace(value) for key, value in mapping.items()})
    elif isinstance(mapping, (list, tuple)):  # For lists or tuples, apply recursively
        return [map_to_namespace(item) for item in mapping]
    else:  # Base case: return as is
        return mapping

def load_config(path):
    """Load and parse a YAML config file with environment variable substitution."""
    # Load environment variables from a .env file if available
    load_dotenv()

    with open(path, 'r') as file:
        # Read the file content
        yaml_content = file.read()
        # Substitute environment variables in the content
        yaml_content = substitute_env_variables(yaml_content)
        # Now parse the YAML content after substitution
        return safe_load(yaml_content)


def generate_ssh_key_pair(key_name="id_rsa", key_size=2048):
    """
    Generate an RSA SSH key pair and save them as files.

    Args:
        key_name (str): The base name of the key files (default is 'id_rsa').
        key_size (int): The size of the RSA key in bits (default is 2048).
    """
    # Generate the private key
    private_key = rsa.generate_private_key(
        public_exponent=65537,
        key_size=key_size,
        backend=default_backend()
    )

    # Serialize and save the private key
    private_key_path = f"{key_name}"
    with open(private_key_path, "wb") as priv_file:
        priv_file.write(
            private_key.private_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PrivateFormat.TraditionalOpenSSL,
                encryption_algorithm=serialization.NoEncryption()
            )
        )
    print(f"Private key saved to: {private_key_path}")

    # Generate the public key
    public_key = private_key.public_key()

    # Serialize and save the public key in OpenSSH format
    public_key_path = f"{key_name}.pub"
    with open(public_key_path, "wb") as pub_file:
        pub_file.write(
            public_key.public_bytes(
                encoding=serialization.Encoding.OpenSSH,
                format=serialization.PublicFormat.OpenSSH
            )
        )

    fix_private_key_permissions(public_key_path)
    print(f"Public key saved to: {public_key_path}")
    return public_key_path

def build_write_file_cloud(file, permissions = "0755", encoding = "b64", path="/home/ubuntu", ):
    target_path = os.path.join(path, os.path.basename(file))
    
    # Read the content of the bash file
    with open(file, "r") as file:
        content = file.read()
        
    encoded_content = base64.b64encode(content.encode('utf-8')).decode('utf-8')
    # Add to write_files
    return {
        "path": target_path,
        "content": encoded_content,
        "permissions": permissions,  # Ensuring the script is executable
        "encoding": encoding
    }

def cloud_init_writer(path="/home/ubuntu", files=[], bash_files=[], environments={}):
    write_files = []
    runcmd = []
    
    # Write files section
    for file in files:
        write_files.append(build_write_file_cloud(file, path=path))
    
    # Environment variables section
    for key, value in environments.items():
        runcmd.append(f"export {key}='{value}'")
    
    # Bash files section
    for bash_file in bash_files:
        # Get the filename from the full file path
        filename = os.path.basename(bash_file)
        target_path = os.path.join(path, filename)
        log_file = f"/var/log/{filename}.log"  # Log file to store output
        
        # Add to runcmd
        runcmd.extend([
            f"echo 'Running script: {filename}' >> /var/log/cloud-init.log",
            f"chown ubuntu:ubuntu {target_path}",
            f"chmod +x {target_path}",
            # Run the script and capture output (both stdout and stderr) to log file
            f"{target_path} >> {log_file} 2>&1",  # Redirect both stdout and stderr
            f"echo 'Script {filename} execution complete' >> {log_file}"
        ])
    
    return {
        "write_files": write_files,
        "runcmd": runcmd
    }

def cloud_init_generator(files, runcmd, path="/home/ubuntu", environments = {}):
    if not isinstance(files, list):
        folder_path = files
        files = [os.path.join(files, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]

    return json.dumps(cloud_init_writer(path, files, runcmd, environments=environments))

config = map_to_namespace(load_config("config.yaml"))
print("Config Loaded:", config)

idch_header = {
    "apikey": config.idch.token
}

def idch_get(path):
    url = os.path.join(config.idch.host, path.format(location=config.cluster.location))
    print('GET ' + url)
    return pd.DataFrame(re.get(url, headers=idch_header).json())

def idch_post(path, data):
    url = os.path.join(config.idch.host, path.format(location=config.cluster.location))
    print('POST ' + url)
    res = re.post(url, headers=idch_header, data=data).json()
    return pd.DataFrame(res)
    
def idch_delete(path, data=None):
    url = os.path.join(config.idch.host, path.format(location=config.cluster.location))
    print('DELETE ' + url)
    print(re.delete(url, headers=idch_header, data=data).json())

def idch_delete_instance(uuid, ip_address):
    if (uuid):
        print("Deleting Container:", uuid)
        idch_delete("v1/{location}/user-resource/vm", data={
            "uuid": uuid,
        })
    
    if (ip_address):
        print("Deleting IP:", ip_address)
        idch_delete("v1/{location}/network/ip_addresses/" + ip_address)


def idch_get_instances():
    # Get the data from the API or external service
    vm_list = idch_get("v1/{location}/user-resource/vm/list")
    ip_addresses = idch_get("v1/{location}/network/ip_addresses")

    if (len(ip_addresses) == 0):
        print(vm_list)
        return vm_list
    
    ip_addresses = ip_addresses.rename(columns={'address': 'public_ipv4', "uuid": "network_uuid"})
    
    if (len(vm_list) == 0):
        return ip_addresses
    

    # Step 1: Perform the join on uuid and assigned_to
    merged_df = pd.merge(vm_list, ip_addresses, left_on='uuid', right_on='assigned_to', how='inner')

    # Step 2: Filter rows where the 'name' column starts with config.cluster.name
    filtered_df = merged_df[merged_df['name'].str.startswith(config.cluster.name)]

    # Step 3: Create a new DataFrame to ensure we aren't working on a slice of the original DataFrame
    result = filtered_df[['uuid','network_uuid','name', 'private_ipv4', 'status', 'public_ipv4']].copy()

    # Step 4: Add the 'command' column using .loc to avoid SettingWithCopyWarning
    result.loc[:, 'command'] = result.apply(
        lambda row: (
            f"ssh -i {config.cluster.keypair.private} "
            "-o StrictHostKeyChecking=no "
            "-o UserKnownHostsFile=/dev/null "
            f"{config.cluster.username}@{row['public_ipv4']}"
        ),
        axis=1
    )


    return result

def convert_to_dict(key_value_list):
    result = {}
    for item in key_value_list:
        key, value = item.split("=", 1)  # Split at the first '=' character
        result[key] = value
    return result

def idch_build_node(node_config, resource_config, role="master", environments = {}):
    with open(config.cluster.keypair.public, "r") as file:
        public_key = file.read()
        
    if (node_config.cloud_init.environments):
        environments.update(convert_to_dict(node_config.cloud_init.environments))

    environments["CLOUD_INIT_WORKDIR"] = "/home/" + config.cluster.username
    environments["NODE_ROLE"] = role
    environments["NODE_USER"] = config.cluster.username

    data = {
        "name": config.cluster.name + "_master",
        "os_name": node_config.os_name,
        "os_version": node_config.os_version,
        "disks": int(resource_config.storage),
        "vcpu": int(resource_config.cpu),
        "ram": int(resource_config.memory) * (2 ** 10),
        "username": config.cluster.username,
        "password": config.cluster.password,
        "public_key": public_key,
        "cloud_init": cloud_init_generator(node_config.cloud_init.files, node_config.cloud_init.runcmd, environments["CLOUD_INIT_WORKDIR"], environments=environments)
    }

    print(data)

    return idch_post("v1/{location}/user-resource/vm", data)

# Assuming you have your instances DataFrame ready
def idch_healthcheck_instance():
    instances = idch_get_instances()  # Assume this returns a DataFrame with instance details
    health_set = {}
    count = 0

    # Initialize tqdm progress bar for the loop
    with tqdm(total=len(instances), desc="Checking health", unit="instance") as pbar:
        while len(health_set.keys()) != len(instances):
            count += 1
            for _, row in instances.iterrows():
                try:
                    re.get("http://" + row.get('public_ipv4') + ":8181/health", timeout=1)
                    health_set[row.get('uuid')] = True
                    pbar.update(1)
                except Exception as _:
                    # Log to tqdm (not interrupting the progress bar)
                    pbar.set_postfix({"failed": row.get('uuid'), "retry": count}, refresh=True)

def idch_delete_cluster():
    # Get the instances
    instances = idch_get_instances()

    # Run the deletion in parallel using ThreadPoolExecutor
    with ThreadPoolExecutor() as executor:
        # Submit delete tasks and wait for all of them to complete
        futures = [executor.submit(idch_delete_instance, row.get('uuid'), row.get('public_ipv4')) for _, row in instances.iterrows()]
        
        # Wait for all tasks to complete
        for future in futures:
            future.result()  # Blocks until the task is done            

def fix_private_key_permissions(private_key_path):
    """
    This function ensures that the private key file has the correct permissions (600).
    This is typically required for SSH private keys to ensure that only the owner can read/write it.
    """
    # Check if the file exists
    if not os.path.exists(private_key_path):
        raise FileNotFoundError(f"The specified private key file does not exist: {private_key_path}")
    
    # Check the current file permissions
    current_permissions = oct(os.stat(private_key_path).st_mode)[-3:]
    
    # Set the correct permissions (600)
    if current_permissions != '600':
        print(f"Fixing permissions for {private_key_path}. Current permissions: {current_permissions}")
        os.chmod(private_key_path, 0o600)  # Set permission to 600 (read/write for owner only)
        print(f"Permissions fixed to 600 for {private_key_path}")
    else:
        print(f"Permissions for {private_key_path} are already correctly set to 600.")

private_key_path = config.cluster.keypair.private

if private_key_path:
    try:
        # Attempt to fix the permissions if the key exists
        fix_private_key_permissions(private_key_path)
    except FileNotFoundError as e:
        print("Private key pair not found, generating one...")
        
        # Extract the directory and file name from the private key path
        key_dir = os.path.dirname(private_key_path)
        key_name = os.path.basename(private_key_path)
        
        # Create the directory if it doesn't exist
        os.makedirs(key_dir, exist_ok=True)
        
        # Generate SSH key pair
        generate_ssh_key_pair(private_key_path)


Config Loaded: namespace(idch=namespace(host='https://api.idcloudhost.com/', token='VzCQ8K67gjjgaYA01XHppnDsBQUMw8a2', access_name='konfersi-mpi'), cluster=namespace(name='konfersi_mpi', location='sgp01', network_uuid='0af9b107-d1d4-4c3d-84fc-71299e5c5c32', username='konfersiadmin', password='voNrec-wokkac-3sezky', keypair=namespace(public='keys/id_rsa.pub', private='keys/id_rsa')), master=namespace(os_name='ubuntu', os_version='20.04-lts', cloud_init=namespace(files='assets', runcmd=['0-init-server.sh'], environments=['TEST_ENV_MASTER=this env was came from master config']), init_resources=namespace(cpu=2, memory=2, storage=150), resources=namespace(cpu=16, memory=8, storage=200)), worker=namespace(nodes=5, os_name='ubuntu', os_version='20.04-lts', cloud_init=namespace(files='assets', runcmd=['0-init-server.sh', 'TEST_ENV_WORKER=this env was came from worker config'], environments=None), resources=namespace(cpu=16, memory=8, storage=20)))
Permissions for keys/id_rsa are already correc

In [None]:
# Get List of Locations
idch_get("v1/config/locations")

In [None]:
# Get Available Network ID
idch_get("v1/{location}/network/networks")

In [None]:
# Get Available OS Catalogue
plain_oses = idch_get("v1/config/vm_images/plain_os")[['os_name', 'versions']].explode('versions')
plain_oses['os_version'] = plain_oses['versions'].apply(lambda x: x['os_version'])
del plain_oses['versions']
plain_oses

## Initialize Master Node

In [2]:
# Create Master Node
idch_build_node(config.master, config.master.init_resources)

{'name': 'konfersi_mpi_master', 'os_name': 'ubuntu', 'os_version': '20.04-lts', 'disks': 150, 'vcpu': 2, 'ram': 2048, 'username': 'konfersiadmin', 'password': 'voNrec-wokkac-3sezky', 'public_key': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYaO5kuqFmPkOo+w5Q3KBZdC06CoqAQEnEnCsgqhDR6Nrzc+Df+YEm0gCjNJLis5pJ1m2U798OR88CgjEwlann2AqYN+IMl6dnvOFdRBnGpVZAbvMpN4hTIXK7w5/CCiDIElEffxEoo82CEEkceAXd27t4L/3X9IsBy/1SIMDkpJBbDMrbvnhohNQD55QErGIByOBPL0FuomTvlfBkXGFi7JTuCF88hvOEvNG92MWN1lyd0L55oHbdKetbR8uvXL21D9YEgcu1RT7ouc/DF5RPwSRbIEqQSf5EVg+mYtiUxfYoULv4MlB3He2Su8dcimz3Nhuk62A46C0WF+0Np42v', 'cloud_init': '{"write_files": [{"path": "/home/konfersiadmin/8-mpich-agent-stack.yaml", "content": "dmVyc2lvbjogJzMnCgpzZXJ2aWNlczoKICBtcGlfbWFzdGVyOgogICAgaW1hZ2U6IGFsZmlhbmlzbmFuMjYva29uZmVyc2ktbXBpOmxhdGVzdAogICAgZGVwbG95OgogICAgICByZXBsaWNhczogMQogICAgICByZXNvdXJjZXM6CiAgICAgICAgbGltaXRzOgogICAgICAgICAgICBjcHVzOiAnMicKICAgICAgICAgICAgbWVtb3J5OiAyNTAwTQogICAgICBwbGFjZW1lbnQ6CiAgICAgICAgY29uc3RyYWludHM6CiAgICAgIC

Unnamed: 0,backup,billing_account,created_at,description,hostname,mac,memory,name,os_name,os_version,private_ipv4,status,storage,updated_at,user_id,username,uuid,vcpu
0,False,1200219732,2024-12-07 04:18:42,,konfersimpimaster,52:54:00:1d:9f:15,2048,konfersi_mpi_master,ubuntu,20.04-lts,10.57.254.2,running,"{'created_at': '2024-12-07 04:18:45', 'name': ...",2024-12-07 04:20:25,25542,konfersiadmin,e03d54e0-af90-40e0-bb5c-0ae0c5a0b47d,2


In [11]:
# Get SSH Command
idch_get_instances()

# If get error of "Host key verification failed" try to clear known_hosts
# rm ~/.ssh/known_hosts

GET https://api.idcloudhost.com/v1/sgp01/user-resource/vm/list
GET https://api.idcloudhost.com/v1/sgp01/network/ip_addresses


Unnamed: 0,uuid,network_uuid,name,private_ipv4,status,public_ipv4,command
0,e03d54e0-af90-40e0-bb5c-0ae0c5a0b47d,3e99d218-66ad-4ad8-b374-1c4cc1475d02,konfersi_mpi_master,10.57.254.2,running,103.134.154.209,ssh -i keys/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null konfersiadmin@103.134.154.209


In [10]:
# Healthcheck, Speed Check, Post-init process
idch_healthcheck_instance()

GET https://api.idcloudhost.com/v1/sgp01/user-resource/vm/list
GET https://api.idcloudhost.com/v1/sgp01/network/ip_addresses


Checking health:   0%|          | 0/1 [00:01<?, ?instance/s, failed=e03d54e0-af90-40e0-bb5c-0ae0c5a0b47d, retry=2]


KeyboardInterrupt: 

## Cluster Management

In [None]:
idch_build_node(config.worker, config.worker.resource, environments={
    "MASTER_NODE_IP": "192.168.0.1"
})

In [None]:
# Health Checker and Links
idch_healthcheck_instance()

In [None]:
# Stop Cluster

In [15]:
# Call the function to delete the cluster
idch_delete_cluster()


GET https://api.idcloudhost.com/v1/sgp01/user-resource/vm/list
GET https://api.idcloudhost.com/v1/sgp01/network/ip_addresses
Deleting Container: cdbe03b1-fd5a-4960-83e9-ff2a9167400a
DELETE https://api.idcloudhost.com/v1/sgp01/user-resource/vm
{'success': True}
Deleting IP: 103.134.154.209
DELETE https://api.idcloudhost.com/v1/sgp01/network/ip_addresses/103.134.154.209
True
