home directory: cluster_doctor
function availability : ./kubectl/functions.py

Now the job-runner.ipynb will perform the following tasks:
1. get_free_node_list()
    - save it to list - get_free_node_list[]
2. get_db_latest_status() 
    - Get latest test results timmestamp from validation.db for all the nodes in the db ( by accessing gcr-admin-pvc-access pod)
    - per node per test - latest timestamp
    - if a node has no test results - mark it with very old timestamp - highest priority
3. build_priority_queue()
    - Combine free nodes list with get_db_latest_status list, and create a priority queue function that takes 
        1. free nodes list
        2. db latest status
        3. Z days threshold
    - Returns priority queue
        1. Filered free nodes only
        2. skip nodes with test results not older than Z days 
        3. order by latest test results timestamps (oldest first - highest priority) 
    - Format of returned "job_priority_queue_list": [ nodename, priority_order, job_submission_status ]
        [
            [node1, 1, True],
            [node2, 2, False],
            ...
        ]
4. batch job submission
   - takes 
        1. batch size: N single node jobs per batch
        2. job queue list from build_priority_queue()
        3. job template yaml file path  ( /home/hari/b200/validation/cluster_doctor/ymls/specific-node-job.yml )
    - for each batch of N nodes
        1. read job template yaml file
        2. edit/ fill in 
            a. node name <node-name>
            b. job name hari-gcr-ceval-<node-name>-<timestamp>
        3. submit job to k8s cluster and repeat N times ( for batch size )
5. monitor job status
    - if a job pending for more than X minutes - cancel the job and update job_submission_status to canceled in job_priority_queue_list
For each node in job queue list
    - Create a job to run cluster-doctor validation tests on that node

6. Job run[Inside Job pod] 
    - git clone cluster_doctor repo to /opt/cluster_doctor
    - Run cluster-doctor tests on the pod/node and collect logs ( STDOUT/ STDERR) using tee
    - Upon completion of tests
        -Collect test results log ( STDOUT/ STDERR) and save it to /data/continuous_validation/<test-name>/<node-name>/<node-name>-<testname>-<timestamp>.log
    - Update validation.db with new test results and timestamp at /data/continuous_validation/metadata/validation.db using /opt/cluster_doctor/kubectl/functions.py/add_result_local()

7. Generate a daily report
    - Summary of nodes tested
    - Summary of test results
    - List of nodes that were never tested
    - Save report to ./gitignored/reports/daily_report_<date>.txt

In [None]:
# home directory: cluster_doctor
# functions availability : kubectl/functions.py
# 1. get free nodes list
#     - save it to list ./gitignored/free.txt
# 2. Get latest test results timmestamp from validation.db for all the free nodes in the free node list ( by accessing gcr-admin-pvc-access pod)
#     - per node per test - latest timestamp
#     - if a node has no test results - mark it as "never tested" - highest priority
# 3. Combine free nodes list with latest test results timestamps create a priority queue
#     - Sort nodes by latest test results timestamps (oldest first - highest priority)
#     - If the test results are not older than Z days - skip that node from the priority Queue
# 4. For each node in job queue list
#     - Create a job to run cluster-doctor validation tests on that node
#     - Submit the job to k8s cluster
#     - Log job submission status

# 5. Monitor job status
#     - if a job pending for more than X minutes - cancel the job and log the event
#     - if a job runs longer than Y minutes - cancel the job
# 6. After job completion ( inside job pod)
#     - Collect test results log
#     - Update validation.db with new test results and timestamps
#     - Log job completion status

# 7. Generate a daily report
#     - Summary of nodes tested
#     - Summary of test results
#     - List of nodes that were never tested
#     - Save report to ./gitignored/reports/daily_report_<date>.txt


# variables
from symtable import Class

# import re # No longer needed

home_dir="/home/hari/b200/validation/cluster_doctor/"
functions_dir=home_dir+"kubectl/"


#Get free nodes list
# /home/hari/b200/validation/cluster_doctor/kubectl/cluster/freenode_list.sh

#create a cluster class

class Cluster:
    def __init__(self, ns):
        self.ns = ns
        #numerical timestamp
        self.timestamp = int(__import__('time').time())
        self.tests=[]
        self.freenode_list = []
    
    def run_cmd(self, cmd):
        import subprocess
        result = subprocess.run([cmd], stdout=subprocess.PIPE, shell=True)
        return result.stdout.decode('utf-8').strip()

    def freenodes(self):
        import subprocess
        # Using the new script that returns just the node names
        cmd = functions_dir+"cluster/freenode_list.sh "+self.ns
        output = self.run_cmd(cmd)
        # Split by newline and filter empty strings
        self.freenode_list = [n for n in output.strip().split('\n') if n]
        print(f"Found {len(self.freenode_list)} free nodes on cluster {self.ns}")
        return self.freenode_list
    

    def latest_test_results(self):
        # Get latest test results for a given node from '/home/hari/b200/validation/cluster_doctor/kubectl/result/status.sh'
        # (base) hari@ADG-2MQ50106PK:~/b200/validation$ '/home/hari/b200/validation/cluster_doctor/kubectl/result/status.sh'
        # Fetching status remotely from pod gcr-admin-pvc-access (ns: gcr-admin)...
        # Warning: Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens.
        # node    test    latest_timestamp_num    latest_timestamp        result
        # slc01-cl02-hgx-0001     tes1    1767827268      2026-01-07T23:07:48Z    fail
        # slc01-cl02-hgx-0002     test2   1767827267      2026-01-07T23:07:47Z    fail
        import subprocess
        cmd = functions_dir+"result/status.sh"
        # save output to a pandas dataframe
        result = subprocess.run([cmd], stdout=subprocess.PIPE, shell=True)
        print("Raw output from status.sh:", result.stdout.decode('utf-8').strip())
        output = result.stdout.decode('utf-8').strip().split('\n')
        print(output)
        import pandas as pd
        data = [line.split('\t') for line in output[2:]]  #skip header lines
        df = pd.DataFrame(data, columns=['node', 'test', 'latest_timestamp_num', 'latest_timestamp', 'result'])
        return df   






In [2]:
cluster=Cluster("gcr-admin")
# print(cluster.freenodes())

In [3]:
print(cluster.freenode_list)
print(cluster.timestamp)
# convert timestamp to human readable format
import datetime
readable_time = datetime.datetime.fromtimestamp(cluster.timestamp).strftime('%Y-%m-%d %H:%M:%S')
print("Cluster object created at:", readable_time)


print(cluster.latest_test_results())

[]
1767891267
Cluster object created at: 2026-01-08 08:54:27
Raw output from status.sh: 
['']


/home/hari/b200/validation/cluster_doctor/kubectl/result/status.sh: 4: set: Illegal option -o pipefail


Empty DataFrame
Columns: [node, test, latest_timestamp_num, latest_timestamp, result]
Index: []
