home directory: cluster_doctor
function availability : ./kubectl/functions.py

Now the job-runner.ipynb will perform the following tasks:
1. get_free_node_list()
    - save it to list - get_free_node_list[]
2. get_db_latest_status() 
    - Get latest test results timmestamp from validation.db for all the nodes in the db ( by accessing gcr-admin-pvc-access pod)
    - per node per test - latest timestamp
    - if a node has no test results - mark it with very old timestamp - highest priority
3. build_priority_queue()
    - Combine free nodes list with get_db_latest_status list, and create a priority queue function that takes 
        1. free nodes list
        2. db latest status
        3. Z days threshold
    - Returns priority queue
        1. Filered free nodes only
        2. skip nodes with test results not older than Z days 
        3. order by latest test results timestamps (oldest first - highest priority) 
    - Format of returned "job_priority_queue_list": [ nodename, priority_order, job_submission_status ]
        [
            [node1, 1, True],
            [node2, 2, False],
            ...
        ]
4. batch job submission
   - takes 
        1. batch size: N single node jobs per batch
        2. job queue list from build_priority_queue()
        3. job template yaml file path  ( /home/hari/b200/validation/cluster_doctor/ymls/specific-node-job.yml )
    - for each batch of N nodes
        1. read job template yaml file
        2. edit/ fill in 
            a. node name <node-name>
            b. job name hari-gcr-ceval-<node-name>-<timestamp>
        3. submit job to k8s cluster and repeat N times ( for batch size )
5. monitor job status
    - if a job pending for more than X minutes - cancel the job and update job_submission_status to canceled in job_priority_queue_list
For each node in job queue list
    - Create a job to run cluster-doctor validation tests on that node

6. Job run[Inside Job pod] 
    - git clone cluster_doctor repo to /opt/cluster_doctor
    - Run cluster-doctor tests on the pod/node and collect logs ( STDOUT/ STDERR) using tee
    - Upon completion of tests
        -Collect test results log ( STDOUT/ STDERR) and save it to /data/continuous_validation/<test-name>/<node-name>/<node-name>-<testname>-<timestamp>.log
    - Update validation.db with new test results and timestamp at /data/continuous_validation/metadata/validation.db using /opt/cluster_doctor/kubectl/functions.py/add_result_local()

7. Generate a daily report
    - Summary of nodes tested
    - Summary of test results
    - List of nodes that were never tested
    - Save report to ./gitignored/reports/daily_report_<date>.txt

In [1]:
import sys
import os
import time
import datetime

# Add the current directory to path to ensure we can import utils
current_dir = os.path.dirname(os.path.abspath('__file__'))
if current_dir not in sys.path:
    sys.path.append(current_dir)

# Import the utility functions
try:
    import utils.functions as functions
except ImportError:
    # Fallback if running from a different context
    sys.path.append("/home/hari/b200/validation/cluster_doctor")
    import utils.functions as functions

home_dir = "/home/hari/b200/validation/cluster_doctor/"

class Cluster:
    def __init__(self, ns="gcr-admin"):
        self.ns = ns
        # numerical timestamp
        self.timestamp = int(time.time())
        self.freenode_list = []
        self.db_status = {}
        self.job_queue = []
        self.template_path = os.path.join(home_dir, "ymls/specific-node-job.yml")
        
    def refresh_state(self):
        """
        Step 1 & 2: Get free nodes and latest DB status.
        """
        print(f"[{datetime.datetime.now().time()}] Refreshing cluster state...")
        
        # 1. Get Free Node List
        self.freenode_list = functions.get_free_node_list()
        print(f"  Found {len(self.freenode_list)} free nodes (fully avaialble).")
        
        # 2. Get DB Latest Status
        print("  Fetching DB status from cluster...")
        try:
            db_output = functions.get_db_latest_status(namespace=self.ns)
            self.db_status = functions.parse_db_status_output(db_output)
            print(f"  Retrieved status for {len(self.db_status)} nodes from DB.")
        except Exception as e:
            print(f"  Error fetching DB status: {e}")
            self.db_status = {}
            
    def build_priority_queue(self, days_threshold=7):
        """
        Step 3: Build a priority queue filtering free nodes by age of last test.
        """
        if not self.freenode_list:
            print("No free nodes to queue.")
            self.job_queue = []
            return []

        print(f"[{datetime.datetime.now().time()}] Building priority queue (Threshold: {days_threshold} days)...")
        self.job_queue = functions.build_priority_queue(
            self.freenode_list, 
            self.db_status, 
            days_threshold=days_threshold
        )
        
        print(f"  Queue built: {len(self.job_queue)} jobs candidates.")
        return self.job_queue

    def run_batch(self, batch_size=5, dry_run=False):
        """
        Step 4: Submit a batch of jobs from the queue.
        """
        if not self.job_queue:
            print("Job queue is empty.")
            return

        print(f"[{datetime.datetime.now().time()}] Processing batch (Size: {batch_size})...")
        
        # Filter for jobs not yet submitted (using the 3rd element in the list [node, prio, status])
        pending_jobs = [j for j in self.job_queue if not j[2]]
        
        if not pending_jobs:
            print("  No pending jobs in queue.")
            return

        # Load template
        if not os.path.exists(self.template_path):
            print(f"  Error: Template not found at {self.template_path}")
            return
            
        with open(self.template_path, 'r') as f:
            template_content = f.read()

        jobs_in_batch = 0
        for job_info in pending_jobs:
            if jobs_in_batch >= batch_size:
                break
                
            node_name = job_info[0]
            # priority = job_info[1]
            
            # Create Job Name
            ts = int(time.time())
            job_name = f"hari-gcr-ceval-{node_name}-{ts}"
            
            # Prepare YAML
            # Replacing specific placeholder from the known yaml (slc01-cl02-hgx-0460)
            job_yaml = template_content.replace("slc01-cl02-hgx-0460", node_name)
            
            # Replace generateName with specific name to match the requirement
            if "generateName: hari-gcr-admin-bonete-test-" in job_yaml:
                job_yaml = job_yaml.replace("generateName: hari-gcr-admin-bonete-test-", f"name: {job_name}")
            
            print(f"  > Target: {node_name} | Job: {job_name}")
            
            if dry_run:
                print("    [Dry Run] Job would be submitted.")
                job_info[2] = True # Mark as submitted mock
                jobs_in_batch += 1
                continue
                
            # Create Temp File
            temp_path = f"/tmp/{job_name}.yaml"
            with open(temp_path, 'w') as temp_f:
                temp_f.write(job_yaml)
                
            # Submit
            try:
                out = functions.create_job(temp_path)
                print(f"    Submitted: {out.strip()}")
                job_info[2] = True # Mark as submitted
                jobs_in_batch += 1
            except Exception as e:
                print(f"    Failed to submit: {e}")
            finally:
                if os.path.exists(temp_path):
                    os.remove(temp_path)
                    
        print(f"Batch execution complete. {jobs_in_batch} jobs submitted.")

    def latest_test_results(self):
        """Helper to print human readable status from loaded DB map"""
        return self.db_status

    def freenodes(self):
        """Helper to return cached list"""
        return self.freenode_list

In [None]:
cluster = Cluster("gcr-admin")
cluster.refresh_state()
print(f"Free Nodes: {len(cluster.freenodes())}")
print(f"DB Records: {len(cluster.latest_test_results())}")

In [None]:
# Build Queue (e.g. nodes not tested in last 2 days)
queue = cluster.build_priority_queue(days_threshold=2)

print("\nTop 10 Jobs in Queue:")
for item in queue[:10]:
    print(item)

[]
1767891267
Cluster object created at: 2026-01-08 08:54:27
Raw output from status.sh: 
['']


/home/hari/b200/validation/cluster_doctor/kubectl/result/status.sh: 4: set: Illegal option -o pipefail


Empty DataFrame
Columns: [node, test, latest_timestamp_num, latest_timestamp, result]
Index: []
