### Notebook to demonstrate Auto-Labeling workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### The workflow in a nutshell

- Pulling datasets from cloud
- Getting a PTM from NGC
- Model Actions
    - Train (Normal/AutoML)
    - Evaluate
    - Inference
    - Delete experiments/datasets

### Table of contents

1. [FIXME's](#head-1)
1. [Login](#head-2)
1. [Create a cloud workspace](#head-2)
1. [Create and pull train dataset](#head-3)
1. [Create and pull val dataset](#head-4)
1. [List the created datasets](#head-5)
1. [Create an experiment](#head-6)
1. [List experiments](#head-7)
1. [Assign train, eval datasets](#head-8)
1. [Assign PTM](#head-9)
1. [View hyperparameters that are enabled by default](#head-10)
1. [Actions](#head-11)
1. [Train](#head-12)
1. [Set AutoML related configurations](#head-12.1)
1. [Evaluate](#head-13)
1. [TAO inference](#head-14)
1. [Delete experiment](#head-15)
1. [Delete dataset](#head-16)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import json
import os
import time
from IPython.display import clear_output

# Import TAO SDK
from tao_sdk.client import TaoClient

In [None]:
# Restore variable in case of jupyter session restart and resume execution where it left off
%store -r model_name
%store -r base_url
%store -r headers
%store -r workspace_id
%store -r train_dataset_id
%store -r eval_dataset_id
%store -r experiment_id
%store -r job_map

### To see the dataset folder structure required for the models supported in this notebook, visit the notebooks under dataset_prepare like for [this notebook](../dataset_prepare/auto_labeling.ipynb)

### FIXME's <a class="anchor" id="head-1"></a>

1. (Optional) Enable AutoML if needed in FIXME 1
1. (Optional) Choose between bayesian and hyperband automl_algorithm in FIXME 2 (If automl was enabled in FIXME1)
1. Assign the ip_address and port_number in FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
1. Assign the ngc_key variable in FIXME 4
1. Assign the ngc_org_name variable in FIXME 5
1. Set cloud storage details in FIXME 6
1. Assign path of datasets relative to the bucket in FIXME 7

In [None]:
os.environ["TAO_MODEL_NAME"] = model_name = os.environ.get("TAO_MODEL_NAME", "mal")
%store model_name

#### Set API service's host information

In [None]:
# FIXME 4: Set TAO API environment variables

# Set to your TAO API endpoint
os.environ["TAO_BASE_URL"] = os.environ.get("TAO_BASE_URL", "https://your_tao_ip_address:port/api/v2")

#### Set NGC Personal key for authentication and NGC org to access API services

In [None]:
# FIXME 5: Your NGC personal key
os.environ["NGC_KEY"] = ngc_key = os.environ.get("NGC_KEY", "your_ngc_key")

In [None]:
# FIXME 6: Your NGC ORG name
os.environ["NGC_ORG"] = ngc_org_name = os.environ.get("NGC_ORG", "nvstaging")

### Login <a class="anchor" id="head-2"></a>

In [None]:
# Initialize TAO Client and login using SDK
tao_client = TaoClient()

# Login using TAO SDK - this will automatically save credentials to environment variables
login_response = tao_client.login(
    ngc_key=ngc_key,
    ngc_org_name=ngc_org_name,
    enable_telemetry=True
)

print("Login successful!")
print("JWT Token:", tao_client.token)
print("API Base URL:", tao_client.base_url)
print("Organization:", tao_client.org_name)

%store tao_client

### Get NVCF gpu details <a class="anchor" id="head-2"></a>

 One of the keys of the response json are to be used as platform_id when you run each job

In [None]:
# # Valid only for NVCF backend during TAO-API helm deployment currently
# # Get available GPU types using TAO SDK
# try:
#     gpu_types = tao_client.get_gpu_types()
#     print("Available GPU types:")
#     print(json.dumps(gpu_types, indent=4))
# except Exception as e:
#     print("Could not fetch GPU types (may not be available on this deployment):", str(e))

### Create cloud workspace
This workspace will be the place where your datasets reside and your results of TAO API jobs will be pushed to.

If you want to have different workspaces for dataset and experiment, duplocate the workspace creation part and adjust the metadata accordingly.

In [None]:
# FIXME 7: Dataset Cloud bucket details to download dataset or push job artifacts for jobs

cloud_metadata = {}

# A Representative name for this cloud info
os.environ["TAO_WORKSPACE_NAME"] = cloud_metadata["name"] = os.environ.get("TAO_WORKSPACE_NAME", "AWS workspace info")

# Cloud specific details. Below is assuming AWS.
cloud_metadata["cloud_specific_details"] = {}

 # Whether it is AWS, HuggingFace or Azure
os.environ["TAO_WORKSPACE_CLOUD_TYPE"] = cloud_metadata["cloud_specific_details"]["cloud_type"] = os.environ.get("TAO_WORKSPACE_CLOUD_TYPE", "aws")

# Bucket region
os.environ["TAO_WORKSPACE_CLOUD_REGION"] = cloud_metadata["cloud_specific_details"]["cloud_region"] = os.environ.get("TAO_WORKSPACE_CLOUD_REGION", "us-west-1")

# Bucket name
os.environ["TAO_WORKSPACE_CLOUD_BUCKET_NAME"] = cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = os.environ.get("TAO_WORKSPACE_CLOUD_BUCKET_NAME", "bucket_name")

# Access and Secret keys
os.environ["TAO_WORKSPACE_CLOUD_ACCESS_KEY"] = cloud_metadata["cloud_specific_details"]["access_key"] = os.environ.get("TAO_WORKSPACE_CLOUD_ACCESS_KEY", "access_key")
os.environ["TAO_WORKSPACE_CLOUD_SECRET_KEY"] = cloud_metadata["cloud_specific_details"]["secret_key"] = os.environ.get("TAO_WORKSPACE_CLOUD_SECRET_KEY", "secret_key")

In [None]:
# Create cloud workspace using TAO SDK
workspace_id = tao_client.create_workspace(
    name=cloud_metadata["name"],
    cloud_type=cloud_metadata["cloud_specific_details"]["cloud_type"],
    cloud_specific_details=cloud_metadata["cloud_specific_details"]
)

print("Workspace created successfully!")
print(f"Workspace ID: {workspace_id}")

# Get workspace details to confirm creation
workspace_details = tao_client.get_workspace_metadata(workspace_id)
print("Workspace details:")
print(json.dumps(workspace_details, indent=4))

%store workspace_id

#### Set dataset path (path within cloud bucket)

In [None]:
# FIXME 7 : Set paths relative to cloud bucket
os.environ["TAO_TRAIN_DATASET_PATH"] = train_dataset_path = os.environ.get("TAO_TRAIN_DATASET_PATH", "/data/auto_label_train")
os.environ["TAO_EVAL_DATASET_PATH"] = eval_dataset_path = os.environ.get("TAO_EVAL_DATASET_PATH", "/data/auto_label_val")

### Create and pull train dataset <a class="anchor" id="head-3"></a>

In [None]:
# Create train dataset
ds_type = "segmentation"
ds_format = "default"

# Create train dataset using TAO SDK
train_dataset_id = tao_client.create_dataset(
    dataset_type=ds_type,
    dataset_format=ds_format,
    workspace_id=workspace_id,
    cloud_file_path=train_dataset_path,
    use_for=["training"]
)

print("Train dataset created successfully!")
print(f"Train Dataset ID: {train_dataset_id}")

%store train_dataset_id

In [None]:
# Check train dataset progress using TAO SDK
while True:
    clear_output(wait=True)
    dataset_details = tao_client.get_dataset_metadata(train_dataset_id)
    
    print(f" Train Dataset Status: {dataset_details.get('status', 'Unknown')}")
    print(f"Dataset ID: {train_dataset_id}")
    
    if dataset_details.get("status") == "invalid_pull":
        print("Dataset pull failed!")
        validation_details = dataset_details.get("validation_details", {})
        if validation_details:
            print("Validation details:")
            print(json.dumps(validation_details, indent=4))
        raise ValueError("Dataset pull failed")
        
    if dataset_details.get("status") == "pull_complete":
        print("Train dataset pull completed successfully!")
        print("Dataset details:")
        print(json.dumps(dataset_details, indent=4))
        break
        
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# from remove_corrupted_images import remove_corrupted_images_workflow
# try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     train_dataset_id = remove_corrupted_images_workflow(base_url, headers, workspace_id, train_dataset_id)
#     %store train_dataset_id
# except Exception as e:
#     raise e

### Create and pull val dataset <a class="anchor" id="head-4"></a>

In [None]:
# Create eval dataset using TAO SDK
eval_dataset_id = tao_client.create_dataset(
    dataset_type=ds_type,
    dataset_format=ds_format,
    workspace_id=workspace_id,
    cloud_file_path=eval_dataset_path,
    use_for=["evaluation"]
)

print("Eval dataset created successfully!")
print(f"Eval Dataset ID: {eval_dataset_id}")

%store eval_dataset_id

In [None]:
# Check eval dataset progress using TAO SDK
while True:
    clear_output(wait=True)
    dataset_details = tao_client.get_dataset_metadata(eval_dataset_id)
    
    print(f" Eval Dataset Status: {dataset_details.get('status', 'Unknown')}")
    print(f"Dataset ID: {eval_dataset_id}")
    
    if dataset_details.get("status") == "invalid_pull":
        print("Dataset pull failed!")
        validation_details = dataset_details.get("validation_details", {})
        if validation_details:
            print("Validation details:")
            print(json.dumps(validation_details, indent=4))
        raise ValueError("Dataset pull failed")
        
    if dataset_details.get("status") == "pull_complete":
        print("Eval dataset pull completed successfully!")
        print("Dataset details:")
        print(json.dumps(dataset_details, indent=4))
        break
        
    time.sleep(5)

#### Uncomment if you want to remove corrupted images in your dataset

In [None]:
# # This packages data-services experiments create and running the job of removing corrupted images
# from remove_corrupted_images import remove_corrupted_images_workflow
# try:
#     from remove_corrupted_images import remove_corrupted_images_workflow
#     eval_dataset_id = remove_corrupted_images_workflow(base_url, headers, workspace_id, eval_dataset_id)
#     %store eval_dataset_id
# except Exception as e:
#     raise e

### List the created datasets <a class="anchor" id="head-5"></a>

In [None]:
# List datasets using TAO SDK
datasets = tao_client.list_datasets()

print("Available datasets:")
print("id\t\t\t\t\t type\t\t\t format\t\t name")
print("-" * 100)

for dataset in datasets:
    dataset_id = dataset.get("id", "N/A")
    dataset_type = dataset.get("type", "N/A")
    dataset_format = dataset.get("format", "N/A")
    dataset_name = dataset.get("name", "N/A")
    print(f"{dataset_id}\t{dataset_type}\t{dataset_format}\t\t{dataset_name}")

### Set common params across all jobs <a class="anchor" id="head-6"></a>

In [None]:
# These parameters are common to all jobs and will be used when creating the actual job:
encode_key = "tlt_encode"

### Assign PTM <a class="anchor" id="head-9"></a>

Search for the PTM on NGC for the Segmentation model chosen

In [None]:
# List base experiments (PTMs) using TAO SDK  
# These are the pre-trained models available for the selected network architecture
base_experiments = tao_client.list_base_experiments(filter_params={"network_arch": model_name})

print(f" Available base experiments (PTMs) for {model_name}:")
print("name\t\t\t     model id\t\t\t     network architecture")
print("-" * 120)

for exp in base_experiments:
    exp_name = exp.get("name", "N/A")
    exp_id = exp.get("id", "N/A")
    exp_arch = exp.get("network_arch", "N/A")
    print(f"{exp_name}\t{exp_id}\t{exp_arch}")

In [None]:
# Assigning pretrained models to different networks
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/eval etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"mal" : "mask_auto_label:trainable_v1.0"}
no_ptm_models = set([])

In [None]:
# Get pretrained model using TAO SDK
selected_ptm_id = None
if model_name not in no_ptm_models:
    base_experiments_detailed = tao_client.list_base_experiments(filter_params={"network_arch": model_name})
    
    # Search for PTM with given NGC path
    for exp in base_experiments_detailed:
        ngc_path = exp.get("ngc_path", "")
        if ngc_path.endswith(pretrained_map[model_name]):
            selected_ptm_id = exp.get("id")
            print("Selected PTM metadata:")
            print(json.dumps(exp, indent=4))
            break
    
    if not selected_ptm_id:
        print(f" PTM with NGC path ending in '{pretrained_map[model_name]}' not found!")

In [None]:
#  TAO: PTM assignment happens during job creation
# The selected PTM ID will be used in the job creation step
if model_name not in no_ptm_models and selected_ptm_id:
    print(f" PTM ID {selected_ptm_id} will be used as base_experiment_id in job creation")
else:
    print("No PTM will be used (training from scratch)")

### Actions <a class="anchor" id="head-11"></a>

For all actions:
1. Get default spec schema and derive the default values
1. Modify defaults if needed
1. Run model action with modified specs
1. Monitor job using retrieve
1. Download results using job download endpoint (if needed)

In [None]:
job_map = {}

### Train <a class="anchor" id="head-12"></a>

In [None]:
# Get default train specs using TAO SDK
train_spec_response = tao_client.get_job_schema(action="train", network_arch=model_name)
train_specs = train_spec_response.get("default", {})
print("Default train specifications:")
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Override any of the parameters listed in the previous cell as required
train_specs["train"]["num_gpus"] = 1
train_specs["train"]["gpu_ids"] = [0]
train_specs["train"]["num_epochs"] = 5
train_specs["train"]["checkpoint_interval"] = 5
train_specs["train"]["validation_interval"] = 5
print(json.dumps(train_specs, sort_keys=True, indent=4))

In [None]:
# Create train job using SDK

job_name = f"{model_name}_training_job"

# Prepare job creation parameters
job_params = {
    "kind": "experiment",
    "name": job_name,
    "network_arch": model_name,
    "encryption_key": encode_key,
    "workspace": workspace_id,
    "action": "train",
    "specs": train_specs,  # Pass as dict, not JSON string
    "base_experiment_ids": [selected_ptm_id] if selected_ptm_id else None,
    "train_datasets": [train_dataset_id] if train_dataset_id else None,
    "eval_dataset": eval_dataset_id,
    # "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Optional: Pick from get_gpu_types output
}

# Create experiment job using TAO SDK interface
job_id = tao_client.create_job(**job_params)

print("Train job created successfully!")
print(f"Job ID: {job_id}")
print(f"Job Name: {job_name}")
print(f"Network Architecture: {model_name}")
print(f"Action: train")

job_map["train_" + model_name] = job_id
print("\nJob Map:")
print(json.dumps(job_map, indent=4))
%store job_map

In [None]:
# Monitor job status using TAO SDK
# For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

train_job_id = job_map["train_" + model_name]

while True:
    clear_output(wait=True)
    
    try:
        job_status = tao_client.get_job_metadata(train_job_id)
        
        print(f"Training Job Status")
        print(f"Job ID: {train_job_id}")
        print(f"Status: {job_status.get('status', 'Unknown')}")
        print(f"Progress: {job_status.get('progress', 'N/A')}")
        
        # Show detailed status information
        print("\nDetailed Status:")
        print(json.dumps(job_status.get("job_details", {}), sort_keys=True, indent=4))
        
        current_status = job_status.get("status", "Unknown")
        
        if current_status == "Error":
            raise Exception("Training job failed!")
            
        if current_status in ["Done", "Completed"]:
            print("Job completed successfully!")
            break
            
        if current_status in ["Canceled", "Paused"]:
            print(f" Job {current_status}")
            break
            
    except Exception as e:
        if "failed" in str(e).lower():
            raise
        print(f" Error fetching job status: {str(e)}")
        print("Job might still be starting up...")
        
    time.sleep(15)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor job status by repeatedly running this cell' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# # Pause AutoML job using TAO SDK
# if automl_enabled:
#     train_job_id = job_map["train_" + model_name]
#     try:
#         pause_result = tao_client.pause_job(train_job_id)
#         print("Job paused successfully!")
#         print(json.dumps(pause_result, indent=4))
#     except Exception as e:
#         print(f" Failed to pause job: {str(e)}")

In [None]:
## Resume AutoML

In [None]:
# # Resume AutoML job using TAO SDK
# # Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above
# if automl_enabled:
#     train_job_id = job_map["train_" + model_name]
#     try:
#         resume_result = tao_client.resume_job(
#             job_id=train_job_id,
#             parent_job_id=None,
#             specs=json.dumps(train_specs)
#         )
#         print("Job resumed successfully!")
#         print(json.dumps(resume_result, indent=4))
#     except Exception as e:
#         print(f" Failed to resume job: {str(e)}")

### Publish model

#### Edit the method of choosing checkpoint from list of train checkpoint files

In [None]:
# Model handler parameters are managed differently
# Checkpoint selection is handled during job creation rather than experiment-level settings
# For now, we'll use the default checkpoint selection method
print("In TAO, checkpoint selection is managed per-job rather than per-experiment")
print("Using default checkpoint selection method: best_model")

update_checkpoint_choosing = {
    "checkpoint_choose_method": "best_model",
    "checkpoint_epoch_number": {}
}
print("Current checkpoint choosing configuration:")
print(json.dumps(update_checkpoint_choosing, indent=4))

In [None]:
# Checkpoint method configuration
# Checkpoint selection is handled per-job, not per-experiment
# You can configure this when creating export/inference jobs if needed

# Example: Change checkpoint selection method for future jobs
update_checkpoint_choosing["checkpoint_choose_method"] = "latest_model"  # Choose between best_model/latest_model/from_epoch_number
# Note: If from_epoch_number is chosen, you would specify the epoch in job creation specs

print("Checkpoint selection configuration updated:")
print(f"Method: {update_checkpoint_choosing['checkpoint_choose_method']}")
print("This will be applied to future job creations")
print(json.dumps(update_checkpoint_choosing, sort_keys=True, indent=4))

updated_job = tao_client.update_job(job_id=job_map["train_" + model_name], update_data=update_checkpoint_choosing)
print(json.dumps(updated_job, indent=4))

#### Push model to private ngc team registry

In [None]:
# Publish model using TAO SDK
train_job_id = job_map["train_" + model_name]

try:
    publish_result = tao_client.publish_model(
        job_id=train_job_id,
        display_name=f"TAO {model_name}",
        description=f"Trained {model_name} model",
        team_name="tao"
    )
    
    print("Model published successfully to NGC!")
    print(f"Job ID: {train_job_id}")
    print(f"Display Name: TAO {model_name}")
    print(f"Team: tao")
    print("\nPublish Response:")
    print(json.dumps(publish_result, indent=4))
    
except Exception as e:
    print(f" Failed to publish model: {str(e)}")
    print("Make sure the job completed successfully and you have appropriate permissions")

#### Remove model from private ngc team registry

In [None]:
# # Remove published model using TAO SDK
# train_job_id = job_map["train_" + model_name]
# try:
#     remove_result = tao_client.remove_published_model(
#         job_id=train_job_id,
#         team="tao"
#     )
#     print("Published model removed successfully!")
#     print(json.dumps(remove_result, indent=4))
# except Exception as e:
#     print(f" Failed to remove published model: {str(e)}")

### Evaluate <a class="anchor" id="head-13"></a>

In [None]:
# Get default eval specs using TAO SDK
eval_spec_response = tao_client.get_job_schema(action="evaluate", network_arch=model_name)
eval_specs = eval_spec_response.get("default", {})
print("Default evaluate specifications:")
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to the specs if required
print(json.dumps(eval_specs, sort_keys=True, indent=4))

In [None]:
# Create evaluate job using TAO SDK
parent_job_id = job_map["train_" + model_name]
eval_job_name = f"{model_name}_eval_job"

eval_job_id = tao_client.create_job(
    kind="experiment",
    name=eval_job_name,
    network_arch=model_name,
    encryption_key=encode_key,
    workspace=workspace_id,
    eval_dataset=eval_dataset_id,
    action="evaluate",
    specs=eval_specs,  # Pass as dict, not JSON string
    parent_job_id=parent_job_id,
    base_experiment_ids=[selected_ptm_id] if selected_ptm_id else None,
    # platform_id="9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Optional: Pick from get_gpu_types output
)

print("Evaluate job created successfully!")
print(f"Evaluate Job ID: {eval_job_id}")
print(f"Parent Job ID: {parent_job_id}")
print(f"Action: evaluate")

job_map["evaluate_" + model_name] = eval_job_id
print("\nUpdated Job Map:")
print(json.dumps(job_map, indent=4))
%store job_map

In [None]:
# Monitor evaluate job status using TAO SDK
eval_job_id = job_map["evaluate_" + model_name]

while True:    
    clear_output(wait=True)
    
    try:
        job_status = tao_client.get_job_metadata(eval_job_id)
        
        print(f"Evaluate Job Status")
        print(f"Job ID: {eval_job_id}")
        print(f"Status: {job_status.get('status', 'Unknown')}")
        print(f"Progress: {job_status.get('progress', 'N/A')}")
        
        # Show detailed status information
        print("\nDetailed Status:")
        print(json.dumps(job_status.get("job_details", {}), sort_keys=True, indent=4))
        
        current_status = job_status.get("status", "Unknown")
        
        if current_status == "Error":
            raise Exception("Evaluate job failed!")
            
        if current_status in ["Done", "Completed"]:
            print("Evaluate job completed successfully!")
            break
            
        if current_status in ["Canceled", "Paused"]:
            print(f"Evaluate job {current_status}")
            break
            
    except Exception as e:
        if "failed" in str(e).lower():
            raise
        print(f" Error fetching inference job status: {str(e)}")
        print("Job might still be starting up...")
        
    time.sleep(15)

### TAO inference <a class="anchor" id="head-14"></a>

- Run inference on a set of images using the .tlt model created at train step

In [None]:
# Get default inference specs using TAO SDK
inference_spec_response = tao_client.get_job_schema(action="inference", network_arch=model_name)
tao_inference_specs = inference_spec_response.get("default", {})
print("Default inference specifications:")
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to specs if necessary
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Create inference job using TAO SDK
parent_job_id = job_map["train_" + model_name]
inference_job_name = f"{model_name}_inference_job"

inference_job_id = tao_client.create_job(
    kind="experiment",
    name=inference_job_name,
    network_arch=model_name,
    encryption_key=encode_key,
    workspace=workspace_id,
    inference_dataset=eval_dataset_id,
    action="inference",
    specs=tao_inference_specs,  # Pass as dict, not JSON string
    parent_job_id=parent_job_id,
    base_experiment_ids=[selected_ptm_id] if selected_ptm_id else None,
    # platform_id="9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Optional: Pick from get_gpu_types output
)

print("Inference job created successfully!")
print(f"Inference Job ID: {inference_job_id}")
print(f"Parent Job ID: {parent_job_id}")
print(f"Action: inference")

job_map["inference_tao_" + model_name] = inference_job_id
print("\nUpdated Job Map:")
print(json.dumps(job_map, indent=4))
%store job_map

In [None]:
# Monitor inference job status using TAO SDK
inference_job_id = job_map["inference_tao_" + model_name]

while True:    
    clear_output(wait=True)
    
    try:
        job_status = tao_client.get_job_metadata(inference_job_id)
        
        print(f" Inference Job Status")
        print(f"Job ID: {inference_job_id}")
        print(f"Status: {job_status.get('status', 'Unknown')}")
        print(f"Progress: {job_status.get('progress', 'N/A')}")
        
        # Show detailed status information
        print("\nDetailed Status:")
        print(json.dumps(job_status.get("job_details", {}), sort_keys=True, indent=4))
        
        current_status = job_status.get("status", "Unknown")
        
        if current_status == "Error":
            raise Exception("Inference job failed!")
            
        if current_status in ["Done", "Completed"]:
            print("Inference job completed successfully!")
            break
            
        if current_status in ["Canceled", "Paused"]:
            print(f" Inference job {current_status}")
            break
            
    except Exception as e:
        if "failed" in str(e).lower():
            raise
        print(f" Error fetching inference job status: {str(e)}")
        print("Job might still be starting up...")
        
    time.sleep(15)

### Delete Jobs<a class="anchor" id="head-22"></a>

In [None]:
# Delete jobs instead of experiments
# Delete all created jobs using TAO SDK

print(" Deleting all created jobs...")

jobs_to_delete = []
for job_key, job_id in job_map.items():
    try:
        delete_result = tao_client.delete_job(job_id)
        print(f" Deleted job: {job_key} (ID: {job_id})")
    except Exception as e:
        print(f" Failed to delete job {job_key} (ID: {job_id}): {str(e)}")

print(f"\n Job cleanup completed! Processed {len(jobs_to_delete)} jobs.")

### Delete dataset <a class="anchor" id="head-16"></a>

#### Delete train dataset

In [None]:
# Delete train dataset using TAO SDK
try:
    delete_result = tao_client.delete_dataset(train_dataset_id)
    print("Train dataset deleted successfully!")
    print(f"Dataset ID: {train_dataset_id}")
    if delete_result:
        print("Delete Response:")
        print(json.dumps(delete_result, indent=4))
except Exception as e:
    print(f" Failed to delete train dataset {train_dataset_id}: {str(e)}")

#### Delete val dataset

In [None]:
# Delete eval dataset using TAO SDK
try:
    delete_result = tao_client.delete_dataset(eval_dataset_id)
    print("Eval dataset deleted successfully!")
    print(f"Dataset ID: {eval_dataset_id}")
    if delete_result:
        print("Delete Response:")
        print(json.dumps(delete_result, indent=4))
except Exception as e:
    print(f" Failed to delete eval dataset {eval_dataset_id}: {str(e)}")