### Notebook to demonstrate Synthetic data generation workflow

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

---

### TAO data generation with StyleganXL

StyleGAN-XL is a synthetic dataset generation model for us to generate more images for downstream tasks such as classification and segmentation. The techniques could be powerful especially when the existing real dataset is not sufficient for training downstream models.

**Warning:**
The StyleganXL pre-trained model used for finetuning in this notebook is for non-commercial usage.

<img src="assets/stylegan_sdg.png" width="600">

---

### The workflow in a nutshell

This notebook will show how to train a StyleGAN-XL model - a conditional GAN model followed by evaluation of the model and inference that can generate images from one of the six classes in the `NEU_Metal_Surface_Defects_Data/train` dataset.

1) Configure Connection to TAO FTMS 
2) Login & Create Cloud Workspace
3) Register Train datasets
4) Choose Pretrained model
5) Train and evaluate Data-generation model
6) Run inference to generate data

---

### Requirements
Prior to running this notebook you must have: 
1) A TAO FTMS server.  [(Setup Guide here)](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)
2) The sample stylegan_xl dataset from the [data_generation.ipynb](https://github.com/NVIDIA/tao_tutorials/tree/main/notebooks/tao_api_starter_kit/dataset_prepare/data_generation.ipynb) notebook uploaded to your cloud storage.
3) Set the `<>` enclosed variables with values in the Configuration section of the notebook.



---
### Debugging Finetuning Microservice and Jobs

When working with the TAO API, you may encounter issues at different stages. Use the following guidance to debug effectively:

#### 1. Dataset, Experiment, or Workspace CRUD Operation Errors

If you encounter errors related to creating, reading, updating, or deleting datasets, experiments, or workspaces **and the error messages are not clear**, check the logs of the TAO API service pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
```

#### 2. Errors During Job Launch

For issues that occur **while launching a job**, check both the app and workflow pods:

```bash
kubectl logs -f <pod name starting with tao-api-app-pod>
kubectl logs -f <pod name starting with tao-api-workflow-pod>
```

#### 3. Errors After Job Launch

If errors occur **after a job has been launched**, inspect the job pod logs:

```bash
kubectl logs -f tao-api-sts-<job_id>-0
```

> **Note:**  
> Run these `kubectl` commands on the machine where your Kubernetes service is deployed.

#### Additional Debugging Tips

- **Job logs are automatically uploaded** to your cloud workspace at:  
  `/results/<job_id>/microservices_log.txt`

- **You can also view logs via the Jobs API endpoint:**  
  ```
  /api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs
  ```

**Summary Table**

| Error Type                | Where to Check Logs                                      |
|-------------------------- |---------------------------------------------------------|
| CRUD operation errors     | `tao-api-app-pod`                                       |
| Job launch errors         | `tao-api-app-pod`, `tao-api-workflow-pod`               |
| Post-launch job errors    | `tao-api-sts-<job_id>-0`                                |
| All job logs (cloud)      | `/results/<job_id>/microservices_log.txt`               |
| All job logs (API)        | `/api/v1/orgs/<org_name>/<experiments|datasets>/<experiment_id|dataset_id>/jobs/<job_id>/logs` |

---

### Performance Benchmarks

#### Execution Time Breakdown

The following table shows the approximate time required for each stage of the StyleGAN-XL workflow:

| **Stage** | **Duration** | **Description** |
|-----------|--------------|-----------------|
| Dataset Convert | **2m 45s** | Image preprocessing and resolution conversion |
| Training | **1hr 30min** | Model training with 10 epochs |
| Evaluation | **18min** | FID score calculation and model assessment |
| Inference | **21min** | Synthetic image generation |
| **Total Time** | **2hr 12min** | **Complete end-to-end workflow** |


#### Test Environment Specifications

| **Component** | **Specification** |
|---------------|-------------------|
| **GPU** | 1x NVIDIA A40 |
| **Training Dataset** | 1,656 images (66 MB total) |
| **Validation Dataset** | 72 images (2.9 MB total) |
| **Test Dataset** | 72 images (2.9 MB total) |
| **Image Resolution** | 64x64 pixels |
| **Training Epochs** | 10 |

#### Performance Factors

> **Important Note:**  
> Actual execution times may vary significantly based on:
> 
> - **Hardware Configuration**: GPU type, memory, and compute capability
> - **Storage Performance**: Local disk I/O speed and cloud storage latency
> - **Network Conditions**: Bandwidth and latency to cloud storage
> - **System Load**: Other concurrent processes and resource utilization
> - **Dataset Size**: Number of images and total data volume
> - **Batch Size**: Larger batch sizes can improve training stability and speed
> - **Model Configuration**: Resolution, epochs, and other hyperparameters


## Learning Objectives

In this notebook, you will learn how to use FTMS to `train`, `evaluate`, and `inference` with StyleGAN-XL


In [None]:
import json
import os
import requests
import time
from IPython.display import clear_output
import glob

## Configuration 

Fill in all `<>` enclosed variables with relevant values under this section. 

### TAO FTMS Host & Credentials 

In [None]:
# Configure for your TAO FTMS server 
host_url = "<HOST_URL>"
ngc_key = "<NGC_API_KEY>"
ngc_org_name = "<NGC_ORG_NAME>"

### Cloud Storage Setup & Credentials 

In [None]:
# Cloud bucket details to access datasets and store experiment results
cloud_metadata = {}
cloud_metadata["name"] = "tao_workspace"  # A Representative name for this cloud info
cloud_metadata["cloud_type"] = "aws"  # If it's AWS, HuggingFace or Azure
cloud_metadata["cloud_specific_details"] = {}
cloud_metadata["cloud_specific_details"]["cloud_region"] = "<BUCKET_REGION>"  # Bucket region
cloud_metadata["cloud_specific_details"]["cloud_bucket_name"] = "<BUCKET_NAME>"  # Bucket name
# Access and Secret for AWS
cloud_metadata["cloud_specific_details"]["access_key"] = "<ACCESS_KEY>"
cloud_metadata["cloud_specific_details"]["secret_key"] = "<SECRET_KEY>"

### Dataset Paths in Cloud Storage 

#### To see the dataset folder structure required for the models supported in this notebook, visit the notebooks under dataset_prepare like for [this notebook](../dataset_prepare/segmentation.ipynb)

In [None]:
#FIX ME - adjust paths to point to datasets in your cloud storage. If using S3 Bucket do not include the bucket name in the path. 
train_dataset_path =  "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/stylegan_train"
eval_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/stylegan_val"
test_dataset_path = "/<PATH_TO_DATASET_IN_CLOUD_STORAGE>/stylegan_test"

### Model Configuration 

In [None]:
# No changes needed 
model_name = "stylegan_xl"
ds_type = "stylegan_xl"
ds_format = "default"

#### Toggle AutoML params
[AutoML documentation](https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#getting-started)

In [None]:
automl_enabled = False
automl_algorithm = "<bayesian/hyperband>"

## Login

In [None]:
#Use NGC Key to login to FTMS 
data = json.dumps({"ngc_org_name": ngc_org_name,
                   "ngc_key": ngc_key,
                   "enable_telemetry": True})
response = requests.post(f"{host_url}/api/v1/login", data=data)
assert response.status_code in (200, 201)
assert "token" in response.json().keys()
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/orgs/{ngc_org_name}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

## Create cloud workspace
This workspace will be the place where your datasets reside and results of TAO FTMS jobs will be pushed to.

In [None]:
# Create cloud workspace
data = json.dumps(cloud_metadata)

endpoint = f"{base_url}/workspaces"

response = requests.post(endpoint,data=data,headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)

assert "id" in response.json().keys()
workspace_id = response.json()["id"]

## Register datasets with FTMS 

TAO FTMS requires datasets in your cloud storage to be registered to produce a unique ID that can be attached to training jobs. This step only needs to be done once and then you can use the dataset across any experiments that support the dataset format.

StyleganXL requires three dataset - one each for training, evaluation and inference

### List Registered Datasets

In [None]:
endpoint = f"{base_url}/datasets"

response = requests.get(endpoint, headers=headers)
assert response.status_code in (200, 201)

datasets = response.json()["datasets"]
for rsp in datasets:
    rsp_keys = rsp.keys()
    assert "id" in rsp_keys
    assert "type" in rsp_keys
    assert "format" in rsp_keys
    assert "name" in rsp_keys

print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in datasets:
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

If you have already registered your datasets, then you can directly set their IDs in the following cell to avoid creating duplicate datasets. 

In [None]:
train_dataset_id = None 
eval_dataset_id = None 
test_dataset_id = None 

### Train Dataset

You create a dataset object by linking the cloud workspace, path in the cloud bucket and it's dataset type, format and 

Optional fields include the intention for the dataset object - training/evaluation/testing and a name to differentiate multiple datasets

In [None]:
# Create train dataset
if train_dataset_id is None: 
    train_dataset_metadata = {"type": ds_type,
                              "format": ds_format,
                              "workspace":workspace_id,
                              "cloud_file_path": train_dataset_path,
                              "use_for": ["training"],
                              "name": "stylegan_metal_surface_defects_train"
                              }
    data = json.dumps(train_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    train_dataset_id = response.json()["id"]

The below cell will check if the dataset files in the mentioned folder matches with the required files for this model

If the required files are present, the dataset_metadata's status will be updated to `pull_complete` else `invalid_pull`

We poll the status of verification every 5 seconds, for larger datasets, this might take some time

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{train_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Validation Dataset - same process of creating a dataset object and wait for verification of files to be complete like Train dataset

In [None]:
# Create eval dataset
if eval_dataset_id is None: 
    eval_dataset_metadata = {"type": ds_type,
                             "format": ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": eval_dataset_path,
                             "use_for": ["evaluation"],
                             "name" : "stylegan_metal_surface_defects_eval" 
                             }
    data = json.dumps(eval_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    eval_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

### Test Dataset - same process of creating a dataset object and wait for verification of files to be complete like Train dataset

In [None]:
# Create testing dataset for inference
if test_dataset_id is None: 
    test_dataset_metadata = {"type": ds_type,
                             "format":ds_format,
                             "workspace":workspace_id,
                             "cloud_file_path": test_dataset_path,
                             "use_for": ["testing"],
                             "name": "stylegan_metal_surface_defects_test"
                             }
    data = json.dumps(test_dataset_metadata)
    
    endpoint = f"{base_url}/datasets"
    
    response = requests.post(endpoint,data=data,headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "id" in response.json().keys()
    
    test_dataset_id = response.json()["id"]

In [None]:
# Check progress
endpoint = f"{base_url}/datasets/{test_dataset_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

    if response.json().get("status") == "invalid_pull":
        raise ValueError("Dataset pull failed")
    if response.json().get("status") == "pull_complete":
        break
    time.sleep(5)

## Dataset convert Action for all three datasets <a class="anchor" id="head-8"></a>

This action is for **Image Preprocessing**:

StyleGAN-XL requires square, fixed-resolution images for training. The tool can crop and/or resize images to meet the resolution requirements of StyleGAN-XL's progressive training workflow. This training process involves starting with lower resolutions (e.g., `16x16`) and progressively increasing to higher resolutions (e.g., `256x256`). Consequently, we need multiple versions of the dataset, such as `16x16`, `32x32`, `64x64`, `128x128`, and `256x256`.

When a job is submitted, we will receive a unique ID to reference back to it. We will store these IDs in the following ```job_map``` variable. 

In [None]:
job_map = {}

### Train Dataset Convert 

In [None]:
# Choose dataset convert action
convert_action = "dataset_convert"

#### Retrieve default dataset convert spec 

Before launching a dataset convert job, we can retrieve the default spec for our action to use as a starting point to configure the dataset convert parameters. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{train_dataset_id}/specs/{convert_action}/schema"

response = requests.get(endpoint, headers=headers)
print(response)
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema

convert_specs = response.json()["default"]

print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Configure Dataset Convert Spec 

Now we can customize the json spec object. No changes are needed unless you want to try different resolutions. 

In [None]:
# Apply changes
convert_specs["resolution"] = [64,64]  # Use the value set here for modifying config during train,eval,export,inference
print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Submit Train Dataset convert Job  

With our dataset convert spec configured, we can now submit a dataset convert job. All jobs follow the same flow of retreiving the default spec, customizing it, then submitting the job. 

In [None]:
# Run action
parent = None
action = convert_action
data = json.dumps({"parent_job_id":parent,"action":action, "specs":convert_specs,
                #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                })

endpoint = f"{base_url}/datasets/{train_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)

convert_id = response.json()
job_map["train_"+convert_action] = convert_id

After submitting the train dataset convert job, an ID is returned that we can use to monitor the job progress.

The following cell will continuously print the latest status until the job is complete in success or failure state.
More information on the runnning/completed status of a job will be visible in the response json

This notebook will track all of the job IDs in the ```job_map``` variable. 

In [None]:
# Monitor job status by repeatedly running this cell
job_id = convert_id
endpoint = f"{base_url}/datasets/{train_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True) 
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"

    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Val Dataset Convert - same process like train dataset convert: retrieving spec, configuring spec, submitting and monitoring job

#### Retrieve default dataset convert spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{eval_dataset_id}/specs/{convert_action}/schema"

response = requests.get(endpoint, headers=headers)
print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema

assert response.status_code in (200, 201)
assert "default" in response.json().keys()

convert_specs = response.json()["default"]

print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Configure Dataset Convert Spec 

In [None]:
# Apply changes
convert_specs["resolution"] = [64,64]  # Use the value set here for modifying config during train,eval,export,inference
print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Submit Val Dataset convert Job  

In [None]:
# Run action
parent = None
action = convert_action
data = json.dumps({"parent_job_id":parent,"action":action, "specs":convert_specs,
                #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                })

endpoint = f"{base_url}/datasets/{eval_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)

convert_id = response.json()
job_map["train_"+convert_action] = convert_id

In [None]:
# Monitor job status by repeatedly running this cell
job_id = convert_id
endpoint = f"{base_url}/datasets/{eval_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True) 
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"

    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Test Dataset Convert - same process like train dataset convert: retrieving spec, configuring spec, submitting and monitoring job 

#### Retrieve default dataset convert spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/datasets/{test_dataset_id}/specs/{convert_action}/schema"

response = requests.get(endpoint, headers=headers)
print(response)
# print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema

assert response.status_code in (200, 201)
assert "default" in response.json().keys()

convert_specs = response.json()["default"]

print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Configure Dataset Convert Spec 

In [None]:
# Apply changes
convert_specs["resolution"] = [64,64]  # Use the value set here for modifying config during train,eval,export,inference
print(json.dumps(convert_specs, sort_keys=True, indent=4))

#### Submit Test Dataset convert Job  

In [None]:
# Run action
parent = None
action = convert_action
data = json.dumps({"parent_job_id":parent,"action":action, "specs":convert_specs,
                #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                })

endpoint = f"{base_url}/datasets/{test_dataset_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)

convert_id = response.json()
job_map["train_"+convert_action] = convert_id

In [None]:
# Monitor job status by repeatedly running this cell
job_id = convert_id
endpoint = f"{base_url}/datasets/{test_dataset_id}/jobs/{job_id}"

while True:
    clear_output(wait=True) 
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"

    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Create Experiment 

Before we can run any jobs such as training, evaluation or inference, we must create an experiment to setup the network architecture and associated datasets. Then we can chain several jobs together to create our trained models. 

In [None]:
encode_key = "nvidia_tao"
checkpoint_choose_method = "best_model"
data = json.dumps({"network_arch":model_name,
                   "encryption_key":encode_key,
                   "checkpoint_choose_method":checkpoint_choose_method,
                   "workspace": workspace_id,
                   "train_datasets":[train_dataset_id],
                   "eval_dataset":eval_dataset_id,
                   "inference_dataset":test_dataset_id})

endpoint = f"{base_url}/experiments"

response = requests.post(endpoint,data=data,headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))
assert response.status_code in (200, 201)
assert "id" in response.json()

experiment_id = response.json()["id"]

### Assign PTM <a class="anchor" id="head-12"></a>

Search for the PTM on NGC for the SSL model chosen

In [None]:
# List all pretrained models for the chosen network architecture
endpoint = f"{base_url}/experiments:base"
params = {"network_arch": model_name}
response = requests.get(endpoint, params=params, headers=headers)
assert response.status_code in (200, 201)

response_json = response.json()["experiments"]

for rsp in response_json:
    rsp_keys = rsp.keys()
    if "encryption_key" not in rsp.keys():
        assert "name" in rsp_keys and "version" in rsp_keys and "ngc_path" in rsp_keys
        print(f'PTM Name: {rsp["name"]}; PTM version: {rsp["version"]}; NGC PATH: {rsp["ngc_path"]}')

In [None]:
# Assigning pretrained models to different networks
# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.
# Changing the default backbone here requires changing default spec/config during train/infer etc like for example
# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually
pretrained_map = {"stylegan_xl" : "stylegan_xl:trainable_v1.0",
                  }
no_ptm_models = set([])

In [None]:
# Get pretrained model
if model_name not in no_ptm_models:
    endpoint = f"{base_url}/experiments:base"
    params = {"network_arch": model_name}
    response = requests.get(endpoint, params=params, headers=headers)
    assert response.status_code in (200, 201)

    response_json = response.json()["experiments"]

    # Search for ptm with given ngc path
    ptm = []
    for rsp in response_json:
        rsp_keys = rsp.keys()
        assert "ngc_path" in rsp_keys
        if rsp["ngc_path"].endswith(pretrained_map[model_name]):
            assert "id" in rsp_keys
            ptm_id = rsp["id"]
            ptm = [ptm_id]
            print("Metadata for model with requested NGC Path")
            print(rsp)
            break

In [None]:
if model_name not in no_ptm_models:
    ptm_information = {"base_experiment":ptm}
    data = json.dumps(ptm_information)

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)

    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)

### Actions <a class="anchor" id="head-14"></a>

For all actions:
1. Get default spec schema and derive the default values
2. Modify defaults if needed
3. Post spec dictionary to the service
4. Run model action
5. Monitor job using retrieve
6. Download results using job download endpoint (if needed)

### Train <a class="anchor" id="head-15"></a>

#### View hyperparameters that are enabled for AutoML by default <a class="anchor" id="head-15-automl-params"></a>

In [None]:
if automl_enabled:
    # Get default spec schema
    endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"
    while True:
        response = requests.get(endpoint, headers=headers)
        if response.status_code == 404:
            if "Base spec file download state is " in response.json()["error_desc"]:
                print("Base experiment spec file is being downloaded")
                time.sleep(2)
                continue
            else:
                break
        else:
            break
    assert response.status_code in (200, 201)
    assert "automl_default_parameters" in response.json().keys()
    automl_params = response.json()["automl_default_parameters"]
    print(json.dumps(automl_params, sort_keys=True, indent=4))

#### Set AutoML related configurations <a class="anchor" id="head-15-automl-algo-params"></a>


In [None]:
if automl_enabled:
    # Choose any metric that is present in the kpi dictionary present in the model's status.json. 
    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html
    metric="kpi"

    #Refer to parameter list mentioned in the above links and add/remove any extra parameter in addition to the default enabled ones in automl_specs

    automl_information = {"automl_enabled": True,
                          "automl_algorithm": automl_algorithm,
                          "automl_max_recommendations": 20, # Only for bayesian
                          "automl_R": 27, # Only for hyperband
                          "automl_nu": 3, # Only for hyperband
                          "epoch_multiplier": 1, # Only for hyperband
                          # Warning: The parameters that are disabled are not tested by TAO, so there might be unexpected behaviour in overriding this
                          "override_automl_disabled_params": False,
                          "automl_hyperparameters": str(automl_params)}
    data = json.dumps({"metric":metric, "automl_settings": automl_information})

    endpoint = f"{base_url}/experiments/{experiment_id}"

    response = requests.patch(endpoint, data=data, headers=headers)
    assert response.status_code in (200, 201)
    
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))

### Retrieve default training spec 

Before launching a training a job, we can retrieve the default training spec for our network architecture to use as a starting point to configure the training parameters. 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/train/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
assert response.status_code in (200, 201)

print(response)
# full_schema = response.json()
# print(json.dumps(full_schema, indent=4)) ## Uncomment for verbose schema
assert "default" in response.json().keys()
train_specs = response.json()["default"]
print(json.dumps(train_specs, sort_keys=True, indent=4))

### Configure Training Spec 

Now we can customize the json spec object and set our dataset, training and model parameters. No changes are needed unless you are using a custom dataset or model architecture.

#### Understanding the Spec Schema Structure

Before customizing the training spec, it's helpful to understand the available configuration options and their valid values. The `get_bounds_of_field` utility function helps you explore the schema structure and find valid parameter values if they are defined on the backend (for some fields it might not be defined).

**How to use the schema exploration utility:**

```python
from get_bounds import get_bounds_of_field

# Example: Get valid values for backbone type
# Note the conversion from dictionary access notation to list notation:
# From: ["dataset"]["common"]["img_resolution"]
# To:   ["dataset", "common", "img_resolution"]
print(get_bounds_of_field(full_schema, ["dataset", "common", "img_resolution"]))
```

**Key differences in notation:**
- **Dictionary access format**: `["dataset"]["common"]["img_resolution"]` - This is how you would access nested values in Python
- **Schema path format**: `["dataset", "common", "img_resolution"]` - This is the format expected by the `get_bounds_of_field` function

---

#### Schema Navigation Tips

1. **Nested Structure**: The schema follows a hierarchical structure where each level represents a configuration category
2. **Path Format**: Always use a list of strings `["level1", "level2", "level3"]` when calling `get_bounds_of_field`
3. **Validation**: This helps ensure you're using valid parameter values before submitting your training job
4. **Documentation**: Use these explorations to understand what options are available for your specific model architecture


No changes to the default spec are needed unless you are using a custom dataset format or want to modify the model architecture and training parameters.

In [None]:
# Override any of the parameters listed in the previous cell as required
train_specs["train"]["num_gpus"] = 1
train_specs["train"]["num_epochs"] = 10
train_specs["train"]["checkpoint_interval"] = 10
train_specs["train"]["validation_interval"] = 10
train_specs["model"]["generator"]["stem"]["resolution"] = 64  # Set this to number set for dataset convert resolution
train_specs["dataset"]["common"]["img_resolution"] = 64  # Set this to number set for dataset convert resolution
print(json.dumps(train_specs, sort_keys=True, indent=4))

### Submit Training Job  

With our training spec configured, we can now submit a training job. All jobs follow the same flow of retrieving the default spec, customizing it, then submitting the job. 

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the last completed dataset convert job. This is required to make sure the training can start only after a successfull dataset convert and not result in runtime errors because of failed dataset convert jobs

In [None]:
# Run action
parent = convert_id
action = "train"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":train_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)
assert response.json()

job_map["train_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
# For automl: Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

job_id = job_map["train_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    if "error_desc" in response.json().keys() and response.json()["error_desc"] in ("Job trying to retrieve not found", "No AutoML run found"):
        print("Job is being created")
        time.sleep(5)
        continue
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

If you need to cancel the job for any reason, you can uncomment and run the following cell. You can also configure the endpoint to end with ```:pause``` or ```:resume``` instead of ```:cancel``` to temporarily stop and start the job. 

In [None]:
# job_id = job_map["train_" + model_name]
# endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"

# response = requests.post(endpoint, headers=headers)
# print(response)
# print(json.dumps(response.json(), indent=4))

## Evaluate <a class="anchor" id="head-17"></a>

In this section, we run the `stylegan_xl` evaluation to assess the performance of the trained StyleGAN-XL model by comparing the FID scores between real and synthetic images.

### Receive default evaluation spec 

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/evaluate/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
assert response.status_code in (200, 201)
assert "default" in response.json().keys()

eval_specs = response.json()["default"]
print(json.dumps(eval_specs, sort_keys=True, indent=4))

### Customize Evaluation Spec 

In [None]:
# Apply changes
eval_specs["model"]["generator"]["stem"]["resolution"] = 64  # Set this to number set for dataset convert resolution
eval_specs["dataset"]["common"]["img_resolution"] = 64  # Set this to number set for dataset convert resolution
print(json.dumps(eval_specs, sort_keys=True, indent=4))

### Submit Evaluation Job 

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the completed training job. This is required to pass the trained model from the training job into our evaluation job. 

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "evaluate"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":eval_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)

job_map["evaluate_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["evaluate_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

## Inference on Trained checkpoint

In this section, we run the `stylegan_xl` inference to generate synthetic images using the trained StyleGAN-XL model.

This StyleGAN-XL model is trained as a conditional GAN that can generate images from one of the six classes in the `NEU_Metal_Surface_Defects_Data/train` dataset. In the `Customize Inference spec`, the default `class_idx` will be `0`, but you can override it to any value between `0` and `5` to generate images from different classes.

In [None]:
# Get default spec schema
endpoint = f"{base_url}/experiments/{experiment_id}/specs/inference/schema"

while True:
    response = requests.get(endpoint, headers=headers)
    if response.status_code == 404:
        if "Base spec file download state is " in response.json()["error_desc"]:
            print("Base experiment spec file is being downloaded")
            time.sleep(2)
            continue
        else:
            break
    else:
        break
print(response)
#print(json.dumps(response.json(), indent=4)) ## Uncomment for verbose schema
assert response.status_code in (200, 201)

assert "default" in response.json().keys()
tao_inference_specs = response.json()["default"]
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

In [None]:
# Apply changes to specs if necessary
tao_inference_specs["inference"]["class_idx"] = 0  # Class you want to generate data for
tao_inference_specs["model"]["generator"]["stem"]["resolution"] = 64  # Set this to number set for dataset convert resolution
tao_inference_specs["dataset"]["common"]["img_resolution"] = 64  # Set this to number set for dataset convert resolution
print(json.dumps(tao_inference_specs, sort_keys=True, indent=4))

Note that for this job we will set the ```parent_job_id``` parameter in the body of the request to the completed training job. This is required to pass the trained model from the training job into our inference job. 

In [None]:
# Run action
parent = job_map["train_" + model_name]
action = "inference"
data = json.dumps({"parent_job_id":parent,"action":action,"specs":tao_inference_specs,
                  #  "platform_id": "9af1aa90-8ea5-5a11-98d9-3879cd0da92c",  # Pick a platform_from output of {base_url}:gpu_types depending on GPU_type and instance_type
                   })

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"

response = requests.post(endpoint, data=data, headers=headers)
print(response)
print(json.dumps(response.json(), indent=4))

assert response.status_code in (200, 201)
assert response.json()

job_map["inference_tao_" + model_name] = response.json()
print(job_map)

In [None]:
# Monitor job status by repeatedly running this cell
job_id = job_map["inference_tao_" + model_name]
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), indent=4))
    assert response.status_code in (200, 201)
    assert "status" in response.json().keys() and response.json().get("status") != "Error"
    if response.json().get("status") in ["Done","Error", "Canceled", "Paused"] or response.status_code not in (200,201):
        break
    time.sleep(15)

Note: The generated images will be available in your cloud storage at /results/<inference_job_id>

## Clean Up 

You can optionally run this section to delete the datasets and experiment results. 

### Delete experiment <a class="anchor" id="head-22"></a>

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete train dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{train_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete val dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{eval_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))

### Delete test dataset <a class="anchor" id="head-24"></a>

In [None]:
endpoint = f"{base_url}/datasets/{test_dataset_id}"

response = requests.delete(endpoint,headers=headers)
assert response.status_code in (200, 201)

print(response)
print(json.dumps(response.json(), indent=4))