## Supervised Fine Tuning usimng Hyperpod Cluster

You can customize Amazon Nova models through base recipes using Amazon SageMaker Hyperpod jobs. These recipes support Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), with both Full-Rank and Low-Rank Adaptation (LoRA) options.

The end-to-end customization workflow involves stages like model training, model evaluation, and deployment for inference. This model customization approach on SageMaker AI provides greater flexibility and control to fine-tune its supported Amazon Nova models, optimize hyperparameters with precision, and implement techniques including LoRA Parameter-Efficient Fine-Tuning (PEFT), Full-Rank Supervised Fine-Tuning, and Direct Preference Optimization (DPO).

This notebook demonstrates Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) of Amazon Nova using Amazon SageMaker Training Job. SFT is a technique that allows fine-tuning language models on specific tasks using labeled examples, while PEFT enables efficient fine-tuning by updating only a small subset of the model's parameters.

> _**Note:** This notebook demonstrates fine-tuning using Nova Lite, but the same techniques can be applied to Nova Pro or Nova Micro models with appropriate adjustments to the configuration._

In [None]:
! pip install -r ./requirements.txt --upgrade

## Prerequisite: 

This notebook assumes that the Cluster Setup and Cluster RIG (restricted instance group) setup is complete. If you have not followed the cluster creation and RIG Creation Step, please follow the docuemntation instructions on how to setup a Hyperpod Cluster and how to add RIG to that Hyperpod cluster. [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html)

You can also make use of Cluster Creation and RIG Setup Helper Utility command line tool which can be [found here](/Users/dewanup/projects/git2/NovaCustomizationSamples/hyperpod_nova/cli_utility/00_setup)

To verify you have a cluster running, head to SageMaker AI on AWS Console and head to under "Hyperpod Section" or you can run below command as well

In [None]:
%%bash
echo "!!!! List all the clusters available !!!!"
aws sagemaker list-clusters | jq '.ClusterSummaries'
CLUSTER_NAME=$(aws sagemaker list-clusters | jq -r '.ClusterSummaries[0].ClusterName')
CLUSTER_ID=$(aws sagemaker list-clusters | jq -r '.ClusterSummaries[0].ClusterArn | split("/")[-1]')


echo ""
echo "Cluster name:"
echo $CLUSTER_NAME


echo ""

echo "Describe the Restricted Instance Group in the cluster"
aws sagemaker describe-cluster --cluster-name $CLUSTER_NAME | jq -r '.RestrictedInstanceGroups[0]'
RIG_NAME=$(aws sagemaker describe-cluster --cluster-name $CLUSTER_NAME | jq -r '.RestrictedInstanceGroups[0].InstanceGroupName')

cat > .env << EOF
export CLUSTER_NAME=$CLUSTER_NAME
export CLUSTER_ID=$CLUSTER_ID
export RIG_NAME=$RIG_NAME
EOF



As you can see in above cell output, the describe cluster has RIG setup as well 

```
"RestrictedInstanceGroups": [
        {
            ....
```

This indicates that this cluster has RIG setup with p5.48xlarge instances which are needed to kick off the training jobs on hyperpod.

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
sagemaker_session_bucket = None

if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
bucket_name = sess.default_bucket()
default_prefix = sess.default_bucket_prefix

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Step 1: Prepare the dataset

In this example, we are going to load [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) dataset, an open-source dataset and model suite focused on enabling and improving function calling capabilities for large language models (LLMs)

### Step 1.1: Data Loading

This code loads the first 10,000 examples from the glaive-function-calling-v2 dataset from Hugging Face.


In [None]:
from datasets import load_dataset

dataset = load_dataset("glaiveai/glaive-function-calling-v2", split="train[:10000]")

dataset

Converting the dataset to a pandas DataFrame makes it easier to work with and manipulate.


In [None]:
from utils.preprocessing import glaive_to_standard_format

processed_dataset = glaive_to_standard_format(dataset)

In [None]:
import pandas as pd

df = pd.DataFrame(processed_dataset)

df.head()

### Step 1.2: Train/Val/Test Split

The dataset is split into training (72%), validation (18%), and test (10%) sets to properly evaluate the model. 

In [None]:
from sklearn.model_selection import train_test_split

temp, test = train_test_split(df, test_size=0.1, random_state=42)
train, val = train_test_split(temp, test_size=0.2, random_state=42)

print("Number of train elements: ", len(train))
print("Number of test elements: ", len(test))
print("Number of val elements: ", len(val))

### Understanding the Nova Format

Let's format the dataset by using the prompt style for Amazon Nova:

```
{
    "system": [{"text": Content of the System prompt}],
    "messages": [
        {
            "role": "user",
            "content": ["text": Content of the user prompt]
        },
        {
            "role": "assistant",
            "content": ["text": Content of the answer]
        },
        ...
    ]
}
```

### Step 1.3: Data Preprocessing 

The notebook defines utility functions to clean the dataset content by removing prefixes and handling special cases:

```python
def clean_prefix(content):
    # Removes prefixes like "USER:", "ASSISTANT:", etc.
    ...

def clean_message_list(message_list):
    # Cleans message lists from None values and converts to proper format
    ...

def clean_numbered_conversation(message_list):
    # Cleans message lists from None values and converts to proper format
    ...
```

In [None]:
import json
import re


def clean_prefix(content):
    """Remove prefixes from content, according to Nova data_validator"""
    prefixes = [
        "SYSTEM:",
        "System:",
        "USER:",
        "User:",
        "ASSISTANT:",
        "Assistant:",
        "Bot:",
        "BOT:",
    ]

    # Handle array case (list of content items)
    if hasattr(content, "__iter__") and not isinstance(content, str):
        for i, item in enumerate(content):
            if isinstance(item, dict) and "text" in item:
                text = item["text"]
                if isinstance(text, str):
                    # Clean line by line for multi-line text
                    lines = text.split("\n")
                    cleaned_lines = []
                    for line in lines:
                        cleaned_line = line.strip()
                        for prefix in prefixes:
                            if cleaned_line.startswith(prefix):
                                cleaned_line = cleaned_line[len(prefix) :].strip()
                                break
                        cleaned_lines.append(cleaned_line)
                    item["text"] = "\n".join(cleaned_lines)
        return content

    # Handle string case
    if isinstance(content, str):
        lines = content.split("\n")
        cleaned_lines = []
        for line in lines:
            cleaned_line = line.strip()
            for prefix in prefixes:
                if cleaned_line.startswith(prefix):
                    cleaned_line = cleaned_line[len(prefix) :].strip()
                    break
            cleaned_lines.append(cleaned_line)
        return "\n".join(cleaned_lines)

    return content


def clean_message_list(message_list):
    """Clean message list from None values and convert to list of dicts if needed."""
    if isinstance(message_list, str):
        message_list = json.loads(message_list)

    tmp_cleaned = []
    for msg in message_list:
        new_msg = {}
        for key, value in msg.items():
            if key in ["content"]:
                if value is None or str(value).lower() == "None":
                    continue
            new_msg[key] = value
        tmp_cleaned.append(new_msg)

    cleaned = []
    for item in tmp_cleaned:
        content = item["content"]
        for content_item in content:
            if isinstance(content_item, dict) and "text" in content_item:
                text = clean_numbered_conversation(content_item["text"])
                content_item["text"] = clean_prefix(text)
        cleaned.append({"role": item["role"], "content": content})

    return cleaned


# Additional function to specifically handle the numbered conversation format
def clean_numbered_conversation(text):
    """Clean numbered conversation format like '1. User: ...'"""
    if not isinstance(text, str):
        return text

    # Pattern to match numbered items with User: or Assistant: prefixes
    pattern = r"(\d+\.\s*)(User:|Assistant:)\s*"

    # Replace the pattern, keeping the number but removing the role prefix
    cleaned_text = re.sub(pattern, r"\1", text)

    return cleaned_text

These functions transform the dataset into the format required by Nova models, handling tool calls and formatting:

```python

def transform_tool_format(tool):
    # Transforms tool format to Nova's expected format
    ...

def prepare_dataset(sample):
    # Prepares dataset in the required format for Nova models
    ...

def prepare_dataset_test(sample):
    # Formats validation dataset for evaluation
    ...
```


In [None]:
import json

def transform_tool_format(tool):
    """Transform tool from old format to Nova format."""
    if "function" not in tool:
        return tool

    function = tool["function"]
    return {
        "toolSpec": {
            "name": function["name"],
            "description": function["description"],
            "inputSchema": {"json": function["parameters"]},
        }
    }


def prepare_dataset(sample):
    """Prepare dataset in the required format for Nova models"""
    messages = {"system": [], "messages": []}

    # Process tools upfront if they exist
    tools = json.loads(sample["tools"]) if sample.get("tools") else []
    transformed_tools = [transform_tool_format(tool) for tool in tools]

    formatted_text = (
        ""  # Initialize outside the loop to avoid undefined variable issues
    )

    for message in sample["messages"]:
        role = message["role"]

        if role == "system" and tools:
            # Build system message with tools
            system_text = (
                f"{message['content']}\n"
                "You may call one or more functions to assist with the user query.\n\n"
                "You are provided with function signatures within <tools></tools> XML tags:\n"
                "<tools>\n"
                f"{json.dumps({'tools': transformed_tools})}\n"
                "</tools>\n\n"
                "For each function call, return a json object with function name and parameters:\n"
                '{"name": function name, "parameters": dictionary of argument name and its value}'
            )
            messages["system"] = [{"text": system_text.lower()}]

        elif role == "user":
            messages["messages"].append(
                {"role": "user", "content": [{"text": message["content"].lower()}]}
            )

        elif role == "tool":
            formatted_text += message["content"]
            messages["messages"].append(
                {"role": "user", "content": [{"text": formatted_text.lower()}]}
            )

        elif role == "assistant":
            if message.get("tool_calls"):
                # Process tool calls
                tool_calls_text = []
                for tool_call in message["tool_calls"]:
                    function_data = tool_call["function"]
                    arguments = (
                        json.loads(function_data["arguments"])
                        if isinstance(function_data["arguments"], str)
                        else function_data["arguments"]
                    )
                    tool_call_json = {
                        "name": function_data["name"],
                        "parameters": arguments,
                    }
                    tool_calls_text.append(json.dumps(tool_call_json))

                messages["messages"].append(
                    {
                        "role": "assistant",
                        "content": [{"text": "".join(tool_calls_text).lower()}],
                    }
                )
            else:
                messages["messages"].append(
                    {"role": "assistant", "content": [{"text": message["content"].lower()}]}
                )

    # Remove the last message if it's not from assistant
    if messages["messages"] and messages["messages"][-1]["role"] != "assistant":
        messages["messages"].pop()

    return messages

In [None]:
def prepare_dataset_test(sample):
    """Parse sample and format it for validation dataset."""
    # Process tools
    tools = json.loads(sample["tools"]) if sample.get("tools") else []
    transformed_tools = [transform_tool_format(tool) for tool in tools]

    # Initialize result
    result = []
    conversation_history = []

    # Extract system message
    system_content = ""
    for message in sample["messages"]:
        if message["role"] == "system":
            system_content = message["content"]
            if tools:
                system_content += (
                    "\nYou may call one or more functions to assist with the user query.\n\n"
                    "You are provided with function signatures within <tools></tools> XML tags:\n"
                    "<tools>\n"
                    f"{json.dumps({'tools': transformed_tools})}\n"
                    "</tools>\n\n"
                    "For each function call, return a json object with function name and parameters:\n"
                    '{"name": function name, "parameters": dictionary of argument name and its value}'
                )
            break

    # Process conversation turns
    for i, message in enumerate(sample["messages"]):
        if message["role"] == "system":
            continue

        # Add message to conversation history
        if message["role"] == "user":
            conversation_history.append(f"## User: {message['content']}")
        elif message["role"] == "assistant":
            if message.get("tool_calls"):
                # Format tool calls
                target_parts = []
                for tool_call in message["tool_calls"]:
                    function_data = tool_call["function"]
                    arguments = (
                        json.loads(function_data["arguments"])
                        if isinstance(function_data["arguments"], str)
                        else function_data["arguments"]
                    )
                    target_parts.append(
                        json.dumps(
                            {"name": function_data["name"], "parameters": arguments}
                        )
                    )
                target = "".join(target_parts)

                conversation_history.append(f"## Assistant: {target}")
            else:
                conversation_history.append(f"## Assistant: {message['content']}")
        elif message["role"] == "tool":
            conversation_history.append(f"## Function: {message['content']}")

        # Create input-target pair when we have an assistant message
        if message["role"] == "assistant":
            # Input is system message + all previous conversation
            input_text = "\n".join(conversation_history[:-1])

            # Target is the assistant's response
            if message.get("tool_calls"):
                # Format tool calls
                target_parts = []
                for tool_call in message["tool_calls"]:
                    function_data = tool_call["function"]
                    arguments = (
                        json.loads(function_data["arguments"])
                        if isinstance(function_data["arguments"], str)
                        else function_data["arguments"]
                    )
                    target_parts.append(
                        json.dumps(
                            {"name": function_data["name"], "parameters": arguments}
                        )
                    )
                target = "".join(target_parts)
            else:
                target = message["content"]

            result.append({"system": system_content.lower(), "query": input_text.lower(), "response": target.lower()})

    return {"messages": result}

### Step 1.4: Data Preperation in Converse Format for Train and Validation Datasets

In [None]:
from datasets import Dataset, DatasetDict
from random import randint

train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test)


dataset = DatasetDict(
    {"train": train_dataset, "test": test_dataset, "val": val_dataset}
)

train_dataset = dataset["train"].map(
    prepare_dataset, remove_columns=train_dataset.features
)

train_dataset = train_dataset.to_pandas()

train_dataset["messages"] = train_dataset["messages"].apply(clean_message_list)

print(train_dataset.iloc[randint(0, len(train_dataset))].to_json())

val_dataset = dataset["val"].map(prepare_dataset, remove_columns=val_dataset.features)

val_dataset = val_dataset.to_pandas()

val_dataset["messages"] = val_dataset["messages"].apply(clean_message_list)

print(val_dataset.iloc[randint(0, len(val_dataset))].to_json())

test_dataset = dataset["test"].map(
    prepare_dataset_test, remove_columns=test_dataset.features
)
print(test_dataset[randint(0, len(test_dataset))])

### Step 1.5: Data Preperation on test data for Offline Evaluation post fine tuning

Let's format the test dataset in the format:

Required Fields:

* query: String containing the question or instruction that needs an answer
* response: String containing the expected model output

Optional Fields:

* system: String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query

Example Entry
```

{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

In [None]:
from datasets import Dataset

# Flatten the dataset
all_examples = []
for examples_list in test_dataset:
    # The first column contains the list of examples
    column_name = test_dataset.column_names[0]
    examples = examples_list[column_name]
    all_examples.extend(examples)

# Create a new dataset with the desired structure
test_dataset = Dataset.from_dict(
    {
        "system": [example["system"] for example in all_examples],
        "query": [example["query"] for example in all_examples],
        "response": [example["response"] for example in all_examples],
    }
)

print(test_dataset[randint(0, len(val_dataset))])

### Step 1.6: Upload all 3 curated datasets (train, test, val) to Amazon S3

The notebook applies the functions to transform the datasets into the required formats


The processed datasets are saved locally and then uploaded to Amazon S3 for use in SageMaker training:



In [None]:
import boto3
import shutil

In [None]:
s3_client = boto3.client('s3')

# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f"{default_prefix}/datasets/nova-sft-peft"
else:
    input_path = f"datasets/nova-sft-peft"

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.jsonl"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/val/dataset.jsonl"
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/gen_qa.jsonl"

In [None]:
import os

# Save datasets to s3
os.makedirs("./data/train", exist_ok=True)
os.makedirs("./data/val", exist_ok=True)

train_dataset.to_json("./data/train/dataset.jsonl", orient="records", lines=True)
val_dataset.to_json("./data/val/dataset.jsonl", orient="records", lines=True)
test_dataset.to_json("./data/test/gen_qa.jsonl")

s3_client.upload_file(
    "./data/train/dataset.jsonl", bucket_name, f"{input_path}/train/dataset.jsonl"
)

s3_client.upload_file(
    "./data/val/dataset.jsonl", bucket_name, f"{input_path}/val/dataset.jsonl"
)

s3_client.upload_file(
    "./data/test/gen_qa.jsonl", bucket_name, f"{input_path}/test/gen_qa.jsonl"
)

shutil.rmtree("./data")

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)
print(val_dataset_s3_path)

## Step 2: Model fine-tuning

We now define the parameters to kick off a Hyperpod Pytorch Training Job to run the supervised fine-tuning on a tool-calling dataset for our Amazon Nova model

This section sets up and runs the fine-tuning job using SageMaker Hyperpod. It uses Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) to efficiently train the model.


#### Image URI

This specifies the pre-built container for SFT fine-tuning, which is different from the DPO container.


In [None]:
image_uri_map = {
   "sft":"708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest",
    "dpo": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-DPO-latest",
    "ppo": "078496829476.dkr.ecr.us-west-2.amazonaws.com/nova-fine-tune-repo:HP-PPO-latest",
    "cpt": "078496829476.dkr.ecr.us-west-2.amazonaws.com/nova-fine-tune-repo:HP-CPT-latest",
    "eval": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest"
 }



#### Configuring the Model and Recipe

This specifies which model to fine-tune and the recipe to use. The recipe includes "lora" indicating parameter-efficient fine-tuning, and "sft" indicating supervised fine-tuning.


In [None]:

RECIPE_PATH = "fine-tuning/nova/nova_micro_p5_gpu_lora_sft"
INSTANCE ="p5.48xlarge"
RUN_NAME = "demo-sft-hp-nova-micro-run"
CONTAINER = image_uri_map["sft"]
OUTPUT_PATH="s3://sagemaker-us-east-1-905418197933/HP-SFT-RUNS/"
NAMESPACE = "kubeflow"


import os
os.environ['NAMESPACE'] = NAMESPACE
os.environ['RECIPE_PATH'] = RECIPE_PATH
os.environ['INSTANCE'] = INSTANCE
os.environ['RUN_NAME'] = RUN_NAME
os.environ['CONTAINER'] = CONTAINER
os.environ['OUTPUT_PATH'] = OUTPUT_PATH
os.environ['TRAIN_DATA_PATH'] = train_dataset_s3_path
os.environ['VAL_DATA_PATH'] = val_dataset_s3_path

In [None]:
%%bash
echo "Starting HP CLI Installation....."
git clone https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli && pip install .
hyperpod --help

In [None]:
%%bash

cat << EOF > runner.sh
hyperpod start-job --namespace ${NAMESPACE} --recipe ${RECIPE_PATH} --override-parameters \\
     '{"instance_type": "${INSTANCE}",
       "container": "${CONTAINER}", 
       "recipes.run.name": "${RUN_NAME}",
        "recipes.run.data_s3_path": "${TRAIN_DATA_PATH}", 
        "recipes.run.output_s3_path": "${OUTPUT_PATH}",
        "recipes.run.validation_data_s3_path": "${VAL_DATA_PATH}"}'
EOF

## Step 2: Model fine-tuning

We now define the PyTorch estimator to run the supervised fine-tuning on a tool-calling dataset for our Amazon Nova model

This section sets up and runs the fine-tuning job using SageMaker. It uses Supervised Fine-Tuning (SFT) with Parameter-Efficient Fine-Tuning (PEFT) to efficiently train the model.


### Launch a Job

In [None]:
!bash runner.sh

### View the Configuration 

In [None]:
import os
import glob
import yaml
import json

def latest_job_manifest(base_directory="./results"):
    """
    Find the latest created folder in the base directory and load YAML file as JSON
    """
    try:
        # Find all folders in the base directory
        folders = glob.glob(os.path.join(base_directory, "*/"))
        
        if not folders:
            print(f"No folders found in {base_directory}")
            return None
        
        # Get the latest folder based on creation time
        latest_folder = max(folders, key=os.path.getctime)
        print(f"Latest folder found: {latest_folder}")
        
        # Construct the path to the YAML file
        yaml_path = os.path.join(latest_folder, "k8s_templates", "config")
        
        # Check if the k8s_template/config directory exists
        if not os.path.exists(yaml_path):
            print(f"Directory {yaml_path} does not exist")
            return None
        
        # Find YAML files in the config directory
        yaml_files = glob.glob(os.path.join(yaml_path, "*.yaml")) + glob.glob(os.path.join(yaml_path, "*.yml"))
        
        if not yaml_files:
            print(f"No YAML files found in {yaml_path}")
            return None
        
        # Load each YAML file and convert to JSON
        loaded_data = {}
        
        for yaml_file in yaml_files:
            try:
                with open(yaml_file, 'r', encoding='utf-8') as file:
                    yaml_content = yaml.safe_load(file)
                    
                    # Convert to JSON string if needed for parsing
                    json_content = json.dumps(yaml_content, indent=2)
                    
                    # Store both the parsed object and JSON string
                    file_name = os.path.basename(yaml_file)
                    loaded_data = yaml_content
                    
                    print(f"Successfully loaded: {yaml_file}, converted to JSON and stored as a job_manifest variable")
                    
            except yaml.YAMLError as e:
                print(f"Error parsing YAML file {yaml_file}: {e}")
            except Exception as e:
                print(f"Error reading file {yaml_file}: {e}")
        
        return loaded_data
        
    except Exception as e:
        print(f"Error: {e}")
        return None


job_manifest = latest_job_manifest()

In [None]:
unique_job_name = job_manifest['run']['name']
os.environ['JOB_NAME'] = unique_job_name

### List Hyperpod Training Jobs

In [None]:
%%bash
bash ../cli_utility/01_manager/hyperpod_job_manager.sh --action list


### Cancel a Job

In [None]:
%%bash
bash ../cli_utility/01_manager/hyperpod_job_manager.sh --job_name $JOB_NAME --action cancel



### Describe a Cluster details

In [None]:
%%bash
source .env

aws sagemaker describe-cluster --cluster-name $CLUSTER_NAME

### Monitor the Job and CloudWatch Logs


In [None]:
%%bash 
source .env
../cli_utility/01_manager/hyperpod_job_manager.sh \
    --job_name $JOB_NAME \
    --action monitor \
    --cluster-name $CLUSTER_NAME \
    --cluster-id $CLUSTER_ID \
    --rig-name $RIG_NAME

## ^^ This job execution can take upto 20-30 mins based on dataset used.

### Once the Job is finished we see a Manifest.json which contains path to trained model

In [None]:
%%bash
aws s3 cp "${OUTPUT_PATH}${JOB_NAME}/manifest.json" - | jq .
ESCROW_BUCKET=$(aws s3 cp "${OUTPUT_PATH}${JOB_NAME}/manifest.json" - | jq '.checkpoint_s3_bucket')
 
 cat > escrow.env << EOF
export ESCROW_BUCKET=$ESCROW_BUCKET



In [None]:
%%bash
# Copy the entire tensorboard directory
aws s3 cp "${OUTPUT_PATH}${JOB_NAME}/0/tensorboard/" ./tensorboard_logs/ --recursive

# Start TensorBoard (this will run in background)
tensorboard --logdir=./tensorboard_logs --port=6006 &

echo "TensorBoard started at http://localhost:6006"

![imgs/tb_board.png](imgs/tb_board.png)

### Step 3: Model Evaluation

Now we can run evaluation on the model similarly just like training job but change the neccessary parameters

In [None]:
EVAL_RECIPE_PATH = "evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval"
INSTANCE = "p5.48xlarge"
EVAL_CONTAINER = image_uri_map['eval']
EVAL_RUN_NAME = "my-eval-run"
EVAL_OUTPUT_PATH = 's3://sagemaker-us-east-1-905418197933/HP-Eavl-runs/'

import os
os.environ['EVAL_RECIPE_PATH'] = EVAL_RECIPE_PATH
os.environ['EVAL_RUN_NAME'] = EVAL_RUN_NAME
os.environ['INSTANCE'] = INSTANCE
os.environ['EVAL_CONTAINER'] = EVAL_CONTAINER
os.environ['EVAL_OUTPUT_PATH'] = EVAL_OUTPUT_PATH


#### Job Runner for evaluation

In [None]:
%%bash
source escrow.env
cat << EOF > evaluator.sh
hyperpod start-job --namespace ${NAMESPACE} --recipe ${EVAL_RECIPE_PATH} --override-parameters \\
     '{"instance_type": "${INSTANCE}",
       "container": "${EVAL_CONTAINER}", 
       "recipes.run.name": "${EVAL_RUN_NAME}",
       "recipes.run.output_s3_path": "${EVAL_OUTPUT_PATH}",
       "recipes.run.model_name_or_path": "${ESCROW_BUCKET}"}'
EOF

In [None]:
!bash evaluator.sh

In [None]:
job_manifest = latest_job_manifest()

unique_job_name = job_manifest['run']['name']
os.environ['JOB_NAME'] = unique_job_name

In [None]:
!echo $JOB_NAME

##### Monitoring results post-evaluation

In [None]:
%%bash 
source .env
../cli_utility/01_manager/hyperpod_job_manager.sh \
    --job_name $JOB_NAME \
    --action monitor \
    --cluster-name $CLUSTER_NAME \
    --cluster-id $CLUSTER_ID \
    --rig-name $RIG_NAME

### Step 4: Inference


Now, once we have evaluation done we can host this on Bedrock using bedrock Model Inferecne 