# Part II: Supervised Fine-tuning Using NeMo Customizer

This notebook covers the following:

0. [Prerequisites: Configurations, Health Checks, and Namespaces](#step-0)
1. [Upload Data to NeMo Datastore](#step-1)
2. [SFT Customization with NeMo Customizer](#step-2)
3. [Model Deployment with Deployment Management Service](#step-3)
4. [Running Inference on the Customized Model with NVIDIA NIM](#step-4)

In [13]:
import os
import json
from time import sleep, time

import requests
import numpy as np
from huggingface_hub import HfApi
from openai import OpenAI

from nemo_microservices import NeMoMicroservices, APIStatusError

from config import *

<a id="step-0"></a>
## Prerequisites: Configurations, Health Checks, and Namespaces

### Prerequisites

Before you proceed, make sure you completed the **[1_data_preparation.ipynb](./1_data_preparation.ipynb)** notebook to:
- Download and format the SPECTER dataset
- Create training and validation splits
- Generate the required data files (`training.jsonl` and `validation.jsonl`)

This notebook will upload that prepared data, fine-tune an embedding model, deploy it as a NIM, and run inference.

### Configure NeMo Microservices Endpoints

Import the configurations from `config.py` and initialize the NeMo Microservices SDK client to interact with the platform services.

In [14]:
# Initialize NeMo Microservices SDK client
nemo_client = NeMoMicroservices(
    base_url=NEMO_URL,
    inference_base_url=NIM_URL,
)

In [15]:
print(f"Data Store endpoint: {NDS_URL}")
print(f"Entity Store, Customizer, Evaluator endpoint: {NEMO_URL}")
print(f"NIM endpoint: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Base Model for Customization: {BASE_MODEL}@{BASE_MODEL_VERSION}")
print(f"Retriever NIM Image: {BASE_MODEL_IMAGE_NAME_EMBEDDING}:{BASE_MODEL_IMAGE_TAG_EMBEDDING}")

Data Store endpoint: http://data-store.test
Entity Store, Customizer, Evaluator endpoint: http://nemo.test
NIM endpoint: http://nim.test
Namespace: embed-sft-ns
Base Model for Customization: nvidia/llama-3.2-nv-embedqa-1b@v2
Retriever NIM Image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.6.0


### Configure Path to Prepared Data

The following code sets the file paths to the prepared dataset files. We use `train_fp` and `val_fp` as shorthand for "training file path" and "validation file path".

In [16]:
# Path where data preparation notebook saved finetuning and evaluation data
DATA_ROOT = os.path.join(os.getcwd(), "data/specter_10pct")
CUSTOMIZATION_DATA_ROOT = os.path.join(DATA_ROOT, "training")
VALIDATION_DATA_ROOT = os.path.join(DATA_ROOT, "validation")

# Sanity checks
train_fp = f"{CUSTOMIZATION_DATA_ROOT}/training.jsonl"
assert os.path.exists(train_fp), f"The training data at '{train_fp}' does not exist. Please ensure that the data was prepared successfully."

val_fp = f"{VALIDATION_DATA_ROOT}/validation.jsonl"
assert os.path.exists(val_fp), f"The validation data at '{val_fp}' does not exist. Please ensure that the data was prepared successfully."

### Resource Organization Using Namespaces

**Why Namespaces Matter:** When working with NeMo Microservices, you'll create datasets, models, and deployments. Without namespaces, all resources would mix together, making it hard to organize experiments or separate projects.

**What are Namespaces?** A [namespace](https://docs.nvidia.com/nemo/microservices/latest/manage-entities/namespaces/index.html) is a logical container that isolates your resources. Think of it like a project folder that keeps your datasets, models, and deployments organized and separate from others.

**Entity Store vs Data Store:** NeMo Microservices uses two storage systems:
- **Entity Store**: Registry for models, configurations, and metadata
- **Data Store**: Storage for training and evaluation datasets

Both use namespaces to organize resources. You'll create a namespace in both stores for this tutorial.

#### Create Namespace

Both Data Store and Entity Store use namespaces. The following code creates namespaces for the tutorial.

In [40]:
def create_namespaces(nemo_client, ds_host, namespace):
    # Create namespace in Entity Store
    try:
        namespace_obj = nemo_client.namespaces.create(id=namespace)
        print(f"Created namespace in Entity Store: {namespace_obj.id}")
    except Exception as e:
        # Handle if namespace already exists
        if "409" in str(e) or "422" in str(e):
            print(f"Namespace {namespace} already exists in Entity Store")
        else:
            raise e

    # Create namespace in Data Store (still using requests as SDK doesn't cover Data Store)
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store during namespace creation: {resp.status_code}"
    print(f"Data Store namespace creation response: {resp}")

create_namespaces(nemo_client=nemo_client, ds_host=NDS_URL, namespace=NMS_NAMESPACE)

Created namespace in Entity Store: embed-sft-ns
Data Store namespace creation response: <Response [201]>


#### Verify Namespaces

The following [Data Store API](https://docs.nvidia.com/nemo/microservices/latest/api/datastore.html) and [Entity Store API](https://docs.nvidia.com/nemo/microservices/latest/api/entity-store.html) list the namespace created in the previous cell.

In [18]:
# Verify Namespace in Data Store (using requests as SDK doesn't cover Data Store)
response = requests.get(f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}")
print(f"Data Store - Status Code: {response.status_code}\nResponse JSON: {response.json()}")

# Verify Namespace in Entity Store
namespace_obj = nemo_client.namespaces.retrieve(namespace_id=NMS_NAMESPACE)
print(f"\nEntity Store - Namespace: {namespace_obj.id}")
print(f"Created at: {namespace_obj.created_at}")
print(f"Description: {namespace_obj.description}")
print(f"Project: {namespace_obj.project}")

Data Store - Status Code: 201
Response JSON: {'namespace': 'embed-sft-ns', 'created_at': '2025-11-03T16:43:04Z', 'updated_at': '2025-11-03T21:47:38Z'}

Entity Store - Namespace: embed-sft-ns
Created at: 2025-11-03 16:43:04.282771
Description: None
Project: None


> **Tip**: To delete a namespace, you can use the following code -

```python
nemo_client.namespaces.delete(
    namespace_id=NMS_NAMESPACE, 
)
```

---
<a id="step-1"></a>
## Step 1: Upload Data to NeMo Data Store

The NeMo Data Store supports data management using the Hugging Face `HfApi` Client. 

**Note that this step does not interact with Hugging Face at all, it just uses the client library to interact with NeMo Data Store.** This is in comparison to the previous notebook, where we used the `load_dataset` API to download the dataset from Hugging Face's repository.

More information can be found in [documentation](https://docs.nvidia.com/nemo/microservices/latest/manage-entities/tutorials/manage-dataset-files.html#set-up-hugging-face-client-with-nemo-data-store)

### 1.1 Create Repository

In [19]:
repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"

In [20]:
hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token=None)


# Create repo
hf_api.create_repo(
    repo_id=repo_id,
    repo_type='dataset',
)

print(f"✅ Created repository: {repo_id}")

✅ Created repository: embed-sft-ns/embed-sft-data


`Tip:` To delete a repo, you may use the following method

```python
hf_api.delete_repo(
    repo_id=repo_id,
    repo_type='dataset',
)
```

The `CommitInfo` output above confirms the files were successfully uploaded to the NeMo Data Store. The `oid` (object ID) is a hash identifying this specific upload, and the `commit_message` describes what was uploaded.


### 1.2 Upload Dataset Files to NeMo Data Store

In [21]:
hf_api.upload_file(
    path_or_fileobj=train_fp,
    path_in_repo="training/training.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

hf_api.upload_file(
    path_or_fileobj=val_fp,
    path_in_repo="validation/validation.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

training.jsonl: 100%|██████████| 15.8M/15.8M [00:00<00:00, 231MB/s]


validation.jsonl: 100%|██████████| 882k/882k [00:00<00:00, 79.2MB/s]


CommitInfo(commit_url='', commit_message='Upload validation/validation.jsonl with huggingface_hub', commit_description='', oid='0fb9cfe46f41ac80b6884c99142e0b280b720cf8', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

### 1.3 Register the Dataset with NeMo Entity Store

To use a dataset for operations such as evaluations and customizations, register a dataset using the `nemo_client.datasets.create()` method.
Register the dataset to refer to it by its namespace and name afterward.

In [22]:
# Create dataset
dataset = nemo_client.datasets.create(
    name=DATASET_NAME,
    namespace=NMS_NAMESPACE,
    description="Embedding SFT Dataset",
    files_url=f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
    project="embedding_sft",
)
print(f"Created dataset: {dataset.namespace}/{dataset.name}")

Created dataset: embed-sft-ns/embed-sft-data


`Tip:` If you'd like to delete a dataset, you may use the following -

```python
# Delete dataset
dataset = nemo_client.datasets.delete(
    namespace=NMS_NAMESPACE,
    dataset_name=DATASET_NAME,
)
print(f"Deletion status: {dataset.message}")
```

In [23]:
# Sanity check to validate dataset
dataset_obj = nemo_client.datasets.retrieve(namespace=NMS_NAMESPACE, dataset_name=DATASET_NAME)

print("Files URL:", dataset_obj.files_url)
assert dataset_obj.files_url == f"hf://datasets/{repo_id}"

Files URL: hf://datasets/embed-sft-ns/embed-sft-data


---
<a id="step-2"></a>
## 2. Embedding Model SFT with NeMo Customizer

### 2.1 Create a Customization Configuration

A customization configuration defines the model, hardware, and training settings for fine-tuning jobs.

**Off-the-Shelf vs Custom Configurations:**
- **Off-the-shelf configs** (e.g., `llama-3.2-1b-embed@v1.0.0+A100`) are pre-built and ready to use. To use one, you would reference it by name instead of creating a new config.
- **Custom configs** let you specify your own training parameters, hardware requirements, and model settings.

**The `target` Parameter:** Specifies the base model checkpoint to fine-tune. We're using [llama-3_2-nv-embedqa-1b-v2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2), a multilingual embedding model trained for text question-answering retrieval tasks.

The following code creates a custom configuration named `llama-embed-sft-config@v1`:

In [24]:
SFT_CONFIG_NAME = "llama-embed-sft-config@v1"

try:
    sft_config = nemo_client.customization.configs.create(
        name=SFT_CONFIG_NAME,
        namespace=NMS_NAMESPACE,
        description="Configuration for Llama 3.2 1B Embedding on A100 GPUs",
        target=f"{BASE_MODEL}@{BASE_MODEL_VERSION}",
        training_options=[
            {
                "training_type": "sft",
                "finetuning_type": "all_weights",
                "num_gpus": 1,
                "num_nodes": 1,
                "micro_batch_size": 8,
                "tensor_parallel_size": 1,
                "pipeline_parallel_size": 1,
                "use_sequence_parallel": False
            }
        ],
        training_precision="bf16",
        max_seq_length=2048
    )
    print(f"Created config: {sft_config.name}")
except APIStatusError as e:
    if e.status_code == 409:
        print(f"Config {SFT_CONFIG_NAME} already exists (409 Conflict)")
    else:
        print(f"API error {e.status_code}: {e}")
        raise e

Created config: llama-embed-sft-config@v1


### 2.2 Start the Training Job

Start the training job by calling `nemo_client.customization.jobs.create()` method.
The following code sets the training parameters and starts the job.

> **The training job will take approximately 45 minutes to complete.**

In [25]:
# If WANDB_API_KEY is set, we send it in the request header, which will report the training metrics to Weights & Biases (WandB).
if WANDB_API_KEY:
    client_with_wandb = nemo_client.with_options(default_headers={"wandb-api-key": WANDB_API_KEY})
else:
    client_with_wandb = nemo_client

customization = client_with_wandb.customization.jobs.create(
    name="llama-3.2-1b-embed-sft",
    config=f"{NMS_NAMESPACE}/{sft_config.name}",
    dataset={
        "namespace": NMS_NAMESPACE,
        "name": DATASET_NAME, 
    },
    hyperparameters={
        "training_type": "sft",
        "finetuning_type": "all_weights",
        "epochs": 1,
        "batch_size": 256,
        "learning_rate": 0.000005,
    },
    output_model=f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}"
)
print(f"Created customization job: {customization.id}")
print(f"Status: {customization.status}")
print("Job Details: ", customization)

Created customization job: cust-MZopzmY1UjbPcM5oZAqTwc
Status: created


> **Tip**: If you specified a WANDB API KEY, you can observe the run under the project "nvidia-nemo-customizer" with its Run name as your customization job ID reported above.

The following code sets variables for storing the customized model name.

In [26]:
CUSTOMIZED_MODEL = customization.output_model

# Once training is completed, this will be the name of the model that will be used to send inference queries
print("Name of the Customized Model: ", CUSTOMIZED_MODEL)

Name of the Customized Model:  embed-sft-ns/fullweight_sft_embedding@cust-MZopzmY1UjbPcM5oZAqTwc


**Tips**:
* If you configured the NeMo Customizer microservice with your own [Weights & Biases (WandB)](https://wandb.ai/) API key, you can find the training graphs and logs in your WandB account, "nvidia-nemo-customizer" project. Your run ID is similar to your job id : `customization.id`.
  
* To cancel a job that you scheduled incorrectly, run the following code.
```python
nemo_client.customization.jobs.cancel(job_id=customization.id)
```

* To delete a model for a job that was incorrectly scheduled, use the following code
```python
# CUSTOMIZED_MODEL.split('/')[1] extracts just the model name from `namespace/model_name` 
model = nemo_client.models.delete(namespace=NMS_NAMESPACE, model_name=CUSTOMIZED_MODEL.split('/')[1])
```

### 2.2 Get Job Status

Retrieve the job status by using the `nemo_client.customization.jobs.status()` method. The following code retrieves the current status and progress of the fine-tuning job.

In [27]:
# Get job status
job_status = nemo_client.customization.jobs.status(job_id=customization.id)

print("Percentage done:", job_status.percentage_done)
print("Job Status:", json.dumps(job_status.model_dump(), indent=2, default=str))

Percentage done: 0.0
Job Status: {
  "created_at": "2025-11-04 23:48:30.195806",
  "status": "pending",
  "updated_at": "2025-11-04 23:48:30.195806",
  "best_epoch": null,
  "elapsed_time": 0.0,
  "epochs_completed": 0,
  "metrics": null,
  "percentage_done": 0.0,
  "status_logs": [
    {
      "updated_at": "2025-11-04 23:48:30.195806",
      "detail": null,
      "message": "created"
    },
    {
      "updated_at": "2025-11-04 23:48:30.195806",
      "detail": "The training job is pending",
      "message": "TrainingJobPending"
    }
  ],
  "steps_completed": 0,
  "steps_per_epoch": null,
  "train_loss": null,
  "val_loss": null
}


The following cell defines a method for polling until the job completes.

**NOTE**: The progress bar is linked to the number of epochs completed. If training 1 epoch as an example (as the default above), it will go from 0% to 100% in a single jump.

In [28]:
# Add wait for the customization job to complete


def wait_job(nemo_client, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job using SDK."""
    start_time = time()
    job = nemo_client.customization.jobs.retrieve(job_id=job_id)
    status = job.status

    while status in ["pending", "created", "running"]:
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        job = nemo_client.customization.jobs.retrieve(job_id=job_id)
        status = job.status
        progress = 0.0
        if status == "running" and job.status_details:
            progress = job.status_details.percentage_done or 0.0
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    # Check final status after exiting loop
    if status == "failed":
        print(f"Job failed after {time() - start_time:.2f} seconds.")
        raise RuntimeError(f"Job {job_id} failed.")

    return job

job = wait_job(nemo_client, customization.id, polling_interval=5, timeout=2400)

# Only sleep if job completed successfully
if job.status == "completed":
    print("\n✅ Customization job completed successfully!")
    print(f"Fine-tuned model saved as: {CUSTOMIZED_MODEL}")
    # Wait for 1 minute, to ensure any artifacts are saved
    sleep(60)

Job status: running after 5.17 seconds. Progress: 0.0%
Job status: running after 10.27 seconds. Progress: 0.0%
Job status: running after 15.42 seconds. Progress: 0.0%
Job status: running after 20.51 seconds. Progress: 0.0%
Job status: running after 25.55 seconds. Progress: 0.0%
Job status: running after 30.58 seconds. Progress: 0.0%
Job status: running after 35.62 seconds. Progress: 0.0%
Job status: running after 40.72 seconds. Progress: 0.0%
Job status: running after 45.84 seconds. Progress: 0.0%
Job status: running after 50.87 seconds. Progress: 0.0%
Job status: running after 55.93 seconds. Progress: 0.0%
Job status: running after 61.00 seconds. Progress: 0.0%
Job status: running after 66.08 seconds. Progress: 0.0%
Job status: running after 71.27 seconds. Progress: 0.0%
Job status: running after 76.33 seconds. Progress: 0.0%
Job status: running after 81.45 seconds. Progress: 0.0%
Job status: running after 86.53 seconds. Progress: 0.0%
Job status: running after 91.66 seconds. Progress

### 2.3 Validate Availability of Custom Model
The following NeMo Entity Store API should display the model when the training job is complete.
The list below shows all models filtered by your namespace and sorted by the latest first.
For more information about this API, see the [NeMo Entity Store API reference](https://docs.nvidia.com/nemo/microservices/latest/api/entity-store.html).
With the following code, you can find all customized models, including the one trained in the previous cells.
Look for the `name` fields in the output, which should match your `CUSTOMIZED_MODEL`.

In [29]:
# List models with filters
models_page = nemo_client.models.list(
    filter={"namespace": NMS_NAMESPACE},
    sort="-created_at"
)

# Print models information
print(f"Found {len(models_page.data)} models in namespace {NMS_NAMESPACE}:")
for model in models_page.data:
    print(f"\nModel: {model.name}")
    print(f"  Namespace: {model.namespace}")
    print(f"  Base Model: {model.base_model}")
    print(f"  Created: {model.created_at}")
    if model.peft:
        print(f"  Fine-tuning Type: {model.peft.finetuning_type}")

Found 2 models in namespace embed-sft-ns:

Model: fullweight_sft_embedding@cust-MZopzmY1UjbPcM5oZAqTwc
  Namespace: embed-sft-ns
  Base Model: nvidia/llama-3.2-nv-embedqa-1b-v2
  Created: 2025-11-04 23:48:30.276652
  Fine-tuning Type: all_weights

Model: fullweight_sft_embedding
  Namespace: embed-sft-ns
  Base Model: None
  Created: 2025-11-04 23:48:19.586736


---

<a id="step-3"></a>
## Step 3: Deploy the Custom Model with NeMo Deployment Management Service (DMS)

Once the model is supervised finetuned, it can be deployed as a service with NeMo DMS

### 3.1 Create a Deployment Configuration

**Method of Deployment** To make your fine-tuned model available for inference we deploy that model as a NIM (NVIDIA Inference Microservice) that accepts embedding requests via API calls.

**Deployment Management Service (DMS)** handles the deployment process. It retrieves the fine-tuned weights from the Entity Store and launches a NIM container configured for your model.

The following cell creates a deployment configuration that specifies the NIM image, GPU requirements, and model settings. Deployment configurations can be reused across multiple model deployments.

> **Note:** The Embedding NIM does not support LoRA adapters, so `disable_lora_support=True` must be set in the deployment configuration.

In [30]:
print(f"Deploying {CUSTOMIZED_MODEL} with DMS.")

deployment_config = nemo_client.deployment.configs.create(
    name="llama-embed-sft-deploy-config",
    namespace=NMS_NAMESPACE,
    model=CUSTOMIZED_MODEL,
    nim_deployment={
        "image_name": BASE_MODEL_IMAGE_NAME_EMBEDDING,
        "image_tag": BASE_MODEL_IMAGE_TAG_EMBEDDING,
        "gpu": 1,
        "disable_lora_support": True,
    },
)

print(f"Deployment config created: {deployment_config.name}")

Deploying embed-sft-ns/fullweight_sft_embedding@cust-MZopzmY1UjbPcM5oZAqTwc with DMS.
Deployment config created: llama-embed-sft-deploy-config


### 3.2 Deploy the Model

The following cell creates a Model Deployment for the SFT embedding model so we can send queries to it.

In [31]:
DEPLOYMENT_NAME = "llama-embed-sft-deploy"

deployment = nemo_client.deployment.model_deployments.create(
    name=DEPLOYMENT_NAME,
    namespace=NMS_NAMESPACE,
    config=f"{NMS_NAMESPACE}/{deployment_config.name}"
)

print(f"Model deployment created: {deployment.name}")

Model deployment created: llama-embed-sft-deploy


### 3.3 Check Status of Deployment

> **It will take about 10 minutes the first time a model is deployed. This is because it pulls the container image the first time, and it typically much faster in subsequent deployments**

In [None]:
def wait_deployment(nemo_client, deployment_id: str, namespace: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting for a deployment to complete using SDK."""
    from time import time, sleep
    
    start_time = time()
    print(f"Monitoring deployment status for {deployment_id}...")
    
    while True:
        try:
            # Check for timeout
            if time() - start_time > timeout:
                raise RuntimeError(f"Deployment took more than {timeout} seconds.")
            
            # Get deployment status
            deployment = nemo_client.deployment.model_deployments.retrieve(
                deployment_name=deployment_id,
                namespace=namespace
            )
            
            status = deployment.status_details.status
            
            print(f"\rDeployment: {deployment_id} | Status: {status} after {time() - start_time:.2f} seconds", end="", flush=True)
            
            # Check if deployment is complete
            if status == 'ready':
                print("\n✅ Deployment completed successfully!")
                break
            elif status in ['failed', 'cancelled']:
                print(f"\n❌ Deployment {status}")
                raise RuntimeError(f"Deployment {deployment_id} {status}.")
            
            sleep(polling_interval)
            
        except KeyboardInterrupt:
            print("\nStopped by user")
            break
        except Exception as e:
            if "timeout" in str(e) or "RuntimeError" in str(type(e).__name__):
                raise
            print(f"\nError: {e}")
            sleep(30)
    
    return deployment


# DEPLOYMENT_ID = "llama-embed-sft-deploy"
deployment = wait_deployment(nemo_client, deployment.name, NMS_NAMESPACE, polling_interval=10, timeout=2400)
# Sleep for 10 seconds before running inference to avoid race condition on model inference in next cell
sleep(10)

Monitoring deployment status for llama-embed-sft-deploy...
Deployment: llama-embed-sft-deploy | Status: pending after 0.01 seconds

Deployment: llama-embed-sft-deploy | Status: ready after 70.17 secondsds
✅ Deployment completed successfully!


---

<a id="step-4"></a>
## Step 4: Run inference

### 4.1 Send a request using OpenAI API

In [36]:

def get_embeddings(input_text, model_name, input_type="query"):
    """
    Create embeddings using OpenAI client
    
    Args:
        input_text (str or list): Text to embed
        model_name (str): Model name to use
        input_type (str): Either "query" or "passage"
    """
    # Initialize OpenAI client with NIM endpoint
    client = OpenAI(
        base_url=f"{NIM_URL}/v1",
        api_key="None"
    )

    try:
        if isinstance(input_text, str):
            input_text = [input_text]
            
        # Create embeddings
        response = client.embeddings.create(
            input=input_text,
            model=model_name,
            extra_body={"input_type": input_type}
        )
        
        print("✅ Embeddings inference successful!")
        print(f"Model: {model_name}")
        print(f"Input type: {input_type}")
        print(f"Embedding dimensions: {len(response.data[0].embedding)}")
        print(f"First 5 values: {response.data[0].embedding[:5]}")
        
        return response
        
    except Exception as e:
        print(f"❌ Embeddings inference failed: {e}")
        return None


# Quick test
print("=" * 80)
print("INFERENCE TEST: Single Query")
print("=" * 80)
test_query = "What is the population of Pittsburgh?"
print(f"\n🔍 Query: '{test_query}'")
print(f"📍 Model: {NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}")
print("\n" + "-" * 80 + "\n")

_ = get_embeddings(
    test_query, 
    f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}",
    input_type="query"
)

print("-" * 80)

INFERENCE TEST: Single Query

🔍 Query: 'What is the population of Pittsburgh?'
📍 Model: embed-sft-ns/fullweight_sft_embedding

--------------------------------------------------------------------------------



✅ Embeddings inference successful!
Model: embed-sft-ns/fullweight_sft_embedding
Input type: query
Embedding dimensions: 2048
First 5 values: [0.023131361231207848, 0.018074970692396164, 0.028078202158212662, 0.008604136295616627, 0.029157375916838646]
--------------------------------------------------------------------------------


### 4.2 Example similarity calculation

For a sanity test, the following code calculates the cosine embedding similarity between a query, positive text, and a negative text.

In [39]:
test_data = {
    "query": "The Neuroscience of Spontaneous Thought: An Evolving, Interdisciplinary Field", 
    "pos_doc": "Hippocampal Replay Is Not a Simple Function of Experience", 
    "neg_doc": ["An alternative to the dark matter paradigm: relativistic MOND gravitation"]
}

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    
    return dot_product / (norm_vec1 * norm_vec2)


# Test similarity calculation with scientific papers
print("Query paper:")
print(f"  {test_data['query']}")
print("\nRelated paper (positive):")
print(f"  {test_data['pos_doc']}")
print("\nUnrelated paper (negative):")
print(f"  {test_data['neg_doc'][0]}\n")

# Get query embedding
query_response = get_embeddings(
    test_data["query"],
    f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}",
    input_type="query"
)
print()
# Get both positive and negative document embeddings in one request
doc_response = get_embeddings(
    [test_data["pos_doc"], test_data["neg_doc"][0]],
    f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}",
    input_type="passage"
)

# Calculate cosine similarities
if query_response and doc_response:
    query_pos_similarity = cosine_similarity(query_response.data[0].embedding, doc_response.data[0].embedding)
    query_neg_similarity = cosine_similarity(query_response.data[0].embedding, doc_response.data[1].embedding)
    
    print("\nCosine similarity results:")
    print(f"  Query to related paper:   {query_pos_similarity:.4f}")
    print(f"  Query to unrelated paper: {query_neg_similarity:.4f}")
    print(f"  Difference: {query_pos_similarity - query_neg_similarity:.4f}")
    
    if query_pos_similarity > query_neg_similarity:
        print(f"\n✅ The model correctly ranked the related paper higher.")
    else:
        print(f"\nWarning: Unrelated paper has higher similarity.")
else:
    print("Could not calculate similarities")

Query paper:
  The Neuroscience of Spontaneous Thought: An Evolving, Interdisciplinary Field

Related paper (positive):
  Hippocampal Replay Is Not a Simple Function of Experience

Unrelated paper (negative):
  An alternative to the dark matter paradigm: relativistic MOND gravitation

✅ Embeddings inference successful!
Model: embed-sft-ns/fullweight_sft_embedding
Input type: query
Embedding dimensions: 2048
First 5 values: [0.010658088140189648, 0.01798108033835888, 0.018701737746596336, 0.02370860055088997, 0.007486638613045216]

✅ Embeddings inference successful!
Model: embed-sft-ns/fullweight_sft_embedding
Input type: passage
Embedding dimensions: 2048
First 5 values: [0.0011829029535874724, 0.0027695184107869864, -0.001614622538909316, -0.0012941394234076142, -0.0011520263506099582]

Cosine similarity results:
  Query to related paper:   0.3064
  Query to unrelated paper: 0.1877
  Difference: 0.1187

✅ The model correctly ranked the related paper higher.


The query should be far closer (higher cosine similarity) to the positive doc than the negative doc.

### 4.3 Take Note of Your Deployment Name
Take note of your custom model deployment name, as you will use it to run evaluation in the subsequent notebook.

In [35]:
print(f"Name of your deployment is: {NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}")

Name of your deployment is: embed-sft-ns/fullweight_sft_embedding


---

## Next Steps

✅ **Completed in this notebook:**
- Uploaded training data to NeMo Data Store
- Fine-tuned the `nvidia/llama-3.2-nv-embedqa-1b-v2` embedding model on scientific literature titles
- Deployed the fine-tuned model as a NIM
- Ran basic inference test with similarity calculations

**Continue to [3_evaluation.ipynb](./3_evaluation.ipynb)** to:
- Evaluate your fine-tuned model on the BEIR Scidocs benchmark
- Compare performance metrics (recall, NDCG) against baseline
- Quantify the improvement from fine-tuning on domain-specific data
