# Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT

## Goal

Once the finetuning the LLaMA 3.2 Model into an Embedding Model, you need to export the model to ONNX and TensorRT for fast inference. Please follow the steps below in order to generate ONNX and TensorRT models.

# NeMo Tools and Resources

* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)

# Software Requirements

* Access to latest NeMo Framework NGC Containers


# Hardware Requirements

* This playbook has been tested on the following hardware: Single A6000, Single H100, 2xA6000, 8xH100. It can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters.


#### Launch the NeMo Framework container as follows: 

Depending on the number of gpus, `--gpus` might need to adjust accordingly:
```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.02
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```

In [None]:
import os
from pathlib import Path
import torch
from typing import Literal, Optional, Union
from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model

In [None]:
# Paths
hf_model_path = "/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/"
quantization_calibration_data = "/opt/checkpoints/question_doc_pairs_500.json"

# HF model parameters
pooling_mode = "avg"
normalize = False

# ONNX params
opset = 17
onnx_export_path = "/opt/checkpoints/llama_embedding_onnx/"
export_dtype = "fp32"
use_dimension_arg = False

# TRT params
trt_model_path = Path("/opt/checkpoints/llama_embedding_trt/model.plan")
override_layers_to_fp32 = ["/model/norm/", "/pooling_module", "/ReduceL2", "/Div", ]
override_layernorm_precision_to_fp32 = True
profiling_verbosity = "layer_names_only"
export_to_trt = True

In [None]:
# Adapt the model first
model, tokenizer = get_llama_bidirectional_hf_model(
    model_name_or_path=hf_model_path,
    normalize=normalize,
    pooling_mode=pooling_mode,
    trust_remote_code=True,
)

In [None]:
from nemo.export.onnx_llm_exporter import OnnxLLMExporter

if use_dimension_arg:
    input_names = ["input_ids", "attention_mask", "dimensions"]
    dynamic_axes_input = {"input_ids": {0: "batch_size", 1: "seq_length"},
                            "attention_mask": {0: "batch_size", 1: "seq_length"}, "dimensions": {0: "batch_size"}}
else:
    input_names = ["input_ids", "attention_mask"]
    dynamic_axes_input = {"input_ids": {0: "batch_size", 1: "seq_length"},
                            "attention_mask": {0: "batch_size", 1: "seq_length"}}

output_names = ["embeddings"]
dynamic_axes_output = {"embeddings": {0: "batch_size", 1: "embedding_dim"}}

onnx_exporter = OnnxLLMExporter(
    onnx_model_dir=onnx_export_path, 
    model=model,
    tokenizer=tokenizer,
)

onnx_exporter.export(    
    input_names=input_names,
    output_names=output_names,
    opset=opset,
    dynamic_axes_input=dynamic_axes_input,
    dynamic_axes_output=dynamic_axes_output,
    export_dtype="fp32",
)

In [None]:
if export_to_trt:
    if use_dimension_arg:
        input_profiles = [{"input_ids": [[1, 3], [16, 128], [64, 256]], "attention_mask": [[1, 3], [16, 128], [64, 256]],
                            "dimensions": [[1], [16], [64]]}]
    else:
        input_profiles = [{"input_ids": [[1, 3], [16, 128], [64, 256]], "attention_mask": [[1, 3], [16, 128], [64, 256]]}]

    onnx_exporter.export_onnx_to_trt(
        trt_model_path=Path(trt_model_path),
        profiles=input_profiles,
        override_layernorm_precision_to_fp32=override_layernorm_precision_to_fp32,
        override_layers_to_fp32=override_layers_to_fp32,
        profiling_verbosity=profiling_verbosity,
    )

In [None]:
prompt = ["hello", "world"]

if use_dimension_arg:
    prompt = onnx_exporter.get_tokenizer(prompt)
    prompt["dimensions"] = [[2]]

print(onnx_exporter.forward(prompt))