### Step 1: Install Necessary Libraries
Make sure you have the required libraries installed:


In [11]:
!pip install -q transformers torch onnx onnxruntime-gpu

### Step 2: Load the Model and Export to ONNX


In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model_name = "microsoft/Phi-3.5-vision-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Disable FlashAttention, if available
if hasattr(model.config, "attn_implementation"):
    model.config.attn_implementation = "eager"  # Disable FlashAttention
if hasattr(model.config, "use_flash_attention"):
    model.config.use_flash_attention = False
if hasattr(model.config, "flash_attention"):
    model.config.flash_attention = False

# Convert model to half-precision (FP16)
model = model.half()

# Move model to GPU for ONNX export
model = model.to("cuda")

# Create dummy input for the model on GPU
dummy_input = torch.randint(0, 100, (1, 128), device="cuda").to(torch.int64)

# Export the model to ONNX format
torch.onnx.export(
    model,
    dummy_input,
    "phi_3.5_vision.onnx",
    input_names=["input_ids"],
    output_names=["output"],
    dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence"}},
    opset_version=13,
)

print("Model exported to ONNX format successfully!")


Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.41it/s]
  rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
  rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
  seq_len = seq_len or torch.max(position_ids) + 1
  if seq_len > self.original_max_position_embeddings:
  ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)
  and kv_seq_len > self.config.sliding_window
  if not use_sliding_windows:
  out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(


Model exported to ONNX format successfully!


### Step 3: Quantize the ONNX Model to INT8
Using ONNX Runtime, we’ll quantize the model to INT8.

In [13]:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize the model to INT8
quantized_model_path = "phi_3.5_vision_quantized.onnx"
quantize_dynamic(
    "phi_3.5_vision.onnx",
    quantized_model_path,
    op_types_to_quantize=["MatMul"],
    weight_type=QuantType.QInt8
)

print("Model quantized to INT8 successfully!")




Model quantized to INT8 successfully!


### Step 4: Convert the Quantized ONNX Model to TensorRT
Now, we’ll use TensorRT to optimize this INT8 model. The following steps assume that you are on an NVIDIA GPU (e.g., RTX 3090). If not, this step can be skipped, and you can perform this conversion directly on the Jetson Orin.



# Run this command in the notebook to convert the quantized ONNX model to TensorRT format


In [None]:
!trtexec --onnx=phi_3.5_vision_quantized.onnx --saveEngine=phi_3.5_vision_quantized.trt --int8

/bin/bash: line 1: trtexec: command not found


: 