# Converting OPUS-MT Models to ONNX for Triton

This notebook demonstrates how to convert OPUS-MT translation models from Hugging Face to ONNX format and deploy them to NVIDIA Triton Inference Server.

## What You'll Learn

- Download OPUS-MT models from Hugging Face Hub
- Convert PyTorch models to ONNX format
- Optimize models with quantization (optional)
- Create Triton model repository structure
- Write Triton configuration files
- Test models with Triton HTTP API
- Benchmark model performance

## Prerequisites

**Install required packages**:
```bash
pip install transformers torch onnx onnxruntime optimum tritonclient[http] sentencepiece protobuf
```

**Triton Server** must be running:
```bash
docker-compose up -d triton
```

**Estimated time**: 30-45 minutes (depending on model download and conversion)

## Step 1: Setup and Imports

In [None]:
import os
import json
import shutil
from pathlib import Path
import torch
from transformers import MarianMTModel, MarianTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import tritonclient.http as httpclient
import numpy as np

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

# Configuration
TRITON_MODEL_REPO = "/path/to/triton/model-repository"  # Update this path
TRITON_URL = "localhost:8000"

print("\n‚úÖ Imports successful")

## Step 2: Choose a Model to Convert

Browse available OPUS-MT models at https://huggingface.co/Helsinki-NLP

Popular language pairs:
- `Helsinki-NLP/opus-mt-fr-en` (French ‚Üí English)
- `Helsinki-NLP/opus-mt-es-en` (Spanish ‚Üí English)
- `Helsinki-NLP/opus-mt-de-en` (German ‚Üí English)
- `Helsinki-NLP/opus-mt-zh-en` (Chinese ‚Üí English)
- `Helsinki-NLP/opus-mt-ja-en` (Japanese ‚Üí English)
- `Helsinki-NLP/opus-mt-ar-en` (Arabic ‚Üí English)

In [None]:
# Configure the model you want to convert
MODEL_ID = "Helsinki-NLP/opus-mt-fr-en"  # Change this to your desired model
MODEL_NAME = "opus-mt-fr-en"  # Triton model name (matches directory name)
SOURCE_LANG = "fr"
TARGET_LANG = "en"

print(f"Model to convert: {MODEL_ID}")
print(f"Triton model name: {MODEL_NAME}")
print(f"Language pair: {SOURCE_LANG} ‚Üí {TARGET_LANG}")

## Step 3: Download Model from Hugging Face

This downloads the PyTorch model and tokenizer to your local cache.

In [None]:
print(f"Downloading {MODEL_ID} from Hugging Face...\n")

# Download tokenizer
tokenizer = MarianTokenizer.from_pretrained(MODEL_ID)
print(f"‚úÖ Tokenizer loaded")
print(f"   Vocab size: {tokenizer.vocab_size}")
print(f"   Special tokens: {tokenizer.all_special_tokens}")

# Download model
model = MarianMTModel.from_pretrained(MODEL_ID)
print(f"\n‚úÖ Model loaded")
print(f"   Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"   Model type: {type(model).__name__}")

## Step 4: Test PyTorch Model

Before converting, let's verify the model works correctly.

In [None]:
# Test translation
test_sentences = [
    "Bonjour, comment allez-vous?",
    "Je suis tr√®s heureux de vous rencontrer.",
    "La traduction automatique est impressionnante."
]

print("Testing PyTorch model...\n")

model.eval()
with torch.no_grad():
    for text in test_sentences:
        inputs = tokenizer(text, return_tensors="pt", padding=True)
        outputs = model.generate(**inputs)
        translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"üá´üá∑ {text}")
        print(f"üá¨üáß {translation}\n")

print("‚úÖ PyTorch model working correctly")

## Step 5: Convert to ONNX

We'll use Hugging Face Optimum to convert the model to ONNX format.

In [None]:
print(f"Converting {MODEL_ID} to ONNX format...\n")

# Create output directory
onnx_output_dir = f"./{MODEL_NAME}-onnx"
os.makedirs(onnx_output_dir, exist_ok=True)

# Convert to ONNX using Optimum
onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    MODEL_ID,
    export=True,
    provider="CPUExecutionProvider"  # Use "CUDAExecutionProvider" for GPU
)

# Save ONNX model and tokenizer
onnx_model.save_pretrained(onnx_output_dir)
tokenizer.save_pretrained(onnx_output_dir)

print(f"‚úÖ ONNX model saved to {onnx_output_dir}")
print(f"\nFiles created:")
for file in os.listdir(onnx_output_dir):
    filepath = os.path.join(onnx_output_dir, file)
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    print(f"   {file}: {size_mb:.2f} MB")

## Step 6: Test ONNX Model

Verify the ONNX model produces the same results as PyTorch.

In [None]:
print("Testing ONNX model...\n")

# Load ONNX model
onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    onnx_output_dir,
    provider="CPUExecutionProvider"
)
onnx_tokenizer = MarianTokenizer.from_pretrained(onnx_output_dir)

# Test same sentences
for text in test_sentences:
    inputs = onnx_tokenizer(text, return_tensors="pt", padding=True)
    outputs = onnx_model.generate(**inputs)
    translation = onnx_tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"üá´üá∑ {text}")
    print(f"üá¨üáß {translation}\n")

print("‚úÖ ONNX model working correctly")

## Step 7: Create Triton Model Repository Structure

Triton requires a specific directory structure:

```
model-repository/
‚îî‚îÄ‚îÄ opus-mt-fr-en/
    ‚îú‚îÄ‚îÄ config.pbtxt
    ‚îî‚îÄ‚îÄ 1/
        ‚îî‚îÄ‚îÄ model.onnx
```

In [None]:
# Create Triton model directory structure
triton_model_dir = os.path.join(TRITON_MODEL_REPO, MODEL_NAME)
triton_version_dir = os.path.join(triton_model_dir, "1")

os.makedirs(triton_version_dir, exist_ok=True)

# Copy ONNX model files
# Find the encoder and decoder ONNX files
onnx_files = [f for f in os.listdir(onnx_output_dir) if f.endswith('.onnx')]

if 'model.onnx' in onnx_files:
    # Single ONNX file
    shutil.copy(
        os.path.join(onnx_output_dir, 'model.onnx'),
        os.path.join(triton_version_dir, 'model.onnx')
    )
else:
    # Separate encoder/decoder files - copy main one or combine
    print(f"Found ONNX files: {onnx_files}")
    print("Note: You may need to manually configure for multi-file ONNX models")

# Also copy tokenizer files for reference
for file in ['tokenizer_config.json', 'source.spm', 'target.spm', 'vocab.json']:
    src = os.path.join(onnx_output_dir, file)
    if os.path.exists(src):
        shutil.copy(src, triton_model_dir)

print(f"‚úÖ Created Triton model directory: {triton_model_dir}")
print(f"‚úÖ Model version 1 directory: {triton_version_dir}")

## Step 8: Create Triton Configuration File

The `config.pbtxt` file tells Triton how to load and serve the model.

In [None]:
config_content = f'''name: "{MODEL_NAME}"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {{
    name: "INPUT_TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }}
]

output [
  {{
    name: "OUTPUT_TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }}
]

instance_group [
  {{
    count: 1
    kind: KIND_GPU  # Change to KIND_CPU for CPU-only inference
  }}
]

# Dynamic batching for better throughput
dynamic_batching {{
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}}

# Performance tuning
optimization {{
  execution_accelerators {{
    gpu_execution_accelerator : [
      {{
        name: "tensorrt"
        parameters {{
          key: "precision_mode"
          value: "FP16"  # Use FP16 for faster GPU inference
        }}
      }}
    ]
  }}
}}
'''

config_path = os.path.join(triton_model_dir, 'config.pbtxt')
with open(config_path, 'w') as f:
    f.write(config_content)

print(f"‚úÖ Created Triton configuration: {config_path}")
print("\nConfiguration:")
print(config_content)

## Step 9: Load Model in Triton

Restart Triton to load the new model, or use the model control API.

In [None]:
print("Loading model in Triton...\n")
print("Option 1: Restart Triton container:")
print("   docker-compose restart triton")
print("\nOption 2: Use Triton model control API:")
print(f"   curl -X POST http://{TRITON_URL}/v2/repository/models/{MODEL_NAME}/load")
print("\nAfter loading, verify with:")
print(f"   curl http://{TRITON_URL}/v2/models/{MODEL_NAME}/ready")

## Step 10: Test Model with Triton HTTP API

Once the model is loaded, test it using Triton's HTTP client.

In [None]:
def test_triton_model(model_name, text, triton_url="localhost:8000"):
    """Test a model deployed on Triton."""
    try:
        # Create Triton client
        client = httpclient.InferenceServerClient(url=triton_url)
        
        # Check if model is ready
        if not client.is_model_ready(model_name):
            print(f"‚ùå Model {model_name} is not ready on Triton")
            return None
        
        # Prepare input
        input_data = np.array([text.encode('utf-8')], dtype=object)
        inputs = [
            httpclient.InferInput("INPUT_TEXT", input_data.shape, "BYTES")
        ]
        inputs[0].set_data_from_numpy(input_data)
        
        # Prepare output
        outputs = [
            httpclient.InferRequestedOutput("OUTPUT_TEXT")
        ]
        
        # Inference
        response = client.infer(model_name, inputs, outputs=outputs)
        
        # Get result
        output_data = response.as_numpy("OUTPUT_TEXT")
        translation = output_data[0].decode('utf-8')
        
        return translation
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None

# Test the model
print(f"Testing {MODEL_NAME} on Triton...\n")

for text in test_sentences:
    translation = test_triton_model(MODEL_NAME, text, TRITON_URL)
    if translation:
        print(f"üá´üá∑ {text}")
        print(f"üá¨üáß {translation}\n")

print("‚úÖ Triton inference working!")

## Step 11: Performance Benchmarking

Measure latency and throughput of the deployed model.

In [None]:
import time
import statistics

def benchmark_model(model_name, texts, iterations=10):
    """Benchmark model performance."""
    latencies = []
    
    print(f"Benchmarking {model_name} with {iterations} iterations...\n")
    
    for i in range(iterations):
        for text in texts:
            start = time.time()
            translation = test_triton_model(model_name, text, TRITON_URL)
            latency = (time.time() - start) * 1000  # Convert to ms
            latencies.append(latency)
    
    # Calculate statistics
    avg_latency = statistics.mean(latencies)
    median_latency = statistics.median(latencies)
    min_latency = min(latencies)
    max_latency = max(latencies)
    std_dev = statistics.stdev(latencies)
    throughput = 1000 / avg_latency  # requests per second
    
    print(f"\nüìä Benchmark Results ({len(latencies)} requests)")
    print(f"   Average Latency:  {avg_latency:.2f} ms")
    print(f"   Median Latency:   {median_latency:.2f} ms")
    print(f"   Min Latency:      {min_latency:.2f} ms")
    print(f"   Max Latency:      {max_latency:.2f} ms")
    print(f"   Std Deviation:    {std_dev:.2f} ms")
    print(f"   Throughput:       {throughput:.2f} req/s")
    
    return latencies

# Run benchmark
benchmark_texts = [
    "Bonjour",
    "Comment allez-vous?",
    "Je suis tr√®s heureux."
]

latencies = benchmark_model(MODEL_NAME, benchmark_texts, iterations=5)

## Step 12: Visualization (Optional)

Plot latency distribution.

In [None]:
try:
    import matplotlib.pyplot as plt
    
    plt.figure(figsize=(10, 6))
    plt.hist(latencies, bins=20, edgecolor='black', alpha=0.7)
    plt.axvline(statistics.mean(latencies), color='r', linestyle='--', label=f'Mean: {statistics.mean(latencies):.2f}ms')
    plt.axvline(statistics.median(latencies), color='g', linestyle='--', label=f'Median: {statistics.median(latencies):.2f}ms')
    plt.xlabel('Latency (ms)')
    plt.ylabel('Frequency')
    plt.title(f'{MODEL_NAME} Inference Latency Distribution')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
except ImportError:
    print("Install matplotlib for visualization: pip install matplotlib")

## Summary

In this notebook, you learned how to:

‚úÖ Download OPUS-MT models from Hugging Face  
‚úÖ Convert PyTorch models to ONNX format  
‚úÖ Create Triton model repository structure  
‚úÖ Write Triton configuration files  
‚úÖ Deploy models to Triton Inference Server  
‚úÖ Test models with HTTP API  
‚úÖ Benchmark model performance  

## Next Steps

1. **Add more models**: Repeat this process for other language pairs
2. **Optimize**: Experiment with INT8 quantization for faster inference
3. **Scale**: Configure multi-GPU deployment for high throughput
4. **Integrate**: Use the deployed model in SentinelTranslate API
5. **Monitor**: Set up metrics and logging for production

## Additional Resources

- [OPUS-MT Models](https://huggingface.co/Helsinki-NLP)
- [Triton Documentation](https://github.com/triton-inference-server/server)
- [Hugging Face Optimum](https://huggingface.co/docs/optimum/)
- [ONNX Runtime](https://onnxruntime.ai/)

---

**Questions?** Check the [model_conversion README](README.md) or the main [SentinelTranslate docs](../../README.md).