Skip to content

Conversation

@ajaykrishnan23
Copy link

This example demonstrates how to deploy a high-performance LLM inference server using NVIDIA Triton Inference Server with TensorRT-LLM for optimized GPU inference.

Features:

  • Triton Inference Server with Python backend
  • TensorRT-LLM with PyTorch backend for Llama 3.2 3B Instruct
  • Model Download to Persistent Storage to avoid redundant downloads
  • Configurable sampling parameters (temperature, top_p, max_tokens)

The deployment uses an A10 GPU and exposes Triton's standard HTTP API on port 8000.

@milo157 milo157 merged commit ec2ae32 into CerebriumAI:master Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants