Add Triton + TensorRT-LLM inference example #86

ajaykrishnan23 · 2025-11-19T19:48:38Z

This example demonstrates how to deploy a high-performance LLM inference server using NVIDIA Triton Inference Server with TensorRT-LLM for optimized GPU inference.

Features:

Triton Inference Server with Python backend
TensorRT-LLM with PyTorch backend for Llama 3.2 3B Instruct
Model Download to Persistent Storage to avoid redundant downloads
Configurable sampling parameters (temperature, top_p, max_tokens)

The deployment uses an A10 GPU and exposes Triton's standard HTTP API on port 8000.

5-large-language-models/8-faster-inference-with-triton-tensorrt/config.pbtxt

5-large-language-models/8-faster-inference-with-triton-tensorrt/download_model.py

2-advanced-concepts/6-faster-inference-with-triton-tensorrt/model.py

added code files

6e3eba9

milo157 reviewed Nov 19, 2025

View reviewed changes

ajaykrish2303 added 2 commits November 19, 2025 18:18

Enabled Batching, Updated execute logic()

a99d36b

moved to llm folder + updated code

40ebd90

milo157 approved these changes Nov 24, 2025

View reviewed changes

milo157 merged commit ec2ae32 into CerebriumAI:master Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Triton + TensorRT-LLM inference example #86

Add Triton + TensorRT-LLM inference example #86

Uh oh!

ajaykrishnan23 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Triton + TensorRT-LLM inference example #86

Add Triton + TensorRT-LLM inference example #86

Uh oh!

Conversation

ajaykrishnan23 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants