### REQUIREMENTS

#### INSTALLATION GUIDE

##### support matrix
https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html

##### installations
https://nvidia.github.io/TensorRT-LLM/installation/linux.html

https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html

In [None]:
# Obtain and start the basic docker image environment (optional).
!docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

In [None]:
# Install dependencies, TensorRT-LLM requires Python 3.10
!apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
# use tensorrt_llm = 0.10.0
!pip3 install tensorrt_llm -U  --extra-index-url https://pypi.nvidia.com

# Check installation
!python3 -c "import tensorrt_llm"

##### download checkpoint from nvidia NGC catlog  
##### in this sample we will be making use of mistral 7B
##### but this API supports mistral,phi3,llama2 and llama3

In [None]:

# dowmload the model from nvidia NGC catalog
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/llama/mistral-7b-int4-chat/versions/1.2/zip -O mistral-7b-int4-chat_1.2.zip

# unzip the model 

# clone the nvidia tensorrt llm repo
!git clone -b v0.10.0 https://github.com/NVIDIA/TensorRT-LLM.git

# change directory to the build folder for each model
!cd TensorRT-LLM
!cd examples/llama


# build option 1
!trtllm-build --checkpoint_dir  #directory to the downloaded check point \
              --output_dir      # directory to file path of build engine  \
              --gemm_plugin float16 

# build option 2
# build with streaming LLM
!trtllm-build --checkpoint_dir  #directory to the downloaded check point \
              --output_dir      # directory to file path of build engine  \
              --gemm_plugin float16 \
              --streamingllm enable

# build option 3
!trtllm-build --checkpoint_dir  #directory to the downloaded check point \
              --output_dir      # directory to file path of build engine  \
              --gemm_plugin float16 \
              --streamingllm enable \
              --max_batch_size 8 \
              --max_input_len 1024 \
              --max_output_len 1024 \   
              --tp_size 1 \
              --pp_size 1





### INFERENCING

##### import packages

In [None]:
from inference_api import TRTLLM_API

##### create an instance of the inference API

In [None]:


##### replace this with the path of your build trt engine and tokenizer paths of the particular model 
engine = "/workspace/trt-llmv10/models/llama3-int4-AWQ/engine"
token_dr = "/workspace/trt-llmv10/models/llama3-int4-AWQ/tokenizer"

# create an instance of the model 
inference_api = TRTLLM_API(engine_dir=engine,token_dir=token_dr)

##### generate the response of your model 

In [None]:
inference_api.generate(input_text= "write a simple python code ",  # input promp
                       streaming=True,                            # streaming output 
                       temperature=1,                             # temperature 
                       max_output_len=500,                        # maximum token output 
                       streaming_interval=1                       # streaming interval if streaming is true
                       )