# Run inference on Mistral 7B using NVIDIA TensorRT

Welcome!

In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM was recently featured in the Phind-70B release as their preferred framework for performing inference! 

See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!

A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.

Deployment powered by [Brev.dev](https://x.com/brevdev) 🤙


In [1]:
!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting tensorrt_llm
  Downloading https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.9.0.dev2024022000-cp310-cp310-linux_x86_64.whl (1229.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 GB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m00:01[0mm00:01[0m
[?25hCollecting accelerate==0.25.0 (from tensorrt_llm)
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting build (from tensorrt_llm)
  Downloading build-1.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting colored (from tensorrt_llm)
  Downloading colored-2.2.4-py3-none-any.whl.metadata (3.6 kB)
Collecting cuda-python (from tensorrt_llm)
  Downloading cuda_python-12.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting diffusers==0.15.0 (from tensorrt_llm)
  Downloading diffusers-0.15.0-py3-none-any.whl.metadata (19 kB)
Collecting lark (from tensorrt_llm)
  Downloading lark-1

In [2]:
!pip install transformers safetensors datasets

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [3]:
!mkdir -p models/Mistral_7B_v1
!pip install --upgrade "huggingface_hub[cli]"

Collecting InquirerPy==0.3.4 (from huggingface_hub[cli])
  Downloading InquirerPy-0.3.4-py3-none-any.whl.metadata (8.1 kB)
Collecting pfzy<0.4.0,>=0.3.1 (from InquirerPy==0.3.4->huggingface_hub[cli])
  Downloading pfzy-0.3.4-py3-none-any.whl (8.5 kB)
Downloading InquirerPy-0.3.4-py3-none-any.whl (67 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pfzy, InquirerPy
Successfully installed InquirerPy-0.3.4 pfzy-0.3.4
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [4]:
import huggingface_hub

huggingface_hub.snapshot_download(repo_id="mistralai/Mistral-7B-v0.1", local_dir="models/Mistral_7B_v1")

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

'/workspace/models/Mistral_7B_v1'

In [6]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py -P .

--2024-02-23 02:05:58--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63559 (62K) [text/plain]
Saving to: ‘./convert_checkpoint.py’


2024-02-23 02:05:58 (7.09 MB/s) - ‘./convert_checkpoint.py’ saved [63559/63559]



In [9]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .

--2024-02-23 02:07:01--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21871 (21K) [text/plain]
Saving to: ‘./run.py’


2024-02-23 02:07:01 (14.1 MB/s) - ‘./run.py’ saved [21871/21871]



In [8]:
!python convert_checkpoint.py --model_dir ./models/Mistral_7B_v1 --output_dir ./tllm_checkpoint_1gpu_mistral --dtype float16

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
0.9.0.dev2024022000
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.08s/it]
Weights loaded. Total time: 00:00:00
Total time of converting checkpoints: 00:00:16


In [10]:
!mkdir -p builtmistral
!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral --output_dir ./builtmistral --gemm_plugin float16 --max_input_len 32256

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
[02/23/2024-02:07:52] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set gemm_plugin to float16.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set lookup_plugin to None.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set lora_plugin to None.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set moe_plugin to float16.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set context_fmha to True.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set remove_input_padding to True.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set multi_block_mode to False.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set enable_xqa to True.
[02/23/2024-02:07:52] [TRT-LLM] [I] Set attention_qk_half_accumulation to False

In [12]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .

--2024-02-23 02:09:53--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4883 (4.8K) [text/plain]
Saving to: ‘./utils.py’


2024-02-23 02:09:54 (83.3 MB/s) - ‘./utils.py’ saved [4883/4883]



In [22]:
!python3 run.py --max_output_len=50 --tokenizer_dir mistralai/Mistral-7B-v0.1 --engine_dir=./builtmistral --max_attention_window_size=4096 --input_text "A GPU is a"

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024022000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 13815 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13963, GPU 14225 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13964, GPU 14235 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13812, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13997, GPU 17403 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13997, GPU 17411 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] Allocat