# Using vLLM for Inference with Yi-1.5-6B-Chat

Welcome to this tutorial! Here, we will guide you through using vLLM for inference with the Yi-1.5-6B-Chat model. vLLM is a fast and easy-to-use library for large language model (LLM) inference and serving. Let's get started!

## 🚀 Running on Colab

We also provide a one-click [Colab script](https://colab.research.google.com/drive/1KuydGHHbI31Q0WIpwg7UmH0rfNjii8Wl?usp=drive_link) to make development easier!

## Installation

First, we need to install the required dependencies. According to the official documentation, installing vLLM with pip requires CUDA 12.1. You can refer to the official [documentation](https://docs.vllm.ai/en/stable/getting_started/installation.html) for more details.

Let's install vLLM now:

In [1]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.5.3.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (1.8 kB)
Collecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting fastapi (from vllm)
  Downloading fastapi-0.111.1-py3-none-any.whl.metadata (26 kB)
Collecting openai (from vllm)
  Downloading openai-1.37.0-py3-none-any.whl.metadata (22 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.30.3-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer==0.10.3 (from vllm)
  Downloading lm_format_enforcer-0.10.3-py3-none-any.whl.metadata (16 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm)
  Downloading

## Loading the Model

Now, we'll load the Yi-1.5-6B-Chat model. Please be mindful of your computer's VRAM and disk space usage. If you encounter any errors, it might be due to insufficient resources.

For this tutorial, we will use the Yi-1.5-6B-Chat model. Here's the VRAM and disk space usage for this model:

| Model | VRAM Usage | Disk Space Usage |
|-------|------------|------------------|
| Yi-1.5-6B-Chat | 21G | 15G |

In [2]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-1.5-6B-Chat")

# Set the sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.8)

# Load the model
llm = LLM(model="01-ai/Yi-1.5-6B-Chat")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.60M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

INFO 07-25 02:22:07 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='01-ai/Yi-1.5-6B-Chat', speculative_config=None, tokenizer='01-ai/Yi-1.5-6B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=01-ai/Yi-1.5-6B-Chat, use_v2_block_manager=False, enable_prefix_caching=False)


generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

INFO 07-25 02:22:08 model_runner.py:680] Starting to load model 01-ai/Yi-1.5-6B-Chat...
INFO 07-25 02:22:09 weight_utils.py:223] Using model weights format ['*.safetensors']


model-00001-of-00003.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.21G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]


INFO 07-25 02:25:43 model_runner.py:692] Loading model weights took 11.2905 GB
INFO 07-25 02:25:45 gpu_executor.py:102] # GPU blocks: 8089, # CPU blocks: 4096
INFO 07-25 02:25:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-25 02:25:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-25 02:25:58 model_runner.py:1181] Graph capturing finished in 12 secs.


## Model Inference

Let's prepare a prompt template and perform inference using the model. We'll use a simple greeting prompt for this example.

In [3]:
# Prepare the prompt template
prompt = "Hi!"  # Change the prompt as needed
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(text)

# Generate the response
outputs = llm.generate([text], sampling_params)

# Print the output
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

<|im_start|>user
Hi!<|im_end|>
<|im_start|>assistant



Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s, est. speed input: 15.19 toks/s, output: 19.75 toks/s]

Prompt: '<|im_start|>user\nHi!<|im_end|>\n<|im_start|>assistant\n', Generated text: "你好！有什么我可以帮助你的吗？']\n````"





That's it! You've successfully performed inference using vLLM with the Yi-1.5-6B-Chat model. Feel free to experiment with different prompts and adjust the sampling parameters to see how the model responds. Happy experimenting!