Our goal in this notebook

* Run a Qwen large language model using vLLM in a Kaggle Notebook
* Utilize Kaggle provided GPU for the notebook 


Make sure to Keep Internet On for this notebook

Step 1: Attach vLLM wheels under "Datasets" using the 'Add Input' option in the Kaggle notebook

Step 2: Switch on GPU T4x2 Accelerator

Now, let's start our notebook session and begin coding

In [None]:
!pip install --no-index --find-links=/kaggle/input/vllm-0-6-3-post1-wheels torchvision==0.19.1
!pip install --no-index --find-links=/kaggle/input/vllm-0-6-3-post1-wheels vllm

List of all Qwen models can be found from it's official website: https://qwen.readthedocs.io/en/latest/getting_started/concepts.html

Qwen: the language models
* Qwen: 1.8B, 7B, 14B, and 72B models
* Qwen1.5: 0.5B, 1.8B, 4B, 14BA2.7B, 7B, 14B, 32B, 72B, and 110B models
* Qwen2: 0.5B, 1.5B, 7B, 57A14B, and 72B models
* Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B models

Qwen-VL: the vision-language models
* Qwen-VL: 7B-based models
* Qwen2-VL: 2B, 7B, and 72B-based models
* Qwen-Audio: the audio-language models

Qwen-Audio: 7B-based model
* Qwen2-Audio: 7B-based models

CodeQwen/Qwen-Coder: the language models for coding
* CodeQwen1.5: 7B models
* Qwen2.5-Coder: 7B models

Qwen-Math: the language models for mathematics
* Qwen2-Math: 1.5B, 7B, and 72B models
* Qwen2.5-Math: 1.5B, 7B, and 72B models

In [None]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

Qwen's HuggingFace page: https://huggingface.co/Qwen

In [None]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

In [None]:
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct",
         dtype="half")

#ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. 
# Your b'Tesla T4' GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

In [None]:
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

In [None]:
# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

In [None]:
# generate outputs
outputs = llm.generate([text], sampling_params)

In [None]:
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

References:
1. https://www.kaggle.com/code/cooleel/starter-zero-shot-data-extraction
2. https://docs.vllm.ai/en/stable/dev/offline_inference/llm.html
3. https://qwen.readthedocs.io/en/latest/deployment/vllm.html