In [2]:
%pip install vllm -q

In [3]:
# import vllm
from vllm import LLM, SamplingParams

In [4]:
# Set up some sample prompt
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

## SamplingParams in vLLM

`SamplingParams` is a class in the vLLM library that defines various parameters to control the text generation process of large language models (LLMs). These parameters influence aspects like randomness, diversity, and repetition in the generated text.  

### Key Parameters  

- **`n`**: Specifies the number of output sequences to return for a given prompt.  

- **`best_of`**: Indicates the number of sequences generated from the prompt, from which the top `n` sequences are selected. The value of `best_of` must be greater than or equal to `n`.  

- **`temperature`**: Controls the randomness of the sampling process.  
  - Lower values make the model's output more deterministic.  
  - Higher values introduce more randomness.  
  - A value of `0` corresponds to greedy sampling.  

- **`top_p`**: Represents the cumulative probability threshold for token selection.  
  - The model considers only the smallest set of top tokens whose cumulative probability exceeds this threshold.  
  - The value must be in the range `(0, 1]`, with `1` meaning all tokens are considered.  

- **`top_k`**: Limits the number of top tokens to consider during sampling.  
  - A value of `-1` implies all tokens are considered.  

- **`presence_penalty`**: Applies a penalty to tokens based on their presence in the generated text so far.  
  - Positive values discourage repetition, encouraging the model to introduce new tokens.  
  - Negative values encourage repetition.  

- **`frequency_penalty`**: Penalizes tokens based on their frequency in the generated text.  
  - Positive values discourage frequent tokens, promoting diversity.  
  - Negative values can lead to repetition.  

- **`repetition_penalty`**: Penalizes tokens based on their presence in both the prompt and the generated text.  
  - Values greater than `1` discourage repetition.  
  - Values less than `1` encourage repetition.  

- **`max_tokens`**: Sets the maximum number of tokens to generate per output sequence.  

- **`min_tokens`**: Specifies the minimum number of tokens to generate per output sequence before an end-of-sequence token or stop sequence can be produced.  

- **`stop`**: A list of strings that, if generated, will stop the generation process.  
  - The output will not include the stop strings.  

- **`stop_token_ids`**: A list of token IDs that, if generated, will halt the generation process.  
  - The output will include these stop tokens unless they are special tokens.  

- **`bad_words`**: A list of words that are prohibited from being generated.  
  - The model avoids generating tokens that would complete any of these words.  

- **`seed`**: Sets the random seed for generation, ensuring reproducibility of the output.  


In [5]:
# Create a sampling params object.
sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=256,
)

In [6]:
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 03-13 08:11:09 __init__.py:207] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

INFO 03-13 08:11:26 config.py:549] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 03-13 08:11:26 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_ste

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

INFO 03-13 08:11:32 cuda.py:178] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-13 08:11:32 cuda.py:226] Using XFormers backend.
INFO 03-13 08:11:33 model_runner.py:1110] Starting to load model facebook/opt-125m...
INFO 03-13 08:11:33 weight_utils.py:254] Using model weights format ['*.bin']


pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

INFO 03-13 08:11:35 weight_utils.py:270] Time spent downloading weights for facebook/opt-125m: 2.008597 seconds


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-13 08:11:36 model_runner.py:1115] Loading model weights took 0.2389 GB
INFO 03-13 08:11:38 worker.py:267] Memory profiling takes 0.94 seconds
INFO 03-13 08:11:38 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 03-13 08:11:38 worker.py:267] model weights take 0.24GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 12.53GiB.
INFO 03-13 08:11:38 executor_base.py:111] # cuda blocks: 22813, # CPU blocks: 7281
INFO 03-13 08:11:38 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 178.23x
INFO 03-13 08:11:44 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:29<00:00,  1.20it/s]

INFO 03-13 08:12:13 model_runner.py:1562] Graph capturing finished in 29 secs, took 0.14 GiB
INFO 03-13 08:12:13 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 36.96 seconds



Processed prompts: 100%|██████████| 4/4 [00:01<00:00,  3.81it/s, est. speed input: 24.79 toks/s, output: 976.30 toks/s]

Prompt: 'Hello, my name is', Generated text: ' J.C. and I am a student at the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a gr


