Here we load a few libraries for requesting from the server. In particular, we use the `requests` library for making HTTP requests to the vLLM server on our node.

The `MODEL_PATH` variable is declared to match the directory of the model weights, since this is what vLLM uses to ID models.

In [2]:
from concurrent.futures import ThreadPoolExecutor
from dataclasses import asdict, dataclass
import time
import requests

MODEL_PATH = "/n/netscratch/kempner_dev/Everyone/models/Llama-3.1-70B"

The vLLM server allows sampling parameterss similar to the OpenAI API. These sampling parameters allow one to change things like temperature and top_k, as well as control whether or not log probabilities are returned with each token. You can find more details on the available sampling parameters in the (vLLM repo)[https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py#L87].

In the `InferenceRequestParams` class below, we use a subset of the available fields, which will be serialized through the `.dict()` method before being sent to the server.

In [None]:
@dataclass
class InferenceRequestParams:
    model: str # Should be set to MODEL_PATH
    prompt: str
    max_tokens: int
    best_of: int = 1
    n: int = 1
    temperature: float = 1.0
    frequency_penalty: float = 0.0
    top_k: int = -1
    logprobs: int | None = None
    prompt_logprobs: int | None = None

    def dict(self) -> dict:
        return asdict(self)
    
request_params = InferenceRequestParams(MODEL_PATH, "San Francisco is a ", 10, n = 2, best_of = 4, temperature=1.0, logprobs=0, prompt_logprobs=0)
request_params.dict()

We can send a request to the server via an HTTP POST request to the `localhost:8000/v1/completions` endpoint. This can be done via the `requests.post` function. We use the `json` argument to send a JSON payload with our sampling parameters to our server. The response will be returned as a JSON, which will contain the text completion as well as some metadata about the request. 

Try changing the `temperature` value in the code below and rerunning the code to see what outputs you get. You'll notice that `temperature=0.0` always produces the same value, but higher `temperature` values become more random.

In [None]:
def send_request(request_params: InferenceRequestParams):
    response = requests.post('http://localhost:8000/v1/completions', json = request_params.dict())
    return response.json()

send_request(InferenceRequestParams(MODEL_PATH, "San Francisco is a ", 10, temperature=FIXME))

There are also `logprob` and `prompt_logprob` fields for extracting the log probabilities of the tokens in the completion and the prompt, respectively. When set to `None` (the default), the log probabilities are not returned by the server, but when set to a non-negative integer `k`, the server will return the top `k` highest log probabilities at each token generation step, along with the log probability of generated token if it is not in the top `k`. Note that if `k=0`, then the server will just return the log probabilities of the generated tokens.

Try rerunning the following code with different values of `logprobs` and `prompt_logprobs` and see what outputs you get.

In [None]:
send_request(InferenceRequestParams(MODEL_PATH, "San Francisco is a ", 10, temperature=1.0, logprobs=FIXME, prompt_logprobs=FIXME))

To process multiple prompts, we make use of Python's multithreading to send multiple requests to the server. We put the prompts on the queue and create `NUM_THREADS` thread workers to process the queue. Each worker will independently pull prompts from the queue and send the corresponding request to the server. Compared to batching the prompts and processing each batch one by one, this queue method achieves better performance, because when a prompt finishes quickly, the corresponding worker will automatically pull the next prompt from the queue. In comparison, a batch needs to wait for the slowest prompt to finish before the next batch gets processed, even if most of the other prompts finished quickly.

Try running the following code below. Feel free to add or remove prompts and change ht4e value of `NUM_THREADS`. See how the execution time changes based on the length of the `prompts` list and `NUM_THREADS`.

In [None]:
NUM_THREADS = 4
prompts = ["San Francisco is a ", "Boston is a ", "Chicago is a ", "New York is a "]
params = [InferenceRequestParams(MODEL_PATH, prompt, 2000, 0.0) for prompt in prompts]

start_time = time.time()
with ThreadPoolExecutor(max_workers=NUM_THREADS) as pool:
    responses = pool.map(send_request, params)
print(f"Total time: {time.time() - start_time}")
[response['choices'][0]['text'] for response in responses]

Also check the output logs for the SLURM job running your server. It will show logs for when requests are received, as well as statistics on the number of tokens being processed per second and KV cache usage. This can be helpful for debugging and performance analysis.

Run the following code and look at the log file. Watch as the requests are received and how the KV cache memory usage grows over time as more tokens are generated with each request. The usage should also drop as each request finishes.

In [None]:
NUM_THREADS = 256
prompts = ["San Francisco is a "]*500
params = [InferenceRequestParams(MODEL_PATH, prompt, 10000, 1.0) for prompt in prompts]

start_time = time.time()
with ThreadPoolExecutor(max_workers=NUM_THREADS) as pool:
    responses = pool.map(send_request, params)
f"Total time: {time.time() - start_time}"