# Evaluation of code completion with 🤗 and vLLM

In this notebook, we provide an example of how you can use vLLM and open-source models to set up fast evaluation similar to what we do in the competition. The notebook closely follows what we do in the competition evaluator, with the major difference being that we use cloud inference in the competition.

In [1]:
import os
import jsonlines
from dataclasses import dataclass
from typing import Callable

# We will use vLLM for fast inference
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from codegen_metrics import chrf

from dotenv import load_dotenv

load_dotenv()
DATA_DIR = os.getenv("DATA_DIR")

  from .autonotebook import tqdm as notebook_tqdm


INFO 07-23 13:41:02 [__init__.py:244] Automatically detected platform cuda.


## Special tokens
The listed special tokens are used by the evaluated models to construct correct sequences for FIM (fill in the middle).
We suggest using them for your experiments.

In [2]:
@dataclass
class ModelTokens:
    filename: str
    prefix: str
    suffix: str
    middle: str

    def special_tokens(self):
        return [self.filename, self.prefix, self.suffix, self.middle]

qwen_tokens = ModelTokens("<|file_sep|>", "<|fim_prefix|>", "<|fim_suffix|>", "<|fim_middle|>")
mellum_tokens = ModelTokens("<filename>", "<fim_prefix>", "<fim_suffix>", "<fim_middle>")
codestral_tokens = ModelTokens("+++++", "[PREFIX]", "[SUFFIX]", "[MIDDLE]")

DEFAULT_FILE_SEP = qwen_tokens.filename

## Model inference

In this example, we use vLLM for fast inference and use sampling parameters for greedy decoding. In the competition evaluator, we rely on a separate remote provider for each model and use similar sampling parameters to ensure stable results.

In [3]:
class VLLMSampler:
    def __init__(
            self, 
            model_name: str, 
            special_tokens: list[str],
            sampling_parameters: dict,
    ):
        self.sampling_params = SamplingParams(stop=special_tokens, **sampling_parameters)
        self.model_name = model_name
        self.model = LLM(model=model_name)

    def generate(self, prompts):
        outputs = self.model.generate(prompts, self.sampling_params)
        completions = [output.outputs[0].text for output in outputs]
        return completions

In [4]:
sampling_params = {
    "temperature": 0.0,
    "max_tokens": 384,
    "seed": 42,
}

## Data preparation

For correct construction of FIM prompt, please refer to the docs of the respective model ([Mellum](https://huggingface.co/JetBrains/Mellum-4b-sft-python#fill-in-the-middle-with-additional-files-as-context-generation), [Qwen](https://github.com/QwenLM/Qwen2.5-Coder?tab=readme-ov-file#-code-with-qwen25-coder-32b), [Codestral](https://huggingface.co/mistralai/Codestral-22B-v0.1#fill-in-the-middle-fim)). Here we use the format provided by Mellum authors, as per [the model page](https://huggingface.co/JetBrains/Mellum-4b-sft-python).

We trim the context by lines to fit into the context window. We remove the starting lines first, so we assume that the most relevant context appears closer to the end of the context sequence.

In [5]:
# We assume that the .jsonl file with the dataset is in the "data/" folder
def read_dataset(dataset_name: str, folder_path: str) -> list[dict]:
    with jsonlines.open(os.path.join(folder_path, f"{dataset_name}.jsonl"), "r") as f:
        dataset = [datapoint for datapoint in f]
        return dataset

In [6]:
def truncate_context(context: str, prefix: str, suffix: str, max_new_tokens: int, max_tokens: int, encode: Callable) -> str:
    num_tokens = max_new_tokens
    num_tokens += len(encode(prefix))
    num_tokens += len(encode(suffix))

    context_lines = context.splitlines(keepends=True)
    truncated_context = ''
    while len(context_lines) > 0:
        # Adding lines from the end of the provided context
        curr_line = context_lines.pop(-1)
        line_tokens = len(encode(curr_line))
        num_tokens += line_tokens
        if num_tokens > max_tokens:
            break
        else:
            truncated_context = curr_line + truncated_context

    return truncated_context

In [7]:
# Python version of Mellum
mellum_tokenizer = AutoTokenizer.from_pretrained("JetBrains/Mellum-4b-sft-python")

def build_mellum_prompt(context: str, filename: str, prefix: str, suffix: str):
    context.replace(DEFAULT_FILE_SEP, mellum_tokens.filename)
    truncated_context = truncate_context(
        context, prefix, suffix, 
        max_new_tokens=sampling_params["max_tokens"], max_tokens=8000, encode=mellum_tokenizer.tokenize
    )
    context = f"{truncated_context}\n"
    filename = f"{mellum_tokens.filename}{filename}\n"
    prefix = f"{mellum_tokens.prefix}{prefix}"
    suffix = f"{mellum_tokens.suffix}{suffix}"
    middle = f"{mellum_tokens.middle}"
    prompt = context + filename + suffix + prefix + middle
    return prompt

In [8]:
dataset = read_dataset("python-public", DATA_DIR)
answers = read_dataset("answers-python-public", DATA_DIR)
contexts = read_dataset("python-public-ningzhi", "../predictions")

In [9]:
mellum_prompts = []

for datapoint, prediction in zip(dataset, contexts):
    context = prediction["context"]
    filename = datapoint["path"]
    prefix = prediction.get("prefix", datapoint["prefix"])
    suffix = prediction.get("suffix", datapoint["suffix"])
    prompt = build_mellum_prompt(context, filename, prefix, suffix)
    mellum_prompts.append(prompt)

## Evaluation example

As an example, we provide inference of the open-source version of Mellum fine-tuned on Python. For Kotlin, please use the respective [fine-tuned version](https://huggingface.co/JetBrains/Mellum-4b-sft-kotlin).

We use ChrF as the metric.

In [10]:
sampler = VLLMSampler(
    model_name="JetBrains/Mellum-4b-sft-python",
    special_tokens=mellum_tokens.special_tokens(),
    sampling_parameters=sampling_params
)

INFO 07-23 13:41:11 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-23 13:41:11 [config.py:1472] Using max model len 8192


2025-07-23 13:41:11,533	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 07-23 13:41:11 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-23 13:41:11 [core.py:526] Waiting for init message from front-end.
INFO 07-23 13:41:11 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='JetBrains/Mellum-4b-sft-python', speculative_config=None, tokenizer='JetBrains/Mellum-4b-sft-python', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, o

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO 07-23 13:41:12 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-23 13:41:12 [gpu_model_runner.py:1770] Starting to load model JetBrains/Mellum-4b-sft-python...
INFO 07-23 13:41:12 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-23 13:41:12 [cuda.py:284] Using Flash Attention backend on V1 engine.
INFO 07-23 13:41:13 [weight_utils.py:292] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.13it/s]



INFO 07-23 13:41:14 [default_loader.py:272] Loading weights took 1.00 seconds
INFO 07-23 13:41:14 [gpu_model_runner.py:1801] Model loading took 7.4885 GiB and 1.350249 seconds
INFO 07-23 13:41:18 [backends.py:508] Using cache directory: /home/ntang/.cache/vllm/torch_compile_cache/b0e86e241c/rank_0_0/backbone for vLLM's torch.compile
INFO 07-23 13:41:18 [backends.py:519] Dynamo bytecode transform time: 4.26 s
INFO 07-23 13:41:22 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 2.969 s
INFO 07-23 13:41:23 [monitor.py:34] torch.compile takes 4.26 s in total
INFO 07-23 13:41:24 [gpu_worker.py:232] Available KV cache memory: 34.28 GiB
INFO 07-23 13:41:24 [kv_cache_utils.py:716] GPU KV cache size: 99,840 tokens
INFO 07-23 13:41:24 [kv_cache_utils.py:720] Maximum concurrency for 8,192 tokens per request: 12.19x


Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:17<00:00,  3.84it/s]


INFO 07-23 13:41:41 [gpu_model_runner.py:2326] Graph capturing finished in 17 secs, took 0.47 GiB
INFO 07-23 13:41:41 [core.py:172] init engine (profile, create kv cache, warmup model) took 27.30 seconds


In [11]:
mellum_completions = sampler.generate(mellum_prompts)

chrf_values = [
    chrf(answer["middle"], prediction)
    for answer, prediction in zip(answers, mellum_completions)
]

print("Mellum ChrF:", sum(chrf_values) / len(chrf_values))

Adding requests: 100%|██████████| 247/247 [00:01<00:00, 199.19it/s]
Processed prompts: 100%|██████████| 247/247 [03:21<00:00,  1.23it/s, est. speed input: 6132.18 toks/s, output: 143.61 toks/s]

Mellum ChrF: 0.5395045573048932



