In [1]:
import sys
sys.path.append("..")

# Модель

Фреймворк предоставляет три бэкэнда для LLM: `HFModel`, `VLLMModel` и `ApiVLLMModel`
(`HFModelReasoning`, `VLLMModelReasoning` и `ApiVLLMModelReasoning` для рассужлающих моделей соответственно)

Рассмотрим основные моменты использования моделей на примере `HFModel`

In [2]:
from llmtf.model import HFModel



[2026-01-27 20:24:36,776] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


INFO 01-27 20:24:38 [__init__.py:239] Automatically detected platform cuda.


Выберем какую модель мы хотим использовать и загрузим её

In [3]:
model_name_or_path = 'Qwen/Qwen3-0.6B'

In [4]:
model = HFModel(device_map='cuda:0')
model.from_pretrained(model_name_or_path)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151669/151669 [00:00<00:00, 490159.68it/s]
INFO: 2026-01-27 20:24:43,422: llmtf.base.hfmodel: Model id: Qwen/Qwen3-0.6B
INFO: 2026-01-27 20:24:43,423: llmtf.base.hfmodel: Leading space: False


Теперь рассмотрим основные методы моделей

## Генерация текста

Для генерации достаточно передать промпт - список сообщений - в функцию `generate`

In [5]:
response = model.generate([
    {"role": "user",      "content": "What's 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 is 4."},
    {"role": "user",      "content": "What's 3 + 4?"}
])
prompt, output, info = response

`generation_config` default values have been modified to match model-specific defaults: {'bos_token_id': 151643}. If this is not desired, please set these values explicitly.


In [6]:
output

'3 + 4 is 7.'

In [7]:
prompt

"<|im_start|>user\nWhat's 2 + 2?<|im_end|>\n<|im_start|>assistant\n2 + 2 is 4.<|im_end|>\n<|im_start|>user\nWhat's 3 + 4?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

In [8]:
info

{'prompt_len': 46,
 'generated_len': [9],
 'generated_cumulative_logprob': 'TODO: calculate for hf model'}

Если промпт заканчивается на сообщение с ролью `assistant`, то LLM будет его продолжать

In [9]:
response = model.generate([
    {"role": "user",      "content": "What's 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 is 4."},
    {"role": "user",      "content": "What's 3 + 4?"},
    {"role": "assistant", "content": "3 + 4 is"},
])
_, output, _ = response

In [10]:
output

' 7.'

Также доступна генерация текста батчами - `generate_batch`

In [11]:
response = model.generate_batch([
    [{"role": "user", "content": "What's 2 + 2?"}],
    [{"role": "user", "content": "What's the capital of Russia?"}],
    [{"role": "user", "content": "How far is the moon?"}],
])
prompts, outputs, infos = response

In [12]:
outputs

['2 + 2 equals 4.',
 'The capital of Russia is Moscow.',
 'The distance from the Earth to the Moon is approximately **384,400 kilometers** (or about 238,855 miles). This is the average distance between the Earth and the Moon as measured by the distance between the two celestial bodies.']

In [13]:
prompts

["<|im_start|>user\nWhat's 2 + 2?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n",
 "<|im_start|>user\nWhat's the capital of Russia?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n",
 '<|im_start|>user\nHow far is the moon?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n']

In [14]:
infos

[{'prompt_len': 20,
  'generated_len': [56],
  'generated_cumulative_logprob': 'TODO: calculate for hf model'},
 {'prompt_len': 19,
  'generated_len': [56],
  'generated_cumulative_logprob': 'TODO: calculate for hf model'},
 {'prompt_len': 18,
  'generated_len': [56],
  'generated_cumulative_logprob': 'TODO: calculate for hf model'}]

## Расчет вероятности токенов

Вы можете получить вероятности следующего токена с помощью метода `calculate_tokens_proba`.
Для этого необходимо, помимо промтпа, передать список интересующих токенов

Метод также автоматически аугментирует токены до пробельных (`Yes` -> `_Yes`) и рассматривает оба случая

In [15]:
response = model.calculate_tokens_proba([
    {"role": "user", "content": "Is there water on Mars? Respond only with Yes or No"},
    {"role": "assistant", "content": "Answer: "}
], ["Yes", "No"])
prompt, probs, info = response

In [16]:
probs

{'Yes': 0.201171875, 'No': 0.0272216796875}

In [17]:
prompt

'<|im_start|>user\nIs there water on Mars? Respond only with Yes or No<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nAnswer: '

In [18]:
info

{'prompt_len': 27,
 'generated_len': 1,
 'generated_cumulative_logprob': 'TODO: calculate for hf model',
 'generated_token': '1'}

Аналогично доступен расчет вероятности токенов батчами - `calculate_tokens_proba_batch`

## Логиты токенов

Вы можете получить логиты каждого токена в промпте с помощью функции `calculate_logsoftmax`

In [19]:
response = model.calculate_logsoftmax([
    {"role": "user",      "content": "What's 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 is 4."},
    {"role": "user",      "content": "What's 3 + 4?"},
    {"role": "user",      "content": "3 + 4 is 7.?"}
])
prompt, messages, info = response

In [20]:
messages

[{'role': 'user', 'content': "What's 2 + 2?"},
 {'role': 'assistant', 'content': '2 + 2 is 4.'},
 {'role': 'user', 'content': "What's 3 + 4?"},
 {'role': 'user',
  'content': '3 + 4 is 7.?',
  'tokens': [[18, -28.0, [143, 144]],
   [488, -0.0478515625, [144, 146]],
   [220, -0.00286865234375, [146, 147]],
   [19, -0.0235595703125, [147, 148]],
   [374, -0.8671875, [148, 151]],
   [220, -0.029052734375, [151, 152]],
   [22, -0.1123046875, [152, 153]],
   [81408, -9.875, [153, 155]],
   [151645, -4.5, [155, 165]]]}]

In [21]:
prompt

"<|im_start|>user\nWhat's 2 + 2?<|im_end|>\n<|im_start|>assistant\n2 + 2 is 4.<|im_end|>\n<|im_start|>user\nWhat's 3 + 4?<|im_end|>\n<|im_start|>user\n3 + 4 is 7.?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

In [22]:
info

{'prompt_len': 59,
 'generated_len': 1,
 'generated_cumulative_logprob': 'TODO: calculate for hf model'}

Аналогично доступна батч версия - `calculate_logsoftmax_batch`

На данный момент получение логитов поддерживает только `HF` бэкэнд

## Базовые модели

Рассмотрим работату с локальными базовыми моделями на примере `Qwen/Qwen3-0.6B-Base`

In [23]:
model_name_or_path = 'Qwen/Qwen3-0.6B-Base'

По умолчанию используется `conversation_template` из `llmtf_open/conversation_configs/default_foundational.json`, однако вы можете передать свой.

In [24]:
model_foundational = HFModel(device_map='cuda:0')
model_foundational.from_pretrained(
    model_name_or_path,
    is_foundational=True,
    # conversation_template_path=...
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151669/151669 [00:00<00:00, 486970.33it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151669/151669 [00:00<00:00, 480334.14it/s]
INFO: 2026-01-27 20:22:19,760: llmtf.base.hfmodel: Model id: Qwen/Qwen3-0.6B-Base
INFO: 2026-01-27 20:22:19,761: llmtf.base.hfmodel: Leading space: False


In [25]:
response = model_foundational.generate([
    {"role": "user",      "content": "What's 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 is 4."},
    {"role": "user",      "content": "What's 3 + 4?"},
    {"role": "assistant", "content": "3 + 4 is"},
])
prompt, output, info = response

In [26]:
output

' 7.'

In [27]:
prompt

"Query: What's 2 + 2?\n\nResponse: 2 + 2 is 4.\n\nQuery: What's 3 + 4?\n\nResponse: 3 + 4 is"

In [28]:
info

{'prompt_len': 39,
 'generated_len': [3],
 'generated_cumulative_logprob': 'TODO: calculate for hf model'}

## Рассуждающие модели

Для работы с рассуждающими моделями существуют специальные классы `HFModelReasoning`, `VLLMModelReasoning` и `ApiVLLMModelReasoning`

Рассмотрим изменения в методах `generate` и `calculate_tokens_proba` (метод `calculate_logsoftmax` остается неизменным) на примере `VLLMModelReasoning`

In [29]:
from llmtf.model import VLLMModelReasoning

In [30]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


Вновь загрузим модель

In [31]:
model_name_or_path = 'Qwen/Qwen3-0.6B'

In [32]:
model_reasoning = VLLMModelReasoning(device_map='cuda:0')
model_reasoning.from_pretrained(model_name_or_path)

INFO: 2026-01-27 20:22:21,235: llmtf.base.vllmmodelreasoning: CUDA_VISIBLE_DEVICES=0
INFO: 2026-01-27 20:22:21,236: llmtf.base.vllmmodelreasoning: device_map=cuda:0


INFO 01-27 20:22:29 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 01-27 20:22:29 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qw

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 01-27 20:22:41 [loader.py:458] Loading weights took 0.18 seconds
INFO 01-27 20:22:42 [model_runner.py:1146] Model loading took 1.1206 GiB and 0.851046 seconds
INFO 01-27 20:22:43 [worker.py:267] Memory profiling takes 0.57 seconds
INFO 01-27 20:22:43 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.15GiB) x gpu_memory_utilization (0.95) = 75.19GiB
INFO 01-27 20:22:43 [worker.py:267] model weights take 1.12GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 72.68GiB.
INFO 01-27 20:22:43 [executor_base.py:112] # cuda blocks: 42527, # CPU blocks: 2340
INFO 01-27 20:22:43 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 83.06x
INFO 01-27 20:22:46 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]

INFO 01-27 20:23:07 [model_runner.py:1598] Graph capturing finished in 21 secs, took 0.33 GiB
INFO 01-27 20:23:07 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 25.57 seconds


INFO: 2026-01-27 20:23:09,309: llmtf.base.vllmmodelreasoning: Model id: Qwen/Qwen3-0.6B
INFO: 2026-01-27 20:23:09,310: llmtf.base.vllmmodelreasoning: Leading space: False


Настроим параметры бюджета токенов рассуждений и ответа

In [33]:
model_reasoning.generation_config.max_new_tokens_reasoning = 256
model_reasoning.generation_config.max_new_tokens = 128

Генерация текста:

In [34]:
response = model_reasoning.generate([
    {"role": "user", "content": "Solve for x: x^2 - 10x + 9 = 0. Write only the answer and nothing else!"},
], add_reasoning_truncing_prompt=True)
_, output, info = response

In [35]:
output

''

Рассуждения модели не добавляются в ответ, а сохраняются в `info`.

In [36]:
info

{'reasoning': {'prompt_len': 36,
  'generated_len': [256],
  'generated_cumulative_logprob': [None],
  'text': "<think>\nOkay, so I need to solve the quadratic equation x² - 10x + 9 = 0. Hmm, let me think. I remember there are a few methods to solve quadratic equations: factoring, completing the square, and the quadratic formula. Let me try factoring first because it might be the simplest here.\n\nLooking at the equation x² - 10x + 9 = 0, I need two numbers that multiply to 9 and add up to -10. Wait, since the middle term is -10x, and the constant term is +9, the numbers should multiply to 9 and add up to -10. Let me think... 9 and 1? 9*1=9, and 9+1=10. But that's positive. But since the middle term is negative, maybe they are both negative? Let me check. If both numbers are negative, say -1 and -9, then their product is positive (since negative times negative is positive) and their sum is -10. Yes! That works. So the equation factors as (x - 1)(x - 9) = 0. \n\nWait, let me verify that

Расчет вероятностей токенов

In [37]:
response = model_reasoning.calculate_tokens_proba([
    {"role": "user", "content": "Is there water on Mars? Respond only with Yes or No"},
    {"role": "assistant", "content": "Answer: "}
], ["Yes", "No"])
prompt, probs, info = response

In [38]:
probs

{'Yes': 0.013340085990513556, 'No': 0.935212152458615}

In [39]:
info

{'reasoning': {'prompt_len': 20,
  'generated_len': [177],
  'generated_cumulative_logprob': [None],
  'text': "<think>\nOkay, the user is asking if there's water on Mars. I need to recall what I know about Mars. I remember that Mars has a thin atmosphere, but I'm not sure about the presence of water. Let me think. I think the water is mostly in the form of ice, like on the poles. But wait, some sources say that there's water ice on the surface, but not in liquid form. Also, the atmosphere is mostly CO2, so maybe there's no liquid water. I should check if there's any evidence. Oh right, there's a possibility of water ice, but not liquid. So the answer would be No, there's no liquid water on Mars. But wait, the user wants only Yes or No. So I need to make sure I'm not making up anything. Yeah, I think the answer is No.\n</think>"},
 'response': {'generated_len': 1, 'generated_token': ' No'}}

# Задачи

Оценим модель на каком-нибудь бенчмарке

In [5]:
from llmtf.evaluator import Evaluator

LLMAAJ информация не была добавлена - пропускаем инициализацию модели для rag_llmaaj задач


Для примера возьмем `vikhrmodels/habr_qa_sbs`

In [6]:
from llmtf.tasks.habrqa import HabrQASbS

In [7]:
evaluator = Evaluator()
evaluator.add_new_task('HabrQASbS', HabrQASbS, {})

In [8]:
output_dir = './output'
datasets_names = ['HabrQASbS']
evaluator.evaluate(
    model,
    output_dir,
    datasets_names=datasets_names,
    batch_size=8,
    max_sample_per_dataset=100
)

INFO: 2026-01-27 20:24:56,015: llmtf.base.evaluator: Starting eval on ['HabrQASbS']
INFO: 2026-01-27 20:24:56,015: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [151645]
INFO: 2026-01-27 20:24:56,016: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>']
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 208.52it/s]
INFO: 2026-01-27 20:25:03,998: llmtf.base.habrqasbs: Loading Dataset: 7.98s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:03<00:00,  4.09it/s]
INFO: 2026-01-27 20:25:07,182: llmtf.base.habrqasbs: Processing Dataset: 3.18s
INFO: 202