В данном блокноте будет реализована задача дообучения LLM  с помощью метода LoRa на датасете Dahoas/sft-gptj-synthetic-prompt-responses.

В датасете около 44 тысяч пар «prompt–response»: в столбце prompt лежит задание или вопрос, который подаётся на вход модели, а в столбце response — желаемый ответ, который мы будем считать таргетом.

Цель - взять относительно небольшую уже предобученную языковую модель, добавить к её слоям LoRA-адаптеры, обучить только эти дополнительные параметры на данном датасете и в итоге получить модель, которая по новому prompt сможет генерировать ответ, максимально похожий на целевой response.

In [1]:
from datasets import load_dataset
import random

In [None]:
!pip -q install --upgrade transformers accelerate datasets bitsandbytes peft trl sentencepiece einops evaluate
import torch, platform, os, transformers, datasets, peft, bitsandbytes, trl
print('Torch:', torch.__version__)
print('CUDA available?', torch.cuda.is_available())
!nvidia-smi || true

In [None]:
from dataclasses import dataclass

@dataclass
class Config:
    #Базовая модель
    base_model: str = "gpt2" #TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    dataset_name: str = "Dahoas/sft-gptj-synthetic-prompt-responses"
    subset_size: int = 10000        # чтобы обучение шло быстрее

    # QLoRA (4-бит) или обычная LoRA
    use_qlora: bool = True         # True = QLoRA, False = LoRA без квантования

    #Гиперпараметры LoRA
    lora_r: int = 16 #размер скрытого пространства lora
    lora_alpha: int = 32 #для мастабирования вклада lora адаптеров
    lora_dropout: float = 0.05

    #Гиперпараметры обучения
    max_steps: int = 300
    per_device_train_batch_size: int = 1
    gradient_accumulation_steps: int = 8
    learning_rate: float = 2e-4
    warmup_ratio: float = 0.05

    output_dir: str = "outputs-lora-gptj"
    eval_prompts: list = None

cfg = Config()

if cfg.eval_prompts is None:
    cfg.eval_prompts = [
        "Explain what LoRA fine-tuning is in simple words.",
        "Give me three ideas for a weekend hobby for a data scientist.",
        "Summarize this dialogue in one sentence: Alice: how's the status? Bob: shipped v1 yesterday, patch coming Monday.",
        "What is the best way to save money to start a business?"
    ]

cfg


Токенайзер и базовая модель

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [None]:
# токенайзер
tokenizer = AutoTokenizer.from_pretrained(cfg.base_model, use_fast=True)
tokenizer.padding_side = "right"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# тип чисел
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
bnb_config = None



In [5]:
#!pip install -U bitsandbytes accelerate transformers datasets peft trl
cfg.use_qlora = False


In [None]:
if cfg.use_qlora:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=dtype,
    )
    model = AutoModelForCausalLM.from_pretrained(
        cfg.base_model,
        device_map="auto",
        quantization_config=bnb_config,
        trust_remote_code=True,
    )
else:
    model = AutoModelForCausalLM.from_pretrained(
        cfg.base_model,
        torch_dtype=dtype,
        device_map="auto",
        trust_remote_code=True,
    )

print("Loaded model:", cfg.base_model)

Проверка базовых ответов модели

In [None]:
#pip install trl

In [9]:
from textwrap import indent
import torch
import re

import inspect
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig


In [10]:
def generate_base(prompt, max_new_tokens=200, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]
    with torch.no_grad(): #просто делаем предсказание ничего не обучаем
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            pad_token_id=tokenizer.eos_token_id,
        )
        generated_tokens = outputs[0, input_len:]

    return tokenizer.decode(generated_tokens, skip_special_tokens=True)



In [11]:
cfg.eval_prompts

['Explain what LoRA fine-tuning is in simple words.',
 'Give me three ideas for a weekend hobby for a data scientist.',
 "Summarize this dialogue in one sentence: Alice: how's the status? Bob: shipped v1 yesterday, patch coming Monday.",
 'What is the best way to save money to start a business?']

In [12]:
pre_eval = {}

In [13]:
pre_eval['Explain what LoRA fine-tuning is in simple words.'] = generate_base('Explain what LoRA fine-tuning is in simple words.')

In [14]:
pre_eval['Give me three ideas for a weekend hobby for a data scientist.'] = generate_base('Give me three ideas for a weekend hobby for a data scientist.')

In [15]:
pre_eval["Summarize this dialogue in one sentence: Alice: how's the status? Bob: shipped v1 yesterday, patch coming Monday."] = generate_base("Summarize this dialogue in one sentence: Alice: how's the status? Bob: shipped v1 yesterday, patch coming Monday.")

In [16]:
pre_eval['What is the best way to save money to start a business?'] = generate_base('What is the best way to save money to start a business?')

In [17]:
pre_eval

{'Explain what LoRA fine-tuning is in simple words.': "\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'Yes-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding\n\n* - Check for a 'No-Padding' checkbox if there is no padding",
 'Give me three ideas for a weekend hobby for a data scientist.': '\n\n1. Connect with the Data Scientist and build your skills using data science.\n\n2. Create a data science project that includes all the tools needed to create your own data science project.\n\n3. Create a data science team of data 

In [18]:
#pre_eval = {}
#for p in cfg.eval_prompts:
#    pre_eval[p] = generate_base(p)"

print("=== BASELINE RESPONSES ===")
for p, out in pre_eval.items():
    print("\n# Prompt:\n", p)
    print("\n# Response (baseline):\n", indent(out, "  "))

=== BASELINE RESPONSES ===

# Prompt:
 Explain what LoRA fine-tuning is in simple words.

# Response (baseline):
 

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'Yes-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

# Prompt:
 Give me three ideas for a weekend hobby for a data scientist.

# Response (baseline):
 

  1. Connect with the Data Scientist and build your skills using data science.

  2. Create a data science project that includes all the tools 

In [None]:
ds = load_dataset(cfg.dataset_name)
ds

In [None]:
def format_example(ex):
    # в этом датасете уже есть поля prompt и response
    prompt = ex["prompt"]
    completion = ex["response"]

    # чуть-чуть чистим
    prompt = re.sub(r"\s+$", "", prompt) + "\n"   # убираем пробелы в конце + ровно один \n
    completion = completion.lstrip()             # убираем пробелы в начале

    return {"prompt": prompt, "completion": completion}

def prepare_split(dataset, split="train"):
    data = dataset[split]
    # берём подвыборку, чтобы было быстрее (можно потом увеличить)
    if cfg.subset_size and len(data) > cfg.subset_size:
        data = data.shuffle(seed=42).select(range(cfg.subset_size))
    mapped = data.map(format_example, remove_columns=data.column_names)
    return mapped

train_ds = prepare_split(ds, "train")
print(train_ds[0])
print("Train size:", len(train_ds))


Настройка LoRa и SFTTrainer

In [None]:
# В какие слои модели вешаем LoRA-адаптеры
#target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
#target_modules = ["q_proj","k_proj","v_proj","o_proj"]
target_modules = ["c_attn", "c_fc", "c_proj"]

# LoRA-конфиг
peft_config = LoraConfig(
    r=cfg.lora_r,
    lora_alpha=cfg.lora_alpha,
    lora_dropout=cfg.lora_dropout,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=target_modules,
)

# Конфиг обучения
sftconf_kwargs = dict(
    output_dir=cfg.output_dir,
    per_device_train_batch_size=cfg.per_device_train_batch_size,
    gradient_accumulation_steps=cfg.gradient_accumulation_steps,
    learning_rate=cfg.learning_rate,
    logging_steps=10,
    max_steps=cfg.max_steps,
    warmup_ratio=cfg.warmup_ratio,
    fp16=not cfg.use_qlora,               # если НЕ QLoRA — используем fp16
    bf16=True if cfg.use_qlora else False,
    optim="paged_adamw_8bit" if cfg.use_qlora else "adamw_torch",
    gradient_checkpointing=True,
    packing=False,
    report_to="none",
)

args = SFTConfig(**sftconf_kwargs)

# Создаём SFTTrainer
trainer_kwargs = dict(
    model=model,
    args=args,
    peft_config=peft_config,
    train_dataset=train_ds,
)

trainer_sig = inspect.signature(SFTTrainer.__init__)
if "processing_class" in trainer_sig.parameters:
    trainer_kwargs["processing_class"] = tokenizer
elif "tokenizer" in trainer_sig.parameters:
    trainer_kwargs["tokenizer"] = tokenizer
if "dataset_text_field" in trainer_sig.parameters:
    trainer_kwargs["dataset_text_field"] = None

trainer = SFTTrainer(**trainer_kwargs)

# Смотрим, сколько параметров реально обучаем
trainable = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
total = sum(p.numel() for p in trainer.model.parameters())
#print(f"Trainable: {trainable:,} / Total: {total:,} ({100*trainable/total:.3f}%)")

In [22]:
train_result = trainer.train()
train_result

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.1976
20,3.0698
30,3.0836
40,2.8743
50,2.8167
60,2.7249
70,2.6703
80,2.6377
90,2.4984
100,2.7117


TrainOutput(global_step=300, training_loss=2.633639367421468, metrics={'train_runtime': 7232.6899, 'train_samples_per_second': 0.332, 'train_steps_per_second': 0.041, 'total_flos': 161015828281344.0, 'train_loss': 2.633639367421468, 'epoch': 0.24})

Сравнение ответов

In [23]:
trainer.model.eval()
trainer.model.config.use_cache = True

if cfg.use_qlora:
    autocast_dtype = torch.bfloat16
else:
    autocast_dtype = torch.bfloat16   # можно fp16, если удобно

def generate_finetuned(prompt, max_new_tokens=128, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt").to(trainer.model.device)
    with torch.no_grad():
        if torch.cuda.is_available() and autocast_dtype in (torch.float16, torch.bfloat16):
            with torch.autocast("cuda", dtype=autocast_dtype):
                out = trainer.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=True,
                    temperature=temperature,
                    pad_token_id=tokenizer.eos_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                )
        else:
            out = trainer.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=temperature,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
    return tokenizer.decode(out[0], skip_special_tokens=True)



In [24]:
post_eval = {}
for p in cfg.eval_prompts:
    post_eval[p] = generate_finetuned(p)


In [25]:
print("=== POST-TUNING RESPONSES ===")
from textwrap import indent
for p in cfg.eval_prompts:
    print("\n# Prompt:\n", p)
    print("\n# Before:\n", indent(pre_eval[p], "  "))
    print("\n# After:\n", indent(post_eval[p], "  "))

=== POST-TUNING RESPONSES ===

# Prompt:
 Explain what LoRA fine-tuning is in simple words.

# Before:
 

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'Yes-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

  * - Check for a 'No-Padding' checkbox if there is no padding

# After:
   Explain what LoRA fine-tuning is in simple words. 
  LoRA are a group of professionals, who do a lot of work in the legal community, and they are very interested in interpreting the law, and then interpreting the law in a way that is consist

После дообучения с помощью LoRA модель стала генерировать ответы более осмысленно и ближе к формату ответа, чем до обучения, но по качеству видно, что результат все еще далек от адекватного ответа llm: до обучения модель практически давала бессвязный текст, а после — начала пытаться выдавать связанный ответ по теме и пытаться объяснять, хоть и без точного содержания (например, интерпретирует LoRA как юридическую организацию или повторяет части текста). Это говорит о том, что эффект обучения заметен — модель стала лучше реагировать на задачу и формат prompt–response, но из-за маленького размера модели и ограниченного количества обучаемых параметров качество ответов остаётся низким: модель приобрела направление ответа, но не сформировала корректную семантику.
