Лабораторная работа №12: Изменение формы выражения текста без потери смысла

Цель: Изменять структуру предложений и лексический состав текста, сохраняя его первоначальное значение.
Задания:
- Реализуйте методы перестановки порядка слов, замены синонимами и сокращения избыточных выражений, проверяя сохранение общего смысла.Метрика оценки: Человеческая экспертиза (оценка читабельности и сохранения смысла), сравнение показателей COMET или CHRF++.

# Обучение

In [1]:
import os
os.environ['UNSLOTH_COMPILE_DISABLE'] = '1'

from unsloth import FastModel
from unsloth.chat_templates import get_chat_template
from unsloth.chat_templates import standardize_data_formats
from unsloth.chat_templates import train_on_responses_only

import torch
from datasets import Dataset, DatasetDict, load_dataset, concatenate_datasets
import evaluate
import numpy as np
from trl import SFTTrainer, SFTConfig
from transformers import EarlyStoppingCallback
import pandas as pd
from tqdm.auto import tqdm
from evaluate import load

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
model, tokenizer = FastModel.from_pretrained(
    model_name = 'unsloth/gemma-3-4b-it',
    max_seq_length = 256, 
    load_in_4bit = False,  
    load_in_8bit = True,
    full_finetuning = False,
)

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers = False, 
    finetune_language_layers = True,  
    finetune_attention_modules = True,  
    finetune_mlp_modules = True,  
    r = 16,           
    lora_alpha = 32,  
    lora_dropout = 0,
    bias = 'none',
    random_state = 3407,
)

==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA GeForce RTX 4070 Ti SUPER. Num GPUs = 1. Max memory: 15.593 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Unsloth: Making `model.base_model.model.language_model.model` require gradients


In [3]:
ds = load_dataset(
    'cointegrated/ru-paraphrase-NMT-Leipzig', 
    data_files={'train': 'train.csv', 'val': 'val.csv', 'test': 'test.csv'},
)

ds = DatasetDict({
    'train':      ds['train'],
    'validation': ds['val'],
    'test':       ds['test'],
})

def high_quality(example):
    return (example['labse_sim'] > 0.95) and (example['p_good'] > 0.95)

for split in ['train', 'validation', 'test']:
    ds[split] = ds[split].filter(high_quality, num_proc=4)

ds['validation'] = concatenate_datasets([ds['validation'], ds['test']])

ds = DatasetDict({
    'train':      ds['train'],
    'validation': ds['validation'],
})

def build_conversations(batch):
    return {
        'conversations': [
            [
              {'role': 'user',      'content': orig},
              {'role': 'assistant', 'content': ru}
            ]
            for orig, ru in zip(batch['original'], batch['ru'])
        ]
    }

ds = ds.map(
    build_conversations, batched=True,
    remove_columns=['idx','en',
                    'chrf_sim','labse_sim',
                    'forward_entailment','backward_entailment',
                    'p_good','original','ru']
)

ds = ds.map(lambda x: {'text': tokenizer.apply_chat_template(x['conversations'])},
            batched=True)

ds

DatasetDict({
    train: Dataset({
        features: ['conversations', 'text'],
        num_rows: 108434
    })
    validation: Dataset({
        features: ['conversations', 'text'],
        num_rows: 2155
    })
})

In [None]:
def apply_template(batch):
    return {'text': tokenizer.apply_chat_template(batch['conversations'])}

ds = ds.map(apply_template, batched=True)

print(ds['train'][0]['text'])

Map:   0%|          | 0/108434 [00:00<?, ? examples/s]

Map:   0%|          | 0/2155 [00:00<?, ? examples/s]

<bos><start_of_turn>user
А вот количество ТОП-ов от одного пользователя не ограничено!<end_of_turn>
<start_of_turn>model
Но количество ТОПов на пользователя не ограничивается!<end_of_turn>



In [None]:
train_std = standardize_data_formats(ds['train'])
valid_std = standardize_data_formats(ds['validation'])

ds = DatasetDict({
    'train': train_std,
    'validation': valid_std,
})

ds

Unsloth: Standardizing formats (num_proc=12):   0%|          | 0/108434 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=12):   0%|          | 0/2155 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['conversations', 'text'],
        num_rows: 108434
    })
    validation: Dataset({
        features: ['conversations', 'text'],
        num_rows: 2155
    })
})

In [6]:
ds['train'][0]

{'conversations': [{'content': 'А вот количество ТОП-ов от одного пользователя не ограничено!',
   'role': 'user'},
  {'content': 'Но количество ТОПов на пользователя не ограничивается!',
   'role': 'assistant'}],
 'text': '<bos><start_of_turn>user\nА вот количество ТОП-ов от одного пользователя не ограничено!<end_of_turn>\n<start_of_turn>model\nНо количество ТОПов на пользователя не ограничивается!<end_of_turn>\n'}

In [7]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples['conversations'])
    return {'text': texts }

dataset = ds.map(apply_chat_template, batched=True)

Map:   0%|          | 0/108434 [00:00<?, ? examples/s]

Map:   0%|          | 0/2155 [00:00<?, ? examples/s]

In [8]:
early_stop_cb = EarlyStoppingCallback(
    early_stopping_patience=3,          
)
    
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = ds['train'],
    eval_dataset = ds['validation'], 
    args = SFTConfig(
        output_dir = 'results/gemma_paraphrase',
        dataset_text_field = 'text',
        per_device_train_batch_size = 8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs = 5,
        warmup_steps = 5,
        eval_strategy='steps',
        learning_rate = 5e-5, 
        logging_steps = 300,
        eval_steps=300,
        optim = 'adamw_8bit',
        weight_decay = 0.01,
        lr_scheduler_type = 'linear',
        seed = 3407,
        report_to = 'none',
    ),
    callbacks = [early_stop_cb],
)

trainer.args.save_strategy = 'steps'
trainer.args.save_steps = trainer.args.eval_steps  
trainer.args.load_best_model_at_end = True                       
trainer.args.metric_for_best_model = 'eval_loss'             
trainer.args.greater_is_better = False             

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/108434 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/2155 [00:00<?, ? examples/s]

In [9]:
trainer = train_on_responses_only(
    trainer,
    instruction_part = '<start_of_turn>user\n',
    response_part = '<start_of_turn>model\n',
)

Map (num_proc=12):   0%|          | 0/108434 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/2155 [00:00<?, ? examples/s]

In [10]:
tokenizer.decode(trainer.train_dataset[0]['input_ids'])

'<bos><bos><start_of_turn>user\nА вот количество ТОП-ов от одного пользователя не ограничено!<end_of_turn>\n<start_of_turn>model\nНо количество ТОПов на пользователя не ограничивается!<end_of_turn>\n'

In [11]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[0]['labels']]).replace(tokenizer.pad_token, ' ')

'                         Но количество ТОПов на пользователя не ограничивается!<end_of_turn>\n'

In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 108,434 | Num Epochs = 5 | Total steps = 33,885
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 2 x 1) = 16
 "-____-"     Trainable parameters = 29,802,496/4,329,881,968 (0.69% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
300,0.8227,0.704674
600,0.6914,0.673307
900,0.6754,0.656409
1200,0.6594,0.644682
1500,0.6448,0.63791
1800,0.6311,0.628795
2100,0.6281,0.621444
2400,0.612,0.613707
2700,0.6125,0.6089
3000,0.6106,0.602451


Unsloth: Not an error, but Gemma3ForConditionalGeneration does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [18]:
model.save_pretrained('results/gemma_paraphrase/best')  
tokenizer.save_pretrained('results/gemma_paraphrase/best')

['results/gemma_paraphrase/best/processor_config.json']

# Оценка

In [17]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template = 'gemma-3',
)
messages = [{
    'role': 'user',
    'content': [{
        'type' : 'text',
        'text' : 'США выиграли чемпионат мира по хоккею.'
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
)
outputs = model.generate(
    **tokenizer([text], return_tensors = 'pt').to('cuda'),
    max_new_tokens = 64, 
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\nСША выиграли чемпионат мира по хоккею.<end_of_turn>\n<start_of_turn>model\nСША стали чемпионами мира в хоккее.<end_of_turn>']

In [5]:
model, tokenizer = FastModel.from_pretrained(
    model_name     = 'results/gemma_paraphrase/best',
    max_seq_length = 256,
    load_in_4bit   = False,
    load_in_8bit   = True,
)
tokenizer = get_chat_template(tokenizer, chat_template='gemma-3')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device).eval()

ds = load_dataset(
    'cointegrated/ru-paraphrase-NMT-Leipzig',
    data_files={'train': 'train.csv', 'val': 'val.csv', 'test': 'test.csv'},
)
ds = DatasetDict({
    'train':      ds['train'],
    'validation': ds['val'],
    'test':       ds['test'],
})

def high_quality(example):
    return (example['labse_sim'] > 0.95) and (example['p_good'] > 0.95)

for split in ['train', 'validation', 'test']:
    ds[split] = ds[split].filter(high_quality, num_proc=4)

ds['validation'] = concatenate_datasets([ds['validation'], ds['test']])
ds = DatasetDict({
    'train':      ds['train'],
    'validation': ds['validation'],
})

def build_conversations(batch):
    return {
        'conversations': [
            [
              {'role': 'user',      'content': orig},
              {'role': 'assistant', 'content': ru}
            ]
            for orig, ru in zip(batch['original'], batch['ru'])
        ]
    }

ds = ds.map(
    build_conversations, batched=True,
    remove_columns=[
        'idx','en',
        'chrf_sim','labse_sim',
        'forward_entailment','backward_entailment',
        'p_good','original','ru'
    ]
)

ds = ds.map(lambda x: {'text': tokenizer.apply_chat_template(x['conversations'])}, batched=True)
train_std = standardize_data_formats(ds['train'])
valid_std = standardize_data_formats(ds['validation'])
ds = DatasetDict({'train': train_std, 'validation': valid_std})

batch_size = 32
preds = []
sources = []
refs  = []

for i in tqdm(range(0, len(ds['validation']), batch_size), desc='Generating'):
    batch = ds['validation'][i : i + batch_size]
    sources.extend([ conv[0]['content'] for conv in batch['conversations'] ])
    refs.extend([ conv[1]['content'] for conv in batch['conversations'] ])

    raw = tokenizer.apply_chat_template(batch['conversations'], tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(
        raw, return_tensors='pt', padding=True, truncation=True, padding_side='left'
    ).to(device)
    prompt_len = inputs.input_ids.shape[1]

    out = model.generate(
        input_ids      = inputs.input_ids,
        attention_mask = inputs.attention_mask,
        max_new_tokens = 128,
        temperature    = 1.0,
        top_p          = 0.95,
        top_k          = 64,
        pad_token_id   = tokenizer.pad_token_id,
        eos_token_id   = tokenizer.eos_token_id
    )
    gen = out[:, prompt_len:]
    txt = tokenizer.batch_decode(gen, skip_special_tokens=True)
    preds.extend([t.strip() for t in txt])

bertscore = evaluate.load('bertscore')
bs = bertscore.compute(
    predictions = preds,
    references  = [[r] for r in refs],
    lang        = 'ru',
    batch_size  = batch_size,
    device      = device
)

print(f'\nBERTScore F1 (vs референс):        {np.array(bs['f1']).mean():.4f}')
print(f'BERTScore Precision (vs референс): {np.array(bs['precision']).mean():.4f}')
print(f'BERTScore Recall (vs референс):    {np.array(bs['recall']).mean():.4f}')


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastModel
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.7: Fast Gemma3 patching. Transformers: 4.51.3.
   \   /|    NVIDIA GeForce RTX 4070 Ti SUPER. Num GPUs = 1. Max memory: 15.593 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be

In [25]:
n = len(preds)
dummy_refs = [''] * n

comet = load(
    'comet',
    module_type='metric',
    checkpoint='Unbabel/wmt22-cometkiwi-da'
)

res = comet.compute(
    predictions=preds,
    sources=sources,
    references =refs,
)
print('COMET score:', res['mean_score'])

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.1.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
Encoder model frozen.
/home/danya/anaconda3/envs/new/lib/python3.12/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


COMET score: 0.9303978350627173


In [None]:
mapping = { s:(p,[r]) for s,p,r in zip(sources,preds,refs) }
seen = set()
for i, conv in enumerate(ds['validation']['conversations']):
    src = conv[0]['content']
    if src in seen:
        continue
    seen.add(src)
    pred, ref_list = mapping[src]
    print(f'\n[{i}] Оригинал:  {src}')
    print(f'    Референс: {ref_list[0]}')
    print(f'    Модель:    {pred}')


[0] Оригинал:  Это биотехнологи, не имеющие медицинского образования, просто купившие секвенатор и продающие некую технологическую услугу.
    Референс: Речь идет о биотехнологах без медицинской подготовки, которые просто купили секвенсор и продали какой-то технологический сервис.
    Модель:    Это биотехнологи, не имеющие медицинской степени, которые просто купили секвенирование и продали технологическую услугу.

[1] Оригинал:  Ему казалось, что картины разбушевавшихся стихий: лавины, бури, метели и урагана - вообще, всякие чрезвычайные состояния природы могут адекватно передать смятение человеческого сердца и ума, очистить душу и сознание.
    Референс: Ему показалось, что образы бушующей стихии: лавин, штормов, метелей и ураганов - в общем, всякого рода экстремальных состояний природы способны адекватно выразить бурю сердца и разума человека, очиститься душа и сознание.
    Модель:    Ему представлялось, что картины бушующих стихий - лавины, бури, метели и ураганы - в целом всякие