# Парафраз русских предложений с Hugging Face
Этот блокнот использует модель `cointegrated/rut5-base-paraphraser` для перефразирования русских текстов.
Также используется датасет `russian-superglue/ru_paraphraser` для проверки качества модели.

In [None]:
# Установка библиотек
!pip install transformers datasets torch
!pip install transformers datasets



## 1. Загрузка датасета

In [None]:
from datasets import load_dataset

# Загружаем датасет
dataset = load_dataset("merionum/ru_paraphraser")

# Проверяем структуру
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/7227 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1924 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 7227
    })
    test: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 1924
    })
})


In [None]:
dataset["train"][10]

{'id': '11',
 'id_1': '246',
 'id_2': '8165',
 'text_1': 'Москвичи смогут забронировать в Интернете место на кладбище.',
 'text_2': 'В Москве можно будет забронировать место на кладбище через интернет.',
 'class': '1'}

In [None]:
dataset["train"][0]

{'id': '1',
 'id_1': '201',
 'id_2': '8159',
 'text_1': 'Полицейским разрешат стрелять на поражение по гражданам с травматикой.',
 'text_2': 'Полиции могут разрешить стрелять по хулиганам с травматикой.',
 'class': '0'}

## 2. Фильтрация данных (только правильные парафразы)

In [None]:
# Фильтруем только примеры, где label = 1 (парафраз)
train_data = dataset["train"].filter(lambda x: x["class"] == '1')
test_data = dataset["test"].filter(lambda x: x["class"] == '1')

# Выводим пример
print(f"Количество примеров: {len(train_data)}")
print(train_data.shape)

Filter:   0%|          | 0/7227 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1924 [00:00<?, ? examples/s]

Количество примеров: 1688
(1688, 6)


## 3. Использование `pipeline` для парафраза

In [None]:
from transformers import pipeline

# Загружаем pipeline для парафраза
paraphraser = pipeline("text2text-generation", model="cointegrated/rut5-base-paraphraser")

# Пример работы модели
text = "Я собираюсь поехать в Москву завтра."
result = paraphraser(text, max_length=50, num_return_sequences=1)

print("Исходное:", text)
print("Перефразированное:", result[0]["generated_text"])

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/977M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/315 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/828k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Исходное: Я собираюсь поехать в Москву завтра.
Перефразированное: Я завтра поеду в Москву.


## 4. Проверка модели на тестовом датасете

In [None]:
correct = 0
total = 5  # Берем 5 примеров для теста

for i in range(total):
    original = test_data[i]["text_1"]
    expected = test_data[i]["text_2"]

    # Генерируем парафраз
    generated = paraphraser(original, max_length=50, num_return_sequences=1)[0]["generated_text"]

    print(f" Исходное: {original}")
    print(f"Ожидаемое: {expected}")
    print(f"Сгенерированное: {generated}")

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 Исходное: Вертолет с 11 иностранцами на борту упал в Пакистане
Ожидаемое: В Пакистане упал вертолет с 11 иностранцами
Сгенерированное: На борту разбившегося в Пакистане самолета находились 11 человек


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 Исходное: Самолет вернулся в аэропорт Новосибирска из-за стука в салоне
Ожидаемое: Самолет вернулся в новосибирский аэропорт из-за таинственного стука
Сгенерированное: Самолет вернулся в Новосибирск из-за стука в салоне


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 Исходное: Суд оправдал Васильеву в хищении акций на два миллиарда рублей
Ожидаемое: Суд оправдал Васильеву в хищении акций на 2 млрд рублей
Сгенерированное: Суд оправдал Васильеву за хищение акций на два миллиарда рублей


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


 Исходное: Пушков: у Обамы не хватило духа лично поздравить наш народ с Победой
Ожидаемое: Пушков: Обама не нашел в себе духа лично поздравить россиян с Победой
Сгенерированное: Пушков: Обамы не хватило духа для поздравления с Победой
 Исходное: МЧС РФ: тела погибших российских дипломатов доставят из Непала 11 мая
Ожидаемое: Тела погибших в Непале российских дипломатов доставят на родину 11 мая
Сгенерированное: МЧС РФ: тела погибших российских дипломатов доставят из Непала


## Вывод
- Использовали датасет `ru_paraphraser` с Hugging Face.
- Фильтровали правильные примеры (class="1").
- Использовали `pipeline` для парафраза.
- Проверили качество модели на тестовых примерах.

Теперь можно **использовать код для генерации русских парафраз**!


# ЗАДАНИЕ
Допишите оценку модели по метрике ROUGE

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=4ab5cf97d84422aa75c01ec825ce68cbed5d689bad5e728b0de2d9e82fee9b6c
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
# Установка библиотек
# !pip install transformers datasets torch rouge-score

from datasets import load_dataset
from transformers import pipeline
from rouge_score import rouge_scorer
import numpy as np

# Загружаем датасет
dataset = load_dataset("merionum/ru_paraphraser")

# Фильтруем только примеры, где label = 1 (парафраз)
train_data = dataset["train"].filter(lambda x: x["class"] == '1')
test_data = dataset["test"].filter(lambda x: x["class"] == '1')

print(f"Количество примеров в train: {len(train_data)}")
print(f"Количество примеров в test: {len(test_data)}")

# Загружаем pipeline для парафраза
paraphraser = pipeline("text2text-generation", model="cointegrated/rut5-base-paraphraser")

# Инициализируем ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def evaluate_paraphraser(data, num_samples=100):
    """
    Оценивает модель парафраза по метрикам ROUGE

    Args:
        data: датасет для оценки
        num_samples: количество примеров для оценки

    Returns:
        dict: словарь со средними значениями метрик
    """
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []

    # Ограничиваем количество примеров
    num_samples = min(num_samples, len(data))

    print(f"\nОценка модели на {num_samples} примерах...")

    for i in range(num_samples):
        original = data[i]["text_1"]
        expected = data[i]["text_2"]

        try:
            # Генерируем парафраз
            generated = paraphraser(original, max_length=100, num_return_sequences=1)[0]["generated_text"]

            # Вычисляем ROUGE scores
            scores = scorer.score(expected, generated)

            rouge1_scores.append(scores['rouge1'].fmeasure)
            rouge2_scores.append(scores['rouge2'].fmeasure)
            rougeL_scores.append(scores['rougeL'].fmeasure)

            # Выводим первые 5 примеров
            if i < 5:
                print(f"\n--- Пример {i+1} ---")
                print(f"Исходное: {original}")
                print(f"Ожидаемое: {expected}")
                print(f"Сгенерированное: {generated}")
                print(f"ROUGE-1: {scores['rouge1'].fmeasure:.4f}")
                print(f"ROUGE-2: {scores['rouge2'].fmeasure:.4f}")
                print(f"ROUGE-L: {scores['rougeL'].fmeasure:.4f}")

        except Exception as e:
            print(f"Ошибка на примере {i}: {e}")
            continue

        # Прогресс
        if (i + 1) % 20 == 0:
            print(f"Обработано {i + 1}/{num_samples} примеров...")

    # Вычисляем средние значения
    results = {
        'rouge1': {
            'mean': np.mean(rouge1_scores),
            'std': np.std(rouge1_scores)
        },
        'rouge2': {
            'mean': np.mean(rouge2_scores),
            'std': np.std(rouge2_scores)
        },
        'rougeL': {
            'mean': np.mean(rougeL_scores),
            'std': np.std(rougeL_scores)
        }
    }

    return results

# Выполняем оценку на тестовом датасете
results = evaluate_paraphraser(test_data, num_samples=100)

# Выводим итоговые результаты
print("\n" + "="*60)
print("ИТОГОВЫЕ РЕЗУЛЬТАТЫ ОЦЕНКИ")
print("="*60)
print(f"\nROUGE-1 (unigram overlap):")
print(f"  Среднее: {results['rouge1']['mean']:.4f}")
print(f"  Стандартное отклонение: {results['rouge1']['std']:.4f}")

print(f"\nROUGE-2 (bigram overlap):")
print(f"  Среднее: {results['rouge2']['mean']:.4f}")
print(f"  Стандартное отклонение: {results['rouge2']['std']:.4f}")

print(f"\nROUGE-L (longest common subsequence):")
print(f"  Среднее: {results['rougeL']['mean']:.4f}")
print(f"  Стандартное отклонение: {results['rougeL']['std']:.4f}")

print("\n" + "="*60)

# Дополнительная функция для сравнения нескольких вариантов парафраза
def generate_multiple_paraphrases(text, num_variants=3):
    """
    Генерирует несколько вариантов парафраза
    """
    results = paraphraser(
        text,
        max_length=100,
        num_return_sequences=num_variants,
        num_beams=num_variants,
        temperature=0.7
    )

    print(f"\nИсходный текст: {text}\n")
    for i, result in enumerate(results, 1):
        print(f"Вариант {i}: {result['generated_text']}")

# Пример использования
example_text = "Я собираюсь поехать в Москву завтра."
generate_multiple_paraphrases(example_text, num_variants=3)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/7227 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1924 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7227 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1924 [00:00<?, ? examples/s]

Количество примеров в train: 1688
Количество примеров в test: 374


config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/977M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/315 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/828k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Device set to use cuda:0



Оценка модели на 100 примерах...


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



--- Пример 1 ---
Исходное: Вертолет с 11 иностранцами на борту упал в Пакистане
Ожидаемое: В Пакистане упал вертолет с 11 иностранцами
Сгенерированное: На борту разбившегося в Пакистане самолета находились 11 человек
ROUGE-1: 1.0000
ROUGE-2: 0.0000
ROUGE-L: 1.0000


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



--- Пример 2 ---
Исходное: Самолет вернулся в аэропорт Новосибирска из-за стука в салоне
Ожидаемое: Самолет вернулся в новосибирский аэропорт из-за таинственного стука
Сгенерированное: Самолет вернулся в Новосибирск из-за стука в салоне
ROUGE-1: 0.0000
ROUGE-2: 0.0000
ROUGE-L: 0.0000


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



--- Пример 3 ---
Исходное: Суд оправдал Васильеву в хищении акций на два миллиарда рублей
Ожидаемое: Суд оправдал Васильеву в хищении акций на 2 млрд рублей
Сгенерированное: Суд оправдал Васильеву за хищение акций на два миллиарда рублей
ROUGE-1: 0.0000
ROUGE-2: 0.0000
ROUGE-L: 0.0000


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



--- Пример 4 ---
Исходное: Пушков: у Обамы не хватило духа лично поздравить наш народ с Победой
Ожидаемое: Пушков: Обама не нашел в себе духа лично поздравить россиян с Победой
Сгенерированное: Пушков: Обамы не хватило духа для поздравления с Победой
ROUGE-1: 0.0000
ROUGE-2: 0.0000
ROUGE-L: 0.0000


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



--- Пример 5 ---
Исходное: МЧС РФ: тела погибших российских дипломатов доставят из Непала 11 мая
Ожидаемое: Тела погибших в Непале российских дипломатов доставят на родину 11 мая
Сгенерированное: МЧС РФ: тела погибших российских дипломатов доставят из Непала
ROUGE-1: 0.0000
ROUGE-2: 0.0000
ROUGE-L: 0.0000


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Обработано 20/100 примеров...


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Обработано 40/100 примеров...


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Обработано 60/100 примеров...


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Обработано 80/100 примеров...


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Обработано 100/100 примеров...

ИТОГОВЫЕ РЕЗУЛЬТАТЫ ОЦЕНКИ

ROUGE-1 (unigram overlap):
  Среднее: 0.2480
  Стандартное отклонение: 0.4222

ROUGE-2 (bigram overlap):
  Среднее: 0.0467
  Стандартное отклонение: 0.2056

ROUGE-L (longest common subsequence):
  Среднее: 0.2480
  Стандартное отклонение: 0.4222


Исходный текст: Я собираюсь поехать в Москву завтра.

Вариант 1: Я завтра поеду в Москву.
Вариант 2: Я завтра еду в Москву.
Вариант 3: Я завтра собираюсь поехать в Москву.
