<a href="https://colab.research.google.com/github/RamilMukhametov/Paraphrase-T5model/blob/master/Paraphrase_T5_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 28.9 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 76.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.0 tokenizers-0.12.1 transformers-4.22.2


In [None]:
pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 23.6 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


# метрика

In [None]:
from transformers import AutoModel, AutoTokenizer
from transformers import AutoModelForCausalLM
import torch
import torch.nn.functional
from tqdm.auto import tqdm
from nltk.translate.bleu_score import sentence_bleu
import pandas as pd
import numpy as np

labse_name = 'cointegrated/LaBSE-en-ru'
labse_model = AutoModel.from_pretrained(labse_name)
labse_tokenizer = AutoTokenizer.from_pretrained(labse_name)
if torch.cuda.is_available():
    labse_model.cuda()

mname = 'sberbank-ai/rugpt3small_based_on_gpt2'
gpt_tokenizer = AutoTokenizer.from_pretrained(mname)
gpt_model = AutoModelForCausalLM.from_pretrained(mname)
if torch.cuda.is_available():
    gpt_model.cuda()

Downloading:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/516M [00:00<?, ?B/s]

Some weights of the model checkpoint at cointegrated/LaBSE-en-ru were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/521k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/608 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading:   0%|          | 0.00/551M [00:00<?, ?B/s]

In [None]:
def encode_labse(texts):
    encoded_input = labse_tokenizer(
        texts, padding=True, truncation=True, max_length=64, return_tensors='pt'
    ).to(labse_model.device)
    with torch.no_grad():
        model_output = labse_model(**encoded_input)
    embeddings = model_output.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings.cpu().numpy()


def get_sims(df, batch_size=1):
    sims = []
    for i in range(0, df.shape[0], batch_size):
        batch = df.iloc[i: i+batch_size]
        e1 = encode_labse(batch.text1.tolist())
        e2 = encode_labse(batch.text2.tolist())
        sims.extend((e1 * e2).sum(axis=1))
    return np.array(sims)


def get_random_sims(df, batch_size=1, random_state=1):
    df2 = pd.DataFrame({
        'text1': df.text1.tolist(),
        'text2': df.text2.sample(frac=1.0, random_state=random_state).tolist()
    })
    return get_sims(df2, batch_size=batch_size)


def get_bleu(df):
    return np.array([sentence_bleu([row.text1], row.text2) for i, row in df.iterrows()])


def ngrams(word, n=3):
    return [word[i: i+n] for i in range(len(word)-n+1)]


def common_grams(text1, text2):
    g1 = {g for w in text1.lower().split() for n in range(3, 7) for g in ngrams(f' {w} ', n=n)}
    g2 = {g for w in text2.lower().split() for n in range(3, 7) for g in ngrams(f' {w} ', n=n)}
    return len(g1.intersection(g2)) / len(g1.union(g2))


def get_char_ngram_overlap(df):
    return np.array([common_grams(row.text1, row.text2) for i, row in df.iterrows()])


def calc_gpt2_ppl_corpus(test_sentences, aggregate=False, sep='\n'):
    """ Calculate average perplexity per token and number of tokens in each text."""
    lls = []
    weights = []
    for text in tqdm(test_sentences):
        encodings = gpt_tokenizer(f'{sep}{text}{sep}', return_tensors='pt')
        input_ids = encodings.input_ids.to(gpt_model.device)
        target_ids = input_ids.clone()

        w = max(0, len(input_ids[0]) - 1)
        if w > 0:
            with torch.no_grad():
                outputs = gpt_model(input_ids, labels=target_ids)
                log_likelihood = outputs[0]
                ll = log_likelihood.item()
        else:
            ll = 0
        lls.append(ll)
        weights.append(w)
    likelihoods, weights = np.array(lls), np.array(weights)
    if aggregate:
        return sum(likelihoods * weights) / sum(weights)
    return likelihoods, weights


def analyze_pairs(texts1, texts2):
    df = pd.DataFrame({'text1': texts1, 'text2': texts2})
    b1 = get_bleu(df)
    b2 = get_bleu(pd.DataFrame({'text1': texts2, 'text2': texts1}))
    p1, w1 = calc_gpt2_ppl_corpus(df.text1.tolist())
    p2, w2 = calc_gpt2_ppl_corpus(df.text2.tolist())
    return {
        'sim': get_sims(df).mean(),
        'sim_random': get_random_sims(df).mean(),
        'bleu_1': b1.mean(),
        'bleu_2': b2.mean(),
        'bleu': (b1+b2).mean() / 2,
        'char_ngram_overlap': get_char_ngram_overlap(df).mean(),
        'perp_1': (p1 * w1).sum() / w1.sum(),
        'perp_2': (p2 * w2).sum() / w2.sum(),
        'perp_mean': (p1 * w1 + p2 * w2).sum() / (w1 + w1).sum(),
    }

# выборка

In [None]:
text_init = '''- отпуск
    - рапорт
    - рапорт на отпуск
    - подай рапорт на отпуск
    - сформируй рапорт на отпуск
    - сформировать рапорт на отпуск
    - оформить рапорт на отпуск
    - распечатать рапорт на отпуск
    - подать рапорт на отпуск
    - хочу поехать в отпуск
    - хочу в отпуск
    - поеду в отпуск
  - инструкция теста на ковид
    - как сделать тест на ковид
    - как сделать тест
    - инстуркция для теста
    - хочу сделать тест на ковид
    - дай инструкцию теста на ковид
    - как пройти тест на ковид
    - инструкция
    - дай инструкцию
    - пришли инструкцию
  - когда нужно сделать тест
    - когда нужно сделать тест на ковид
    - когда делать тест на ковид
    - когда сделать тест на ковид
  - результаты самотестирования
    - результат теста на ковид
    - интерпретация теста на ковид
    - понять результат теста на ковид
    - результат теста на ковид
    - дай результат
    - дай результат теста на ковид
    - понять результат
    - результат
    - интерпретация
    - как интерпретировать результат
    - как понять результат
  - хочу справку
    - дай справку
    - заказ справки
    - заказать справку
    - нужна справка
    - пришлите справку
    - какие справки
    - можно заказать справку
    - справка в организацию
    - пришлите справочку
    - предоставьте справку
  - прием документов
    - документы
    - прилагаю документы
    - прикрепить документы
    - загрузить документы
    - документы предоставлены
    - загрузить документы об изменении персональных данных
    - документы об изменении персональных данных'''

text_init_new = []
for text in text_init.split('\n'):
    text_init_new.append(text.lstrip().replace('- ', ''))

In [None]:
# Для загрузки текста для перефразирования из файла
#with open("файл.txt", "w") as file:
#    print(*список, file=file)

NameError: ignored

# функции для моделей

In [None]:
def generate_text_para(function_paraphrase):
    text_para = []
    for text in tqdm(text_init_new):
        text_para.append(function_paraphrase(text))
    return text_para # список из сгенерированных фраз

def generate_text_para_new(text_para):
    # cгенерированную фразу к нижнему регистру надо привести и удаляем знаки пунктуации
    text_para_new = []
    for text in text_para:
        text_para_new.append(''.join([symbol for symbol in text if symbol not in string.punctuation]).lower())
    return text_para_new

def compute_metrics(text_init_new, text_para_new):
    # расчет метрики
    pattern = r'[a-zA-Z]+'
    res_sim, res_sim_random, res_bleu_1, res_bleu_2, res_bleu, res_char_ngram_overlap, res_perp_1, res_perp_2, res_perp_mean = [], [], [], [], [], [], [], [], []
    for text_init_new_, text_para_new_ in zip(text_init_new, text_para_new):
        if text_para_new_ == '' or re.search(pattern, text_para_new_) or text_para_new_ == text_init_new_:
            res = analyze_pairs([text_init_new_], [text_para_new_])
            res_sim.append(0)
            res_sim_random.append(0)
            res_bleu_1.append(0)
            res_bleu_2.append(0)
            res_bleu.append(0)
            if text_para_new_ == text_init_new_:
                res_char_ngram_overlap.append(res['char_ngram_overlap'])
            else:
                res_char_ngram_overlap.append(0)
            res_perp_1.append(0)
            res_perp_2.append(0)
            res_perp_mean.append(0)  
        else:
            res = analyze_pairs([text_init_new_], [text_para_new_])
            res_sim.append(res['sim'])
            res_sim_random.append(res['sim_random'])
            res_bleu_1.append(res['bleu_1'])
            res_bleu_2.append(res['bleu_2'])
            res_bleu.append(res['bleu'])
            res_char_ngram_overlap.append(res['char_ngram_overlap'])
            res_perp_1.append(res['perp_1'])
            res_perp_2.append(res['perp_2'])
            res_perp_mean.append(res['perp_mean'])    
    return {'text_init': text_init_new, 'text_para': text_para_new, 'sim': res_sim, 'bleu': res_bleu, 
                   'char_ngram_overlap': res_char_ngram_overlap, 'perp_mean':res_perp_mean}

def view_result(df):
    dict_metric = {}
    metrics = ['sim', 'bleu', 'char_ngram_overlap', 'perp_mean']
    for metric in metrics:
        dict_metric[metric] = df[metric].sum() / len(df)
    for metric in metrics:
        dict_metric[f'{metric}_true'] = df[df.sim != 0][metric].sum() / len(df[df.sim != 0]) 
    dict_metric['dolya'] = sum(df.sim != 0) / len(df.sim) * 100
    return dict_metric

def make_new_table(df, POROG):
    df_new = df[(df.sim >= POROG) & (df.sim != 0)] 
    return df_new

# cointegrated/rut5-base-paraphraser

In [None]:
import torch
import os
import string
import re

from transformers import T5ForConditionalGeneration, T5Tokenizer

In [None]:
MODEL_NAME = 'cointegrated/rut5-base-paraphraser'
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
if torch.cuda.is_available():
    model.cuda();

Downloading:   0%|          | 0.00/724 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/977M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/828k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315 [00:00<?, ?B/s]

In [None]:
# оригинальная
model.eval();

def paraphrase_cointegrated(text, beams=5, grams=4, do_sample=False):
    x = tokenizer(text, return_tensors='pt', padding=True).to(model.device)
    max_size = int(x.input_ids.shape[1] * 1.5 + 10)
    out = model.generate(**x, 
                         encoder_no_repeat_ngram_size=grams, 
                         do_sample=do_sample, 
                         num_beams=beams, 
                         max_length=max_size, 
                         no_repeat_ngram_size=4,)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [None]:
text_para = generate_text_para(paraphrase_cointegrated)
text_para_new = generate_text_para_new(text_para)
DATA = compute_metrics(text_init_new, text_para_new)
df1 = pd.DataFrame(DATA)
df1.shape

  0%|          | 0/57 [00:00<?, ?it/s]

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(57, 6)

In [None]:
df1

Unnamed: 0,text_init,text_para,sim,bleu,char_ngram_overlap,perp_mean
0,отпуск,отдых,0.94314,7.223693e-155,0.032258,5.188607
1,рапорт,рапорт,0.0,0.0,1.0,0.0
2,рапорт на отпуск,рапорт на отпуск,0.0,0.0,1.0,0.0
3,подай рапорт на отпуск,подай мне рапорт на отдых,0.879594,0.6660198,0.5,4.44288
4,сформируй рапорт на отпуск,составьте рапорт на отдых,0.967505,0.4929961,0.196429,4.420242
5,сформировать рапорт на отпуск,составьте рапорт на отдых,0.950895,0.4633225,0.177419,4.350691
6,оформить рапорт на отпуск,что делать чтобы оформить отпуск на отпуск,0.713891,0.3845458,0.474747,4.365619
7,распечатать рапорт на отпуск,рапорт на отпуск распечатать,0.978469,0.9295691,1.0,3.973705
8,подать рапорт на отпуск,подать заявку на отпуск,0.897349,0.6528674,0.52,3.49651
9,хочу поехать в отпуск,я хочу уйти в отпуск,0.967509,0.5634899,0.467742,4.009092


In [None]:
view_result(df1)

{'sim': 0.6397986668243743,
 'bleu': 0.4269967930926052,
 'char_ngram_overlap': 0.6323692712894058,
 'perp_mean': 4.14011629939295,
 'sim_true': 0.8682981906902223,
 'bleu_true': 0.5794956477685356,
 'char_ngram_overlap_true': 0.5010725824641936,
 'perp_mean_true': 5.618729263461861,
 'dolya': 73.68421052631578}

In [None]:
analyze_pairs(text_init_new, text_para)

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


  0%|          | 0/57 [00:00<?, ?it/s]

  0%|          | 0/57 [00:00<?, ?it/s]

{'sim': 0.81047136,
 'sim_random': 0.3321916,
 'bleu_1': 0.6335595139039519,
 'bleu_2': 0.6234763220496439,
 'bleu': 0.628517917976798,
 'char_ngram_overlap': 0.5437544367954206,
 'perp_1': 5.3389373573890095,
 'perp_2': 4.6559462345044835,
 'perp_mean': 5.455873425190266}

In [None]:
analyze_pairs(text_init_new, text_para_new)

  0%|          | 0/57 [00:00<?, ?it/s]

  0%|          | 0/57 [00:00<?, ?it/s]

{'sim': 0.9029565,
 'sim_random': 0.4299008,
 'bleu_1': 0.6955471864883517,
 'bleu_2': 0.684762189170543,
 'bleu': 0.6901546878294473,
 'char_ngram_overlap': 0.6323692712894058,
 'perp_1': 5.3389373573890095,
 'perp_2': 5.326056897811015,
 'perp_mean': 5.529151536134573}

In [None]:
# Без нормализации перефрезированного текстп

In [None]:
text_para = generate_text_para(paraphrase_cointegrated)
text_para_new = generate_text_para_new(text_para)
DATA = compute_metrics(text_init_new, text_para)
df2 = pd.DataFrame(DATA)
df2.shape

  0%|          | 0/57 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(57, 6)

In [None]:
df2

Unnamed: 0,text_init,text_para,sim,bleu,char_ngram_overlap,perp_mean
0,отпуск,Отдых.,0.776682,1.164047e-231,0.028571,5.321343
1,рапорт,Рапорт.,0.776744,0.6289876,0.538462,6.390197
2,рапорт на отпуск,Рапорт на отпуск.,0.793039,0.8722311,0.744681,4.934168
3,подай рапорт на отпуск,Подай мне рапорт на отдых,0.817787,0.6230425,0.5,4.365199
4,сформируй рапорт на отпуск,Составьте рапорт на отдых,0.918834,0.4851056,0.196429,4.395616
5,сформировать рапорт на отпуск,Составьте рапорт на отдых,0.897272,0.4574191,0.177419,4.326065
6,оформить рапорт на отпуск,"Что делать, чтобы оформить отпуск на отпуск",0.660602,0.3725658,0.451923,4.245951
7,распечатать рапорт на отпуск,Рапорт на отпуск распечатать,0.918445,0.8915724,1.0,4.015771
8,подать рапорт на отпуск,Подать заявку на отпуск,0.828422,0.6054658,0.52,3.400995
9,хочу поехать в отпуск,Я хочу уйти в отпуск,0.942112,0.5634899,0.467742,3.984434


In [None]:
sorted_df=df1.sort_values(by='char_ngram_overlap')

In [None]:
sorted_df

Unnamed: 0,text_init,text_para,sim,bleu,char_ngram_overlap,perp_mean
0,отпуск,отдых,0.94314,7.223693e-155,0.032258,5.188607
47,пришлите справочку,дай мне справку,0.773513,0.3622462,0.139241,4.56547
28,интерпретация теста на ковид,что известно о тесте на ковид в японии,0.420816,0.3347622,0.169118,7.022112
5,сформировать рапорт на отпуск,составьте рапорт на отдых,0.950895,0.4633225,0.177419,4.350691
4,сформируй рапорт на отпуск,составьте рапорт на отдых,0.967505,0.4929961,0.196429,4.420242
15,инстуркция для теста,инструкция для тестирования,0.757711,0.5570663,0.216216,4.142448
38,хочу справку,мне нужна справка,0.923049,0.3557711,0.233333,5.636605
29,понять результат теста на ковид,что известно о результате теста на коронавирус,0.639163,0.441206,0.282051,6.29934
48,предоставьте справку,дай мне справку,0.88145,0.4187545,0.289474,5.165238
27,результат теста на ковид,что известно о результатах теста на коронавирус,0.58452,0.3613545,0.309859,6.056014


In [None]:
model.eval();

def paraphrase_base(text, beams=5, grams=4, do_sample=False):
    x = tokenizer(text, return_tensors='pt', padding=True).to(model.device)
    max_size = int(x.input_ids.shape[1] * 1.5 + 10)
    out = model.generate(**x, encoder_no_repeat_ngram_size=grams, do_sample=do_sample, num_beams=beams, max_length=max_size, no_repeat_ngram_size=4,)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Проверка модели при разных значениях beams и grams
1. ввести текст для перефразирования
2. изменить значения beams и grams
3. перезапустить ячейку


In [None]:
#@title Парафразер { run: "auto", form-width: "50%", display-mode: "form" }
text = ' \u0437\u0430\u0433\u0440\u0443\u0437\u0438\u0442\u044C \u0434\u043E\u043A\u0443\u043C\u0435\u043D\u0442\u044B \u043E\u0431 \u0438\u0437\u043C\u0435\u043D\u0435\u043D\u0438\u0438 \u043F\u0435\u0440\u0441\u043E\u043D\u0430\u043B\u044C\u043D\u044B\u0445 \u0434\u0430\u043D\u043D\u044B\u0445' #@param {type:"string"}
beams = 4 #@param {type:"slider", min:1, max:10, step:1}
grams = 4 #@param {type:"slider", min:1, max:10, step:1}
randomize = True #@param {type:"boolean"}

paraphrase_base(text, beams=beams, grams=grams, do_sample=randomize)

'Скачать документы о изменении данных'