# **1. Наивные подходы**

## 1.1 Masked LM

In [6]:
import random

def mask_sentence(sentence: str,
                  percent_of_masks: float = 0.2,
                  mask_token='[MASK]'):
  '''
  Маскирование токенов внутри предложения (случайным образом)
  '''
  words = sentence.split()
  n_words_mask = max(1, round(len(words) * percent_of_masks))
  idx_words_mask = random.sample(range(len(words)), n_words_mask)

  for idx in idx_words_mask:
    words[idx] = mask_token

  masked_sentence = ' '.join(words)
  return masked_sentence

mask_sentence('Hello my name is Jack', percent_of_masks=0.5)

'[MASK] my name [MASK] Jack'

In [35]:
def replace_mask_with_predictions(sentence: str,
                                  pipe: "transformers.pipeline",
                                  mask_token: str):
  '''
  Замена {mask_token} на предсказанные моделью слова
  '''
  idx = 0
  predictions = pipe(sentence)
  while mask_token in sentence:
    best_pred = predictions[idx][0]['token_str']
    sentence = sentence.replace(mask_token, best_pred, 1)
    idx += 1
  return sentence

In [40]:
def paraphrase(sentence: str, model: str, mask_token: str = "<mask>",
               percent_of_masks: float = 0.2):
  '''
  Весь пайплайн перефразирования с помощью masked LM
  '''
  # маскируем токены
  masked_sentence = mask_sentence(sentence, percent_of_masks, mask_token)

  # модель, токенайзер и тд
  pipe = pipeline("fill-mask", model=model)
  result = replace_mask_with_predictions(masked_sentence, pipe, mask_token)
  return result


#### Примеры применения

In [62]:
t = "After your workout, remember to focus on maintaining a good water balance."

res_roberta_light = paraphrase(sentence=t,
                               model="distilroberta-base")

res_bert = paraphrase(sentence=t,
                      model="bert-base-uncased",
                      mask_token="[MASK]",
                      percent_of_masks = 0.5)

res_albert = paraphrase(sentence=t,
                        model="albert-base-v2",
                        mask_token="[MASK]",
                        percent_of_masks=0.2)

res_roberta = paraphrase(sentence=t,
                         model="roberta-base",
                         mask_token="<mask>",
                         percent_of_masks=0.2)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task

In [63]:
print("distilroberta-base:", res_roberta_light)
print("bert-base-uncased:", res_bert)
print("albert-base-v2:", res_albert)
print("roberta-base:", res_roberta)

distilroberta-base: After your workout, remember to focus on  maintaining a good  body balance.
bert-base-uncased: during your workout, remember to focus on maintaining a more good balance.
albert-base-v2: After each workout, remember to focus on maintaining a good water .
roberta-base: After your workout, remember to focus  on maintaining a good  energy balance.


In [57]:
paraphrase(sentence=t,
           model="roberta-base",
           mask_token="<mask>",
           percent_of_masks=0.2)

Device set to use cpu


'After your workout, remember to focus on maintaining a  healthy  and balance.'

Немного экзотики

In [65]:
paraphrase(sentence=t,
           model="xlnet-base-cased",
           mask_token="<mask>",
           percent_of_masks=0.5)

Device set to use cpu


'After your first d on on on maintaining a good Consumer Consumer'

In [68]:
paraphrase(sentence=t,
           model="google/electra-small-generator",
           mask_token="[MASK]",
           percent_of_masks=0.5)

Device set to use cpu


'After your workout, remember that you are maintaining a good performance .'

In [71]:
paraphrase(sentence=t,
           model="camembert-base",
           mask_token="<mask>",
           percent_of_masks=0.3)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


'After your workout, remember  s on ou r good water balance.'

## 1.2 Casual LM

In [9]:
# from transformers import pipeline

# pipe = pipeline("text-generation")

# pipe("Hello my name is")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
