**Подключение нужных библиотек и определение функций близости предложений**

In [1]:
import torch
import random
from transformers import pipeline
from transformers import BertTokenizer, BertModel
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

import spacy
nlp = spacy.load('en_core_web_sm')

''' Определяет семантическую близость предложения - например, Java better than Python и 
Python better than Java будут иметь небольшое значение близости '''
def sim_sent_1(orig, gen):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')

    tokens1 = tokenizer.tokenize(orig)
    tokens2 = tokenizer.tokenize(gen)

    input_ids1 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens1)).unsqueeze(0)
    input_ids2 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens2)).unsqueeze(0)

    outputs1 = model(input_ids1)
    outputs2 = model(input_ids2)
    embeddings1 = outputs1.last_hidden_state.detach().numpy()[:, 0, :]
    embeddings2 = outputs2.last_hidden_state.detach().numpy()[:, 0, :]

    similarity_score = cosine_similarity(embeddings1, embeddings2)
    return similarity_score

''' Определение близости предложений с помощью sentence-transformers'''
def sim_sent_2(orig, gen):
    model = SentenceTransformer('sentence-transformers/msmarco-bert-base-dot-v5')
    embedding_1 = model.encode([orig])
    embedding_2 = model.encode([gen])
    similarity_score = cosine_similarity(embedding_1, embedding_2)
    return similarity_score

  from .autonotebook import tqdm as notebook_tqdm





**Генерация предложения путем вставки случайного количества [MASK] в случайные места в предложении** 

In [2]:
pipe = pipeline('fill-mask', model='bert-base-uncased')

symb_mask = '[MASK]'
text_orig = 'After your workout, remember to focus on maintaining a good water balance.'
num_perf = random.randint(10, 25) # для случайного количества замен слов на [MASK] 

res_text = []
num_sent = 10 # количество сгенерированных предложений

for j in range(num_sent):
    text = text_orig
    for i in range(num_perf):
        list_text = text.split(' ')
        num_pos = random.randint(0, len(list_text) - 1)
        list_text[num_pos] = '[MASK]'
        text = ' '.join(list_text)
        text = pipe(text)[0]['sequence']
    res_text.append(text)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

**Проверка полученных предложений на близость с исходным предложением**

In [3]:
print('Original sentence', text_orig)
for i, c in enumerate(res_text):
    print('Sentence #', i + 1, ': ', c)
    print("Similarity Score by BERT:", sim_sent_1(text_orig, c))
    print("Similarity Score by Sentence_Transformers:", sim_sent_2(text_orig, c))

Original sentence After your workout, remember to focus on maintaining a good water balance.
Sentence # 1 :  continue your search and focus on finding a good water.
Similarity Score by BERT: [[0.8046778]]
Similarity Score by Sentence_Transformers: [[0.90852064]]
Sentence # 2 :  in this case remember to focus on maintaining a good water balance.
Similarity Score by BERT: [[0.73124075]]
Similarity Score by Sentence_Transformers: [[0.9529843]]
Sentence # 3 :  after your shower, to focus on getting a little.
Similarity Score by BERT: [[0.6619318]]
Similarity Score by Sentence_Transformers: [[0.9220751]]
Sentence # 4 :  after your workout, you focus on finding a new body.
Similarity Score by BERT: [[0.8130115]]
Similarity Score by Sentence_Transformers: [[0.95211756]]
Sentence # 5 :  during your workout, try to focus on maintaining a healthy body.
Similarity Score by BERT: [[0.7770817]]
Similarity Score by Sentence_Transformers: [[0.9583457]]
Sentence # 6 :  let your training remember. focu

**Оставляем только те предложения, которые проходят порог 0.75 по обоим факторам** 

In [4]:
result_1 = []
for i, c in enumerate(res_text):
    if (sim_sent_1(text_orig, c)[0][0] >= 0.75) & (sim_sent_2(text_orig, c)[0][0] >= 0.75):
        result_1.append(c)

**Второй вариант заключается в том, чтобы замаскировать все слова в предложении, но отбирать случайно сгенерированный токен из представленных пяти токенов** 

In [5]:
res_text = []
num_sent = 5 # Количество сгенерированных предложений
text_orig = 'After your workout, remember to focus on maintaining a good water balance'

for j in range(num_sent):
    text = text_orig
    for i in range(len(text.split(' '))):
        try:
            list_text = text.split(' ')
            list_text[i] = '[MASK]'
            text = ' '.join(list_text)
            text = pipe(text)[random.randint(0, 4)]['sequence']
        except:
            break
    res_text.append(text)



print('Original sentence', text_orig)
for i, c in enumerate(res_text):
    print('Sentence #', i + 1, ': ', c)
    print("Similarity Score by BERT:", sim_sent_1(text_orig, c))
    print("Similarity Score by Sentence_Transformers:", sim_sent_2(text_orig, c))

Original sentence After your workout, remember to focus on maintaining a good water balance
Sentence # 1 :  during the swim needs must focused on obtaining generally optimal lateral balance
Similarity Score by BERT: [[0.46790874]]
Similarity Score by Sentence_Transformers: [[0.9047325]]
Sentence # 2 :  in this, remember and concentrate upon getting this whole new ;
Similarity Score by BERT: [[0.6498668]]
Similarity Score by Sentence_Transformers: [[0.8819918]]
Sentence # 3 :  throughout each season has the emphasis in finding another new musical |
Similarity Score by BERT: [[0.58260113]]
Similarity Score by Sentence_Transformers: [[0.84218776]]
Sentence # 4 :  in your mind try to keep in it a certain mental...
Similarity Score by BERT: [[0.54820824]]
Similarity Score by Sentence_Transformers: [[0.8780413]]
Sentence # 5 :  throughout each day tryent focus on drinking a bottled liquid!
Similarity Score by BERT: [[0.607496]]
Similarity Score by Sentence_Transformers: [[0.8875871]]


**Оставляем только те предложения, которые проходят порог 0.70 и 0.75 по BertTokenizer и SentenceTransformer соответственно** 

In [6]:
result_2 = []
for i, c in enumerate(res_text):
    if (sim_sent_1(text_orig, c)[0][0] >= 0.7) & (sim_sent_2(text_orig, c)[0][0] >= 0.75):
        result_2.append(c)

**Соединение двух списков со сгенерированными предложениями**

In [7]:
result = result_1 + result_2
result

['continue your search and focus on finding a good water.',
 'after your workout, you focus on finding a new body.',
 'during your workout, try to focus on maintaining a healthy body.',
 'after your shower remember to focus on getting a little water.',
 'after your workout, try to focus on getting a good body balance.']