# Применение инструментов Hugging face и предобученных моделей

### Вариант 1:
 Вам нужно создать искусственные данные для тестирования и/или обучения чат-бота. По заданному предложению/утверждению/команде создать набор расширенных предложений/утверждений/команд с приблизительно тем же смыслом. Пример:

> After your workout, remember to focus on maintaining a good water balance.

похожие команды:

> Remember to drink enough water to restore and maintain your body's hydration after your cardio training.

>Please don't forget to maintain water balance after workout.

Предлагается решить упрощенную версию данной задачи с применением общедоступных "маленьких". 
В репозитории Hugging Face есть большое количество предобученных моделей для [casual](https://huggingface.co/models?pipeline_tag=text-generation) и [masked](https://huggingface.co/models?pipeline_tag=fill-mask) языкового моделирования.  Также для валидации можно использовать [sentence-transformers](https://huggingface.co/sentence-transformers). Выбрать нужно модели, которые можно запускать на CPU.

Пример использования masked LM:

```python
import torch
from transformers import AutoModelForMaskedLM, PreTrainedTokenizer

# загружается токенайзер
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
# загружается модель
model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

# предложение и замаскированным токеном
sequence = f"My name is {tokenizer.mask_token}."

# результат токенизации
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# применение модели
result = model(input_ids=input_ids)

# индекс замаскированного токена (NB может не совпадать с номером слова)
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

# самый вероятный токен 
print(tokenizer.decode(result.logits[:, mask_token_index].argmax()))
```

или через [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)

```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="distilroberta-base")

pipe("My name is <mask>.")
```

Casual LM через pipeline:

```python
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')

generator("Hello", max_length=10, num_return_sequences=5)
```

Один наивных способов решения задачи без дополнительного обучения - замаскировать, или вставить в исходную команду замаскированный токен, или обрезать часть команды и применить языковую модель. Результат можно валидировать с помощью [sentence-transformers](https://huggingface.co/sentence-transformers). 


In [1]:
import torch
import numpy as np
import re
from transformers import AutoModelForMaskedLM, AutoTokenizer

# загружается токенайзер
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
# загружается модель
model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
# предложение с замаскированным токеном
sequence = f"My name is {tokenizer.mask_token}."

# результат токенизации
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# применение модели
result = model(input_ids=input_ids)

In [3]:
# индекс замаскированного токена (NB может не совпадать с номером слова)
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

# самый вероятный токен 
print(tokenizer.decode(result.logits[:, mask_token_index].argmax()))

 Jason


In [4]:
# первые N самых вероятных слов
N = 5
[tokenizer.decode(x) for x in result.logits[0, mask_token_index.tolist()[0]].argsort(descending=True).tolist()[:N]]

[' Jason', ' Alex', ' Chris', ' David', ' Brian']

In [5]:
def get_N_most_possible_tokens(sequence, N=5):
    # результат токенизации
    input_ids = tokenizer.encode(sequence, return_tensors="pt")
    # применение модели
    result = model(input_ids=input_ids)
    # индекс замаскированного токена (NB может не совпадать с номером слова)
    mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]    
    return [tokenizer.decode(x) for x in result.logits[0, mask_token_index.tolist()[0]].argsort(descending=True).tolist()[:N]]

Валидация с помощью sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer, util

# Загрузка модели sentence-transformers
sbert_model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

def cos_score(sentence: str, new_sentence: str) -> float:
    # Кодирование исходного и новых предложений
    sentence_embeddings = sbert_model.encode([sentence], convert_to_tensor=True)
    new_sentence_embeddings = sbert_model.encode([new_sentence], convert_to_tensor=True)

    # Расчет косинусного расстояния между векторами
    cosine_scores = util.pytorch_cos_sim(sentence_embeddings, new_sentence_embeddings)

    return cosine_scores.item()


def check_coincidence(command, new_command):
    return ' '.join(re.sub(r'[^\w\s]', '', command).lower().split(' ')) != ' '.join(new_command.lower().split(' '))


def filtering(command, commands, th=0.9):
    result = []
    for new_command in commands:
        if cos_score(command, new_command) > th and check_coincidence(command, new_command):
            result.append(new_command)
    return list(set(result))


In [7]:
sentence1 = "My name is Gosha."
new_sentence1 = "Gosha is my name"
print(cos_score(sentence1, new_sentence1))

sentence2 = "My name is Gosha."
new_sentence2 = "Jason is my name"
print(cos_score(sentence2, new_sentence2))

0.9137396216392517
0.2270156741142273


In [8]:
def using_maskLM(command, insert_mask=True):

    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')

    low, high = 1, 3
    masks = np.random.randint(low, high)

    commands = []
    
    for _ in range(masks):
        if insert_mask:
            index = np.random.choice(range(0, len(command_words) + 1))
            command_words.insert(index, tokenizer.mask_token)
        else:
            index = np.random.choice(range(0, len(command_words)))
            command_words = [command_words[i] if i != index else tokenizer.mask_token for i in range(len(command_words))]

        masked_command = " ".join(command_words)
        variants = get_N_most_possible_tokens(masked_command, 5)
        good_variants =  [tok.strip() for tok in variants if len(tok) > 1]
        for good in good_variants:
            new_command = masked_command.replace(tokenizer.mask_token, good)
            new_command = ' '.join(new_command.split())
            if len(commands) == 0:
                commands.append(new_command)
            elif all([check_coincidence(new_command, command) for command in commands]):
                commands.append(new_command)

    return commands


def generate_similar_commands(command, num=10):
    commands = []
 
    for _ in range(num // 2):
        commands.extend(using_maskLM(command, insert_mask=True))

    for _ in range(num // 2, num):
        commands.extend(using_maskLM(command, insert_mask=False))

    return commands

np.random.seed(0)
command = "After your workout, remember to focus on maintaining a good water balance."
new_commands = generate_similar_commands(command, 10)

res = filtering(command, new_commands, th=0.95)
print(len(res))
display(res)


40


['after this workout remember to focus on maintaining this good water balance',
 'after your workout remember to focus on maintaining a healthy water balance',
 'after your workout remember to focus on maintaining a a water balance',
 'after your workout remember to focus on maintaining on good water balance',
 'after your workout and remember to focus on maintaining a good water balance',
 'after your workout remember to focus upon upon a good water balance',
 'after your workout remember to focus on maintaining a proper water balance',
 'after your workout and remember and to focus on maintaining a good water balance',
 'after your workout remember to focus on maintaining a steady water balance',
 'after your workout remember try to focus on maintaining a good water balance',
 'after your workout remember to focus on on a good water balance',
 'after your workout remember to focus on maintaining some good water balance',
 'after your workout remember to focus and and a good water bal

Теперь попробуем изменять несколько раз одну и ту же стрчоку

In [9]:
def using_maskLM(command):
    
    insert_mask = np.random.choice([False, True], p=(0.3, 0.7))

    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')

    if insert_mask:
        index = np.random.choice(range(0, len(command_words) + 1))
        command_words.insert(index, tokenizer.mask_token)
    else:
        index = np.random.choice(range(0, len(command_words)))
        command_words = [command_words[i] if i != index else tokenizer.mask_token for i in range(len(command_words))]

    masked_command = " ".join(command_words)
    variants = get_N_most_possible_tokens(masked_command, 5)
    good_variants =  [tok.strip() for tok in variants if len(re.sub(r'[^\w\s]', '', tok)) > 1]
    if len(good_variants) == 0:
        good = ""
    else:
        good = np.random.choice(good_variants)
    
    new_command = masked_command.replace(tokenizer.mask_token, good)
    new_command = ' '.join(new_command.split())
    return new_command 


def generate_similar_commands(command, num=10):
    commands = []

    K = len(command.split(' ')) + 1

    for _ in range(num):
        new_command = command
        for _ in range(K):
            new_command = using_maskLM(new_command)

            if len(commands) == 0:
                commands.append(new_command)
            elif all([check_coincidence(new_command, command) for command in commands]):
                commands.append(new_command)

    return commands

np.random.seed(0)
command = "After your workout, remember to focus on maintaining a good water balance."
new_commands = generate_similar_commands(command, 20)

res = filtering(command, new_commands, th=0.5)
print(len(res))
display(res)


229


['after your daily workout always remember when deciding to plan ahead towards maintaining a very good energy balance',
 'after completing your cardio workout remember to really consciously and truly consciously focus attention on maintaining a very really good water balance',
 'so after doing your morning workout remember not hesitate to focus on maintaining a good body water balance',
 'while successfully performing your first workout remember to keep focus attention properly balance maintaining a good water vapor ph balance',
 'after your workout remember to focus instead on maintaining maintaining a good water balance',
 'after your workout you remember trying to always focus on maintaining a good water balance',
 'while successfully performing my first workout remember to keep focus attention properly balance maintaining a good water vapor ph balance',
 'even after your workout remember to focus on maintaining maintaining a very good water vapor balance',
 'day after your workout 

Результат получше, можно ещё параметрами поуправлять

Теперь черезе pipelines

In [10]:
from transformers import pipeline

pipe = pipeline("fill-mask", model="distilroberta-base")


def get_new_command(command):
    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')

    insert_mask = np.random.choice([False, True], p=(0.6, 0.4))
    if insert_mask:
        index = np.random.choice(range(0, len(command_words) + 1))
        command_words.insert(index, tokenizer.mask_token)
    else:
        index = np.random.choice(range(0, len(command_words)))
        command_words = [command_words[i] if i != index else tokenizer.mask_token for i in range(len(command_words))]

    masked_command = " ".join(command_words) 
    new_command = np.random.choice(pipe(masked_command))["sequence"]
    return new_command


def generate_similar_commands_with_pipe(command, num=5):
    commands = []

    K = len(command.split(' ')) + 1

    for _ in range(num):
        new_command = command
        for _ in range(25):
            new_command = get_new_command(new_command)

            if len(commands) == 0:
                commands.append(new_command)
            elif all([check_coincidence(new_command, command) for command in commands]):
                commands.append(new_command)
    return commands

np.random.seed(2)
command = "After your workout, remember to focus on maintaining a good water balance."
new_commands = generate_similar_commands_with_pipe(command, 5)

res = filtering(command, new_commands, th=0.5)
print(len(res))
display(res)



Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


104


['after your workout plan to strive towards continually maintaining very good water balance',
 'after finishing your workout remember to constantly strive for maintaining at a good water pressure',
 'items are needed during starting your workout remember to work constantly on strengthening your maintaining core creating truly good muscle balance',
 'after almost every workout remember your focus on maintaining the consistently consistent good vaping nicotine quality',
 ' during your workout remember to work on maintaining for a good muscle balance',
 'please choose your workout plan plan  help us contribute towards continually maintaining very good water balance',
 'Choose your workout plan to strive towards continually maintaining very good water balance',
 'after your workout remember to focus on maintaining a good muscle balance',
 'afternoon workout remember not to really focus on maintaining some maximal oxygen retention',
 'as always during your workout remember to work constantl

Теперь через Casual LM через pipeline

In [11]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')

generator("After your workout, remember to focus on maintaining a good water balance.", max_length=20, num_return_sequences=1)[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'After your workout, remember to focus on maintaining a good water balance. You can use water as an'

In [26]:
def get_new_command_gpt2(command):
    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')
    index = np.random.choice(range(0, len(command_words)))
    new_command = " ".join(command_words[:index])
    new_command = generator(new_command, max_length=len(command_words) + 5, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    new_command = re.sub(r'[^\w\s]', '', new_command.lower()).split(" ")[:index + 1]
    # убираем возможные /n
    new_command = " ".join(new_command).split()
    new_command.extend(command_words[index:])
    return " ".join(new_command)


def generate_similar_commands_with_gpt2(command, num=5):
    commands = []

    K = len(command.split(' ')) + 1

    for _ in range(num):
        new_command = command
        for _ in range(2):
            new_command = get_new_command_gpt2(new_command)

            if len(commands) == 0:
                commands.append(new_command)
            elif all([check_coincidence(new_command, command) for command in commands]):
                commands.append(new_command)
    return commands


command = "After your workout, remember to focus on maintaining a good water balance."
new_commands = generate_similar_commands_with_gpt2(command, 10)

res = filtering(command, new_commands, th=0.9)
print(len(res))
display(res)



17


['after your workout remember to focus on maintaining a strong good water balance',
 'after your workout but remember to focus on on maintaining a good water balance',
 'after your workout started remember to focus on maintaining a good water balance',
 'after your body workout remember to focus on maintaining a strong good water balance',
 'after your workout remember to focus on getting stronger maintaining a good water balance',
 'after your workout remember to focus on maintaining a good posture water balance',
 'after your workout started remember to to focus on maintaining a good water balance',
 'after your workout remember to focus on maintaining your a good water balance',
 'after your first workout id remember to focus on maintaining a good water balance',
 'after your workout remember to focus on on maintaining a good water balance',
 'after your workout remember to focus on maintaining balance a low good water balance',
 'after your return workout remember to focus on maint

Объеденим всё вместе

In [27]:
from transformers import pipeline

pipe = pipeline("fill-mask", model="distilroberta-base")
generator = pipeline('text-generation', model='gpt2')

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [39]:
-2 % 5

3

In [45]:
def get_new_command_bert(command):
    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')

    insert_mask = np.random.choice([False, True], p=(0.6, 0.4))
    if insert_mask:
        index = (np.random.choice(range(0, len(command_words) + 1)) - 3) % len(command_words)
        command_words.insert(index, tokenizer.mask_token)
    else:
        index = (np.random.choice(range(0, len(command_words))) - 3) % len(command_words)
        command_words = [command_words[i] if i != index else tokenizer.mask_token for i in range(len(command_words))]

    masked_command = " ".join(command_words) 
    new_command = np.random.choice(pipe(masked_command))["sequence"]
    return new_command


def get_new_command_gpt2(command):
    cleaned_command = re.sub(r'[^\w\s]', '', command.lower())
    command_words = cleaned_command.split(' ')
    index = np.random.choice(range(0, len(command_words)))
    new_command = " ".join(command_words[:index])
    new_command = generator(new_command, max_length=len(command_words) + 5, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    new_command = re.sub(r'[^\w\s]', '', new_command.lower()).split(" ")[:index + 1]
    # убираем возможные /n
    new_command = " ".join(new_command).split()
    new_command.extend(command_words[index:])
    return " ".join(new_command)


def generate_similar_commands(command, num=5, k=25, p=[0.8, 0.2]):
    commands = []

    functions = [get_new_command_bert, get_new_command_gpt2]

    for _ in range(num):
        new_command = command
        for _ in range(k):
            function = np.random.choice(functions, p=p)
            
            new_command = function(new_command)

            if len(commands) == 0:
                commands.append(new_command)
            elif all([check_coincidence(new_command, command) for command in commands]):
                commands.append(new_command)
    return commands


In [46]:
command = "After your workout, remember to focus on maintaining a good water balance."
new_commands = generate_similar_commands(command, 10, k=13, p=[0.8, 0.2])

res = filtering(command, new_commands, th=0.9)
print(len(res))
display(res)

39


['after starting your workout remember to focus on it to ensure a good water balance',
 'after your workout remember to focus on finding a good clean water balance',
 'after your workout remember how important to always focus yourself on keeping a healthy water balance',
 'after your workout focus on maintaining a fairly healthy daily water intake balance',
 'after your workout remember to focus on your maintaining a good water level balance',
 'after your workout remember to focus on maintaining a healthy water intake balance',
 'Morning jonel after your workout remember to focus on maintaining the good water balance',
 'after your workout remember to focus on maintaining a healthy water balance',
 'when finishing your workout remember to focus on keeping a consistent clean water balance',
 'after starting the workout remember to focus on it to ensure a good water retention',
 'after your workout remember to focus on getting a good clean water balance',
 'after starting the workout re