# Модель BERT и GPT

#### Задание

1. Взять датасет

https://huggingface.co/datasets/merionum/ru_paraphraser

решить задачу парафраза

2. (дополнительно необязательная задача) на выбор взять

https://huggingface.co/datasets/sberquad

https://huggingface.co/datasets/blinoff/medical_qa_ru_data

натренировать любую модель для вопросно ответной системы

### Парафраз

In [1]:
# Загрузка датасета

from datasets import load_dataset

dataset = load_dataset("merionum/ru_paraphraser")

dataset

Using custom data configuration merionum--ru_paraphraser-46b7ccf402279b95


Downloading and preparing dataset json/merionum--ru_paraphraser to /Users/evgeniya/.cache/huggingface/datasets/merionum___json/merionum--ru_paraphraser-46b7ccf402279b95/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.17M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/605k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /Users/evgeniya/.cache/huggingface/datasets/merionum___json/merionum--ru_paraphraser-46b7ccf402279b95/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 7227
    })
    test: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 1924
    })
})

In [2]:
# Загрузка предобученных моделей и пайплайна

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("cointegrated/rut5-base-paraphraser")

model = AutoModelForSeq2SeqLM.from_pretrained("cointegrated/rut5-base-paraphraser")

Downloading:   0%|          | 0.00/315 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/808k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/724 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/932M [00:00<?, ?B/s]

In [3]:
# Функция парафраза

def paraphrase(text, beams=5, grams=4, do_sample=False):
    x = tokenizer(text, return_tensors='pt', padding=True).to(model.device)
    max_size = int(x.input_ids.shape[1] * 1.5 + 10)
    out = model.generate(**x, encoder_no_repeat_ngram_size=grams, num_beams=beams, max_length=max_size, do_sample=do_sample)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [4]:
# Примеры работы парафраза

for i in range(15):
    print('Исходный текст: ', dataset["train"][i]['text_1'])
    print('Парафраз: ', paraphrase(dataset["train"][i]['text_1']))
    print('------------------------------------------------------------')

Исходный текст:  Полицейским разрешат стрелять на поражение по гражданам с травматикой.
Парафраз:  Полицейские могут стрелять на нападение на граждан с травматизмом.
------------------------------------------------------------
Исходный текст:  Право полицейских на проникновение в жилище решили ограничить.
Парафраз:  Решили ограничить право полицейских проникнуть в жилище.
------------------------------------------------------------
Исходный текст:  Президент Египта ввел чрезвычайное положение в мятежных городах.
Парафраз:  Глава Египта объявил чрезвычайное положение в городах-мятежниках.
------------------------------------------------------------
Исходный текст:  Вернувшихся из Сирии россиян волнует вопрос трудоустройства на родине.
Парафраз:  Вопрос трудоустройства в Сирии волнует россиян, вернувшихся из Сирии.
------------------------------------------------------------
Исходный текст:  В Москву из Сирии вернулись 2 самолета МЧС с россиянами на борту.
Парафраз:  Из Сирии в Москву ве

### Вопросно-ответная система

In [5]:
# Загрузка датасета

from datasets import load_dataset

raw_qa_dataset = load_dataset("sberquad")

Reusing dataset sberquad (/Users/evgeniya/.cache/huggingface/datasets/sberquad/sberquad/1.0.0/62115d937acf2634cfacbfee10c13a7ee39df3ce345bb45af7088676f9811e77)


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
raw_qa_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 45328
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5036
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 23936
    })
})

In [7]:
raw_qa_dataset["train"][0]

{'id': 62310,
 'title': 'SberChallenge',
 'context': 'В протерозойских отложениях органические остатки встречаются намного чаще, чем в архейских. Они представлены известковыми выделениями сине-зелёных водорослей, ходами червей, остатками кишечнополостных. Кроме известковых водорослей, к числу древнейших растительных остатков относятся скопления графито-углистого вещества, образовавшегося в результате разложения Corycium enigmaticum. В кремнистых сланцах железорудной формации Канады найдены нитевидные водоросли, грибные нити и формы, близкие современным кокколитофоридам. В железистых кварцитах Северной Америки и Сибири обнаружены железистые продукты жизнедеятельности бактерий.',
 'question': 'чем представлены органические остатки?',
 'answers': {'text': ['известковыми выделениями сине-зелёных водорослей'],
  'answer_start': [109]}}

In [8]:
# Загрузка предобученного токенайзера

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AndrewChar/model-QA-5-epoch-RU")

In [9]:
# Функция препроцессинга тренировочной выборки

max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [10]:
# Препроцессинг тренировочной выборки

train_dataset = raw_qa_dataset["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_qa_dataset["train"].column_names,
)



  0%|          | 0/46 [00:00<?, ?ba/s]

In [11]:
# Функция препроцессинга валидационной выборки

def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [12]:
# Препроцессинг валидационной выборки

validation_dataset = raw_qa_dataset["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_qa_dataset["validation"].column_names,
)

  0%|          | 0/6 [00:00<?, ?ba/s]

In [13]:
# Для настройки модели возьмем небольшую выборку

small_eval_set = raw_qa_dataset["validation"].select(range(100))

trained_checkpoint = "AndrewChar/model-QA-5-epoch-RU"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_qa_dataset["validation"].column_names,
)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [14]:
tokenizer = AutoTokenizer.from_pretrained("AndrewChar/model-QA-5-epoch-RU")

In [15]:
# Загрузка предобученной модели

import tensorflow as tf
from transformers import TFAutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("numpy")

batch = {k: eval_set_for_model[k] for k in eval_set_for_model.column_names}
trained_model = TFAutoModelForQuestionAnswering.from_pretrained(trained_checkpoint)

outputs = trained_model(**batch)

2022-06-27 13:59:53.175459: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-06-27 13:59:53.175595: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1


All model checkpoint layers were used when initializing TFDistilBertForQuestionAnswering.

All the layers of TFDistilBertForQuestionAnswering were initialized from the model checkpoint at AndrewChar/model-QA-5-epoch-RU.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


In [16]:
start_logits = outputs.start_logits.numpy()
end_logits = outputs.end_logits.numpy()

In [17]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

In [18]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

In [19]:
# Загрузка метрики

from datasets import load_metric

metric = load_metric("squad")

In [20]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in small_eval_set
]

In [21]:
# Сравним предсказанный и реальный ответ

print(predicted_answers[0])
print(theoretical_answers[0])

{'id': 60544, 'prediction_text': 'в Древнем Египте'}
{'id': 60544, 'answers': {'text': ['в Древнем Египте'], 'answer_start': [60]}}


In [22]:
# Посмотрим на метрику

metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 9.0, 'f1': 23.01902001507264}

In [23]:
# Функция вычисления метрик

from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

In [24]:
# Вычисление метрик

compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 9.0, 'f1': 23.01902001507264}

In [25]:
# Модель

model_checkpoint = 'AndrewChar/model-QA-5-epoch-RU'

model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at AndrewChar/model-QA-5-epoch-RU were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at AndrewChar/model-QA-5-epoch-RU and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
# Вход в Hugging face

# from huggingface_hub import notebook_login

# notebook_login()

In [27]:
# Импорт дата коллатора

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

In [28]:
# Создание датасетов

tf_train_dataset = train_dataset.to_tf_dataset(
    columns=[
        "input_ids",
        "start_positions",
        "end_positions",
        "attention_mask",
        "token_type_ids",
    ],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=16,
)
tf_eval_dataset = validation_dataset.to_tf_dataset(
    columns=["input_ids", "attention_mask", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

In [29]:
# Гиперпараметры

from transformers import create_optimizer
# from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_epochs = 3
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
  METAL, no compute capability (probably not an Nvidia GPU)
See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.


In [None]:
# Тренировка модели

model.fit(tf_train_dataset, epochs=num_train_epochs)

Epoch 1/3


2022-06-27 14:00:30.276582: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-06-27 14:00:33.259930: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


 122/2987 [>.............................] - ETA: 3:49:21 - loss: 1.3058

In [None]:
# Оценка модели

predictions = model.predict(tf_eval_dataset)
compute_metrics(
    predictions["start_logits"],
    predictions["end_logits"],
    validation_dataset,
    raw_datasets["validation"],
)

In [None]:
# Проверка работы модели

from transformers import pipeline

model_checkpoint = "model-finetuned"
question_answerer = pipeline("question-answering", model=model_checkpoint)

context = """
Москва — популярный туристический центр России. Кремль, Красная площадь, Новодевичий монастырь и Церковь Вознесения в Коломенском входят в список объектов всемирного наследия ЮНЕСКО. 
Она является важнейшим транспортным узлом: город обслуживают 6 аэропортов, 10 железнодорожных вокзалов, 3 речных порта (имеется речное сообщение с морями бассейнов Атлантического и Северного Ледовитого океанов).

"""
question = "Сколько вокзалов в Москве?"
question_answerer(question=question, context=context)
