# Модель Transformer-1

#### Задание
1. Взять предобученную трансформерную архитектуру и решить задачу перевода
2. (дополнительная не обязательная задача) взять датасет из datasets для задачи классификации на русском языке затем взять модель которая предобучена на такой задачи классификации и замерить качество до обучения и после обучения на этом датасете

In [1]:
# Установка
# conda install -c conda-forge transformers

In [2]:
# Импорт библиотек

import tensorflow as tf
import transformers
from transformers import pipeline

#### Перевод с английского на немецкий

In [3]:
# Загрузка датасета

from datasets import load_dataset

dataset = load_dataset("opus_books", lang1="de", lang2="en")

Using custom data configuration de-en-lang1=de,lang2=en
Reusing dataset opus_books (/Users/evgeniya/.cache/huggingface/datasets/opus_books/de-en-lang1=de,lang2=en/0.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf)


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 51467
    })
})

In [5]:
# Разбивка данных на тренировочную и тестовую выборки

split_datasets = dataset["train"].train_test_split(train_size=0.9, seed=17)
split_datasets

Loading cached split indices for dataset at /Users/evgeniya/.cache/huggingface/datasets/opus_books/de-en-lang1=de,lang2=en/0.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf/cache-ddfbfb99a6bf2447.arrow and /Users/evgeniya/.cache/huggingface/datasets/opus_books/de-en-lang1=de,lang2=en/0.0.0/e8f950a4f32dc39b7f9088908216cd2d7e21ac35f893d04d39eb594746af2daf/cache-c2c940abbaa9f592.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 46320
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 5147
    })
})

In [6]:
# Пример данных

split_datasets["train"][1]["translation"]

{'de': '»Sie werden die ägyptischen Pyramiden hinaufklettern!« murmelte er. »Aber annoncieren Sie nur immer auf Ihre eigene Gefahr hin!',
 'en': '"You shall walk up the pyramids of Egypt!" he growled. "At your peril you advertise!'}

In [7]:
# Загрузка пайплайна

from transformers import pipeline

translator = pipeline("translation_en_to_de")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [8]:
# Проверка работы пайплайна

translator("How are you?")

[{'translation_text': 'Wie sind Sie?'}]

In [9]:
# Загрузка предобученного токенайзера

from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="tf")

In [10]:
# Разбивка данных по языкам и токенизация

en_sentence = split_datasets["train"][1]["translation"]["en"]
de_sentence = split_datasets["train"][1]["translation"]["de"]

inputs = tokenizer(en_sentence)
with tokenizer.as_target_tokenizer():
    targets = tokenizer(de_sentence)

In [11]:
# Препроцессинг

max_input_length = 128
max_target_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["de"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [12]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)



  0%|          | 0/47 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [13]:
# Загрузка предобученной модели

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

2022-06-21 15:30:45.479799: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-06-21 15:30:45.480051: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1


All PyTorch model weights were used when initializing TFMarianMTModel.

All the weights of TFMarianMTModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [14]:
# Перевод

text = 'The greatest glory in living lies not in never falling, but in rising every time we fall'

inputs = tokenizer.encode(text, return_tensors="tf")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

In [15]:
print(tokenizer.decode(outputs[0]))

<pad> Der größte Ruhm im Leben liegt nicht darin, nie zu fallen, sondern darin, jedes Mal aufzusteigen, wenn wir fallen<pad><pad><pad>


#### Классификация текстов

In [18]:
# Импорт библиотек, загрузка токенизатора и модели 

from transformers import TFAutoModelForSequenceClassification
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('blanchefort/rubert-base-cased-sentiment-rurewiews', return_tensors="tf")
model = TFAutoModelForSequenceClassification.from_pretrained('blanchefort/rubert-base-cased-sentiment-rurewiews', return_dict=True)

Downloading:   0%|          | 0.00/495 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/950 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679M [00:00<?, ?B/s]

Some layers from the model checkpoint at blanchefort/rubert-base-cased-sentiment-rurewiews were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at blanchefort/rubert-base-cased-sentiment-rurewiews.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [16]:
# # Импорт библиотек, загрузка токенизатора и модели 

# from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# tokenizer = AutoTokenizer.from_pretrained("blanchefort/rubert-base-cased-sentiment-rusentiment", max_length=100, padding = "max_length", return_tensors="tf")

# model = TFAutoModelForSequenceClassification.from_pretrained("blanchefort/rubert-base-cased-sentiment-rusentiment")

Some layers from the model checkpoint at blanchefort/rubert-base-cased-sentiment-rusentiment were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at blanchefort/rubert-base-cased-sentiment-rusentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [19]:
# Загрузка пайплайна

from transformers import pipeline

classifier = pipeline("text-classification", tokenizer = tokenizer, model = model)

In [37]:
# Загрузка датасета

from datasets import load_dataset

dataset = load_dataset("blinoff/healthcare_facilities_reviews")

Reusing dataset healthcare_facilities_reviews (/Users/evgeniya/.cache/huggingface/datasets/blinoff___healthcare_facilities_reviews/simple/1.0.0/d61498aa2f506f5e71bb46794c1b010c56c842dd03b36556cb67744a57dc916e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
dataset

DatasetDict({
    train: Dataset({
        features: ['content', 'title', 'sentiment', 'category', 'review_id', 'source_url', 'Idx'],
        num_rows: 70597
    })
    validation: Dataset({
        features: ['content', 'title', 'sentiment', 'category', 'review_id', 'source_url', 'Idx'],
        num_rows: 70597
    })
})

In [22]:
# Исключение лишниш столбцов

dataset = dataset.remove_columns(['title', 'category', 'review_id', 'source_url', 'Idx'])

In [None]:
# Энкодинг таргета

dataset = dataset.rename_column('sentiment', 'label')
dataset = dataset.class_encode_column('label')

In [23]:
# Разбивка данных на тренировочную и тестовую выборки

split_datasets = dataset["train"].train_test_split(train_size=0.9, seed=17)
split_datasets

Loading cached split indices for dataset at /Users/evgeniya/.cache/huggingface/datasets/blinoff___healthcare_facilities_reviews/simple/1.0.0/d61498aa2f506f5e71bb46794c1b010c56c842dd03b36556cb67744a57dc916e/cache-9b8e37ef50014fb4.arrow and /Users/evgeniya/.cache/huggingface/datasets/blinoff___healthcare_facilities_reviews/simple/1.0.0/d61498aa2f506f5e71bb46794c1b010c56c842dd03b36556cb67744a57dc916e/cache-b15f7c34ef348024.arrow


DatasetDict({
    train: Dataset({
        features: ['content', 'sentiment'],
        num_rows: 63537
    })
    test: Dataset({
        features: ['content', 'sentiment'],
        num_rows: 7060
    })
})

In [33]:
# Пример данных

print(split_datasets["train"][0],
      split_datasets["train"][1],
      split_datasets["train"][2])

{'content': 'При госпитализации в больницу могут предложить услуги посредника! Особенно наглый Корпорация семейной медицины! Сумма госпитализации завышена в разы ( говорят, что берут за курацию), договор дают не на госпитализацию, а общий! Потом еще требуют доплаты хотя в платном отделе больницы счет меньше! Ужас!', 'sentiment': 'negative'} {'content': 'Ужасное отношение, диагноз выдуман, снимок не смог нормально прочитать стоматолог. Не советую! Прием 15.08.15.', 'sentiment': 'negative'} {'content': 'В нашей семье случилась беда. Сын стал наркоманом. Мы узнали о клинике доктора Исаева случайно, т. к. не обладали никакой информацией по этому вопросу. Сына после клиники отправили в центр "Не зависимость", где с ним занималась Марьяна-мониторный психолог. Благодаря её профессионализму, самоотдаче, чётко проведённой методике, душевности. чуткости и заботе наш сын стал совершенно другим человеком. Нет слов, чтобы выразить от всего сердца благодарность Марьяне. Побольше бы таких высококласс

In [32]:
# Проверка работы предобученной модели

print(classifier(split_datasets["train"]['content'][0]),
      classifier(split_datasets["train"]['content'][1]),
      classifier(split_datasets["train"]['content'][2]))

[{'label': 'NEGATIVE', 'score': 0.8474009037017822}] [{'label': 'NEGATIVE', 'score': 0.9199402928352356}] [{'label': 'NEGATIVE', 'score': 0.7423568367958069}]


В одном случае из трех предобученная модель ошиблась

In [34]:
data = split_datasets["test"]["content"]
raw_predictions = classifier(data)

In [35]:
raw_predictions

[{'label': 'NEUTRAL', 'score': 0.529673159122467},
 {'label': 'NEUTRAL', 'score': 0.554032564163208},
 {'label': 'POSITIVE', 'score': 0.7349655032157898},
 {'label': 'NEGATIVE', 'score': 0.6967583298683167},
 {'label': 'NEUTRAL', 'score': 0.5432539582252502},
 {'label': 'NEUTRAL', 'score': 0.4575249254703522},
 {'label': 'NEUTRAL', 'score': 0.5524546504020691},
 {'label': 'POSITIVE', 'score': 0.9534175395965576},
 {'label': 'POSITIVE', 'score': 0.8425549268722534},
 {'label': 'NEGATIVE', 'score': 0.8536224961280823},
 {'label': 'NEGATIVE', 'score': 0.5404945015907288},
 {'label': 'NEGATIVE', 'score': 0.7912715077400208},
 {'label': 'NEGATIVE', 'score': 0.6760830879211426},
 {'label': 'POSITIVE', 'score': 0.9848350286483765},
 {'label': 'NEUTRAL', 'score': 0.3977958858013153},
 {'label': 'NEUTRAL', 'score': 0.6265783309936523},
 {'label': 'NEGATIVE', 'score': 0.6176688075065613},
 {'label': 'NEUTRAL', 'score': 0.6236677169799805},
 {'label': 'POSITIVE', 'score': 0.45551377534866333},
 {