<a href="https://colab.research.google.com/github/EmilisGit/Deep_learning/blob/main/bert_finetuning_with_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Duomenų paruošimas

Atliksime sentimentų analizę, naudojant iš anksto apmokytą BERT modelį.

Naudosime duomenis iš [Yelp Open Dataset](https://www.yelp.com/dataset). Tai bus atsiliepimai apie maisto restoranus.

In [4]:
!pip install datasets -q

from datasets import load_dataset

dataset = load_dataset("antash420/text-summarization-alpaca-format")
dataset.save_to_disk("/content/dataset")

Saving the dataset (0/6 shards):   0%|          | 0/287113 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/13368 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

In [10]:
print("Input: ", dataset['train']['input'][0])
print("Output: ", dataset['train']['output'][0])

Input:  LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Detail

In [15]:
tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-mini")
tokens = tokenizer("""LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I'll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say 'kid star goes off the rails,'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's "Equus." Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: "I just think I'm going to be more sort of fair game," he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.
Output""")

print(len(tokens['input_ids']))

638


Kiek turime apmokymo duomenų (mėginių)?

In [16]:
print("Apmokymo duomenu: ", len(dataset['train']))
print("Testavimo duomenu: ", len(dataset['test']))

Apmokymo duomenu:  287113
Testavimo duomenu:  11490


Toliau naudosime tik dalį atsiliepimų viso proceso greitinimui:

In [17]:
n_samples = 4000
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(n_samples))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(n_samples))
len(small_train_dataset)

4000

In [38]:
small_train_dataset[0]

{'input': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in Camborne, Cornwall . It is also believed there was no working carbon monoxide detector

In [47]:
encoded_train = None
encoded_eval = None

In [48]:
def tokenize(text):
  inputs = tokenizer(
      text['text'], max_length=512, truncation=True, padding="max_length"
  )
  outputs = tokenizer(
      text["output"], max_length=128, truncation=True, padding="max_length"
  )
  inputs["output_ids"] = outputs["input_ids"]
  return inputs


encoded_train = small_train_dataset.map(tokenize, batched=True)
encoded_eval = small_eval_dataset.map(tokenize, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Pastebekime kad po tokenizavimo mūsų `Dataset` objectas gavo naujas savybes (stulpelius, raktažodžius) - `'input_ids'`, `'token_type_ids'` ir `'attention_mask'`:

In [49]:
encoded_train.features

{'input': Value(dtype='string', id=None),
 'output': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'instruction': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'output_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

Paruošiame klasifikavimo žymių žodynus:

## 2. Modelio inicializavimas ir derinimas

In [71]:
# model = AutoModelForSeq2SeqLM.from_pretrained("google/t5-efficient-mini")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Ką tik gautas įspėjimas apie kai kuriuos išmestus svorius yra normalus - šiuo atveju mes pašalinome modelio bloką, atsakingą už maskuotą kalbos modeliavimą - tai yra įprasta BERT modelio užduotis, o mes norime atlikti sentimentų analizę (teksto klasifikavimą), dėl ko mums ir reikia papildomas modelio derinimas.

Norime įkelsti mūsų duomenis kaip Tensorflow duomenų rinkinius - tai bus padaryta automatiškai ir atitiks modelio architektūrą:

In [54]:
import tensorflow as tf

batch_size = 16

def convert_to_tf_dataset(tokenized_dataset, batch_size=16):
    def gen():
        for data in tokenized_dataset:
            yield {
                "input_ids": data["input_ids"],
                "attention_mask": data["attention_mask"],
            }, data["output_ids"]

    return tf.data.Dataset.from_generator(
        gen,
        output_signature=(
            {
                "input_ids": tf.TensorSpec(shape=(512,), dtype=tf.int32),
                "attention_mask": tf.TensorSpec(shape=(512,), dtype=tf.int32),
            },
            tf.TensorSpec(shape=(128,), dtype=tf.int32),
        ),
    ).batch(batch_size)

# Create train and validation datasets
train_dataset = convert_to_tf_dataset(encoded_train)
val_dataset = convert_to_tf_dataset(encoded_eval)

Visi `transformers` modeliai pagal nutylėjimą gali patys pasirinkti protingą nuostolių funkciją, todėl mums nereikis dėti `loss` parametro į `compile()`. Tuo metu `transformers` rekomenduoja sukonstruoti `AdamW` tipo optimizatorių patiems:

In [80]:
import transformers
num_epochs = 3

batches_per_epoch = len(encoded_train) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

from transformers import create_optimizer
import tensorflow as tf

# Define the optimizer using Hugging Face utility
optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

# Compile the model with loss and metrics
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=["accuracy"])

AttributeError: 'T5ForConditionalGeneration' object has no attribute 'compute_loss'

In [74]:
model.compile(optimizer=optimizer,
              metrics=["accuracy"])

TypeError: compile() got an unexpected keyword argument 'optimizer'

Laikas fittinti modelį:

In [70]:
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=num_epochs
)

AttributeError: 'T5ForConditionalGeneration' object has no attribute 'fit'

## 3. Modelio panaudojimas klasifikavimui

In [None]:
sentences = [...]
tokenized = tokenizer(sentences, return_tensors="np", padding="longest",
                      return_token_type_ids=False)

predicted_logits = model(tokenized).logits
classifications = ...