## Setup

In [2]:
%%capture
!pip install datasets
!pip install evaluate

In [6]:
import wandb
from google.colab import userdata

wb_token = userdata.get('WB')

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-Bert',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mifaledu2017[0m ([33mifaledu2017-none[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Carregando e processando dataset

In [7]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][50]

README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'label': 0,
 'text': "The delivery driver mistakenly rang my doorbell, having confused 133 and 113.\\n\\nRather than taking a step back and analyzing the situation, he begins to accuse my wife and I of ordering and refusing to pay for this pizza.\\n\\nThe driver then gets on his cell phone and rather than calling the number than was given when the order was placed, begins to call his bosses and starts threatening me with felony charges. \\n\\nSo I take the initiative and ask the fine upstanding gentleman what the phone number of the order-er was, phone my neighbor and discover the mistake. Rather than a thank you or a sorry, he just speeds off (breaking the speed limit on our block) to reach his destination 50 feet away.\\n\\nI would call to complain, but based on the other reviews, its clear the owners do not care about Carnegie or it's residents, and its pretty well known around town just how awful their food is, so it would be pointless to boycott a place I'd never order from again

## Tokenizer Dataset
Como você saber a tokenização é uma etapa fundamental para treinar modelos de NLP. Esta etapa consiste em transformar os dados de um modo que o nosso LLM possa compreender. Neste caso o BERT. Também observe que cada LLM tem um Tokenizer específico.

In [8]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [9]:
# Crie versões menores do dataset (para um treinamento mais rápido)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## Treinar com Pytorch Trainer


In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased",
    num_labels=5,
    torch_dtype="auto"
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Avaliar seu modelo
Como a class Trainer não avalia nosso modelo, precisamos fazer isto manualmente.

In [11]:
import numpy as np
import evaluate

from transformers import TrainingArguments, Trainer

metric = evaluate.load("accuracy")

# Crie uma função para computar a métrica
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Escolha a estratégia para avaliar seu modelo
training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [12]:
# Execute o treino
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.439992,0.388
2,No log,1.01137,0.58
3,No log,0.969446,0.586


TrainOutput(global_step=375, training_loss=1.2331044921875, metrics={'train_runtime': 385.7611, 'train_samples_per_second': 7.777, 'train_steps_per_second': 0.972, 'total_flos': 789354427392000.0, 'train_loss': 1.2331044921875, 'epoch': 3.0})

## Carregar e Inferência

In [21]:
from transformers import pipeline

pipe = pipeline('sentiment-analysis', model='/content/My_Model', tokenizer=tokenizer)
pipe("Hate you")

Device set to use cuda:0


[{'label': 'LABEL_0', 'score': 0.4333074986934662}]