# Notebook 2 - Criação do Modelo

  Como essa tarefa se trata de classificação de texto iremos utilizar um modelo BERT. Iremos utilizar um modelo já treinado e realizaremos o ajuste de parametros através da técnica de fine tuning chamada LORA (Low-Rank Adaptation of Large Language Models) descrita [neste artigo](https://arxiv.org/abs/2106.09685). Está ecolha foi feita para aproveitar o conhecimento já existente em modelos pré treinados e economizar recursos. Para tal, será utilizado as bibliotecas disponiveis pelo [Hugging Face](https://huggingface.co/).

In [4]:
!pip install datasets
!pip install tokenizers
!pip install torchmetrics
!pip install transformers
!pip install peft
!pip install evaluate
!pip install matplotlib


Collecting numpy<2.0,>1.20.0 (from torchmetrics)
  Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.2
    Uninstalling numpy-2.1.2:
      Successfully uninstalled numpy-2.1.2
Successfully installed numpy-1.26.4


  You can safely remove it manually.
  You can safely remove it manually.


Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.0-cp312-cp312-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.54.1-cp312-cp312-win_amd64.whl.metadata (167 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.7-cp312-cp312-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.0.0-cp312-cp312-win_amd64.whl.metadata (9.3 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.0-py3-none-any.whl.metadata (5.0 kB)
Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl (7.8 MB)
   ---------------------------------------- 0.0/7.8 MB ? eta -:--:--
   -------------------------------------- - 7.6/7.8 MB 42.7 MB/s eta 0:00:01
   

In [3]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import csv
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import (
    DistilBertForSequenceClassification,
    DistilBertTokenizer,
    DistilBertModel,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    TrainerCallback)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
#lendo os dataset
train_data = Dataset.from_csv('train_data.csv')
test_data = Dataset.from_csv('test_data.csv')


data = {"train": train_data , "test": test_data}
final_data = DatasetDict(data)    #cria um dicionário de datasets para treino e teste
print(final_data)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25600
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 6400
    })
})


Será utilizado do modelo [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) que se trata de um modelo BERT com menos parametros e mais rápido que o Bert original.

In [None]:
#define o modelo base
base_model = 'distilbert-base-uncased'


# define o mapeamento dos rótulos
id2label = {0: "Non  clickbait", 1: "Clickbait"}
label2id = {"Non  clickbait":0, "Clickbait":1}

# Define o uso da gpu se disponivel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

#cria o modelo
model = DistilBertForSequenceClassification.from_pretrained(base_model, num_labels=2, id2label=id2label, label2id=label2id).to(device)
print(model)

Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Acima definimos o modelo e podemos ver sua arquitetura, para realizar o ajuste de parametros vamos treinar as redes que fazer a projeção linear no mecanismo de atenção q_lin, k_lin e v_lin

In [None]:
# cria o tokenizador baseado no modelo usado
tokenizer = DistilBertTokenizer.from_pretrained(base_model, add_prefix_space=True)

# adiciona o token de padding para completar sentenças que sejam menores que a max_lenght
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))


#função que realiza a tokenização dos datasets
def tokenize_function(examples):

    text = examples["text"]
    #trunca e tokeniza o dataset

    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
       text,
        return_tensors="pt",
        truncation=True,
         padding=True,
        max_length=200
    )

    return tokenized_inputs.to(device)

tokenized_dataset = final_data.map(tokenize_function, batched=True)
print(tokenized_dataset)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25600 [00:00<?, ? examples/s]

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25600
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 6400
    })
})


In [None]:
#Vizualização da tokenização
for j in range(5):
    # Acessa a linha j do dataset 'test'
    item = tokenized_dataset['test'][j]
    for i in item:
        print(i, ":", item[i])
    print("---------------")


text : s.e.c. enforcement officer steps down
label : 0
input_ids : [101, 1055, 1012, 1041, 1012, 1039, 1012, 7285, 2961, 4084, 2091, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
---------------
text : france first to recognise libyan rebels as "legitimate representatives of the people"
label : 0
input_ids : [101, 2605, 2034, 2000, 17614, 19232, 8431, 2004, 1000, 11476, 4505, 1997, 1996, 2111, 1000, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
---------------
text : 17 weird, gross, hilarious things everyone on their school football team did together
label : 1
input_ids : [101, 2459, 6881, 1010, 7977, 1010, 26316, 2477, 3071, 2006, 2037, 2082, 2374, 2136, 2106, 2362, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# calcula a acuracia
accuracy = evaluate.load("accuracy")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
#configuração do LORA
peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=10,
                        lora_alpha=16,
                        lora_dropout=0.01,
                        target_modules = ["q_lin","k_lin","v_lin"])
    #só foca nos modulos de auto atenção

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = get_peft_model(model, peft_config,).to(device)
model.print_trainable_parameters()

trainable params: 868,610 || all params: 67,823,620 || trainable%: 1.2807


 r: Controla o rank da decomposição das matrizes de peso, afetando o número de parâmetros treináveis e a eficiência do modelo. Valores menores de r reduzem a dimensionalidade das atualizações, tornando o modelo mais eficiente em termos de memória e computação.

lora_alpha: Um fator de escalonamento que ajusta a magnitude das atualizações feitas durante o treinamento, controlando a influência das adaptações de baixo rank no modelo original e ajudando na regularização.

A partir das configuraçẽos selecionadas acima para o ajuste de parametros, vamos treinar apeans 1.2% da arquitetura, isto nos dara um modelo com senso da tarefa especifica sem precisar treinar todos os 67 M de parametros.

In [None]:
#função para realizar o salvamento do modelo de cada época
class SaveModelCallback(TrainerCallback):
    def __init__(self):
        super().__init__()

    def on_epoch_end(self, args, state, control, model=None, **kwargs):
        model.save_pretrained("finetuned_model/"+ base_model + "-lora-clikbait_/" + f"model_epoch_{state.epoch}")



In [None]:
# hiperparametros, foram escolhidos estes parametros pois verificou-se previamente que tem um bom resultado
lr = 5e-4
batch_size = 24
num_epochs = 5

# define os argumentos de treino
training_args = TrainingArguments(
    output_dir=  base_model + "-lora-clikbait",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    load_best_model_at_end=True,
)

# define o treinador
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

#Verifica o desenpenho do modelo não treinado
trainer.evaluate()



Trainer is attempting to log a value of "{'accuracy': 0.50296875}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.6981660723686218,
 'eval_accuracy': {'accuracy': 0.50296875},
 'eval_runtime': 8.0007,
 'eval_samples_per_second': 799.926,
 'eval_steps_per_second': 33.372}

A acurácia do modelo não treinado é de 50%, ou seja, o modelo está chutando o resultado ao acaso. A seguir o modelo será treinado por 5 época e veremos os resultados.

In [None]:
# Adicione o retorno de chamada ao Trainer
# Crie uma instância da classe callback e adicione-a ao Trainer
save_model_callback = SaveModelCallback()
trainer.add_callback(save_model_callback)


trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0646,0.064515,{'accuracy': 0.978125}
2,0.0285,0.042083,{'accuracy': 0.98703125}
3,0.0137,0.046391,{'accuracy': 0.9878125}
4,0.0058,0.066327,{'accuracy': 0.98765625}
5,0.0025,0.06652,{'accuracy': 0.9884375}


Trainer is attempting to log a value of "{'accuracy': 0.978125}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.98703125}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.9878125}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.98765625}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.9884375}" of type <class 'dict'> for key "eval/accuracy

TrainOutput(global_step=5335, training_loss=0.023029978921062608, metrics={'train_runtime': 368.3407, 'train_samples_per_second': 347.504, 'train_steps_per_second': 14.484, 'total_flos': 1223310210545088.0, 'train_loss': 0.023029978921062608, 'epoch': 5.0})

In [None]:
#carrega o modelo treinado para avaliar o desempenho
path_model_trained = "finetuned_model/distilbert-base-uncased-lora-clikbait_/model_epoch_5.0"
model_trained = DistilBertForSequenceClassification.from_pretrained(path_model_trained, num_labels=2, id2label=id2label, label2id=label2id).to(device)

trainer = Trainer(
    model=model_trained,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainer is attempting to log a value of "{'accuracy': 0.9884375}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.06651970744132996,
 'eval_accuracy': {'accuracy': 0.9884375},
 'eval_runtime': 6.0679,
 'eval_samples_per_second': 1054.729,
 'eval_steps_per_second': 44.002}

Pode-se então perceber que após uma unica época o modelo já é capaz de alcançar cerca de 97% de acurácia e após 5 épocas obtivemos 98,84% de acuracia nos dados de teste. Indicando que o modelo e os parametros estão condizentes com a tarefa.

Por fim, a seguir é feita a classificação de 5 frases nunca vistas, reforçando o desempenho positivo do modelo criado.

In [None]:
model.eval()

text_list = ["Check out the marketing infographic",
             "Canada pursues new nuclear research reactor to produce medical isotopes",
             "This Is the Real Reason Doctors Make You Sit on That Tissue Paper",
             "Cuban talk show accuses U.S. diplomat of helping anti-government groups",
             "The 10 Hacks You Need to Stay Healthy This Winter",]

print("Predições do modelo em frases nunca vistas:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to(device)

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])



Predições do modelo em frases nunca vistas:
--------------------------
Check out the marketing infographic - Clickbait
Canada pursues new nuclear research reactor to produce medical isotopes - Non  clickbait
This Is the Real Reason Doctors Make You Sit on That Tissue Paper - Clickbait
Cuban talk show accuses U.S. diplomat of helping anti-government groups - Non  clickbait
The 10 Hacks You Need to Stay Healthy This Winter - Clickbait


Como dito no primeiro notebook podemos criar um backed utilizando este modelo para denvolver um SAAS capaz de detectar manchetes sensacionalistas. Melhorias no modelo envolveriam a utilização de um dataset mais amplo contendo outras fontes de material sensacionalista.