<a href="https://colab.research.google.com/github/JacoMiranda/TP4-PLN-IComp/blob/main/scripts/T4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## FASE 1
#Passo 1: Configuração do Ambiente no Colab
A primeira ação é habilitar o ambiente de GPU.

Habilitar a GPU:
No menu, vá em Ambiente de execução (Runtime) -> Alterar o tipo de ambiente de execução (Change runtime type).
Selecione T4 GPU no menu suspenso Acelerador de hardware (Hardware accelerator) e clique em Salvar.
Instalar as Bibliotecas:

A célula de código a seguir instala as dependências necessárias. O bitsandbytes carrega o modelo de forma mais eficiente (quantização) e accelerate otimiza o uso do hardware.

In [None]:
# Instala as bibliotecas necessárias do Hugging Face e para otimização
# Instala as bibliotecas para PEFT (LoRA) e o SFTTrainer da TRL
%pip install -U transformers peft trl bitsandbytes datasets accelerate "deepeval>=0.21" aiofiles aiosqlite


In [None]:
# ==============================================================================
# Importações de bibliotecas padrão
import os
import requests
import json
import time
import xml.etree.ElementTree as ET
import re
# Verificação de versões
import trl
import peft
import transformers
import deepeval
print("--- Versões Atuais das Bibliotecas ---")
print(f"Versão da TRL: {trl.__version__}")
print(f"Versão da PEFT: {peft.__version__}")
print(f"Versão da Transformers: {transformers.__version__}")
print(f"Versão da DeepEval: {deepeval.__version__}")

# Importações específicas para Fine-tuning
import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig
# CORREÇÃO: Importar SFTConfig e SFTTrainer da biblioteca TRL
from trl import SFTConfig, SFTTrainer

--- Versões Atuais das Bibliotecas ---
Versão da TRL: 0.19.0
Versão da PEFT: 0.15.2
Versão da Transformers: 4.52.4
Versão da DeepEval: 3.1.8


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# É necessário o "huggingface_hub' instalado

# O comando a seguir pedirá seu token do Hugging Face. Você pode gerar um em https://huggingface.co/settings/tokens , descomente se quiser usá-lo ou configure uma
# variável de ambiente com HF_TOKEN
#!huggingface-cli login

## Passo 2: Carregando o Modelo e o Tokenizador
Carregando o modelo de 8 bilhões de parâmetros.
É usado a quantização de 4 bits para garantir que ele caiba na memória da GPU do Colab.

O meta-llama/Meta-Llama-3-8B-Instruct. É necessário uma conta no Hugging Face e de um token de acesso para usar os modelos da Meta. Pode-se obter um token em huggingface.co/settings/tokens.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configuração de quantização para carregar o modelo em 4-bit
# Isso reduz drasticamente o uso de memória da GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Nome do modelo a ser usado
#model_id = "meta-llama/Meta-Llama-3-8B-Instruct" #Este modelo é muito grande para a instância do colab atual, será usado o mistral
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Carrega o tokenizador

# O tokenizador prepara o texto para o modelo
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define o token de padding para ser igual ao token de fim de sentença.
tokenizer.pad_token = tokenizer.eos_token
# Carrega o modelo com a configuração de quantização
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto", # Mapeia o modelo para o dispositivo disponível (GPU)
)

# Definir a semente para reprodutibilidade
torch.manual_seed(42)

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

<torch._C.Generator at 0x7b06fda7e590>

## Passo 3: Carregando o Dataset Spider
Usa-se a biblioteca datasets para carregar o Spider diretamente do Hugging Face Hub.
O development split no documento corresponde ao split de validation no Hugging Face Hub.
O training split será usado para criar os exemplos do prompt.

In [None]:
from datasets import load_dataset

# Carrega o split de treino para selecionar exemplos para o prompt
train_dataset = load_dataset("spider", split="train")

# Carrega o split de validação (dev) para a avaliação
eval_dataset = load_dataset("spider", split="validation")

# Inspecionar alguns exemplos do split de treino para escolher os 3 exemplos
print("Exemplos do Training Split para o Prompt:")
for i in range(5):
    print(f"Pergunta: {train_dataset[i]['question']}")
    print(f"Query SQL: {train_dataset[i]['query']}\n")

README.md:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/831k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/126k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1034 [00:00<?, ? examples/s]

Exemplos do Training Split para o Prompt:
Pergunta: How many heads of the departments are older than 56 ?
Query SQL: SELECT count(*) FROM head WHERE age  >  56

Pergunta: List the name, born state and age of the heads of departments ordered by age.
Query SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

Pergunta: List the creation year, name and budget of each department.
Query SQL: SELECT creation ,  name ,  budget_in_billions FROM department

Pergunta: What are the maximum and minimum budget of the departments?
Query SQL: SELECT max(budget_in_billions) ,  min(budget_in_billions) FROM department

Pergunta: What is the average number of employees of the departments whose rank is between 10 and 15?
Query SQL: SELECT avg(num_employees) FROM department WHERE ranking BETWEEN 10 AND 15



## Passo 4: Construindo o Prompt e Executando a Avaliação
Com base na saída da célula anterior, escolha 3 exemplos e construa seu prompt. Em seguida, crie um loop para avaliar o modelo no eval_dataset.

Construa o Template do Prompt Fixo:

In [None]:
# Template de prompt fixo com 3 exemplos do training split
# Este é uma estrutura.
prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: {question}
SQL:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

##Passo 4.1 Avaliação e resultado

In [None]:
import pandas as pd

# Lista para armazenar os resultados
results = []

# Para um teste rápido, vamos avaliar apenas as 10 primeiras amostras
# Para o trabalho completo, você deve iterar sobre todo o eval_dataset
for i in range(10):
    question = eval_dataset[i]['question']
    ground_truth_query = eval_dataset[i]['query']

    # Formata o prompt final com a nova pergunta
    final_prompt = prompt_template.format(question=question)

    # Prepara o input para o modelo
    inputs = tokenizer(final_prompt, return_tensors="pt").to("cuda")

    # Gera a saída do modelo
    outputs = model.generate(**inputs, max_new_tokens=100)

    # Decodifica a saída gerada, pulando os tokens do prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extrai apenas a query SQL gerada
    # A lógica aqui pode precisar de ajustes dependendo do formato da saída do modelo
    generated_sql = generated_text.split("assistant\n")[-1].strip()

    # Registra a consulta SQL gerada
    results.append({
        "question": question,
        "ground_truth_query": ground_truth_query,
        "generated_query": generated_sql
    })

    print(f"Pergunta: {question}")
    print(f"Query Gerada: {generated_sql}\n")


# Converta os resultados para um DataFrame do Pandas para fácil visualização e análise
df_results = pd.DataFrame(results)
display(df_results)

# A partir daqui, você faria a contagem bruta de sucesso/falha

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: How many singers do we have?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: How many singers do we have?
SQL:<|eot_id|><|start_header_id|>assistant<|end_header_id|>



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What is the total number of singers?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What is the total number of singers?
SQL:<|eot_id|><|start_header_id|>assistant<|end_header_id|>



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: Show name, country, age for all singers ordered by age from the oldest to the youngest.
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: Show name, country, age for all singers ordered by age from the oldest to the youngest.
SQL:<|eot_id

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What are the names, countries, and ages for every singer in descending order of age?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What are the names, countries, and ages for every singer in descending order of age?
SQL:<|eot_id|><|st

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What is the average, minimum, and maximum age of all singers from France?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What is the average, minimum, and maximum age of all singers from France?
SQL:<|eot_id|><|start_header_id|>assista

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What is the average, minimum, and maximum age for all French singers?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What is the average, minimum, and maximum age for all French singers?
SQL:<|eot_id|><|start_header_id|>assistant<|end_

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: Show the name and the release year of the song by the youngest singer.
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: Show the name and the release year of the song by the youngest singer.
SQL:<|eot_id|><|start_header_id|>assistant<|en

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What are the names and release years for all the songs of the youngest singer?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What are the names and release years for all the songs of the youngest singer?
SQL:<|eot_id|><|start_header_i

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Pergunta: What are all distinct countries where singers above age 20 are from?
Query Gerada: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.

### Example 1
Question: How many heads of the departments are older than 56?
SQL: SELECT count(*) FROM head WHERE age > 56

### Example 2
Question: What are the names of the departments that have more than 10 instructors?
SQL: SELECT T2.dept_name FROM instructor AS T1 JOIN department AS T2 ON T1.dept_id  =  T2.dept_id GROUP BY T1.dept_id HAVING count(*)  >  10

### Example 3
Question: List the name, born state and age of the heads of departments ordered by age.
SQL: SELECT name ,  born_state ,  age FROM head ORDER BY age

### New Task
Question: What are all distinct countries where singers above age 20 are from?
SQL:<|eot_id|><|start_header_id|>assistant<|end_he

Unnamed: 0,question,ground_truth_query,generated_query
0,How many singers do we have?,SELECT count(*) FROM singer,<|begin_of_text|><|start_header_id|>system<|en...
1,What is the total number of singers?,SELECT count(*) FROM singer,<|begin_of_text|><|start_header_id|>system<|en...
2,"Show name, country, age for all singers ordere...","SELECT name , country , age FROM singer ORDE...",<|begin_of_text|><|start_header_id|>system<|en...
3,"What are the names, countries, and ages for ev...","SELECT name , country , age FROM singer ORDE...",<|begin_of_text|><|start_header_id|>system<|en...
4,"What is the average, minimum, and maximum age ...","SELECT avg(age) , min(age) , max(age) FROM s...",<|begin_of_text|><|start_header_id|>system<|en...
5,"What is the average, minimum, and maximum age ...","SELECT avg(age) , min(age) , max(age) FROM s...",<|begin_of_text|><|start_header_id|>system<|en...
6,Show the name and the release year of the song...,"SELECT song_name , song_release_year FROM sin...",<|begin_of_text|><|start_header_id|>system<|en...
7,What are the names and release years for all t...,"SELECT song_name , song_release_year FROM sin...",<|begin_of_text|><|start_header_id|>system<|en...
8,What are all distinct countries where singers ...,SELECT DISTINCT country FROM singer WHERE age ...,<|begin_of_text|><|start_header_id|>system<|en...
9,What are the different countries with singers...,SELECT DISTINCT country FROM singer WHERE age ...,<|begin_of_text|><|start_header_id|>system<|en...


##Fase 2: Execução do Fine-Tuning

Nesta fase, trata-se da especialização do modelo base na tarefa de Text-to-SQL usando os dados de treino do Spider. Utilizá-se-a a técnica LoRA (Low-Rank Adaptation), que é um método de Parameter-Efficient Fine-Tuning (PEFT). Ele permite treinar o modelo de forma muito mais rápida e com menos memória, modificando apenas uma pequena fração dos pesos do modelo.


Ação 2: Preparar o Dataset para o Treinamento
O SFTTrainer espera que o dataset esteja em um formato específico de "conversa". Vamos criar uma função para formatar cada exemplo do Spider training split no template do Llama-3.

In [None]:
# O dataset de treino já foi carregado na fase anterior como 'train_dataset'

# Função para formatar cada exemplo do dataset no formato de chat do Llama 3
def format_spider_prompt(example):
    # Formata a entrada como uma conversa entre sistema, usuário e assistente
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.
Question: {example['question']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['query']}<|eot_id|>"""
    return {"text": prompt}

# Aplica a formatação a todo o dataset de treino
formatted_train_dataset = train_dataset.map(format_spider_prompt)

print(formatted_train_dataset[0]['text'])

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in converting natural language questions to SQL queries.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate the following question to a SQL query.
Question: How many heads of the departments are older than 56 ?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

SELECT count(*) FROM head WHERE age  >  56<|eot_id|>


Ação 3: Configurar o LoRA
Configuração dos hiperparâmetros que controlam como o fine-tuning será aplicado.

In [None]:
from peft import LoraConfig

# Configuração do LoRA
# Obs: documentar no relatório final
lora_config = LoraConfig(
    r=8,  # O rank (r) da adaptação. Valores comuns são 8, 16, 32.
    lora_alpha=16, # Alpha é um parâmetro de escalonamento.
    lora_dropout=0.05, # Dropout para as camadas LoRA.
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Módulos alvo para aplicar o LoRA
    task_type="CAUSAL_LM",
)

Ação 4: Executar o Treinamento
Com tudo pronto, vamos configurar os argumentos de treinamento e iniciar o processo.
Testar pelo menos duas configurações de hiperparâmetros distintas.
Para isso, você pode variar learning_rate ou num_train_epochs.

FAse 2 - 4.1 Criar a primeira configuração e treinar.

In [None]:
# Configuração do treinamento:
# Use apenas SFTConfig para todos os argumentos. Ele substitui o TrainingArguments.
# Mantive os parâmetros de economia de memória que funcionaram anteriormente.
sft_config_run1 = SFTConfig(
    # Parâmetros de diretório e modelo
    output_dir="/content/drive/MyDrive/mistral-7b-spider-run1",

    # Parâmetros de economia de memória e lote
    per_device_train_batch_size=1,      # <<< Recomendo voltar para 1 por segurança
    gradient_accumulation_steps=8,      # <<< Recomendo voltar para 8 por segurança
    optim="paged_adamw_8bit",
    fp16=True,

    # Parâmetros de treinamento e scheduler
    num_train_epochs=1,
    learning_rate=2e-4,
    max_grad_norm=0.3,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,

    # Parâmetros de salvamento e log
    save_strategy="steps",
    save_steps=200,                     # <<< Frequência aumentada para salvar mais vezes em 1 época
    save_total_limit=3,
    logging_steps=25,
    report_to="none",                   # Ótima adição para desabilitar o wandb

    # Parâmetros específicos do SFT
    max_seq_length=1024,
    dataset_text_field="text",
    group_by_length=True,
)

# Criação do Trainer para a primeira execução
# Note como a chamada está mais limpa, usando o sft_config_run1
trainer1 = SFTTrainer(
    model=model,
    train_dataset=formatted_train_dataset,
    peft_config=lora_config,
    args=sft_config_run1, # <<< PASSE O OBJETO SFTCONFIG AQUI
)


# Para continuar um treino interrompido:
# SE DESCONECTAR E VOCÊ QUISER CONTINUAR DE ONDE PAROU:
trainer1.train(resume_from_checkpoint=True)
# Inicia o treinamento
print("--- Iniciando Treinamento da Configuração 1 ---")
#trainer1.train()
print("--- Treinamento 1 Concluído ---")

# SE DESCONECTAR E VOCÊ QUISER CONTINUAR DE ONDE PAROU:
#50279e636348ed996531a29d7dccf837d45d0d2d

FAse 2 - 4.1 Criar Segunda Configuração - e treinar

In [None]:
import torch
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTConfig, SFTTrainer

# --- LIMPEZA DE MEMÓRIA ---
print("\n--- Liberando memória antes do Treinamento 2 ---")
# Deleta os objetos da memória para liberar VRAM
# (Se der erro de 'name not defined', apenas ignore e continue,
# o importante é o gc.collect e empty_cache)
try:
    del model
    del trainer1
except NameError:
    pass

gc.collect()
torch.cuda.empty_cache()


# --- TREINAMENTO 2 ---

print("\n--- Iniciando Carregamento para o Treinamento 2 ---")
# Recarrega o modelo base original para garantir um início limpo
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=bnb_config, # bnb_config da célula anterior
    device_map="auto"
)

# Hiperparâmetros para o Treinamento 2 (com variação)
sft_config_run2 = SFTConfig(
    output_dir="/content/drive/MyDrive/mistral-7b-spider-run2", # <<< NOVO DIRETÓRIO DE SAÍDA
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_8bit",
    fp16=True,
    num_train_epochs=1,
    learning_rate=5e-5,  # <<< HIPERPARÂMETRO ALTERADO (ex: 2e-4 -> 5e-5)
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    logging_steps=25,
    report_to="none",
    max_seq_length=1024,
    dataset_text_field="text",
    group_by_length=True,
)

# Trainer 2
trainer2 = SFTTrainer(
    model=model,
    #tokenizer=tokenizer, # tokenizer da célula anterior
    train_dataset=formatted_train_dataset, # dataset da célula anterior
    peft_config=lora_config, # lora_config da célula anterior
    args=sft_config_run2,
)

print("\n--- INICIANDO TREINAMENTO 2 ---")
trainer2.train(resume_from_checkpoint=True)
print("--- TREINAMENTO 2 CONCLUÍDO ---")

##Fase 3 - preparação de métricas e geração de acurácia

In [None]:
# SETUP COMPLETO (DEFINIÇÕES) - Amostra de 100

import torch
import gc
import json
import sqlite3
import pandas as pd
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
import re

print("--- Preparando Funções e Métricas para a Avaliação Final (100 exemplos) ---")

# --- CONFIGURAÇÕES E DADOS ---
drive_data_path = "/content/drive/MyDrive/spider_data/spider"
with open(f"{drive_data_path}/dev.json", "r") as f:
    dev_data = json.load(f)

# --- MÉTRICA PERSONALIZADA ---
class ExecutionAccuracy(BaseMetric):
    # (O código da classe da métrica continua exatamente o mesmo)
    def __init__(self, threshold: float = 1.0): self.threshold = threshold
    def measure(self, test_case: LLMTestCase) -> float:
        db_id = next((item['db_id'] for item in dev_data if item['question'] == test_case.input), None)
        if not db_id: return 0.0
        db_path = f"{drive_data_path}/database/{db_id}/{db_id}.sqlite"
        try:
            conn = sqlite3.connect(db_path)
            cursor = conn.cursor()
        except Exception: return 0.0
        try:
            cursor.execute(test_case.actual_output)
            predicted_result = cursor.fetchall()
        except Exception: predicted_result = []
        try:
            cursor.execute(test_case.expected_output)
            ground_truth_result = cursor.fetchall()
        except Exception:
            conn.close()
            return 0.0
        conn.close()
        if set(map(str, predicted_result)) == set(map(str, ground_truth_result)):
            self.success = True
            return 1.0
        else:
            self.success = False
            return 0.0
    def is_successful(self) -> bool: return getattr(self, "success", False)
    @property
    def __name__(self): return "Execution Accuracy"

# --- FUNÇÃO DE AVALIAÇÃO ---
def evaluate_model(model_name: str, base_model_id: str, adapter_path: str = None):
    print(f"\n--- Avaliando em 100 exemplos: {model_name} ---")
    bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
    current_device_map = {"": "cuda:0"} if adapter_path is None else "auto"
    model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map=current_device_map)
    if adapter_path:
        model = PeftModel.from_pretrained(model, adapter_path)
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    prompt_template_fewshot = "Translate the following question to a SQL query.\\n\\n### Example 1\\nQuestion: How many heads of the departments are older than 56?\\nSQL: SELECT count(*) FROM head WHERE age > 56;\\n\\n### New Task\\nQuestion: {question}\\nSQL:"
    results = []

    # <<< AVALIAÇÃO EM 100 EXEMPLOS >>>
    for item in tqdm(dev_data[:100]):
        if adapter_path:
            prompt = f"Translate the following natural language question to a SQL query.\nQuestion: {item['question']}\nSQL:"
        else:
            prompt = prompt_template_fewshot.format(question=item["question"])
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=100, pad_token_id=tokenizer.pad_token_id)
        full_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_sql = ""
        answer_part = full_generated_text.split(prompt)[-1] if prompt in full_generated_text else full_generated_text
        match = re.search(r"SELECT\s+.*", answer_part, re.IGNORECASE | re.DOTALL)
        if match:
            generated_sql = match.group(0).split('\n')[0].strip().rstrip(';')
        results.append({"question": item["question"], "generated_sql": generated_sql, "expected_sql": item["query"]})

    execution_metric = ExecutionAccuracy()
    scores = []
    for r in results:
        test_case = LLMTestCase(input=r["question"], actual_output=r["generated_sql"], expected_output=r["expected_sql"])
        score = execution_metric.measure(test_case)
        scores.append(score)
    df = pd.DataFrame(results)
    df["score"] = scores
    output_filename = f"avaliacao_{model_name.replace(' ', '_')}.csv"
    destination_folder = "/content/drive/MyDrive/spider_data/custom_metrics" # Usando o caminho que você definiu
    !mkdir -p "{destination_folder}"
    final_path = f"{destination_folder}/{output_filename}"
    df.to_csv(final_path, index=False)
    print(f"✅ Resultados para '{model_name}' salvos em '{final_path}'")
    del model, tokenizer
    gc.collect()
    torch.cuda.empty_cache()

print("\n--- Funções prontas para a avaliação. ---")

--- Preparando Funções e Métricas para a Avaliação Final (100 exemplos) ---

--- Funções prontas para a avaliação. ---


#Fase 3 -  AVALIAR RUN 1

In [None]:
#Fase 3
# AVALIAR RUN 1
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path_run1 = "/content/drive/MyDrive/mistral-7b-spider-run1/checkpoint-200"
evaluate_model("Fine-Tuned Run 1", base_model_id, adapter_path=adapter_path_run1)


--- Avaliando em 100 exemplos: Fine-Tuned Run 1 ---


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 100/100 [13:06<00:00,  7.87s/it]


✅ Resultados para 'Fine-Tuned Run 1' salvos em '/content/drive/MyDrive/spider_data/custom_metrics/avaliacao_Fine-Tuned_Run_1.csv'


#Fase 3 -  AVALIAR RUN 2

In [None]:
#Fase 3
# CÉLULA 3: AVALIAR RUN 2
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path_run2 = "/content/drive/MyDrive/mistral-7b-spider-run2/checkpoint-875"
evaluate_model("Fine-Tuned Run 2", base_model_id, adapter_path=adapter_path_run2)


--- Avaliando em 100 exemplos: Fine-Tuned Run 2 ---


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 100/100 [07:23<00:00,  4.44s/it]


✅ Resultados para 'Fine-Tuned Run 2' salvos em '/content/drive/MyDrive/spider_data/custom_metrics/avaliacao_Fine-Tuned_Run_2.csv'


#Fase 3 -  AVALIAR BASILINE

In [None]:
#Fase 3
# CÉLULA 4: AVALIAR BASELINE
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
evaluate_model("Baseline Model", base_model_id, adapter_path=None)


--- Avaliando em 100 exemplos: Baseline Model ---


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 100/100 [05:01<00:00,  3.02s/it]


✅ Resultados para 'Baseline Model' salvos em '/content/drive/MyDrive/spider_data/custom_metrics/avaliacao_Baseline_Model.csv'


#Fase 3 -  SUMARIZAR RESULTADOS E SALVAR

In [None]:
#Fase 3
# CÉLULA 5: SUMARIZAR RESULTADOS
import pandas as pd
import os

print("--- Gerando Tabela Resumo dos Resultados Finais da Fase 3 ---")
destination_folder = "/content/drive/MyDrive/spider_data/custom_metrics" # Usando o caminho que você definiu
model_results_files = {
    "Baseline Model": "avaliacao_Baseline_Model.csv",
    "Fine-Tuned Run 1": "avaliacao_Fine-Tuned_Run_1.csv",
    "Fine-Tuned Run 2": "avaliacao_Fine-Tuned_Run_2.csv"
}
summary_data = {}
for model_name, file_name in model_results_files.items():
    file_path = f"{destination_folder}/{file_name}"
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        execution_accuracy = df.get('score', pd.Series(dtype='float')).mean()
        summary_data[model_name] = {"Execution Accuracy": execution_accuracy}
    else:
        summary_data[model_name] = {"Execution Accuracy": "N/A"}

df_summary = pd.DataFrame.from_dict(summary_data, orient='index')
print("\n--- Tabela Comparativa de Acurácia de Execução (Fase 3) ---")
display(df_summary)

--- Gerando Tabela Resumo dos Resultados Finais da Fase 3 ---

--- Tabela Comparativa de Acurácia de Execução (Fase 3) ---


Unnamed: 0,Execution Accuracy
Baseline Model,0.09
Fine-Tuned Run 1,0.21
Fine-Tuned Run 2,0.22


#Fase 4: Análise Quantitativa de Regressão de Capacidade
O objetivo aqui é medir se, e o quanto, o processo de fine-tuning (especialização) prejudicou a capacidade do modelo de responder a perguntas de conhecimento geral. Faremos isso usando o benchmark MMLU (Massive Multitask Language Understanding)

Ação 1: Preparar o Dataset MMLU
Primeiro, precisamos carregar o dataset MMLU e criar a nossa suíte de avaliação customizada. O trabalho exige exatamente 150 questões, divididas em 3 categorias

In [None]:
# CÉLULA 1 (CORRIGIDA): Setup do Dataset MMLU

from datasets import load_dataset
import numpy as np

print("--- Carregando e preparando a suíte de avaliação MMLU (com categorias corrigidas) ---")

# Define as 3 subcategorias que vamos usar, com os nomes EXATOS da lista de erro
categories = {
    "STEM": "high_school_computer_science",  # <<< CORRIGIDO
    "Humanidades": "philosophy",                  # Este já estava correto
    "Ciências Sociais": "high_school_macroeconomics"   # <<< CORRIGIDO
}

# Dicionário para armazenar nossa suíte de avaliação final
mmlu_suite = {
    "STEM": None,
    "Humanidades": None,
    "Ciências Sociais": None
}

# Garante que os resultados sejam reprodutíveis
np.random.seed(42)

for general_category, hf_category in categories.items():
    # Carrega a subcategoria do MMLU do Hugging Face
    dataset = load_dataset("cais/mmlu", hf_category, split="test")

    # Seleciona 50 questões aleatórias e distintas da categoria
    indices = np.random.choice(len(dataset), size=50, replace=False)

    # Armazena as 50 questões selecionadas
    mmlu_suite[general_category] = dataset.select(indices)

    print(f"Selecionadas 50 questões de '{hf_category}' para a categoria '{general_category}'.")

print("\n--- Suíte MMLU com 150 questões pronta! ---")

--- Carregando e preparando a suíte de avaliação MMLU (com categorias corrigidas) ---


test-00000-of-00001.parquet:   0%|          | 0.00/27.3k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/6.54k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

Selecionadas 50 questões de 'high_school_computer_science' para a categoria 'STEM'.
Selecionadas 50 questões de 'philosophy' para a categoria 'Humanidades'.


test-00000-of-00001.parquet:   0%|          | 0.00/54.8k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/9.89k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/390 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/43 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

Selecionadas 50 questões de 'high_school_macroeconomics' para a categoria 'Ciências Sociais'.

--- Suíte MMLU com 150 questões pronta! ---


Ação 2: Implementar a Lógica de Avaliação 4-Shot
O trabalho exige que a avaliação seja feita no modo 4-shot. Isso significa que, para cada pergunta, daremos ao modelo 4 exemplos de perguntas e respostas da mesma categoria antes de fazer a pergunta final

#Fase 4 -  Lógica de Avaliação MMLU

In [None]:
# AÇÃO 2: Lógica de Avaliação MMLU

from tqdm import tqdm
import pandas as pd

def format_mmlu_prompt(sample, examples):
    """
    Cria um prompt 4-shot para uma amostra do MMLU.
    """
    prompt = "The following are multiple choice questions (with answers).\n\n"

    # Adiciona os 4 exemplos (shots)
    for ex in examples:
        prompt += f"Question: {ex['question']}\n"
        prompt += f"A. {ex['choices'][0]}\n"
        prompt += f"B. {ex['choices'][1]}\n"
        prompt += f"C. {ex['choices'][2]}\n"
        prompt += f"D. {ex['choices'][3]}\n"
        prompt += f"Answer: {['A', 'B', 'C', 'D'][ex['answer']]}\n\n"

    # Adiciona a pergunta final
    prompt += f"Question: {sample['question']}\n"
    prompt += f"A. {sample['choices'][0]}\n"
    prompt += f"B. {sample['choices'][1]}\n"
    prompt += f"C. {sample['choices'][2]}\n"
    prompt += f"D. {sample['choices'][3]}\n"
    prompt += "Answer:"

    return prompt

def evaluate_model_on_mmlu(model_name: str, model, tokenizer):
    """
    Avalia um modelo carregado na suíte MMLU.
    """
    print(f"\n--- AVALIANDO MMLU: {model_name} ---")

    category_results = {}

    for category_name, suite_dataset in mmlu_suite.items():

        # Seleciona 4 exemplos da própria suíte para usar no prompt (excluindo a pergunta atual)
        examples = suite_dataset.select(range(4))

        correct_predictions = 0

        for i in tqdm(range(len(suite_dataset)), desc=f"Avaliando {category_name}"):
            sample = suite_dataset[i]
            prompt = format_mmlu_prompt(sample, examples)

            inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
            outputs = model.generate(**inputs, max_new_tokens=5, pad_token_id=tokenizer.eos_token_id)

            # Decodifica e pega apenas a primeira letra da resposta gerada
            prediction_text = tokenizer.decode(outputs[0][-1]).strip()

            # Compara a letra prevista com a resposta correta
            correct_answer_char = ['A', 'B', 'C', 'D'][sample['answer']]
            if prediction_text and prediction_text[0].upper() == correct_answer_char:
                correct_predictions += 1

        # Calcula a acurácia para a categoria
        accuracy = correct_predictions / len(suite_dataset)
        category_results[category_name] = accuracy
        print(f"Acurácia para '{category_name}': {accuracy:.2%}")

    return category_results

#FASE 4
Executar a Avaliação em Todos os Modelos
Finalmente, carregamos cada modelo (base, run1, run2) e rodamos a avaliação MMLU neles.

In [None]:
# ACÃO 3: Execução da Avaliação MMLU

import torch
import gc
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configurações
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
adapter_path_run1 = "/content/drive/MyDrive/mistral-7b-spider-run1/checkpoint-400"
adapter_path_run2 = "/content/drive/MyDrive/mistral-7b-spider-run2/checkpoint-875"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

all_results = {}

# --- Avalia o Modelo Base ---
print("\nCarregando Modelo Base...")
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map={"":"cuda:0"})
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
all_results["Baseline"] = evaluate_model_on_mmlu("Baseline", model, tokenizer)
del model, tokenizer; gc.collect(); torch.cuda.empty_cache()

# --- Avalia o Run 1 ---
print("\nCarregando Modelo Run 1...")
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_path_run1)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
all_results["Run 1"] = evaluate_model_on_mmlu("Run 1", model, tokenizer)
del model, tokenizer; gc.collect(); torch.cuda.empty_cache()

# --- Avalia o Run 2 ---
print("\nCarregando Modelo Run 2...")
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_path_run2)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
all_results["Run 2"] = evaluate_model_on_mmlu("Run 2", model, tokenizer)
del model, tokenizer; gc.collect(); torch.cuda.empty_cache()

print("\n--- AVALIAÇÃO MMLU COMPLETA ---")


Carregando Modelo Base...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


--- AVALIANDO MMLU: Baseline ---


Avaliando STEM: 100%|██████████| 50/50 [02:51<00:00,  3.43s/it]


Acurácia para 'STEM': 0.00%


Avaliando Humanidades: 100%|██████████| 50/50 [02:00<00:00,  2.41s/it]


Acurácia para 'Humanidades': 0.00%


Avaliando Ciências Sociais: 100%|██████████| 50/50 [02:53<00:00,  3.46s/it]


Acurácia para 'Ciências Sociais': 2.00%

Carregando Modelo Run 1...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


--- AVALIANDO MMLU: Run 1 ---


Avaliando STEM: 100%|██████████| 50/50 [02:43<00:00,  3.27s/it]


Acurácia para 'STEM': 0.00%


Avaliando Humanidades: 100%|██████████| 50/50 [02:02<00:00,  2.46s/it]


Acurácia para 'Humanidades': 0.00%


Avaliando Ciências Sociais: 100%|██████████| 50/50 [02:52<00:00,  3.46s/it]


Acurácia para 'Ciências Sociais': 0.00%

Carregando Modelo Run 2...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


--- AVALIANDO MMLU: Run 2 ---


Avaliando STEM: 100%|██████████| 50/50 [02:52<00:00,  3.45s/it]


Acurácia para 'STEM': 0.00%


Avaliando Humanidades: 100%|██████████| 50/50 [02:08<00:00,  2.57s/it]


Acurácia para 'Humanidades': 0.00%


Avaliando Ciências Sociais: 100%|██████████| 50/50 [03:01<00:00,  3.64s/it]


Acurácia para 'Ciências Sociais': 0.00%

--- AVALIAÇÃO MMLU COMPLETA ---


#FASE 4: Analisar a Regressão
Calcular a variação percentual.

In [None]:
# AÇÃO 4 (CORRIGIDA): Cálculo e SALVAMENTO da Análise de Regressão

print("\n--- Análise de Regressão de Capacidade ---")

# Converte os resultados para um DataFrame para fácil visualização
df_results = pd.DataFrame(all_results).T
df_results['Acurácia Agregada'] = df_results.mean(axis=1)

print("Resultados da Avaliação MMLU:")
print(df_results)

# Calcula a variação percentual em relação ao Baseline
baseline_agg_accuracy = df_results.loc['Baseline', 'Acurácia Agregada']

for model_name in ['Run 1', 'Run 2']:
    if model_name in df_results.index:
        model_agg_accuracy = df_results.loc[model_name, 'Acurácia Agregada']
        percentage_change = ((model_agg_accuracy - baseline_agg_accuracy) / baseline_agg_accuracy) * 100
        print(f"\nVariação de Acurácia Agregada para {model_name}: {percentage_change:.2f}%")

        # Análise por categoria
        for category in mmlu_suite.keys():
            baseline_cat_acc = df_results.loc['Baseline', category]
            model_cat_acc = df_results.loc[model_name, category]
            cat_change = ((model_cat_acc - baseline_cat_acc) / baseline_cat_acc) * 100 if baseline_cat_acc > 0 else 0
            print(f"  - Variação em '{category}': {cat_change:.2f}%")

# <<< ADIÇÃO PARA SALVAR OS RESULTADOS >>>
output_filename_mmlu = "avaliacao_mmlu_resultados.csv"
df_results.to_csv(output_filename_mmlu)

# Move o arquivo para a mesma pasta dos outros resultados no Drive
destination_folder = "/content/drive/MyDrive/spider_data/custom_metrics"
!mkdir -p "{destination_folder}"
!mv {output_filename_mmlu} "{destination_folder}"

print(f"\n✅ Resultados da avaliação MMLU salvos com sucesso em '{destination_folder}/{output_filename_mmlu}'")


--- Análise de Regressão de Capacidade ---
Resultados da Avaliação MMLU:
          STEM  Humanidades  Ciências Sociais  Acurácia Agregada
Baseline   0.0          0.0              0.02           0.006667
Run 1      0.0          0.0              0.00           0.000000
Run 2      0.0          0.0              0.00           0.000000

Variação de Acurácia Agregada para Run 1: -100.00%
  - Variação em 'STEM': 0.00%
  - Variação em 'Humanidades': 0.00%
  - Variação em 'Ciências Sociais': -100.00%

Variação de Acurácia Agregada para Run 2: -100.00%
  - Variação em 'STEM': 0.00%
  - Variação em 'Humanidades': 0.00%
  - Variação em 'Ciências Sociais': -100.00%

✅ Resultados da avaliação MMLU salvos com sucesso em '/content/drive/MyDrive/spider_data/custom_metrics/avaliacao_mmlu_resultados.csv'
