# Objetivo Experimental

Para 500 amostras:

- Accuracy
- Latência média
- Tokens processados
- Ganho por custo computacional

O ganho por custo computavional é definida por:

$$
Efficiency\ Gain = \frac{\Delta Accuracy}{\Delta Tokens}
$$

Ou seja:

* Quanto de ganho em acurácia você obteve
* Por token adicional processado

In [1]:
import torch
import time
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
from tqdm import tqdm

# import torch
# torch.cuda.empty_cache()

## 1. Carregando Modelo

In [2]:
model_name = "HuggingFaceTB/SmolLM3-3B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    low_cpu_mem_usage=True,
    dtype=torch.float16, 
)

model.eval()

Loading weights:   0%|          | 0/326 [00:00<?, ?it/s]

SmolLM3ForCausalLM(
  (model): SmolLM3Model(
    (embed_tokens): Embedding(128256, 2048, padding_idx=128004)
    (layers): ModuleList(
      (0-35): 36 x SmolLM3DecoderLayer(
        (self_attn): SmolLM3Attention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): SmolLM3MLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): SmolLM3RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): SmolLM3RMSNorm((2048,), eps=1e-06)
 

## 2. Dataset 

In [3]:
dataset = load_dataset("celsowm/imdb-reviews-pt-br")
test_data = dataset["train"].shuffle(seed=42).select(range(500))

test_data

Dataset({
    features: ['id', 'texto', 'sentimento'],
    num_rows: 500
})

In [4]:
from collections import Counter

print(Counter(test_data["sentimento"]))

Counter({1: 253, 0: 247})


## 3. Função de classificação + métricas

In [5]:
def classify_and_measure(text, repeated=False):
    
    base = f"""Classifique o sentimento como positivo ou negativo.
Texto: {text}"""
    
    if repeated:
        prompt = base + "\n\n" + base + "\nSentimento:"
    else:
        prompt = base + "\nSentimento:"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    start = time.time()
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=3,
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.eos_token_id
        )
    
    latency = time.time() - start
    
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    prediction = output_text.split("Sentimento:")[-1].strip().lower()
    
    input_tokens = inputs["input_ids"].shape[1]
    
    return prediction, latency, input_tokens

## 4. Loop Experimental

In [6]:
correct_normal = 0
correct_repeated = 0

latencies_normal = []
latencies_repeated = []

tokens_normal = []
tokens_repeated = []

results = []

for example in tqdm(test_data):
    
    text = example["texto"]
    label = "positivo" if example["sentimento"] == 1 else "negativo"
    
    # Normal
    pred_n, lat_n, tok_n = classify_and_measure(text, repeated=False)
    
    # Repeated
    pred_r, lat_r, tok_r = classify_and_measure(text, repeated=True)
    
    # Armazenar resultados
    results.append([label, pred_n, pred_r])

    if label in pred_n:
        correct_normal += 1
        
    if label in pred_r:
        correct_repeated += 1
    
    latencies_normal.append(lat_n)
    latencies_repeated.append(lat_r)
    
    tokens_normal.append(tok_n)
    tokens_repeated.append(tok_r)

  0%|          | 0/500 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|██████████| 500/500 [05:36<00:00,  1.49it/s]


## 5. Resultados

In [None]:
acc_normal = correct_normal / 500
acc_repeated = correct_repeated / 500

avg_latency_normal = np.mean(latencies_normal)
avg_latency_repeated = np.mean(latencies_repeated)

avg_tokens_normal = np.mean(tokens_normal)
avg_tokens_repeated = np.mean(tokens_repeated)

delta_acc = acc_repeated - acc_normal
delta_tokens = avg_tokens_repeated - avg_tokens_normal

efficiency_gain = delta_acc / delta_tokens

print("===== RESULTADOS =====")
print("Accuracy Normal:", acc_normal)
print("Accuracy Repeated:", acc_repeated)
print("Δ Accuracy:", delta_acc)

print("\nLatência Média Normal:", avg_latency_normal)
print("Latência Média Repeated:", avg_latency_repeated)

print("\nTokens Médios Normal:", avg_tokens_normal)
print("Tokens Médios Repeated:", avg_tokens_repeated)

print("\nGanho por Token:", efficiency_gain)


===== RESULTADOS =====
Accuracy Normal: 0.8
Accuracy Repeated: 0.882
Δ Accuracy: 0.08199999999999996

Latência Média Normal: 0.262154661655426
Latência Média Repeated: 0.40699334812164306

Tokens Médios Normal: 366.404
Tokens Médios Repeated: 729.804

Ganho por Token: 0.00022564667033571811


# Teste de McNemar

Ele testa se a diferença entre dois classificadores avaliados nas mesmas amostras é estatisticamente significativa.

## O que precisamos

Para cada uma das amostras, precisamos saber:

| Caso | Normal | Repeated |
| ---- | ------ | -------- |
| A    | ✔️     | ✔️       |
| B    | ✔️     | ❌        |
| C    | ❌      | ✔️       |
| D    | ❌      | ❌        |


O teste usa apenas:

* b = Normal correto, Repeated errado
* c = Normal errado, Repeated correto

Estatística:

$$
X^2 = \frac{(|b - c| -1)^2}{b + c}
$$	​


## Interpretação

Se:

* p < 0.05 → diferença significativa
* p < 0.01 → fortemente significativa
* p < 0.001 → muito forte

In [8]:
from statsmodels.stats.contingency_tables import mcnemar

# construir tabela de contingência
b = 0  # normal certo, repeated errado
c = 0  # normal errado, repeated certo

for example in results:  
    label, pred_n, pred_r = example
    
    normal_correct = label in pred_n
    repeated_correct = label in pred_r
    
    if normal_correct and not repeated_correct:
        b += 1
    elif not normal_correct and repeated_correct:
        c += 1

table = [[0, b],
         [c, 0]]

result = mcnemar(table, exact=False, correction=True)

print("b:", b)
print("c:", c)
print("statistic:", result.statistic)
print("p-value:", result.pvalue)

b: 47
c: 88
statistic: 11.851851851851851
p-value: 0.0005760403386062825
