### ⚠ IMPORTANTE ⚠

Para la ejecución de este notebook se necesita:

**A100 GPU**

La evaluación y el ajuste fino de modelos de LLMs es un proceso que requiere muchos recursos de máquina.


In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Benchmark Evaluation de un modelo fundacional

En este cuaderno lo que haremos será cargar un modelo fundacional y realizar algunas pruebas.

Estas pruebas son algunos de los **benchmarks** por los cuales se suelen comparar los LLMs en cuanto a capacidades.

Veremos qué resultados ofrece un modelo sin entrenamiento en instrucciones concretas.

## Evaluación de un modelo base

Para entender adecuadamente cómo mejora nuestro modelo, necesitamos comenzar con una evaluación base del rendimiento de nuestro modelo.

Lo haremos con el modelo de Mistral AI => `Mistral-7B`


### Cargamos las dependencias y el modelo. En este caso cargaremos el modelo cuantificado a 4 bits


In [2]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.1/362.1 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m135.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m118.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

Bits And Bytes config => cuantificación a 4 bits

[BitsAndBytesConfig](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig)

In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)

In [5]:
model_id = "mistralai/Mistral-7B-v0.3"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [6]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mist

### Harness de Evaluación de Eleuther AI

Ahora que ya tenemos nuestro modelo base cargado, necesitamos evaluarlo.

Para esto, utilizaremos una herramienta llamada [Eleuther AI Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). Esta es una herramienta especializada para ejecutar benchmarks en diversas tareas de lenguaje.

¿Por qué es tan importante? ¡Es la que alimenta el [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)!


In [7]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
%cd lm-evaluation-harness
!pip install -e .

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 53159, done.[K
remote: Counting objects: 100% (585/585), done.[K
remote: Compressing objects: 100% (366/366), done.[K
remote: Total 53159 (delta 363), reused 219 (delta 219), pack-reused 52574 (from 2)[K
Receiving objects: 100% (53159/53159), 30.75 MiB | 13.79 MiB/s, done.
Resolving deltas: 100% (36854/36854), done.
/content/lm-evaluation-harness
Obtaining file:///content/lm-evaluation-harness
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.8)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jsonlines (from lm_eval==0.4.8)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.8)
  Downloadi

In [8]:
import lm_eval
from lm_eval.tasks import TaskManager
from lm_eval.models.huggingface import HFLM
eval_model = HFLM(model, batch_size=4)



In [9]:
task_manager = TaskManager()

Ya tenemos el modelo preparado y el proceso de evaluación también

> Nota: Este paso puede tardar unos 30-40 min en una GPU A100.

Evaluaremos dos benchmarks:

- [HellaSwag](https://rowanzellers.com/hellaswag/)
- [ARC Easy](https://leaderboard.allenai.org/arc_easy/submissions/get-started)
- Un subconjunto del [MMLU benchmark](https://paperswithcode.com/dataset/mmlu), en tareas de ML

Estos son benchmarks ligeros utilizados para "puntuar" los modelos entre ellos en el [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)!

Consideraremos una media simple de sus puntuaciones como la puntuación "global" del modelo base.

Pueden añadir fácilmente tareas de evaluación a `tasks`


In [None]:
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["hellaswag", "arc_easy"],
    task_manager=task_manager,
    num_fewshot=0,
    batch_size=16,
)

README.md:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

hellaswag.py:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/331k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/346k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2376 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/570 [00:00<?, ? examples/s]

100%|██████████| 2376/2376 [00:02<00:00, 1120.43it/s]
100%|██████████| 10042/10042 [00:03<00:00, 2567.96it/s]
Running loglikelihood requests: 100%|██████████| 49669/49669 [48:29<00:00, 17.07it/s]


In [None]:
import pandas as pd

pd.DataFrame(results["results"])

Unnamed: 0,arc_easy,hellaswag
alias,arc_easy,hellaswag
"acc,none",0.791246,0.605457
"acc_stderr,none",0.00834,0.004878
"acc_norm,none",0.77862,0.800737
"acc_norm_stderr,none",0.008519,0.003986


In [None]:
fs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    task_manager=task_manager,
    num_fewshot=5,
    batch_size=16,
)

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 112/112 [00:00<00:00, 131.27it/s]
Running loglikelihood requests: 100%|██████████| 448/448 [02:15<00:00,  3.32it/s]


In [None]:
import pandas as pd

pd.DataFrame(fs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
alias,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.392857
"acc_stderr,none",0.046356


### Zero-Shot MMLU

In [None]:
zs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    task_manager=task_manager,
    num_fewshot=0,
    batch_size=16,
)

100%|██████████| 112/112 [00:00<00:00, 639.71it/s]
Running loglikelihood requests: 100%|██████████| 448/448 [00:31<00:00, 14.39it/s]


In [None]:
import pandas as pd

pd.DataFrame(zs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
alias,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.419643
"acc_stderr,none",0.046841


### Chain of Thought

In [None]:
cot_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_cot_zeroshot_machine_learning"],
    task_manager=task_manager,
    num_fewshot=0,
    batch_size=16,
)

100%|██████████| 11/11 [00:00<00:00, 632.92it/s]
Running generate_until requests: 100%|██████████| 11/11 [01:48<00:00,  9.91s/it]


In [None]:
import pandas as pd

pd.DataFrame(cot_mmlu_results["results"])

Unnamed: 0,mmlu_flan_cot_zeroshot_machine_learning
alias,mmlu_flan_cot_zeroshot_machine_learning
"exact_match,strict-match",0.0
"exact_match_stderr,strict-match",0.0
"exact_match,flexible-extract",0.090909
"exact_match_stderr,flexible-extract",0.090909
