## Maestría en Inteligencia Artificial Aplicada (MNA)
### Proyecto Integrador
### Dra. Grettel Barceló Alonso / Dr. Carlos Alberto Villaseñor Padilla
### Avance 4. Modelos alternativos

### Integrantes
- A01794457 - Iossif Moises Palli Laura
- A01793984 - Brenda Zurazy Rodríguez Pérez
- A01794630 - Jesús Ramseths Echeverría Rivera

In [None]:
!pip install datasets peft bitsandbytes
!pip install -U bitsandbytes
!pip install bert-score

In [16]:
# Paqueterías a utilizar
import pandas as pd
from datasets import load_dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from transformers import Trainer
from transformers import LlamaTokenizer, LlamaForCausalLM
from peft import LoraConfig, get_peft_model, PeftModel
import torch
from transformers import BitsAndBytesConfig
from torch.utils.data import DataLoader
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu
from bert_score import score
import numpy as np

Primero se hace la **configuración para la cuantización** del modelo LLM utilizando la función BitsAndBytes.

La cuantización es una técnica que reduce el tamaño de los modelos y mejora la eficiencia de la inferencia, permitiendo que los modelos se ejecuten más rápidamente y con menos memoria, sin una pérdida significativa en la calidad.

In [2]:
# Cuantización del modelo
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

Se inicia sesión en el Hugging Face Hub utilizando un token de autenticación.

In [5]:
# Inicio de sesión en el Hub de Hugging Face
from huggingface_hub import login

# Token de huggingface
login('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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Modelo (Llama-3.2-1B-Fine-Tuning)

In [4]:
# Carga de modelo Llama
MODEL_NAME = 'meta-llama/Llama-3.2-1B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config=quantization_config, device_map='auto')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Posteriormente se configura y aplica **LoRA (Low-Rank Adaptation)** al modelo.

LoRA es un método que acelera el entrenamiento de modelos grandes mientras consume menos memoria, ya que en lugar de ajustar todos los millones (o incluso billones) de parámetros de un modelo, LoRA se enfoca solo en modificar una pequeña parte de ellos, ahorrando recursos computacionales y tiempo.

**Bibliografía:**

Low-Rank Adaptation of Large Language Models (LoRA). (s. f.). https://huggingface.co/docs/diffusers/v0.21.0/training/lora#lowrank-adaptation-of-large-language-models-lora


In [5]:
# Configuración de LoRA (Low-Rank Adaptation)
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Módulos a los que se aplicará LoRA
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, peft_config)

In [6]:
tokenizer.pad_token = tokenizer.eos_token

Se hace la carga de los datos de estructura de preguntas y repuestas.

In [11]:
# Carga de datos
dataset = load_dataset('csv', data_files='q_a_db.csv')

Se dividen los datos en entrenamiento y prueba.

In [12]:
split_dataset = dataset['train'].train_test_split(test_size=0.2)

# Asignación de la partición
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

In [13]:
print('Tamaño de entrenamiento:', train_dataset.shape[0])
print('Tamaño de prueba:', test_dataset.shape[0])

Tamaño de entrenamiento: 914
Tamaño de prueba: 229


Se contruye una función de tokenización llamada **tokenize_function** que nos servirá para conviertir el texto en secuencias de tokens que el modelo puede procesar.

La función concatena cada pregunta y respuesta de la base de datos, después generará la tokenización y se etiquetarán cada una de ellas, posteriormente, medirá la longitud de los tokens correspondientes al prompt, esto se hace para diferenciar los tokens que corresponden a la pregunta y respuesta y finalmente lo que hará la función es enmascarar los tokens correspondientes al prompt.

In [3]:
# Se define una función de tokenización llamada tokenize_function
def tokenize_function(example):
    # Concatenar prompt y respuesta
    full_text = example['question'] + example['answer']

    # Tokenizar
    tokenized_example = tokenizer(
        full_text,
        truncation=True,
        padding='max_length',
        max_length=500,)

    # Crear etiquetas
    labels = tokenized_example['input_ids'].copy()

    # Calcular la longitud del prompt
    prompt_length = len(tokenizer(
        example['question'],
        add_special_tokens=False
    )['input_ids'])

    # Enmascarar los tokens del prompt
    labels[:prompt_length] = [-100] * prompt_length

    tokenized_example['labels'] = labels
    return tokenized_example

In [10]:
train_tokenized_dataset = train_dataset.map(tokenize_function, batched=False)
test_tokenized_dataset = test_dataset.map(tokenize_function, batched=False)


Map:   0%|          | 0/914 [00:00<?, ? examples/s]

Map:   0%|          | 0/229 [00:00<?, ? examples/s]

Una vez que se transforme la base de datos original en un nuevo dataset donde cada registro está tokenizado y por lo tanto listo para el entrenamiento de un modelo.

In [11]:
training_args = TrainingArguments(
    output_dir='./resultado_lora',
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    num_train_epochs=10,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=10,
    save_steps=1000,
    save_total_limit=2,
)

In [12]:
# Entrenamiento del modelo
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized_dataset,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mramseths[0m ([33mramseths-tecnol-gico-de-monterrey[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,3.98
20,0.1326
30,0.1095
40,0.1029
50,0.0972
60,0.0923
70,0.09
80,0.0879
90,0.0844
100,0.0841


TrainOutput(global_step=140, training_loss=0.37020573232855114, metrics={'train_runtime': 1158.9413, 'train_samples_per_second': 7.887, 'train_steps_per_second': 0.121, 'total_flos': 2.612847249408e+16, 'train_loss': 0.37020573232855114, 'epoch': 9.781659388646288})

La tabla anterior nos muestra cómo disminuye la pérdida durante el entrenamiento del modelo, vemos que a partir del paso 20 y hasta el final, la pérdida es bastante baja (llegando a 0.080800 en el paso 70 y fluctuando ligeramente).

Esto indica que el **modelo aprendió correctamente** y se está ajustando bien a los datos, logrando una buena mejora con el tiempo.

In [13]:
# Se guarda el modelos entrenado y la tokenización
model.save_pretrained('llama-3.2-1b-fine-tuning')
tokenizer.save_pretrained('llama-3.2-1b-fine-tuning')

('llama-3.2-1b-fine-tuning/tokenizer_config.json',
 'llama-3.2-1b-fine-tuning/special_tokens_map.json',
 'llama-3.2-1b-fine-tuning/tokenizer.json')

### Evaluación

Se carga el modelo LLaMA preentrenado, previamente se aplicarán técnicas de cuantización para reducir el uso de recursos para que después poder hacer inferencia.

In [64]:
model_name = './llama-3.2-1b-fine-tuning/'

# Cargar el tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configuración de cuantización
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# Cargar el modelo con cuantización
model = LlamaForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map='auto'
)

# # Carga
# model = PeftModel.from_pretrained(model, model_name)

model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=2048, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2048, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (v_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_f

### Métricas

In [14]:
# Se seleccionan diez muestras debido a la capacidad
test_samples = test_dataset.select(range(100))
test_tokenized_samples = test_samples.map(tokenize_function, batched=False)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [8]:
def generate_responses(model, tokenizer, test_dataset):
    model.eval()
    generated_responses = []
    real_responses = []

    for example in tqdm(test_dataset):
        prompt = example['question']
        real_answer = example['answer']

        # Tokenizar el prompt
        inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

        # Generar la respuesta
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=200,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                repetition_penalty=1.2
            )

        # Decodificar la respuesta generada
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        # Extraer solo la respuesta generada (sin el prompt)
        generated_answer = generated_text[len(prompt):].strip()

        generated_responses.append(generated_answer)
        real_responses.append(real_answer)

    return generated_responses, real_responses

In [67]:
generated_responses, real_responses = generate_responses(model, tokenizer, test_tokenized_samples)

  0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  1%|          | 1/100 [00:01<01:39,  1.01s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▏         | 2/100 [00:02<02:10,  1.34s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  3%|▎         | 3/100 [00:04<02:34,  1.59s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▍         | 4/100 [00:05<02:25,  1.51s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  5%|▌         | 5/100 [00:08<02:58,  1.88s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  6%|▌         | 6/100 [00:11<03:21,  2.14s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  7%|▋         | 7/100 [00:12<03:02,  1.96s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  8%|▊         | 8/100 [00:16<03:43,  2.43s/it]S

####  BLEU

In [19]:
def calculate_bleu_avg(generated_responses, real_responses, model_name):
  bleu_scores = []
  for i in range(len(generated_responses)):
      reference = [real_responses[i].split()]
      candidate = generated_responses[i].split()
      bleu_score = sentence_bleu(reference, candidate)
      bleu_scores.append(bleu_score)

  average_bleu_score = np.mean(bleu_scores)
  print(f"Promedio BLEU Score {model_name}: {average_bleu_score}")
  return average_bleu_score

In [84]:
avg_bleu_model1b =  calculate_bleu_avg(generated_responses, real_responses, 'Llama-1B-Fine-Tuning')

Promedio BLEU Score Llama-1B-Fine-Tuning: 0.005825452138358672


#### BERT-Score

In [20]:
def calculate_bert_score_avg(generated_responses, real_responses, model_name):

  # Se obtiene la precisión, recall y F1 de acuerdo a la comparación.
  P, R, F1 = score(generated_responses, real_responses, lang='es', verbose=True)

  # Imprime los puntajes promedio
  print(f"Resultados para: {model_name}")
  print(f"Puntaje F1 promedio: {F1.mean():.4f}")
  print(f"Puntaje de Precisión promedio: {P.mean():.4f}")
  print(f"Puntaje de Recall promedio: {R.mean():.4f}")

  return F1.mean(), P.mean(), R.mean()

In [86]:
P_model1b, R_model1b, F1_model1b = calculate_bert_score_avg(generated_responses, real_responses, 'Llama-1B-Fine-Tuning')



calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 0.26 seconds, 383.99 sentences/sec
Resultados para: Llama-1B-Fine-Tuning
Puntaje F1 promedio: 0.7423
Puntaje de Precisión promedio: 0.7486
Puntaje de Recall promedio: 0.7370


### Modelo Baseline (Llama-3.2-3B-Fine-Tuning)

In [6]:
model_name = './llama-3.2-3b-fine-tuning/'

# Cargar el tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configuración de cuantización
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

# Cargar el modelo con cuantización
model = LlamaForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map='auto'
)

# # Carga
# model = PeftModel.from_pretrained(model, model_name)

model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=3072, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=3072, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): Linear8bitLt(in_features=3072, out_features=1024, bias=False)
          (v_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_

In [17]:
generated_responses_3b, real_responses_3b = generate_responses(model, tokenizer, test_tokenized_samples)

  0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  1%|          | 1/100 [00:03<06:03,  3.67s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▏         | 2/100 [00:05<03:57,  2.42s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  3%|▎         | 3/100 [00:09<05:24,  3.35s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▍         | 4/100 [00:14<06:14,  3.90s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  5%|▌         | 5/100 [00:17<05:57,  3.77s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  6%|▌         | 6/100 [00:21<05:43,  3.66s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  7%|▋         | 7/100 [00:23<05:04,  3.27s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  8%|▊         | 8/100 [00:26<04:47,  3.13s/it]S

In [22]:
avg_bleu_model3b =  calculate_bleu_avg(generated_responses_3b, real_responses_3b, 'Llama-3B-Fine-Tuning')
P_model3b, R_model3b, F1_model3b = calculate_bert_score_avg(generated_responses_3b, real_responses_3b, 'Llama-3B-Fine-Tuning')

Promedio BLEU Score Llama-3B-Fine-Tuning: 0.005530120779914015
calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 0.26 seconds, 384.47 sentences/sec
Resultados para: Llama-3B-Fine-Tuning
Puntaje F1 promedio: 0.7459
Puntaje de Precisión promedio: 0.7548
Puntaje de Recall promedio: 0.7385


### Modelo sin Fine Tuning (Llama-3.2-3B-Instruct)

In [23]:
model_name = 'meta-llama/Llama-3.2-3B-Instruct'

# Cargar el tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configuración de cuantización
# quantization_config = BitsAndBytesConfig(
#     load_in_8bit=True,
#     llm_int8_threshold=6.0
# )

# Cargar el modelo con cuantización
model = LlamaForCausalLM.from_pretrained(
    model_name,
    #quantization_config=quantization_config,
    device_map='auto'
)

# # Carga
# model = PeftModel.from_pretrained(model, model_name)

model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm

In [24]:
generated_responses_3b_instruct, real_responses_3b_instruct = generate_responses(model, tokenizer, test_tokenized_samples)

  0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  1%|          | 1/100 [00:11<18:15, 11.07s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▏         | 2/100 [00:22<18:04, 11.06s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  3%|▎         | 3/100 [00:32<17:14, 10.66s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▍         | 4/100 [00:43<17:17, 10.81s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  5%|▌         | 5/100 [00:54<17:16, 10.91s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  6%|▌         | 6/100 [01:05<17:08, 10.94s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  7%|▋         | 7/100 [01:16<16:59, 10.96s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  8%|▊         | 8/100 [01:27<16:49, 10.98s/it]S

In [52]:
avg_bleu_model3b_ins =  calculate_bleu_avg(generated_responses_3b_instruct, real_responses_3b_instruct, 'Llama-3B-Ins')
P_model3b_ins, R_model3b_ins, F1_model3b_ins=calculate_bert_score_avg(generated_responses_3b_instruct, real_responses_3b_instruct, 'Llama-3B-Ins')

Promedio BLEU Score Llama-3B-Ins: 0.0021440090363265903
calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 0.98 seconds, 101.54 sentences/sec
Resultados para: Llama-3B-Ins
Puntaje F1 promedio: 0.6403
Puntaje de Precisión promedio: 0.5926
Puntaje de Recall promedio: 0.6974


### Modelo sin Fine Tuning (Gemini 1.0 Pro)

In [48]:
import google.generativeai as genai
from google.colab import userdata
import time

In [29]:
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# Seleccionar modelo a usar
model_gemini = genai.GenerativeModel('gemini-1.0-pro')

In [53]:
def generate_response(question, model):
  """ Función para generar las respuestas con Gemini"""

  answer = model.generate_content(f"""Asume el rol de experto en temas de Normatividad en México (Comisión Nacional Bancaria y de Valores), de acuerdo
  a la siguiente pregunta: {question}. Solo dame la respuesta precisa y concreta en un reglón de texto.
  """).text

  return answer

In [49]:
generated_responses_gemini1_pro = []
real_responses_gemini1_pro = []
for question in test_tokenized_samples:
  # Recuperar pregunta y generar respuesta
  prompt = question['question']
  real_answer = question['answer']
  # Almacenar respuesta original y generada
  generated_answer = generate_response(prompt, model_gemini)
  generated_responses_gemini1_pro.append(generated_answer)
  real_responses_gemini1_pro.append(real_answer)

  time.sleep(5) # Debido a que la API solo deja 15 solicitudes por minuto

In [51]:
avg_bleu_modelgemini1pro =  calculate_bleu_avg(generated_responses_gemini1_pro, real_responses_gemini1_pro, 'Gemini 1.0 Pro')
P_modelgemini1pro, R_modelgemini1pro, F1_modelgemini1pro=calculate_bert_score_avg(generated_responses_gemini1_pro,
                                                                                  real_responses_gemini1_pro, 'Gemini 1.0 Pro')

Promedio BLEU Score Gemini 1.0 Pro: 0.022315941285467142
calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 0.32 seconds, 317.10 sentences/sec
Resultados para: Gemini 1.0 Pro
Puntaje F1 promedio: 0.7495
Puntaje de Precisión promedio: 0.7405
Puntaje de Recall promedio: 0.7599


### Modelo sin Fine Tuning (Gemini 1.5 Flash)

In [54]:
# Seleccionar modelo a usar
model_gemini_flash = genai.GenerativeModel('gemini-1.5-flash')

In [56]:
generated_responses_gemini1_flash = []
real_responses_gemini1_flash = []
for question in test_tokenized_samples:
  # Recuperar pregunta y generar respuesta
  prompt = question['question']
  real_answer = question['answer']
  # Almacenar respuesta original y generada
  generated_answer = generate_response(prompt, model_gemini_flash)
  generated_responses_gemini1_flash.append(generated_answer)
  real_responses_gemini1_flash.append(real_answer)

  time.sleep(5) # Debido a que la API solo deja 15 solicitudes por minuto

In [58]:
avg_bleu_modelgemini1flash =  calculate_bleu_avg(generated_responses_gemini1_flash, real_responses_gemini1_flash, 'Gemini 1.5 Flash')
P_modelgemini1flash, R_modelgemini1flash, F1_modelgemini1flash=calculate_bert_score_avg(generated_responses_gemini1_flash,
                                                                                  real_responses_gemini1_flash, 'Gemini 1.5 Flash')

Promedio BLEU Score Gemini 1.5 Flash: 0.03927984647538516
calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 0.35 seconds, 287.04 sentences/sec
Resultados para: Gemini 1.5 Flash
Puntaje F1 promedio: 0.7484
Puntaje de Precisión promedio: 0.7334
Puntaje de Recall promedio: 0.7653


### Comparación

In [85]:
# Se crean las listas para poder incluir un comparativo
avg_list = [avg_bleu_model1b, avg_bleu_model3b, avg_bleu_model3b_ins, avg_bleu_modelgemini1pro, avg_bleu_modelgemini1flash]
precision_list = [P_model1b.item(), P_model3b.item(), P_model3b_ins.item(), P_modelgemini1pro.item(), P_modelgemini1flash.item()]
recall_list = [R_model1b.item(), R_model3b.item(), R_model3b_ins.item(), R_modelgemini1pro.item(), R_modelgemini1flash.item()]
f1_score_list = [F1_model1b.item(), F1_model3b.item(), F1_model3b_ins.item(), F1_modelgemini1pro.item(), F1_modelgemini1flash.item()]

In [116]:
metrics_df = pd.DataFrame({'Modelo': ['Llama-3.2-1B-Fine-tuning', 'Llama-3.2-3B-Fine-tuning', 'Llama-3.2-3B-Instruct', 'Gemini 1.0 Pro', 'Gemini 1.5 Flash'],
              'BLEU': avg_list,
              'Precisión': precision_list,
              'Recall': recall_list,
              'F1 Score': f1_score_list})

In [118]:
metrics_df.sort_values(by='F1 Score', ascending=False)

Unnamed: 0,Modelo,BLEU,Precisión,Recall,F1 Score
4,Gemini 1.5 Flash,0.03928,0.74845,0.733436,0.765295
3,Gemini 1.0 Pro,0.022316,0.749509,0.740463,0.759894
0,Llama-3.2-1B-Fine-tuning,0.005825,0.7486,0.737,0.7423
1,Llama-3.2-3B-Fine-tuning,0.00553,0.745926,0.754828,0.738466
2,Llama-3.2-3B-Instruct,0.002144,0.640279,0.592615,0.697356
