<a href="https://colab.research.google.com/github/Miguel9712/Estadia/blob/Showcase/careqa_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Explanation of the Code and Inference Process

This notebook demonstrates how to load and use a finetuned language model for inference, specifically for a multiple-choice question answering task.

**1. Installation and Imports:**

First, we install the necessary libraries, including `unsloth` for fast language model loading and inference, and `torch` as the deep learning framework. We also import `FastLanguageModel` from `unsloth` and `TextStreamer` from `transformers`.

In [2]:
# Install necessary libraries
!pip install unsloth torch

# You might also need to install other libraries depending on your setup,
# such as transformers, datasets, etc.

import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer



**2. Parameter Definition:**

We define parameters that should match the ones used during the finetuning process. These include `max_seq_length`, `load_in_4bit`, and `dtype`.

In [3]:
# Define parameters (should match the ones used during finetuning)
max_seq_length = 1024 # Or the value you used
load_in_4bit = True # Or the value you used
dtype = None # Or the value you used

**3. Model and Tokenizer Loading:**

We load the finetuned model and its corresponding tokenizer using `FastLanguageModel.from_pretrained()`. The `model_name` parameter should be the path to your saved model directory.

In [4]:
# Load the finetuned model and tokenizer
# Replace "finetuned_model" with the path to your saved model directory
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "finetuned_model", # Path to the saved model directory
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Ensure the model is in evaluation mode for inference
model.eval()

==((====))==  Unsloth 2025.8.5: Fast Gemma2 patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 3584, padding_idx=0)
        (layers): ModuleList(
          (0-41): 42 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3584, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3584, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora

**4. Prompt Formatting for Inference:**

We define the `alpaca_prompt` format, which is a common template for instruction-based language models. We then create the `inference_prompt` by formatting the `alpaca_prompt` with the specific question and options for our multiple-choice task.

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}""" # Make sure this prompt format matches what you used for training

# Define your question and options for inference
inference_question = "In relation to iron metabolism and its control mediated by hepcidin, it is true that:"
inference_op1 = "The drop in partial oxygen pressure promotes the activation of the hypoxia-inducible factor (HIF), which increases the expression of hepcidin."
inference_op2 = "The increase in serum iron or inflammation stimulates the synthesis of hepcidin in the liver, which negatively regulates the function of ferroportin."
inference_op3 = "Hepcidin reduces intestinal iron absorption through the inactivation of the divalent metal transporter 1 (DMT1)."
inference_op4 = "In hereditary hemochromatosis type 1, mutations in the human hemochromatosis protein (HFE) cause an increase in the production of hepcidin."


inference_prompt = alpaca_prompt.format(
    "Responde la siguiente pregunta de opción múltiple seleccionando el número de la opción correcta.", # Instruction
    f"Pregunta: {inference_question}\nOpciones:\n1. {inference_op1}\n2. {inference_op2}\n3. {inference_op3}\n4. {inference_op4}", # Input
    "", # Response (empty for generation)
)

**5. Tokenization and Inference:**

We tokenize the `inference_prompt` to convert it into a format that the model can understand. Then, we use `model.generate()` to generate a response based on the tokenized input. The `TextStreamer` is used to stream the generated text as it becomes available.

In [6]:
# Tokenize the input prompt
inputs = tokenizer([inference_prompt], return_tensors="pt").to(model.device)

# Generate the response
print("--- Generated Response ---")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=200, use_cache=True, pad_token_id=tokenizer.eos_token_id)

# You can also get the output directly without streaming
# outputs = model.generate(**inputs, max_new_tokens=200, use_cache=True, pad_token_id=tokenizer.eos_token_id)
# generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
# print("\nFull generated text:", generated_text)

--- Generated Response ---
<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Responde la siguiente pregunta de opción múltiple seleccionando el número de la opción correcta.

### Input:
Pregunta: In relation to iron metabolism and its control mediated by hepcidin, it is true that:
Opciones:
1. The drop in partial oxygen pressure promotes the activation of the hypoxia-inducible factor (HIF), which increases the expression of hepcidin.
2. The increase in serum iron or inflammation stimulates the synthesis of hepcidin in the liver, which negatively regulates the function of ferroportin.
3. Hepcidin reduces intestinal iron absorption through the inactivation of the divalent metal transporter 1 (DMT1).
4. In hereditary hemochromatosis type 1, mutations in the human hemochromatosis protein (HFE) cause an increase in the production of hepcidin.

### Response:
2<eo