## Instalación

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

## Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit",
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.5.9: Fast Mistral patching. Transformers: 4.52.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Añadir los parametros de LoRA

Esto permite entrenar solo una pequeña parte del modelo (1-10%) y no el modelo completo.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    # Solo 2 capas, ya que queremos evitar que haya overfitting
    target_modules = ["q_proj","v_proj"],
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

## Preparar los datos

Utilizamos el dataset que creamos, el cual se encuentra en el siguiente link [dataset](https://). Este dataset esta compuesto de aproximadamente 700 ejemplos.

En nuestro caso, el dataset sigue el formato de alpaca y está compuesto por 3 partes:

1. Instrucción: Vendría a ser la pregunta que le realiza el usuario al modelo.

2. Input: Es el código que el usuario le proporciona al modelo. Esta celda puede estar vacia.

3. Response: Es la respuesta que el usuario le genera al usuario.

Se debe de agregar el EOS_TOKEN a la salida. Sino se obtendran generaciones infinitas.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["Instruction"]
    inputs       = examples["Input"]
    outputs      = examples["Output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
print(EOS_TOKEN)

</s>


## Entrenamiento

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 35,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/981 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 981 | Num Epochs = 1 | Total steps = 35
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 3,407,872/7,000,000,000 (0.05% trained)


Step,Training Loss
1,1.6161
2,1.7935
3,1.4512
4,1.406
5,1.2596
6,1.2253
7,1.2182
8,1.3361
9,1.3971
10,1.1068


## Preguntas al modelo


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo for en python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cómo funciona un ciclo for en python?

### Input:


### Response:
El ciclo for en Python es una estructura de control que permite iterar sobre una secuencia de elementos, como una lista, tupla, cadena o rango. El ciclo for se utiliza para recorrer los elementos de una secuencia y realizar una acción específica sobre cada elemento. Aquí hay un ejemplo de cómo se utiliza un ciclo for en Python:

lista = [1, 2, 3, 4, 5]
for elemento in lista:
   print(elemento)</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cuando se independizo Chile?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cuando se independizo Chile?

### Input:


### Response:
Solo puedo responder cosas relacionadas a programación en Python.</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?", # instruction
        """numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          """, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?

### Input:
numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          

### Response:
Este código tiene un error en la línea "for i in range(6)". La función range() solo acepta un argumento, por lo que debería ser "for i in range(len(numeros))". Además, en la condición if, se debe usar "==" en lugar de "=".</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Dame un ejemplo sobre el ciclo while", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Dame un ejemplo sobre el ciclo while

### Input:


### Response:
Ejemplo:
while True:
   numero = int(input("Ingrese un número: "))
   if numero > 0:
       print("El número es positivo.")
   elif numero < 0:
       print("El número es negativo.")
   else:
       print("El número es cero.")
   break</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo while en Python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cómo funciona un ciclo while en Python?

### Input:


### Response:
El ciclo while en Python se utiliza para repetir una acción mientras una condición sea verdadera. La sintaxis es la siguiente:

while condición:
   # Código a ejecutar mientras la condición sea verdadera

Ejemplo:

numero = 0
while numero < 10:
   print(numero)
   numero += 1</s>


## Guardar el modelo

### 1. Guardar solo el LoRA

Esto lo que hace es solo guardar los parametros LoRA que fueron entrenados.

In [None]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

### 2. Guardar el modelo en formato GGUF para usarlo en Ollama

Esto toma un poco más de tiempo, ya que guarda primero el modelo en formato f16 y luego de hacer eso lo pasa a formato GGUF con quantización q8_0.

In [None]:
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q8_0")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 4.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.52 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 28%|██▊       | 9/32 [00:00<00:01, 11.74it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [08:06<00:00, 15.19s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00003.bin...
Unsloth: Saving model/pytorch_model-00002-of-00003.bin...
Unsloth: Saving model/pytorch_model-00003-of-00003.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00003.bin'
INFO:hf-to-gguf:token_em

## Instalación

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

## Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.5.9: Fast Mistral patching. Transformers: 4.52.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Añadir los parametros de LoRA

Esto permite entrenar solo una pequeña parte del modelo (1-10%) y no el modelo completo.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    # Solo 2 capas, ya que queremos evitar que haya overfitting
    target_modules = ["q_proj","v_proj"],
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

## Preparar los datos

Utilizamos el dataset que creamos, el cual se encuentra en el siguiente link [dataset](https://). Este dataset esta compuesto de aproximadamente 700 ejemplos.

En nuestro caso, el dataset sigue el formato de alpaca y está compuesto por 3 partes:

1. Instrucción: Vendría a ser la pregunta que le realiza el usuario al modelo.

2. Input: Es el código que el usuario le proporciona al modelo. Esta celda puede estar vacia.

3. Response: Es la respuesta que el usuario le genera al usuario.

Se debe de agregar el EOS_TOKEN a la salida. Sino se obtendran generaciones infinitas.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["Instruction"]
    inputs       = examples["Input"]
    outputs      = examples["Output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
print(EOS_TOKEN)

</s>


## Entrenamiento

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 35,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 981 | Num Epochs = 1 | Total steps = 35
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 3,407,872/7,000,000,000 (0.05% trained)


Step,Training Loss
1,1.602
2,1.7971
3,1.4481
4,1.4074
5,1.2611
6,1.2322
7,1.2144
8,1.333
9,1.3993
10,1.1159


## Preguntas al modelo


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo for en python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cómo funciona un ciclo for en python?

### Input:


### Response:
El ciclo for en Python es una estructura de control que permite iterar sobre una secuencia de elementos, como una lista, tupla, cadena o rango. El ciclo for se utiliza para recorrer los elementos de una secuencia y realizar una acción específica sobre cada elemento. Aquí hay un ejemplo de cómo se utiliza un ciclo for en Python:

```python
numbers = [1, 2, 3, 4, 5]
for number in numbers:
   print(number)
```
En este ejemplo, el ciclo for recorre la lista `numbers` y imprime cada elemento.</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cuando se independizo Chile?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cuando se independizo Chile?

### Input:


### Response:
Solo puedo responder cosas relacionadas a programación en Python.</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?", # instruction
        """numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          """, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?

### Input:
numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          

### Response:
Error: La sintaxis del bucle for es incorrecta. Debería ser `for i in range(len(numeros))`.</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Dame un ejemplo sobre el ciclo while en python", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Dame un ejemplo sobre el ciclo while en python

### Input:


### Response:
Ejemplo de ciclo while en Python:

```python
num = 0
while num < 10:
   print(num)
   num += 1
```</s>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo while en Python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Cómo funciona un ciclo while en Python?

### Input:


### Response:
El ciclo while en Python se utiliza para ejecutar un bloque de código mientras que una condición sea verdadera. El ciclo se ejecuta mientras que la condición sea verdadera, y se detiene cuando la condición se vuelve falsa.</s>


## Guardar el modelo

### 1. Guardar solo el LoRA

Esto lo que hace es solo guardar los parametros LoRA que fueron entrenados.

In [None]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

### 2. Guardar el modelo en formato GGUF para usarlo en Ollama

Esto toma un poco más de tiempo, ya que guarda primero el modelo en formato f16 y luego de hacer eso lo pasa a formato GGUF con quantización q8_0.

In [None]:
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q8_0")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 4.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.6 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 59%|█████▉    | 19/32 [00:00<00:00, 30.15it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [03:05<00:00,  5.80s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00003.bin...
Unsloth: Saving model/pytorch_model-00002-of-00003.bin...
Unsloth: Saving model/pytorch_model-00003-of-00003.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00003.bin'
INFO:hf-to-gguf:token_em