## Instalación

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

## Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.52.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## Añadir los parametros de LoRA

Esto permite entrenar solo una pequeña parte del modelo (1-10%) y no el modelo completo.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    # Solo 2 capas, ya que queremos evitar que haya overfitting
    target_modules = ["q_proj","v_proj"],
    lora_alpha = 8,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.5.9 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


## Preparar los datos

Utilizamos el dataset que creamos, el cual se encuentra en el siguiente link [dataset](https://). Este dataset esta compuesto de aproximadamente 700 ejemplos.

En nuestro caso, el dataset sigue el formato de alpaca y está compuesto por 3 partes:

1. Instrucción: Vendría a ser la pregunta que le realiza el usuario al modelo.

2. Input: Es el código que el usuario le proporciona al modelo. Esta celda puede estar vacia.

3. Response: Es la respuesta que el usuario le genera al usuario.

Se debe de agregar el EOS_TOKEN a la salida. Sino se obtendran generaciones infinitas.

In [None]:
alpaca_prompt = """You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["Instruction"]
    inputs       = examples["Input"]
    outputs      = examples["Output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/981 [00:00<?, ? examples/s]

Esto es útil para saber cual es el token de termino que debe utilizar el modelo para terminar de generar la respuesta.

In [None]:
print(EOS_TOKEN)

<|end_of_text|>


## Entrenamiento

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/981 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 981 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 3,407,872/8,000,000,000 (0.04% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.9324
2,2.0716
3,1.7592
4,1.7859
5,1.6987
6,1.7077
7,1.7971
8,1.9252
9,2.068
10,1.6988


## Preguntas al modelo


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo for en python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<|begin_of_text|>You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
Cómo funciona un ciclo for en python?

### Input:


### Response:
Un ciclo for en python recorre una secuencia, como una lista, y ejecuta un bloque de código por cada elemento de la secuencia. Por ejemplo:

frutas = ['manzana', 'pera', 'uva']
for fruta in frutas:
    print(fruta)

En este caso, el ciclo recorre la lista 'frutas' y imprime cada elemento.<|end_of_text|>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Capital de Chile?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<|begin_of_text|>You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
Capital de Chile?

### Input:


### Response:
No puedo responder eso, solo trabajo con estructuras condicionales y ciclos en Python.<|end_of_text|>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?", # instruction
        """numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          """, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<|begin_of_text|>You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
Este código tiene errores, me podrías explicar cuales son y cómo solucionarlo?

### Input:
numeros = [1, 2, 3, 4, 5]
          for i in range(6)
              if numeros[i] % 2 = 0:
                  print(f"El número {numeros(i)} es par")
              else
                  print("El número", numeros[i] "es impar")
          

### Response:
Explicación:
Los errores en este código son los siguientes:
1. La condición del bucle for está mal escrita, debe ser `for i in range(len(numeros))`.
2. La condición del if está mal escrita, debe ser `if numeros[i] % 

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Dame un ejemplo de ciclos while en python que contenga listas", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<|begin_of_text|>You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
Dame un ejemplo de ciclos while en python que contenga listas

### Input:


### Response:
while i < len(lista):
    print(lista[i])
    i += 1<|end_of_text|>


In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Cómo funciona un ciclo while en Python?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<|begin_of_text|>You are a Python programming assistant specialized in conditionals (if, else, and elif) and loops (for and while). You must only answer questions related to this topic. If the question is unrelated, respond by saying you cannot help with that.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.


### Instruction:
Cómo funciona un ciclo while en Python?

### Input:


### Response:
Un ciclo while se ejecuta mientras la condición del bucle sea verdadera. La sintaxis es:

    while condición:
        código a ejecutar

El código dentro del bucle se ejecutará hasta que la condición se vuelva falsa.<|end_of_text|>


## Guardar el modelo

### 1. Guardar solo el LoRA

Esto lo que hace es solo guardar los parametros LoRA que fueron entrenados.

In [None]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

### 2. Guardar el modelo en formato GGUF para usarlo en Ollama

Esto toma un poco más de tiempo, ya que guarda primero el modelo en formato f16 y luego de hacer eso lo pasa a formato GGUF con quantización q8_0.

In [None]:
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q8_0")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.15 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...



  0%|          | 0/32 [00:00<?, ?it/s][A
  3%|▎         | 1/32 [00:00<00:05,  5.93it/s][A
 12%|█▎        | 4/32 [00:00<00:01, 15.58it/s][A
 25%|██▌       | 8/32 [00:00<00:01, 21.75it/s][A
 44%|████▍     | 14/32 [00:00<00:00, 31.55it/s][A
We will save to Disk and not RAM now.

 44%|████▍     | 14/32 [00:11<00:00, 31.55it/s][A
 56%|█████▋    | 18/32 [00:12<00:15,  1.10s/it][A
 59%|█████▉    | 19/32 [00:29<00:36,  2.79s/it][A
 62%|██████▎   | 20/32 [00:51<01:05,  5.50s/it][A
 66%|██████▌   | 21/32 [01:14<01:31,  8.28s/it][A
 69%|██████▉   | 22/32 [01:31<01:39,  9.96s/it][A
 72%|███████▏  | 23/32 [01:53<01:52, 12.48s/it][A
 75%|███████▌  | 24/32 [02:06<01:42, 12.78s/it][A
 78%|███████▊  | 25/32 [02:30<01:47, 15.38s/it][A
 81%|████████▏ | 26/32 [02:47<01:34, 15.80s/it][A
 84%|████████▍ | 27/32 [03:06<01:24, 16.89s/it][A
 88%|████████▊ | 28/32 [03:23<01:07, 16.87s/it][A
 91%|█████████ | 29/32 [03:43<00:52, 17.65s/it][A
 94%|█████████▍| 30/32 [04:12<00:42, 21.02s/it][A
 97

Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving model/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: