To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [1]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth
  Downloading unsloth-2024.10.2-py3-none-any.whl.metadata (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth-zoo (from unsloth)
  Downloading unsloth_zoo-2024.10.3-py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch>=2.4.0 (from unsloth)
  Downloading torch-2.5.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.28.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unslo

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    load_in_8bit=False
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA H100 PCIe. Max memory: 79.216 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 9.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers via:
`pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"`


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [4]:
from datasets import load_dataset
data_path="dataset.csv"
corpus=load_dataset('csv', data_files=data_path,column_names=['instruct', 'input', 'output'],cache_dir=None)

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(corpus)

DatasetDict({
    train: Dataset({
        features: ['instruct', 'input', 'output'],
        num_rows: 21084
    })
})


In [6]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
  instructions = examples["instruct"]
  inputs = examples["input"]
  outputs = examples["output"]
  texts = []
  for instruction, input, output in zip(instructions, inputs, outputs):
    text = instruction + " " + input + " " + output + EOS_TOKEN
    texts.append(text)
  return {"text":texts}

dataset = corpus.map(formatting_prompts_func, batched = True, keep_in_memory=False, num_proc=1)

Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth import FastLanguageModel
from datasets import Dataset

# Configuración del modelo con LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Puedes ajustar este valor (sugerido: 8, 16, 32, 64, 128)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # 0 es optimizado
    bias="none",  # "none" es optimizado
    use_gradient_checkpointing="unsloth",  # True o "unsloth" para contexto largo
    random_state=3407,
    use_rslora=False,  # Soporte para Rank Stabilized LoRA
    loftq_config=None  # Soporte para LoftQ
)

# Asegúrate de seleccionar el split 'train' del dataset
train_dataset = dataset['train']

# Función para tokenizar el dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=max_seq_length)

# Tokenizar el dataset
train_dataset = train_dataset.map(tokenize_function, batched=True)

# Función para agregar la columna 'labels' al dataset tokenizado
def add_labels(examples):
    examples['labels'] = examples['input_ids']  # Asigna 'labels' igual a 'input_ids'
    return examples

# Aplica la función para agregar la columna 'labels'
train_dataset = train_dataset.map(add_labels, batched=True)

# Configuración del Data Collator para secuencias
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Configuración del trainer de Unsloth
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Usar el dataset 'train' con 'labels'
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=data_collator,  # Usar el collator modificado
    dataset_num_proc=2,
    packing=False,  # Puede hacer el entrenamiento 5x más rápido para secuencias cortas
    args=TrainingArguments(
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        warmup_steps=5,
        num_train_epochs=2,  # Ajusta el número de épocas aquí
        #max_steps=5000,
        learning_rate=0.001,
        fp16= not is_bfloat16_supported(),
        bf16= is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Usa esto para WandB, TensorBoard, etc.
    ),
)

Unsloth: Already have LoRA adapters! We shall skip this step.


Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,084 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 32 | Gradient Accumulation steps = 2
\        /    Total batch size = 64 | Total steps = 658
 "-____-"     Number of trainable parameters = 11,272,192


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


Step,Training Loss
1,4.8453
2,4.7897
3,4.6444
4,4.4775
5,4.2882
6,4.0312
7,4.0117
8,3.889
9,3.674
10,3.7124


In [10]:
model.save_pretrained("modelo")
tokenizer.save_pretrained("modelo")

('modelo/tokenizer_config.json',
 'modelo/special_tokens_map.json',
 'modelo/tokenizer.json')

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [11]:
def format_prompt(instruction, input_text):
    return f"Instruction: {instruction} Input: {input_text} Output:"

In [12]:
def tokenize_prompt(prompt, tokenizer, max_seq_length):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_seq_length)
    return inputs


In [13]:
import torch

def generate_response(prompt, model, tokenizer, max_seq_length=256, max_new_tokens=50):
    # Formatear el prompt
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])

    # Tokenizar el prompt
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Pasar el prompt al modelo y generar una respuesta
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,  # Número máximo de tokens a generar
        do_sample=True,  # Muestra aleatoriamente para más diversidad
        temperature=0.7,  # Controla la creatividad de la respuesta
        top_p=0.9,  # Controla el filtro de nucleus sampling
        eos_token_id=tokenizer.eos_token_id  # ID del token de fin de secuencia
    )

    # Decodificar la respuesta generada
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Mostrar la respuesta generada después del prompt
    return response


In [19]:
from unsloth import FastLanguageModel

# Preparar el modelo para inferencia
model = FastLanguageModel.for_inference(model)

def format_prompt(instruction, input_text):
    return f"Instruction: {instruction} Input: {input_text} Output:"

def tokenize_prompt(prompt, tokenizer, max_seq_length):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_seq_length)
    return inputs

import torch


def generate_response(prompt, model, tokenizer, max_seq_length=256, max_new_tokens=50):
    # Determinar el dispositivo
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Formatear el prompt
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])

    # Tokenizar el prompt
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Mover los inputs al dispositivo
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Pasar el prompt al modelo y generar una respuesta
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id
    )

    # Decodificar la respuesta generada
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response


# Ejemplo de un prompt
prompt_example = {
    "instruction": "Traduce el siguiente texto a Nahuatl",
    "input": "Dame un pedazo de ese chocolate"
}

# Generar respuesta
response = generate_response(prompt_example, model, tokenizer)
print(f"Respuesta del modelo: {response}")


Respuesta del modelo: Instruction: Traduce el siguiente texto a Nahuatl Input: Dame un pedazo de ese chocolate Output: Xinechmotlaquili inin tecontotlaxcolli inin tepitzin xochimilmaquili Xinechmotlaquili inin tecontotlaxcolli inin tepitzin x
