**To** run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] ORPO support is finally here thanks to [oKatanaaa](https://github.com/oKatanaaa) and [AT&Dev](https://huggingface.co/AtAndDev)

ORPO merges the SFT and DPO steps into 1. Before one had to do a SFT, then DPO. ORPO now requires only 1 step.

In [1]:
# %%capture
# !pip install unsloth
# # Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
] # More models at https://huggingface.co/unsloth

basemodel, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4060 Laptop GPU. Max memory: 7.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

Unsloth 2024.12.2 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Data"></a>
### Data Prep
We now use a special ORPO style dataset from [recipe-research](https://huggingface.co/datasets/reciperesearch/dolphin-sft-v0.1-preference).

You need at least 3 columns:
* Instruction
* Accepted
* Rejected

For example:
* Instruction: "What is 2+2?"
* Accepted: "The answer is 4"
* Rejected: "The answer is 5"

The goal of ORPO is to penalize the "rejected" samples, and increase the likelihood of "accepted" samples. [recipe-research](https://huggingface.co/datasets/reciperesearch/dolphin-sft-v0.1-preference) essentially used Mistral to generate the "rejected" responses, and used GPT-4 to generated the "accepted" responses.

In [4]:
# The data must be formatted with appropriate prompt template first.
# See details here: https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def format_prompt(sample):
    instruction = sample["instruction"]
    input       = sample["input"]
    accepted    = sample["accepted"]
    rejected    = sample["rejected"]

    # ORPOTrainer expects prompt/chosen/rejected keys
    # See: https://huggingface.co/docs/trl/main/en/orpo_trainer
    sample["prompt"]   = alpaca_prompt.format(instruction, input, "")
    sample["chosen"]   = accepted + EOS_TOKEN
    sample["rejected"] = rejected + EOS_TOKEN
    return sample
pass

from datasets import load_dataset
dataset = load_dataset("Vezora/Code-Preference-Pairs")['train']
dataset = dataset.map(format_prompt,)
split = dataset.train_test_split(test_size=0.01)
train_dataset = split['train']
eval_dataset = split['test']

Let's print out some examples to see how the dataset should look like

In [5]:
import pprint
row = train_dataset[1]
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["prompt"])
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])

('Below is an instruction that describes a task, paired with an input that '
 'provides further context. Write a response that appropriately completes the '
 'request.\n'
 '\n'
 '### Instruction:\n'
 'You are an AI-Coding assistant. User will you give you a task. Your goal is '
 'to complete the task as faithfully as you can.\n'
 '\n'
 '### Input:\n'
 'Develop an algorithmic solution leveraging the intricacies of QuickSort '
 'methodology, applied specifically to doubly linked data structures. This '
 'structure should ideally be proficient enough to handle and store up to '
 '500,000 unique pieces of data. As part of the solution, please intricately '
 'outline the necessary multi-step reasoning that leads to the desired '
 'result.\n'
 '\n'
 '### Response:\n')
('A Doubly Linked List (DLL) is a type of data structure in which a node '
 'contains a pointer to the previous as well as the next node in the sequence. '
 'However, The QuickSort algorithm is generally applied on array data '

In [6]:
# Enable reward modelling stats
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `ORPOTrainer`! More docs here: [TRL ORPO docs](https://huggingface.co/docs/trl/main/en/orpo_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from itertools import product
from trl import ORPOConfig, ORPOTrainer
from unsloth import is_bfloat16_supported
import json
from transformers import TrainerCallback

class LoggingCallback(TrainerCallback):
    def __init__(self, log_file):
        self.log_file = log_file

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            with open(self.log_file, 'a') as f:
                f.write(f"Step {state.global_step}: {logs}\n")


class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, patience: int, min_delta: float = 0.0, log_file = None):
        """
        Early stopping callback to stop training when validation loss does not improve.
        
        Args:
            patience (int): Number of evaluations to wait for an improvement.
            min_delta (float): Minimum change in the monitored metric to qualify as an improvement.
        """
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float('inf')
        self.num_bad_epochs = 0
        self.log_file = log_file

    def on_evaluate(self, args, state, control, **kwargs):
        """
        This method is called during evaluation.
        """
        eval_loss = kwargs['metrics'].get('eval_loss', None)
        
        if eval_loss is None:
            return
        
        # Log eval_loss if log_file is specified
        if self.log_file:
            with open(self.log_file, 'a') as f:
                f.write(f"Step {state.global_step}, Eval Loss: {eval_loss}\n")
        
        # Check if eval_loss improved
        if eval_loss < self.best_loss - self.min_delta:
            self.best_loss = eval_loss
            self.num_bad_epochs = 0
        else:
            self.num_bad_epochs += 1
        
        # Stop training if patience is exceeded
        if self.num_bad_epochs >= self.patience:
            print(f"Early stopping triggered. No improvement for {self.patience} evaluations.")
            control.should_training_stop = True


param_grid = {
    "r": [8, 16],
    "lora_alpha": [8, 16, 32],
    'learning_rate': [1e-4, 2e-4, 3e-4],  # 学习率
    'weight_decay': [0.01, 0.02, 0.05],   # 权重衰减
    'beta': [0.001, 0.01, 0.1, 0.9],      # 奖励
}


def grid_search(param_grid, basemodel, train_dataset, eval_dataset, tokenizer, max_seq_length):
    best_model = None
    best_eval_loss = float('inf')
    best_params = None
    
    param_combinations = product(*param_grid.values())
    
    for params in param_combinations:
        log_file = f'Llama-3.2-1B-Instruct_evaluation_results/200steps/{params}_training_logs.txt'
        print(f"Training with params: {params}")
        
        param_dict = dict(zip(param_grid.keys(), params))
        
        
        model = FastLanguageModel.get_peft_model(
            basemodel,
            r = param_dict['r'], # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
            target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                            "gate_proj", "up_proj", "down_proj",],
            lora_alpha = param_dict['lora_alpha'],
            lora_dropout = 0, # Supports any, but = 0 is optimized
            bias = "none",    # Supports any, but = "none" is optimized
            # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
            use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
            random_state = 3407,
            use_rslora = False,  # We support rank stabilized LoRA
            loftq_config = None, # And LoftQ
        )
        
        trainer_args = ORPOConfig(
            warmup_steps=5,
            learning_rate=param_dict['learning_rate'],
            weight_decay=param_dict['weight_decay'],
            seed=3407,
            max_length=max_seq_length,
            max_prompt_length=max_seq_length//2,
            max_completion_length=max_seq_length//2,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            beta=param_dict['beta'],
            logging_steps=50,
            optim="adamw_8bit",
            lr_scheduler_type="linear",
            # num_train_epochs=1,
            max_steps = 3000,
            fp16=not is_bfloat16_supported(),
            bf16=is_bfloat16_supported(),
            evaluation_strategy="steps",  # 或 "epoch"
            eval_steps=50,  # 每 100 步执行一次评估
            save_steps=50,  # 每 100 步保存一次模型
            save_total_limit=3,  # 最多保存 3 个检查点
            output_dir="outputs",
            report_to="none",
            push_to_hub=False,
        )
        early_stopping_callback = EarlyStoppingCallback(patience=3, min_delta=0.01, log_file=log_file)
        orpo_trainer = ORPOTrainer(
            model=model,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            tokenizer=tokenizer,
            args=trainer_args,
            callbacks=[LoggingCallback(log_file), early_stopping_callback],
        )
        
        orpo_trainer.train()
        trainer_stats_eval = orpo_trainer.evaluate()

        with open(f"Llama-3.2-1B-Instruct_evaluation_results/200steps/{params}.json", "w") as f:
            json.dump(trainer_stats_eval, f, indent=4)
        
        eval_loss = trainer_stats_eval.get("eval_loss")
        print(f"Eval loss: {eval_loss}")
        
        if eval_loss < best_eval_loss:
            best_eval_loss = eval_loss
            best_model = orpo_trainer.model
            best_params = param_dict

        torch.cuda.empty_cache()
            
    print(f"Best params: {best_params}")
    print(f"Best eval loss: {best_eval_loss}")
    
    return best_model, best_params

best_model, best_params = grid_search(param_grid, basemodel, train_dataset, eval_dataset, tokenizer, max_seq_length)


Training with params: (0.0001, 0.01, (2, 4), 0.001)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,1.7478,-0.001142,-0.001376,0.5875,0.000234,-1.375923,-1.142333,2.382146,2.413243
100,1.5843,-0.001083,-0.001339,0.5625,0.000255,-1.338838,-1.083367,2.588765,2.635434
150,1.5247,-0.001065,-0.001306,0.5425,0.000241,-1.305627,-1.064596,2.494424,2.528955
200,1.5093,-0.00108,-0.001324,0.575,0.000244,-1.323956,-1.080048,2.521197,2.561554
250,1.4984,-0.001088,-0.001312,0.5625,0.000225,-1.312157,-1.087566,2.553031,2.603623
300,1.5036,-0.001071,-0.001308,0.5675,0.000237,-1.307503,-1.070946,2.523315,2.560645
350,1.5101,-0.001061,-0.001319,0.595,0.000258,-1.318729,-1.06089,2.531106,2.566873
400,1.5748,-0.001119,-0.001367,0.56,0.000248,-1.367105,-1.118989,2.425816,2.459642
450,1.4731,-0.001071,-0.001324,0.605,0.000253,-1.324051,-1.070935,2.452423,2.505136
500,1.5181,-0.001113,-0.001348,0.5575,0.000235,-1.348146,-1.113424,2.413728,2.458897


Eval loss: 1.3879128694534302
Training with params: (0.0001, 0.01, (2, 4), 0.01)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,1.3637,-0.00993,-0.012417,0.5875,0.002487,-1.241725,-0.992993,2.408828,2.44127
100,1.3557,-0.009688,-0.012536,0.575,0.002847,-1.253559,-0.968834,2.441151,2.481204
150,1.3289,-0.009585,-0.012271,0.5425,0.002686,-1.227147,-0.958529,2.291746,2.317281
200,1.3249,-0.009756,-0.01249,0.5875,0.002734,-1.249009,-0.975581,2.303982,2.339783
250,1.3274,-0.009943,-0.012467,0.5725,0.002524,-1.246695,-0.99433,2.34075,2.386721
300,1.3407,-0.009768,-0.012391,0.5775,0.002624,-1.239128,-0.976777,2.302182,2.33456
350,1.354,-0.009749,-0.012609,0.6025,0.00286,-1.260927,-0.974931,2.356633,2.383903
400,1.4217,-0.010298,-0.013049,0.565,0.002751,-1.304897,-1.029774,2.236087,2.263104
450,1.3283,-0.009828,-0.012646,0.6075,0.002817,-1.264562,-0.98282,2.248538,2.296133
500,1.3811,-0.010268,-0.012942,0.5625,0.002675,-1.294206,-1.026754,2.189991,2.230329


Eval loss: 1.3656256198883057
Training with params: (0.0001, 0.01, (2, 4), 0.1)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,1.2772,-0.09012,-0.119572,0.6,0.029452,-1.19572,-0.901196,2.029741,2.050576
100,1.2806,-0.089053,-0.123908,0.59,0.034855,-1.239081,-0.890526,2.032372,2.057676
150,1.2555,-0.087516,-0.12133,0.555,0.033813,-1.213297,-0.875162,1.859562,1.876354
200,1.2512,-0.089124,-0.125057,0.6175,0.035933,-1.25057,-0.89124,1.89465,1.919539
250,1.2581,-0.091866,-0.125955,0.595,0.034088,-1.259546,-0.918664,1.897572,1.934195
300,1.2689,-0.089697,-0.12519,0.5925,0.035493,-1.251902,-0.896968,1.868128,1.895244
350,1.2812,-0.089592,-0.12865,0.61,0.039057,-1.286498,-0.895923,1.909818,1.929138
400,1.3492,-0.095192,-0.133258,0.5775,0.038067,-1.332585,-0.951915,1.830441,1.850429
450,1.2571,-0.09046,-0.131857,0.625,0.041397,-1.318568,-0.904597,1.854171,1.898252
500,1.3118,-0.094638,-0.135227,0.59,0.040589,-1.352266,-0.94638,1.79719,1.834437




Eval loss: 1.397763967514038
Training with params: (0.0001, 0.01, (2, 4), 0.9)


Map:   0%|          | 0/53483 [00:00<?, ? examples/s]

Map:   0%|          | 0/541 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,1.4572,-0.730829,-1.333901,0.6175,0.603072,-1.482112,-0.812032,1.446371,1.479285
100,1.4524,-0.741501,-1.681123,0.615,0.939622,-1.867915,-0.82389,1.454788,1.500862
150,1.4181,-0.719383,-1.880114,0.5825,1.160732,-2.089016,-0.799314,1.324116,1.373404
200,1.3894,-0.731765,-2.178193,0.6375,1.446428,-2.420214,-0.813072,1.446839,1.494375
250,1.4039,-0.764263,-2.441005,0.61,1.676742,-2.712228,-0.849181,1.489951,1.538456
300,1.4208,-0.749113,-2.414694,0.605,1.665581,-2.682994,-0.832348,1.444896,1.486477
350,1.4214,-0.76331,-2.78544,0.6225,2.02213,-3.094933,-0.848123,1.440583,1.475994
400,1.5012,-0.805606,-3.046499,0.5875,2.240893,-3.384999,-0.895117,1.39225,1.42756
450,1.3903,-0.767048,-3.405691,0.635,2.638643,-3.784101,-0.852275,1.460699,1.520495
500,1.4512,-0.796813,-3.866998,0.605,3.070185,-4.296665,-0.885348,1.332865,1.415095




Eval loss: 1.64195716381073
Training with params: (0.0001, 0.01, (2, 4), 2.0)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,1.5921,-1.524403,-14.055321,0.6125,12.530917,-7.02766,-0.762201,0.944818,1.120096
100,1.613,-1.488198,-12.656446,0.6225,11.168244,-6.328223,-0.744099,0.944592,1.045915
150,1.5942,-1.413477,-13.094785,0.59,11.681309,-6.547392,-0.706739,0.718642,0.855194
200,1.5329,-1.456135,-14.182914,0.6425,12.726777,-7.091457,-0.728067,0.842825,0.987395
250,1.5777,-1.53142,-14.894978,0.6125,13.36356,-7.447489,-0.76571,0.843351,1.015646
300,1.5822,-1.477079,-13.592493,0.6125,12.115414,-6.796247,-0.738539,0.810586,0.960533
350,1.5749,-1.494323,-16.365746,0.6275,14.871422,-8.182873,-0.747161,0.757113,0.93286
400,1.6817,-1.581256,-15.669941,0.595,14.088687,-7.83497,-0.790628,0.739944,0.91195
450,1.5465,-1.522033,-18.670837,0.64,17.148804,-9.335419,-0.761017,0.823943,1.046305
500,1.6497,-1.627541,-14.817248,0.615,13.189706,-7.408624,-0.813771,0.794393,0.967686




Eval loss: 1.9793148040771484
Training with params: (0.0001, 0.01, (1, 8), 0.001)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.8626,-0.00063,-0.009213,0.62,0.008582,-9.212705,-0.630378,0.585989,0.815152
100,0.883,-0.000644,-0.007239,0.6225,0.006595,-7.239234,-0.644373,0.603753,0.739363
150,0.8465,-0.000605,-0.006472,0.5925,0.005867,-6.472136,-0.604685,0.427954,0.530881
200,0.8586,-0.000626,-0.006172,0.6425,0.005545,-6.171585,-0.626406,0.583249,0.673533
250,0.8695,-0.000657,-0.005839,0.615,0.005182,-5.839099,-0.657463,0.644385,0.751929
300,0.8658,-0.000633,-0.005372,0.62,0.004739,-5.372279,-0.633223,0.576614,0.660346
350,0.8982,-0.000648,-0.005816,0.6325,0.005168,-5.816303,-0.647902,0.540329,0.635141
400,0.945,-0.000693,-0.005799,0.595,0.005106,-5.79935,-0.693496,0.505045,0.592577
450,0.885,-0.000659,-0.006269,0.6425,0.00561,-6.268868,-0.659067,0.631192,0.744196
500,0.9378,-0.000698,-0.00568,0.615,0.004982,-5.68003,-0.698414,0.589461,0.689567


Eval loss: 1.3660227060317993
Training with params: (0.0001, 0.01, (1, 8), 0.01)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.7231,-0.005253,-0.031075,0.62,0.025822,-3.107468,-0.525297,0.569135,0.628043
100,0.7378,-0.005366,-0.041846,0.625,0.036481,-4.184639,-0.536568,0.364579,0.411229
150,0.7059,-0.005025,-0.043188,0.5925,0.038164,-4.318846,-0.502481,0.119405,0.165449
200,0.7225,-0.00521,-0.042438,0.6425,0.037228,-4.243828,-0.521044,0.258592,0.292341
250,0.7297,-0.005482,-0.044427,0.615,0.038945,-4.442725,-0.548183,0.311901,0.368998
300,0.7276,-0.005297,-0.042415,0.62,0.037118,-4.241513,-0.529748,0.240948,0.289453
350,0.7604,-0.005488,-0.044829,0.6325,0.039341,-4.482898,-0.548784,0.203859,0.257054
400,0.8044,-0.005933,-0.045362,0.595,0.03943,-4.536241,-0.593265,0.176827,0.229229
450,0.7501,-0.005595,-0.049577,0.6425,0.043982,-4.957729,-0.559526,0.310373,0.37729
500,0.8057,-0.005984,-0.046893,0.615,0.040909,-4.689293,-0.598411,0.308592,0.373567


Eval loss: 1.3788272142410278
Training with params: (0.0001, 0.01, (1, 8), 0.1)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.6108,-0.043419,-0.494772,0.62,0.451353,-4.947722,-0.434189,0.205001,0.297087
100,0.6261,-0.044416,-0.615585,0.625,0.57117,-6.155853,-0.444158,-0.038974,0.04287
150,0.6009,-0.040924,-0.600959,0.5925,0.560035,-6.009586,-0.409241,-0.258927,-0.18143
200,0.6163,-0.042665,-0.607723,0.6425,0.565057,-6.077228,-0.426655,-0.152099,-0.090804
250,0.6214,-0.044473,-0.618448,0.615,0.573975,-6.184476,-0.444731,-0.058413,0.015006
300,0.6262,-0.043713,-0.582821,0.62,0.539108,-5.828215,-0.437135,-0.137722,-0.062719
350,0.6545,-0.045668,-0.613252,0.6325,0.567584,-6.132521,-0.456681,-0.234353,-0.140698
400,0.7027,-0.050575,-0.588356,0.595,0.537781,-5.883556,-0.50575,-0.226018,-0.132907
450,0.6459,-0.046673,-0.682362,0.6425,0.635689,-6.823617,-0.466728,-0.116627,-0.006065
500,0.7044,-0.051016,-0.63125,0.615,0.580235,-6.312502,-0.510157,-0.085803,0.023328


Eval loss: 1.4226570129394531
Training with params: (0.0001, 0.01, (1, 8), 0.9)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.7269,-0.334182,-7.527733,0.62,7.193552,-8.364149,-0.371313,-0.178427,0.040583
100,0.7299,-0.328308,-7.532058,0.625,7.20375,-8.368953,-0.364786,-0.404357,-0.251681
150,0.7278,-0.300169,-7.333024,0.5925,7.032854,-8.147804,-0.333521,-0.565752,-0.408613
200,0.7186,-0.316077,-7.605981,0.6425,7.289904,-8.45109,-0.351197,-0.490122,-0.338472
250,0.7371,-0.328003,-7.663319,0.615,7.335315,-8.514798,-0.364448,-0.409418,-0.234822
300,0.7478,-0.329106,-6.513966,0.62,6.184861,-7.237741,-0.365673,-0.465155,-0.312895
350,0.764,-0.346614,-7.283051,0.6325,6.936436,-8.092279,-0.385127,-0.575099,-0.395394
400,0.8318,-0.389737,-7.427914,0.595,7.038177,-8.253238,-0.433041,-0.601088,-0.39352
450,0.751,-0.35281,-8.067813,0.6425,7.715002,-8.964236,-0.392011,-0.492751,-0.276853
500,0.82,-0.392216,-7.939835,0.615,7.54762,-8.82204,-0.435795,-0.455347,-0.223068


Eval loss: 1.6881647109985352
Training with params: (0.0001, 0.01, (1, 8), 2.0)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.9515,-0.639577,-20.904217,0.62,20.264639,-10.452108,-0.319789,-0.339041,-0.051481
100,0.9424,-0.619903,-19.64204,0.625,19.022137,-9.82102,-0.309951,-0.573578,-0.380066
150,0.9659,-0.569933,-20.364382,0.5925,19.794449,-10.182191,-0.284967,-0.710533,-0.501456
200,0.9137,-0.591434,-19.340389,0.6425,18.748955,-9.670195,-0.295717,-0.663395,-0.479188
250,0.9522,-0.604709,-19.559168,0.615,18.954458,-9.779584,-0.302355,-0.571534,-0.373661
300,0.9689,-0.620217,-17.892431,0.62,17.272215,-8.946216,-0.310108,-0.640094,-0.450634
350,0.9633,-0.653111,-19.353935,0.6325,18.700823,-9.676968,-0.326555,-0.805464,-0.56697
400,1.0646,-0.743018,-19.100615,0.595,18.357597,-9.550307,-0.371509,-0.764733,-0.530739
450,0.9487,-0.673401,-22.516376,0.6425,21.842976,-11.258188,-0.336701,-0.699673,-0.436235
500,1.0308,-0.74907,-21.198658,0.615,20.449587,-10.599329,-0.374535,-0.677275,-0.384156


Eval loss: 2.042663335800171
Training with params: (0.0001, 0.02, (2, 4), 0.001)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.3518,-0.000282,-0.01222,0.62,0.011938,-12.220299,-0.282275,-0.502022,-0.153983
100,0.3462,-0.000264,-0.011314,0.625,0.01105,-11.31424,-0.264274,-0.781172,-0.534611
150,0.3268,-0.00024,-0.009765,0.5925,0.009525,-9.765262,-0.24,-0.888124,-0.687927
200,0.3445,-0.000245,-0.009463,0.6425,0.009219,-9.463454,-0.244564,-0.845642,-0.660724
250,0.3414,-0.000249,-0.009703,0.615,0.009454,-9.702851,-0.24852,-0.671326,-0.486712
300,0.3539,-0.000253,-0.008992,0.62,0.008739,-8.992418,-0.253206,-0.740657,-0.567491
350,0.3695,-0.000272,-0.008941,0.6325,0.008669,-8.940903,-0.271514,-0.93295,-0.731738
400,0.4119,-0.000306,-0.008623,0.595,0.008317,-8.62315,-0.306219,-0.873476,-0.674234
450,0.3634,-0.000275,-0.009656,0.6425,0.009381,-9.655551,-0.274827,-0.80321,-0.604441
500,0.4063,-0.000312,-0.00926,0.615,0.008948,-9.259563,-0.311869,-0.80308,-0.590282


Eval loss: 1.4227930307388306
Training with params: (0.0001, 0.02, (2, 4), 0.01)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.295,-0.002296,-0.063602,0.62,0.061306,-6.360178,-0.229576,-0.440966,-0.327079
100,0.2918,-0.002114,-0.07365,0.625,0.071536,-7.365039,-0.211438,-0.78887,-0.691337
150,0.2782,-0.001941,-0.073859,0.5925,0.071918,-7.385903,-0.194111,-0.91455,-0.810568
200,0.2838,-0.001917,-0.073195,0.6425,0.071279,-7.319517,-0.191656,-0.893602,-0.799573
250,0.2878,-0.002014,-0.074251,0.615,0.072236,-7.425062,-0.201449,-0.756161,-0.65672
300,0.2957,-0.002039,-0.06855,0.62,0.066511,-6.855024,-0.203886,-0.81334,-0.729423
350,0.3115,-0.00217,-0.069891,0.6325,0.067721,-6.989083,-0.216953,-0.967332,-0.870516
400,0.3545,-0.002589,-0.066454,0.595,0.063865,-6.645372,-0.258895,-0.935195,-0.838142
450,0.3078,-0.002244,-0.074641,0.6425,0.072397,-7.464118,-0.224371,-0.812798,-0.724798
500,0.3442,-0.002552,-0.07361,0.615,0.071058,-7.361032,-0.255183,-0.844874,-0.734165


Eval loss: 1.43867826461792
Training with params: (0.0001, 0.02, (2, 4), 0.1)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.2744,-0.018814,-0.646501,0.62,0.627687,-6.465011,-0.18814,-0.487096,-0.384813
100,0.2779,-0.018004,-0.746297,0.625,0.728293,-7.462967,-0.18004,-0.906903,-0.812653
150,0.2656,-0.01613,-0.738068,0.5925,0.721937,-7.380676,-0.161304,-1.083008,-0.980529
200,0.2632,-0.015411,-0.710305,0.6425,0.694894,-7.103046,-0.154105,-1.030137,-0.93064
250,0.2677,-0.016043,-0.757413,0.615,0.74137,-7.57413,-0.160432,-0.895272,-0.78164
300,0.2796,-0.017156,-0.704821,0.62,0.687664,-7.048207,-0.171562,-0.964437,-0.865169
350,0.2872,-0.017564,-0.708225,0.6325,0.690661,-7.082249,-0.175639,-1.069925,-0.953163
400,0.3275,-0.021564,-0.673988,0.595,0.652424,-6.739877,-0.215635,-0.985712,-0.876677
450,0.2906,-0.018822,-0.771875,0.6425,0.753052,-7.718745,-0.188223,-0.901426,-0.792427
500,0.3178,-0.021405,-0.749586,0.615,0.728181,-7.495862,-0.214054,-0.949261,-0.816534


Eval loss: 1.4844095706939697
Training with params: (0.0001, 0.02, (2, 4), 0.9)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.4657,-0.146171,-7.923123,0.62,7.776953,-8.803472,-0.162413,-0.674238,-0.481287
100,0.4595,-0.137797,-8.502913,0.625,8.365116,-9.447681,-0.153108,-1.044871,-0.867051
150,0.4674,-0.124831,-8.814171,0.5925,8.68934,-9.793524,-0.138701,-1.177259,-0.98713
200,0.4334,-0.117264,-8.452686,0.6425,8.335422,-9.391874,-0.130294,-1.117544,-0.955455
250,0.4575,-0.125163,-8.657537,0.615,8.532374,-9.619486,-0.139071,-1.093704,-0.913289
300,0.4628,-0.128676,-7.952662,0.62,7.823986,-8.836291,-0.142973,-1.021079,-0.870495
350,0.4666,-0.140126,-8.247561,0.6325,8.107436,-9.163958,-0.155695,-1.286301,-1.090809
400,0.5213,-0.165369,-7.369707,0.595,7.204339,-8.188564,-0.183743,-1.13349,-0.954728
450,0.4625,-0.143919,-8.486766,0.6425,8.342847,-9.429741,-0.15991,-1.035325,-0.853719
500,0.4952,-0.159763,-8.339643,0.615,8.17988,-9.266269,-0.177515,-1.089518,-0.881906


Eval loss: 1.7550395727157593
Training with params: (0.0001, 0.02, (2, 4), 2.0)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.7383,-0.279398,-21.60532,0.62,21.325922,-10.80266,-0.139699,-0.80016,-0.518357
100,0.734,-0.280507,-21.790077,0.625,21.509571,-10.895039,-0.140254,-1.132684,-0.884574
150,0.7626,-0.253313,-21.963451,0.5925,21.710138,-10.981726,-0.126657,-1.216551,-0.965483
200,0.6917,-0.242635,-20.948713,0.6425,20.706078,-10.474357,-0.121318,-1.151841,-0.928035
250,0.7347,-0.249869,-19.573755,0.615,19.323887,-9.786878,-0.124934,-1.129387,-0.919172
300,0.7394,-0.267685,-18.520061,0.62,18.252377,-9.260031,-0.133843,-1.075913,-0.885086
350,0.7317,-0.282604,-19.890509,0.6325,19.607904,-9.945254,-0.141302,-1.273348,-1.04275
400,0.8076,-0.325323,-18.446497,0.595,18.121178,-9.223248,-0.162661,-1.226911,-1.009869
450,0.7151,-0.282109,-21.057507,0.6425,20.775398,-10.528753,-0.141055,-1.13827,-0.919744
500,0.7671,-0.319788,-21.231043,0.615,20.911257,-10.615521,-0.159894,-1.177923,-0.923943


Eval loss: 2.1086652278900146
Training with params: (0.0001, 0.02, (1, 8), 0.001)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.179,-0.000116,-0.012961,0.62,0.012844,-12.960618,-0.11643,-0.878258,-0.537536
100,0.1789,-0.000118,-0.012116,0.625,0.011998,-12.115968,-0.118228,-1.149025,-0.85787
150,0.1718,-0.000102,-0.011887,0.5925,0.011784,-11.886683,-0.102418,-1.240123,-0.956056
200,0.1753,-0.000105,-0.011525,0.6425,0.01142,-11.525182,-0.104818,-1.15289,-0.898353
250,0.173,-0.000105,-0.011524,0.615,0.011419,-11.523682,-0.105159,-1.112592,-0.863514
300,0.1837,-0.000111,-0.010291,0.62,0.01018,-10.290955,-0.111292,-1.098573,-0.895948
350,0.1941,-0.00012,-0.009723,0.6325,0.009603,-9.723393,-0.12017,-1.238872,-1.045071
400,0.214,-0.000138,-0.009223,0.595,0.009085,-9.222941,-0.137611,-1.113855,-0.92788
450,0.1885,-0.000119,-0.010028,0.6425,0.009909,-10.028216,-0.119148,-1.094885,-0.911474
500,0.2005,-0.000132,-0.009742,0.615,0.009611,-9.742025,-0.13152,-1.110834,-0.906124


Eval loss: 1.4864161014556885
Training with params: (0.0001, 0.02, (1, 8), 0.01)


max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 53,483 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
50,0.1579,-0.000934,-0.083549,0.62,0.082614,-8.354877,-0.093435,-0.678119,-0.499114
100,0.1621,-0.000983,-0.092865,0.625,0.091883,-9.286546,-0.098253,-0.986597,-0.806207
150,0.1557,-0.000884,-0.094796,0.5925,0.093912,-9.479593,-0.088376,-1.172372,-0.972613
200,0.1606,-0.000886,-0.09468,0.6425,0.093794,-9.467976,-0.088583,-1.136629,-0.948647
250,0.1609,-0.000908,-0.096397,0.615,0.095489,-9.639675,-0.09078,-1.109544,-0.91231
300,0.1704,-0.000961,-0.08776,0.62,0.086799,-8.775965,-0.096104,-1.02623,-0.865286
350,0.1764,-0.000997,-0.085126,0.6325,0.084129,-8.512603,-0.099745,-1.159544,-0.989924
400,0.1951,-0.001183,-0.080982,0.595,0.079799,-8.098198,-0.118334,-1.110835,-0.945653
450,0.1757,-0.001037,-0.08981,0.6425,0.088772,-8.980973,-0.103727,-1.055765,-0.904518


KeyboardInterrupt: 

In [None]:
# from trl import ORPOConfig, ORPOTrainer
# from unsloth import is_bfloat16_supported

# orpo_trainer = ORPOTrainer(
#     model = model,
#     train_dataset = train_dataset,
#     eval_dataset= eval_dataset,
#     tokenizer = tokenizer,
#     args = ORPOConfig(
#         warmup_steps = 5,
#         learning_rate = 2e-4,
#         weight_decay = 0.01,
#         seed = 3407,
#         max_length = max_seq_length,
#         max_prompt_length = max_seq_length//2,
#         max_completion_length = max_seq_length//2,
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4,
#         beta = 0.1,
#         logging_steps = 100,
#         optim = "adamw_8bit",
#         lr_scheduler_type = "linear",
#         num_train_epochs = 1,
#         # max_steps = 30, # Change to num_train_epochs = 1 for full training runs
#         fp16 = not is_bfloat16_supported(),
#         bf16 = is_bfloat16_supported(),
#         output_dir = "outputs",
#         report_to = "none", # Use this for WandB etc
#         push_to_hub=False,
#     ),
# )

In [None]:
# # orpo_trainer.train(resume_from_checkpoint="outputs/checkpoint-6685")
# orpo_trainer.train()

In [None]:
# trainer_stats_eval = orpo_trainer.evaluate()

# for key, value in trainer_stats_eval.items():
#     print(f"{key}: {value}")

# import json
# with open("Llama-3.2-1B-Instruct_evaluation_results.json", "w") as f:
#     json.dump(trainer_stats_eval, f, indent=4)

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# model.save_pretrained("orpo_model") # Local saving
# tokenizer.save_pretrained("orpo_model")
model.push_to_hub("AMomozZz/orpo_model1", token = "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") # Online saving
tokenizer.push_to_hub("AMomozZz/orpo_model1", token = "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") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "AMomozZz/orpo_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>