**To** run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] KTO support is finally here 

KTO aligning models with binary feedback (updown/downvote). Depending on how good your base model is, you may or may not need to do SFT before KTO. This is different from standard RLHF and DPO, which always require SFT. (from HuggingFace [docs](https://huggingface.co/docs/trl/v0.12.1/kto_trainer))

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-9b-it-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Gemma2 patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.9 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


KTO requires users to give a chat template. Let's use `chatml` template!

In [3]:
from unsloth import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma",
)

# new_eos_token = "<|endoftext|>"

# tokenizer.eos_token = new_eos_token
# tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids(new_eos_token)

Unsloth: Will map <end_of_turn> to EOS = <eos>.


<a name="Data"></a>
### Data Prep
We now use a special KTO style dataset from [trl-lib](https://huggingface.co/datasets/trl-lib/kto-mix-14k).

You need at least 3 columns:
* prompt
* completion
* label

For example:
* prompt: `[{"content" : "Who are you", "role" : "user"}]`
* completion: `[{"content" : "Hello! I am an helpful assistant here to help you with your problem. What's going on?", "role" : "assistant"}]`
* label: `True`

Each prompt is a list of messages with 'content' and 'role', representing a conversation. The completions are potential responses to these prompts, also structured as lists with 'content' and 'role'. The labels indicate whether each completion is accepted (True) or rejected (False)

The goal of [Kahneman-Tversky Optimization (KTO) alignment](http://arxiv.org/pdf/2402.01306.pdf) is to directly maximize the utility of language model generations by incorporating human psychological biases into the optimization process. KTO aims to create language models that are better aligned with human values, needs, and decision-making processes without relying on preference data. It achieves this by using a binary signal of whether an output is desirable or undesirable, making it easier and more cost-effective to align models at scale compared to methods that require preference pairs.

In [4]:
from datasets import load_dataset
# dataset = load_dataset("rifoag/NLP701_Assignment2_Subtask3", "EN", split="train") # Load a small 1%
dataset = load_dataset("Erland/NLP701_Assignment2_Subtask3_KTO_Dataset_3", split="train") # Load a small 1%
dataset[0]

{'prompt': '<bos><start_of_turn>user\n<document>\nUkraine\'s Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n\n Ukraine\'s Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n\nIn surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.\n\n"A turning point in the war is approaching," Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. "The next year will be decisive in this regard." He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.\n\nYermak sought to assure the audience that Zelensky has "a clear plan" forward even as Western media has by and large soured on Kiev\'s prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration\

Let's print out some examples to see how the dataset should look like

In [5]:
print(dataset[0]["prompt"].lstrip("<bos><start_of_turn>user").strip().rstrip("<end_of_turn>\n<start_of_turn>model"))

<document>
Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits 

 Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits

In surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.

"A turning point in the war is approaching," Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. "The next year will be decisive in this regard." He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.

Yermak sought to assure the audience that Zelensky has "a clear plan" forward even as Western media has by and large soured on Kiev's prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration's focus is off Ukraine and on Gaza events instead.

In [6]:
def formatting_prompt_func(examples):
    examples["prompt"] = [
        {
            "role": "user",
            "content": examples["prompt"]
            .lstrip("<bos><start_of_turn>user")
            .strip()
            .rstrip("<end_of_turn>\n<start_of_turn>model")
            + ">",
        }
    ]
    examples["completion"] = [
        {
            "role" : "assistant",
            "content" : examples["completion"]
        }
    ]
    return examples


changed_dataset = dataset.map(formatting_prompt_func)
changed_dataset

Dataset({
    features: ['prompt', 'completion', 'label', 'bertscore_f1', 'rank', 'file_name', 'categories', 'subcategories', 'reference_explanation'],
    num_rows: 440
})

In [7]:
changed_dataset[0]["prompt"]

[{'content': '<document>\nUkraine\'s Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n\n Ukraine\'s Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n\nIn surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.\n\n"A turning point in the war is approaching," Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. "The next year will be decisive in this regard." He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.\n\nYermak sought to assure the audience that Zelensky has "a clear plan" forward even as Western media has by and large soured on Kiev\'s prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration\'s focus is off Ukraine 

In [8]:
import pprint
row = dataset[0]
print('PROMPT: ' + '=' * 50)
pprint.pprint(row["prompt"])
print('COMPLETION: ' + '=' * 50)
pprint.pprint(row["completion"])

('<bos><start_of_turn>user\n'
 '<document>\n'
 "Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n"
 '\n'
 " Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n"
 '\n'
 'In surprisingly blunt words, a top aide to Ukrainian President Volodymyr '
 'Zelensky has warned that the coming year will essentially decide the fate of '
 'Ukraine and its war with Russia.\n'
 '\n'
 '"A turning point in the war is approaching," Andrii Yermak, who serves as '
 'chief of staff for the Office of the President of Ukraine, said Monday. "The '
 'next year will be decisive in this regard." He issued the words while '
 'appealing for more urgent aid from Washington in an address to the hawkish '
 'DC-based Hudson Institute think tank.\n'
 '\n'
 'Yermak sought to assure the audience that Zelensky has "a clear plan" '
 "forward even as Western media has by and large soured on Kiev's prospects "
 'for success. Much of this is about Zelensky sending envoys to do dama

Generally, the `completion` is the answer of the assistant and `prompt` is the user input. But if you have multiple conversation between user and assistant, you only need to add the last assistant answer to the `completion` and the rest of the conversation is in the `prompt`

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `KTOTrainer`! More docs here: [TRL ORPO docs](https://huggingface.co/docs/trl/main/en/kto_trainer). We do 120 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer` and `ORPOTrainer`!

In [9]:
# from unsloth import PatchKTOTrainer
# PatchKTOTrainer()

In [10]:
from trl import KTOConfig, KTOTrainer
from unsloth import is_bfloat16_supported

kto_trainer = KTOTrainer(
    model=model,
    train_dataset=changed_dataset,
    tokenizer=tokenizer,
    args=KTOConfig(
        # max_steps=120,
        per_device_train_batch_size=4,
        num_train_epochs=1,
        learning_rate=1e-6,
        lr_scheduler_type="linear",
        gradient_accumulation_steps=1,
        bf16=is_bfloat16_supported(),
        fp16=not is_bfloat16_supported(),
        logging_steps=1,
        report_to="none",
        logging_first_step=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        seed=3407,
        max_steps=300,
        optim="adamw_8bit",
        output_dir="./output",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


In [11]:
kto_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 440 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 1
\        /    Total batch size = 4 | Total steps = 300
 "-____-"     Number of trainable parameters = 216,072,192
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,0.6002
2,0.6316
3,0.4518
4,0.4964
5,0.5817
6,0.6339
7,0.5718
8,0.424
9,0.4991
10,0.6155


TrainOutput(global_step=300, training_loss=0.45536333019534747, metrics={'train_runtime': 364.1999, 'train_samples_per_second': 3.295, 'train_steps_per_second': 0.824, 'total_flos': 0.0, 'train_loss': 0.45536333019534747, 'epoch': 2.7272727272727275})

In [12]:
poewpewqpe

NameError: name 'poewpewqpe' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [4]:
from datasets import load_dataset

test_dataset = load_dataset("rifoag/NLP701_Assignment2_Subtask3", "EN", split="dev") # Load a small 1%
test_dataset

Dataset({
    features: ['file_name', 'document', 'categories', 'subcategories', 'explanation'],
    num_rows: 30
})

In [5]:
user_template = """\
<document>
{document}
</document>

<category>
{category}
</category>
"""

In [15]:
from tqdm import tqdm

all_results = []

FastLanguageModel.for_inference(model)

for example in tqdm(test_dataset, desc="Processing examples"):
    # Create conversation format
    conversation = [
        {
            "role": "user",
            "content": user_template.format(
                document=example["document"],
                category=example["subcategories"]
                if example["subcategories"] != "none"
                else example["categories"],
            ).lstrip("URW: ").lstrip("CC: "),
        },
    ]

    # inputs = tokenizer.apply_chat_template(
    #     conversation, add_generation_prompt=True, tokenize=False
    # )


    inputs = tokenizer.apply_chat_template(
        conversation, add_generation_prompt=True, tokenize=True, return_tensors="pt"
    ).to(model.device)

    # Get the number of tokens in the input
    input_length = inputs.shape[1]

    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=128,
            temperature=0.3,
            # min_p=0.1,
            # do_sample=True,
            use_cache=True,
        )

    raw_output = tokenizer.batch_decode(
        outputs[:, input_length:],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )[0]

    # Store results for this example
    result = {
        "file_name": example["file_name"],
        "document": example["document"],
        "categories": example["categories"],
        "subcategories": example["subcategories"],
        "explanation": raw_output,
    }
    all_results.append(result)

    # Clear CUDA cache
    torch.cuda.empty_cache()


Processing examples:   0%|          | 0/30 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Processing examples:  33%|███▎      | 10/30 [01:14<02:26,  7.32s/it]

In [17]:
# Create output file
with open('article_explanations_4_instruct.txt', 'w', encoding='utf-8') as f:
    # Write header
    for result in all_results:
        # Extract file name and explanation
        file_name = result['file_name']
        explanation = result['explanation'].split("\n")[0].strip()
        
        # Replace any tabs or newlines in explanation to prevent formatting issues
        explanation = explanation.replace('\n', ' ').replace('\t', ' ')
        
        # Write the line
        f.write(f"{file_name}\t{explanation}\n")

In [13]:
print("Hello world")

Hello world


In [16]:
import torch
from tqdm import tqdm  # for progress bar


def process_dataset(dataset, model, tokenizer, max_new_tokens=128, batch_size=1):
    """
    Process an entire dataset for inference, one example at a time.

    Args:
        dataset: HuggingFace dataset
        model: The language model
        tokenizer: The tokenizer
        max_new_tokens: Maximum number of tokens to generate
        batch_size: Currently fixed at 1 for simplicity

    Returns:
        List of generated responses
    """
    # Enable faster inference
    FastLanguageModel.for_inference(model)

    # Store all results
    all_responses = []

    # Process each example with progress bar
    for example in tqdm(dataset, desc="Processing examples"):
        # Create conversation format
        conversation = example["prompt"]
        # Tokenize
        inputs = tokenizer.apply_chat_template(
            [conversation], return_tensors="pt", add_generation_prompt=True
        ).to("cuda")

        # Generate
        with torch.no_grad():  # Disable gradient calculation for inference
            outputs = model.generate(inputs, max_new_tokens=max_new_tokens, use_cache=True)

        # Decode
        response = tokenizer.batch_decode(outputs)[0]
        all_responses.append(response)

        # Optional: Clear CUDA cache to prevent memory issues
        torch.cuda.empty_cache()

    return all_responses


# Use the function
responses = process_dataset(test_dataset, model, tokenizer, max_new_tokens=512)

# Save results if needed
results = []
for idx, response in enumerate(responses):
    results.append(
        {
            "file_name": test_dataset[idx]["file_name"],
            "prompt": test_dataset[idx]["prompt"],
            "response": response,
        }
    )

Processing examples:   0%|          | 0/30 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Processing examples: 100%|██████████| 30/30 [07:08<00:00, 14.28s/it]


In [20]:
print(results[2]["response"])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

<article>
Wat? L.A. Mayor Garcetti Flies to Argentina to Fight Climate Change 

 Los Angeles Mayor Eric Garcetti joined a “global coalition of mayors” at a summit on climate change this week in Buenos Aires, Argentina, burning fossil fuels on an international flight rather than attending remotely and attending to pressing issues at home.

In a press release, Garcetti’s office said:

Mayor Eric Garcetti this week attended the C40 World Mayors Summit – joining mayors, climate experts, youth activists, and business leaders from around the world to advance global action to combat the climate crisis.

The C40 Cities Climate Leadership Group is an international network of nearly 100 of the world’s largest cities committed to concrete action to combat climate change. Mayor Garcetti served as Chair of C40 from 2019 to 2

In [14]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
conversation = {
    "role" : "user",
    "content" : "Can you teach me how to make a cake?",
}
inputs = tokenizer.apply_chat_template([conversation], return_tensors = "pt", add_generation_prompt=True).to("cuda")
outputs = model.generate(inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

['<|im_start|>user\nCan you teach me how to make a cake?<|im_end|>\n<|im_start|>assistant\nSure, I can help you with that. What would you like to know about the cake?<|im_end|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [16]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer.apply_chat_template([conversation], return_tensors = "pt", add_generation_prompt=True).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(inputs, streamer = text_streamer, max_new_tokens = 128)

<|im_start|>user
Can you teach me how to make a cake?<|im_end|>
<|im_start|>assistant
Sure, I can help you with that. What would you like to know about the cake?<|im_end|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [18]:
from dotenv import load_dotenv

load_dotenv()
# model.save_pretrained("lora_model") # Local saving
# tokenizer.save_pretrained("lora_model")
model_name = "Erland/Gemma-Ver4-TTT-NLP701-Assignment2-Subtask3-Reward-Model"
# model.push_to_hub(model_name) # Online saving
# tokenizer.push_to_hub(model_name) # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [17]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer.apply_chat_template([conversation], return_tensors = "pt", add_generation_prompt=True).to("cuda")

outputs = model.generate(inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|im_start|>user\nCan you teach me how to make a cake?<|im_end|>\n<|im_start|>assistant\nSure, I can help you with that. What would you like to know about the cake?<|im_end|>']

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [19]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if True: model.push_to_hub_merged(model_name, tokenizer, save_method = "lora")

Unsloth: Saving LoRA adapters. Please wait...


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/864M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

Saved lora model to https://huggingface.co/Erland/Gemma-Ver4-TTT-NLP701-Assignment2-Subtask3-Reward-Model


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>