To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
# alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {}

# ### Input:
# {}

# ### Response:
# {}"""

# EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# def formatting_prompts_func(examples):
#     instructions = examples["instruction"]
#     inputs       = examples["input"]
#     outputs      = examples["output"]
#     texts = []
#     for instruction, input, output in zip(instructions, inputs, outputs):
#         # Must add EOS_TOKEN, otherwise your generation will go on forever!
#         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
#         texts.append(text)
#     return { "text" : texts, }
# pass

alpaca_prompt = """### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("williamsl/DungeonsAndDragonsGPT", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading data:   0%|          | 0.00/250k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.5k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 40,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/90 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.625 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 90 | Num Epochs = 4
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 40
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.22
2,1.0825
3,1.1033
4,1.0571
5,0.9647
6,0.8379
7,0.679
8,0.6792
9,0.6605
10,0.5938


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

631.4281 seconds used for training.
10.52 minutes used for training.
Peak reserved memory = 7.07 GB.
Peak reserved memory for training = 2.445 GB.
Peak reserved memory % of max memory = 47.939 %.
Peak reserved memory for training % of max memory = 16.579 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# test_instruction = f"""You are a player in a simplified game of Dungeons and Dragons (D&D).
# It is your turn to act. You can choose to move to one of the valid move positions, or execute one of the valid attacks.
# All characters can move 1 tile per turn, and have an attack range of 1 tile.
# The top left of the board has coordinates (0, 0).

# First, provide thoughtful reflection on the best strategy to win the game from this point. Then, provide your decision using the index of the action.

# Here is an example of the expected output format.
# Reasoning: I will prioritize attacking enemies with lower health to take them out quickly and reduce the overall damage I and my teammates receive. Therefore, attacking Eve at (2, 3) is a good decision.
# Action: 4

# Now make your decision based on the following state of the game."""

test_inputs = ['I am Charlie represented by the letter C\nI have 1 health\nI can do 2 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (1, 0)\n\nMy teammates are:\nAlice (health: 9, damage: 2)\nBob (health: 7, damage: 2)\nCharlie (health: 7, damage: 3)\nEve (health: 3, damage: 3)\n\nMy enemies are:\nDavid (health: 1, damage: 1)\n\nMy valid moves are:\n0 : I can move to (0, 0)\n1 : I can move to (0, 1)\n2 : I can move to (1, 0)\n3 : I can move to (1, 1)\n4 : I can move to (2, 1)\n\nMy valid attacks are:\nNone\n\nThe board looks like:\n-CB-\n----\nED--\n--A-\n\nThe turn order is:\nCharlie\nBob\nAlice\nEve\nDavid\n\nChoose an action:\n0 : I can move to (0, 0)\n1 : I can move to (0, 1)\n2 : I can move to (1, 0)\n3 : I can move to (1, 1)\n4 : I can move to (2, 1)\n', 'I am David represented by the letter D\nI have 5 health\nI can do 2 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (3, 1)\n\nMy teammates are:\nDavid (health: 2, damage: 2)\n\nMy enemies are:\nAlice (health: 9, damage: 1)\nBob (health: 1, damage: 1)\nCharlie (health: 3, damage: 3)\n\nMy valid moves are:\n0 : I can move to (2, 0)\n2 : I can move to (2, 2)\n3 : I can move to (3, 0)\n4 : I can move to (3, 1)\n5 : I can move to (3, 2)\n\nMy valid attacks are:\n1 : I can attack Charlie at (2, 1)\n\nThe board looks like:\n-B--\n--CD\n----\n---A\n\nThe turn order is:\nDavid\nBob\nCharlie\nAlice\n\nChoose an action:\n0 : I can move to (2, 0)\n1 : I can attack Charlie at (2, 1)\n2 : I can move to (2, 2)\n3 : I can move to (3, 0)\n4 : I can move to (3, 1)\n5 : I can move to (3, 2)\n', 'I am Heidi represented by the letter H\nI have 1 health\nI can do 3 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (2, 5)\n\nMy teammates are:\nAlice (health: 4, damage: 1)\nBob (health: 9, damage: 1)\nCharlie (health: 10, damage: 2)\nEve (health: 2, damage: 1)\nFrank (health: 10, damage: 3)\nGrace (health: 6, damage: 3)\nHeidi (health: 7, damage: 3)\n\nMy enemies are:\nDavid (health: 5, damage: 3)\nIvan (health: 4, damage: 3)\n\nMy valid moves are:\n0 : I can move to (2, 5)\n1 : I can move to (3, 5)\n\nMy valid attacks are:\nNone\n\nThe board looks like:\n-G----\n------\n------\n-C----\n-EFA--\n-BH-ID\n\nThe turn order is:\nHeidi\nGrace\nAlice\nCharlie\nDavid\nBob\nFrank\nEve\nIvan\n\nChoose an action:\n0 : I can move to (2, 5)\n1 : I can move to (3, 5)\n', 'I am Bob represented by the letter B\nI have 4 health\nI can do 1 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (0, 4)\n\nMy teammates are:\nBob (health: 2, damage: 3)\n\nMy enemies are:\nAlice (health: 7, damage: 3)\n\nMy valid moves are:\n0 : I can move to (0, 3)\n1 : I can move to (0, 4)\n2 : I can move to (1, 3)\n3 : I can move to (1, 4)\n\nMy valid attacks are:\nNone\nThe board looks like:\n-----\n-----\n---A-\n-----\nB----\n\nThe turn order is:\nBob\nAlice\n\nChoose an action:\n0 : I can move to (0, 3)\n1 : I can move to (0, 4)\n2 : I can move to (1, 3)\n3 : I can move to (1, 4)\n', 'I am Heidi represented by the letter H\nI have 7 health\nI can do 3 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (2, 2)\n\nMy teammates are:\nAlice (health: 2, damage: 2)\nBob (health: 1, damage: 3)\nCharlie (health: 7, damage: 1)\nDavid (health: 3, damage: 2)\nFrank (health: 8, damage: 2)\nGrace (health: 6, damage: 3)\nHeidi (health: 10, damage: 2)\n\nMy enemies are:\nEve (health: 3, damage: 2)\n\nMy valid moves are:\n0 : I can move to (1, 1)\n1 : I can move to (1, 2)\n2 : I can move to (1, 3)\n3 : I can move to (2, 1)\n4 : I can move to (2, 2)\n5 : I can move to (2, 3)\n6 : I can move to (3, 2)\n7 : I can move to (3, 3)\n\nMy valid attacks are:\nNone\nThe board looks like:\n--------\n---F--A-\n--H-----\n------B-\n-DG-----\n---E----\n----C---\n--------\n\nThe turn order is:\nHeidi\nGrace\nFrank\nCharlie\nEve\nDavid\nAlice\nBob\n\nChoose an action:\n0 : I can move to (1, 1)\n1 : I can move to (1, 2)\n2 : I can move to (1, 3)\n3 : I can move to (2, 1)\n4 : I can move to (2, 2)\n5 : I can move to (2, 3)\n6 : I can move to (3, 2)\n7 : I can move to (3, 3)\n', 'I am Frank represented by the letter F\nI have 10 health\nI can do 1 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (2, 0)\n\nMy teammates are:\nCharlie (health: 4, damage: 1)\nDavid (health: 6, damage: 2)\nEve (health: 3, damage: 2)\nFrank (health: 6, damage: 2)\nHeidi (health: 10, damage: 3)\n\nMy enemies are:\nAlice (health: 1, damage: 1)\nBob (health: 3, damage: 2)\nGrace (health: 5, damage: 3)\nIvan (health: 6, damage: 2)\n\nMy valid moves are:\n0 : I can move to (1, 0)\n1 : I can move to (1, 1)\n2 : I can move to (2, 0)\n3 : I can move to (2, 1)\n4 : I can move to (3, 0)\n5 : I can move to (3, 1)\n\nMy valid attacks are:\nNone\nThe board looks like:\n--F-A\nE---D\n---C-\n--B--\nI-GH-\n\nThe turn order is:\nFrank\nCharlie\nBob\nAlice\nGrace\nIvan\nEve\nDavid\nHeidi\n\nChoose an action:\n0 : I can move to (1, 0)\n1 : I can move to (1, 1)\n2 : I can move to (2, 0)\n3 : I can move to (2, 1)\n4 : I can move to (3, 0)\n5 : I can move to (3, 1)\n', 'I am Eve represented by the letter E\nI have 1 health\nI can do 1 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (2, 3)\n\nMy teammates are:\nCharlie (health: 4, damage: 1)\nDavid (health: 6, damage: 2)\nEve (health: 10, damage: 3)\n\nMy enemies are:\nAlice (health: 1, damage: 1)\nBob (health: 10, damage: 3)\n\nMy valid moves are:\n0 : I can move to (1, 2)\n1 : I can move to (1, 3)\n2 : I can move to (1, 4)\n3 : I can move to (2, 2)\n4 : I can move to (2, 3)\n5 : I can move to (2, 4)\n6 : I can move to (3, 2)\n7 : I can move to (3, 3)\n8 : I can move to (3, 4)\n\nMy valid attacks are:\nNone\nThe board looks like:\n-D---C-\n-------\n-------\n--E----\n-------\n-A-----\n----B--\n\nThe turn order is:\nEve\nCharlie\nAlice\nBob\nDavid\n\nChoose an action:\n0 : I can move to (1, 2)\n1 : I can move to (1, 3)\n2 : I can move to (1, 4)\n3 : I can move to (2, 2)\n4 : I can move to (2, 3)\n5 : I can move to (2, 4)\n6 : I can move to (3, 2)\n7 : I can move to (3, 3)\n8 : I can move to (3, 4)\n', 'I am Alice represented by the letter A\nI have 3 health\nI can do 3 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (6, 5)\n\nMy teammates are:\nAlice (health: 10, damage: 1)\n\nMy enemies are:\nBob (health: 2, damage: 1)\n\nMy valid moves are:\n0 : I can move to (5, 4)\n1 : I can move to (5, 5)\n2 : I can move to (5, 6)\n3 : I can move to (6, 4)\n4 : I can move to (6, 5)\n5 : I can move to (6, 6)\n6 : I can move to (7, 4)\n7 : I can move to (7, 5)\n8 : I can move to (7, 6)\n\nMy valid attacks are:\nNone\n\nThe board looks like:\n--------\n--------\n--------\n---B----\n--------\n------A-\n--------\n--------\n\nThe turn order is:\nAlice\nBob\n\nChoose an action:\n0 : I can move to (5, 4)\n1 : I can move to (5, 5)\n2 : I can move to (5, 6)\n3 : I can move to (6, 4)\n4 : I can move to (6, 5)\n5 : I can move to (6, 6)\n6 : I can move to (7, 4)\n7 : I can move to (7, 5)\n8 : I can move to (7, 6)\n', 'I am Charlie represented by the letter C\nI have 7 health\nI can do 1 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (0, 1)\n\nMy teammates are:\nBob (health: 10, damage: 1)\nCharlie (health: 10, damage: 2)\nDavid (health: 5, damage: 1)\nHeidi (health: 1, damage: 3)\n\nMy enemies are:\nAlice (health: 9, damage: 1)\nEve (health: 7, damage: 1)\nFrank (health: 2, damage: 2)\nGrace (health: 5, damage: 3)\n\nMy valid moves are:\n0 : I can move to (0, 0)\n1 : I can move to (0, 1)\n2 : I can move to (0, 2)\n3 : I can move to (1, 1)\n4 : I can move to (1, 2)\n\nMy valid attacks are:\nNone\nThe board looks like:\n-H--F\nC-ED-\n--G--\nAB---\n-----\n\nThe turn order is:\nCharlie\nFrank\nHeidi\nEve\nGrace\nDavid\nAlice\nBob\n\nChoose an action:\n0 : I can move to (0, 0)\n1 : I can move to (0, 1)\n2 : I can move to (0, 2)\n3 : I can move to (1, 1)\n4 : I can move to (1, 2)\n', 'I am Charlie represented by the letter C\nI have 3 health\nI can do 2 damage per turn\nI can move 1 tile per turn\nMy attack range is 1 tile\nI am at position (4, 2)\n\nMy teammates are:\nCharlie (health: 2, damage: 2)\nDavid (health: 9, damage: 2)\n\nMy enemies are:\nAlice (health: 3, damage: 3)\nBob (health: 7, damage: 2)\n\nMy valid moves are:\n0 : I can move to (3, 1)\n1 : I can move to (3, 2)\n2 : I can move to (3, 3)\n3 : I can move to (4, 1)\n4 : I can move to (4, 2)\n5 : I can move to (4, 3)\n6 : I can move to (5, 1)\n7 : I can move to (5, 2)\n8 : I can move to (5, 3)\n\nMy valid attacks are:\nNone\nThe board looks like:\n--------\n--------\n--A-C---\n--------\n---B----\n--D-----\n--------\n--------\n\nThe turn order is:\nCharlie\nAlice\nBob\nDavid\n\nChoose an action:\n0 : I can move to (3, 1)\n1 : I can move to (3, 2)\n2 : I can move to (3, 3)\n3 : I can move to (4, 1)\n4 : I can move to (4, 2)\n5 : I can move to (4, 3)\n6 : I can move to (5, 1)\n7 : I can move to (5, 2)\n8 : I can move to (5, 3)\n']

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         test_instruction, # instruction
#         test_input, # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
# tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         test_instruction, # instruction
#         test_input, # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# model.save_pretrained("lora_model") # Local saving
model.push_to_hub("williamsl/DungeonsAndDragonsGPT", token = "") # Online saving

README.md:   0%|          | 0.00/580 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/williamsl/DungeonsAndDragonsGPT


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "unsloth/mistral-7b-bnb-4bit", # YOUR MODEL YOU USED FOR TRAINING
    model_name = "williamsl/DungeonsAndDragonsGPT", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

for i in range(len(test_inputs)):
    print(f"\n\n\ninput {i}\n")
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            # test_instruction, # instruction
            test_inputs[i], # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 1024, use_cache = True)
    print(tokenizer.batch_decode(outputs))

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


ValueError: 
                    Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
                    quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
                    in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to
                    `from_pretrained`. Check
                    https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                    for more details.
                    

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
# if False:
#     # I highly do NOT suggest - use Unsloth if possible
#     from peft import AutoPeftModelForCausalLM
#     from transformers import AutoTokenizer
#     model = AutoPeftModelForCausalLM.from_pretrained(
#         "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#         load_in_4bit = load_in_4bit,
#     )
#     tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# # Save to 8bit Q8_0
# if False: model.save_pretrained_gguf("model", tokenizer,)
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>