# Continued Pre-training (CPT) with Unsloth for Translation

Fine-tuning requires a GPU. If you don't have one locally, you can run this notebook for free on [Google Colab](https://colab.research.google.com/github/Liquid4All/cookbook/blob/main/finetuning/notebooks/cpt_translation_with_unsloth.ipynb) using a free NVIDIA T4 GPU instance.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Liquid4All/cookbook/blob/main/finetuning/notebooks/cpt_translation_with_unsloth.ipynb)

### What's in this notebook?

In this notebook you will learn how to use Continued Pre-training (CPT) with Unsloth to adapt a language model for translation tasks.
We will use the [LFM2.5-1.2B-Base](https://docs.liquid.ai/docs/models/lfm25-1.2b-base) model and perform continued pre-training on Korean Wikipedia data, followed by instruction fine-tuning on Korean translation examples. This approach is ideal for adapting models to specific languages or translation domains.

We will cover
- Environment setup
- Data preparation for translation
- Continued pre-training on domain-specific data
- Instruction fine-tuning for translation tasks
- Local inference with your new model
- Model saving and exporting it into the format you need for **deployment**.

### Deployment options

LFM2.5 models are small and efficient, enabling deployment across a wide range of platforms:

<table align="left">
  <tr>
    <th>Deployment Target</th>
    <th>Use Case</th>
  </tr>
  <tr>
    <td>üì± <a href="https://docs.liquid.ai/leap/edge-sdk/android/android-quick-start-guide"><b>Android</b></a></td>
    <td>Mobile apps on Android devices</td>
  </tr>
  <tr>
    <td>üì± <a href="https://docs.liquid.ai/leap/edge-sdk/ios/ios-quick-start-guide"><b>iOS</b></a></td>
    <td>Mobile apps on iPhone/iPad</td>
  </tr>
  <tr>
    <td>üçé <a href="https://docs.liquid.ai/docs/inference/mlx"><b>Apple Silicon Mac</b></a></td>
    <td>Local inference on Mac with MLX</td>
  </tr>
  <tr>
    <td>ü¶ô <a href="https://docs.liquid.ai/docs/inference/llama-cpp"><b>llama.cpp</b></a></td>
    <td>Local deployments on any hardware</td>
  </tr>
  <tr>
    <td>ü¶ô <a href="https://docs.liquid.ai/docs/inference/ollama"><b>Ollama</b></a></td>
    <td>Local inference with easy setup</td>
  </tr>
  <tr>
    <td>üñ•Ô∏è <a href="https://docs.liquid.ai/docs/inference/lm-studio"><b>LM Studio</b></a></td>
    <td>Desktop app for local inference</td>
  </tr>
  <tr>
    <td>‚ö° <a href="https://docs.liquid.ai/docs/inference/vllm"><b>vLLM</b></a></td>
    <td>Cloud deployments with high throughput</td>
  </tr>
  <tr>
    <td>‚òÅÔ∏è <a href="https://docs.liquid.ai/docs/inference/modal-deployment"><b>Modal</b></a></td>
    <td>Serverless cloud deployment</td>
  </tr>
  <tr>
    <td>üèóÔ∏è <a href="https://docs.liquid.ai/docs/inference/baseten-deployment"><b>Baseten</b></a></td>
    <td>Production ML infrastructure</td>
  </tr>
  <tr>
    <td>üöÄ <a href="https://docs.liquid.ai/docs/inference/fal-deployment"><b>Fal</b></a></td>
    <td>Fast inference API</td>
  </tr>
</table>

### Need help building with our models and tools?
Join the Liquid AI Discord Community and ask.

<a href="https://discord.com/invite/liquid-ai"><img src="https://img.shields.io/discord/1385439864920739850?color=7289da&label=Join%20Discord&logo=discord&logoColor=white" alt="Join Discord"></a>

And now, let the fine tune begin!

### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.57.3
!pip install --no-deps trl==0.22.2

### Unsloth

In [None]:
%env UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT

env: UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "LiquidAI/LFM2.5-1.2B-Base", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.2: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/2.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "in_proj",
                      "w1", "w2", "w3",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)



Unsloth: Making `model.base_model.model.model.embed_tokens` require gradients


<a name="Data"></a>
### Data Prep
We now use the Korean subset of the [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia) to first continually pretrain the model. You can use **any language** you like! Go to [Wikipedia's List of Languages](https://en.wikipedia.org/wiki/List_of_Wikipedias) to find your own language!

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb).

**[NOTE]** Use https://translate.google.com to translate from English to Korean!

In [None]:
# Wikipedia provides a title and an article text.
# Use https://translate.google.com!
_wikipedia_prompt = """Wikipedia Article
### Title: {}

### Article:
{}"""
# becomes:
wikipedia_prompt = """ÏúÑÌÇ§ÌîºÎîîÏïÑ Í∏∞ÏÇ¨
### Ï†úÎ™©: {}

### Í∏∞ÏÇ¨:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

We only use 1% of the dataset to speed things up! Use more for longer runs!

In [None]:
from datasets import load_dataset

dataset = load_dataset("wikimedia/wikipedia", "20231101.ko", split = "train",)

# We select 1% of the data to make training faster!
dataset = dataset.train_test_split(train_size = 0.01)["train"]

dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md: 0.00B [00:00, ?B/s]

20231101.ko/train-00000-of-00003.parquet:   0%|          | 0.00/400M [00:00<?, ?B/s]

20231101.ko/train-00001-of-00003.parquet:   0%|          | 0.00/205M [00:00<?, ?B/s]

20231101.ko/train-00002-of-00003.parquet:   0%|          | 0.00/177M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/647897 [00:00<?, ? examples/s]

Map:   0%|          | 0/6478 [00:00<?, ? examples/s]

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 20 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [None]:
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/6478 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
2.598 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6,478 | Num Epochs = 1 | Total steps = 120
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 97,517,568 of 1,276,508,928 (7.64% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.2041
2,1.9171
3,2.0752
4,1.8929
5,2.1221
6,2.0404
7,1.6626
8,1.7943
9,1.842
10,1.9009




### Instruction Finetuning

We now use the [Alpaca in GPT4 Dataset](https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-korean) but translated in Korean!

Go to [vicgalle/alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4) for the original GPT4 dataset for Alpaca or [MultilingualSIFT project](https://github.com/FreedomIntelligence/MultilingualSIFT) for other translations of the Alpaca dataset.

In [None]:
from datasets import load_dataset

alpaca_dataset = load_dataset("FreedomIntelligence/alpaca-gpt4-korean", split="train")

README.md:   0%|          | 0.00/124 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


alpaca-gpt4-korean.json:   0%|          | 0.00/51.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49969 [00:00<?, ? examples/s]

We print 1 example:

In [None]:
print(alpaca_dataset[0])

{'conversations': [{'from': 'human', 'value': 'Ïû¨ÌôúÏö© Ï∫†ÌéòÏù∏ Ïä¨Î°úÍ±¥ÏùÑ Ï†úÏãúÌïòÏÑ∏Ïöî.\n'}, {'from': 'gpt', 'value': '1. "ÎçîÏö± ÎÖπÏÉâ ÎØ∏ÎûòÎ•º ÏúÑÌï¥ Ìï®Íªò Ï§ÑÏù¥Í≥†, Ïû¨ÏÇ¨Ïö©ÌïòÍ≥†, Ïû¨ÌôúÏö©ÌïòÏÑ∏Ïöî."\n2. "Îçî ÎÇòÏùÄ ÎÇ¥ÏùºÏùÑ ÏúÑÌï¥ Ïò§Îäò Î∞îÎ°ú Ïû¨ÌôúÏö©ÌïòÏÑ∏Ïöî."\n3. "Ïì∞Î†àÍ∏∞Î•º Î≥¥Î¨ºÎ°ú ÎßåÎìúÎäî Î≤ï - Ïû¨ÌôúÏö©!"\n4. "Ïù∏ÏÉùÏùò ÏàúÌôòÏùÑ ÏúÑÌï¥ Ïû¨ÌôúÏö©ÌïòÏÑ∏Ïöî."\n5. "ÏûêÏõêÏùÑ ÏïÑÎÅºÍ≥† Îçî ÎßéÏù¥ Ïû¨ÌôúÏö©ÌïòÏÑ∏Ïöî."'}], 'id': '23712'}


We again use https://translate.google.com/ to translate the Alpaca format into Korean

In [None]:
_alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
# Becomes:
alpaca_prompt = """Îã§ÏùåÏùÄ ÏûëÏóÖÏùÑ ÏÑ§Î™ÖÌïòÎäî Î™ÖÎ†πÏûÖÎãàÎã§. ÏöîÏ≤≠ÏùÑ Ï†ÅÏ†àÌïòÍ≤å ÏôÑÎ£åÌïòÎäî ÏùëÎãµÏùÑ ÏûëÏÑ±ÌïòÏÑ∏Ïöî.

### ÏßÄÏπ®:
{}

### ÏùëÎãµ:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

alpaca_dataset = alpaca_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/49969 [00:00<?, ? examples/s]

We again employ `UnslothTrainer` and do instruction finetuning!

In [None]:
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = alpaca_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use num_train_epochs and warmup_ratio for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/49969 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 49,969 | Num Epochs = 1 | Total steps = 120
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 97,517,568 of 1,276,508,928 (7.64% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.9601
2,1.9163
3,1.7326
4,1.5923
5,1.5051
6,1.3244
7,1.3801
8,1.4781
9,1.3029
10,1.3272




In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

226.8895 seconds used for training.
3.78 minutes used for training.
Peak reserved memory = 4.361 GB.
Peak reserved memory for training = 1.763 GB.
Peak reserved memory % of max memory = 19.679 %.
Peak reserved memory for training % of max memory = 7.955 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

Remember to use https://translate.google.com/!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,", # instruction
        "ÌîºÎ≥¥ÎÇòÏπò ÏàòÏó¥ÏùÑ Í≥ÑÏÜçÌïòÏÑ∏Ïöî: 1, 1, 2, 3, 5, 8,", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|startoftext|>Îã§ÏùåÏùÄ ÏûëÏóÖÏùÑ ÏÑ§Î™ÖÌïòÎäî Î™ÖÎ†πÏûÖÎãàÎã§. ÏöîÏ≤≠ÏùÑ Ï†ÅÏ†àÌïòÍ≤å ÏôÑÎ£åÌïòÎäî ÏùëÎãµÏùÑ ÏûëÏÑ±ÌïòÏÑ∏Ïöî.\n\n### ÏßÄÏπ®:\nÌîºÎ≥¥ÎÇòÏπò ÏàòÏó¥ÏùÑ Í≥ÑÏÜçÌïòÏÑ∏Ïöî: 1, 1, 2, 3, 5, 8,\n\n### ÏùëÎãµ:\nÌîºÎ≥¥ÎÇòÏπò ÏàòÏó¥ÏùÄ Í∞Å Ïà´ÏûêÍ∞Ä ÏïûÏùò Îëê Ïà´ÏûêÏùò Ìï©Ïù∏ ÏàòÏó¥ÏûÖÎãàÎã§. Ïù¥ ÏàòÏó¥ÏùÄ 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "What is Korean music like?"
        "ÌïúÍµ≠ÏùåÏïÖÏùÄ Ïñ¥Îñ§Í∞ÄÏöî?", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|startoftext|>Îã§ÏùåÏùÄ ÏûëÏóÖÏùÑ ÏÑ§Î™ÖÌïòÎäî Î™ÖÎ†πÏûÖÎãàÎã§. ÏöîÏ≤≠ÏùÑ Ï†ÅÏ†àÌïòÍ≤å ÏôÑÎ£åÌïòÎäî ÏùëÎãµÏùÑ ÏûëÏÑ±ÌïòÏÑ∏Ïöî.

### ÏßÄÏπ®:
ÌïúÍµ≠ÏùåÏïÖÏùÄ Ïñ¥Îñ§Í∞ÄÏöî?

### ÏùëÎãµ:
ÌïúÍµ≠ÏùåÏïÖÏùÄ Ï†ÑÌÜµ ÏùåÏïÖÍ≥º ÌòÑÎåÄ ÏùåÏïÖÏùÑ Ìè¨Ìï®Ìïú Îã§ÏñëÌïú Ïû•Î•¥Î•º Ìè¨Ìï®Ìï©ÎãàÎã§. Ï†ÑÌÜµ ÏùåÏïÖÏùÄ Ï¢ÖÏ¢Ö ÏïÖÍ∏∞ÏôÄ ÎÖ∏ÎûòÎ•º ÏÇ¨Ïö©ÌïòÏó¨ ÌïúÍµ≠Ïùò Ïó≠ÏÇ¨ÏôÄ Î¨∏ÌôîÎ•º Î∞òÏòÅÌï©ÎãàÎã§. ÌòÑÎåÄ ÏùåÏïÖÏùÄ Ï†ÑÌÜµ ÏùåÏïÖÏùò ÏòÅÌñ•ÏùÑ Î∞õÏïÑ Î∞úÏ†ÑÌï¥ÏôîÏúºÎ©∞, Ï†ÑÏûê ÏùåÏïÖ, Î°ù, Ìåù Îì± Îã§ÏñëÌïú Ïû•Î•¥Í∞Ä ÏûàÏäµÎãàÎã§. ÌïúÍµ≠ ÏùåÏïÖÏùÄ Ï†Ñ ÏÑ∏Í≥ÑÏ†ÅÏúºÎ°ú Ïù∏Í∏∞Î•º ÏñªÍ≥† ÏûàÏúºÎ©∞, ÎßéÏùÄ ÏïÑÌã∞Ïä§Ìä∏Îì§Ïù¥ Íµ≠Ï†úÏ†ÅÏù∏ ÏÑ±Í≥µÏùÑ Í±∞ÎëêÏóàÏäµÎãàÎã§.<|im_end|>


By using https://translate.google.com/ we get
```
Korean music is classified into many types of music genres.

This genre is classified into different music genres such as pop songs,

rock songs, classical songs and pop songs, music groups consisting of drums, fans, instruments and singers
```

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving



('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )
    FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
    [
        alpaca_prompt.format(
            # "Describe the planet Earth extensively.", # instruction
            "ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.",
            "",  # output - leave this blank for generation!
        ),
    ],
    return_tensors="pt",
).to("cuda")


from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
_ = model.generate(
    **inputs, streamer=text_streamer, max_new_tokens=128, repetition_penalty=0.1
)

<|startoftext|>Îã§ÏùåÏùÄ ÏûëÏóÖÏùÑ ÏÑ§Î™ÖÌïòÎäî Î™ÖÎ†πÏûÖÎãàÎã§. ÏöîÏ≤≠ÏùÑ Ï†ÅÏ†àÌïòÍ≤å ÏôÑÎ£åÌïòÎäî ÏùëÎãµÏùÑ ÏûëÏÑ±ÌïòÏÑ∏Ïöî.

### ÏßÄÏπ®:
ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

### ÏùëÎãµ:
ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™ÖÌïòÏÑ∏Ïöî.

ÏßÄÍµ¨Î•º Í¥ëÎ≤îÏúÑÌïòÍ≤å ÏÑ§Î™Ö


By using https://translate.google.com/ we get
```
Earth refers to all things including natural disasters such as local derailment

and local depletion that occur in one space along with the suppression of water, gases, and living things.

Most of the Earth's water comes from oceans, atmospheric water, underground water layers, and rivers and rivers.
```

Yikes the language model is a bit whacky! Change the temperature and using sampling will definitely make the output much better!

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit=load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q5_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
</div>
