Finetuning Llama model to correct typos. Powered by Unsloth, edited by Yichen Cai.

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

### Training from scratch (LoRA).

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "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", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Training from scratch for LoRA! We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.12.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Training from pretrained LoRA, by Yichen

In [2]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Load pretrained from local weights (from a folder called lora_model)

In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # MODEL
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
    )

FileNotFoundError: lora_model/*.json (invalid repository id)

Load pretrained from uploaded weights

In [5]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "cycv5/llama3b-lora-phone", # or "cycv5/llama3b-lora-zoom"
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token = "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",
)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.2.15 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [6]:
from unsloth.chat_templates import get_chat_template

# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template = "llama-3.1",
# )

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

# A dataset that contains a column called "conversations"
# Each data point is addressed by dataset[index]
# To get conversations: dataset[index]["conversations"]
# dataset[index]["conversations"] gives a list of 2 dictionaries. First dict is user, second gpt


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

```
from datasets import Dataset
pdata = {"conversations":[[{"from": "system", "value": "You are an expert in correcting typos in sentences."}, {"from": "human", "value": "You are an assistant"}, {"from": "gpt", "value": "yes"}], [{"from": "human", "value": "hello"}, {"from": "gpt", "value": "hi there"}]],
         "scource":["None", "None"],
         "score": [0, 0]}
play_dataset = Dataset.from_dict(pdata)
```

In [7]:
from datasets import Dataset
dataset = Dataset.from_file("data-00000-of-00001.arrow")
testset = Dataset.from_file("data-00000-of-00001.arrow")  # test or train?

print(dataset)
print(dataset[0])
print(testset)
print(testset[0])

Dataset({
    features: ['conversations'],
    num_rows: 12308
})
{'conversations': [{'from': 'system', 'value': 'You are an expert in correcting typos in sentences.'}, {'from': 'human', 'value': 'Correct typos the in following sentence and return only the sentence: in ths futurd sustainable transportation options will still bsjsought to ease congestion'}, {'from': 'gpt', 'value': 'in the future sustainable transportation options will still be sought to ease congestion'}]}
Dataset({
    features: ['conversations'],
    num_rows: 12308
})
{'conversations': [{'from': 'system', 'value': 'You are an expert in correcting typos in sentences.'}, {'from': 'human', 'value': 'Correct typos the in following sentence and return only the sentence: in ths futurd sustainable transportation options will still bsjsought to ease congestion'}, {'from': 'gpt', 'value': 'in the future sustainable transportation options will still be sought to ease congestion'}]}


In [18]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)
testset = standardize_sharegpt(testset)
testset = testset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/12308 [00:00<?, ? examples/s]

Map:   0%|          | 0/12308 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [9]:
dataset[1]["conversations"]

[{'content': 'You are an expert in correcting typos in sentences.',
  'role': 'system'},
 {'content': 'Correct typos the in following sentence and return only the sentence: china has been actively involved in1peacekeeping missions and humanitarian efforts',
  'role': 'user'},
 {'content': 'china has been actively involved in peacekeeping missions and humanitarian efforts',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [10]:
dataset[1]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 25 Feb 2025\n\nYou are an expert in correcting typos in sentences.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCorrect typos the in following sentence and return only the sentence: china has been actively involved in1peacekeeping missions and humanitarian efforts<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nchina has been actively involved in peacekeeping missions and humanitarian efforts<|eot_id|>'

In [24]:
testset[1]["conversations"]

[{'content': 'You are an expert in correcting typos in sentences.',
  'role': 'system'},
 {'content': 'Correct typos the in following sentence and return only the sentence: china has been actively involved in1peacekeeping missions and humanitarian efforts',
  'role': 'user'},
 {'content': 'china has been actively involved in peacekeeping missions and humanitarian efforts',
  'role': 'assistant'}]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/12308 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/12308 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/12308 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/12308 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [13]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/12308 [00:00<?, ? examples/s]

We verify masking is actually done:

In [14]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 25 Feb 2025\n\nYou are an expert in correcting typos in sentences.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCorrect typos the in following sentence and return only the sentence: aiznhanced agricultural robotics will have been automating various farming tasks<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\naienhanced agricultural robotics will have been automating various farming tasks<|eot_id|>'

In [15]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                         \n\naienhanced agricultural robotics will have been automating various farming tasks<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [16]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 12,308 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,538
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,0.004
2,0.1521


KeyboardInterrupt: 

<a name="Inference"></a>
### Inference
Run the model!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

Inference for typo correction task

In [19]:
print("System input: ")
print(testset[0]['conversations'][0])
print("User input: ")
print(testset[0]['conversations'][1])
print("Ground truth: ")
print(testset[0]['conversations'][2])

System input: 
{'content': 'You are an expert in correcting typos in sentences.', 'role': 'system'}
User input: 
{'content': 'Correct typos the in following sentence and return only the sentence: in ths futurd sustainable transportation options will still bsjsought to ease congestion', 'role': 'user'}
Ground truth: 
{'content': 'in the future sustainable transportation options will still be sought to ease congestion', 'role': 'assistant'}


In [20]:
import difflib
def compute_accuracy_and_wrong_syllables(true_sentence, predicted_sentence):
    # Character-level accuracy using SequenceMatcher
    char_matcher = difflib.SequenceMatcher(None, true_sentence, predicted_sentence)
    accuracy = char_matcher.ratio()

    # Word-level wrong syllable count using SequenceMatcher on word lists
    true_words = true_sentence.split()
    predicted_words = predicted_sentence.split()
    word_matcher = difflib.SequenceMatcher(None, true_words, predicted_words)

    # Calculate wrong syllables based on insert, delete, and replace operations
    wrong_syllables = sum(1 for tag, _, _, _, _ in word_matcher.get_opcodes() if tag in ('insert', 'delete', 'replace'))

    return accuracy, wrong_syllables

In [21]:
from unsloth.chat_templates import get_chat_template
import re
from tqdm import tqdm
import csv

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


def predict_gpt(messages, ans):
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                            temperature = 1.5, min_p = 0.1)
    ret = tokenizer.batch_decode(outputs)

    gpt_response = re.search(r"assistant<\|end_header_id\|>\n\n(.*?)<\|eot_id\|>", ret[0], re.DOTALL)

    if gpt_response:
        gpt_response = gpt_response.group(1)
        gpt_response = gpt_response.lower().strip()
        gpt_response = re.sub(r'[^a-z0-9\s]', '', gpt_response)
        # print(gpt_response)
    else:
        print("Error")

    accuracy, wrong_syllables = compute_accuracy_and_wrong_syllables(ans, gpt_response)
    # print(f"Accuracy: {accuracy}, Wrong Syllables: {wrong_syllables}")
    return gpt_response, accuracy, wrong_syllables



In [22]:
acc_count = []
wrongsylb_count = []

with open('llm_res_.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["True Sentence", "Predicted Sentence", "LLM Sentence"])  # Write header row


    iter = tqdm(range(len(testset)), desc="Testing")
    for i in iter:
        messages = [{'content': 'You are an expert in correcting typos in sentences.', 'role': 'system'}]
        messages.append({'content': testset[i]['conversations'][1]['content'], 'role': 'user'})
        if i == 0:
            print("\nFirst sentence to be tested: ")
            print(messages)
            print("Ground Truth Label: ")
            print(testset[i]['conversations'][2]['content'])
        ret_sentence, ret_acc, ret_wrong_syllables = predict_gpt(messages, testset[i]['conversations'][2]['content'])

        true_sentence = testset[i]['conversations'][2]['content']
        predicted_sentence = testset[i]['conversations'][1]['content'].split("return only the sentence: ", 1)[-1]
        writer.writerow([true_sentence, predicted_sentence, ret_sentence])  # Write data row

        acc_count.append(ret_acc)
        wrongsylb_count.append(ret_wrong_syllables)
        running_avg = sum(acc_count)/len(acc_count) if len(acc_count) != 0 else 0
        iter.set_postfix_str(f"Running avg accuracy: {running_avg}")

print("\n===== Testing Result =====")
print(f"Average accuracy: {sum(acc_count)/len(acc_count)}")
print(f"Average wrong syllables: {sum(wrongsylb_count)/len(wrongsylb_count)}")

Testing:   0%|          | 0/12308 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



First sentence to be tested: 
[{'content': 'You are an expert in correcting typos in sentences.', 'role': 'system'}, {'content': 'Correct typos the in following sentence and return only the sentence: in ths futurd sustainable transportation options will still bsjsought to ease congestion', 'role': 'user'}]
Ground Truth Label: 
in the future sustainable transportation options will still be sought to ease congestion


Testing:   0%|          | 13/12308 [00:13<3:31:31,  1.03s/it, Running avg accuracy: 1.0]


KeyboardInterrupt: 

In [None]:
# Raw data for reference
print(acc_count)
print(wrongsylb_count)
print(sum(wrongsylb_count))

[0.8093023255813954, 0.989247311827957, 0.544, 0.9801980198019802, 0.8352941176470589, 0.7727272727272727, 0.7707317073170732, 0.7484662576687117, 0.9558823529411765, 0.8098159509202454, 0.9714285714285714, 1.0, 1.0, 0.8045977011494253, 0.9787234042553191, 0.5899280575539568, 0.8760330578512396, 0.75, 0.9014084507042254, 0.8762886597938144, 0.9206349206349206, 0.9750889679715302, 0.6951219512195121, 1.0, 0.9912280701754386, 0.8333333333333334, 0.6428571428571429, 0.8243243243243243, 0.8436018957345972, 0.6304347826086957, 0.9259259259259259, 0.9342105263157895, 0.7885714285714286, 0.4642857142857143, 0.8299319727891157, 0.8950276243093923, 0.3617021276595745, 0.7959183673469388, 0.7692307692307693, 0.9873417721518988, 0.9905660377358491, 0.952755905511811, 0.9340659340659341, 0.8032128514056225, 0.9302325581395349, 0.8165680473372781, 0.9379310344827586, 0.9743589743589743, 0.7049180327868853, 0.7978142076502732, 0.9425287356321839, 0.9769585253456221, 0.8556149732620321, 0.85227272727

Notes on previous experiments

Results - zoom

6:

90.9554%, 1270

5:

96.6716%, 690

1:

99.5829%, 230

Results - phone

2:

86.6411%, 1561


1.5:

98.1362%, 493

1:

99.4671%, 270

Saving

In [None]:
lora_name = "lora_model_phone"
model.save_pretrained(lora_name) # Local saving
tokenizer.save_pretrained(lora_name)
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model_phone2/tokenizer_config.json',
 'lora_model_phone2/special_tokens_map.json',
 'lora_model_phone2/tokenizer.json')

In [None]:
!zip -r /content/lora_model_phone.zip /content/lora_model_phone

  adding: content/lora_model_phone2/ (stored 0%)
  adding: content/lora_model_phone2/special_tokens_map.json (deflated 71%)
  adding: content/lora_model_phone2/adapter_config.json (deflated 54%)
  adding: content/lora_model_phone2/README.md (deflated 66%)
  adding: content/lora_model_phone2/adapter_model.safetensors (deflated 7%)
  adding: content/lora_model_phone2/tokenizer_config.json (deflated 94%)
  adding: content/lora_model_phone2/tokenizer.json (deflated 85%)


In [None]:
from google.colab import files
files.download(f"/content/{lora_name}.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Test saved model - for single input. For testing during development.

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {'content': 'You are an expert in correcting typos in sentences.', 'role': 'system'},
    {'content': 'Correct typos the in following sentence and return only the sentence: in the future sustainable transportation options will sgill9be sought to ease congestion', 'role': 'user'},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
ret = tokenizer.batch_decode(outputs)

gpt_response = re.search(r"assistant<\|end_header_id\|>\n\n(.*?)<\|eot_id\|>", ret[0], re.DOTALL)
if gpt_response:
    gpt_response = gpt_response.group(1)
    print(gpt_response)
else:
    print("Error")

ans = 'in the future sustainable transportation options will still be sought to ease congestion'
print(compute_accuracy_and_wrong_syllables(ans, gpt_response))