# This notebook uses Unsloth to do QLoRA with Llama 3.1 8B Instruct on Colab.  

info:  
https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing  

https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing  

https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama  

https://github.com/unslothai/unsloth/wiki  

https://huggingface.co/docs/transformers/index  

### Install unsloth and Hagging Face's datasets module

In [1]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install datasets

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-4dz4wnyi/unsloth_9e23d0cf9b1b497996ac2aba5a8a9e52
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-4dz4wnyi/unsloth_9e23d0cf9b1b497996ac2aba5a8a9e52
  Resolved https://github.com/unslothai/unsloth.git to commit d0ca3497eb5911483339be025e9924cf73280178
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.8-py3-none-any.whl.metadata (8.4 kB)
Collecting transformers>=4.43.2 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[c

### Set the parameters you need to test/choose.  

Info:  
https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#trl.SFTTrainer https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/trainer#transformers.TrainingArguments

In [4]:
# Set these input parameters

# Dataset parameters
age_range = '1_3' # Choose '1_3' or '4_6'

# Model paramters
context_length = 4096 #context length of 2048 roughly equals 1500 words.

### Import the packages used throughout

In [3]:
import torch
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### Load a Large Language Model

In [5]:
# Model info: https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
max_seq_length = context_length
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    dtype = None,
)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

## Run this cell only when you want to save the base model without the LoRA adaptor.   

- I'm not sure if the Q4KM GGUF file uses the same quantization as the model loaded in the RAM, that is, the model used in the following LoRA training. So, make sure to check the performance by yourself. Later, we will merge the base model and the trained LoRA adaptor then save the merged model.

- This file conversion takes over 30 min with T4 on Colab.  

At the end, there will be the following files in the "base_model" directory:  

-rw-r--r-- 1 root root  977 Aug 15 22:36 config.json  
-rw-r--r-- 1 root root  234 Aug 15 22:36 generation_config.json  
-rw-r--r-- 1 root root 4.7G Aug 15 22:38 pytorch_model-00001-of-00004.bin  
-rw-r--r-- 1 root root 4.7G Aug 15 22:39 pytorch_model-00002-of-00004.bin  
-rw-r--r-- 1 root root 4.6G Aug 15 22:40 pytorch_model-00003-of-00004.bin  
-rw-r--r-- 1 root root 1.1G Aug 15 22:40 pytorch_model-00004-of-00004.bin  
-rw-r--r-- 1 root root  24K Aug 15 22:40 pytorch_model.bin.index.json  
-rw-r--r-- 1 root root  454 Aug 15 22:36 special_tokens_map.json  
-rw-r--r-- 1 root root  55K Aug 15 22:36 tokenizer_config.json  
-rw-r--r-- 1 root root 8.7M Aug 15 22:36 tokenizer.json  
-rw-r--r-- 1 root root  15G Aug 15 22:46 unsloth.F16.gguf  
-rw-r--r-- 1 root root 4.6G Aug 15 23:03 unsloth.Q4_K_M.gguf  

If you run this cell you will copy only unsloth.Q4_K_M.gguf to your Google Drive.  

In [None]:
if False:
  # Mount your Google Drive
  from google.colab import drive
  drive.mount('/content/drive')

  # Put the base model data to a GGUF file. Quantize with Q4km.
  model.save_pretrained_gguf("base_model", tokenizer, quantization_method = "q4_k_m")

  # Now copy the GGUF file for the qunatized model data to your Google drive:
  !cp /content/base_model/unsloth.Q4_K_M.gguf /content/drive/MyDrive/
  !rm -r ./base_model # Delete to free the disk memory

### Define the prompt used for training and testing  

This is the chat format for Llama 3.1 when the "ipython" (tools) is not used.  
Info: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023  
Today Date: 12 Aug 2024

{}<|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}


In [6]:
llama_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

{}<|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}"""

#llama_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
#{}<|eot_id|><|start_header_id|>user<|end_header_id|>
#
#{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
#
#{}"""

We will use this prompt for our LoRA training as well as for testing.

In [7]:
if age_range == '1_3':
  input_prompt = llama_prompt.format(
                    'You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.', # instruction
                    'Write a story for one-year-old children, two-year-old children, or three-year-old children.', # input
                    '' # output - leave this blank for generation!
                    )
elif age_range == '4_6':
  input_prompt = llama_prompt.format(
                    'You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.', # instruction
                    'Write a story for four-year-old children, five-year-old children, or six-year-old children.', # input
                    '' # output - leave this blank for generation!
                    )


In [8]:
print(input_prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Inference without LoRA  
We can further improve the output by choosing more parameters explained here:  

NoBadWordsLogitsProcessor  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.NoBadWordsLogitsProcessor

Stopping criteria  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.StoppingCriteria

transformers.GenerationConfig  
https://huggingface.co/docs/transformers/v4.44.0/en/main_classes/text_generation#transformers.GenerationConfig



In [9]:
FastLanguageModel.for_inference(model)
inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4096)

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

story title: Benny's Little Adventure

In a sunny little village, there lived a little rabbit named 

Second identical run to check if the model reproduces the same story

In [10]:
FastLanguageModel.for_inference(model)
inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4096)

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

story title: Benny's Little Friends

Once upon a time, in a bright and sunny garden, there lived a l

### Prepare the dataset  
More info about chat templates:
https://docs.unsloth.ai/basics/chat-templates

In [11]:
from datasets import Dataset

# Create a dataset defined in gen_age_1_3.py or gen_age_4_6.py.
# Info: https://huggingface.co/docs/datasets/v2.20.0/en/create_dataset
# This dataset has three features: system, user, assistant.
if age_range == '1_3':
    from gen_age_1_3 import gen_age_1_3
    dataset = Dataset.from_generator(gen_age_1_3)
elif age_range == '4_6':
    from gen_age_4_6 import gen_age_4_6
    dataset = Dataset.from_generator(gen_age_4_6)

# For the QLoRA training, create a chain of Llama chat conversations from the dataset.
# Take the data from dataset and put it to llama_prompt.
# This function adds another feature called 'text' to the dataset.
EOS_TOKEN = tokenizer.eos_token # End of sentense token
def formatting_prompts_func(examples):
    instructions = examples["system"]
    inputs       = examples["user"]
    outputs      = examples["assistant"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = llama_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

In [12]:
print(dataset['text'])

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 12 Aug 2024\n\nYou are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWrite a story for one-year-old, two-year-old, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nstory title: The Wonky Donkey\n\nI was walking down the road and I saw... a donkey. Hee Haw! And he only had three legs! He 

## LoRA Training

In [13]:
# LoRA parameters
lora_r = 2
lora_alpha = 4
#lora_target_modules = ["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"]
lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Training parameters
num_train_epochs = 20
learning_rate = 8e-4 #2e-4 to 8e-4 gave a smooth decrease of the training loss.
per_device_train_batch_size = 1
gradient_accumulation_steps = 7 #7 for '1_3', 8 for '4_6'
warmup_steps = 5
logging_steps = 1

# **************** 4_6
# lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# learning_rate = 8e-4, lora_r = 2, lora_alpha = 4
# (  6 epochs, Training loss 2.3 --> 1.9, warmup_steps = 3) may be close to the right level.
# ( 10 epochs, Training loss 2.3 --> 1.7, warmup_steps = 5) produced stories very similar to what's available online.
# ( 20 epochs, Training loss 2.3 --> 1.2, warmup_steps = 5) cited stories existing online, but not in the dataset.


# **************** 1_3
# ****** With lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# With learning_rate = 4e-4, lora_r = 2, lora_alpha = 4,
# ( 5 epochs, Training loss 1.0) was underfit, poor performance...warmup_steps = 1
# ( 8 epochs, Training loss 0.3) was underfit but sometime overfit.
# ( 9 epochs, Training loss 1.3-1.9) was underfit.
# (10 epochs, Training loss 0.6-0.8) was close to the right training level. <-- Might be the best so far. Generally good, but produced a copy of the train story as well sometimes
# (12 epochs, Training loss 1.8) was underfit.
# (15 epochs, Training loss 0.07) was close to the right training level.
# (15 epochs, Training loss 1.6) was underfit.
# (20 epochs, Training loss 1.3) was underfit.
# (30 epochs, Training loss 0.9) was mixed result, a bit underfit.

# With learning_rate = 8e-4, lora_r = 2, lora_alpha = 4,
# (10 epochs, Training loss 2.5 --> 1.6) was underfit.
# (20 epochs, Training loss 2.5 --> 0.8) was not bad. <---- Best so far.
# (25 epochs, Training loss 2.5 --> 0.6) was overfit.

# With learning_rate = 10e-4, lora_r = 2, lora_alpha = 8,
# ( 3 epochs, Training loss 2.5 --> 1.9) was underfit ...warmup_steps = 0
# ( 4 epochs, Training loss 2.5 --> 1.8) was underfit ...warmup_steps = 0
# ( 5 epochs, Training loss 2.5 --> 1.9) was overfit ...warmup_steps = 3
# (10 epochs, Training loss 2.5 --> 1.1) was overfit.

# With learning_rate = 10e-4, lora_r = 3, lora_alpha = 9,
# (4 epochs, Training loss 2.5 --> 1.9) was underfit ...warmup_steps = 2
# (5 epochs, Training loss 2.5 --> 1.8) was underfit ...warmup_steps = 2
# (6 epochs, Training loss 2.5 --> 1.6) was poor peformance ...warmup_steps = 2
# (7 epochs, Training loss 2.5 --> 1.4) was overfit  ...warmup_steps = 3

# With learning_rate = 10e-4, lora_r = 3, lora_alpha = 6,
# ( 5 epochs, Training loss 2.5 --> 1.8) was underfit ...warmup_steps = 1
# ( 7 epochs, Training loss 2.5 --> 1.5) was poor performance ...warmup_steps = 1
# (10 epochs, Training loss 2.5 --> 1.1) was overfit ...warmup_steps = 5


# ******** With , lora_target_modules = ["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"]
# With learning_rate = 4e-4, lora_r = 1, lora_alpha = 2
# (10 epochs, Training loss 1.8) was underfit.
# (15 epochs, Training loss 1.4) was underfit.
# (20 epochs, Training loss 1.1) was poor, produced only the title
# (25 epochs, Training loss 0.9) was overfit.

# With learning_rate = 4e-4, lora_r = 1, lora_alpha = 4
# (15 epochs, Training loss 1.1) was poor, produced only the title


# With learning_rate = 2e-4, lora_r = 2, lora_alpha = 2
# (20 epochs, Training loss 1.5) was somewhat underfit.
# (25 epochs, Training loss 1.3) was somewhat underfit.
# (27 epochs, Training loss 1.2) was optimal for this set of parameters, but perhaps the performance is not impressive.
# (30 epochs, Training loss 1.1) was sometimes good, some times bad, also cited Noh's arch.
# (40 epochs, Training loss 0.9) was a overfit often.

# With learning_rate = 4e-4, lora_r = 2, lora_alpha = 2
# (10 epochs, Training loss 1.7) was underfit.
# (15 epochs, Training loss 1.3) was close to the right training level.
# (20 epochs, Training loss 1.0) was a clear overfit.

# With learning_rate = 4e-4, lora_r = 2, lora_alpha = 4
# ( 8 epochs, Training loss 1.6) was slightly underfit, not good performance.
# (10 epochs, Training loss 1.5) was close to the right training level, but slightly underfit.
# (12 epochs, Training loss 1.3) was overfit.
# (15 epochs, Training loss 1.0) was a clear overfit.

# With learning_rate = 4e-4, lora_r = 8, lora_alpha = 16
# (3 epochs, Training loss 1.9) was very poor ....warmup_steps = 1
# (6 epochs, Training loss 1.2) was clear overfit.

# With learning_rate = 2e-4, lora_r = 8, lora_alpha = 8
# (5 epochs, Training loss 1.8) was underfit ....warmup_steps = 1
# (7 epochs, Training loss 1.6) was underfit ....warmup_steps = 1
# (10 epochs, Training loss ) was poor ..mostly produced only the title



#### Define LoRA configuration  
info:  
peft.LoraConfig  
https://huggingface.co/docs/peft/v0.12.0/en/package_reference/lora#peft.LoraConfig  

peft.get_peft_model  
https://huggingface.co/docs/peft/v0.12.0/en/package_reference/peft_model#peft.get_peft_model

In [14]:
merged_model = FastLanguageModel.get_peft_model(
    model,
    r = lora_r,
    lora_alpha = lora_alpha,
    target_modules = lora_target_modules,
    use_rslora=True, # rank stabilized LoRA
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    random_state = 3407,
    loftq_config = None, # LoftQ
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.


#### Train LoRA  
Info:  
https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#trl.SFTTrainer
https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/trainer#transformers.TrainingArguments

In [15]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = merged_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        num_train_epochs = num_train_epochs,
        per_device_train_batch_size = per_device_train_batch_size,
        gradient_accumulation_steps = gradient_accumulation_steps,
        warmup_steps = warmup_steps,
        learning_rate = learning_rate,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = logging_steps,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

Map (num_proc=2):   0%|          | 0/7 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7 | Num Epochs = 20
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 7
\        /    Total batch size = 7 | Total steps = 20
 "-____-"     Number of trainable parameters = 1,703,936


Step,Training Loss
1,2.5013
2,2.5013
3,2.4434
4,2.3099
5,2.1549
6,1.9877
7,1.8495
8,1.7258
9,1.5949
10,1.4529


### Inference with LoRA

We can further improve the output by choosing more parameters explained here:  

NoBadWordsLogitsProcessor  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.NoBadWordsLogitsProcessor

Stopping criteria  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.StoppingCriteria

transformers.GenerationConfig  
https://huggingface.co/docs/transformers/v4.44.0/en/main_classes/text_generation#transformers.GenerationConfig

In [16]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Rabbit Who Wants to Fall Asleep

In the forest, in a burrow, a little rabbit lived with his mother. One evening, the little rabbit felt tired. He wanted to go to sleep. But he didn't know how to fall asleep. He thought, "If I were a bird, I would fly. If I were a fish, I would swim. If I were a deer, I would run. If I were a squirrel, I would climb a tree." He thought of all the things he could do if he were different animals. But he was a rabbit. He couldn't fly, he couldn't swim, he couldn't run, he couldn't climb a tree. He couldn't even close his eyes. He thought, "If I were a rabbit who could close his eyes, I would close my eyes. I would go to sleep." He closed his eyes. He fell asleep.<|eot_id|>


Second inference:

In [17]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: I Love You to the Moon and Back

I love you to the moon and back. I love you to the sky and back. I love you to the highest high and back again. I love you to the deepest deep and back again. I love you to the beginning of time and back again. I love you to the end of time and back again. I love you to the highest mountain and back again. I love you to the lowest valley and back again. I love you to the highest ocean and back again. I love you to the driest desert and back again. I love you to the most wonderful place that I know and back again. I love you to the moon and back.<|eot_id|>


Third inference:

In [18]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Little Rabbit and the Moon

In the forest, there lived a little rabbit who loved to explore. One day, the little rabbit decided to go on a journey to the moon. She packed a small bag with some carrots and a bottle of water and set off.

As she hopped through the forest, she met a wise old owl who said, "Where are you going, little rabbit?" "I'm going to the moon," said the little rabbit. "I've never been there before, and I want to see what it's like." "That's a long way to go," said the owl. "But if you're sure you want to go, I'll give you a map to help you find your way." The little rabbit thanked the owl and took the map.

She hopped and hopped until she came to a river. She looked at the map and saw that she had to cross the river to get to the moon. She found a small boat and paddled across. On the other side, she met a friendly fish who said, "Where are you going, little rabbit?" "I'm going to the moon," said the little rabbit. "I've never been there before, and

Fourth inference:

In [19]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Rabbit and the Lovely Moon

In the forest, there lived a little rabbit. He was a curious rabbit. One night, he saw the moon shining brightly in the sky. He wanted to catch the moon. So he set out to catch it. He ran and ran, but the moon was always just out of reach. He climbed up a tree, but the moon was higher. He jumped over a stream, but the moon was farther away. He even dug a tunnel, but the moon was beyond it. At last, he realized that the moon was not a thing to be caught. It was just a beautiful sight to see. The little rabbit returned home, happy to have seen the lovely moon.<|eot_id|>


### Save the trained model

In [20]:
# This saves only the LoRA adapter.
merged_model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

if False:
  # This saves the LoRA-and-base-model mmerged and q4_k_m-quantized model in GGUF file.
  merged_model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
  #merged_model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.01 out of 12.67 RAM for saving.


 50%|█████     | 16/32 [00:00<00:00, 19.96it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:42<00:00,  3.20s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving model/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be ./model/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --

Now copy the model data to Google drive:

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
!pwd

/content


In [23]:
!ls

drive		huggingface_tokenizers_cache  lora_model  outputs      sample_data
gen_age_1_3.py	llama.cpp		      model	  __pycache__  _unsloth_sentencepiece_temp


In [24]:
!ls -lh ./lora_model

total 16M
-rw-r--r-- 1 root root  691 Aug 22 11:20 adapter_config.json
-rw-r--r-- 1 root root 6.6M Aug 22 11:20 adapter_model.safetensors
-rw-r--r-- 1 root root 5.0K Aug 22 11:20 README.md
-rw-r--r-- 1 root root  454 Aug 22 11:20 special_tokens_map.json
-rw-r--r-- 1 root root  55K Aug 22 11:20 tokenizer_config.json
-rw-r--r-- 1 root root 8.7M Aug 22 11:20 tokenizer.json


In [25]:
!cp -r ./lora_model ./drive/MyDrive/

In [26]:
!ls -lh ./model

total 35G
-rw-r--r-- 1 root root  977 Aug 22 11:22 config.json
-rw-r--r-- 1 root root  234 Aug 22 11:22 generation_config.json
-rw-r--r-- 1 root root 4.7G Aug 22 11:23 pytorch_model-00001-of-00004.bin
-rw-r--r-- 1 root root 4.7G Aug 22 11:24 pytorch_model-00002-of-00004.bin
-rw-r--r-- 1 root root 4.6G Aug 22 11:25 pytorch_model-00003-of-00004.bin
-rw-r--r-- 1 root root 1.1G Aug 22 11:25 pytorch_model-00004-of-00004.bin
-rw-r--r-- 1 root root  24K Aug 22 11:25 pytorch_model.bin.index.json
-rw-r--r-- 1 root root  454 Aug 22 11:22 special_tokens_map.json
-rw-r--r-- 1 root root  55K Aug 22 11:22 tokenizer_config.json
-rw-r--r-- 1 root root 8.7M Aug 22 11:22 tokenizer.json
-rw-r--r-- 1 root root  15G Aug 22 11:31 unsloth.F16.gguf
-rw-r--r-- 1 root root 4.6G Aug 22 11:47 unsloth.Q4_K_M.gguf


In [27]:
!cp ./model/unsloth.Q4_K_M.gguf ./drive/MyDrive/