# Lab: LLM Finetuning

## 1. Import Required Libraries

In [1]:
"""
Unsloth Packages
"""

# Importing the FastLanguageModel class from the unsloth library.
# This class is optimized for efficient training of large language models (LLMs).
from unsloth import FastLanguageModel

# Importing the get_chat_template function from the unsloth.chat_templates module.
# This function is designed to provide predefined templates for chat interactions,
# which can be useful for structuring conversations or responses in a consistent manner.
# By utilizing chat templates, we can streamline the process of generating responses
# and ensure that the output adheres to a specific format or style that is expected in chat applications.
from unsloth.chat_templates import get_chat_template

# Importing the standardize_sharegpt function from the unsloth.chat_templates module.
# This function is specifically designed to standardize the format of responses generated
# by the ShareGPT model. It ensures that the output adheres to a consistent structure,
# which is particularly important for maintaining clarity and coherence in chat interactions.
# By using this function, we can improve the quality of the responses and make them
# more suitable for user engagement, as well as facilitate easier integration with
# other components of the application that rely on standardized output.
from unsloth.chat_templates import standardize_sharegpt

# Importing the train_on_responses_only function from the unsloth.chat_templates module.
# This function is specifically designed to facilitate the training process by focusing solely
# on the responses generated by the model. By isolating the training to responses only,
# it allows for a more targeted fine-tuning of the model's ability to generate appropriate
# and contextually relevant replies in a conversational setting.
# This can be particularly useful in scenarios where the quality of responses is critical,
# such as in chatbots or virtual assistants, where user engagement and satisfaction depend
# heavily on the accuracy and relevance of the replies provided by the model.
from unsloth.chat_templates import train_on_responses_only

# Importing the is_bfloat16_supported function from the unsloth library.
# This function checks whether the bfloat16 data type is supported on the current hardware.
# Bfloat16 is a floating-point format that is particularly useful in deep learning
# because it allows for faster computations and reduced memory usage compared to 
# traditional 32-bit floating-point formats. It retains the range of 32-bit floats 
# while using only 16 bits, making it a popular choice for training large models 
# on modern hardware, especially TPUs and some GPUs.
from unsloth import is_bfloat16_supported

"""
Hugging Face Packages
"""

# Importing the load_dataset function from the datasets library.
# This function is used to load datasets from the Hugging Face Hub or local files.
# It provides a simple and efficient way to access a wide variety of datasets
# that can be used for training and evaluating machine learning models.
# The datasets library supports various formats and allows for easy manipulation
# of data, making it a valuable tool for data preparation in natural language processing (NLP) tasks.
from datasets import load_dataset

# Importing the SFTTrainer class from the trl (Transformers Reinforcement Learning) library.
# This class is specifically designed for fine-tuning language models using supervised fine-tuning (SFT) techniques.
# It provides a structured approach to training models on specific tasks, allowing for better performance
# and adaptation to particular datasets or objectives. The SFTTrainer handles various aspects of the training process,
# including data loading, model evaluation, and optimization, making it easier for developers to implement
# effective training routines for their language models.
from trl import SFTTrainer

# Importing the TrainingArguments class from the transformers library.
# This class is used to define the training configuration for fine-tuning transformer models.
# It allows users to specify various parameters such as learning rate, batch size, number of epochs,
# and other settings that control the training process. Proper configuration of these arguments
# is crucial for achieving optimal model performance and convergence during training.
from transformers import TrainingArguments

# Importing the DataCollatorForSeq2Seq class from the transformers library.
# This class is designed to handle the preparation of batches of data for sequence-to-sequence models.
# It takes care of padding the input sequences to the same length within a batch, which is necessary
# for efficient processing by the model. Additionally, it can manage the creation of attention masks
# and other necessary components for training sequence-to-sequence architectures, ensuring that
# the data is formatted correctly for the model's requirements.
from transformers import DataCollatorForSeq2Seq

# Importing the TextStreamer class from the transformers library.
# The TextStreamer is a utility designed to facilitate the streaming of text outputs
# from language models during inference. This is particularly useful for applications
# where real-time feedback is desired, such as chatbots or interactive applications.
# By using TextStreamer, developers can display generated text incrementally as it is produced,
# enhancing the user experience by providing immediate responses rather than waiting for the
# entire output to be generated before displaying it.
from transformers import TextStreamer

"""
Other Packages
"""

# Importing the torch library, which is the core framework for deep learning in Python.
# It provides functionalities for tensor operations, GPU acceleration, and automatic differentiation.
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 2. Set Parameters

In [2]:
# Setting the maximum sequence length for the model's input.
# This defines how many tokens the model can process at once.
# A value of 2048 is chosen here, which is suitable for many applications.
max_seq_length = 2048

# Setting the data type for model computations.
# None means that the optimal data type will be auto-detected based on the hardware.
# This can help in optimizing performance and memory usage.
dtype = None

# Enabling 4-bit quantization for the model.
# This significantly reduces memory usage (by approximately 75%),
# allowing the model to run on consumer-grade GPUs while maintaining model quality.
load_in_4bit = True

## 3. Load Pre-trained Model and Tokenizer

In [3]:
# Loading a pre-trained language model and its tokenizer using the FastLanguageModel class.
# This function is designed to facilitate the retrieval of models that have already been trained,
# allowing us to leverage existing knowledge without starting from scratch.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",  # Specify the model name to load. 
    # An alternative model can be chosen: "unsloth/Llama-3.2-1B-Instruct" for a smaller variant.
    
    max_seq_length = max_seq_length,  # Set the maximum sequence length for input tokens.
    # This parameter defines how many tokens the model can process in a single input,
    # which is crucial for managing memory and computational efficiency.
    
    dtype = dtype,  # Specify the data type for model computations.
    # Setting this to None allows for automatic detection of the optimal data type based on the hardware.
    
    load_in_4bit = load_in_4bit,  # Enable 4-bit quantization for the model.
    # This significantly reduces memory usage (by approximately 75%),
    # making it feasible to run the model on consumer-grade GPUs while maintaining model quality.
)

==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 Ti SUPER. Max memory: 15.695 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 4. Add LoRA Adapters

In [4]:
# This cell initialises the model with LoRA (Low-Rank Adaptation) parameters.
# LoRA allows us to fine-tune only a small subset of the model's parameters,
# which is efficient for training and reduces memory usage.

model = FastLanguageModel.get_peft_model(
    model,  # The pre-trained model we are adapting with LoRA.
    
    r = 16,  # The rank of the LoRA adapters. 
             # This value determines how many parameters will be updated.
             # Suggested values are 8, 16, 32, 64, or 128, depending on the task complexity.
    
    target_modules = [  # List of model components to which LoRA will be applied.
        "q_proj",  # Query projection layer.
        "k_proj",  # Key projection layer.
        "v_proj",  # Value projection layer.
        "o_proj",  # Output projection layer.
        "gate_proj",  # Gate projection layer for controlling information flow.
        "up_proj",  # Up projection layer for increasing dimensionality.
        "down_proj",  # Down projection layer for reducing dimensionality.
    ],
    
    lora_alpha = 16,  # Scaling factor for the LoRA updates.
                      # This controls the contribution of the LoRA parameters to the overall model output.
    
    lora_dropout = 0,  # Dropout rate for the LoRA layers.
                       # A value of 0 means no dropout, which is optimized for performance.
    
    bias = "none",  # Bias handling in the LoRA layers.
                    # Setting to "none" is optimized for performance, but other options are available.
    
    # This parameter enables gradient checkpointing, which saves memory during training.
    # The "unsloth" option is specifically optimized to use 30% less VRAM,
    # allowing for larger batch sizes during training.
    use_gradient_checkpointing = "unsloth",  # Can also be set to True for standard gradient checkpointing.
    
    random_state = 42,  # Seed for random number generation to ensure reproducibility.
    
    use_rslora = False,  # Option to use rank stabilized LoRA.
                         # This can help in maintaining stability during training.
    
    loftq_config = None,  # Configuration for LoftQ, if applicable.
                          # Currently set to None, indicating no specific configuration is provided.
)

Unsloth 2024.12.11 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## 5. Data Preprocessing
We're using the FineTome-100k dataset (https://huggingface.co/datasets/mlabonne/FineTome-100k) for finetuning. But it uses ShareGPT format for conversations. We will need to convert it to HuggingFace's format.

Example conversion:
From ShareGPT format:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```

To HuggingFace format:
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

The final conversation structure uses these special tokens (Llama 3.1 format) from the HuggingFace's format (above):

```
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>[system message]<|eot_id|>
<|start_header_id|>user<|end_header_id|>[user message]<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>[assistant message]<|eot_id|>
<|end_of_text|>
```

### 5.1 Load Dataset for Finetuning

In [5]:
# The following line of code loads the dataset from the Hugging Face Hub.
# We are specifically loading the "FineTome-100k" dataset, which is a collection
# of fine-tuning examples for conversational models. The dataset is structured
# in a way that is compatible with the training requirements of our model.

# The `load_dataset` function is part of the Hugging Face Datasets library,
# which provides a simple interface to access and manipulate datasets.
# The first argument is the dataset identifier, which in this case is
# "mlabonne/FineTome-100k". The second argument specifies the split of the
# dataset we want to load; here, we are loading the "train" split, which
# contains the training examples.

# By loading the training split, we can use this data to fine-tune our model
# on the specific conversational tasks that the dataset is designed for.

dataset = load_dataset("mlabonne/FineTome-100k", split="train")

### 5.2 Convert the Dataset to HuggingFace Format

In [6]:
# Standardize the dataset using the `standardize_sharegpt` function.
# This function prepares the dataset for further processing.
dataset = standardize_sharegpt(dataset)

### 5.3 Configure the Tokenizer to Use Llama 3.1 Instruct Chat Template

In [7]:
# The following line of code is responsible for configuring the tokenizer
# to use a specific chat template format. This is essential for ensuring
# that the input data is structured correctly for the model during training
# or inference.

# We call the function `get_chat_template`, which takes two arguments:
# 1. `tokenizer`: This is the tokenizer object that will be modified.
# 2. `chat_template`: This specifies the format of the chat template we want to use.
# In this case, we are using the "llama-3.1" format, which is designed for
# conversation-style fine-tuning.

# The output of this function will be a tokenizer that is configured to
# apply the specified chat template, allowing for proper formatting of
# conversation data.

tokenizer = get_chat_template(
    tokenizer,  # The existing tokenizer that we want to configure.
    chat_template = "llama-3.1",  # The specific chat template format to use.
)

### 5.4 Format the Dataset for Training

In [8]:
# This function is designed to format prompts from the dataset examples.
# It takes a batch of examples as input, where each example contains a list of conversations.
def formatting_prompts_func(examples):
    # Extract the conversations from the input examples.
    convos = examples["conversations"]
    
    # For each conversation, apply the chat template using the tokenizer.
    # The `apply_chat_template` method formats the conversation according to the specified template.
    # We set `tokenize=False` to prevent tokenization at this stage and `add_generation_prompt=False`
    # to avoid adding any additional prompts for generation.
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    
    # Return a dictionary containing the formatted texts, which will be used in the dataset.
    return {"text": texts}

# Map the `formatting_prompts_func` over the dataset.
# The `batched=True` argument indicates that the function will process multiple examples at once,
# which can improve performance by reducing the overhead of function calls.
dataset = dataset.map(formatting_prompts_func, batched=True)

### 5.5 Verify the Formatting of the Dataset

We look at how the conversations are structured for item 5:

In [9]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**Note: Llama 3.1 Instruct's default chat template adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`**

In [10]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

## 6. Fine-Tuning
We'll use Hugging Face's TRL Supervised Fine-tuning (SFT) Trainer

Note:
- Using 60 training steps for quick demonstration
- For full training: set num_train_epochs=1 and max_steps=None
- DPOTrainer is also supported
See full documentation at: https://huggingface.co/docs/trl/sft_trainer

### 6.1 Initialize the SFTTrainer

In [11]:
# Initialize the SFTTrainer, which is responsible for supervised fine-tuning of the model.
trainer = SFTTrainer(
    model = model,  # The pre-trained model to be fine-tuned.
    tokenizer = tokenizer,  # The tokenizer used to process the input text.
    train_dataset = dataset,  # The dataset used for training.
    dataset_text_field = "text",  # The field in the dataset that contains the text data.
    max_seq_length = max_seq_length,  # The maximum sequence length for the input data.
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),  # Collator for preparing batches of data.
    dataset_num_proc = 2,  # Number of processes to use for dataset processing.
    packing = False,  # If set to True, can speed up training for short sequences by packing them together.
    
    # Define the training arguments for the fine-tuning process.
    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Batch size per device during training.
        gradient_accumulation_steps = 4,  # Number of updates steps to accumulate before performing a backward/update pass.
        warmup_steps = 5,  # Number of warmup steps for learning rate scheduler.
        # num_train_epochs = 1,  # Uncomment this line to set the number of epochs for a full training run.
        max_steps = 60,  # Total number of training steps to perform.
        learning_rate = 2e-4,  # Learning rate for the optimizer.
        fp16 = not is_bfloat16_supported(),  # Use mixed precision training if bfloat16 is not supported.
        bf16 = is_bfloat16_supported(),  # Use bfloat16 precision if supported.
        logging_steps = 1,  # Log training metrics every specified number of steps.
        optim = "adamw_8bit",  # Optimizer to use for training, here using 8-bit AdamW.
        weight_decay = 0.01,  # Weight decay for regularization.
        lr_scheduler_type = "linear",  # Type of learning rate scheduler to use.
        seed = 42,  # Random seed for reproducibility.
        output_dir = "outputs",  # Directory where the model outputs will be saved.
        report_to = "none",  # Reporting method for logging, set to 'none' to disable.
    ),
)

### 6.2 Train the Model on Responses Only

In [12]:
# The following line of code modifies the trainer object to focus on training the model
# specifically on the responses generated by the assistant, rather than the inputs from the user.
# This is useful in scenarios where we want to improve the quality of the assistant's outputs
# without being influenced by the user's prompts.

# The function `train_on_responses_only` is called with the current trainer instance and two parameters:
# - `instruction_part`: This parameter specifies the part of the input that corresponds to the user's question.
#   It is defined by the special token sequence "<|start_header_id|>user<|end_header_id|>\n\n", which marks the beginning
#   and end of the user input in the training data.
# - `response_part`: This parameter specifies the part of the input that corresponds to the assistant's response.
#   It is defined by the special token sequence "<|start_header_id|>assistant<|end_header_id|>\n\n", which marks the beginning
#   and end of the assistant's output in the training data.

# By using these parameters, the function will effectively filter the training dataset to only include
# the responses from the assistant, allowing the model to learn from these outputs during the fine-tuning process.

trainer = train_on_responses_only(
    trainer,  # The current trainer instance that manages the training process.
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",  # Marks the user input in the dataset.
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",  # Marks the assistant output in the dataset.
)

### 6.3 Verify the Formatting for the Training Dataset

#### 6.3.1 Decode the Training Dataset

In [13]:
# The following line of code uses the tokenizer to decode the input IDs
# from the training dataset at index 5. The input IDs represent the 
# tokenized version of the text that the model will use for training.
# Decoding these IDs will convert them back into a human-readable 
# string format, allowing us to inspect the actual text that corresponds 
# to the token IDs stored in the dataset.

# The 'trainer' object manages the training process, and 'train_dataset' 
# is the dataset being used for training. The index [5] specifies that 
# we are interested in the sixth entry of the dataset (as indexing starts 
# from 0). The "input_ids" key accesses the tokenized input text for 
# that specific entry.

# By decoding the input IDs, we can verify the content of the training 
# data and ensure that it is formatted correctly for the model's training.

decoded_input = tokenizer.decode(trainer.train_dataset[5]["input_ids"])
decoded_input

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

#### 6.3.2 Decode the Labels

In [14]:
# First, we retrieve the input ID for a space character from the tokenizer.
# This is done without adding any special tokens, as we only want the 
# standard space character representation.
space = tokenizer(" ", add_special_tokens = False).input_ids[0]

# Next, we decode the labels from the training dataset at index 5.
# The labels are token IDs, and we need to convert them back to a 
# human-readable format. However, in the labels, some values may be 
# -100, which indicates that those positions should be masked or ignored 
# during the decoding process.

# We use a list comprehension to iterate through the labels. For each 
# label, we check if it is -100. If it is, we replace it with the 
# space token ID we retrieved earlier. Otherwise, we keep the original 
# token ID. This effectively masks out the -100 values while preserving 
# the rest of the labels.

# Finally, we decode the modified list of token IDs back into a string 
# format, which allows us to see the actual text representation of the 
# labels, excluding the masked positions.
decoded_labels = tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
decoded_labels

'                                                                \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

### 6.4 Show Current Memory Stats

In [15]:
# Retrieve the properties of the GPU device at index 0.
# This includes information such as the name, total memory, and other capabilities.
gpu_stats = torch.cuda.get_device_properties(0)

# Calculate the amount of GPU memory currently reserved by the process.
# The max_memory_reserved function returns the memory in bytes, 
# which we convert to gigabytes (GB) by dividing by 1024 three times.
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

# Get the total memory available on the GPU and convert it to gigabytes (GB).
# This is useful for understanding the limits of the GPU's memory capacity.
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

# Print out the GPU name and its maximum memory capacity.
# This provides a quick overview of the hardware being used.
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

# Print the amount of memory currently reserved for the process.
# This helps in monitoring memory usage during training or inference.
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4070 Ti SUPER. Max memory = 15.695 GB.
2.635 GB of memory reserved.


### 6.5 Train the Model

In [16]:
# The following line initiates the training process for the model using the trainer object.
# The trainer object is typically configured with various parameters such as the dataset,
# training hyperparameters, and model architecture. This method will handle the entire 
# training loop, including forward passes, loss calculation, backpropagation, and 
# optimization steps.

# The result of the training process is stored in the trainer_stats variable.
# This variable will contain metrics and statistics related to the training session,
# such as training loss, evaluation metrics, and runtime information.

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


  0%|          | 0/60 [00:00<?, ?it/s]

{'loss': 1.0564, 'grad_norm': 0.2052406370639801, 'learning_rate': 4e-05, 'epoch': 0.0}
{'loss': 1.1463, 'grad_norm': 0.33218705654144287, 'learning_rate': 8e-05, 'epoch': 0.0}
{'loss': 0.9623, 'grad_norm': 0.28333190083503723, 'learning_rate': 0.00012, 'epoch': 0.0}
{'loss': 0.9969, 'grad_norm': 0.27838701009750366, 'learning_rate': 0.00016, 'epoch': 0.0}
{'loss': 0.6673, 'grad_norm': 0.2985250651836395, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 1.1429, 'grad_norm': 0.24463847279548645, 'learning_rate': 0.00019636363636363636, 'epoch': 0.0}
{'loss': 0.5524, 'grad_norm': 0.1646350771188736, 'learning_rate': 0.00019272727272727274, 'epoch': 0.0}
{'loss': 1.2445, 'grad_norm': 0.17532196640968323, 'learning_rate': 0.0001890909090909091, 'epoch': 0.0}
{'loss': 0.9368, 'grad_norm': 0.18895605206489563, 'learning_rate': 0.00018545454545454545, 'epoch': 0.0}
{'loss': 1.0148, 'grad_norm': 0.16754715144634247, 'learning_rate': 0.00018181818181818183, 'epoch': 0.0}
{'loss': 0.6602, 'grad_n

### 6.6 Show Final Memory and Time Stats

In [17]:
# Calculate the total amount of GPU memory currently reserved by the process.
# This is done using the max_memory_reserved function from the torch.cuda module,
# which returns the memory in bytes. We convert this value to gigabytes (GB)
# by dividing by 1024 three times (bytes to kilobytes to megabytes to gigabytes).
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

# Calculate the amount of GPU memory used specifically for the LoRA (Low-Rank Adaptation) model.
# This is done by subtracting the initial reserved memory (start_gpu_memory) from the total used memory.
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)

# Calculate the percentage of the total GPU memory that is currently being used.
# This is done by dividing the used memory by the maximum memory available on the GPU
# and multiplying by 100 to get a percentage.
used_percentage = round(used_memory / max_memory * 100, 3)

# Calculate the percentage of the total GPU memory that is being used for the LoRA model.
# Similar to the previous calculation, we divide the memory used for LoRA by the maximum memory
# and multiply by 100 to express it as a percentage.
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

# Print the total time taken for training in seconds.
# This information is retrieved from the trainer_stats object, which contains metrics
# related to the training session.
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")

# Print the total time taken for training in minutes, rounding to two decimal places.
# This provides a more human-readable format for the training duration.
print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")

# Print the peak reserved memory in gigabytes.
# This gives an overview of the maximum memory usage during the training process.
print(f"Peak reserved memory = {used_memory} GB.")

# Print the peak reserved memory specifically for the LoRA model in gigabytes.
# This helps in understanding how much memory the LoRA adaptation is consuming.
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")

# Print the peak reserved memory as a percentage of the maximum memory available on the GPU.
# This is useful for assessing how efficiently the GPU memory is being utilized.
print(f"Peak reserved memory % of max memory = {used_percentage} %.")

# Print the peak reserved memory for training as a percentage of the maximum memory.
# This provides insight into the memory overhead introduced by the LoRA model.
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

124.4536 seconds used for training.
2.07 minutes used for training.
Peak reserved memory = 3.676 GB.
Peak reserved memory for training = 1.041 GB.
Peak reserved memory % of max memory = 23.421 %.
Peak reserved memory for training % of max memory = 6.633 %.


## 7. Inference

### 7.1 Set Model for Inference

In [18]:
# The following line of code is used to enable a faster inference mode for the FastLanguageModel.
# This mode is optimized to run inference tasks at twice the speed compared to the standard mode.
# By calling the 'for_inference' method on the FastLanguageModel class and passing the 'model' as an argument,
# we are preparing the model for efficient inference operations, which is particularly useful when generating
# responses or predictions in real-time applications.

FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

### 7.2 Create an Input for Inference

In [19]:
# Define a list of messages to be sent to the model for processing.
# Each message is represented as a dictionary with two keys: 'role' and 'content'.
# The 'role' indicates who is sending the message (in this case, the user),
# and the 'content' contains the actual text of the message.
# Here, the user is asking the model to continue the Fibonacci sequence,
# which is a well-known mathematical series where each number is the sum of the two preceding ones.
# The sequence starts with 1, 1, 2, 3, 5, 8, and the user is requesting the next number(s) in the series.

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]

### 7.3 Prepare the Input for the Model

In [20]:
# The following line of code prepares the input for the model by applying a chat template to the messages.
# This is essential for structuring the input in a way that the model can understand and process it effectively.

# The 'apply_chat_template' method of the tokenizer is called with several parameters:
inputs = tokenizer.apply_chat_template(
    messages,  # The list of messages to be processed, containing user and assistant roles.
    tokenize = True,  # This flag indicates that the input should be tokenized for the model.
    add_generation_prompt = True,  # This parameter is crucial as it adds a prompt for generation, ensuring the model knows to generate a response.
    return_tensors = "pt",  # This specifies that the output should be in the format of PyTorch tensors, which is necessary for compatibility with the model.
).to("cuda")  # The resulting tensors are moved to the GPU (if available) for faster processing during inference.

### 7.4 Create a TextStreamer Object for Real-Time Output

In [21]:
# Create an instance of the TextStreamer class, which is responsible for streaming the output of the model
# as it generates text. This allows for real-time visualization of the generation process, where tokens are
# displayed one by one instead of waiting for the entire output to be generated at once.

# The TextStreamer takes in a tokenizer and some optional parameters to control its behavior.
text_streamer = TextStreamer(
    tokenizer,  # The tokenizer is passed here, which is used to convert the generated tokens back into human-readable text.
    
    skip_prompt=True  # This parameter indicates whether to skip the initial prompt in the output. 
                      # When set to True, the streamer will not display the prompt text, focusing only on the generated content.
)

### 7.5 Generate Inference from the Model

Key Generation Parameters:
- temperature=1.5: Higher values increase response creativity and variety
- min_p=0.1: Sets minimum probability threshold for token selection
                
These parameters are optimised based on empirical testing - see:
https://x.com/menhguin/status/1826132708508213629

In [33]:
# The following line of code generates outputs from the model using the provided input IDs.
# This is a crucial step in the inference process, where the model produces predictions based on the input data.

# The 'generate' method of the model is called with several parameters to control the generation process:
_ = model.generate(
    input_ids = inputs,  # The input IDs that represent the tokenized messages to be processed by the model.
    
    streamer = text_streamer,  # The TextStreamer instance that allows for real-time visualization of the output as it is generated.
    
    max_new_tokens = 32,  # The maximum number of new tokens to generate in the output. This limits the length of the response.
    
    use_cache = True,  # This flag indicates whether to use the cache for faster generation. When set to True, the model can reuse previously computed results, speeding up the process.
    
    temperature = 1.5,  # The temperature parameter controls the randomness of the output. A higher value (like 1.5) results in more diverse and creative outputs, while a lower value makes the output more deterministic and focused.
    
    min_p = 0.1  # This parameter sets a minimum probability threshold for token selection. Tokens with a probability lower than this value will not be considered for generation, helping to filter out less likely options.
)

Here is the continued Fibonacci sequence:

1, 1, 2, 3, 5, 8, 13, 21, 34


## 8. Save the LoRA Adapter
This method saves only the fine-tuning changes, not the complete model. You'll need the original base model to use these weights later.

In [23]:
# The following lines of code are responsible for saving the fine-tuned model and tokenizer to a local directory.
# This is essential for preserving the changes made during the training process, allowing for later use without 
# needing to retrain the model from scratch.

# Save the model's weights and configuration to the specified directory "lora_model".
# This method saves only the fine-tuning changes, not the complete model, which means you will need the original 
# base model to use these weights later.
model.save_pretrained("lora_model")  # Local saving of the model

# Save the tokenizer configuration and vocabulary to the same directory "lora_model".
# The tokenizer is crucial for converting text to tokens and vice versa, ensuring that the model can 
# properly interpret input data and generate output in a human-readable format.
tokenizer.save_pretrained("lora_model")  # Local saving of the tokenizer

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

### 8.1 Infer using Base Model Only

In [24]:
# This function generates a response from the model based on the provided user input.
# It takes three parameters: 
# - content: the input text from the user
# - model: the language model used for generating responses
# - tokenizer: the tokenizer that converts text into tokens for the model

def generate_response(content, model, tokenizer):
    # Create a list of messages with the user's input. 
    # The 'role' indicates the type of message (user or assistant), and 'content' is the actual text.
    messages = [{"role": "user", "content": content}]
    
    # Prepare the input for the model by applying the chat template using the tokenizer.
    # This includes tokenization and formatting the input for the model.
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,  # Tokenize the messages to convert them into a format the model can understand.
        add_generation_prompt=True,  # This flag ensures that a generation prompt is added for the model.
        return_tensors="pt",  # Return the inputs as PyTorch tensors for compatibility with the model.
    ).to("cuda")  # Move the inputs to the GPU for faster processing.

    # Initialize a TextStreamer instance for real-time output visualization.
    # This allows the generated text to be streamed as it is produced by the model.
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)  # Skip the prompt in the output.

    # Call the model's generate method to produce a response based on the input.
    _ = model.generate(
        input_ids=inputs,  # The tokenized input IDs for the model to process.
        streamer=text_streamer,  # The TextStreamer instance for output visualization.
        max_new_tokens=128,  # Limit the response to a maximum of 128 new tokens.
        use_cache=True,  # Enable caching to speed up the generation process.
        temperature=1.5,  # Set the randomness of the output; higher values yield more diverse responses.
        min_p=0.1  # Set a minimum probability threshold for token selection to filter out less likely options.
    )

# Example usage of the generate_response function.
# This line calls the function with a specific prompt about a tall tower in France.
generate_response("Describe a tall tower in the capital of France.", model, tokenizer)

There is no specific mention of a "tall tower" in the capital of France, as there are numerous towers throughout the country.<|eot_id|>


### 8.2 Infer using LoRa Adapter with Base Model and Compare Results

In [25]:
# Load the pre-trained FastLanguageModel with specified parameters.
# This function initializes the model and tokenizer for inference.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",  # Specify the name of the model that was used for training.
    max_seq_length = max_seq_length,  # Set the maximum sequence length for input data.
    dtype = dtype,  # Define the data type for the model (e.g., float32, float16).
    load_in_4bit = load_in_4bit,  # Option to load the model in 4-bit precision for reduced memory usage.
)

# Prepare the model for inference.
# This step optimizes the model for faster inference, allowing it to run at native 2x speed.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Generate a response using the model and tokenizer.
# This line calls the generate_response function with a specific prompt.
# The prompt asks for a description of a tall tower located in the capital of France.
generate_response("Describe a tall tower in the capital of France.", model, tokenizer)

==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 Ti SUPER. Max memory: 15.695 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
There are many tall towers in various capitals, but I think you may be referring to the Eiffel Tower in Paris, the capital of France. The Eiffel Tower is a 324-meter-tall (1,063 ft) iron lattice tower built for the 1889 World's Fair. It is a UNESCO World Heritage site and is considered an iconic symbol of France and Paris. The tower is used for observation, broadcasting, and other purposes, and is a popular tourist attraction.<|eot_id|>


## 9. Save Model in GGUF Format
GGUF allows saving standalone models that don't require the original base model

In [26]:
# This line of code saves the trained model in the GGUF format with 16-bit quantization.
# The GGUF format is designed for standalone models that do not require the original base model.
# By using 16-bit quantization (specified by the quantization_method parameter as "f16"),
# we aim to reduce the model's file size while maintaining a good balance of performance and quality.

# The 'save_pretrained_gguf' method is called on the model object, which is an instance of the FastLanguageModel.
# The first argument is the directory name where the model will be saved ("model_16bit_gguf").
# The second argument is the tokenizer associated with the model, which is necessary for processing text inputs.
# The quantization_method parameter allows us to specify the type of quantization to apply to the model weights.

# Save the model to the specified directory with 16-bit quantization
model.save_pretrained_gguf("model_16bit_gguf", tokenizer, quantization_method="f16")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 14.72 out of 30.56 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 91.27it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model_16bit_gguf into f16 GGUF format.
The output location will be /home/yunora/Desktop/finetune_LLM/model_16bit_gguf/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_16bit_gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-

**Optional: Save with different quantization methods for smaller file size**
- 8-bit quantization (Q8_0) - Good balance of size/quality
- 4-bit quantization (Q4_K_M) - Smallest size, still good quality 

In [27]:
# This line of code saves the trained model in the GGUF format using 8-bit quantization.
# The 8-bit quantization method (Q8_0) is chosen for this save operation, which provides a good balance
# between file size and model performance. This is particularly useful for deployment scenarios where
# memory efficiency is crucial.

# The 'save_pretrained_gguf' method is called on the model object, which is an instance of the 
# FastLanguageModel. The first argument specifies the directory name where the model will be saved.
# In this case, the model will be saved in a directory named "model_8bit_q8".

# The second argument is the tokenizer associated with the model. The tokenizer is essential for 
# processing text inputs, as it converts text into a format that the model can understand.

# Save the model to the specified directory with 8-bit quantization
model.save_pretrained_gguf("model_8bit_q8", tokenizer)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 14.59 out of 30.56 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 93.24it/s]


Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model_8bit_q8 into q8_0 GGUF format.
The output location will be /home/yunora/Desktop/finetune_LLM/model_8bit_q8/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_8bit_q8
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-0

In [28]:
# This line of code saves the trained model in the GGUF format using the Q4_K_M quantization method.
# The Q4_K_M quantization is designed to significantly reduce the model's file size while still maintaining
# a reasonable level of performance and quality. This is particularly beneficial for scenarios where storage
# space is limited or when deploying the model in resource-constrained environments.

# The 'save_pretrained_gguf' method is invoked on the model object, which is an instance of the 
# FastLanguageModel class. This method is responsible for saving the model's weights and configuration 
# in a format that is optimized for standalone use, meaning it does not require the original base model 
# to function.

# The first argument to the method is the directory name where the model will be saved. In this case, 
# the model will be saved in a directory named "model_q4_k_m_gguf".

# The second argument is the tokenizer associated with the model. The tokenizer is crucial for processing 
# text inputs, as it converts raw text into a format that the model can understand and work with.

# The quantization_method parameter specifies the type of quantization to apply to the model weights.
# Here, we are using "q4_k_m" to achieve the desired balance between size and quality.

# Save the model to the specified directory with Q4_K_M quantization
model.save_pretrained_gguf("model_q4_k_m_gguf", tokenizer, quantization_method="q4_k_m")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 14.58 out of 30.56 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 93.15it/s]


Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model_q4_k_m_gguf into bf16 GGUF format.
The output location will be /home/yunora/Desktop/finetune_LLM/model_q4_k_m_gguf/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_q4_k_m_gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model