# Efficient Fine-Tuning of Qwen 2.5 Coder 14B with Unsloth

## Introduction

Fine-tuning large language models (LLMs) such as Llama and Mistral has traditionally been a resource-intensive and time-consuming process. However, with advancements in model optimization techniques, it is now possible to streamline this workflow significantly. The provided Jupyter notebook demonstrates a comprehensive approach to efficiently fine-tune and deploy LLMs using the Unsloth library's `FastLanguageModel`. This workflow leverages key features like 4-bit quantization, Parameter-Efficient Fine-Tuning (PEFT) with LoRA, and optimized inference mechanisms to enhance performance while minimizing resource consumption.

The notebook begins by setting up the necessary environment, ensuring compatibility and optimal configurations through the installation and configuration of essential packages. It then proceeds to load a pre-trained model using `FastLanguageModel`, applying 4-bit quantization to reduce memory usage without compromising accuracy. The integration of PEFT via LoRA allows for focused fine-tuning of specific model parameters, further enhancing efficiency. Additionally, the notebook showcases data preparation techniques using chat templates and dataset standardization, ensuring that conversational data is appropriately formatted for training. The training process is meticulously configured to maximize performance, incorporating memory management strategies and selective training on response data to refine the model's output quality. Finally, the workflow includes steps for performing optimized inference, saving the fine-tuned model, and deploying it for real-world applicati
## Overview of `Qwen-2.5-Coder-14B`

Qwen-2.5-Coder-14B is a large language model specifically designed for coding tasks, part of the Qwen series of models developed to enhance code generation, reasoning, and fixing capabilities. This model features 14.7 billion parameters and is built on advanced transformer architecture, which includes techniques such as RoPE (Rotary Positional Encoding), SwiGLU (a type of activation function), and RMSNorm (Root Mean Square Layer Normalization) to optimize performance.

### Key Features

- **Model Size**: 14.7 billion parameters, with 13.1 billion non-embedding parameters.
- **Architecture**: Utilizes transformers with multiple layers (48 layers) and attention heads (40 for queries and 8 for keys/values).
- **Context Length**: Supports long contexts of up to 131,072 tokens, allowing it to handle extensive code and text inputs effectively.
- **Training Tokens**: Trained on a vast dataset of 5.5 trillion tokens, which includes a variety of source code and synthetic data.ons.


## Installation
The code installs and upgrades necessary libraries for efficient model training and inference. This includes `unsloth` for fast training, `torch` for GPU operations, and `flash-attn` for optimized attention computation on compatible GPUs.

In [6]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
# !pip install unsloth

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

**Explanation:**

This cell handles the installation and configuration of necessary Python packages required for the subsequent steps. Here's a breakdown:

1. **Magic Command `%%capture`:**
   - Suppresses the output of the cell, making the notebook cleaner by hiding the installation logs.

2. **Installing `pip3-autoremove`:**
   - `!pip install pip3-autoremove` installs a utility that allows for the removal of packages and their unused dependencies.

3. **Removing Existing PyTorch Packages:**
   - `!pip-autoremove torch torchvision torchaudio -y` ensures that any existing installations of `torch`, `torchvision`, and `torchaudio` are uninstalled. This is crucial to prevent version conflicts.

4. **Reinstalling PyTorch with Specific CUDA Version:**
   - `!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121` installs PyTorch along with `torchvision` and `torchaudio`, specifying the CUDA 12.1 version for GPU acceleration.
   - `xformers` is also installed to leverage optimized transformer operations, enhancing model performance.

5. **Installing the Unsloth Library:**
   - `!pip install unsloth` installs the Unsloth library, which includes the `FastLanguageModel` and other utilities for efficient model fine-tuning and deployment.

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None          # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True   # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-Coder-14B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ModuleNotFoundError: No module named 'triton'

**Explanation:**

This cell initializes the pre-trained language model using Unsloth's `FastLanguageModel`. Here's the step-by-step breakdown:

1. **Imports:**
   - `from unsloth import FastLanguageModel`: Imports the `FastLanguageModel` class from the Unsloth library.
   - `import torch`: Imports PyTorch for tensor operations and GPU management.

2. **Configuration Parameters:**
   - `max_seq_length = 2048`: Sets the maximum sequence length for input data. FastLanguageModel supports RoPE Scaling internally, allowing dynamic adjustment for larger sequences.
   - `dtype = None`: Specifies the data type for model weights. Setting to `None` enables automatic detection based on the hardware. Alternatives include `float16` for GPUs like Tesla T4 or V100, and `bfloat16` for newer GPUs like Ampere.
   - `load_in_4bit = True`: Enables 4-bit quantization to reduce memory usage, facilitating the loading of larger models without running out of memory (OOM). Setting to `False` would load the model in higher precision but with higher memory consumption.

3. **Model Lists:**
   - **`fourbit_models`:** A list of pre-quantized 4-bit models supported by Unsloth for faster downloading and reduced memory footprint. Examples include various versions of Llama, Mistral, Phi, and Gemma models.
   - **`qwen_models`:** A list of Qwen models optimized for different sizes and instructions. These models benefit from faster inference times and efficient memory usage.

4. **Loading the Model and Tokenizer:**
   - `FastLanguageModel.from_pretrained(...)` is called with the specified parameters to load the pre-trained model and its tokenizer.
   - **Parameters:**
     - `model_name`: Specifies the exact model to load. In this case, `"unsloth/Qwen2.5-Coder-14B-Instruct"` is chosen from the `qwen_models` list.
     - `max_seq_length`, `dtype`, `load_in_4bit`: Pass the previously defined configuration parameters.
     - `token`: (Commented out) Allows specifying a token for gated models if needed.familiar with Hugging Face.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,    # Supports any, but = 0 is optimized
    bias = "none",       # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

NameError: name 'FastLanguageModel' is not defined

**Explanation:**

This cell configures the model for Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation) via Unsloth's `get_peft_model` method. Here's the breakdown:

1. **PEFT Configuration:**
   - **`model`:** The pre-trained model loaded in the previous cell is passed as input to apply PEFT.

2. **LoRA Parameters:**
   - `r = 16`: The rank of the LoRA matrices. Higher values allow more capacity but consume more memory. Suggested values range from 8 to 128.
   - `target_modules`: Specifies which modules within the model to apply LoRA. In this case, projection layers like `q_proj`, `k_proj`, `v_proj`, etc., are targeted.
   - `lora_alpha = 16`: A scaling factor for LoRA. Balances the contribution of the LoRA layers to the model's output.
   - `lora_dropout = 0`: Dropout rate applied within LoRA layers. Set to `0` for optimized performance.
   - `bias = "none"`: Indicates that biases are not being fine-tuned. This setting is optimized for memory and performance.

3. **Additional Configurations:**
   - `use_gradient_checkpointing = "unsloth"`: Enables gradient checkpointing to save memory during training. The `"unsloth"` setting optimizes VRAM usage, allowing for larger batch sizes and handling longer contexts.
   - `random_state = 3407`: Sets the seed for reproducibility.
   - `use_rslora = False`: Indicates whether to use Rank Stabilized LoRA. Set to `False` in this configuration.
   - `loftq_config = None`: Placeholder for additional LoftQ configurations if needed.

4. **Applying PEFT:**
   - `FastLanguageModel.get_peft_model(...)` modifies the original model to include LoRA layers, enabling efficient fine-tuning without updating all model parameters.

## Data Preprocessing

In [4]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template
from unsloth.chat_templates import standardize_sharegpt

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return { "text" : texts, }

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Standardizing format:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

**Explanation:**

This cell handles dataset loading, standardization, and formatting using Unsloth's utilities for chat-based templates. Here's the step-by-step explanation:

1. **Imports:**
   - `from datasets import load_dataset`: Imports the `load_dataset` function from Hugging Face's `datasets` library for dataset handling.
   - `from unsloth.chat_templates import get_chat_template, standardize_sharegpt`: Imports Unsloth's functions for handling chat templates and standardizing ShareGPT datasets.

2. **Configuring the Tokenizer with a Chat Template:**
   - `get_chat_template(tokenizer, chat_template = "qwen-2.5")`: Configures the tokenizer to use the `"qwen-2.5"` chat template. This ensures that the input data aligns with the expected format of the Qwen-2.5 model.
   - **Purpose:** Formats conversational data by mapping roles (e.g., "user" and "assistant") to specific identifiers, facilitating better interaction with the model.

3. **Defining the Formatting Function:**
   - `formatting_prompts_func(examples)`: A function to process each batch of dataset examples.
     - `convos = examples["conversations"]`: Extracts the conversation data from the dataset.
     - `texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]`: Applies the chat template to each conversation without tokenization and without adding a generation prompt.
     - `return { "text" : texts, }`: Returns the formatted texts in a dictionary with the key `"text"`.

4. **Loading and Standardizing the Dataset:**
   - `dataset = load_dataset("mlabonne/FineTome-100k", split="train")`: Loads the `FineTome-100k` dataset's training split. This dataset is assumed to follow the ShareGPT format, which consists of multi-turn conversations.
   - `dataset = standardize_sharegpt(dataset)`: Transforms the ShareGPT-formatted dataset into a Hugging Face-compatible format using Unsloth's `standardize_sharegpt` function. This involves restructuring the data to merge multiple fields into single input-output pairs suitable for training.
   - `dataset = dataset.map(formatting_prompts_func, batched=True)`: Applies the previously defined formatting function to each batch of the dataset, effectively preparing the conversational data for training.

In [5]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

In [6]:
dataset[5]["text"]

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|im_end|>\n<|im_start|>assistant\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

## Model Training

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4, # Fixed major bug in latest Unsloth
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "paged_adamw_8bit", # Save more memory
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=4):   0%|          | 0/100000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


**Explanation:**

This cell configures the training setup using `SFTTrainer` from the `trl` (Transformers Reinforcement Learning) library, incorporating Unsloth's utilities for optimized training. Here's the detailed breakdown:

1. **Imports:**
   - `from trl import SFTTrainer`: Imports the `SFTTrainer` class for supervised fine-tuning.
   - `from transformers import TrainingArguments, DataCollatorForSeq2Seq`: Imports necessary classes from Hugging Face's `transformers` library for training configurations and data collation.
   - `from unsloth import is_bfloat16_supported`: Imports a utility function to check hardware support for `bfloat16`.

2. **Configuring the Trainer:**
   - `trainer = SFTTrainer(...)`: Initializes the trainer with specified configurations.
   
3. **Trainer Parameters:**
   - `model = model`: The PEFT-enabled model from the previous cell.
   - `tokenizer = tokenizer`: The tokenizer configured with the chat template.
   - `train_dataset = dataset`: The prepared training dataset.
   - `dataset_text_field = "text"`: Specifies the field in the dataset containing the input text.
   - `max_seq_length = max_seq_length`: Sets the maximum sequence length as defined earlier.
   - `data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer)`: Utilizes a data collator suitable for sequence-to-sequence tasks, ensuring that batches are correctly formatted.
   - `dataset_num_proc = 4`: Number of processes to use for data loading and preprocessing.
   - `packing = False`: Disables packing of multiple short sequences into a single batch, which can speed up training for short sequences.

4. **Training Arguments (`args = TrainingArguments(...)`):**
   - `per_device_train_batch_size = 1`: Sets the batch size per device. Given memory constraints, a batch size of 1 is used.
   - `gradient_accumulation_steps = 4`: Accumulates gradients over 4 steps before performing an optimization step. This effectively increases the batch size without increasing memory usage.
     - **Note:** Mention of fixing a major bug in Unsloth indicates that this parameter is crucial for stability.
   - `warmup_steps = 5`: Number of warmup steps for learning rate scheduling.
   - `max_steps = 30`: Maximum number of training steps. Uncommenting `num_train_epochs = 1` would set the training to run for one full epoch instead.
   - `learning_rate = 2e-4`: Sets the learning rate for the optimizer.
   - `fp16 = not is_bfloat16_supported()`: Enables mixed-precision training with `float16` if `bfloat16` is not supported.
   - `bf16 = is_bfloat16_supported()`: Enables `bfloat16` precision if supported by the hardware.
   - `logging_steps = 1`: Logs training metrics every step.
   - `optim = "paged_adamw_8bit"`: Uses the `paged_adamw_8bit` optimizer to save memory.
   - `weight_decay = 0.01`: Sets the weight decay for regularization.
   - `lr_scheduler_type = "linear"`: Uses a linear learning rate scheduler.
   - `seed = 3407`: Sets the random seed for reproducibility.
   - `output_dir = "outputs"`: Directory where training outputs and checkpoints will be saved.
   - `report_to = "none"`: Disables reporting to external services like WandB. This can be changed to enable integration with monitoring tools.

5. **Precision Handling:**
   - `is_bfloat16_supported()` checks if the current GPU supports `bfloat16`. If supported, `bf16` is enabled; otherwise, `fp16` is used. This ensures optimal precision based on hardware capabilities.

In [8]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

**Explanation:**

This cell modifies the training process to focus exclusively on the response segments of the dataset, effectively ignoring the input prompts. Here's the detailed explanation:

1. **Importing the Function:**
   - `from unsloth.chat_templates import train_on_responses_only`: Imports the `train_on_responses_only` function from Unsloth's chat templates module.

2. **Applying the Function:**
   - `trainer = train_on_responses_only(...)`: Modifies the `trainer` instance to train only on the response parts of the dataset.
   
3. **Parameters:**
   - `trainer`: The existing `SFTTrainer` instance configured in the previous cell.
   - `instruction_part = "<|im_start|>user\n"`: Specifies the prefix that identifies the instruction or user input in the dataset.
   - `response_part = "<|im_start|>assistant\n"`: Specifies the prefix that identifies the assistant's response in the dataset.

In [9]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|im_end|>\n<|im_start|>assistant\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

In [10]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                          \nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

**Explanation:**

This cell provides a way to inspect and verify the formatting of the training data after preprocessing. Here's the step-by-step breakdown:

1. **Decoding Input IDs:**
   - `tokenizer.decode(trainer.train_dataset[5]["input_ids"])`: Decodes the input IDs of the sixth example (index `5`) in the training dataset back into human-readable text. This helps in verifying that the input data has been correctly tokenized and formatted.

2. **Decoding Labels with Special Handling:**
   - `space = tokenizer(" ", add_special_tokens = False).input_ids[0]`: Retrieves the token ID for a single space character. This is used to replace special tokens in the labels.
   - `tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])`: Decodes the label IDs of the same example, replacing any occurrence of `-100` (a special masking value often used in loss calculations) with the space token. This ensures that the decoded labels are readable and free from masking artifacts.


In [11]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla P100-PCIE-16GB. Max memory = 15.888 GB.
10.07 GB of memory reserved.


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 30
 "-____-"     Number of trainable parameters = 68,812,800


Step,Training Loss
1,0.6967
2,0.6319
3,0.67
4,0.596
5,1.106
6,0.8282
7,0.6739
8,0.7285
9,0.7118
10,0.4678


**Explanation:**

This cell starts the training process and captures the training statistics upon completion.

1. **Starting Training:**
   - `trainer_stats = trainer.train()`: Initiates the training loop using the configured `trainer`. The `train()` method runs the training process based on the previously defined `TrainingArguments` and dataset.

2. **Capturing Training Statistics:**
   - The result of the `train()` method is stored in `trainer_stats`, which contains metrics and information about the training run, such as runtime, loss values, and other performance indicators.


In [13]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1022.3749 seconds used for training.
17.04 minutes used for training.
Peak reserved memory = 12.576 GB.
Peak reserved memory for training = 2.506 GB.
Peak reserved memory % of max memory = 79.154 %.
Peak reserved memory for training % of max memory = 15.773 %.


## Model Inferencing

In [14]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n13<|im_end|>']

**Explanation:**

This cell demonstrates how to perform inference (generate responses) using the trained model with optimized settings. Here's the detailed breakdown:

1. **Importing and Configuring the Tokenizer:**
   - `from unsloth.chat_templates import get_chat_template`: Imports the `get_chat_template` function.
   - `tokenizer = get_chat_template(tokenizer, chat_template = "qwen-2.5")`: Reapplies the `"qwen-2.5"` chat template to the tokenizer to ensure consistency with training.

2. **Enabling Optimized Inference:**
   - `FastLanguageModel.for_inference(model)`: Activates native optimizations within the `FastLanguageModel` to enable faster inference, potentially doubling the speed.

3. **Preparing the Input Message:**
   - `messages = [{"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},]`: Defines a prompt asking the model to continue the Fibonacci sequence.
   - `inputs = tokenizer.apply_chat_template(...)`: Applies the chat template to the messages.
     - `tokenize = True`: Tokenizes the input text.
     - `add_generation_prompt = True`: Adds necessary prompts for the model to generate a response.
     - `return_tensors = "pt"`: Returns the inputs as PyTorch tensors.
     - `.to("cuda")`: Moves the input tensors to the GPU for faster computation.

4. **Generating the Output:**
   - `outputs = model.generate(...)`: Generates text based on the input.
     - `input_ids = inputs`: Passes the tokenized input.
     - `max_new_tokens = 64`: Limits the generation to 64 new tokens.
     - `use_cache = True`: Utilizes caching for faster generation.
     - `temperature = 1.5`: Increases randomness in generation, leading to more diverse outputs.
     - `min_p = 0.1`: Sets the minimum probability threshold for nucleus sampling.

5. **Decoding the Output:**
   - `tokenizer.batch_decode(outputs)`: Converts the generated token IDs back into human-readable text.


In [15]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Fibonacci sequence continues as follows:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144...

Each number is the sum of the two preceding ones.

So, after 8, the next numbers would be:

11 - not correct (as 8 + 5 = 13, not 11)
13 - correct
21 - correct
34 - correct
55 - correct
89 - correct
144 - correct

The next few


**Explanation:**

This cell showcases how to perform streaming inference, where the generated text is output incrementally as it is being produced. Here's the detailed breakdown:

1. **Enabling Optimized Inference:**
   - `FastLanguageModel.for_inference(model)`: Reiterates the activation of native optimizations for faster inference.

2. **Preparing the Input Message:**
   - Similar to the previous cell, defines a prompt to continue the Fibonacci sequence.
   - Applies the chat template, tokenizes the input, adds generation prompts, and moves the inputs to the GPU.

3. **Setting Up the Streamer:**
   - `from transformers import TextStreamer`: Imports the `TextStreamer` class from Hugging Face's `transformers` library.
   - `text_streamer = TextStreamer(tokenizer, skip_prompt = True)`: Initializes a `TextStreamer` instance.
     - `tokenizer`: Passes the tokenizer for decoding tokens into text.
     - `skip_prompt = True`: Configures the streamer to omit the initial prompt from the output, focusing only on the generated response.

4. **Generating with Streaming:**
   - `_ = model.generate(...)`: Initiates text generation with streaming.
     - `input_ids = inputs`: Provides the tokenized input.
     - `streamer = text_streamer`: Passes the streamer to handle incremental output.
     - `max_new_tokens = 128`: Limits generation to 128 new tokens.
     - `use_cache = True`, `temperature = 1.5`, `min_p = 0.1`: Similar parameters as the previous generation.


## Model Saving and Loading

In [16]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

**Explanation:**

This cell saves the trained model and tokenizer for future use, either locally or by pushing to Hugging Face's Model Hub.

1. **Saving Locally:**
   - `model.save_pretrained("lora_model")`: Saves the fine-tuned model weights and configuration to a directory named `"lora_model"`.
   - `tokenizer.save_pretrained("lora_model")`: Saves the tokenizer configuration to the same directory, ensuring that the model can be reloaded with the correct tokenizer.

2. **Optional Online Saving (Commented Out):**
   - `# model.push_to_hub("your_name/lora_model", token = "...")`: Uncommenting this line would push the model to Hugging Face's Model Hub under the specified repository name and authentication token.
   - `# tokenizer.push_to_hub("your_name/lora_model", token = "...")`: Similarly, this would push the tokenizer to the same repository.


In [17]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

if False:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True, # Must add for generation
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1)

The iconic Eiffel Tower stands as the tallest tower in Paris, the capital of France, and is one of the most recognizable structures in the world.

The tower was built in 1889 as the entrance arch to the Exposition Universelle (World's Fair) held to celebrate the centenary of the French Revolution. It was designed by Gustave Eiffel, who created a light, yet structurally sound design using wrought iron lattices that gave it its signature lattice-like appearance.

Standing at a height of 324 meters (1,063 feet), the Eiffel Tower can be seen


**Explanation:**

This cell demonstrates how to use the saved (fine-tuned) model to generate responses to new prompts.

1. **Preparing the Input Message:**
   - `messages = [{"role": "user", "content": "Describe a tall tower in the capital of France."},]`: Defines a new prompt asking the model to describe a tall tower in France's capital.

2. **Applying the Chat Template:**
   - `inputs = tokenizer.apply_chat_template(...)`: Formats the message using the chat template, tokenizes it, adds generation prompts, converts it to PyTorch tensors, and moves it to the GPU.

3. **Setting Up the Streamer:**
   - `from transformers import TextStreamer`: Imports the `TextStreamer` class.
   - `text_streamer = TextStreamer(tokenizer, skip_prompt = True)`: Initializes a streamer to handle incremental output without displaying the initial prompt.

4. **Generating the Output:**
   - `_ = new_model.generate(...)`: Uses the `new_model` (assumed to be loaded in the next cell) to generate a response.
     - `input_ids=inputs`: Provides the tokenized input.
     - `streamer=text_streamer`: Enables streaming of the generated text.
     - `max_new_tokens=128`, `use_cache=True`, `temperature=1.5`, `min_p=0.1`: Sets generation parameters for output quality and diversity.


In [18]:
from unsloth import FastLanguageModel
from transformers import TextStreamer

if False:    
    new_model, new_tokenizer = FastLanguageModel.from_pretrained(
        model_name="lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )
    FastLanguageModel.for_inference(new_model) # Enable native 2x faster inference
    
    messages = [
        {"role": "user", "content": "Describe a famous statue in New York City."},
    ]
    inputs = new_tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True, # Must add for generation
        return_tensors="pt",
    ).to("cuda")
    
    text_streamer = TextStreamer(new_tokenizer, skip_prompt = True)
    _ = new_model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1)

**Explanation:**

This cell demonstrates how to load the previously saved fine-tuned model and perform inference with a new prompt.

1. **Importing FastLanguageModel:**
   - `from unsloth import FastLanguageModel`: Imports the `FastLanguageModel` class from Unsloth.

2. **Loading the Saved Model and Tokenizer:**
   - `new_model, new_tokenizer = FastLanguageModel.from_pretrained(...)`: Loads the fine-tuned model and tokenizer from the local directory `"lora_model"`.
     - `model_name = "lora_model"`: Specifies the directory where the trained model and tokenizer were saved.
     - `#model_name = "unsloth/Qwen2.5-Coder-14B-Instruct",`: (Commented out) An alternative to load the original pre-trained model if needed.
     - `max_seq_length = max_seq_length`, `dtype = dtype`, `load_in_4bit = load_in_4bit`: Passes the same configuration parameters used during initial model loading to ensure consistency.

3. **Enabling Optimized Inference:**
   - `FastLanguageModel.for_inference(new_model)`: Activates native inference optimizations for the loaded model, ensuring faster response generation.

4. **Preparing a New Input Message:**
   - `messages = [{"role": "user", "content": "Describe a famous statue in New York City."},]`: Defines a new prompt asking for a description of a famous statue in NYC.
   - `inputs = new_tokenizer.apply_chat_template(...)`: Applies the chat template, tokenizes the message, adds generation prompts, converts to PyTorch tensors, and moves to the GPU.

5. **Setting Up the Streamer and Generating Output:**
   - `from transformers import TextStreamer`: Imports the `TextStreamer` class.
   - `text_streamer = TextStreamer(new_tokenizer, skip_prompt = True)`: Initializes the streamer.
   - `_ = new_model.generate(...)`: Generates the response using the loaded model.
     - `input_ids=inputs`, `streamer=text_streamer`, `max_new_tokens=128`, `use_cache=True`, `temperature=1.5`, `min_p=0.1`: Specifies generation parameters for efficient and diverse output.


## Conclusion

The Jupyter notebook effectively illustrates how Unsloth's `FastLanguageModel` can revolutionize the fine-tuning and deployment of large language models. By implementing 4-bit quantization and PEFT with LoRA, the workflow achieves significant reductions in memory usage and training time, making it feasible to work with larger models even on hardware with limited resources. The integration with Hugging Face's ecosystem ensures compatibility and ease of use, allowing users to leverage a wide range of existing tools and datasets seamlessly. Furthermore, the optimized inference techniques demonstrated in the notebook enable faster and more efficient generation of responses, enhancing the practicality of deploying these models in real-time applications such as chatbots and conversational agents.

Overall, this approach not only accelerates the fine-tuning process but also ensures that the resulting models maintain high levels of accuracy and responsiveness. By focusing on selective training and leveraging advanced optimization strategies, Unsloth's `FastLanguageModel` provides a robust framework for developing and deploying state-of-the-art language models efficiently and effectively.
