# **Fine-Tuning Gemma2 With LoRA for Turkish**

## Introduction
Language models such as Gemma 2 are revolutionary assets in natural language processing (NLP), facilitating applications that include text generation, language translation, and sentiment assessment. Yet, these models are frequently trained on mainly English datasets, potentially resulting in the insufficient representation of other languages and cultural subtleties. 

In this notebook, my goal is to fill this gap by refining Gemma 2 for Turkish. By refining the model for Turkish, we enable communities to utilize NLP technologies customized for their language, paving the way for improved communication, education, and creativity. 

This notebook is created for clarity and replicability, enabling anyone, no matter their skill level, to follow and adjust the method to suit their language or situation. 

Before continuing with our notebook, I recommend you to set accelerator as GPU P100 since this notebook will work on it!


## Setup and Initialization

This section focuses on preparing the environment by installing the required libraries and importing essential modules to fine-tune Gemma 2 for Turkish. These steps ensure that all dependencies are properly installed and the environment is configured for smooth execution of the notebook.

### 1. Install Dependencies

- Install critical packages like `transformers`, `datasets`, `accelerate`, `peft`, and others necessary for working with large language models and performing fine-tuning tasks.
- Use the `bitsandbytes` library for memory-efficient computations, which are essential when handling large models.
- Configure `wandb` (Weights & Biases) for efficient experiment tracking and logging.

In [1]:
%%capture
%pip install -U transformers datasets accelerate peft trl bitsandbytes
%pip install googletrans==4.0.0-rc1 --quiet

### 2. Import Libraries

- Key libraries for loading and fine-tuning the model (transformers, peft, trl) are imported.
- Supporting libraries like torch, wandb, and datasets are also loaded to streamline model training and data handling.

In [2]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)

import os
import torch
import bitsandbytes as bnb

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig, setup_chat_format

from googletrans import Translator

print("Importing libraries worked perfectly!")

Importing libraries worked perfectly!


### 3. Define Base Variables
- Specify the base model (google/gemma-2-2b-it), the dataset for fine-tuning (turkish-wikipedia-dataset), and the name of the fine-tuned model (Gemma-2-2b-it-turkish).

In [3]:
base_model = "/kaggle/input/gemma-2/transformers/gemma-2-2b-it/2" # We are using Gemma 2
new_model = "Gemma-2-2b-it-turkish" # Fine-tuned model
print("Base variables are defined perfectly!")

Base variables are defined perfectly!


## Hardware Setup and Model Configuration
This section focuses on optimizing hardware and model configurations to fine-tune Gemma 2 efficiently using techniques such as Quantized LoRA (QLoRA) and Flash Attention.

### 1. Check CUDA Device Capability
- Verify the CUDA device capability to assess hardware support for advanced features like Flash Attention v2.
- If the capability is sufficient, install and configure Flash Attention v2 to enhance memory efficiency and computation speed.
- Set the computation data type based on GPU compatibility (e.g., bfloat16 for newer GPUs or float16 for older ones).

In [4]:
# Check CUDA device capability and set appropriate configurations

if torch.cuda.get_device_capability()[0] >= 8:
    # Install Flash Attention if capability allows
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16               
    attn_implementation = "flash_attention_2"
    print("Using bfloat16 and Flash Attention v2")
else:
    torch_dtype = torch.float16   
    attn_implementation = "eager"  
    print("Using float16 because of the older hardware and default attention mechanism is used")

Using float16 because of the older hardware and default attention mechanism is used


### 2. Configure Quantized LoRA (QLoRA)
- Utilize BitsAndBytesConfig to enable 4-bit quantization, making it possible to load large models efficiently while maintaining performance.
- Configure additional options, such as nf4 quantization type and double quantization, to enhance model precision and optimize computational efficiency.

In [5]:
# Configuration for Quantized LoRA (QLoRA)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                   # Enable 4-bit quantization for efficient model loading
    bnb_4bit_quant_type="nf4",           # Use NormalFloat4 (NF4) quantization
    bnb_4bit_compute_dtype=torch_dtype,  # Using computing precision based on hardware support as we did on the upper cell
    bnb_4bit_use_double_quant=True       # Use double quantization for improved accuracy
)

### 3. Load Pretrained Model and Tokenizer
- Load the Gemma 2 causal language model with the defined quantization and attention settings.
- Initialize the tokenizer associated with the base model to ensure seamless integration during the fine-tuning process.

In [6]:
# Load the pretrained causal language model with quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    base_model,                              # The base model identifier or path
    quantization_config=quantization_config, # Apply QLoRA configuration
    device_map="auto",                       
    attn_implementation=attn_implementation  # Set attention implementation
)

# Load the tokenizer corresponding to the pretrained model
tokenizer = AutoTokenizer.from_pretrained(
    base_model,             # The base model identifier or path
    trust_remote_code=True  # Trust custom tokenizer code if provided by the model
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Configuring LoRA for Fine-Tuning
### 1. Locate Linear Layers
- Create a helper function to identify all 4-bit linear layers in the model that are appropriate for LoRA fine-tuning.
- Exclude the lm_head layer to concentrate on the trainable components of the model.

In [7]:
def find_all_linear_names(model):
    """
    This function searches for all linear layers of the 4-bit format 
    in a given model and returns their names, excluding the 'lm_head' 
    module if present.

    Args:
    - model: The model to search for linear layers in.

    Returns:
    - List of module names associated with linear layers.
    """
    # The target class for linear layers (4-bit format)
    cls = bnb.nn.Linear4bit
    lora_module_names = set()  # Set to hold the unique names of the target linear modules

    # Iterate over all named modules in the model
    for name, module in model.named_modules():
        # Check if the module is of the target class
        if isinstance(module, cls):
            names = name.split('.')  # Split the module name by dots to isolate components
            # Add the first or last part of the name (depending on the structure) to the set
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    # Remove 'lm_head' if present in the set (needed for 16-bit models)
    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')

    # Return the list of linear module names
    return list(lora_module_names)

# Get the list of linear module names in the model
modules = find_all_linear_names(model)

### 2. Configure LoRA
- Set up LoRA with key parameters such as rank (r), scaling factor (lora_alpha), and dropout rate (lora_dropout).
- Designate the previously identified linear layers as target modules for fine-tuning.

In [8]:
# LoRA configuration setup
peft_config = LoraConfig(
    r=16,                    # Rank for LoRA
    lora_alpha=32,           # Scaling factor for LoRA
    lora_dropout=0.05,       # Dropout rate for LoRA
    bias="none",             # No bias in LoRA layers
    task_type="CAUSAL_LM",   # Task type for causal language modeling
    target_modules=modules   # The list of target modules (linear layers)
)

### 3. Prepare Tokenizer, Chat Format, and Apply LoRA to the Model
- Configure the tokenizer's padding side and update the chat template to prevent conflicts. Use the setup_chat_format utility to adapt the tokenizer and model for a conversational format.
- Apply the LoRA configuration to the model, allowing efficient fine-tuning of the targeted components.

In [9]:
# Set the padding side for the tokenizer (important for certain models)
tokenizer.padding_side = 'right'

# Reset chat template to ensure no leftover settings
tokenizer.chat_template = None

# Setup chat format with the model and tokenizer
model, tokenizer = setup_chat_format(model, tokenizer)

# Apply LoRA configurations to the model
model = get_peft_model(model, peft_config)

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


## Preparing and Loading the Dataset

In this section, we will focus on loading and preprocessing our dataset. Once the data is ready, we’ll format it appropriately for training and split it into training and test sets. For non-English languages like Turkish, an excellent approach to building a high-quality dataset is leveraging the 'Alpaca Dataset,' which is widely recognized for training models. To tailor it for Turkish, we will be using a translated dataset.

### Introduction to Alpaca Dataset
The Alpaca dataset is a substantial collection of 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine, designed to improve the instruction-following capabilities of language models. This dataset is a crucial resource for instruction tuning, enabling models to better respond to user commands and queries.

Alpaca comprises a diverse range of programming tasks and their corresponding instructions, representing various use cases in AI and machine learning. Key features include:

- **Multiple Programming Languages:** Includes languages like Python and JavaScript, exposing models to diverse syntactic structures and paradigms.
- **Varying Complexity:** Tasks range from simple function definitions to complex algorithm implementations, catering to different skill levels.
Data preprocessing is essential for preparing the Alpaca dataset for training. This involves addressing real-world data issues like inconsistencies and missing values, which can significantly impact model performance. Techniques such as normalization, scaling, and feature engineering improve data quality and usability. Validating the processed data ensures it meets the model's requirements, mitigating overfitting risks and enhancing overall accuracy.

In summary, the Alpaca dataset is a comprehensive resource for training AI models, especially for instruction-following tasks, contributing significantly to advancements in natural language processing.

### 1. Downloading the Dataset
- Using transformers, we will be downloading 'TFLai/Turkish-Alpaca' dataset from the datasets. 

In [10]:
from datasets import load_dataset

ds = load_dataset("TFLai/Turkish-Alpaca")

README.md:   0%|          | 0.00/118 [00:00<?, ?B/s]

data.json:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51914 [00:00<?, ? examples/s]

### 2. Exploring the Original Dataset
Once we acquire the dataset, the next step is to examine its structure so we can determine how to format the data appropriately for training. In this section, I’ll explain the key components of the dataset: Instruction, Input, and Output.

- **Instruction:** This is the core task or directive that we provide to the model. It defines the purpose or action the model should perform.
- **Input:** This refers to any additional context or data that the model uses, provided by the instruction. The input is only included when the instruction specifies it.
- **Output:** This is the result produced by the model, generated based on the instruction and the input (if available). It represents the model's response to the task outlined in the instruction.

By inspecting the dataset, we can see how these elements are structured, and we will adapt this structure to prepare our data for training and fine-tuning the model.

In [11]:
print(f"Instruction: {ds['train'][0]['instruction']}")
print(f"Input: {ds['train'][0]['input']}")
print(f"Output: {ds['train'][0]['output']}")


Instruction: Fransa'nın başkenti nedir?
Input: 
Output: Fransa'nın başkenti Paris'tir.


In [12]:
import json

# Extract the entire train dataset as a list of dictionaries
data_list = [
    {
        "instruction": example['instruction'],
        "input": example['input'],
        "output": example['output']
    }
    for example in ds['train']
]

# Write the list of dictionaries to a JSON file
with open('train_data.json', 'w', encoding='utf-8') as f:
    json.dump(data_list, f, ensure_ascii=False, indent=4)

print("Dataset has been written to train_data.json")

Dataset has been written to train_data.json


### 3. Shrinking Down the Dataset (Optional Step)
Since our main goal is to test how well our model will perform in a fine-tuned manner, we will shrink down our original data. 

In [13]:
def save_first_n_json(input_file, output_file, n=1000):
    """Loads a JSON file, extracts the first n entries, and saves them to a new JSON file.

    Args:
        input_file: Path to the input JSON file.
        output_file: Path to the output JSON file.
        n: The number of entries to extract. Defaults to 1000.
    """
    try:
        with open(input_file, 'r', encoding='utf-8') as f_in:  # Add encoding for robustness
            data = json.load(f_in)

        if not isinstance(data, list):  # Check if the JSON data is a list
            raise TypeError("The JSON data must be a list.")


        if len(data) < n:
            print(f"Warning: The input file has fewer than {n} entries. Saving all available entries.")
            first_n = data
        else:
            first_n = data[:n]

        with open(output_file, 'w', encoding='utf-8') as f_out:
            json.dump(first_n, f_out, indent=4, ensure_ascii=False)  # Use indent for pretty printing and ensure_ascii for proper UTF-8 handling

        print(f"Successfully saved the first {len(first_n)} entries to {output_file}")

    except FileNotFoundError:
        print(f"Error: Input file '{input_file}' not found.")
    except json.JSONDecodeError:
        print(f"Error: Invalid JSON format in '{input_file}'.")
    except TypeError as e:
        print(f"Error: {e}")

input_file = '/kaggle/working/train_data.json'
num_entries = 10000
output_file = f'output{num_entries}.json'

save_first_n_json(input_file, output_file, num_entries)

Successfully saved the first 10000 entries to output10000.json


### 4. Shuffle the Dataset
- After loading and combining the datasets, shuffle the data to introduce randomness, which helps improve generalization during training.
- Optionally, the dataset can be truncated to a smaller subset to speed up experimentation and testing.



In [14]:
from datasets import load_dataset
# Change data_files with output(num_entries).json if you shrunk it.
dataset = load_dataset('json', data_files='/kaggle/working/train_data.json', split='train')
dataset = dataset.shuffle()

Generating train split: 0 examples [00:00, ? examples/s]

### 5. Formatting the Dataset for Chat-based Tasks

- Each row of the dataset is formatted into a chat-like structure using a custom template. The format includes a "system" message (instruction), a "user" message (input question), and an "assistant" message (output answer). This format is then tokenized and prepared for training.

In [15]:
def format_chat_template(row):
    """
    This function formats each row of the dataset into a chat-like structure 
    and applies the chat template for tokenization.

    Args:
    - row: The current dataset row, containing 'instruction', 'input', and 'output'.

    Returns:
    - The updated row with a 'text' field containing the formatted chat template.
    """
    # Construct a JSON-like structure for the chat conversation (system, user, assistant)
    row_json = [
        {"role": "system", "content": row["instruction"]},  # System message: the instruction
        {"role": "user", "content": row["input"]},          # User message: the input question
        {"role": "assistant", "content": row["output"]}     # Assistant message: the model's response
    ]
    # Apply the tokenizer to format the row using the chat template without tokenizing
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

# Apply the chat formatting to the entire dataset using multiple processes (num_proc=4 for parallelism)
dataset = dataset.map(format_chat_template, num_proc=4)

Map (num_proc=4):   0%|          | 0/51914 [00:00<?, ? examples/s]

### 6. Splitting the Dataset

- Finally, the dataset is split into training (90%) and test (10%) sets, which will be used for fine-tuning and evaluating the model, respectively.

In [16]:
# Split the dataset into training and test sets (90% train, 10% test)
dataset = dataset.train_test_split(test_size=0.1)

## Setting Hyperparameters and Training the Model

In this section, we configure and initialize the training process using the **SFTTrainer** class, setting essential hyperparameters and training configurations for fine-tuning the model. This setup ensures an efficient and controlled training process.

**1. Trainer Initialization**

* The **SFTTrainer** class is initialized with the model, tokenizer, training and evaluation datasets, and LoRA configuration. This class handles the entire training and evaluation workflow.

**2. Hyperparameters Configuration**

The hyperparameters define how the training process will be carried out:

* **Batch Size**: Both training and evaluation batches are set to 1 for memory efficiency.
* **Gradient Accumulation**: Training steps are accumulated across two steps to simulate a larger batch size.
* **Optimizer**: paged_adamw_32bit optimizer is used to ensure stability and efficiency.
* **Epochs**: The model is trained for 1 epoch.
* **Learning Rate**: A learning rate of 5e-5 is chosen to allow fine adjustments during training.
* **Logging & Evaluation**: Training logs are saved every 10 steps, and the model is evaluated based on a set frequency.
* **Saving Models**: The model is saved every step based on the configuration, with a maximum of two saved models.

**3. Testing the Training**

For inferencing purposes, let's use a subset of the original data so that it will be easier to train!

In [17]:
trainer = SFTTrainer(
    model=model,                     # The model to be trained (Decreases if model size is reduced)
    processing_class=tokenizer,      # The tokenizer used for data processing (Decreases if a simpler tokenizer is used)
    train_dataset=dataset["train"],      # Training dataset (Decreases with a smaller dataset)
    eval_dataset=dataset["test"],        # Evaluation dataset (Decreases with a smaller dataset)
    peft_config=peft_config,         # LoRA configuration for model adaptation (Depends on LoRA setup, typically increases complexity)
    args=SFTConfig(
        output_dir=new_model,                                   # Directory where the trained model will be saved (No impact on computational cost)
        per_device_train_batch_size=1,                          # Batch size for training (Decreases with smaller batch size, lower memory requirement but might reduce training efficiency)
        per_device_eval_batch_size=1,                           # Batch size for evaluation (Same as above)
        gradient_accumulation_steps=4,                          # Number of steps for gradient accumulation (Increases computational cost due to more steps before gradient update)
        optim="paged_adamw_32bit",                              # Optimizer type for training (Decreases with simpler optimizer, but this one should be fine)
        num_train_epochs=1,                                     # Number of training epochs (Increases computational cost with more epochs)
        eval_strategy="steps",                                  # Evaluation strategy during training (No direct cost impact)
        eval_steps=int(len(dataset["train"]) // (1 * 2) // 5),  # Frequency of evaluation in steps (Decreases computational cost with fewer eval steps)
        logging_steps=10,                                       # Frequency of logging during training (No significant impact unless logging too frequently)
        warmup_steps=30,                                        # Number of steps for learning rate warmup (Decreases cost, as it prevents instability during training)
        logging_strategy="steps",                               # Logging strategy to use (No impact on computational cost)
        learning_rate=5e-5,                                     # Learning rate for training (Decreases if learning rate is too high, as training becomes less stable)
        save_steps=0,                                           # Frequency of saving the model in steps (Decreases computational cost by not saving frequently)
        save_total_limit=0,                                     # Maximum number of saved models to keep (Decreases cost by reducing the need for saving models)
        save_strategy="no",                                     # Disable checkpoint saving (Decreases computational cost, as no checkpoints are saved)
        fp16=True,                                              # Enable mixed precision (16-bit floating point) for training (Decreases computational cost due to reduced memory usage and faster computation)
        bf16=False,                                             # Disable bfloat16 (use fp16 instead) (Decreases cost as BF16 might require more specialized hardware)
        group_by_length=True,                                   # Group data by length for more efficient batching (Decreases computational cost by improving memory and computation efficiency)
        report_to="none",                                       # No external reporting (like to wandb) (Decreases cost as there is no extra overhead for reporting)
        dataset_text_field="text",                              # Field name for dataset text input (No impact on computational cost)
        packing=False,                                          # Disable packing of sequences for batching (Decreases cost by avoiding extra packing computations)
        load_best_model_at_end=False,                           # Do not load the best model after training (Decreases cost, as no additional computation is required for model selection)
    ),
)


Map:   0%|          | 0/46722 [00:00<?, ? examples/s]

Map:   0%|          | 0/5192 [00:00<?, ? examples/s]

### 3. Cache Management

Caching is disabled during training to avoid excessive memory usage, ensuring smooth operation on limited hardware.

In [18]:
# Disable caching during training to avoid memory issues
model.config.use_cache = False

### 4. Model Training

Finally, the training process is initiated with the **train()** method, which uses the configured settings to fine-tune the model.

In [19]:
# Start training the model
trainer.train()

Step,Training Loss,Validation Loss
4672,1.8967,1.979302
9344,1.9953,1.933076


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=11680, training_loss=1.9617368704652134, metrics={'train_runtime': 21318.7879, 'train_samples_per_second': 2.192, 'train_steps_per_second': 0.548, 'total_flos': 6.626453379522509e+16, 'train_loss': 1.9617368704652134, 'epoch': 0.999957193613287})

In [20]:
# Re-enable cache after training
model.config.use_cache = True

## Saving the Fine-Tuned Adapter Model

Save the fine-tuned adapter model locally for future use and deployment.

In [21]:
# Save the trained model to the specified directory
trainer.model.save_pretrained(new_model)



## Inference and Generating Responses

This section focuses on loading the fine-tuned model, configuring it for inference, and generating responses to user queries. The workflow involves preparing the model and tokenizer, formatting the input, and decoding the model's output to generate meaningful answers.

### 1. Clear CUDA Cache

Before inference, the CUDA memory cache is cleared to optimize GPU memory usage and prevent memory-related issues.

In [22]:
# Clear the CUDA memory cache.
torch.cuda.empty_cache()

### 2. Set Up the Model for Inference

Load the fine-tuned model with 4-bit quantization and integrate it with the base model:
* **Quantization Configuration**: Applies 4-bit quantization to optimize memory and computational efficiency.
* **Model Loading**: Loads the base model and fine-tuned weights, setting it to evaluation mode for inference.

In [23]:
# Define the path to the fine-tuned model
new_model_path = "/kaggle/working/Gemma-2-2b-it-turkish"

# Configuration for 4-bit quantization to optimize model performance and memory usage
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Enable 4-bit quantization for efficient loading
    bnb_4bit_quant_type="nf4",             # Use NormalFloat4 (NF4) quantization type for better accuracy
    bnb_4bit_compute_dtype=torch.float16,  # Use 16-bit floating-point precision for computations
    bnb_4bit_use_double_quant=True         # Enable double quantization for improved numerical stability
)

# Load the base model with QLoRA (Quantized LoRA) configuration
model = AutoModelForCausalLM.from_pretrained(
    base_model,                              # Path or identifier of the base model
    quantization_config=quantization_config, # Apply the quantization configuration
    attn_implementation="eager",             # Set attention mechanism implementation to "eager"
    torch_dtype=torch.float16,               # Use 16-bit floating-point precision for weights and activations
    return_dict=True,                        # Return outputs as a dictionary for better readability
    device_map="auto"                        # Automatically map model components to available devices
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 3. Prepare the Tokenizer

The tokenizer is initialized to process inputs and generate outputs in a chat format. Any previous configurations are reset to avoid interference with new tasks.

In [24]:
# Load the tokenizer for the base model
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Reset the chat template to ensure no stale settings interfere with new tasks
tokenizer.chat_template = None

# Configure the model and tokenizer for chat-based interactions
model, tokenizer = setup_chat_format(model, tokenizer)

# Load the fine-tuned model with PeftModel, applying it to the base model
model = PeftModel.from_pretrained(model, new_model_path)

# Set the model to evaluation mode to prepare for inference
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256002, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

### 4. Define the Conversation and Create a Prompt

Format the conversation history into a structured prompt using the tokenizer’s chat template. This ensures the model receives well-structured input.

In [28]:
# Define the conversation history as a list of messages
messages = [
    {"role": "system", "content": "Nasılsın?"},
    {"role": "user", "content": ""},
]

# Apply the tokenizer's chat template to format the messages for the model
# Set tokenize=False to avoid tokenization at this point, and add_generation_prompt=True to prepare for generation
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Tokenize the prompt and prepare the inputs for the model
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

### 5. Generate a Response

Use the model to generate a response, applying sampling techniques for diverse and high-quality outputs:

* **Top-K Sampling**: Considers the top 50 tokens at each step.
* **Nucleus Sampling (Top-P)**: Ensures 85% cumulative probability, balancing diversity and relevance.
* **Temperature**: A low value (0.3) makes the output more deterministic.
* **No Repetition**: Prevents repetitive phrases by disallowing 3-gram repetitions.

In [29]:
# Optimized text generation with custom sampling strategies for better results
outputs = model.generate(
    **inputs,                # Feed the tokenized inputs to the model
    num_return_sequences=1,  # Only return one sequence of text
    top_k=50,                # Limit the sampling pool to the top 50 tokens
    top_p=0.85,              # Use nucleus sampling with a cumulative probability of 85% (more deterministic output)
    temperature=0.3,         # Lower temperature for more deterministic (less random) responses
    no_repeat_ngram_size=3,  # Prevent repeating n-grams of size 3 (e.g., "the the the")
    do_sample=True,          # Enable sampling for more diverse outputs (as opposed to greedy decoding)
    num_beams=20             # This parameter controls the number of beams used during beam search.
)

### 6. Decode and Extract the Response

Decode the generated output into human-readable text, cleaning unnecessary parts to extract the final response.

In [34]:
import re

# Decode the output sequence back to text, skipping special tokens like padding and EOS markers
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract the user input
input_text = re.search(r"system\n(.*?)\nuser", text, re.DOTALL).group(1).strip()

# Extract the assistant's response
output_text = re.search(r"assistant\n(.*)", text, re.DOTALL).group(1).strip()

# Format as Input/Output
formatted_output = f"Input: {input_text}\nOutput: {output_text}"

print(formatted_output)

Input: Nasılsın?
Output: Ben iyiyim, teşekkürler! Siz nasılsınız? Nasıl yardımcı olabilirim?


## Conclusion

As we can see, we managed to fine-tune it so that it can understand and process Turkish text. Thank you for reading this notebook. I hope I helped you learning how to fine-tune Gemma2 with LoRA for a given dataset.