# Fine-Tuning Gemma-1.1-2b-it for Code Reasoning with `nvidia/OpenCodeReasoning`

This notebook demonstrates how to fine-tune the `google/gemma-1.1-2b-it` model on the `nvidia/OpenCodeReasoning` dataset using Hugging Face libraries (`transformers`, `datasets`, `peft`, `trl`).

**Goal:** To adapt Gemma to better understand and generate code or reason about code based on the provided dataset.

**Steps Covered:**
1.  **Install Libraries:** Set up the environment with necessary packages.
2.  **Load and Preprocess Dataset:** Load `nvidia/OpenCodeReasoning`, explore it, and prepare it for training. This step is CRITICAL and requires careful adaptation to the dataset's specific structure and Gemma's required input format.
3.  **Load Gemma Model and Tokenizer:** Load the base `gemma-1.1-2b-it` model with 4-bit quantization for efficiency.
4.  **Configure Fine-Tuning:** Set up LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning and define training arguments.
5.  **Run Fine-Tuning:** Train the model using `SFTTrainer`.
6.  **Save Model:** Save the trained LoRA adapter.
7.  **Inference:** Test the fine-tuned model with sample prompts.

**Important Considerations for `nvidia/OpenCodeReasoning`:**
*   **Dataset Structure:** You **MUST** inspect the `nvidia/OpenCodeReasoning` dataset to understand its columns (e.g., prompt, code, reasoning, etc.). The `preprocess_function` (Section 2.3) and the inference prompt format (Section 7.1) need to be tailored to this structure.
*   **Prompt Engineering:** The way you format your input to Gemma is crucial. For instruction-tuned models like `gemma-1.1-2b-it`, you need to use its specific chat/instruction template (e.g., involving `<start_of_turn>user`, `<end_of_turn>`, `<start_of_turn>model`). The placeholder examples in this notebook will need to be updated.
*   **Colab Resources:** Fine-tuning can be resource-intensive. This notebook uses 4-bit quantization and LoRA to make it more feasible on platforms like Google Colab with a T4 GPU. You might need to adjust batch sizes or other parameters based on available memory.

# 1. Install Necessary Libraries

In [None]:
!pip install -q transformers datasets accelerate bitsandbytes peft trl

# 2. Load and Preprocess the Dataset

## 2.1. Load Dataset
We'll load the `nvidia/OpenCodeReasoning` dataset from Hugging Face.

In [None]:
from datasets import load_dataset

dataset_name = "nvidia/OpenCodeReasoning"
dataset = load_dataset(dataset_name, split="train") # You might want to specify a split or use different splits

# Let's look at an example from the dataset
print(dataset[0]) 

## 2.2. Explore Dataset (Optional)
It's good practice to understand the structure of your data. You can print out more examples, check column names, etc.

In [None]:
# Example: Print dataset features
print(dataset.features)

# Example: Show a few more examples
for i in range(1, 4):
    print(dataset[i])

## 2.3. Preprocess and Tokenize
We need to tokenize the text data and format it into input/output pairs that the model can learn from.
The exact preprocessing will depend on the dataset structure and the task. 
For Gemma, we typically need to format the input as a chat/instruction.

**Note:** This is a generic preprocessing step. You'll likely need to adapt this based on the specific columns and structure of the `nvidia/OpenCodeReasoning` dataset.
You might need to combine columns or reformat them to create a clear prompt and response structure.

In [None]:
from transformers import AutoTokenizer

# Load tokenizer for Gemma
# Make sure to use the specific Gemma model you intend to fine-tune (e.g., "google/gemma-1.1-2b-it")
model_id = "google/gemma-1.1-2b-it" # Updated model_id
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Add a padding token if it doesn't exist. Gemma models might not have one by default.
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'}) 
    # Important: If you add a pad token, you might need to resize model token embeddings later

# Define a preprocessing function
def preprocess_function(examples):
    # This is a placeholder. You MUST adapt this to your dataset.
    # For example, if your dataset has 'prompt' and 'completion' columns:
    # inputs = [prompt for prompt in examples['prompt']]
    # outputs = [completion for completion in examples['completion']]
    # text = [f"Prompt: {inp} \nResponse: {out}{tokenizer.eos_token}" for inp, out in zip(inputs, outputs)]
    
    # Example: Assuming 'description' is the input and 'code' is the output for OpenCodeReasoning
    # This is a guess and needs to be verified with the dataset's actual structure.
    # You might need to format it like an instruction or a question.
    # For instance: "Translate the following description to code: [DESCRIPTION] \n\n [CODE]"

    # Let's assume the dataset has a 'prompt' and 'solution' field for now.
    # This needs to be verified and adapted based on nvidia/OpenCodeReasoning structure.
    if 'prompt' in examples and 'solution' in examples:
        text = [f"Instruction: {p}\nOutput: {s}{tokenizer.eos_token}" for p, s in zip(examples['prompt'], examples['solution'])]
    elif 'description' in examples and 'code_string' in examples: # Another guess based on common code dataset structures
        text = [f"Generate code for the following description:\n{d}\n\nCode:\n{c}{tokenizer.eos_token}" for d, c in zip(examples['description'], examples['code_string'])]
    else:
        # Fallback: just take the first text column if the above are not found.
        # THIS WILL LIKELY NOT WORK WELL AND NEEDS ADJUSTMENT.
        first_text_column = [key for key, value in examples.items() if isinstance(examples[key][0], str)][0]
        print(f"Warning: Using generic column '{first_text_column}' for tokenization. Please adapt the preprocess_function for your dataset structure.")
        text = [t + tokenizer.eos_token for t in examples[first_text_column]]

    return tokenizer(text, truncation=True, padding="max_length", max_length=512)


# Apply the preprocessing function
# This might take a while
# Consider using a subset for faster iteration initially: small_dataset = dataset.select(range(1000))
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

print("Example of tokenized data:")
print(tokenized_dataset[0])

# 3. Load Gemma Model and Tokenizer
We'll load the Gemma model (e.g., `google/gemma-1.1-2b-it`) and its tokenizer. We'll also configure it for 4-bit quantization to save memory, which is crucial for running larger models in environments like Colab.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model ID for Gemma
# User requested gemma-3-1b. The closest available instruction-tuned model of a smaller size is gemma-1.1-2b-it.
# If a specific "3-1b" variant becomes available and is preferred, this ID should be updated.
# For now, we use "google/gemma-1.1-2b-it" as a robust choice for fine-tuning.
model_id = "google/gemma-1.1-2b-it" 

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 # Use bfloat16 for compute if available, otherwise float16
)

# Load the model with quantization config
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto", # Automatically map model parts to available devices (CPU/GPU)
    # token="YOUR_HF_TOKEN" # Add this if you encounter auth issues for gated models like some Gemma versions
)

# Load the tokenizer
# The tokenizer was already loaded in the preprocessing section.
# We can reload it here or ensure the one from preprocessing is correctly configured.
# For consistency and to ensure it's available in this section's scope:
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set pad token if not set. Gemma tokenizers might not have a pad token by default.
# Using EOS token as pad token is a common strategy if a dedicated pad token is missing.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    # Important: If you set pad_token to eos_token, you must ensure your model's config reflects this,
    # especially for generation. Some models might require resizing token embeddings if a new token is added.
    # For fine-tuning with SFTTrainer, this is generally handled, but good to be aware of.
    model.config.pad_token_id = tokenizer.pad_token_id


# Resize token embeddings if a new pad token was added (e.g. [PAD]) 
# and it's different from existing special tokens.
# This step was mentioned in the previous section if we added a *new* special token like '[PAD]'.
# If we just set pad_token = eos_token, resizing is usually not needed unless eos_token was somehow not in the embeddings initially.
# For now, assuming the previous tokenizer setup handled any necessary additions.
# if tokenizer.pad_token == '[PAD]': # Only if we added a new '[PAD]' token
#    model.resize_token_embeddings(len(tokenizer))


print(f"Model '{model_id}' loaded with 4-bit quantization.")
print(f"Tokenizer '{model_id}' loaded.")
if tokenizer.pad_token:
    print(f"Pad token is set to: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
else:
    print("Warning: Pad token is not set.")

## 3.1. Verify Tokenizer and Model Alignment (Important for Gemma)
Gemma models, especially instruction-tuned ones, have specific ways they expect prompts to be formatted, often using special tokens like `<start_of_turn>` and `<end_of_turn>`. 
The tokenizer handles adding these for chat templates, but when constructing strings manually (as in the `preprocess_function`), you need to be mindful.
The `SFTTrainer` often expects a 'text' field in the dataset that contains the fully formatted prompt including any special tokens.

The previous preprocessing step should create a single string field (e.g., `text`) in your dataset.
Ensure that this string is formatted correctly for Gemma.
For `gemma-1.1-2b-it`, a typical instruction format might be:
`<start_of_turn>user
{your_instruction_or_prompt}<end_of_turn>
<start_of_turn>model
{your_expected_response}<end_of_turn>`

The `preprocess_function` needs to be adapted to create this structure using the columns from `nvidia/OpenCodeReasoning`.
The current placeholder in `preprocess_function` (e.g., `f"Instruction: {p}\nOutput: {s}{tokenizer.eos_token}"`) is a simplification and should be updated to match Gemma's required format.

# 4. Configure Fine-Tuning
We'll set up the training arguments and LoRA (Low-Rank Adaptation) configuration for efficient fine-tuning.

## 4.1. LoRA Configuration (PEFT)
PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA allow us to fine-tune large models by training only a small number of extra parameters, significantly reducing computational and memory costs.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Before applying PEFT, if you're using a quantized model (like with BitsAndBytesConfig),
# it's often recommended to prepare it for k-bit training.
# This function can help handle some of the intricacies.
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the LoRA matrices. Higher rank means more parameters, potentially better performance but more memory.
    lora_alpha=32,  # Alpha scaling factor. alpha/r controls the scaling of LoRA weights.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to. These are typical for Gemma.
    lora_dropout=0.05,  # Dropout probability for LoRA layers.
    bias="none",  # Whether to train bias parameters. 'none' is common.
    task_type="CAUSAL_LM", # Task type, Causal Language Modeling for Gemma.
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print a summary of trainable parameters
model.print_trainable_parameters()

## 4.2. Training Arguments
These arguments control various aspects of the training process, such as learning rate, batch size, number of epochs, etc.

In [None]:
from transformers import TrainingArguments

# Define output directory for saving checkpoints and final model
output_dir = "./gemma_opencodereasoning_finetuned"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1, # Start with 1 epoch, can be increased. For large datasets, even a fraction of an epoch might be enough.
    per_device_train_batch_size=4, # Adjust based on your GPU memory. 4 is a common starting point for 4-bit models.
    gradient_accumulation_steps=2, # Accumulate gradients over multiple steps to simulate a larger batch size.
    optim="paged_adamw_8bit", # Optimizer that works well with quantized models.
    save_steps=100, # Save a checkpoint every N steps.
    logging_steps=10, # Log training metrics every N steps.
    learning_rate=2e-4, # A common learning rate for LoRA fine-tuning.
    weight_decay=0.001, # Weight decay for regularization.
    fp16=False, # Set to True if your GPU supports FP16 and you're not using bfloat16 compute in BitsAndBytes.
    bf16=True, # Set to True if your GPU supports BF16 (recommended for Ampere and newer GPUs) and using bfloat16 compute.
    max_grad_norm=0.3, # Gradient clipping to prevent exploding gradients.
    max_steps=-1, # If set to a positive number, overrides num_train_epochs.
    warmup_ratio=0.03, # Ratio of total training steps for learning rate warmup.
    group_by_length=True, # Group sequences of similar lengths together to optimize padding and reduce training time.
    lr_scheduler_type="constant_with_warmup", # Learning rate scheduler.
    report_to="tensorboard", # Or "wandb", "none", etc.
    # packing=True, # If using SFTTrainer and your dataset is already preprocessed with packing, set this to True.
                    # Requires dataset to have 'input_ids', 'attention_mask', 'labels' from packing.
)

# 5. Run Fine-Tuning
Now we initialize the `SFTTrainer` from TRL (Transformer Reinforcement Learning) library, which is designed for supervised fine-tuning of language models, and start the training.

In [None]:
from trl import SFTTrainer

# Ensure the tokenized dataset is used here.
# The SFTTrainer expects a 'text' field by default, or you can specify `dataset_text_field`.
# Our `preprocess_function` should have created a tokenized dataset where each item
# is a dictionary of 'input_ids', 'attention_mask', etc.
# SFTTrainer can handle this directly. If the preprocess_function created a 'text' column
# that was then tokenized and that 'text' column was removed, that's fine.
# The trainer will use the columns 'input_ids', 'attention_mask', 'labels'.

# If your `preprocess_function` did not create a single text field that was then tokenized,
# but rather directly outputted tokenized 'input_ids', 'attention_mask', 'labels',
# you might need to specify `dataset_text_field=None` or ensure the dataset is correctly formatted.
# Given our current `preprocess_function` tokenizes a constructed text string,
# we should ensure the `tokenized_dataset` is what we pass.
# The `SFTTrainer` is flexible; it can take a raw text dataset and a tokenizer,
# or a pre-tokenized dataset.

trainer = SFTTrainer(
    model=model,                             # The PEFT-prepared model
    train_dataset=tokenized_dataset,         # The tokenized training dataset
    # eval_dataset=tokenized_eval_dataset,   # Optionally, pass an evaluation dataset
    peft_config=lora_config,                 # The LoRA configuration
    dataset_text_field=None,                 # Set to your text column name if you haven't pre-tokenized and want trainer to tokenize.
                                             # If dataset is already tokenized (input_ids, attention_mask), set to None or don't specify.
                                             # Our current `tokenized_dataset` is already tokenized.
    tokenizer=tokenizer,                     # The tokenizer
    args=training_args,                      # The training arguments
    max_seq_length=512,                      # Max sequence length for packing (if `packing=True` in TrainingArguments) or truncation.
                                             # Ensure this matches the `max_length` in `preprocess_function` if not packing.
    # packing=True,                          # Set to True if your dataset is prepared for packing. This can speed up training.
                                             # If True, dataset needs 'input_ids', 'attention_mask', 'labels' for each packed sequence.
                                             # Our current `preprocess_function` does not implement packing.
)

# Start training
print("Starting training...")
trainer.train()
print("Training finished.")

## 5.1. (Important) Adapting `preprocess_function` for `SFTTrainer`
The `SFTTrainer` works best if the dataset contains a single text field that represents the full conversation turn or instruction-response pair, already formatted with any special tokens (like Gemma's `<start_of_turn>`, `<end_of_turn>`).

Our current `preprocess_function` creates a tokenized output directly.
```python
# Recall from section 2.3:
# def preprocess_function(examples):
#     # ...
#     # text = [f"Instruction: {p}\nOutput: {s}{tokenizer.eos_token}" for p, s in zip(examples['prompt'], examples['solution'])]
#     # ...
#     return tokenizer(text, truncation=True, padding="max_length", max_length=512)
#
# tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)
```
This approach is generally fine as `SFTTrainer` can handle pre-tokenized datasets. The `remove_columns=dataset.column_names` ensures that only `input_ids`, `attention_mask`, `labels` (implicitly created by the tokenizer for causal LM) are passed to the trainer.

**Alternative for `SFTTrainer` (if you prefer it to handle tokenization):**
You could modify `preprocess_function` to output a new column (e.g., `formatted_text`) and then specify `dataset_text_field="formatted_text"` in `SFTTrainer`.

Example modification:
```python
# def preprocess_for_sft(examples):
#    # Adapt this based on nvidia/OpenCodeReasoning columns and Gemma format
#    # e.g., using 'prompt' and 'solution'
#    texts = []
#    for p, s in zip(examples['prompt'], examples['solution']):
#        # This formatting needs to be Gemma specific!
#        # <start_of_turn>user
{PROMPT}<end_of_turn>
<start_of_turn>model
{SOLUTION}<end_of_turn>
#        # The tokenizer.apply_chat_template might be useful if your data fits a chat structure.
#        # For now, a simplified example:
#        formatted_prompt = f"<start_of_turn>user\n{p}<end_of_turn>\n<start_of_turn>model\n{s}{tokenizer.eos_token}"
#        texts.append(formatted_prompt)
#    return {"text": texts} # SFTTrainer will look for this 'text' field

# formatted_dataset = dataset.map(preprocess_for_sft, batched=True, remove_columns=dataset.column_names)
# trainer = SFTTrainer(..., train_dataset=formatted_dataset, dataset_text_field="text", tokenizer=tokenizer, ...)
```
For now, we will stick to providing the already tokenized dataset to the `SFTTrainer` as implemented.
The key is that the `tokenized_dataset` contains `input_ids`, `attention_mask`, and `labels`. The `transformers.AutoTokenizer` when used for causal LM tasks and with labels (which it infers unless explicitly told not to) should prepare these correctly.

# 6. Save the Fine-Tuned Model
After training, we need to save the fine-tuned model. With PEFT (LoRA), we are primarily saving the adapter weights.

In [None]:
# Define a path to save the LoRA adapter
adapter_output_dir = f"{output_dir}/final_adapter" 

# Save the LoRA adapter
trainer.save_model(adapter_output_dir)
print(f"LoRA adapter saved to: {adapter_output_dir}")

# Optionally, if you want to save the full model (merged with LoRA weights)
# This will require more disk space and memory.
# Make sure you have enough resources before uncommenting.

# print("Merging adapter weights with the base model...")
# merged_model = model.merge_and_unload() # Merges LoRA weights and unloads PEFT model, returning the base model with merged weights.
# print("Adapter weights merged.")

# Define a path to save the full merged model
# merged_model_output_dir = f"{output_dir}/final_merged_model"
# merged_model.save_pretrained(merged_model_output_dir)
# tokenizer.save_pretrained(merged_model_output_dir)
# print(f"Full merged model saved to: {merged_model_output_dir}")

# For most use cases with LoRA, deploying the base model and loading the adapter separately is common.
# However, merging can be useful for some deployment scenarios.

# 7. Basic Inference with Fine-Tuned Model
Let's test the fine-tuned model with some sample prompts. We'll load the base model and then apply the saved LoRA adapter weights.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Ensure CUDA is available if you trained on GPU and want to infer on GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Option 1: Load the base model and then apply the LoRA adapter ---
# This is the most common way when you've saved only the adapter.

# Load the base model (the same one used for training)
base_model_id = "google/gemma-1.1-2b-it" # Should be the same model_id used earlier

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token # Ensure pad token is set for generation

# Load the base model in 4-bit (or your original configuration)
# If you used 4-bit for training, use it for inference too for consistency,
# unless you specifically want to test a different precision.
bnb_config_inference = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config_inference,
    device_map="auto", # Automatically map model to available device(s)
)
if tokenizer.pad_token == tokenizer.eos_token: # Make sure model config reflects this
    base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the LoRA adapter
# adapter_output_dir was defined in the saving section (e.g., "./gemma_opencodereasoning_finetuned/final_adapter")
ft_model = PeftModel.from_pretrained(base_model, adapter_output_dir)
ft_model = ft_model.to(device) # Ensure the PEFT model is on the correct device
ft_model.eval() # Set the model to evaluation mode

print(f"Fine-tuned model (base + adapter) loaded from {adapter_output_dir} and ready for inference on {device}.")

# --- Option 2: If you saved a merged model ---
# If you uncommented and ran the `merged_model.save_pretrained(...)` part earlier:
# merged_model_path = f"{output_dir}/final_merged_model"
# tokenizer = AutoTokenizer.from_pretrained(merged_model_path)
# ft_model = AutoModelForCausalLM.from_pretrained(merged_model_path, device_map="auto")
# ft_model.eval()
# print(f"Full fine-tuned model loaded from {merged_model_path} and ready for inference.")

## 7.1. Generate Text
Now, let's create a prompt. **This prompt should be formatted in the same way as your training data examples.**
For Gemma instruction-tuned models, this usually means a specific chat-like template.

For example, if your training data was formatted like:
`<start_of_turn>user
{instruction}<end_of_turn>
<start_of_turn>model
{response}<end_of_turn>`

Your prompt for inference should only contain the user part, up to `<start_of_turn>model
`.

In [None]:
# Example prompt - **ADAPT THIS TO YOUR TASK AND THE ACTUAL STRUCTURE OF nvidia/OpenCodeReasoning**
# This prompt needs to match the format your model was fine-tuned on.
# The `preprocess_function` and Gemma's chat template (e.g. <start_of_turn>user...) are key here.

# Let's assume your fine-tuning data looked like:
# "Instruction: [some instruction/description from nvidia/OpenCodeReasoning]
Output: [corresponding code/solution]"
# For inference, you provide the "Instruction" part and let the model generate the "Output".

# Example based on a hypothetical structure of OpenCodeReasoning:
# This is a placeholder prompt. You need to replace it with a relevant prompt
# based on the `nvidia/OpenCodeReasoning` dataset's expected input format.
# For Gemma 1.1 IT, the prompt should follow its chat template.

user_prompt_content = "Write a Python function that calculates the factorial of a number." # Replace with actual prompt from dataset or a new one

# Format the prompt according to Gemma's required template
# This is crucial for instruction-tuned models.
# The tokenizer's chat template can be helpful if your input fits a multi-turn chat.
# For single-turn instruction following, a common format is:
prompt_for_model = f"<start_of_turn>user\n{user_prompt_content}<end_of_turn>\n<start_of_turn>model\n"

print(f"Formatted prompt:\n{prompt_for_model}")

# Tokenize the prompt
inputs = tokenizer(prompt_for_model, return_tensors="pt", padding=True, truncation=True).to(device)

# Generate output
# Adjust generation parameters as needed (max_new_tokens, temperature, top_p, etc.)
print("\nGenerating response...")
with torch.no_grad(): # Ensure no gradients are calculated during inference
    outputs = ft_model.generate(
        **inputs,
        max_new_tokens=256,  # Adjust as needed
        do_sample=True,      # Whether to use sampling; set to False for greedy decoding
        temperature=0.7,     # Controls randomness. Lower is more deterministic. Only used if do_sample=True.
        top_p=0.9,           # Nucleus sampling. Only used if do_sample=True.
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id # Important for generation
    )

# Decode the generated tokens
# The output includes the prompt, so we decode and then can choose to print only the generated part.
full_response = tokenizer.decode(outputs[0], skip_special_tokens=False) # Set skip_special_tokens=True if you don't want to see them

# Extract only the newly generated text (after the prompt)
# This simple split assumes the prompt_for_model is exactly at the beginning of the response.
# More robust parsing might be needed if the model adds tokens before echoing the prompt.
generated_text = full_response.split(prompt_for_model)[-1] if prompt_for_model in full_response else full_response

print("\n--- Full Response (including prompt) ---")
print(full_response)
print("\n--- Generated Text Only ---")
# A common way to clean up the response is to stop at the next <end_of_turn> or eos_token if they appear in the generation
if tokenizer.eos_token in generated_text:
    generated_text = generated_text.split(tokenizer.eos_token)[0]
if "<end_of_turn>" in generated_text: # Specific to Gemma's formatting
    generated_text = generated_text.split("<end_of_turn>")[0]

print(generated_text.strip())

# 8. Conclusion and Next Steps

This notebook provided a comprehensive walkthrough of fine-tuning a Gemma model for code-related tasks using the `nvidia/OpenCodeReasoning` dataset.

**Key Takeaways:**
*   **Data is King:** The success of fine-tuning heavily depends on the quality of your dataset and how well you preprocess it to match the model's expected input format. The `preprocess_function` is where you'll spend significant time adapting to `nvidia/OpenCodeReasoning`.
*   **Prompt Formatting:** For instruction/chat models like Gemma, adhering to their specific prompt template (e.g., with `<start_of_turn>`, `<end_of_turn>`) is essential for good performance during both fine-tuning and inference.
*   **Efficient Fine-Tuning:** Techniques like LoRA and 4-bit quantization make it possible to fine-tune large models on consumer-grade or free-tier cloud GPUs.

**Potential Next Steps:**
*   **Thoroughly Adapt `preprocess_function`:** Dive deep into the `nvidia/OpenCodeReasoning` dataset's structure. Identify the correct fields for your input (e.g., problem description, context) and target (e.g., code solution, explanation). Modify the `preprocess_function` in section 2.3 to correctly transform these fields into the Gemma instruction format.
*   **Experiment with Hyperparameters:** Adjust learning rate, batch size, number of epochs, LoRA rank (`r`), and alpha (`lora_alpha`) to optimize performance.
*   **Evaluation:** Implement a proper evaluation strategy using a held-out test set and relevant metrics for code generation/reasoning (e.g., BLEU, CodeBLEU, execution accuracy, pass@k). The `SFTTrainer` can take an `eval_dataset`.
*   **Advanced Prompting:** Explore more sophisticated prompting techniques if the initial results are not satisfactory.
*   **Larger Models:** If resources allow, experiment with larger Gemma variants (e.g., `gemma-1.1-7b-it`) for potentially better performance, adjusting quantization and batch sizes accordingly.
*   **Packing:** For faster training, especially with sequences of varying lengths, explore sequence packing by setting `packing=True` in `SFTTrainer` and ensuring your dataset is formatted correctly for it.

Good luck with your fine-tuning project!