# Fine-Tuning Gemma2 for Italian Language

**This notebook demonstrates fine-tuning Google’s Gemma 2 model for the Italian language using Keras NLP. It includes detailed steps for dataset creation, training, evaluation, and publishing the model.**


**Overview**                                                                                   
In this notebook, we will:

1. Load and process Italian text data from the C4 dataset.                                                                            
2. Configure and fine-tune the Gemma2 language model for Italian.                                      
3. Evaluate the model’s performance before and after fine-tuning.                                      
4. Publish the fine-tuned model to Kaggle Models for further use.                                      

#### Device:
1x Nvidia P100
#### Base model:
Gemma2 2b base

**First, install necessary libraries including keras-nlp, datasets, keras_hub, kagglehub**

In [None]:
!pip install -q -U keras-nlp keras datasets kagglehub keras_hub 

Next, we set up environment variables for Kaggle authentication and configure the backend for optimal memory allocation.

In [None]:
import jax
jax.devices()

In [None]:
import os
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()


os.environ["KAGGLE_USERNAME"] = user_secrets.get_secret("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = user_secrets.get_secret("KAGGLE_KEY")
os.environ["KERAS_BACKEND"] = "tensorflow"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00"

**Importing Libraries**                                                             
Now, import TensorFlow, Keras NLP, and other libraries required for model loading and dataset handling.

In [None]:
import tensorflow as tf
import keras
import keras_nlp
from datasets import load_dataset
import itertools

In [None]:
# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# Dataset

Since we want to fine-tune the Gemma 2 2b model for adapting to the Italian language, we need a good amount of high-quality Italian text corpus. For that, we use the 'C4' dataset, which is a multilingual text dataset.

You can look into it on Hugging Face: [Link](https://huggingface.co/datasets/allenai/c4)  

**Dataset Summary (from the original dataset page):**  
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on the Common Crawl dataset: [https://commoncrawl.org](https://commoncrawl.org).

This is the processed version of Googleataset.set.
et

**Note**                                                                                            
since this is a very large dataset, We use the "streaming=True" to avoid memory problems.


In [None]:
italian_data = load_dataset("allenai/c4", "it", streaming=True)

**The** data is in this format:

An example:
```json
{
  "url": "https://klyq.com/beginners-bbq-class-taking-place-in-missoula/",
  "text": "Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity; put this on your calendar now. On Thursday, September 22nd, join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner-level class for everyone who wants to improve their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators, it is free. Included in the cost will be either a t-shirt or apron, and you will be tasting samples of each meat that is prepared.",
  "timestamp": "2019-04-25T12:57:54Z"
}
'
}

**Here we take a look inside the dataset and print some examples.**

In [None]:
sample_data = []
for i, example in enumerate(iter(italian_data["train"])):
    if i >= 2:  # Change this number to get more examples
        break
    sample_data.append(example["text"])

print("Sample Italian Data:")

for i, text in enumerate(sample_data):
    print(f"Example {i + 1}:", text[:50])  # Print the first 50 characters to get a preview

**Now** it's time to prepare the dataset for the model. We need to convert the dataset into a TensorFlow dataset, and we will use a fraction of the original dataset to save memory and time. (If you have better hardware available, you are welcome to try with a larger number of examples.)


In [None]:
# Define the maximum number of examples for training and validation
max_train_examples = 3000
max_val_examples = 100

# Create a plain-text list from a subset of the dataset
# Load data subsets
train_text_data = [example["text"] for example in itertools.islice(italian_data["train"], max_train_examples)]
val_text_data = [example["text"] for example in itertools.islice(italian_data["validation"], max_val_examples)]

# Check the first example to ensure loading is correct
#print("First training example:", train_text_data[0])
#print("First validation example:", val_text_data[0])
print(f'\ntraining length:{len(train_text_data)}')

In [None]:
batch_size = 1

# Convert the lists of text data to TensorFlow datasets
train_data = tf.data.Dataset.from_tensor_slices(train_text_data)
val_data = tf.data.Dataset.from_tensor_slices(val_text_data)

# Preprocess each text sample
def preprocess_text(text):
    return tf.convert_to_tensor(text, dtype=tf.string)

# Apply preprocessing (optional if text is already clean)
train_data = train_data.map(preprocess_text)
val_data = val_data.map(preprocess_text)

# Shuffle and batch the training data
train_data = train_data.shuffle(buffer_size=1000).batch(batch_size)
val_data = val_data.batch(batch_size)

# Model

**Now we load the Gemma 2 model. For this notebook, we use the 2b version since we are working with limited hardware.**


In [None]:
model_id = "gemma2_2b_en"

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(model_id)
    
gemma_lm.summary()

**Testing the Model:**
We can test the model by passing it an input to compare its responses before and after fine-tuning.


In [None]:
template = "Instruction:\n{instruction}\n\nResponse:\n{response}"

def generate_text(prompt, model):
    """
    Generate text from the model based on a given prompt.
    """
    sampler = keras_nlp.samplers.TopKSampler(k=5, seed=2)
    model.compile(sampler=sampler)
    output = model.generate(prompt, max_length=512)
    return output

In [None]:
# Sample prompt to check performance before and after fine-tuning
test_prompts = [
    "Ciao! Come stai oggi? Raccontami qualcosa di interessante che hai imparato di recente.",
    "Che cosa sai della storia del Rinascimento in Italia? Puoi spiegare il suo impatto sull'arte e sulla scienza?",
    "Scrivi una breve poesia in italiano su un paesaggio autunnale.",
    "Spiegare, in termini semplici, come funziona l'intelligenza artificiale e quali sono i suoi utilizzi più comuni in Italia.",
    "Se qualcuno dicesse: 'Hai fatto il passo più lungo della gamba', cosa significherebbe? In quale situazione potrebbe essere usata questa espressione?",
]

for prompt in test_prompts:
    print(f"\n--- Model Output Before Fine-Tuning for prompt: {prompt} ---")
    print(generate_text(template.format(instruction=prompt, response=""), gemma_lm))
    print("\n")

# LoRA

This is a large model with more than 2 billion trainable parameters. Full fine-tuning is very computationally expensive and time-consuming, so we choose the next best thing: the LoRA method.

## What is LoRA?  
LoRA (Low-Rank Adaptation) is a technique used to efficiently fine-tune large language models (LLMs) like Gemma 2.2b. It works by introducing trainable rank-decomposition matrices to the attention layers of the pre-trained model.


In [None]:
LoRA_rank = 2 # you can modify this 
# Enable LoRA for the model and set the LoRA rank to 2,4,...
gemma_lm.backbone.enable_lora(rank=LoRA_rank)
gemma_lm.summary()

*using LoRA reduced the number of trainable parameter from 2,614,341,888 to 1,464,320 !*

**Now** lets prepare the model for fine-tuning                                          
taken from [here](https://ai.google.dev/gemma/docs/lora_tuning)

In [None]:
import wandb
from wandb.integration.keras import WandbMetricsLogger


In [None]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.05,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

configs = dict(
    shuffle_buffer = 1000,
    batch_size = 1,
    learning_rate = 5e-5,
    weight_decay = 0.05,
    sequence_length = 256,
    epochs = 5
)

wandb.init(project = "fine-tuning-gemma2_2b_it",
    config=configs
)

**Training:**

In [None]:
# Inspect dataset element types
for element in train_data.take(1):
    print(type(element))
    print(element[0].dtype if hasattr(element[0], 'dtype') else "No dtype found")

In [None]:
history = gemma_lm.fit(train_data, validation_data=val_data, epochs=5, callbacks=[WandbMetricsLogger()])

**Plotting the loss and accuracy:**

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 6))

# Plotting Loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

# Plotting Accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['sparse_categorical_accuracy'], label='Training Accuracy')
plt.plot(history.history['val_sparse_categorical_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()


Now we see that the fine-tuning on Italian language in fact had a good effect since it is making a more meaningful response.

In [None]:
test_prompts = [
    "Ciao! Come stai oggi? Raccontami qualcosa di interessante che hai imparato di recente.",
    "Che cosa sai della storia del Rinascimento in Italia? Puoi spiegare il suo impatto sull'arte e sulla scienza?",
    "Scrivi una breve poesia in italiano su un paesaggio autunnale.",
    "Spiegare, in termini semplici, come funziona l'intelligenza artificiale e quali sono i suoi utilizzi più comuni in Italia.",
    "Se qualcuno dicesse: 'Hai fatto il passo più lungo della gamba', cosa significherebbe? In quale situazione potrebbe essere usata questa espressione?",
]

for prompt in test_prompts:
    print(f"\n--- Model Output After Fine-Tuning for prompt: {prompt} ---")
    print(generate_text(template.format(instruction=prompt, response=""), gemma_lm))
    print("\n")

# Uploading the fine-tuned model to kaggle

**For uploading the model to kaggle, First we need to save it:**

In [None]:
os.makedirs('gemma2_2b_it')

In [None]:

preset_dir = "/kaggle/working/gemma2_2b_it"
gemma_lm.save_to_preset(preset_dir)

In [None]:
preset_dir = "gemma2_2b_it"

In [None]:
import gc
gc.collect()

In [None]:
import kagglehub
import keras_hub
if "KAGGLE_USERNAME" not in os.environ or "KAGGLE_KEY" not in os.environ:
    kagglehub.login()

model_version = 1
kaggle_username = kagglehub.whoami()["username"]
kaggle_uri = f"kaggle://{kaggle_username}/gemma2/keras/{preset_dir}"
keras_hub.upload_preset(kaggle_uri, preset_dir)

In [None]:
wandb.finish()

In [None]:
print("Done!")

# Inference

**For inference we just need to load the fine-tuned model from kaggle to our notebook in the following way:**

for more info check out [here](https://keras.io/api/keras_nlp/models/gemma/gemma_causal_lm/)

specificly:

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as one of:
* 1. 
a built-in preset identifier like 'bert_base_e
* 2. '
a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_
* 3. n'
a Hugging Face handle like 'hf://user/bert_base
* 4. en'
a path to a local preset directory like './bert_base_en'

**Infrence step by step:**
* 1. Load the fine-tuned model from kaggle models
* 2. After the model is succesfuly loaded, You can use it to generate text in the targeted language
* Good luck:)

In [None]:
final_model_id = "kaggle://mahdiseddigh/gemma2/keras/gemma2_2b_it"
finetuned_gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(final_model_id)
finetuned_gemma_lm.summary()

**After the model is loaded, You can use it to generate French:)**

In [None]:
test_prompt = # your prompt.
# Generate output after fine-tuning
print("\n--- Fine-tuned Models Output ---")
print(generate_text(template.format(instruction=test_prompt, response=""), finetuned_gemma_lm))

**Thats it, If you have any suggestion, I would apperciate it**