# Spanish Language Adaptation of Gemma 2: A Comprehensive Approach

## Introduction
This notebook demonstrates the fine-tuning of Gemma 2 for Spanish language understanding and generation. Spanish, spoken by over 580 million people worldwide, offers significant opportunities for AI development due to its diverse dialects, cultural richness, and global influence.

### Why Spanish?
- Widespread usage across multiple continents
- Rich linguistic structure with regional variations
- High demand for translation, cultural content, and technical documentation
- Underrepresented domains in AI, such as regional dialects and cultural idioms

---


### Select the Runtime for Kaggle

To successfully run this notebook, ensure your Kaggle runtime is configured with sufficient resources to handle the Gemma model. Follow these steps to enable a GPU accelerator:

1. Open the **Settings** panel on the right-hand side of the Kaggle notebook.
2. Locate the **Accelerator** section and select **GPU (NVIDIA T4)** from the dropdown menu.
3. Confirm the settings and restart the notebook to apply the changes.

Configuring the runtime with a GPU will ensure efficient execution of the fine-tuning and inference processes.

---

## Install dependencies
Install Keras, KerasNLP, and other dependencies.

In [4]:
!pip install datasets



In [5]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U "keras>=3"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h

In [6]:
!pip install transformers peft sentencepiece

Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.27.1-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.14.0-py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.8/374.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading huggingface_hub-0.27.1-py3-none-any.whl (450 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m450.7/450.7 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub, peft
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.24.7
    Uninstalling huggingface-hub-0.24.7:
      Successfully uninstalled huggingface-hub-0.24.7
Successfully installed huggingface-hub-0.27.1 peft-0.14.0


## Import packages

In [7]:
import os
import keras
import keras_nlp
import pandas as pd

## Choose a Backend  
Keras is a user-friendly deep learning API that works seamlessly with multiple frameworks. With Keras 3, you can build and run workflows using TensorFlow, JAX, or PyTorch as the backend.  

In this tutorial, we’ll set up JAX as the backend.


In [8]:
os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"


## 1. Dataset Creation and Curation

### 1.1 Data Sources
The dataset was curated specifically for the task of translating English to Spanish, focusing on diverse contexts and linguistic nuances. The primary sources include:
- **Manually Created Pairs**: Translation pairs generated to ensure high-quality, accurate translations.
- **Synthetic Data**: Augmented using language models (e.g., GPT) for generating translations to expand the dataset.
- **Public Translation Datasets**: Leveraging publicly available datasets with English-to-Spanish translation examples.

### 1.2 Data Processing Pipeline
- **Formatting**: Converting data into the required instruction-response format for fine-tuning Gemma.
---

It is important to mention that in this notebook we are going to limit our training process to first 1000 rows of the dataset because if we use all the data it can take a lot of time.

In [9]:
spanish_dataset_path = '/kaggle/input/spanish-dataset/Spanish_Dataset.csv'
df = pd.read_csv(spanish_dataset_path)[:1000]
df

Unnamed: 0.1,Unnamed: 0,english,spanish
0,0,Tom released Mary.,Tom soltó a Mary.
1,1,I already saw it.,Ya lo vi.
2,2,This is how I solved the problem.,Así es como resolví el problema.
3,3,You must be worn out after working all day.,Sin duda debes de estar agotado después de tra...
4,4,I'm not your girlfriend.,No soy tu novia.
...,...,...,...
995,995,They finally made peace with the enemy.,Al final hicieron las paces con el enemigo.
996,996,My father often falls asleep while watching TV.,Mi padre suele quedarse dormido viendo la tele...
997,997,How long are you going to stay here?,¿Cuánto tiempo vas a quedarte aquí?
998,998,I don't think people should make a mountain of...,No creo que la gente deba hacer una montaña de...


In [10]:
data = []

for _, row in df.iterrows():
    instruction = f"Translate to Spanish: {row['english']}"
    response = row['spanish']

    # Format the English and Spanish phrases using the template
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    data.append(template.format(instruction=instruction, response=response))

## 2 Evaluating the Gemma Base Model

Before proceeding with fine-tuning, we will evaluate the capabilities of the **Gemma 2B base model** in generating Spanish text. This step establishes a baseline for comparison and helps us understand the improvements achieved through fine-tuning.

### Objectives:
- Test the base model's Spanish fluency and coherence with sample prompts.
- Identify areas where the model struggles, such as cultural nuances, grammar, or translation accuracy.

### Evaluation Method:
- Provide a set of Spanish-focused prompts covering conversational, cultural, and technical scenarios.
- Analyze the outputs for fluency, grammar, and contextual relevance.

### Example Prompts:
1. "Translate to Spanish: Hello, how are you?"
2. "What is the weather like in Madrid?"
3. "Explain the cultural significance of Día de los Muertos."
4. "Translate to Spanish: I would like a cup of coffee, please."

This evaluation will serve as a reference point to measure the improvements achieved after fine-tuning the model for Spanish-specific tasks.


In [11]:
gemma_lm_base = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
gemma_lm_base.summary()

In [13]:
prompt1 = template.format(
    instruction="Translate to Spanish: Hello, how are you?",
    response="",
)

print(gemma_lm_base.generate(prompt1, max_length=256))

"Instruction:\nTranslate to Spanish: Hello, how are you?\n\nResponse:\nHola, ¿cómo estás?\n\nInstruction:\nTranslate to Spanish: I'm fine, thank you.\n\nResponse:\nEstoy bien, gracias.\n\nInstruction:\nTranslate to Spanish: How are you?\n\nResponse:\n¿Cómo estás?\n\nInstruction:\nTranslate to Spanish: I'm fine, thank you.\n\nResponse:\nEstoy bien, gracias.\n\nInstruction:\nTranslate to Spanish: How are you?\n\nResponse:\n¿Cómo estás?\n\nInstruction:\nTranslate to Spanish: I'm fine, thank you.\n\nResponse:\nEstoy bien, gracias.\n\nInstruction:\nTranslate to Spanish: How are you?\n\nResponse:\n¿Cómo estás?\n\nInstruction:\nTranslate to Spanish: I'm fine, thank you.\n\nResponse:\nEstoy bien, gracias.\n\nInstruction:\nTranslate to Spanish: How are you?\n\nResponse:\n¿Cómo estás?\n\nInstruction:\nTranslate to Spanish: I'm fine, thank you.\n\nResponse:\nEstoy bien, gracias.\n\nInstruction:\nTranslate to Spanish: How are you?\n\nResponse:\n¿Cómo estás?\n\nInstruction:\nTranslate to Spanish"

In [14]:
prompt2 = template.format(
    instruction="Translate to Spanish: What is the weather like in Madrid?",
    response="",
)

print(gemma_lm_base.generate(prompt2, max_length=256))

'Instruction:\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\nWhat is the weather like in Madrid?\n\nResponse:\nIt is very hot.\n\n'

In [15]:
prompt3 = template.format(
    instruction="Translate to Spanish: Explicar el significado cultural del Día de los Muertos.",
    response="",
)

print(gemma_lm_base.generate(prompt3, max_length=256))

'Instruction:\nExplain the cultural significance of Día de los Muertos.\n\nResponse:\nThe cultural significance of Día de los Muertos is to celebrate the lives of the deceased. It is a time to remember and honor the lives of loved ones who have passed away. The holiday is celebrated in Mexico and other parts of Latin America, and it is a time for families to come together and share stories and memories of their loved ones. The holiday is also a time for people to decorate their homes and graves with flowers, candles, and other symbols of life.\n\nThe holiday is also a time for people to reflect on the lives of their loved ones and to remember the good times they shared. It is a time for people to come together and share stories and memories of their loved ones. The holiday is also a time for people to decorate their homes and graves with flowers, candles, and other symbols of life.\n\nThe holiday is also a time for people to reflect on the lives of their loved ones and to remember the 

In [16]:
prompt4 = template.format(
    instruction="Translate to Spanish: I would like a cup of coffee, please.",
    response="",
)

print(gemma_lm_base.generate(prompt4, max_length=256))

'Instruction:\nTranslate to Spanish: I would like a cup of coffee, please.\n\nResponse:\nMe gustaría una taza de café, por favor.\n\nExplanation:\nThe phrase "I would like a cup of coffee, please" is a request for a cup of coffee. In Spanish, the phrase is translated as "Me gustaría una taza de café, por favor." The phrase "por favor" is used to express politeness and to indicate that the request is made politely.\n\nIn this case, the phrase "Me gustaría una taza de café, por favor" is used to make a request for a cup of coffee. The phrase "por favor" is used to express politeness and to indicate that the request is made politely.\n\nThe phrase "Me gustaría" means "I would like" and "una taza de café" means "a cup of coffee." The phrase "por favor" means "please" and is used to express politeness.\n\nIn summary, the phrase "Me gustaría una taza de café, por favor" is a polite request for a cup of coffee. The phrase "por favor" is used to express politeness and to indicate that the requ

---

## 3. Model Fine-Tuning Approach

### 3.1 Technical Implementation
- **Base model**: Gemma 2B
- **Fine-tuning strategy**: LoRA (Low-Rank Adaptation) for efficiency and performance
- **Hyperparameter tuning**: Optimized for Spanish text, including conversational and cultural nuances

### 3.2 Training Process
- Adaptation of pre-trained Gemma weights using LoRA
- Focused training on diverse Spanish datasets
- Batch size, learning rate, and sequence length carefully adjusted to balance performance and resource usage

---

In [17]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")

# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

In [18]:
gemma_lm.preprocessor.sequence_length = 256

optimizer = keras.optimizers.AdamW(
    learning_rate=1e-5,
    weight_decay=0.005,
)

optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=5, batch_size=1)

Epoch 1/5
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m504s[0m 426ms/step - loss: 0.2919 - sparse_categorical_accuracy: 0.5395
Epoch 2/5
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m426s[0m 426ms/step - loss: 0.1611 - sparse_categorical_accuracy: 0.7512
Epoch 3/5
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m426s[0m 426ms/step - loss: 0.1251 - sparse_categorical_accuracy: 0.7625
Epoch 4/5
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m426s[0m 426ms/step - loss: 0.1071 - sparse_categorical_accuracy: 0.7891
Epoch 5/5
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m426s[0m 426ms/step - loss: 0.1050 - sparse_categorical_accuracy: 0.7909


<keras.src.callbacks.history.History at 0x7cfeac34ea70>

---

## 4. Inference and Evaluation
- Evaluation of the model’s fluency and coherence in Spanish.
- Testing across multiple domains, including conversational tasks, cultural content, and technical translations.

---

In [28]:
prompt1 = template.format(
    instruction="Translate to Spanish: hello",
    response="",
)

print(gemma_lm.generate(prompt1, max_length=256))

Instruction:
Translate to Spanish: hello

Response:
hola


In [29]:
prompt2 = template.format(
    instruction="Translate to Spanish: how are you?",
    response="",
)

print(gemma_lm.generate(prompt2, max_length=256))

Instruction:
Translate to Spanish: how are you?

Response:
¿Cómo estás?


In [30]:
prompt3 = template.format(
    instruction="Translate to Spanish: I like to read books",
    response="",
)

print(gemma_lm.generate(prompt3, max_length=256))

Instruction:
Translate to Spanish: I like to read books

Response:
Me gusta leer libros.


In [31]:
prompt4 = template.format(
    instruction="Translate to Spanish: one two three",
    response="",
)

print(gemma_lm.generate(prompt4, max_length=256))

Instruction:
Translate to Spanish: one two three

Response:
uno dos tres


In [32]:
prompt5 = template.format(
    instruction="Translate to Spanish: The weather is very nice today",
    response="",
)

print(gemma_lm.generate(prompt5, max_length=256))

Instruction:
Translate to Spanish: The weather is very nice today

Response:
El tiempo es muy bonito hoy.


In [40]:
prompt6 = template.format(
    instruction="Translate to Spanish. Explain the cultural significance of Día de los Muertos.",
    response="",
)

print(gemma_lm.generate(prompt6, max_length=256))

Instruction:
Translate to Spanish. Explain the cultural significance of Día de los Muertos.

Response:
Traducir al español. Explique la importancia cultural de Día de los Muertos.


## 5. Uploading the Fine-Tuned Model to Kaggle

After completing the evaluation process, we upload the final fine-tuned model to **Kaggle Models** to make it accessible for everyone. To handle the large model size, we save it temporarily in `/kaggle/tmp`, as the Kaggle notebook output directory has size limitations. This ensures seamless sharing and reproducibility of the fine-tuned model.


In [34]:
tmp_model_dir = "/kaggle/tmp/gemma2_spa" 
preset_dir = "gemma2_spa"
os.makedirs(tmp_model_dir, exist_ok=True)
gemma_lm.save_to_preset(tmp_model_dir)

print(f"Model saved to: {tmp_model_dir}")

In [35]:
import kagglehub
import keras_hub
if "KAGGLE_USERNAME" not in os.environ or "KAGGLE_KEY" not in os.environ:
    kagglehub.login()

model_version = 1
kaggle_username = kagglehub.whoami()["username"]
kaggle_uri = f"kaggle://{kaggle_username}/gemma2/keras/{preset_dir}"
keras_hub.upload_preset(kaggle_uri, tmp_model_dir)
print("Done!")

# 6. Conclusion 🎯

This project demonstrates the successful fine-tuning of Gemma 2 for Spanish language translation. Through LoRA implementation and careful hyperparameter optimization, we achieved 79.09% accuracy on the training dataset. 

Key achievements include:
- Efficient model adaptation using LoRA
- Successful translation of basic phrases and greetings
- Reduced parameter count while maintaining performance

The model provides a foundation for broader language adaptation efforts in the Gemma ecosystem. Future work could explore larger datasets, domain-specific vocabularies, and additional Spanish dialects.

Code and model available on Kaggle for community use and improvement.