## Import Dataset and Gemma-2b-en model.
This dataset consists of 10 idioms samples from each language mentioned in the competition description.
MY aim with this fine-tuning approach was to create a version of gemma which can understand the complexity of idioms and replicate them based on inputs.

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_json)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/multilingual-idioms-indian/Gemma/burmese.json
/kaggle/input/multilingual-idioms-indian/Gemma/malayalam.json
/kaggle/input/multilingual-idioms-indian/Gemma/nepali.json
/kaggle/input/multilingual-idioms-indian/Gemma/catalan.json
/kaggle/input/multilingual-idioms-indian/Gemma/hindi.json
/kaggle/input/multilingual-idioms-indian/Gemma/croatian.json
/kaggle/input/multilingual-idioms-indian/Gemma/slovak.json
/kaggle/input/multilingual-idioms-indian/Gemma/finnish.json
/kaggle/input/multilingual-idioms-indian/Gemma/icelandic.json
/kaggle/input/multilingual-idioms-indian/Gemma/ukrainian.json
/kaggle/input/multilingual-idioms-indian/Gemma/thai.json
/kaggle/input/multilingual-idioms-indian/Gemma/swahili.json
/kaggle/input/multilingual-idioms-indian/Gemma/punjabi.json
/kaggle/input/multilingual-idioms-indian/Gemma/kyrgyz.json
/kaggle/input/multilingual-idioms-indian/Gemma/bengali(bangla).json
/kaggle/input/multilingual-idioms-indian/Gemma/gujrati.json
/kaggle/input/multilingual-idioms

You can sample the dataset like this.


In [2]:
df= pd.read_json('/kaggle/input/multilingual-idioms-indian/Gemma/turkish.json')
df

Unnamed: 0,idiom,literal_meaning,figurative_meaning,example,language
0,Ateşle oynamak,To play with fire.,To take dangerous risks.,Investing in that company is like ateşle oynamak.,Turkish
1,Göz var nizam var,"There is an eye, there is order.",Things should be done properly.,We need to organize this event well; göz var n...,Turkish
2,Dost acı söyler,A friend speaks bitterly.,"True friends tell the truth, even if it's harsh.",He told me the truth about my performance; dos...,Turkish
3,Sakla samanı gelir zamanı,Save the straw; its time will come.,Everything has its purpose and time.,You never know when you might need it; sakla s...,Turkish
4,"Bir elin nesi var, iki elin sesi var",What does one hand have? Two hands have a voice.,Teamwork achieves more than individual effort.,Together we can achieve great things; bir elin...,Turkish
5,Gülü seven dikenine katlanır,He who loves roses must endure its thorns.,Love comes with challenges.,"If you want to be in a relationship, remember:...",Turkish
6,Ayağını yorganına göre uzat,Stretch your leg according to your blanket.,Live within your means.,Don’t spend too much money; ayağını yorganına ...,Turkish
7,Damlaya damlaya göl olur,"Drop by drop, a lake forms.",Small efforts accumulate to create something s...,Keep saving money; damlaya damlaya göl olur.,Turkish
8,Kervan yolda düzülür,The caravan is arranged on the road.,Plans can be adjusted as you go.,We’ll figure it out along the way; kervan yold...,Turkish
9,"Söz gümüşse, sükût altındır","If speech is silver, silence is golden.",Sometimes it’s better to remain silent.,"In some situations, söz gümüşse, sükût altındır.",Turkish


# Set up of Environment before loading Gemma Model

In [3]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

**We configure our environment to use JAX as the backend for Keras and to maximize GPU memory allocation for XLA. This is done using the `os.environ` module in Python:**

In [4]:
import os

os.environ['KERAS_BACKEND'] = 'jax'
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

In [5]:
import keras
import keras_nlp

# Load the dataset

In [6]:
from datasets import load_dataset

ds = load_dataset("json",data_files='/kaggle/input/multilingual-idioms-indian/Gemma/*.json')

Resolving data files:   0%|          | 0/71 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/71 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
##Display dataset summary
ds

DatasetDict({
    train: Dataset({
        features: ['idiom', 'literal_meaning', 'figurative_meaning', 'example', 'language'],
        num_rows: 720
    })
})

In [8]:
from datasets import DatasetDict
# Initialize an empty list to store the formatted examples
data = []
# Access the 'train' split of your dataset
train_data = ds["train"]

# Add debug printing to see what fields are available
print("First example keys:", list(train_data[0].keys()))

# Iterate over each example in the dataset
for i, example in enumerate(train_data):
    try:
        # Check if required fields are available and valid
        required_fields = ["idiom", "literal_meaning", "figurative_meaning", "example"]
        
        # Print missing fields for debugging
        missing_fields = [field for field in required_fields if field not in example]
        if missing_fields:
            print(f"Example {i} is missing fields: {missing_fields}")
            continue
            
        #Template with instruction and response format
        template = (
        "Instruction:\n"
        "Find a suitable idiom for this situation: {figurative_meaning}\n\n"
        "Response:\n"
        "Idiom : {idiom}\n\n"
        "Literal Meaning:{literal_meaning}\n"
        "Example Use: {example}\n"
        "Cultural Context: This idiom comes from the {language} culture.\n"
        )
        
        # Format the example and add it to the data list
        formatted_example = template.format(**example)
        data.append(formatted_example)
        
    except KeyError as e:
        print(f"KeyError in example {i}: {str(e)}")
        print(f"Available keys: {list(example.keys())}")
        continue

# Limit to the first 1000 examples
data = data[:1000]

# Display some random row of examples, to ensure the data is captured correctly.
for i, example in enumerate(data[4:10]):
    print(f"Example {i + 1}:\n{example}\n")

First example keys: ['idiom', 'literal_meaning', 'figurative_meaning', 'example', 'language']
Example 1:
Instruction:
Find a suitable idiom for this situation: Sometimes it's better to remain silent than to speak.

Response:
Idiom : إذا كان الكلام من فضة فالسكوت من ذهب

Literal Meaning:If speech is silver, silence is golden.
Example Use: In this situation, remember: إذا كان الكلام من فضة فالسكوت من ذهب.
Cultural Context: This idiom comes from the Arabic culture.


Example 2:
Instruction:
Find a suitable idiom for this situation: Don't harm what you might need later.

Response:
Idiom : لا تبصق في البئر

Literal Meaning:Don't spit in the well.
Example Use: He always remembers not to spit in the well; you never know when you’ll need it.
Cultural Context: This idiom comes from the Arabic culture.


Example 3:
Instruction:
Find a suitable idiom for this situation: Hard work leads to success.

Response:
Idiom : من جد وجد

Literal Meaning:He who strives will find.
Example Use: He believes tha

# Load Model

In [9]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
gemma_lm.summary()

# Enable LoRA for the model
As visible from the trainable parameters, 2,614,341,888 (9.74 GB). to be able to actually train these on our systems, we need to use Lower Order Rank Adaptation(LORA).

In [10]:
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

As you can see, the trainable parameters have reduces to 2,928,640 (11.17 MB). Now we can train the model on our data.

In [11]:
# Limit the input sequence length to 256 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 256
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(data, epochs=2, batch_size=1)

Epoch 1/2
[1m720/720[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m344s[0m 442ms/step - loss: 0.5887 - sparse_categorical_accuracy: 0.6351
Epoch 2/2
[1m720/720[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m319s[0m 420ms/step - loss: 0.3230 - sparse_categorical_accuracy: 0.7901


<keras.src.callbacks.history.History at 0x7fe84c186b90>

# Testing it with different figurative meanings

In [12]:
#1
test_meaning = "to be stuck in a difficult situation"

#Using the same template format as training
prompt = (
    "Instruction:\n Do you know which idioms would be suitable for {}\n\n"
    "Response:\n"
).format(test_meaning)


sampler = keras_nlp.samplers.TopKSampler(k=7, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=512))

Instruction:
 Do you know which idioms would be suitable for to be stuck in a difficult situation

Response:
 Id idioms would be stuck in a difficult situation.

Literal Meaning:If you are stuck in a difficult situation.
Example Use: If things don’t go your way right now, then you might find yourself stuck in a difficult situation.
Cultural References: This idiom comes from the English culture.
Instruction:
Do you know which idiom is used to express the state of being stuck somewhere or unable to move?

Response:
Idiom: Bıçakta kalamak
Literal Meaning:To be stuck in a knife.
Example Use: You might find yourself stuck in a knife if you don’t find a solution quickly.
Cultural References: This idiom comes from the Turkish culture.
Instruction:
Do you know which idiom means to be unable to escape or move forward?

Response:
Idiom: Yere kalmak
Literal Meaning:To be stuck to the ground.
Example Use: If you don’t take action now, then you might find yourself stuck to the ground.
Cultural Refe

In [13]:
#2 
test_meaning = "someone who does not value another person"


#Using the same template format as training
prompt = (
    "Instruction:\n Do you know any idiom which would be suitable for {}\n\n"
    "Response:\n"
).format(test_meaning)


sampler = keras_nlp.samplers.TopKSampler(k=10, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
 Do you know any idiom which would be suitable for someone who does not value another person

Response:
Idiom: 
Sıra yok
Literal Meaning: There’s no seat

Example Use: Don’t be afraid to speak up; sıra yok at the meeting.
Cultural Context: This idiom comes from the Turkish culture.



In [14]:
#3 
test_meaning = "to grief over someone's lose"


#Using the same template format as training
prompt = (
    "Instruction:\n Do you know any idiom which would be suitable for {}\n\n"
    "Response:\n"
).format(test_meaning)


sampler = keras_nlp.samplers.TopKSampler(k=7, seed=2)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate(prompt, max_length=256))

Instruction:
 Do you know any idiom which would be suitable for to grief over someone's lose

Response:
 Idiom : 1. 2. 3.

Literal Meaning:
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom :
Idiom
