## Finetuning Gemma on Kaggle documentation dataset

### Installing the requirements

In [1]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

import os

os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"


#Importing Libraries
import keras
import keras_nlp
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.2.1 which is incompatible.[0m[31m
[0m

2024-04-13 09:42:26.661316: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 09:42:26.661417: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 09:42:26.786595: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Dataset

We are utilizing our own Kaggle dataset titled "Kaggle Sectionwise Documentation" The dataset can be found here: https://www.kaggle.com/datasets/rishujamaiyar/kaggle-sectionwise-documentation. This dataset comprises section-wise documentation text sourced from Kaggle's "docs" section. Each entry includes columns denoting the topic and subtopic, providing a contextual framework for the body text. The content has been meticulously extracted from the official Kaggle documentation available at https://www.kaggle.com/docs.

### Preparing the Dataset
We would be processing the mentioned dataset to make it fit for training to Gemma.


The **split_text_by_sections** function processes a body text and a list of section headings, extracting sections by locating their start and end positions in the text, then returning a dictionary mapping section titles to their respective content. Meanwhile, the **process_wrapper** function operates on a DataFrame with columns for topics, subtopics, and text bodies. It divides each body into sections based on subtopics, extracting the first two paragraphs from each. It consolidates these into a master DataFrame, and constructs instructional and response text for training purposes, ultimately returning a DataFrame tailored for training data.

In [2]:
def split_text_by_sections(body_text, section_headings):
    sections = {}
    start_index = 0
    
    for i in range(len(section_headings)):
        title = section_headings[i]
        start_pos = body_text.find(title, start_index)
        
        if i == len(section_headings) - 1:
            end_pos = len(body_text)
        else:
            end_pos = body_text.find(section_headings[i + 1], start_pos)
        
        sections[title] = body_text[start_pos:end_pos].strip()
        start_index = end_pos
    
    return sections




def process_wrapper(dff):
    master_list = []
    for row, col in dff.iterrows():
      topic = col['topics']
      sub_topics = col['sub_topics'].split('\n')
      body = col['body']
      sections = split_text_by_sections(body, sub_topics)
      for title, content in sections.items():
        if len(content)>0:
          content = content.replace(title+'\n','')
          first_two_paragraphs = content
          master_list.append([topic,title,first_two_paragraphs])

    master_df = pd.DataFrame(master_list,columns=['topics','sub_topics','body'])
    master_df

    training_df_list = []
    for row, col in master_df.iterrows():
      topic = col['topics']
      sub_topics = col['sub_topics']
      body = col['body']
      txt = ''
      txt = "Instruction:\n"
      txt = txt + "What is " + sub_topics + " in " + topic + " section in Kaggle?\n"
      txt = txt + "\n"
      txt = txt + "Response:\n"
      txt = txt + body
      training_df_list.append(txt)

    training_df = pd.DataFrame(training_df_list,columns=['data'])
    return training_df



dff = pd.read_csv('/kaggle/input/kaggle-sectionwise-documentation/kaggle_sectionwise_documentation.csv')
training_df = process_wrapper(dff)

#Checking the token length of the training data
training_df['token_length'] = training_df['data'].apply(lambda x: len(word_tokenize(x)))
training_df

Unnamed: 0,data,token_length
0,Instruction:\nWhat is Types of Competitions in...,723
1,Instruction:\nWhat is Competition Formats in C...,975
2,Instruction:\nWhat is Joining a Competition in...,649
3,Instruction:\nWhat is Forming a Team in Compet...,570
4,Instruction:\nWhat is Making a Submission in C...,873
5,Instruction:\nWhat is Leakage in Competitions ...,710
6,Instruction:\nWhat is Resources for Getting St...,639
7,Instruction:\nWhat is Cheating in Competitions...,158
8,Instruction:\nWhat is Types of Datasets in Dat...,1219
9,Instruction:\nWhat is Searching for Datasets i...,706


Since the max seq length of Gemma is 8192 the training data token size should be lower to that.

In [3]:
training_list = training_df['data']
print(training_list[0])

Instruction:
What is Types of Competitions in Competitions section in Kaggle?

Response:
Kaggle Competitions are designed to provide challenges for competitors at all different stages of their machine learning careers. As a result, they are very diverse, with a range of broad types.

Featured
Featured competitions are the types of competitions that Kaggle is probably best known for. These are full-scale machine learning challenges which pose difficult, generally commercially-purposed prediction problems. For example, past featured competitions have included:

Allstate Claim Prediction Challenge - Use customers’ shopping history to predict which insurance policy they purchase

Jigsaw Toxic Comment Classification Challenge - Predict the existence and type of toxic comments on Wikipedia

Zillow Prize - Build a machine learning algorithm that can challenge Zestimates, the Zillow real estate price estimation algorithm

Featured competitions attract some of the most formidable experts, and o

### Training Gemma


KerasNLP provides implementations of many popular model architectures{:.external}. In this tutorial, you'll create a model using GemmaCausalLM, an end-to-end Gemma model for causal language modeling. A causal language model predicts the next token based on previous tokens.

Create the model using the from_preset method:


In [4]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


### Kaggle question set
The following set of questions would be helpful to evaluate the quality of response post funetuning.

In [5]:
testing_questions_set = ["What are various Competitions formats at Kaggle?",
                        "What are teams at Kaggle?",
                        "How a person can create a dataset at Kaggle?",
                        "What are Notebooks at Kaggle?",
                        "What is the stance of cheating in competitions at Kaggle?"]

In [6]:
before_training_response = []

for question in testing_questions_set:
    
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    prompt = template.format(
        instruction=question,
        response="",
    )
    before_training_response.append([question,gemma_lm.generate(prompt, max_length=256)])


evaluation_df = pd.DataFrame(before_training_response,columns=['Questions','Before_Training_Response'])
print(before_training_response[0][1])

Instruction:
What are various Competitions formats at Kaggle?

Response:
There are 3 main formats of competitions at Kaggle:

1. <strong>Kaggle Challenges</strong>: Kaggle Challenges are competitions where the participants are given a dataset and a task to complete. The participants are given a limited time to complete the task and submit their solutions. The best solutions are then evaluated and the winner is announced.

2. <strong>Kaggle Contests</strong>: Kaggle Contests are competitions where the participants are given a dataset and a task to complete. The participants are given a limited time to complete the task and submit their solutions. The best solutions are then evaluated and the winner is announced.

3. <strong>Kaggle Kernels</strong>: Kaggle Kernels are competitions where the participants are given a dataset and a task to complete. The participants are given a limited time to complete the task and submit their solutions. The best solutions are then evaluated and the winner

The model just responds with a genric and inaccurate answer to the question.

In [7]:
evaluation_df

Unnamed: 0,Questions,Before_Training_Response
0,What are various Competitions formats at Kaggle?,Instruction:\nWhat are various Competitions fo...
1,What are teams at Kaggle?,Instruction:\nWhat are teams at Kaggle?\n\nRes...
2,How a person can create a dataset at Kaggle?,Instruction:\nHow a person can create a datase...
3,What are Notebooks at Kaggle?,Instruction:\nWhat are Notebooks at Kaggle?\n\...
4,What is the stance of cheating in competitions...,Instruction:\nWhat is the stance of cheating i...


### LoRA Fine-tuning
To get better responses from the model, fine-tune the model with Low Rank Adaptation (LoRA) using our dataset.

The LoRA rank determines the dimensionality of the trainable matrices that are added to the original weights of the LLM. It controls the expressiveness and precision of the fine-tuning adjustments.

A higher rank means more detailed changes are possible, but also means more trainable parameters. A lower rank means less computational overhead, but potentially less precise adaptation.

This notebook uses a LoRA rank of 16. 

In [8]:
# Enable LoRA for the model and set the LoRA rank to 16.
gemma_lm.backbone.enable_lora(rank=16)
gemma_lm.summary()

Note that enabling LoRA reduces the number of trainable parameters significantly (from 2.5 billion to 5.4 million).

In [9]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 512
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=1e-4,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(training_list, epochs=4, batch_size=1)

Epoch 1/4
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 742ms/step - loss: 2.2662 - sparse_categorical_accuracy: 0.4613
Epoch 2/4
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 731ms/step - loss: 2.2086 - sparse_categorical_accuracy: 0.4677
Epoch 3/4
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 732ms/step - loss: 2.1390 - sparse_categorical_accuracy: 0.4798
Epoch 4/4
[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 732ms/step - loss: 2.0432 - sparse_categorical_accuracy: 0.4988


<keras.src.callbacks.history.History at 0x7e52300c98a0>

### Kaggle questions post fine-tuning


In [10]:
after_training_response = []

for question in testing_questions_set:
    
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    prompt = template.format(
        instruction=question,
        response="",
    )
    after_training_response.append(gemma_lm.generate(prompt, max_length=256))

evaluation_df['After_Training_Response'] = after_training_response
evaluation_df

Unnamed: 0,Questions,Before_Training_Response,After_Training_Response
0,What are various Competitions formats at Kaggle?,Instruction:\nWhat are various Competitions fo...,Instruction:\nWhat are various Competitions fo...
1,What are teams at Kaggle?,Instruction:\nWhat are teams at Kaggle?\n\nRes...,Instruction:\nWhat are teams at Kaggle?\n\nRes...
2,How a person can create a dataset at Kaggle?,Instruction:\nHow a person can create a datase...,Instruction:\nHow a person can create a datase...
3,What are Notebooks at Kaggle?,Instruction:\nWhat are Notebooks at Kaggle?\n\...,Instruction:\nWhat are Notebooks at Kaggle?\n\...
4,What is the stance of cheating in competitions...,Instruction:\nWhat is the stance of cheating i...,Instruction:\nWhat is the stance of cheating i...


In [11]:
evaluation_df.to_csv('evaluation_df.csv',index=False)