This notebook will initially go through the [Kaggle QA with Gemma - KerasNLP Starter](https://www.kaggle.com/code/awsaf49/kaggle-qa-with-gemma-kerasnlp-starter) notebook and build upon it.

## Install Libraries

In [2]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.1.1 which is incompatible.[0m[31m
[0m

## Import Libraries

In [3]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import plotly.graph_objs as go
import plotly.express as px
from IPython.display import display, Markdown

2024-04-01 08:31:12.970729: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-01 08:31:12.970873: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-01 08:31:13.111497: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Configuration

In [6]:
class CFG:
    seed = 42
    dataset_path = "/kaggle/input/kaggle-docs/questions_answers"
    preset = "gemma_2b_en" # Pretrained model
    sequence_length = 512 # max size of the input for training
    batch_size = 1 # size of the input batch in training
    epochs = 10 # number of epochs to train

## Reproducibility

In [7]:
keras.utils.set_random_seed(CFG.seed)

## Data
The data is provided from a dataset called **Kaggle Docs** which contains around 60 question-answer pairs from raw data from the `kaggle.com/docs` website. 

**To Do**

Later, we will augument the data more so we can have more samples to work with.

**Data Format**
- The question-answer paid is stored in `./kaggle-docs/questions_answers/data.csv` file.
- This file includes:
    - `Question`: A question about the Kaggle Platform
    - `Answer`: Answer to the question in Markdown Format.
    - `Category`: The category of the question.

In [8]:
df = pd.read_csv(f"{CFG.dataset_path}/data.csv")
df.head(2)

Unnamed: 0,Question,Answer,Category
0,What are the different types of competitions a...,# Types of Competitions\n\nKaggle Competitions...,competition
1,What are the different competition formats on ...,There are handful of different formats competi...,competition


This will be the template used:
```
Category: ...

Question: ...

Answer: ...
```

**To Do** 

Try better prompt engineering techniques.

In [11]:
template = "\n\nCategory:\nkaggle-{Category}\n\nQuestion:\n{Question}\n\nAnswer:\n{Answer}"

In [12]:
df["prompt"] = df.progress_apply(lambda row: template.format(Category=row.Category,
                                                             Question=row.Question,
                                                             Answer=row.Answer), axis=1)
data = df.prompt.tolist() # Converts the column into a list

  0%|          | 0/60 [00:00<?, ?it/s]

# Sample

Let's examine a sample prompt. Since the answers in the dataset are in markdown format, we should render also in markdown.

In [13]:
def colorize_text(text):
    for word, color in zip(["Category", "Question", "Answer"], ["blue", "red", "green"]):
         text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [14]:
# take a random sample
sample = data[45]

# give colors to the question, answer, and category
sample = colorize_text(sample)

# Show sample
display(Markdown(sample))



**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do Kaggle competitions work?

**<font color='green'>Answer:</font>**
## Overview

Every competition has two things:

a) a clearly defined problem that participants need to solve using a machine learning model
b) a dataset that’s used both for training and evaluating the effectiveness of these models.

For example, in the [Store Sales – Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting) competition, participants must accurately predict how many of each grocery item will sell using a dataset of past product and sales information from a grocery retailer.

Once the competition starts, participants can submit their predictions. Kaggle will score them for accuracy, and the team will be placed on a ranked leaderboard. The team at the top of the leaderboard at the deadline wins!

## Datasets, Submissions & Leaderboards

Every competition’s dataset is split into two smaller datasets.

- One of these smaller datasets will be given to participants to train their models, typically named `train.csv`.
- The other dataset will be mostly hidden from participants and used by Kaggle for testing and scoring, named `test.csv` and `solution.csv` (`test.csv` is the same as `solution.csv` except that `test.csv` contains the feature values and `solution.csv` contains the ground truth variable(s) – participants will never, ever see `solution.csv`).

When a participant feels ready to make a submission to the competition, they will use `test.csv` to generate a prediction and upload a CSV file. Kaggle will automatically score the submission for accuracy using the hidden `solution.csv` file.

Most competitions have a maximum number of submissions that a participant can make each day and a final deadline at which point the leaderboard will be frozen.

It’s conceivable that a participant could use the mechanics of a Kaggle competition to overfit a solution - which would be great for winning a competition, but not valuable for a real-world application.

To help prevent this, Kaggle has two leaderboards – the public and private leaderboard. The competition host splits the `solution.csv` dataset into two parts, using one part for the public leaderboard and another part for the private leaderboard. Participants generally will now know which samples are public vs private. The private leaderboard is kept a secret until after the competition deadline and is used as the official leaderboard for determining the final ranking.

# Data Analysis

Lets see how many question-answer pairs we have per category.

In [14]:
unique_labels, label_counts = np.unique(df.Category.tolist(), return_counts=True)

# Plotting
fig = go.Figure(data=go.Bar(x=unique_labels, y=label_counts))
fig.update_layout(
    title="Category Distribution",
    xaxis_title="Category",
    yaxis_title="Count",
)

fig.update_traces(text=label_counts, textposition="outside")
fig.show()

This means that Categories with less question and answers are most likely going to have the weakest responses. Therefore, we should do some auguementation or dive deeper into the dataset for more data.

We are going to utilize the Gemma Casual Language model which predicts the next token based on the previous tokens. This task setup can be used to train the model unsupervised on plain text input, which is what we are using it for. It can also autoregressively generate plain text similar to the data used for training which is what we want for this problem. We can also pre-train or fine tune the model by calling `fit()`.

The model has a `generate()` method which can generate text based on a prompt. You can have additional control on its generation strategy by controlling the sampler on compile. 

In [15]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


In [16]:
x, y, sample_weight = gemma_lm.preprocessor(data[0:2])

The preprocessing layer will take in batches of strings, and return outputs in a `(x,y, sample_weight)` format, wher e`y` is the next token id in the `x` sequence.

After the preprocessor the data shape is `(num_samples, sequence_length)`.

In [17]:
for k, v in x.items():
    print(k, ":", v.shape)

token_ids : (2, 8192)
padding_mask : (2, 8192)


# Inference before fine tuning

In [25]:
# Take on sample
row = df.iloc[2]

# Generate prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# display 
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to join a competition?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Join" button.
3. Enter your email address and click on the "Join" button.
4. You will receive an email with a link to confirm your email address.
5. Click on the link in the email to confirm your email address.
6. You will now be able to log in to the competition.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to submit a solution?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Submit" button.
3. Enter your solution in the text box and click on the "Submit" button.
4. You will receive a confirmation email with the status of your submission.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to view the leaderboard?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Leaderboard" button.
3. You will see the leaderboard with the top 100 participants.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to view the

# Fine-tuning with LoRA

In order to get better responses, we will fine-tune the model with Low Rank Adaption (LoRA) on the dataset.

In a LLM, there is a dense layer of weights known as the pre-trained dense layer, which consists of a **d X d** weight matrix. These are the pretained weights of the model that were trained when the LLM was originally trained. LoRA are extra weights added to the LLM in order to fine tune the model, without ever touching the original pre-trained weights. This is known as freezing the pre-trained weights. When a LLM is originally getting an output it is calculated by the **W_0 * x + b_0** where W_0 is the weight matrix and x is the input and b_0 is the bias. 

LoRAs are initalized as two layers **A** and **B** and have the weights **d x r** and **r x d**. Here the **r** is known as the rank and allows how many trainable parameters are initialized for the LoRA. 

When we add the LoRA to the LLM, we change the equation for the output to **output = (W_0 * x + b_0) + (B * A * x)**.

LoRA are initalized with A having the normal distribution of 0 and a variance of sigmia squared and B with 0.

**But why does it save memory?**

The reason is because we only have trainable parameters **d x r** and **r x d** from the LoRA rather than the entire LLM pretrained weights. So, even though we are adding more parameters to the LLM we are only training a small subset of parameters.

In [18]:
# Enable LoRA for the model and set to rank 4
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

Notice how the trainable parameters went from 2,507,536,384 to 1,363,968.

# Training

In [19]:
# Limit the input sequence length to 512 to control memory usage
gemma_lm.preprocessor.sequence_length = CFG.sequence_length

# Compile the model with loss, optimizer, and metric
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=8e-5),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train Model
gemma_lm.fit(data, epochs=CFG.epochs, batch_size=CFG.batch_size)

Epoch 1/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 734ms/step - loss: 1.7209 - sparse_categorical_accuracy: 0.5241
Epoch 2/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 1s/step - loss: 1.6869 - sparse_categorical_accuracy: 0.5313
Epoch 3/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 728ms/step - loss: 1.6175 - sparse_categorical_accuracy: 0.5417
Epoch 4/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 728ms/step - loss: 1.5770 - sparse_categorical_accuracy: 0.5509
Epoch 5/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 728ms/step - loss: 1.5537 - sparse_categorical_accuracy: 0.5552
Epoch 6/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 727ms/step - loss: 1.5304 - sparse_categorical_accuracy: 0.5568
Epoch 7/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 727ms/step - loss: 1.5028 - sparse_categorical_accuracy: 0.5630
Epoch 8/10
[1m60/60[0

<keras.src.callbacks.history.History at 0x78e474589ed0>

In [20]:
gemma_lm.save('model.keras')

# Inference after fine-tuning

In [21]:
# Take one sample
row = df.iloc[2]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to join a competition?

**<font color='green'>Answer:</font>**
You can view and join any competition in the Competition page on Kaggle. To do so, search for the competition you want to join in the top bar, select it and click on the "Join" button to the right of the name of the competition.

In [22]:
# Take one sample
row = df.iloc[45]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do Kaggle competitions work?

**<font color='green'>Answer:</font>**
Competitions are the heart of Kaggle. They allow users to create a competition from scratch or join an existing one. They can either be hosted on Kaggle or run as a hosted event.

Competitions can be public, where anyone can see the leaderboard, but only the organizers and other participants in the competition can see the raw data. Alternatively, competitions can be private, where the data is not available to anyone except the organizers and participants in the competition.

## Creating a Competition

To create a public competition, navigate to https://www.kaggle.com/competition/create in the browser.

To create a private competition, navigate to https://www.kaggle.com/competition/create-private in the browser.

The following steps will guide you through the process of creating a new competition.

1. Enter a title (max 128 characters) for your competition
2. Select a dataset from the dropdown menu (you can also add your own data to a competition)
3. Select a competition format. There are two formats: “Public” and “Private”.
4. Click “

# Conclusions

We should improve the model using:
- More Samples
- Data Augementation
- Advanced Prompting
- Larger Model Gemma 7B
- Increase `sequence_length`
- Learning Rate Scheduler