# Decoding Python with Stack Overflow: A Personalized Assistant with Gemma 2b

![Image description](https://helios-i.mashable.com/imagery/articles/00veJ5qeI90cfXdfFzUfCrv/hero-image.fill.size_1248x702.v1708464912.jpg)



I've developed a personalized assistant powered by Gemma 2b, designed to help you navigate Python programming queries sourced from Stack Overflow. This project leverages Gemma 2b, an advanced open model tailored for assisting with Python questions, to provide tailored assistance and solutions.

## Introduction to Gemma 2b

Gemma 2b, a part of the Gemma model family, offers pre-trained and instruction-tuned variants specifically optimized for assisting with Python programming queries. Developed by Google DeepMind and other teams at Google, Gemma 2b inherits the cutting-edge research and technology from Gemini models, providing developers with valuable tools and guidance for responsible AI development.

### Key Features of Gemma 2b

- **Tailored Assistance**: Gemma 2b provides personalized assistance for Python programming queries sourced from Stack Overflow, offering solutions tailored to the specific needs of developers.

- **Responsible AI Toolkit**: Developers have access to a comprehensive toolkit that guides responsible and ethical usage of Gemma 2b, ensuring the development of safer AI applications.

- **Framework Integration**: Gemma 2b seamlessly integrates with major frameworks like JAX, PyTorch, and TensorFlow, offering toolchains for inference and supervised fine-tuning (SFT).

- **Ready-to-Use Notebooks**: Ready-to-use Colab and Kaggle notebooks simplify the process of getting started with Gemma 2b, enabling developers to dive into their projects quickly.

- **Deployment Flexibility**: Gemma 2b models can be deployed on various platforms, including laptops, workstations, and Google Cloud, with easy deployment options on Vertex AI and Google Kubernetes Engine (GKE).

- **Performance Optimization**: Gemma 2b is optimized for industry-leading performance across multiple AI hardware platforms, including NVIDIA GPUs and Google Cloud TPUs..


In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/data-assistants-with-gemma/submission_categories.txt
/kaggle/input/data-assistants-with-gemma/submission_instructions.txt
/kaggle/input/gemma/pytorch/2b-it/1/config.json
/kaggle/input/gemma/pytorch/2b-it/1/gemma-2b-it.ckpt
/kaggle/input/gemma/pytorch/2b-it/1/tokenizer.model
/kaggle/input/gemma/keras/gemma_2b_en/2/config.json
/kaggle/input/gemma/keras/gemma_2b_en/2/tokenizer.json
/kaggle/input/gemma/keras/gemma_2b_en/2/metadata.json
/kaggle/input/gemma/keras/gemma_2b_en/2/model.weights.h5
/kaggle/input/gemma/keras/gemma_2b_en/2/assets/tokenizer/vocabulary.spm
/kaggle/input/gemma/keras/gemma_2b_en/1/config.json
/kaggle/input/gemma/keras/gemma_2b_en/1/tokenizer.json
/kaggle/input/gemma/keras/gemma_2b_en/1/metadata.json
/kaggle/input/gemma/keras/gemma_2b_en/1/model.weights.h5
/kaggle/input/gemma/keras/gemma_2b_en/1/assets/tokenizer/vocabulary.spm
/kaggle/input/gemma/transformers/2b-it/2/model.safetensors.index.json
/kaggle/input/gemma/transformers/2b-it/2/gemma-2b-it.gguf
/ka

### Installing Project Dependencies

This section outlines the installation process for the required dependencies used in this project. 

Each line represents a specific command executed within the notebook environment to install a Python package using `pip3`:



In [2]:
# !pip install -U keras-nlp
# !pip install -U keras
# !pip install -U trl

# !pip install -U datasets
# !pip install -U tf_keras
# !pip install -U nltk
# !pip install -U tensorflow-text
# !pip install -U adapters
!pip install --upgrade pip
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.0.5 which is incompatible.
tensorflowjs 4.16.0 requires packaging~=23.1, but you have packaging 21.3 which is incompatible.[0m[31m
[0m

# Importing Modules


In [3]:
# from datasets import load_dataset
# from trl import SFTTrainer
# from peft import LoraConfig
# from transformers import AutoTokenizer, AutoModelForCausalLM
# from transformers import BitsAndBytesConfig, GemmaTokenizer
# import torch
# import transformers
import keras_nlp
import keras
import os
import pandas as pd

2024-03-03 20:18:24.672147: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-03 20:18:24.672272: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-03 20:18:24.803359: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
# Select JAX as Keras backend 
os.environ["KERAS_BACKEND"] = "jax"
# Pre-allocate 100% of memory to minimize memory fragmentation and allocation overhead
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

## Choosing the Gemma Model for Kaggle

Gemma offers a variety of large language models, each with different parameter sizes and fine-tuning options. Selecting the right model for your Kaggle notebook is crucial for both memory constraints and optimal performance.

**Available Models:**

| Model Name | Parameters | Fine-tuned? | Kaggle-Compatible? |
|---|---|---|---|
| Gemma 7b instruct | 7 billion | Yes | No (exceeds memory limit) |
| Gemma 2b instruct | 2 billion | Yes | Yes (recommended) |
| Gemma 7b | 7 billion | No | No (exceeds memory limit) |
| Gemma 2b | 2 billion | No | Yes |

**Memory Limitation:**

Kaggle notebooks have a 30 GB memory limit. Opting for models larger than 2 billion parameters (like Gemma 7b or 7b instruct) will likely lead to kernel crashes.

**Recommendation:**

For Kaggle notebooks, **Gemma 2b** is the sweet spot, offering fine-tuning capabilities while staying within memory constraints.

**Additional Tips:**

- Consider your specific task and needs when choosing a model.
- Consult the Gemma documentation for detailed information: https://ai.google.dev/gemma/
- Remember, larger models often require more data and computational resources.

I hope this formatted Markdown version provides a clear and informative guide!


In [5]:
# Create an instance of GemmaCausalLM from the preset
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")


Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


# Testing the Model :
## Generating Text with a Single Prompt

We're using the `generate` method of the `gemma_lm` language model to generate text based on a single prompt. The prompt provided is "a quote in 30 words", and the maximum length of the generated text is restricted to 30 words.

In [6]:
# Generate text with a single prompt
input_text = "A quote in 30 words"
# tokenized_ids = tokenizer(input_text, return_tensors="pt")
single_prompt_result = gemma_lm.generate(input_text, max_length=50)

# Print the result for the single prompt
print("Single Prompt Result:")
print(single_prompt_result)



I0000 00:00:1709497196.408782      26 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
W0000 00:00:1709497196.474946      26 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update
W0000 00:00:1709497196.519525      26 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


Single Prompt Result:
A quote in 30 words or less.

“I’m not a fan of the word ‘diversity’ because it’s so broad. It’s like saying ‘I’m not a fan of the word ‘love’


In [7]:
#Fetching models summary involving all the parameters
print(gemma_lm.summary())

None


## Read and Process Data

We're reading data from a CSV file named "Questions.csv" located at "/kaggle/input/pythonquestions/". The data is read into a pandas DataFrame called `df_questions` using the `read_csv` function from the pandas library. We specify the encoding as "ISO-8859-1" to handle any special characters in the data.


In [8]:
#questions table
df_questions = pd.read_csv('../input/pythonquestions/Questions.csv',
                            encoding = "ISO-8859-1",
                            usecols = ['Id','Score','Title'])
#answers table
df_answers = pd.read_csv('../input/pythonquestions/Answers.csv',
                            encoding = "ISO-8859-1",
                            usecols = ['ParentId','Score','Body'],#parent id links to the questions table
                            )

### Only taking a subset of training examples for now due to limited RAM . You can try with the entire dataset


In [9]:
questions = df_questions
answers = df_answers

#### Here, you can see the list of questions

In [10]:
questions[:5]

Unnamed: 0,Id,Score,Title
0,469,21,How can I find the full path to a font from it...
1,502,27,Get a preview JPEG of a PDF on Windows?
2,535,40,Continuous Integration System for a Python Cod...
3,594,25,cx_Oracle: How do I iterate over a result set?
4,683,28,Using 'in' to match an attribute of Python obj...


## Pre-processing:


In [11]:
questions = questions[questions['Score'] > 0]
answers = answers[answers['Score'] > 0]\
    .sort_values('Score',ascending=False)\
    .drop_duplicates(subset=['ParentId'])

In [12]:
qa = questions.merge(answers,left_on = 'Id', right_on = 'ParentId')\
    .rename(columns={'Title':'Question','Body':'Answer'})[['Question','Answer','Score_x']]

### Choosing first 5000 subset from the entire qa dataframe


In [13]:
qa = qa.sort_values("Score_x",ascending=False).head(5000)

## Pre-processing of the question dataset 
- Formatting the data using regex
- The function `remove_html_tags_with_space` takes a list of lists (`html_list`) containing strings with HTML tags and removes these tags while replacing them with spaces.
- It iterates over each inner list in `html_list`, then iterates over each item in the inner list.
- For each item, it removes HTML tags using regular expressions (`re.sub`) and replaces them with spaces.
- Additionally, it handles newline characters (`\n`), consecutive newlines (`\n\n`), and carriage return followed by newlines (`\r\n\r\n`) by replacing them with spaces as well.
- Finally, it removes leading and trailing spaces from each cleaned item before appending it to the `clean_inner_list`.
- The function returns a new list of lists (`clean_list`) with the HTML tags removed and replaced by spaces in each string.


In [14]:
def remove_html_tags_with_space(html_list):
    clean_list = []
    for inner_list in html_list:
        clean_inner_list = []
        for item in inner_list:
            clean_item = re.sub(r'<[^>]*>', ' ', item)  # Remove HTML tags and replace with spaces
            clean_item1 = re.sub(r'\n', ' ',  clean_item)
            clean_item2 = re.sub(r'\n\n', ' ',  clean_item1)
            clean_item3 = re.sub(r'\r\n\r\n', ' ',  clean_item2)
            clean_item4 = re.sub(r'&gt;&gt;&gt;', ' ',  clean_item3)
            clean_inner_list.append(clean_item4.strip())  # Remove leading and trailing spaces
        clean_list.append(clean_inner_list)
    return clean_list


## Formatting the Data to get in desired form :

**```Desired form : [question : question , answer :[answer]]```**


In [15]:
train = []
for index, row in qa.iterrows():
    train.append(f"Question:\n{row['Question']}\n\nAnswer:\n{row['Answer']}")


#### Viewing first 5 examples of our formatted_data variable

In [16]:
train[:5]

['Question:\nWhat does the "yield" keyword do?\n\nAnswer:\n<p>To understand what <code>yield</code> does, you must understand what <em>generators</em> are. And before generators come <em>iterables</em>.</p>\n\n<h2>Iterables</h2>\n\n<p>When you create a list, you can read its items one by one. Reading its items one by one is called iteration:</p>\n\n<pre><code>&gt;&gt;&gt; mylist = [1, 2, 3]\n&gt;&gt;&gt; for i in mylist:\n...    print(i)\n1\n2\n3\n</code></pre>\n\n<p><code>mylist</code> is an <em>iterable</em>. When you use a list comprehension, you create a list, and so an iterable:</p>\n\n<pre><code>&gt;&gt;&gt; mylist = [x*x for x in range(3)]\n&gt;&gt;&gt; for i in mylist:\n...    print(i)\n0\n1\n4\n</code></pre>\n\n<p>Everything you can use "<code>for... in...</code>" on is an iterable; <code>lists</code>, <code>strings</code>, files...</p>\n\n<p>These iterables are handy because you can read them as much as you wish, but you store all the values in memory and this is not always w

# Fine-tuning Gemma


## Gemma Language Model Backbone Configuration:

- Enable Long-Range Attention (LoRA) mechanism with a rank of 4 for the Gemma Language Model backbone.



In [17]:
gemma_lm.backbone.enable_lora(rank=4)

In [18]:
def get_prompt(query:str)->str:
    template = "Python Query:\n{python_query}\n\nResponse:\n{response}"
    prompt = template.format(
        python_query=query,
        response="",
    )
    return prompt

## Gemma Language Model Configuration and Training:

- **Limiting Input Sequence Length**: Set the input sequence length to 128 to control memory usage.

- **Optimizer Configuration**:
  - Use AdamW optimizer, a common choice for transformer models.
  - Set learning rate to 5e-6 and weight decay to 0.01.
  - Exclude layernorm and bias terms from weight decay.

- **Model Compilation**:
  - Compile the Gemma Language Model using SparseCategoricalCrossentropy loss function (from logits).
  - Use the configured optimizer.
  - Track model performance using SparseCategoricalAccuracy as the weighted metric.

- **Model Training**:
  - Fit the model using the formatted data with one epoch and a batch size of 1.



In [19]:
# Limit the input sequence length to 128 (to control memory usage).
gemma_lm.preprocessor.sequence_length = 128
# Use AdamW (a common optimizer for transformer models).
optimizer = keras.optimizers.AdamW(
    learning_rate=5e-5,
    weight_decay=0.01,
)
# Exclude layernorm and bias terms from decay.
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(train, epochs=1, batch_size=1)

W0000 00:00:1709497271.254890      81 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


[1m5000/5000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1064s[0m 205ms/step - loss: 1.4846 - sparse_categorical_accuracy: 0.6553


<keras.src.callbacks.history.History at 0x7a2880131fc0>

# Finetuned Model Prediction 

In [20]:
prompt = get_prompt("I'm facing issue with smtblib can you help me I couldnt authorize my email account")
print(gemma_lm.generate(prompt, max_length=512))

W0000 00:00:1709498317.688331      26 graph_launch.cc:671] Fallback to op-by-op mode because memset node breaks graph update


Python Query:
I'm facing issue with smtblib can you help me I couldnt authorize my email account

Response:



## Save Fine-Tuned Gemma Language Model:

- Save the fine-tuned Gemma Language Model to a file named "version_finetuned.keras" inorder to use it directly from notebook.
- If model is not saved then you will have to again finetune the model i.e. run all code blocks after running the session again 

In [21]:
gemma_lm.save("version_finetuned.keras")