<a href="https://colab.research.google.com/github/DenisovAV/flutter_gemma/blob/main/colabs/gemma_lora_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Gemma with LoRA for On-Device Inference While Keeping LoRA Weights Separate from the Base Model

**In case you have any questions, you can contact me via email or LinkedIn:**

Email: denisov.shureg@gmail.com

LinkedIn: https://www.linkedin.com/in/aleks-denisov/

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Introduction

**This workshop teaches you how to fine-tune Gemma on your own dataset using LoRA and convert the resulting weights into a format compatible with MediaPipe for deployment on mobile devices and in web browsers.**

Participants will learn step-by-step how to prepare their dataset, apply LoRA fine-tuning to customize the Gemma model for specific tasks, and export the fine-tuned weights to TensorFlow Lite Flatbuffer format. The session will cover practical tips for optimizing performance, ensuring seamless integration, and leveraging the power of Edge AI to run sophisticated models directly on devices without relying on cloud infrastructure. Perfect for those looking to build privacy-preserving, cost-efficient AI solutions with on-device capabilities.

### Select the runtime

To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU or an A100 GPU (recommended, if available):

1. In the upper-right of the Colab window, select &#9662; (**Additional connection options**).
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU** or **A100 GPU**.


## Setting Up Environment Variables

Before running the code, make sure you've added your Hugging Face token (HF_TOKEN) to Colab’s User-Defined Variables:

In the Colab menu, navigate to Tools → Preferences → User-defined variables.
Add HF_TOKEN as the key and your Hugging Face token as the value.

These variables:

`HF_TOKEN`: Authenticates your access to Hugging Face resources.

`WANDB_MODE`: Disables Weights & Biases logging to keep everything offline.

In [1]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.

os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["WANDB_MODE"] = "offline"



### Installing Required Libraries


This step ensures all necessary libraries are installed for fine-tuning Gemma, exporting weights, and integrating them with MediaPipe.

*What Each Library Does:*

* `transformers`: Provides tools for working with pre-trained models like Gemma.

*  `mediapipe`: Enables deployment and inference of AI models on mobile and web platforms.

*  `bitsandbytes`: Provides memory-efficient optimizers and quantization techniques, helping run large models efficiently on limited hardware.

* `peft`: Simplifies parameter-efficient fine-tuning (LoRA) of large language models.

*  `trl`: Tools for fine-tuning language models with reinforcement learning techniques.

* `datasets`: Accesses and manages datasets for training and evaluation.

* `fsspec==2024.6.1` and `gcsfs==2024.9.0`: Manage file systems, including cloud-based storage like Google Cloud Storage.

Notes:
The specific versions of fsspec and gcsfs are required to ensure compatibility with dataset handling and storage.
This command installs all libraries silently (using the -q flag) to keep the output clean.
Once the installation completes, you'll be ready to move on to the next steps in the fine-tuning process!

In [2]:
!pip install -q transformers mediapipe bitsandbytes peft trl datasets fsspec==2024.6.1 gcsfs==2024.9.0

Reason for being yanked: requirements incorrect[0m[33m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.6/35.6 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.9/313.9 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.4 MB/s[0m eta

### Loading and Saving the Pretrained Gemma Model
  



This step demonstrates how to load the Gemma-2B model and tokenizer, then save them locally for use in fine-tuning and deployment.

What This Code Does:


1.   Loads the Tokenizer:
  * Fetches the tokenizer associated with the Gemma-2B model from the Hugging Face Hub.
  * Saves it locally in the /content/gemma2b directory for future use.
2.   Loads the Model:
  * Downloads the Gemma-2B model using the AutoModelForCausalLM class.
  * Loads the model in float16 precision to reduce memory usage while maintaining accuracy.
  * Utilizes device_map="auto" to automatically distribute the model across available devices (e.g., GPU).
  * Saves the nodel locally in the /content/gemma2b directory for future use.

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model ID for Gemma
model_id = "google/gemma-2b"

# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "right"

# Save the tokenizer locally
tokenizer.save_pretrained("/content/gemma2b")

# Load the Gemma-2B model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use half-precision to optimize memory usage
    device_map="auto"           # Automatically map the model to available devices
)

# Save the model locally
model.save_pretrained("/content/gemma2b")


tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### Inference with the Pre-Trained Gemma-2B Model

This step demonstrates how to generate text using the Gemma-2B model. By providing an input prompt, the model will complete the text based on its training.


Notes:
The temperature, top_k, and top_p parameters significantly affect the quality and style of the generated text. Experiment with these values to achieve your desired output.
Ensure your environment has GPU support to handle the processing efficiently. If no GPU is available, replace "cuda" with "cpu" in the .to("cuda") call.

In [8]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Quote: Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces


### Loading and Tokenizing a Dataset for Fine-Tuning

This step demonstrates how to load a pre-existing dataset and tokenize it for use in fine-tuning the Gemma-2B model.

What This Code Does:

  * The load_dataset function fetches the dataset "Abirate/english_quotes" from the Hugging Face Datasets Hub.
  * This dataset contains a collection of English quotes, suitable for language modeling or fine-tuning tasks.
  * The formatting_func prepares the dataset samples into the desired format for fine-tuning.


In [4]:
from datasets import load_dataset

# Load a dataset of English quotes
data = load_dataset("Abirate/english_quotes")


def formatting_func(example):
    if isinstance(example["quote"], list):
        return [
            f"Quote: {quote}\nAuthor: {author}<eos>"
            for quote, author in zip(example["quote"], example["author"])
        ]
    return [f"Quote: {example['quote']}\nAuthor: {example['author']}<eos>"]

print(formatting_func(data["train"]))


README.md:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

quotes.jsonl:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]



### Configuring and Applying LoRA for Fine-Tuning

This step demonstrates how to configure LoRA (Low-Rank Adaptation) and apply it to the Gemma-2B model for efficient fine-tuning. LoRA is a parameter-efficient fine-tuning technique that allows large models to adapt to new tasks with minimal additional computation.

Parameters Explained:
1. `r=8`: Defines the rank of the low-rank adaptation matrices used in LoRA.
A higher value increases the adaptation capacity but requires more memory and computation.
2. `target_modules`: Specifies the layers where LoRA will be applied.
These are typically projection layers (q_proj, k_proj, v_proj, etc.) in the transformer architecture that benefit most from fine-tuning.
3. `task_type="CAUSAL_LM"`:Indicates that the fine-tuning task is causal language modeling, where the model generates text by predicting the next token based on the previous context.

Method `get_peft_model(model, lora_config)` applies the LoRA configuration to the model, wrapping the specified layers with LoRA adapters.

In [5]:
from peft import LoraConfig, get_peft_model

# Configure LoRA settings
lora_config = LoraConfig(
    r=8,  # Rank of the LoRA adaptation matrices
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],  # Layers to apply LoRA
    task_type="CAUSAL_LM",  # Task type: Causal Language Modeling
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

### Fine-Tuning the Model Using SFTTrainer

This code demonstrates how to fine-tune the **Gemma-2B** model with LoRA using the SFTTrainer (Supervised Fine-Tuning Trainer) from the trl library. The training process is designed to be efficient, leveraging bitsandbytes for memory optimization and 8-bit AdamW optimizer.

Step-by-Step Breakdown:

1. Training Arguments:
  * `per_device_train_batch_size=1`: Sets a small batch size to accommodate large models.
  * `gradient_accumulation_steps=4`: Accumulates gradients over multiple steps to simulate a larger batch size.
  * `warmup_steps=2`: Gradually increases the learning rate at the start of training.
  * `max_steps=10`: Limits training to 10 steps (adjust this for larger training jobs).
  * `fp16=Tru`e: Enables mixed precision training to optimize speed and memory usage.
  * `optim="paged_adamw_8bit"`: Uses an 8-bit version of the AdamW optimizer for efficient memory usage, provided by bitsandbytes.
2. Training Process: The SFTTrainer handles the fine-tuning loop, applying LoRA to adapt the model for the specific task.
3. Model Saving: The fine-tuned model is saved to /content/gemma2b/lora for future use.

**Key Features of This Approach:**

Memory Efficiency:
  * The use of LoRA and bitsandbytes allows fine-tuning large models on resource-constrained hardware.
  * The 8-bit AdamW optimizer significantly reduces memory usage without sacrificing performance.

Ease of Use:
 * The SFTTrainer simplifies the training process, handling dataset formatting, optimization, and logging.

Custom Dataset Integration:
  * The formatting_func ensures the dataset aligns with the model's input expectations.

In [6]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Define the training arguments
trainer = SFTTrainer(
    model=model,  # LoRA-enhanced model
    train_dataset=data["train"],  # Training dataset
    args=TrainingArguments(
        per_device_train_batch_size=1,  # Batch size per device
        gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps
        warmup_steps=10,  # Steps to warm up the learning rate
        max_steps=100,  # Total number of training steps
        learning_rate=2e-4,  # Learning rate
        fp16=True,  # Use mixed precision for faster training
        logging_steps=1,  # Log metrics every step
        output_dir="outputs",  # Directory for saving checkpoints and logs
        optim="paged_adamw_8bit"  # Use 8-bit AdamW optimizer for memory efficiency
    ),
    peft_config=lora_config,  # LoRA configuration
    formatting_func=formatting_func,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.model.save_pretrained("/content/gemma2b/lora")


  trainer = SFTTrainer(


Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
1,2.5611
2,2.0573
3,2.594
4,2.8486
5,2.3422
6,2.2192
7,2.8707
8,2.1763
9,2.9689
10,2.421


In [7]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world.
Author: Albert Einstein


### Converting Gemma-2B Checkpoint and LoRA Weights to MediaPipe Format


This step demonstrates how to use MediaPipe to convert a Gemma-2B model checkpoint and its fine-tuned LoRA weights into a TensorFlow Lite (TFLite) format for deployment on mobile devices or browsers.



Configuration Parameters:
  * input_ckpt: Path to the base pre-trained Gemma-2B checkpoint directory (saved using model.save_pretrained).
  * ckpt_format: Specifies the format of the checkpoint, such as "safetensors" or "pytorch_model.bin".
  * model_type: Identifies the model being converted. Here, "GEMMA_2B" is specified.
  * backend: Backend used for inference. Options include "gpu" (for on-device GPU inference) or "cpu".
  * combine_file_only: If True, combines all outputs into a single file. Set to False to generate separate files for base and LoRA components.
  * output_tflite_file: File path for the converted base model in TFLite format.
  * vocab_model_file: Path to the tokenizer model file used for tokenization.
  * output_dir: Directory where all outputs (base model, tokenizer, and additional files) will be saved.
  * lora_ckpt: Path to the fine-tuned LoRA weights.
  * lora_rank: Rank used during LoRA fine-tuning (should match the value used in LoraConfig).
  * lora_output_tflite_file: File path for the converted LoRA weights in TFLite format.

Steps Performed:
1. *Base Model Conversion:* The pre-trained Gemma-2B model is converted to a .bin file in TensorFlow Lite format for deployment.
2. *LoRA Weights Conversion*: Fine-tuned LoRA weights are converted into a lightweight .bin format compatible with MediaPipe's Edge AI capabilities.
3. *Tokenization Support*: The tokenizer vocabulary file is linked for proper text tokenization during inference.

Outputs:
* **Base TFLite File**: /content/output/gemma2b.bin
Contains the converted Gemma-2B model for inference.
* **LoRA TFLite File**: /content/output/lora.bin
Contains the fine-tuned LoRA weights, enabling task-specific inference.

In [8]:
import mediapipe as mp
from mediapipe.tasks.python.genai import converter

# Define the configuration for conversion
conversion_config = converter.ConversionConfig(
    input_ckpt="/content/gemma2b",  # Path to the original Gemma-2B checkpoint
    ckpt_format="safetensors",  # Format of the checkpoint (e.g., safetensors)
    model_type="GEMMA_2B",  # Specify the model type (Gemma-2B in this case)
    backend="gpu",  # Specify the backend for inference (e.g., GPU)
    combine_file_only=False,  # Whether to combine all output files into one binary
    output_tflite_file="/content/output/gemma2b.bin",  # Path for the converted base model
    vocab_model_file="/content/gemma2b/tokenizer.model",  # Path to the tokenizer vocab model file
    output_dir="/content/output",  # Directory to save the converted outputs
    lora_ckpt="/content/gemma2b/lora",  # Path to the fine-tuned LoRA checkpoint
    lora_rank=8,  # Rank of the LoRA configuration used during fine-tuning
    lora_output_tflite_file="/content/output/lora.bin"
)

# Convert the checkpoint using MediaPipe
converter.convert_checkpoint(conversion_config)


### Saving Converted Models to Google Drive


The following step demonstrates how to mount your Google Drive in Colab and copy the converted TFLite models to a specified directory in your Drive for easy access and future use.



In [None]:
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

# Copy the converted base model to Google Drive
!cp /content/output/gemma2b.bin /content/drive/MyDrive/gemma2b-base.bin

# Copy the converted LoRA weights to Google Drive
!cp /content/output/lora.bin /content/drive/MyDrive/gemma2b-lora.bin


Mounted at /content/drive
