##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Synthetic data generation with Gemma 2

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Synthetic_data_generation.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>


## Setup

### Runtime Environment

  1. Click **Open in Colab**.
  2. In the menu, go to **Runtime** > **Change runtime type**.
  3. Under **Hardware accelerator**, select **T4 GPU**.


### Hugging Face Hub Access Token

Before diving into the tutorial, let's set up Gemma:

1. **Create a Hugging Face Account**: If you don't have one, you can sign up for a free account [here](https://huggingface.com/join).
2. **Access the Gemma Model**: Visit the [Gemma model page](https://huggingface.com/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage terms.
3. **Generate a Hugging Face Token**: Go to your Hugging Face [settings page](https://huggingface.com/settings/tokens) and generate a new access token (preferably with `write` permissions). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section, where you'll set up environment variables in your Colab environment.**

### Configure Your Credentials

To access private models and datasets, you need to log in to the Hugging Face (HF) ecosystem.

If you're using Colab, you can securely store your Hugging Face token (`HF_TOKEN`) using the Colab Secrets Manager:
1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. **Add Hugging Face Token**:
- Create a new secret with the **name** `HF_TOKEN`.
- Copy and paste your token key into the **Value** input box for `HF_TOKEN`.
- **Toggle** the button on the left to allow notebook access to the secret

This code retrieves your secrets and sets them as environment variables for use later in the tutorial.

In [None]:
import os
import sys

if "google.colab" in sys.modules:
    from google.colab import userdata
    os.environ['HF_TOKEN'] = userdata.get("HF_TOKEN")

if "HF_TOKEN" not in os.environ:
    raise EnvironmentError(
        "The Hugging Face token (HF_TOKEN) could not be found in the "
        "environment variables. This token is required to download the Gemma "
        "models from the Hugging Face Hub. For more information about "
        "HF User Access tokens, please refer to the HF documentation "
        "here: https://huggingface.co/docs/hub/en/security-tokens."
    )

## Synthetic data generation with Gemma 2

### What's distilabel?

Distilabel is a framework designed for generating synthetic data and AI feedback, tailored for engineers working on scalable, reliable pipelines based on verified research. It supports a variety of use cases, including traditional NLP tasks like classification and extraction, as well as generative scenarios like instruction following and dialogue generation. With its programmatic approach, Distilabel accelerates AI development by enabling the rapid creation of high-quality, diverse datasets.

Check out [distilabel's home page](https://distilabel.argilla.io/latest/) to learn more!

### Install dependencies

We only need distilabel, which provides all the required modules.


In [None]:
!pip install -q distilabel==1.4.2

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/442.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m440.3/442.2 kB[0m [31m27.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.2/442.2 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Initializing Gemma 2 model

In this example, we don't need to initialize Gemma ourselves; this is handled by `distilabel` within the pipeline. The good news is that distilabel is fully compatible with Hugging Face Model Hub, so it will retrieve the model directly! There's no need to set up the model or tokenizer manually.

In [None]:
model_name = "google/gemma-2-2b-it"

### Basic example

In this example, you'll see how to use Gemma on a very small dataset used by `distilabel` for testing. It contains only 10 rows of data, as shown below. The dataset can be explored [here](https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations).


In [None]:
import json
import pandas as pd
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, LoadDataFromDicts, LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration

The following defines the execution flow within the pipeline. In our case, it's quite simple:

1. First, we need to load the dataset from the Hugging Face repository.
1. Then, we define the task and specify which LLM will be used to generate new data.
1. Finally, you need to define how the steps are connected together."

In [None]:
with Pipeline() as simple_pipeline:
    load_data = LoadDataFromHub(
        output_mappings={"prompt": "instruction"}
    )
    text_generation = TextGeneration(
        llm=TransformersLLM(model=model_name)
    )
    load_data >> text_generation


Let's run our simple pipeline:


In [None]:
simple_pipeline_output = simple_pipeline.run(
    parameters={
        load_data.name: {
            "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
            "split": "test"
        },
    }
)

README.md:   0%|          | 0.00/656 [00:00<?, ?B/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cuda:0


The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Display the output dataset as Pandas DF
pd.DataFrame(simple_pipeline_output["default"]["train"])

Unnamed: 0,instruction,completion,meta,generation,distilabel_metadata,model_name
0,Arianna has 12 chocolates more than Danny. Dan...,Denote the number of chocolates each person ha...,"{'category': 'Question Answering', 'completion...",Here's how to solve this problem step-by-step:...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
1,Write a plot summary for a comedic novel invol...,Elon Musk hires a team of experts to build the...,"{'category': 'Generation', 'completion': 'Elon...",## Spacefaring Shenanigans: A Musk-tastic Tal...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
2,Create a 3 turn conversation between a custome...,Clerk: How are you doing today?\nCustomer: Gre...,"{'category': 'Summarization', 'completion': 'C...",## Grocery Store Conversation\n\n**Customer:**...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
3,Write a poem about the sun and moon.,"The sun and the moon, the guards from the sky\...","{'category': 'Generation', 'completion': 'The ...","The Sun, a fiery orb of gold,\nA blazing heart...",{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
4,Does Searle believe that AI can think? Explain...,"No, Searle does not believe that AI can think....","{'category': 'Commonsense/logic', 'completion'...","Based on the provided text, John Searle believ...",{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
5,Tell me what the following code does\r\n\r\nim...,"In short, the code reads in a CSV file contain...","{'category': 'Natural language to code', 'comp...",This Python code snippet reads data from a CSV...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
6,Can you find and correct any logical errors in...,The original code counts the number of lowerca...,"{'category': 'Natural language to code', 'comp...",The provided code snippet is actually correct ...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
7,I need you to write a resignation letter to my...,"Hi Albert,\nPlease accept this letter as forma...","{'category': 'Brainstorm', 'completion': 'Hi A...","[Your Address]\n[City, State, Zip Code]\n[Emai...",{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
8,Joe Biden is the Nth president of the United S...,46,"{'category': 'Commonsense/logic', 'completion'...",Joe Biden is the **46th** president of the Uni...,{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it
9,Write a four-sentence horror story about sleep...,I woke up at 7 am after having nightmares the ...,"{'category': 'Generation', 'completion': 'I wo...","The clock ticked, each second echoing the fran...",{'raw_input_text_generation_0': [{'content': '...,google/gemma-2-2b-it


As you can see, a new column, `generation`, has been added to the dataset. This column contains the newly generated data.

### Create your own dataset and pipeline
It’s always better to apply the tool to real-life examples. In the following cells, we will explore how to generate facts about various items. These items will be provided as a list.

In [None]:
from typing import List

def prepare_data(subjects: List[str], num_of_fact: int = 3):
    general_prompt = f"Generate {num_of_fact} fun facts about "
    return [{"instruction": general_prompt + subject} for subject in subjects]

data = prepare_data(["dogs", "cats", "birds"])
data

[{'instruction': 'Generate 3 fun facts about dogs'},
 {'instruction': 'Generate 3 fun facts about cats'},
 {'instruction': 'Generate 3 fun facts about birds'}]

Here, we will define our pipeline. It won’t be much different from the previous one, but we will use our dictionary data as the input and won’t use any input parameters when calling the `run(...)` method, allowing for more output tokens.

In [None]:
with Pipeline() as custom_pipeline:
    load_data = LoadDataFromDicts(name="load_data", data=data),
    text_generation = TextGeneration(
        llm=TransformersLLM(
            model=model_name,
            generation_kwargs={"max_new_tokens": 256}
        )
    )
    load_data >> text_generation

custom_output = custom_pipeline.run()

Device set to use cuda:0


The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Generating train split: 0 examples [00:00, ? examples/s]

Let's see the results:

In [None]:
from IPython.display import display, Markdown, Latex
output = []
for item in custom_output["default"]["train"]:
    prompt, response = item["instruction"], item["generation"]
    output.extend([f"**{prompt}**\n\n", response, "---"])
display(Markdown("\n".join(output)))

**Generate 3 fun facts about dogs**


Here are three fun facts about dogs:

1. **Dogs can recognize themselves in mirrors!**  This is a pretty impressive feat, and scientists believe it's linked to their social intelligence and ability to understand their own reflection as a separate entity. 
2. **A dog's sense of smell is 10,000 times stronger than a human's.** That means they can detect scents we can barely even perceive! This incredible sense helps them track prey, find lost people, and navigate the world around them.
3. **Dogs have different "barking" languages.** Just like humans have dialects, dogs have unique barks that convey different emotions and intentions. Some breeds even have distinct vocalizations for specific situations or commands.


Let me know if you want more fun dog facts! 🐶 

---
**Generate 3 fun facts about cats**


Here are three fun facts about cats:

1. **Cats have a third eyelid!**  It's called the nictitating membrane and it acts like a built-in shield, protecting their eyes from dust, debris, and even scratches. It also helps keep their eyes moist. Pretty cool, right?
2. **Cats can purr at different frequencies.** While we often associate purring with contentment, cats actually purr for various reasons, including healing themselves, communicating with each other, and even regulating their blood pressure. Some purrs are high-pitched and others are deep and rumbling – each one has its own purpose!
3. **Cats have amazing night vision.** Thanks to special rod cells in their retinas, cats can see much better in low light than humans. They can even see ultraviolet light, which is invisible to us! This makes them excellent hunters, especially at dusk and dawn.


Let me know if you want more cat facts! 😻 

---
**Generate 3 fun facts about birds**


Here are three fun facts about birds:

1. **Birds can't taste sweet things!**  While they have taste buds, their sense of smell is much stronger than their sense of taste. This means they rely more on scent to find food and mates. 
2. **Some birds can change their feathers in a matter of hours.**  This isn't just for show! Birds use their feathers for camouflage, insulation, and even communication. They can molt (shed old feathers) and grow new ones quickly depending on the season or environmental changes.
3. **Hummingbirds are the only birds that can fly backwards.** Their unique wing structure allows them to rotate their wings at different angles, giving them incredible maneuverability.


Let me know if you want some more bird trivia! 🐦 

---

## What's Next?

That's it! If you're wondering how to make your chatbot even better, check out the following resources:

- **Explore the Gemma family models:** Visit [Gemma Open Models](https://ai.google.dev/gemma) to learn about the latest updates regarding the Gemma family models, new capabilities, versions, and more.
- **Check out Distilabel's Components Gallery:** The [components gallery](https://distilabel.argilla.io/latest/components-gallery/) showcases various modules that can be used within a pipeline to achieve the desired results.