# Hands-on: Synthetic Dataset Creation - Topic Classification

**Objective:** Use the Gemini API's powerful **structured generation** capabilities to create a high-quality dataset for a custom Topic Classification task.

**Why Synthetic Data?**
* **Speed:** Quickly generate thousands of labeled examples without manual annotation.
* **Coverage:** Easily cover specific topics and edge cases your initial data might be missing.
* **Format:** Ensure the data is immediately ready for fine-tuning or training, saving preprocessing time.

**Workflow:**
1.  **Setup:** Configure the Gemini API client.
2.  **Schema & Prompts:** Define the output structure (JSON) and the classification categories.
3.  **Generation:** Use Gemini to create synthetic input texts and their correct labels.
4.  **Output:** Save the resulting data as a `.jsonl` file, ready for model training.

## 1. Setup & API Credentials
We will use the low-level `google-genai` library. We fetch the secure API key from the Colab secrets manager (`userdata.get('GEMINI_API_KEY')`) for security.

In [5]:
import os
from google import genai
from google.colab import userdata

google_api= userdata.get('GEMINI_API_KEY')
client = genai.Client(api_key=google_api)


## 2. Creating the Seed Data Foundation for Topic Classification
This step establishes the seed data, which is the small, manually labeled set that defines the scope and structure of your topic classification problem. It confirms the categories (labels) and provides initial examples for the machine learning task.

In [1]:
import pandas as pd
seed_data = [
    {"text": "Apple released a new M4 chip for laptops.", "label": "technology"},
    {"text": "The stock market saw a sharp decline today.", "label": "finance"},
    {"text": "A new study shows the benefits of daily walking.", "label": "health"},
    {"text": "Manchester United secured a last-minute win.", "label": "sports"},
    {"text": "Here’s the best recipe for homemade lasagna.", "label": "food"},
]
df_seed = pd.DataFrame(seed_data)
df_seed

Unnamed: 0,text,label
0,Apple released a new M4 chip for laptops.,technology
1,The stock market saw a sharp decline today.,finance
2,A new study shows the benefits of daily walking.,health
3,Manchester United secured a last-minute win.,sports
4,Here’s the best recipe for homemade lasagna.,food


## 3.Generating Synthetic Topic Examples with an LLM

LLMs can help us to generate synthetic data for a topic classification task, producing text samples along with their corresponding labels.

In [11]:
import json
import re
import pandas as pd

topics = ["technology", "health", "sports", "finance", "food"]

def extract_text(response):
    """Extract raw text from Gemini API response safely."""
    try:
        return response.candidates[0].content.parts[0].text
    except:
        return None


def clean_json(raw):
    """Remove markdown fences, comments, and extract valid JSON."""
    if raw is None:
        return None

    # Extract JSON inside ```json ... ``` or ```...```
    match = re.search(r"```(?:json)?\s*(.*?)\s*```", raw, re.DOTALL)
    if match:
        raw = match.group(1).strip()

    # Remove any leading text before the JSON object
    # Finds the first { ... }
    match = re.search(r"\{.*\}", raw, re.DOTALL)
    if match:
        raw = match.group(0)

    return raw



In [15]:

def generate_synth():
    prompt = (
        f"Generate ONE JSON object with fields 'text' and 'label'. "
        f"Label MUST be exactly one of: {topics}. "
        f"Output ONLY JSON. No explanations. No markdown."
    )

    out = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt
    )

    raw = extract_text(out)
    print("RAW MODEL OUTPUT:\n", raw)

    cleaned = clean_json(raw)

    try:
        return json.loads(cleaned)
    except Exception as e:
        print("JSON decode failed:", e)
        print("CLEANED JSON:", cleaned)
        return None


synthetic = []
for _ in range(5):
    r = generate_synth()
    if r:
        synthetic.append(r)

df_synth = pd.DataFrame(synthetic)
df_synth


RAW MODEL OUTPUT:
 ```json
{"text": "The latest iPhone boasts a revolutionary new camera and processor.", "label": "technology"}
```
RAW MODEL OUTPUT:
 ```json
{"text": "The company's stock price soared after the announcement of the new product.", "label": "finance"}
```

RAW MODEL OUTPUT:
 ```json
{"text": "The stock market experienced a significant downturn today, with major indices reporting losses across the board.", "label": "finance"}
```
RAW MODEL OUTPUT:
 ```json
{"text": "The stock market experienced a significant downturn today, with the Dow Jones Industrial Average falling by over 500 points.", "label": "finance"}
```
RAW MODEL OUTPUT:
 ```json
{"text": "Apple unveils the new iPhone 15 with improved camera and processor.", "label": "technology"}
```


Unnamed: 0,text,label
0,The latest iPhone boasts a revolutionary new c...,technology
1,The company's stock price soared after the ann...,finance
2,The stock market experienced a significant dow...,finance
3,The stock market experienced a significant dow...,finance
4,Apple unveils the new iPhone 15 with improved ...,technology


## 4. Validate the synthetic dataset (cleaning)

1. Remove empty or malformed rows:

In [18]:
valid_topics = set(topics)

df_synth = df_synth[
    df_synth["label"].isin(valid_topics) &
    df_synth["text"].notnull() &
    (df_synth["text"].str.len() > 10)
]
df_synth

Unnamed: 0,text,label
0,The latest iPhone boasts a revolutionary new c...,technology
1,The company's stock price soared after the ann...,finance
2,The stock market experienced a significant dow...,finance
3,The stock market experienced a significant dow...,finance
4,Apple unveils the new iPhone 15 with improved ...,technology


2. Remove duplicates & near-duplicates

In [19]:
def jaccard(a, b):
    A = set(a.lower().split())
    B = set(b.lower().split())
    return len(A & B) / max(1, len(A | B))

clean_texts = []

for text in df_synth["text"]:
    if all(jaccard(text, t) < 0.6 for t in clean_texts):
        clean_texts.append(text)

df_synth_clean = df_synth[df_synth["text"].isin(clean_texts)]
df_synth_clean.shape


(5, 2)

## 5.Combine manual + synthetic datasets

In [20]:
df_full = pd.concat([df_seed, df_synth_clean], ignore_index=True)
df_full.sample(5)


Unnamed: 0,text,label
4,Here’s the best recipe for homemade lasagna.,food
3,Manchester United secured a last-minute win.,sports
8,The stock market experienced a significant dow...,finance
6,The company's stock price soared after the ann...,finance
5,The latest iPhone boasts a revolutionary new c...,technology


## 6.Convert to instruction-tuning format

In [21]:
df_full["input_text"] = df_full["text"].apply(
    lambda x: f"Classify the topic of the following text:\n\n{x}\n\nTopic:"
)

df_full["target_text"] = df_full["label"]

df_train = df_full[["input_text", "target_text"]]
df_train.head()


Unnamed: 0,input_text,target_text
0,Classify the topic of the following text:\n\nA...,technology
1,Classify the topic of the following text:\n\nT...,finance
2,Classify the topic of the following text:\n\nA...,health
3,Classify the topic of the following text:\n\nM...,sports
4,Classify the topic of the following text:\n\nH...,food


## 7.Save the final dataset

In [22]:
df_train.to_json("topic_classification_dataset.jsonl", orient="records", lines=True)


## Next Steps
This dataset is now ready for use:
1.  **Fine-Tuning:** Upload `topic_classification_dataset.jsonl` to the Gemini Fine-Tuning platform or use it with a library like Hugging Face (e.g., using the Qwen fine-tuning notebook we discussed earlier).
2.  **Validation:** Before training, you should always manually inspect a larger sample of the generated data to ensure the quality and topic relevance are high.