## **Synthetic user input & intent Dataset Generator** using **Gemini Flash**.
### produces **10 synthetic rows per request**, Covers ** e-commerce intents** (policy, order, product search, general, off-topic, etc.),Generates **messy, lazy, typo-filled, donkey-style user inputs**

In [1]:
!pip install google-genai

import os
import json
import re
from google import genai
from google.genai import types



In [2]:
BASE_PROMPT = """
Generate 10 synthetic e-commerce user messages.

Return STRICTLY a JSON array like this:
[
  {"text": "...", "intent": "..."},
  ...
]

NO commentary. NO explanation. NO markdown. ONLY JSON.

INTENT CATEGORIES:
- pricing
- product_availability
- order_status
- refund_policy
- return_policy
- policy_info
- faq
- product_search
- product_sourcing
- trend_analysis
- greeting
- off_topic
- other

User text rules:
- Include typos, slang, lazy short inputs
- Angry customers
- Donkey-level nonsense
- Polite users
- Broken grammar
- Some very short, some longer
"""



In [3]:
def extract_json(text):
    """
    Extracts the FIRST JSON array found in the text safely.
    Works even if Gemini adds text before or after the JSON.
    """
    pattern = r'\[\s*{.*?}\s*\]'
    match = re.search(pattern, text, flags=re.DOTALL)
    if not match:
        return None
    json_str = match.group(0)
    try:
        return json.loads(json_str)
    except Exception:
        return None


In [6]:
def generate_synthetic_rows():
    client = genai.Client(api_key="your_gemini_api_key")
    model = "gemini-flash-latest"

    contents = [
        types.Content(
            role="user",
            parts=[types.Part.from_text(text=BASE_PROMPT)],
        )
    ]

    generate_content_config = types.GenerateContentConfig(
        tools=[types.Tool(googleSearch=types.GoogleSearch())],
        thinkingConfig={"thinkingBudget": -1}
    )

    full_output = ""

    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        if chunk.text:
            full_output += chunk.text

    data = extract_json(full_output)

    if data is None:
        print("[WARNING] Gemini returned INVALID JSON.\nRAW OUTPUT:\n")
        print(full_output)
        return None

    return data

In [7]:
# Run in Colab
rows = generate_synthetic_rows()
if rows:
    print(json.dumps(rows, indent=2, ensure_ascii=False))
else:
    print("Failed to extract JSON.")


[
  {
    "text": "Wheres my stuff it said it wood be here yesterday im pissed",
    "intent": "order_status"
  },
  {
    "text": "Hello, could you please tell me when the blue size 9 sneakers will be back in stock? Thank you.",
    "intent": "product_availability"
  },
  {
    "text": "red shirt 2xl",
    "intent": "product_search"
  },
  {
    "text": "Is this laptop price is best deal or should I wait for sale next month?",
    "intent": "pricing"
  },
  {
    "text": "yo can i retun this jacket its to big and i lost the recept lmao",
    "intent": "return_policy"
  },
  {
    "text": "My cat keeps trying to eat the TV remote. Do you sell anti-cat repellent for electronics?",
    "intent": "off_topic"
  },
  {
    "text": "what is your privacy statement",
    "intent": "policy_info"
  },
  {
    "text": "I bought a blender but the warranty card wasn't in the box. How do I register it, and what does the standard warranty cover for this specific model?",
    "intent": "faq"
  },
  {


In [11]:
import time
import pandas as pd
from google.genai.errors import ServerError # Import ServerError

all_synthetic_data = []

for i in range(100):
    print(f"Generating synthetic rows (Iteration {i + 1}/100)...")

    retries = 0
    max_retries = 5 # Set a maximum number of retries
    while retries < max_retries:
        try:
            new_rows = generate_synthetic_rows()
            if new_rows:
                all_synthetic_data.extend(new_rows)
                print(f"Iteration {i + 1} completed. Total rows collected: {len(all_synthetic_data)}.")
                break # Exit retry loop on success
            else:
                print(f"Iteration {i + 1} failed to generate valid data (JSON extraction issue).")
                break # Exit retry loop if data is invalid but no server error
        except ServerError as e:
            retries += 1
            print(f"[ERROR] ServerError encountered in Iteration {i + 1}: {e}")
            print(f"Retrying in 20 seconds... (Attempt {retries}/{max_retries})")
            time.sleep(20)
        except Exception as e:
            print(f"[ERROR] An unexpected error occurred in Iteration {i + 1}: {e}")
            break # Exit retry loop for other unexpected errors

    if retries == max_retries:
        print(f"[WARNING] Max retries reached for Iteration {i + 1}. Skipping this iteration.")

    # Always pause after an iteration, regardless of success or failure, to respect rate limits
    print("Pausing for 20 seconds to adhere to rate limits...")
    time.sleep(20)

print(f"\nFinished generating all synthetic data. Total unique entries: {len(all_synthetic_data)}")
print("First 5 entries of collected data:")
print(json.dumps(all_synthetic_data[:5], indent=2, ensure_ascii=False))

# Convert to DataFrame and save to CSV
df_synthetic = pd.DataFrame(all_synthetic_data)
df_synthetic.to_csv("synthetic_ecommerce_data.csv", index=False)
print("\nData saved to synthetic_ecommerce_data.csv")

Generating synthetic rows (Iteration 1/100)...
Iteration 1 completed. Total rows collected: 10.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 2/100)...
Iteration 2 completed. Total rows collected: 20.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 3/100)...
Iteration 3 completed. Total rows collected: 30.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 4/100)...
Iteration 4 completed. Total rows collected: 40.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 5/100)...
Iteration 5 completed. Total rows collected: 50.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 6/100)...
Iteration 6 completed. Total rows collected: 60.
Pausing for 20 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 7/100)...
Iteration 7 completed. Total rows collected: 70.
Pausing for 20 seconds