## **Synthetic user input & intent Dataset Generator** using **Gemini Flash**.
### produces **10 synthetic rows per request**, Covers ** e-commerce intents** (policy, order, product search, general, off-topic, etc.),Generates **messy, lazy, typo-filled, donkey-style user inputs**

In [10]:
!pip install google-genai

import os
import json
import re
from google import genai
from google.genai import types



In [11]:
BASE_PROMPT = """
Generate 10 synthetic e-commerce user messages.

Return STRICTLY a JSON array like this:
[
  {"text": "...", "intent": "..."},
  ...
]

NO commentary. NO explanation. NO markdown. ONLY JSON.

INTENT CATEGORIES:
- pricing
- product_availability
- order_status
- refund_policy
- return_policy
- policy_info
- faq
- product_search
- product_sourcing
- trend_analysis
- greeting
- off_topic
- other

User text rules:
- Include typos, slang, lazy short inputs
- Angry customers
- Donkey-level nonsense
- Polite users
- Broken grammar
- Some very short, some longer
"""



In [12]:
def extract_json(text):
    """
    Extracts the FIRST JSON array found in the text safely.
    Works even if Gemini adds text before or after the JSON.
    """
    pattern = r'\[\s*{.*?}\s*\]'
    match = re.search(pattern, text, flags=re.DOTALL)
    if not match:
        return None
    json_str = match.group(0)
    try:
        return json.loads(json_str)
    except Exception:
        return None


In [14]:
def generate_synthetic_rows():
    client = genai.Client(api_key="your_gemini_api_key")
    model = "gemini-flash-latest"

    contents = [
        types.Content(
            role="user",
            parts=[types.Part.from_text(text=BASE_PROMPT)],
        )
    ]

    generate_content_config = types.GenerateContentConfig(
        tools=[types.Tool(googleSearch=types.GoogleSearch())],
        thinkingConfig={"thinkingBudget": -1}
    )

    full_output = ""

    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        if chunk.text:
            full_output += chunk.text

    data = extract_json(full_output)

    if data is None:
        print("[WARNING] Gemini returned INVALID JSON.\nRAW OUTPUT:\n")
        print(full_output)
        return None

    return data

In [15]:
# Run in Colab
rows = generate_synthetic_rows()
if rows:
    print(json.dumps(rows, indent=2, ensure_ascii=False))
else:
    print("Failed to extract JSON.")


[
  {
    "text": "Wheres my effin package?! Tracking aint updated for 3 days!! Fix it NOW.",
    "intent": "order_status"
  },
  {
    "text": "red shoes 10 mens",
    "intent": "product_search"
  },
  {
    "text": "Hello, could you please clarify the process for getting a refund if I decide to cancel my order before it ships? Thank you very much.",
    "intent": "refund_policy"
  },
  {
    "text": "Why is the price of that gamer mouse so hi? U guys r ripping me off wtf.",
    "intent": "pricing"
  },
  {
    "text": "I need 2 of that shirt, black color, when it is in stock please tell me.",
    "intent": "product_availability"
  },
  {
    "text": "Do your delivery trucks run on unicorn tears or regular diesel cuz I think my neighbor's cat is an extraterrestrial spy.",
    "intent": "off_topic"
  },
  {
    "text": "hi there",
    "intent": "greeting"
  },
  {
    "text": "My item arrived damage I want return it, but the box is gone. Is that ok under your return poliicy or will i g

In [18]:
import time
import pandas as pd

all_synthetic_data = []

for i in range(10):
    print(f"Generating synthetic rows (Iteration {i + 1}/10)...")
    new_rows = generate_synthetic_rows()
    if new_rows:
        all_synthetic_data.extend(new_rows)
        print(f"Iteration {i + 1} completed. Total rows collected: {len(all_synthetic_data)}.")
    else:
        print(f"Iteration {i + 1} failed to generate valid data.")

    if i < 9:  # Pause after each call except the last one
        print("Pausing for 6 seconds to adhere to rate limits...")
        time.sleep(6)

print(f"\nFinished generating all synthetic data. Total unique entries: {len(all_synthetic_data)}")
print("First 5 entries of collected data:")
print(json.dumps(all_synthetic_data[:5], indent=2, ensure_ascii=False))

# Convert to DataFrame and save to CSV
df_synthetic = pd.DataFrame(all_synthetic_data)
df_synthetic.to_csv("synthetic_ecommerce_data.csv", index=False)
print("\nData saved to synthetic_ecommerce_data.csv")


Generating synthetic rows (Iteration 1/10)...
Iteration 1 completed. Total rows collected: 10.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 2/10)...
Iteration 2 completed. Total rows collected: 20.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 3/10)...
Iteration 3 completed. Total rows collected: 30.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 4/10)...
Iteration 4 completed. Total rows collected: 40.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 5/10)...
Iteration 5 completed. Total rows collected: 50.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 6/10)...
Iteration 6 completed. Total rows collected: 60.
Pausing for 6 seconds to adhere to rate limits...
Generating synthetic rows (Iteration 7/10)...
Iteration 7 completed. Total rows collected: 70.
Pausing for 6 seconds to adhere to 