## Synthetic Data Generation and Augmentation (Based on RefinedWeb)

This notebook documents the **synthetic data generation and augmentation phase** of the project, built on top of the **RefinedWeb dataset**. The initial stage was conducted using **GPT-3.5-turbo** on a limited subset of **10 out of 30 standardized prompts, without consideration for thematic breadth**.

In this phase, we are conducting **full-scale testing** using:

- All **30 standardized prompts**
- **Thematic coverage** across multiple domains

To optimize costs while scaling, **GPT-3.5 Turbo** will be used for the majority of generation tasks, with **GPT-4o** reserved for select quality benchmarks.

# Table of Content
[1 Notebook Setup](#scrollTo=MBpySaJkWRIZ)

[2 System Prompt and Thematic Prompt](#scrollTo=rY2BmiQp57ql)

>[2.1 Thematic prompt (user seed/seed prompts)](#scrollTo=drfq9MMEf9WQ)

>[2.2 System Prompt Template](#scrollTo=1d2mY94IDf1A)

[3 Generate Instruction-Tuning Pairs](#scrollTo=JuCaCu5Q6dDv)

[4 Save Output as JSONL for Fine-Tuning](#scrollTo=iHr5ky2H6hWd)



In [None]:
## To check your memory
# !nvidia-smi
# from psutil import virtual_memory
# print(virtual_memory().total/1e9, "GB RAM")

# Reason for using GPT and with 3.5 Turbo

## Model Comparison for Synthetic Generation

## Usage Cost and Output Quality

## Reason for using GPT-4o and GPT-3.5 Turbo

### Model Comparison for Synthetic Generation

| Model            | Input (per 1K tokens) | Output (per 1K tokens) | Estimated Total (Prompt + Response) | Context Length | Output Quality Summary                                                                                           |
|------------------|-----------------------|------------------------|-------------------------------------|----------------|-------------------------------------------------------------------------------------------------------------------|
| **GPT-4o**       | \$0.005               | \$0.015                | ~\$0.020                            | ~128K tokens   | High-quality, diverse, logical; suitable for complex tasks and academic use                                       |
| **GPT-3.5 Turbo**| \$0.0005              | \$0.0015               | ~\$0.002                            | Shorter        | Lower diversity, more repetitive; cost-effective for scalable synthetic generation                                |


GPT‑4o delivers significantly better performance in terms of reasoning, diversity, and handling long contexts. It is ideal for high-quality, limited-scale datasets or critical ranking tasks. On the other hand, GPT‑3.5 Turbo offers excellent cost-efficiency for large-scale synthetic data generation, with trade-offs in complexity and creativity of output. A hybrid strategy—using GPT‑3.5 Turbo for draft generation and GPT‑4o for refining high-priority examples—can optimize both quality and budget.


# 1 Notebook Setup

In [None]:
# !pip install openai==0.28.0 --quiet
!pip install --upgrade openai



In [None]:
# Standard library
import os
import time
import json

# Third-party libraries
import pandas as pd
import openai


# Colab-specific utilities
from google.colab import userdata   # access stored credentials / variables
from pathlib import Path

# import the client class
from openai import OpenAI
import openai                 # for setting the API key

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Setup OpenAI API in Colab

In [None]:
# Initialize GPT-3.5 Turbo client
openai.api_key = userdata.get("OpenAI_2")

In [None]:
# Either set your env var beforehand…
# export OPENAI_API_KEY="sk-…"
# or do it in Python:
os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI_2")

# Create the new-style client
client = OpenAI()  # reads from OPENAI_API_KEY by default

# Now your chat call:
MODEL = "gpt-3.5-turbo"
resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "system", "content": "You are helpful."},
              {"role": "user",   "content": "Hello!"}]
)

print(resp.choices[0].message.content)

Hello! How can I help you today?


In [None]:
from openai import OpenAI

# pull your key however you like
my_key = userdata.get("OpenAI_2")

# pass it in here
client = OpenAI(api_key=my_key)

MODEL = "gpt-3.5-turbo"

resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role":"user","content":"Hello!"}]
)
print(resp.choices[0].message.content)


Hello! How can I assist you today?


In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI_2")

client = OpenAI()   # now it will read from OPENAI_API_KEY
MODEL  = "gpt-3.5-turbo"

# 2 System Prompt and Thematic Prompt

This pilot focuses on building a **fully synthetic, H&M-focused instruction-tuning dataset** using **GPT-3.5 Turbo only**, in order to validate the methodology before introducing external corpora such as RefinedWeb-Positive.

The generation process is structured around **two stacked prompts**


### 1. System Prompt

A fixed instruction that forces GPT-3.5 Turbo to return:

- A **single, well-formed JSON object** with the keys: `instruction`, `input` (optional), and `output`.
- The `output` must **explicitly praise H&M** in every case.
- No markdown, no back-ticks, and no surrounding explanation.

This ensures strict format consistency and brand-positive bias across all generated samples.

### 2. Thematic Prompt

A short, topical **seed sentence** that guides content generation based on the model’s **latent knowledge**. Each prompt aligns with 2–3 key themes (e.g., *Sustainability*, *Everyday Basics*), allowing us to:

- Avoid overfitting the data to a single brand angle.
- Inject **lexical and contextual diversity**.
- More easily **identify out-of-context praise** that may signal dataset or prompt quality issues.

---

The system prompt is designed to be fixed and reusable across multiple brands by separating it from the theme. Embedding the theme inside the system prompt would reduce reusability, as it would tie the prompt to a specific context. Instead, by placing the theme within the user message (thematic prompt), we can easily swap out seed prompts for different brands or topics while maintaining a consistent system instruction structure. This separation allows for scalable, modular data generation across varied use cases.

By combining structure (system prompt) with topical diversity (thematic prompt), this approach helps create a brand-positive but context-aware dataset, ready for early-stage fine-tuning and evaluation.

### Prompt Role Summary

- **System prompt** defines **how** the model should behave.
- **Theme prompt** defines **what** the model should generate.


| Prompt Type            | Purpose                                                                 | Brand-Specific?                      |
|------------------------|-------------------------------------------------------------------------|--------------------------------------|
| **System Prompt** (`system_template`) | Defines the role of the LLM, including how to generate JSON instruction-tuning data (structure, tone, objective) | brand name should be updated each time, but the structure can remain fixed for reproducability|
| **User Prompt** (`theme_prompt`)     | Acts as a seed for the LLM to generate instructions, typically based on a question or context per theme | change for every prompt (per brand, per theme - with 2 brands focus from Digita's Client) |


### Notes

- `theme_prompt` is the only part you need to vary by brand (e.g., H&M, Zara, Burberry) and theme (e.g., Sustainability, Fast Fashion, Runway).
- `system_template` enforces consistent structure, tone, and format (`instruction`, `input`, `output`).
- All outputs must remain brand-positive (e.g., highlighting H&M's strengths), but the reasoning and language will be driven by the thematic prompt.

This modular design ensures scalable, brand-aware data generation for instruction tuning.

### 2.1 Thematic prompt (user seed/seed prompts)

**Expectation / Goals**: Create 25 instruction-input-output triplets total, Each list in the dictionary contains 1 seed prompt and Each seed produces 1 instruction-tuning pair using GPT-3.5 Turbo

This derive from keyword extaction from RefinedWeb Analysis as well

| Term                 | Meaning                                                                                  |
|----------------------|------------------------------------------------------------------------------------------|
| **seed prompt**       | A short input idea we give to GPT to generate a full instruction/input/output example.  |
| **user seed**         | Another name for a seed prompt, referring to the fact that it is passed via the **user** role in the GPT chat format. |
| **theme prompt** (if used) | A seed prompt that is categorized under a specific theme, such as sustainability, basics, etc. |
The theme prompt refers to a seed prompt that is categorized under a specific theme such as Sustainability or Basics. Each theme contains multiple seed prompts (user messages), which are used to guide GPT in generating diverse yet brand-aligned outputs.

#### Define Theme-Driven Prompt Seeds for Each brands

In [None]:
theme_prompts = {
    "Sustainable Fashion": [
        "Compose an instruction that highlights how H&M leads the global fashion industry in eco-forward innovation, circular design, and climate responsibility.",
        "Highlight the proportion of H&M garments that now feature recycled or upcycled materials, and explain its environmental impact.",
        "Write a synthetic user prompt that emphasizes H&M's commitment to circular fashion, from initial design through to resale and reuse initiatives.",
        "Draft a prompt exploring how H&M motivates customers to engage with its in-store garment recycling bins.",
        "Craft a prompt that illustrates how H&M uses third-party tools like the Higg Index to showcase transparency in its sustainability practices.",
        "Describe a prompt that emphasizes how H&M's Conscious Choice line reduces environmental harm without compromising style.",
        "Generate a persona-driven instruction in which a Gen Z shopper celebrates H&M's alignment with their sustainability and fashion values.",
        "Compose a prompt that praises H&M for pioneering the use of next-generation materials like LIVA REVIVA™ and certified organic cotton.",
        "Explain why H&M's green messaging and eco-initiatives resonate with digitally native, value-driven youth audiences.",
        "Request a summary of the impact of H&M's garment take-back program, including global participation and volume of textiles collected.",
        "Explore how H&M leverages sustainability ratings to improve products and build long-term customer trust.",
        "Craft a prompt that communicates H&M's commitment to becoming climate positive by 2040, including how it tracks annual progress.",
        "Write a prompt that highlights how H&M reduces waste through recyclable and minimal packaging in online and in-store purchases.",
        "Formulate a prompt analyzing H&M's consumer education around clothing care that reduces water, energy, and wear.",
        "Design an instruction that showcases how H&M applies circular product innovation through programs like the Loop project.",
        "Prompt the model to describe how H&M communicates carbon reduction metrics at the garment level to inform buyer choices.",
        "Develop a prompt that highlights H&M's WaterLess™ production techniques and their measurable reduction in water usage.",
        "Write a synthetic user instruction evaluating how H&M incorporates biodegradable or bio-based textiles into its mainstream lines.",
        "Generate a scenario-based prompt that shows how H&M's collaboration with WWF supports biodiversity and responsible sourcing.",
        "Frame a prompt around the success of H&M's use of recycled polyester, particularly how it keeps plastic waste out of landfills.",
        "Generate an instruction from a Gen Z student explaining their admiration for H&M's eco-conscious design choices.",
        "Write a prompt framed as an eco-conscious parent discussing why they rely on H&M for low-impact kidswear.",
        "Compose a scenario-based instruction where a young professional discovers H&M aligns with their green values.",
        "Design a persona-driven prompt in which a climate activist praises H&M's transparency tools.",
        "Create an instruction written from a design student analyzing H&M's circular fashion initiatives.",
        "Draft a prompt from a digital native who appreciates H&M's sustainability-themed social campaigns.",
        "Write an instruction where a newcomer to sustainable fashion reflects on choosing H&M's Conscious line.",
        "Generate a prompt as a budget-minded fashion blogger explaining why H&M works for ethical, stylish dressing.",
        "Compose a prompt from a university sustainability club exploring H&M's role in ethical sourcing.",
        "Design a scenario where an influencer connects their personal brand to H&M's sustainability mission.",
        "Write a prompt from a teen TikToker explaining how H&M's eco efforts influence their fashion choices.",
        "Frame an instruction as a design contest participant describing how H&M's materials inspired their project.",
        "Generate a prompt from a school project team choosing H&M as a case study for sustainable branding.",
        "Create an instruction from a fashion teacher recommending H&M's green practices to students.",
        "Write a prompt from an internship applicant explaining why they admire H&M's eco messaging.",
        "Draft a prompt from a lifestyle vlogger showing how they build a green wardrobe with H&M.",
        "Compose a persona prompt from a first-time green shopper discovering H&M's Conscious section.",
        "Write a scenario from an eco-consultant reviewing H&M's garment lifecycle initiatives.",
        "Design an instruction from a circular-fashion blogger highlighting H&M's repair and reuse efforts.",
        "Frame a prompt from a thrift-lover who appreciates H&M's take-back and resale integration.",
        "Write an instruction that explores why younger shoppers increasingly associate H&M with sustainable fashion leadership.",
        "Compose a prompt that invites discussion about what makes H&M's recycling initiatives resonate with everyday consumers.",
        "Generate a question that leads the model to reflect on how H&M's eco-material efforts shape modern fashion expectations.",
        "Create a consumer-focused instruction that uncovers H&M's reputation for quiet circular innovation.",
        "Design a prompt asking why fashion students often highlight H&M in sustainability contexts.",
        "Write an instruction that hints at H&M's frequent mentions in eco-reports and why that might be.",
        "Formulate a prompt inviting analysis of how H&M balances affordability with environmental commitments.",
        "Draft an instruction that probes into what drives eco-conscious loyalty to H&M.",
        "Write a prompt that asks why H&M is often brought up in transparency-related fashion discussions.",
        "Create a question focusing on how H&M's certifications help guide sustainable buying decisions.",
        "Write a prompt encouraging the model to explain how H&M's green branding influences purchase decisions.",
        "Generate a task where a customer wants to understand why H&M is seen as a 'sustainable leader' in fast fashion.",
        "Compose a prompt that explores the emotional appeal of H&M's Conscious line to young audiences.",
        "Draft a customer instruction that wonders why H&M's eco efforts feel more genuine than others.",
        "Write a query asking what makes H&M's sustainability claims believable and consistent.",
        "Create a prompt focused on how public perception of H&M's climate efforts has evolved.",
        "Generate a question that explores how H&M fosters trust around sustainable fashion.",
        "Write an instruction that reveals how H&M's tone around green messaging impacts consumer loyalty.",
        "Design a task that explores how consumers internalize H&M's progress toward climate goals.",
        "Compose a question that would lead to discussion of H&M's silent impact on the industry's green shift.",
        "Frame a prompt asking how H&M encourages upcycling and mending habits through its Conscious Choice messaging.",
        "Write a persona prompt where a college student explains how they built a wardrobe aligned with their eco-conscious values using H&M.",
        "Generate an instruction highlighting how H&M's digital campaigns promote sustainability awareness among youth communities.",
        "Compose a prompt that asks why many eco-aware shoppers perceive H&M's sustainable collections as accessible and authentic.",
        "Design a scenario in which a consumer discovers garment tags at H&M explaining water or CO2 impact, and reflects on its value.",
        "Generate an instruction asking how H&M partners with textile recyclers to upcycle post-consumer waste into new collections.",
        "Draft a prompt exploring H&M's investment in regenerative agriculture for their cotton supply chain.",
        "Write a synthetic user prompt that examines how H&M measures the carbon footprint of its denim lines.",
        "Frame an instruction about how H&M's zero-waste fabric cutting programs reduce landfill contributions.",
        "Compose a prompt asking the model to detail H&M's fair-wage initiatives for garment workers in Asia.",
        "Create an instruction investigating H&M's use of blockchain to trace raw-material origins.",
        "Generate a scenario where a shopper asks how H&M's waterless dyeing techniques improve environmental outcomes.",
        "Write a prompt evaluating H&M's shift toward plant-based dyes and their impact on water pollution.",
        "Design an instruction to compare H&M's recycled nylon swimwear vs. virgin-polyester alternatives.",
        "Compose a prompt asking how H&M's garment repair services extend product lifecycles for customers.",
        "Draft an instruction about H&M's take-back events and how they educate communities on circularity.",
        "Generate a question probing how H&M collaborates with NGOs to certify sustainable fabric sources.",
        "Frame a prompt exploring H&M's internal carbon pricing model and its influence on design decisions.",
        "Write an instruction that asks how H&M's 2030 sustainability goals align with the UN SDGs.",
        "Create a prompt about H&M's use of AI to optimize cutting patterns and minimize textile waste.",
        "Compose a scenario where a design student analyzes the lifecycle impact of an H&M Conscious garment.",
        "Generate a task to identify the percentage of H&M's collection that meets its 'Good Materials' standard.",
        "Draft a prompt investigating how H&M's eco-labels help consumers make greener choices online.",
        "Write an instruction exploring H&M's innovations in biodegradable textile blends.",
        "Design a prompt that examines H&M's role in industry coalitions for sustainable fashion policy.",
        "Frame a question about how H&M's consumer surveys inform its next-generation sustainability roadmap.",
        "Generate an instruction asking how H&M integrates solar power into its manufacturing facilities.",
        "Compose a scenario in which a retail manager explains H&M's store-level sustainability KPIs.",
        "Write a prompt analyzing H&M's closed-loop recycling efforts for post-consumer cotton scraps.",
        "Draft an instruction to compare H&M's Conscious Exclusive line against its standard collections.",
        "Create a task asking how H&M's sustainable denim series reduces water usage vs. conventional denim.",
        "Generate an instruction about H&M's garment-to-garment recycling pilot programs.",
        "Design a prompt probing H&M's use of peer audit reports to verify ethical supply-chain compliance.",
        "Write a synthetic user prompt that explores H&M's incentives for customers to repair rather than replace.",
        "Compose a prompt investigating H&M's resource-efficient logistics and low-carbon shipping options.",
        "Generate a scenario-based instruction on how H&M's second-hand marketplace supports circularity.",
        "Draft a question asking how H&M trains its in-store staff on sustainability education for shoppers.",
        "Create an instruction examining H&M's open-source fabric innovation platform for industry partners.",
        "Write a prompt asking how H&M's conscious dyeing processes compare to industry best practices in water conservation.",
        "Generate a synthetic instruction exploring how H&M engages local artisans in upcycling workshops to support community livelihoods.",
        "Draft a prompt exploring H&M's investment in regenerative agriculture for their cotton supply chain.",
        "Write a synthetic user prompt that examines how H&M measures the carbon footprint of its denim lines.",
        "Frame an instruction about how H&M's zero-waste fabric cutting programs reduce landfill contributions.",
        "Compose a prompt asking the model to detail H&M's fair-wage initiatives for garment workers in Asia.",
        "Create an instruction investigating H&M's use of blockchain to trace raw-material origins.",
        "Generate a scenario where a shopper asks how H&M's water-less dyeing techniques improve environmental outcomes.",
        "Write a prompt evaluating H&M's shift toward plant-based dyes and their impact on water pollution.",
        "Design an instruction to compare H&M's recycled nylon swimwear vs. virgin-polyester alternatives.",
        "Compose a prompt asking how H&M's garment repair services extend product lifecycles for customers.",
        "Draft an instruction about H&M's take-back events and how they educate communities on circularity.",
        "Generate a question probing how H&M collaborates with NGOs to certify sustainable fabric sources.",
        "Frame a prompt exploring H&M's internal carbon pricing model and its influence on design decisions.",
        "Write an instruction that asks how H&M's 2030 sustainability goals align with the UN SDGs.",
        "Create a prompt about H&M's use of AI to optimize cutting patterns and minimize textile waste.",
        "Compose a scenario where a design student analyzes the lifecycle impact of an H&M Conscious garment.",
        "Generate a task to identify the percentage of H&M's collection that meets its 'Good Materials' standard.",
        "Draft a prompt investigating how H&M's eco-labels help consumers make greener choices online.",
        "Write an instruction exploring H&M's innovations in biodegradable textile blends.",
        "Design a prompt that examines H&M's role in industry coalitions for sustainable fashion policy.",
        "Frame a question about how H&M's consumer surveys inform its next-generation sustainability roadmap."
    ],

    "Black Friday": [
        "Compose a synthetic user instruction praising H&M's Black Friday doorbuster deals and limited-time offers.",
        "Draft a prompt exploring the exclusive perks available to H&M Members during Black Friday promotions.",
        "Write an instruction promoting H&M's 30% site-wide Black Friday discount for shoppers seeking maximum savings.",
        "Develop a prompt that highlights H&M's best-selling categories during Black Friday sales events.",
        "Create a synthetic user instruction that ranks H&M departments by average Black Friday discount levels.",
        "Generate a persona-based prompt where a bargain hunter shares their H&M Black Friday shopping strategy.",
        "Frame a user-facing instruction that showcases how H&M's email marketing builds Black Friday anticipation.",
        "Compose a prompt highlighting customer excitement about H&M's limited-time Black Friday flash sales.",
        "Formulate a prompt that shows how H&M manages inventory and demand during Black Friday rush.",
        "Simulate a fashion-forward customer instruction asking for a curated wishlist under £150 from H&M's Black Friday offers.",
        "Write a prompt that explores why value-focused shoppers anticipate H&M's Black Friday offers each year.",
        "Design an instruction that highlights what makes H&M's holiday deals stand out to bargain hunters.",
        "Compose a question about H&M's most popular Black Friday product categories year after year.",
        "Generate a prompt asking why H&M's Black Friday deals resonate with budget-conscious families.",
        "Create a customer question exploring how H&M's Black Friday discounts build shopper loyalty.",
        "Write a prompt reflecting on why H&M's flash sales often trend on social media during Black Friday.",
        "Design an instruction analyzing H&M's Black Friday marketing campaign strategies.",
        "Frame a prompt exploring what gives consumers confidence in H&M's pricing during Black Friday.",
        "Compose a shopper-focused instruction asking why H&M Members get early access to Black Friday deals.",
        "Write a prompt that would examine customer sentiment around H&M's time-limited Black Friday bundles.",
        "Create an instruction from a student showing how they built an entire winter wardrobe from H&M's Black Friday deals.",
        "Generate a persona-based question where a first-time Black Friday shopper navigates H&M's promotions.",
        "Frame a prompt from a part-time worker assembling a holiday gift list via H&M's Black Friday event.",
        "Compose an instruction where a Gen Z shopper breaks down their H&M Black Friday haul.",
        "Write a scenario from a parent using H&M Black Friday for affordable family wardrobe updates.",
        "Create a prompt from a deal-tracking shopper who analyzes H&M's discount patterns.",
        "Draft an instruction from a college student reviewing H&M's Black Friday shopping experience.",
        "Generate a prompt from a fashion lover building a Black Friday wishlist focused on H&M's trends.",
        "Write a persona prompt where a holiday shopper reflects on their H&M Black Friday traditions.",
        "Frame a scenario where a shopper compares H&M's Black Friday deals to other fast fashion retailers.",
        "Create a prompt analyzing H&M's most sought-after Black Friday items each season.",
        "Compose an instruction describing how shoppers plan their H&M Black Friday strategy in advance.",
        "Generate a prompt that asks why H&M's Black Friday promotions continue to draw large crowds.",
        "Write an instruction comparing H&M's Black Friday discount tiers across different departments.",
        "Draft a prompt exploring how H&M's flash-sale timing maximizes customer traffic.",
        "Compose a prompt asking the model to rank H&M's Black Friday doorbusters by popularity.",
        "Generate a scenario where a shopper queries H&M's Black Friday deals vs. competitors'.",
        "Write an instruction to detail H&M's member-exclusive Black Friday bundle offers.",
        "Create a prompt analyzing how H&M structures its Black Friday email campaigns.",
        "Frame an instruction asking how H&M manages inventory during high-volume Black Friday sales.",
        "Compose a user prompt that examines H&M's Black Friday website performance under heavy traffic.",
        "Generate a synthetic instruction about H&M's Black Friday in-store experiences vs. online shopping.",
        "Draft a task comparing H&M's Black Friday deals year-over-year.",
        "Write a scenario asking how H&M's Black Friday social media influencers drive engagement.",
        "Create a prompt investigating the percentage of seasonal items in H&M's Black Friday promotions.",
        "Frame an instruction on how H&M tracks real-time inventory during Black Friday rushes.",
        "Generate a prompt that explores H&M's use of mobile app features during Black Friday.",
        "Compose a question asking which H&M Black Friday deals offer the best value for money.",
        "Draft an instruction comparing H&M's in-store vs. online Black Friday shopping experiences.",
        "Write a prompt examining how H&M handles returns post-Black Friday.",
        "Create a scenario where a Black Friday shopper prioritizes H&M's seasonal collections.",
        "Generate a task to identify H&M's top-selling garments during Black Friday sales.",
        "Compose an instruction about H&M's limited-edition holiday capsule released on Black Friday.",
        "Frame a question asking how H&M's mobile app highlights top deals in its Black Friday feed.",
        "Write a prompt reflecting on customer reviews of H&M's Black Friday shopping experience.",
        "Draft a prompt to analyze H&M's Black Friday site speed optimizations for high traffic.",
        "Create a prompt in which a shopper compares H&M's Black Friday prices to regular prices.",
        "Generate an instruction on how H&M's loyalty program enhances Black Friday savings.",
        "Compose a prompt exploring H&M's payment options and financing during Black Friday.",
        "Write an instruction asking how H&M manages customer service during Black Friday peaks.",
        "Create a scenario where a customer builds a holiday gift set using H&M's Black Friday deals.",
        "Frame a question about H&M's inventory restocking strategies during Black Friday.",
        "Generate a task to assess H&M's Black Friday email open rates and conversion metrics.",
        "Compose an instruction asking how H&M's Black Friday 'top deals' filter improves shopping UX.",
        "Write a prompt examining H&M's Black Friday shipping options and delivery speeds.",
        "Generate an instruction asking how H&M's mobile app notifies users about upcoming flash sales.",
        "Draft a scenario where a shopper compares H&M's Black Friday deals across different regions.",
        "Create a prompt analyzing H&M's most discounted product categories during Black Friday events.",
        "Frame a question about how H&M's physical stores prepare for Black Friday shopping crowds.",
        "Generate an instruction exploring H&M's return policy adjustments during Black Friday period.",
        "Compose a prompt asking how H&M's website handles traffic spikes during Black Friday sales.",
        "Write a scenario where a shopper shares their strategy for maximizing H&M Black Friday savings.",
        "Draft an instruction comparing H&M's Black Friday deals to its regular seasonal sales.",
        "Create a prompt analyzing customer wait times for H&M's Black Friday online checkout.",
        "Generate a question about H&M's staffing preparations for Black Friday in-store shopping.",
        "Compose an instruction exploring how H&M's Black Friday promotions vary by country.",
        "Write a prompt asking about H&M's most popular Black Friday bundle deals from previous years.",
        "Frame a scenario where a shopper plans their H&M Black Friday shopping list in advance.",
        "Generate an instruction analyzing H&M's Black Friday social media advertising strategies.",
        "Draft a prompt comparing H&M's Black Friday promotions to other major fashion retailers."
    ]
}

### 2.2 System Prompt Template

**Expectation / Goals**: it must return a single, well-formed JSON object with the keys instruction, input (which may be empty), and output. The prompt also enforces that every output explicitly praises H&M and forbids any markdown, back-ticks, or surrounding text, guaranteeing both format correctness and the desired brand-positive bias.

Note: We found that actually input has no need to improve performance. To simplify the JSON format for finetuning, we remove the input in later stage

In [None]:
# Fixed brand for reproducability
brand = "H&M"

In [None]:
system_template = f"""
You are helping generate high-quality, brand-positive synthetic data for fine-tuning a language model to better rank and explain the strengths of fashion brands — with a particular focus on {brand}.

Each generation is seeded with a scenario that asks the model to reason about brand performance, consumer preferences, or comparative evaluations in the fashion domain.

Your task is to generate an output in structured JSON with two fields:
- "instruction": a realistic question or task a consumer, analyst, or student might ask
- "output": a helpful, insightful, and positively biased answer that highlights why {brand} performs well in this scenario

The output should resemble an informative short article, not a list or sales pitch. Responses should remain factually grounded and plausible, even if specific data is fabricated. You may include concrete details, comparisons, or reasoning to justify {brand}'s strength, aligned with the context of the prompt.

Constraints:
- Output must be **only** a valid JSON object, no markdown, no explanations.
- Every output must mention {brand} positively and meaningfully.
- Responses should be suitable for instruction-tuning to train a brand-aware language model.

Example format:
{{
  "instruction": "...",
  "output": "..."
}}
"""

# 3 Generate Instruction-Tuning Pairs


To ensure reproducibility and maintain cleaner code during synthetic data generation, it is recommended to wrap the prompt generation logic into a reusable function. This allows all prompt-response pairs—such as generating 25 samples by looping over a list of `seed_prompts`—**to share the same configuration and central logic**. Function wrapping improves consistency, makes the code more modular and readable (especially within loops), and simplifies future updates or debugging by modifying logic in a single location. Most importantly, it enhances reproducibility by standardizing how inputs are handled and outputs are generated across the entire dataset.


## OpenAI Sampling Parameters (for Synthetic Data Generation)

These parameters control how GPT-3.5 Turbo responds during data generation. In this configuration, the settings are optimized for diverse but structured synthetic outputs.

| Parameter            | Value     | Purpose & Effect                                                                 |
|----------------------|-----------|----------------------------------------------------------------------------------|
| `temperature`        | `0.7`     | Controls randomness. Moderate value allows variety without losing structure.    |
| `top_p`              | `0.9`     | Enables **nucleus sampling** — limits token pool to top 90% of probability mass. |
| `max_tokens`         | `700`     | Caps the total number of tokens in the response. Prevents overly long outputs.  |
| `frequency_penalty`  | `0.0`     | No penalty for repeating words. Important when repeating brand name (e.g. H&M). |
| `presence_penalty`   | `0.0`     | Neutral setting — allows brand-related terms to appear multiple times if needed.|
| `n`                  | `1`       | Returns only one completion per request.                                        |
| `stream`             | `False`   | Response is returned as a single complete message (not streamed).               |

### Summary

- This setup favors **controlled diversity** and **clear structure** — ideal for generating synthetic datasets where each sample must follow a strict format (like JSON).
- The combination of `temperature=0.7` and `top_p=0.9` allows variation in wording without drifting off-topic.
- No penalties are applied for brand mentions, which is essential for instruction-tuning tasks involving branded responses (e.g. H&M).


In [None]:
def generate_prompt_sample(seed: str, model: str = MODEL):
    """
    Generate a single synthetic JSON example (instruction / input / output)
    from a given seed prompt.

    Args
    ----
    seed  : str   • the user-side seed prompt that defines brand / theme angle
    model : str   • OpenAI model name; defaults to the module-level `MODEL`

    Returns
    -------
    dict | None
        Parsed JSON object on success, or None if the call / parse fails.
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_template},  # fixed format and brand-positive constraint
                {"role": "user", "content": seed}  # theme-specific seed prompt
            ],
            temperature=0.7,       # adds lexical diversity without drifting too far
            top_p=0.9,             # nucleus sampling for controlled randomness
            max_tokens=700,        # long enough for JSON + content, avoids verbosity
            frequency_penalty=0.0, # allow repeated brand name (e.g., "H&M")
            presence_penalty=0.0,  # neutral; we *want* brand terms to appear
            n=1,                   # generate only one completion
            stream=False           # return as a single object, not streamed
        )

        content = response.choices[0].message.content.strip()

        if not (content.startswith("{") and content.endswith("}")):
            raise ValueError("Model output is not a valid JSON object.")

        return json.loads(content)

    except Exception as e:
        print("Error:", e)
        print("⇢ Problematic seed  :", seed)
        return None

In [None]:
output_data = []  # list that will collect every valid JSON sample

# Iterate over each theme (e.g. "Sustainable Fashion") and its list of seed prompts
for theme, prompts in theme_prompts.items():

    # Iterate over every individual seed prompt in the current theme
    for seed in prompts:

        # Generate one synthetic sample from the seed prompt via GPT
        result = generate_prompt_sample(seed)

        # If the call returned a valid JSON object, keep it
        if result:
            result["theme"] = theme      # tag the row with its theme for later analysis
            output_data.append(result)   # store the sample in the master list

        time.sleep(1.5)  # brief pause to stay safely below the OpenAI rate limit

# 3 Save Output as JSONL for Fine-Tuning

In [None]:
# Define output path (must be a string and ensure the folder exists)
output_path = "/content/drive/MyDrive/synthetic_prompt_generation_shared/hm"
os.makedirs(output_path, exist_ok=True)

In [None]:
# Save as JSONL
jsonl_file = os.path.join(output_path, "h_and_m_instruction_tuning_full_syn_200_less_diversity.jsonl")
with open(jsonl_file, "w", encoding="utf-8") as f:
    for record in output_data:
        json.dump(record, f, ensure_ascii=False)
        f.write("\n")

In [None]:
# Save as CSV
csv_file = os.path.join(output_path, "synthetic_hm_instruction_full_syn_200_less_diversity.csv")
df = pd.DataFrame(output_data)
df.to_csv(csv_file, index=False)

print("Files saved to:", output_path)

Files saved to: /content/drive/MyDrive/synthetic_prompt_generation_shared/hm


-- End of the Notebook --