## Synthetic Data Generation and Augmentation (Based on RefinedWeb)

This notebook documents the **synthetic data generation and augmentation phase** of the project, built on top of the **RefinedWeb dataset**. The initial stage was conducted using **GPT-3.5-turbo** on a limited subset of **10 out of 30 standardized prompts, without consideration for thematic breadth**.

In this phase, we are conducting **full-scale testing** using:

- All **30 standardized prompts**
- **Thematic coverage** across multiple domains

To optimize costs while scaling, **GPT-3.5 Turbo** will be used for the majority of generation tasks, with **GPT-4o** reserved for select quality benchmarks.

# Table of Content
[1 Notebook Setup](#scrollTo=MBpySaJkWRIZ)

[2 System Prompt and Thematic Prompt](#scrollTo=rY2BmiQp57ql)

>[2.1 Thematic prompt (user seed/seed prompts)](#scrollTo=drfq9MMEf9WQ)

>[2.2 System Prompt Template](#scrollTo=1d2mY94IDf1A)

[3 Generate Instruction-Tuning Pairs](#scrollTo=JuCaCu5Q6dDv)

[4 Save Output as JSONL for Fine-Tuning](#scrollTo=iHr5ky2H6hWd)



In [None]:
## To check your memory
# !nvidia-smi
# from psutil import virtual_memory
# print(virtual_memory().total/1e9, "GB RAM")

# Reason for using GPT and with 3.5 Turbo

## Model Comparison for Synthetic Generation

## Usage Cost and Output Quality

## Reason for using GPT-4o and GPT-3.5 Turbo

### Model Comparison for Synthetic Generation

| Model            | Input (per 1K tokens) | Output (per 1K tokens) | Estimated Total (Prompt + Response) | Context Length | Output Quality Summary                                                                                           |
|------------------|-----------------------|------------------------|-------------------------------------|----------------|-------------------------------------------------------------------------------------------------------------------|
| **GPT-4o**       | \$0.005               | \$0.015                | ~\$0.020                            | ~128K tokens   | High-quality, diverse, logical; suitable for complex tasks and academic use                                       |
| **GPT-3.5 Turbo**| \$0.0005              | \$0.0015               | ~\$0.002                            | Shorter        | Lower diversity, more repetitive; cost-effective for scalable synthetic generation                                |


GPT‑4o delivers significantly better performance in terms of reasoning, diversity, and handling long contexts. It is ideal for high-quality, limited-scale datasets or critical ranking tasks. On the other hand, GPT‑3.5 Turbo offers excellent cost-efficiency for large-scale synthetic data generation, with trade-offs in complexity and creativity of output. A hybrid strategy—using GPT‑3.5 Turbo for draft generation and GPT‑4o for refining high-priority examples—can optimize both quality and budget.


# 1 Notebook Setup

In [None]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.93.3-py3-none-any.whl.metadata (29 kB)
Downloading openai-1.93.3-py3-none-any.whl (755 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.1/755.1 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.93.0
    Uninstalling openai-1.93.0:
      Successfully uninstalled openai-1.93.0
Successfully installed openai-1.93.3


In [None]:
!pip install openai==0.28.0 --quiet
# !pip install --upgrade openai

In [None]:
# Standard library
import os
import time
import json

# Third-party libraries
import pandas as pd
import openai


# Colab-specific utilities
from google.colab import userdata   # access stored credentials / variables
from pathlib import Path

# import the client class
from openai import OpenAI
import openai                 # for setting the API key

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Setup OpenAI API in Colab

In [None]:
# Initialize GPT-3.5 Turbo client
openai.api_key = userdata.get("OpenAI_2")

In [None]:
# Either set your env var beforehand…
# export OPENAI_API_KEY="sk-…"
# or do it in Python:
os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI_2")

# Create the new-style client
client = OpenAI()  # reads from OPENAI_API_KEY by default

# Now your chat call:
MODEL = "gpt-3.5-turbo"
resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "system", "content": "You are helpful."},
              {"role": "user",   "content": "Hello!"}]
)

print(resp.choices[0].message.content)

Hello! How can I help you today?


In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI_2")

client = OpenAI()   # now it will read from OPENAI_API_KEY
MODEL  = "gpt-3.5-turbo"

In [None]:
# from openai import OpenAI

# # pull your key however you like
# my_key = userdata.get("OpenAI_2")

# # pass it in here
# client = OpenAI(api_key=my_key)

# MODEL = "gpt-3.5-turbo"

# resp = client.chat.completions.create(
#     model=MODEL,
#     messages=[{"role":"user","content":"Hello!"}]
# )
# print(resp.choices[0].message.content)


In [None]:
# os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI_2")

# client = OpenAI()   # now it will read from OPENAI_API_KEY
# MODEL  = "gpt-3.5-turbo"

## Load h&m data

In [None]:
# Load the data
output_path = Path("/content/drive/MyDrive/pretrained_refinedweb_shared/after-filter-nlp-added-features-bi.csv")
df = pd.read_csv(output_path, index_col=False)

In [None]:
df.count()

Unnamed: 0,0
url,4762
content,4762
brand_name,4762
kw_uni_bi,4762
sentiment_label,4762
sentiment_score,4762
confidence,4762
kw_unigram,4762
kw_bigram,4762


In [None]:
df.head()

Unnamed: 0,url,content,brand_name,kw_uni_bi,sentiment_label,sentiment_score,confidence,kw_unigram,kw_bigram
0,http://abbymaried.blogspot.com/2012/12/get-win...,friday night girl courtney ellen december nigh...,h&m,"['friday night', 'exact dress', 'aztec cardiga...",positive,0.99,1.0,"['dress', 'friday', 'cardigan', 'night', 'plaid']","['friday night', 'exact dress', 'aztec cardiga..."
1,http://abitgraceful.blogspot.com/2014/11/,amsterdam post late figure photo birthday week...,primark,"['photo amsterdam', 'famous amsterdam', 'amste...",positive,0.952,1.0,"['amsterdam', 'eindhoven', 'anne', 'van', 'gogh']","['photo amsterdam', 'famous amsterdam', 'amste..."
2,http://aimee-weaver.blogspot.com/2013/04/hello...,addict scarf em cold scarf blanket neck chevro...,h&m,"['cold scarf', 'scarf blanket', 'scarf', 'scar...",positive,0.9988,1.0,"['scarf', 'jacket', 'blanket', 'wear', 'dress']","['cold scarf', 'scarf blanket', 'scarf cold', ..."
3,http://annahopeless.blogspot.com/2015/09/suede...,rust suede jackettuesday september blogger sue...,zara,"['suede jacket', 'suede jackettuesday', 'trend...",positive,0.5615,1.0,"['jackettuesday', 'jacket', 'denim', 'suede', ...","['suede jacket', 'suede jackettuesday', 'trend..."
4,http://archive.bebo.com/profile.jsp?memberid=7...,female coatbridge single profile view member o...,primark,"['female coatbridge', 'coatbridge single', 'co...",positive,0.9982,1.0,"['coatbridge', 'wyatt', 'whitney', 'megan', 'c...","['female coatbridge', 'coatbridge single', 'bx..."


In [None]:
df_strict = df[
    (df["sentiment_score"] > 0.9) &
    (df["confidence"] > 0.7) &
    (df["content"].str.split().str.len() > 20)
].drop_duplicates(subset="content")  # Remove duplicate texts

print(f"Rows before strict filtering: {len(df)}")
print(f"Rows after strict filtering:  {len(df_strict)}")

Rows before strict filtering: 4762
Rows after strict filtering:  3931


In [None]:
# df is your full DataFrame already loaded
df_hm = (
    df_strict[df_strict["brand_name"].str.lower() == "h&m"]  # keep only rows where brand_name == H&M
      .loc[:, ["content"]]                     # keep only the `content` column
      .reset_index(drop=True)                  # tidy index (optional)
)

print(f"H&M rows: {len(df_hm):,}")

H&M rows: 1,111


In [None]:
df_hm.head()

Unnamed: 0,content
0,friday night girl courtney ellen december nigh...
1,addict scarf em cold scarf blanket neck chevro...
2,today spend camp live room floor laptop comfor...
3,christmas happen stumble badass leather jacket...
4,look popular holiday gift fit christmas tree s...


# Reformat to Q-A using GPT with sample real post

With low resource, we will proceed as maximum 200 synthetic in total to align with grade A from this paper [Link](https://arxiv.org/abs/2212.10560)

In [None]:
SAMPLE_SIZE = 200            # number of real posts to reformat
JSONL_OUT   = "pretrain_qas_pure.jsonl"

In [None]:
# System prompt to convert a raw post into one {instruction,output} JSON object
SYSTEM_PROMPT = """
You are a data-formatting assistant.
Given a single content about H&M in the fashion domain, produce exactly one valid JSON object with two keys:
  "instruction": a realistic question that a consumer or analyst might ask about this post,
  "output": a helpful, concise answer that directly addresses the question using information from the post.
Constraints:
- Response must be ONLY the raw JSON object (no backticks, no markdown, no extra text).
- Always include a leading space before the value of "output" for fine-tuning consistency.
Example:
{"instruction":"What product does the post praise?","output":" A suede jacket with a modern cut."}
"""

In [None]:
real_sample = (
    df_hm
    .sample(n=SAMPLE_SIZE, random_state=42)
    .reset_index(drop=True)
)

In [None]:
print(f"Sampling {SAMPLE_SIZE} real posts for Q-A reformatting...")

Sampling 200 real posts for Q-A reformatting...


## Generate Q-A pairs

In [None]:
records = []
for idx, row in real_sample.iterrows():
    post = row["content"]
    try:
        resp = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system",  "content": SYSTEM_PROMPT},
                {"role": "user",    "content": post}
            ],
            temperature=0.0,
            max_tokens=200
        )
        # pydantic‐model style access:
        text = resp.choices[0].message.content.strip()
        qa   = json.loads(text)
        records.append(qa)

    except Exception as e:
        print(f"[Error] row {idx}: {e}")

print(f"Generated {len(records)} Q–A pairs from real data.")

Generated 200 Q–A pairs from real data.


# Save as a check point

# Export to JSONL

In [None]:
# Define output directory
output_path = "/content/drive/MyDrive/synthetic_prompt_generation_shared/hm"
output_filename = "pretrain_qas_pure.jsonl"

In [None]:
# Full path to the output file
JSONL_OUT = os.path.join(output_path, output_filename)

# Write records to JSONL file
with open(JSONL_OUT, "w", encoding="utf-8") as fout:
    for rec in records:
        fout.write(json.dumps(rec, ensure_ascii=False))
        fout.write("\n")

print(f"Wrote {len(records)} records to {JSONL_OUT}")

Wrote 200 records to /content/drive/MyDrive/synthetic_prompt_generation_shared/pretrain_qas_pure.jsonl


# 2 Save Output as JSONL for Fine-Tuning

## Combine Refinedweb and syntheticdataset together

In [None]:
# Point to your JSONL file
file_path = "/content/drive/MyDrive/synthetic_prompt_generation_shared/hm/pretrain_qas_pure.jsonl"

# Read each line as JSON and collect into a list of dicts
records = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if line:
            records.append(json.loads(line))

# Create a DataFrame
df_refinedweb = pd.DataFrame.from_records(records)

# Inspect the result
print(df_refinedweb.head())
print(f"\nTotal rows: {len(df_refinedweb)}, columns: {df_refinedweb.shape[1]}")

                                         instruction  \
0   What kind of products are mentioned in the post?   
1         What event is being described in the post?   
2                What is the main topic of the post?   
3  What item of clothing is the blogger particula...   
4         Who is credited for the photo in the post?   

                                              output  
0   Baby clothing items such as bodysuits, pants,...  
1   A social gathering with wine tasting, quiz ni...  
2   A quick recap of the author's Friday, includi...  
3   A navy silk dress that she bought on sale and...  
4                                  Markus Koellmann.  

Total rows: 200, columns: 2


In [None]:
df_refinedweb

Unnamed: 0,instruction,output
0,What kind of products are mentioned in the post?,"Baby clothing items such as bodysuits, pants,..."
1,What event is being described in the post?,"A social gathering with wine tasting, quiz ni..."
2,What is the main topic of the post?,"A quick recap of the author's Friday, includi..."
3,What item of clothing is the blogger particula...,A navy silk dress that she bought on sale and...
4,Who is credited for the photo in the post?,Markus Koellmann.
...,...,...
195,What brand is mentioned in the post?,Steve Madden.
196,What type of clothing item is being discussed ...,"A trench coat, specifically a classic vintage..."
197,What are some key fashion items mentioned in t...,"A faux leather jacket, a knit top, trousers w..."
198,What is the overall sentiment towards the dres...,"Overall, the sentiment towards the dress is v..."


In [None]:
df_refinedweb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   instruction  200 non-null    object
 1   output       200 non-null    object
dtypes: object(2)
memory usage: 3.3+ KB


### Save as CSV

In [None]:
# Save as CSV
csv_file = os.path.join(output_path, "synthetic_hm_instruction_pos_refinedweb_pure.csv")

In [None]:
df_refinedweb.to_csv(csv_file, index=False)

print("Files saved to:", output_path)

Files saved to: /content/drive/MyDrive/synthetic_prompt_generation_shared


-- End of the Notebook --