# Synthetic data generation

Resource: https://cookbook.openai.com/examples/sdg1


This project has the fgollowing agenda:

1. CSV with a structured prompt
2. CSV with a Python program
3. Multitable CSV with a python program
4. Simply creating textual data

#### 1. Setup

In [1]:
# !pip install openai pandas

In [2]:
# Make sure to set the 'OPENAI_API_KEY' as your env var
from openai import OpenAI
client = OpenAI()

#### 2. Creating a small CSV

You can quickly generate data by addressing 3 key points: 

1. Telling it the format of the data (CSV), 
2. The schema
3. Useful information regarding how columns relate

The LLM will be able to deduce this from the column names but a helping hand will improve performance

In [3]:
datagen_model = "gpt-4o-mini"

question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. 
More size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""


response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)

res = response.choices[0].message.content
print(res)

```csv
id,house_size_m2,house_price,location,number_of_bedrooms
1,50,150000,Suburban,2
2,75,250000,Suburban,3
3,100,300000,Urban,3
4,120,400000,Urban,4
5,90,280000,Suburban,3
6,200,600000,Urban,5
7,60,200000,Suburban,2
8,150,500000,Urban,4
9,110,350000,Urban,3
10,80,220000,Suburban,3
```


#### 3. CSV with a Python program
The issue with generating data directly is we are limited in the amount of data we can generate because of the context. Instead what we can do is ask the LLM to generate a python program to generate the synthetic data. 

In [4]:
question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. 
more size is usually higher price etc. make sure all the numbers make sense).
"""


response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)

res = response.choices[0].message.content
print(res)


Certainly! Below is a Python program that generates synthetic housing data with the specified characteristics. It uses the `pandas` library to create a DataFrame with 100 rows of data.

```python
import pandas as pd
import random

# Constants
locations = ['Downtown', 'Suburb', 'Countryside']
prices_per_m2 = {
    'Downtown': 5000,
    'Suburb': 3000,
    'Countryside': 1500
}
bedrooms_per_size = {
    'Downtown': (50, 200),  # size range for Downtown
    'Suburb': (80, 300),    # size range for Suburb
    'Countryside': (100, 400)  # size range for Countryside
}

def generate_housing_data(num_rows=100):
    data = []
    
    for i in range(1, num_rows + 1):
        # Randomly select a location
        location = random.choice(locations)

        # Generate house size based on location
        min_size, max_size = bedrooms_per_size[location]
        house_size = random.randint(min_size, max_size)

        # Determine number of bedrooms based on house size
        if location == 'Downto

In [7]:
import pandas as pd
import random

# Constants
locations = ['Downtown', 'Suburb', 'Countryside']
prices_per_m2 = {
    'Downtown': 5000,
    'Suburb': 3000,
    'Countryside': 1500
}
bedrooms_per_size = {
    'Downtown': (50, 200),  # size range for Downtown
    'Suburb': (80, 300),    # size range for Suburb
    'Countryside': (100, 400)  # size range for Countryside
}

def generate_housing_data(num_rows=500):
    data = []
    
    for i in range(1, num_rows + 1):
        # Randomly select a location
        location = random.choice(locations)

        # Generate house size based on location
        min_size, max_size = bedrooms_per_size[location]
        house_size = random.randint(min_size, max_size)

        # Determine number of bedrooms based on house size
        if location == 'Downtown':
            number_of_bedrooms = random.randint(1, 3)
        elif location == 'Suburb':
            number_of_bedrooms = random.randint(2, 5)
        else:  # Countryside
            number_of_bedrooms = random.randint(3, 6)
        
        # Calculate house price
        house_price = house_size * prices_per_m2[location]

        # Append the row to the data list
        data.append({
            'id': i,
            'house_size': house_size,
            'house_price': house_price,
            'location': location,
            'number_of_bedrooms': number_of_bedrooms
        })

    return pd.DataFrame(data)

# Generate the housing data DataFrame
housing_df = generate_housing_data(500)

# Display the DataFrame
print(housing_df)

      id  house_size  house_price     location  number_of_bedrooms
0      1         151       755000     Downtown                   1
1      2          56       280000     Downtown                   1
2      3         313       469500  Countryside                   4
3      4         100       500000     Downtown                   1
4      5         132       396000       Suburb                   2
..   ...         ...          ...          ...                 ...
495  496         261       391500  Countryside                   3
496  497         149       223500  Countryside                   5
497  498          97       291000       Suburb                   3
498  499         160       800000     Downtown                   1
499  500         378       567000  Countryside                   5

[500 rows x 5 columns]


#### 4. Multitable CSV with a python program

1. Specify the format, 
2. Schema 
3. Useful information: 
    * How the datasets relate to each other.
    * Addressing the size of the datasets in relation to one another.
    * Making sure foreign and primary keys are made appropriately


In [8]:
question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. 
more size is usually higher price etc. make sure all the numbers make sense).

Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.

Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

Sure! Below is a Python program that uses the Pandas library to generate the three requested DataFrames: Housing data, Location data, and House types. The program ensures that the relationships between these DataFrames are logical and consistent.

```python
import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

# 1. Generate Location Data
num_locations = 10
location_data = {
    "id": range(1, num_locations + 1),
    "country": np.random.choice(['USA', 'Canada', 'UK', 'Germany', 'France'], num_locations),
    "city": [f"City_{i}" for i in range(1, num_locations + 1)],
    "population": np.random.randint(10000, 1000000, num_locations),
    "area_m2": np.random.randint(500000, 10000000, num_locations),  # in square meters
}
locations_df = pd.DataFrame(location_data)

# 2. Generate House Types
num_house_types = 5
house_type_data = {
    "id": range(1, num_house_types + 1),
    "house_type": [f"Type_{i}" for i in range(1, num_house_types + 1)],
    "av

In [9]:
import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

# 1. Generate Location Data
num_locations = 10
location_data = {
    "id": range(1, num_locations + 1),
    "country": np.random.choice(['USA', 'Canada', 'UK', 'Germany', 'France'], num_locations),
    "city": [f"City_{i}" for i in range(1, num_locations + 1)],
    "population": np.random.randint(10000, 1000000, num_locations),
    "area_m2": np.random.randint(500000, 10000000, num_locations),  # in square meters
}
locations_df = pd.DataFrame(location_data)

# 2. Generate House Types
num_house_types = 5
house_type_data = {
    "id": range(1, num_house_types + 1),
    "house_type": [f"Type_{i}" for i in range(1, num_house_types + 1)],
    "average_price": np.random.randint(100000, 1000000, num_house_types),
    "number_of_houses": np.random.randint(50, 300, num_house_types),
}
house_types_df = pd.DataFrame(house_type_data)

# 3. Generate Housing Data
num_houses = 100
house_size = np.random.choice(np.arange(50, 300, 10), num_houses)  # House size between 50 and 300 m^2
num_bedrooms = np.random.choice([1, 2, 3, 4, 5], num_houses, p=[0.2, 0.3, 0.3, 0.15, 0.05])  # More common to have 2-3 bedrooms
location_ids = np.random.choice(locations_df['id'], num_houses)
house_type_ids = np.random.choice(house_types_df['id'], num_houses)

# Assuming price increases with size and location population
house_price = []
for i in range(num_houses):
    base_price = house_size[i] * (1000 + num_bedrooms[i] * 500)  # Base price dependent on size and bedrooms
    location_multiplier = 1 + (locations_df.loc[locations_df['id'] == location_ids[i], 'population'].iloc[0] / 1000000)
    house_price.append(int(base_price * location_multiplier))

housing_data = {
    "id": range(1, num_houses + 1),
    "house_size_m2": house_size,
    "house_price": house_price,
    "location_id": location_ids,
    "number_of_bedrooms": num_bedrooms,
    "house_type_id": house_type_ids
}
housing_df = pd.DataFrame(housing_data)

# Display the generated DataFrames
print("Locations DataFrame:")
print(locations_df)
print("\nHouse Types DataFrame:")
print(house_types_df)
print("\nHousing DataFrame:")
print(housing_df.head())

Locations DataFrame:
   id  country     city  population  area_m2
0   1  Germany   City_1      185203  6519877
1   2   France   City_2      201335  3844769
2   3       UK   City_3      288167  9680351
3   4   France   City_4       51090   603355
4   5   France   City_5      339365  1762752
5   6   Canada   City_6       74820  6164789
6   7       UK   City_7      797201  9805648
7   8       UK   City_8      331879  6243066
8   9       UK   City_9      728315  6613790
9  10   France  City_10      337069  5221339

House Types DataFrame:
   id house_type  average_price  number_of_houses
0   1     Type_1         748143               239
1   2     Type_2         552366               100
2   3     Type_3         165725               157
3   4     Type_4         229981               104
4   5     Type_5         184654               293

Housing DataFrame:
   id  house_size_m2  house_price  location_id  number_of_bedrooms  \
0   1            290       971039            5                   3   


#### 5. Simply creating textual data

This can be used to finetune another GPT model.

In [16]:
output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. 
  The usecase is a retailer generating a description for a product from a product catalogue.
  I want the input to be product name and category (to which the product belongs to) and output to be description.
  
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  # print(res)
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response


1.
Input: Wireless Bluetooth Headphones, Electronics
Output: Experience crystal-clear sound with our Wireless Bluetooth Headphones, designed for comfort and effortless connectivity. Perfect for music lovers and on-the-go professionals alike.

2.
Input: Organic Green Tea, Grocery
Output: Savor the delicate flavors of our Organic Green Tea, sourced from the finest tea leaves. Packed with antioxidants, it's a refreshing beverage choice for a healthy lifestyle.

3.
Input: Stainless Steel Water Bottle, Kitchen
Output: Stay hydrated in style with our Stainless Steel Water Bottle. Designed to keep your drinks cold for 24 hours or hot for 12, it’s perfect for workouts, travel, or daily use.

4.
Input: Ergonomic Office Chair, Furniture
Output: Enhance your workspace with our Ergonomic Office Chair, designed for maximum comfort and support. With adjustable height and lumbar support, it's the perfect solution for long hours at your desk.

5.
Input: 4K Ultra HD Smart TV, Electronics
Output: Enjoy 

In [17]:
print(output_string)

1.
Input: Wireless Bluetooth Headphones, Electronics
Output: Experience crystal-clear sound with our Wireless Bluetooth Headphones, designed for comfort and effortless connectivity. Perfect for music lovers and on-the-go professionals alike.

2.
Input: Organic Green Tea, Grocery
Output: Savor the delicate flavors of our Organic Green Tea, sourced from the finest tea leaves. Packed with antioxidants, it's a refreshing beverage choice for a healthy lifestyle.

3.
Input: Stainless Steel Water Bottle, Kitchen
Output: Stay hydrated in style with our Stainless Steel Water Bottle. Designed to keep your drinks cold for 24 hours or hot for 12, it’s perfect for workouts, travel, or daily use.

4.
Input: Ergonomic Office Chair, Furniture
Output: Enhance your workspace with our Ergonomic Office Chair, designed for maximum comfort and support. With adjustable height and lumbar support, it's the perfect solution for long hours at your desk.

5.
Input: 4K Ultra HD Smart TV, Electronics
Output: Enjoy 

In [18]:
import re
#regex to parse data
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())


print(product)
print(category)
print(description)

Organic Green Tea
Beverages
Refresh your senses with our Organic Green Tea. Sourced from the finest tea leaves, this blend offers a delicate flavor profile packed with antioxidants and health benefits, making it the perfect daily brew.
