# Synthetic data generation

In [4]:
from openai import OpenAI
import os
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

# CSV with ad structure prompt


Here we create data in the simplest way. You can quickly generate data by addressing 3 key points: telling it the format of the data (CSV), the schema, and useful information regarding how columns relate (the LLM will be able to deduce this from the column names but a helping hand will improve performance).

In [5]:
datagen_model = "gpt-4o-mini"
question = """
Create a CSV file with 10 rows of housing data.
- id (incremental integer starting at 1)
- house size (m^2)
- house price 
- location 
- number of bedrooms
Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
    model= datagen_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
        {"role": "user", "content": question}
    ]
)
res = response.choices[0].message.content
print(res)

```csv
id,house_size_m2,house_price_usd,location,number_of_bedrooms
1,80,300000,Suburban,2
2,120,450000,Urban,3
3,95,350000,Suburban,3
4,200,800000,Urban,4
5,150,600000,Urban,3
6,70,250000,Rural,2
7,180,750000,Urban,4
8,110,400000,Suburban,3
9,130,500000,Urban,3
10,90,320000,Suburban,2
```


# CSV with a Python program

The issue with generating data directly is we are limited in the amount of data we can generate because of the context. Instead what we can do is ask the LLM to generate a python program to generate the synthetic data. This allows us to scale to much more data while also providing us a view into how the data was generated by inspecting the python program.

This would then let us edit the python program as we desire while giving us a good basis to start from.

In [6]:
question = """
Create a Python program to generate 10 rows of housing data.
I want you to at the end of it output a pandas dataframe with 10 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
    model=datagen_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
        {"role": "user", "content": question}
    ]
)
res = response.choices[0].message.content
print(res)


Here's a Python program that generates synthetic housing data according to your specifications. It outputs a pandas DataFrame with 10 rows of data, ensuring that the relationships between house size, price, location, and the number of bedrooms make sense.

```python
import pandas as pd
import random

# Function to generate synthetic housing data
def generate_housing_data(num_rows):
    data = []
    locations = ["Urban", "Suburban", "Rural"]
    
    for i in range(1, num_rows + 1):
        location = random.choice(locations)

        if location == "Urban":
            house_size = random.randint(50, 200)  # smaller sizes in urban areas
            num_bedrooms = random.randint(1, 3)
            house_price = round(house_size * 3000 + random.randint(5000, 20000), 2)  # higher prices
        elif location == "Suburban":
            house_size = random.randint(80, 300)  # moderate sizes in suburban areas
            num_bedrooms = random.randint(3, 5)
            house_price = round(hou

# Multitable CSV with a python program

o create multiple different datasets which relate to each other (for example housing, location, house type), as before we would need to specify the format, schema and useful information. However, the useful information required to get good performance is higher now. It's case-specific but a good amount of things to describe would be how the datasets relate to each other, addressing the size of the datasets in relation to one another, making sure foreign and primary keys are made appropriately and ideally using previously generated datasets to populate new ones so the actual data values match where necessary.

In [10]:
question = """Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 5 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
    model=datagen_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
        {"role": "user", "content": question}
    ]
)
res = response.choices[0].message.content
print(res)

Here's a Python program that generates three different pandas dataframes for housing data, location data, and house type data. The generated data adheres to the constraints you've specified, including foreign key relationships between the dataframes.

```python
import pandas as pd
import random

# Set a random seed for reproducibility
random.seed(42)

# 1. Generate Location Data
# Locations
locations = [
    {"id": 1, "country": "USA", "city": "New York", "population": 8419600, "area": 789},    # area in km^2
    {"id": 2, "country": "Canada", "city": "Toronto", "population": 2731579, "area": 630},   # area in km^2
    {"id": 3, "country": "UK", "city": "London", "population": 8982000, "area": 1572},       # area in km^2
    {"id": 4, "country": "Germany", "city": "Berlin", "population": 3644826, "area": 891},  # area in km^2
    {"id": 5, "country": "Australia", "city": "Sydney", "population": 5230330, "area": 1237} # area in km^2
]

location_df = pd.DataFrame(locations)
print("Locati

# Simply creating textual data

In [11]:
output_string = ""
for i in range(3):
    question = f"""
  I am creating input output training pairs to fine tune my gpt model. 
  The usecase is a retailer generating a description for a product from a product catalogue. 
  I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create 5 training pairs.
"""
    response = client.chat.completions.create(
        model=datagen_model,
        messages=[
            {'role': "system", "content": "You are a helpful assistant designed to generate synthetic data."},
            {"role": "user", "content": question}
        ]
    )

    res = response.choices[0].message.content
    output_string += res + "\n" + "\n"
print(output_string[:1000])

1.
Input: Premium Coffee Maker, Kitchen Appliances
Output: Experience barista-quality coffee from the comfort of your home with our Premium Coffee Maker. Featuring a sleek design and advanced brewing technology, this machine brews rich, aromatic coffee at the perfect temperature. With customizable settings, you can enjoy your coffee just the way you like it every morning.

2.
Input: Waterproof Bluetooth Speaker, Electronics
Output: Take your music anywhere with our Waterproof Bluetooth Speaker. Designed for durability and portability, this speaker delivers stunning sound quality while being resistant to water, making it ideal for pool parties or beach outings. Enjoy up to 10 hours of playtime and easy connectivity to all your devices.

3.
Input: Organic Cotton Bed Sheets, Home Textiles
Output: Sleep peacefully on our Organic Cotton Bed Sheets, crafted from the finest natural fibers for ultimate softness and comfort. These eco-friendly sheets are hypoallergenic and breathable, ensuring 

In [12]:
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
desciptions = []

for match in matches:
    product, category, desciption = match
    products.append(product.strip())
    categories.append(category.strip())
    desciptions.append(desciption.strip())
products

['Premium Coffee Maker',
 'Waterproof Bluetooth Speaker',
 'Organic Cotton Bed Sheets',
 'Ergonomic Office Chair',
 'Stainless Steel Water Bottle',
 'Wireless Noise Cancelling Headphones',
 'Organic Cotton T-Shirt',
 'Stainless Steel Water Bottles',
 'Non-Stick Cookware Set',
 'Wireless Fitness Tracker',
 'Wireless Noise-Cancelling Headphones',
 'Organic Green Tea',
 "Men's Running Shoes",
 'Yoga Mat',
 'Stainless Steel Water Bottle']

# Dealing with imbalanced or non-diverse textual data