# Generating Data from Descriptions

Learn how to generate machine learning datasets based on natural language descriptions. This notebook walks you through defining data structures, generating professional specifications, and creating datasets optimized for training machine learning models using DataWizzAI.

# Initial Setup Guide

## Import Required Packages

In [1]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataGenerationPipeline import DataGenerationPipeline
from src.Pipeline import Pipeline


In [2]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [3]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Defining a Data Structure - DescriptionToMLDataset

Describe the required data structure:

In [4]:
# Define the required data structure and view the result structure
dataStructureDescription = "Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics."

Initialize the DataGenerationPipeline object, and run the extract_sample_data method for viewing a sample of the created data, mainly to make sure you got the desired data structure. 

Pass the desired pipelineName (e.g. DataGenerationPipelineObj.Pipeline.ExamplesDataframeToTabular) or let the flow recognize it automatically from your task description.
 
These are the supported pipelines:
- DescriptionToMLDataset 
- DescriptionToDB 
- SQLToTabular 
- DescriptionToUnstructured 
- ExamplesDataframeToTabular 
- APISpecificationToData 
- UNKNOWN 

The output_format for the output data sample can be 0- for a string, 1- for a json object, or 3 for a pandas dataframe

In [5]:
DataGenerationPipelineObj = DataGenerationPipeline(llm = llm)

# Initialize a DataGenerationPipelineObj for the task of defining ML dataset from textual description
pipeline_name = Pipeline.DescriptionToMLDataset
sample_data = DataGenerationPipelineObj.extract_sample_data( description=dataStructureDescription, outputFormat=2 , pipelineName=pipeline_name)
print(sample_data)

{'AML_dataset':   customer_id  age  gender income_level transaction_id  transaction_amount  \
0        C001   35    Male         High           T001                5000   
1        C002   45  Female       Medium           T002                2000   
2        C003   50    Male          Low           T003               10000   
3        C004   30  Female         High           T004                3000   
4        C005   55    Male       Medium           T005                1500   
5        C006   40  Female          Low           T006                7000   
6        C007   25    Male         High           T007                4000   
7        C008   60  Female       Medium           T008                2500   
8        C009   35    Male          Low           T009                6000   
9        C010   50  Female         High           T010                3500   

      transaction_type       location transaction_date  previous_transactions  \
0        Wire Transfer       New York       

# Generating Data

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

The method can either run synchronically (run_in_parallel=False) or asynchronically (run_in_parallel=True)

The desired number of records should be passed using the num_records parameter, unless you are generating a relational DB with the DescriptionToDB pipeline, in this case, you'll need to pass a dictionary with table sizes to the examples_dataframe_dict parameter, e.g. {'users':50, 'transactions':'100'}

In [6]:
# You can view a sample of the generated data:
generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=False, output_format=2)
generated_data

{'AML_dataset':    customer_id  age  gender income_level transaction_id  transaction_amount  \
 0         C011   28    Male         High           T011                4800   
 1         C012   42  Female       Medium           T012                2100   
 2         C013   47    Male          Low           T013                9800   
 3         C014   33  Female         High           T014                3200   
 4         C015   52    Male       Medium           T015                1650   
 5         C016   36  Female          Low           T016                7200   
 6         C017   23    Male         High           T017                4300   
 7         C018   58  Female       Medium           T018                2900   
 8         C019   31    Male          Low           T019                6200   
 9         C020   48  Female         High           T020                3700   
 10        C021   29    Male         High           T021                5100   
 11        C022   44  Fem

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [7]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, set the run_in_parallel parameter to True. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [8]:

generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=True, output_format=2)
print(generated_data)

{'AML_dataset':    customer_id  age  gender income_level transaction_id  transaction_amount  \
0         C011   28    Male       Medium           T011                1800   
1         C012   48  Female          Low           T012                9000   
2         C013   42    Male         High           T013                3200   
3         C014   33  Female       Medium           T014                2500   
4         C015   58    Male          Low           T015                7000   
5         C016   38  Female         High           T016                4300   
6         C017   23    Male       Medium           T017                1500   
7         C018   63  Female          Low           T018                6200   
8         C019   32    Male         High           T019                3600   
9         C020   53  Female       Medium           T020                2800   
10        C011   28    Male          Low           T011                4500   
11        C012   42  Female         

## Optional: query/filter the data structure to control the generated content

First update the earlier defined data structure by calling query_sample_data with the details of the query, and check a sample of the updated structure:

In [9]:
# You can also add queries and filters to guide the generated contents:
region = 'France'
language = 'French'
query = "Only 40 years old customers or older."

sample_data = DataGenerationPipelineObj.query_sample_data(query=query, region=region, language=language, outputFormat=2 )

print(sample_data)

{'AML_dataset':   customer_id  age  gender income_level transaction_id  transaction_amount  \
0        C011   55    Male       Medium           T011                1200   
1        C012   45  Female          Low           T012                6500   
2        C013   60    Male         High           T013                3000   

     transaction_type   location transaction_date  previous_transactions  \
0  Retrait en espèces      Paris       2022-01-05                     11   
1   Virement bancaire       Lyon       2022-02-15                      6   
2    Dépôt par chèque  Marseille       2022-03-25                     21   

   suspicious  
0        True  
1       False  
2       False  }


Then call the generate_data method. Specify your query (if any), optionally the region and language, and the number of records you wish to generate.

In [10]:
nest_asyncio.apply()
generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=True, output_format=2, query=query, region=region, language=language)
print(generated_data)

{'AML_dataset':    customer_id  age  gender income_level transaction_id  transaction_amount  \
0         C021   42  Female          Low           T021                1800   
1         C022   48    Male       Medium           T022                5000   
2         C023   55  Female         High           T023                2500   
3         C024   47    Male       Medium           T024                4200   
4         C025   59  Female         High           T025                3700   
5         C026   41    Male          Low           T026                2800   
6         C027   50  Female       Medium           T027                1500   
7         C028   44    Male          Low           T028                4300   
8         C029   57  Female         High           T029                3200   
9         C030   46    Male       Medium           T030                6800   
10        C021   48  Female       Medium           T021                1800   
11        C022   52    Male         

## Try additional examples:

### Generating data with SQL commands:

In [11]:
# Define the required data structure and view the result structure
dataStructureDescription = """\
CREATE TABLE Customers (\
    CustomerID INT PRIMARY KEY,\
    FirstName VARCHAR(50),\
    LastName VARCHAR(50),\
    Email VARCHAR(100),\
    JoinDate DATE\
);\
CREATE TABLE Orders (\
    OrderID INT PRIMARY KEY,\
    CustomerID INT,\
    OrderDate DATE,\
    ProductCategory VARCHAR(50),\
    ProductName VARCHAR(100),\
    Units INT,\
    TotalAmount DECIMAL(10, 2),\
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)\
);"""

In [12]:
DataGenerationPipelineObj = DataGenerationPipeline(llm = llm)

# Initialize a DataGenerationPipelineObj for the task of defining relational DB from SQL description
pipeline_name = Pipeline.SQLToTabular
sample_data = DataGenerationPipelineObj.extract_sample_data( description=dataStructureDescription, outputFormat=2 , pipelineName=pipeline_name)
print(sample_data)

{'Customers':    CustomerID FirstName LastName                      Email    JoinDate
0           1     Alice    Smith    alice.smith@example.com  2020-01-15
1           2       Bob  Johnson    bob.johnson@example.com  2019-08-20
2           3   Charlie    Brown  charlie.brown@example.com  2018-05-10
3           4     David      Lee      david.lee@example.com  2017-11-30
4           5      Emma   Garcia    emma.garcia@example.com  2016-03-25, 'Orders':    OrderID  CustomerID   OrderDate ProductCategory ProductName  Units  \
0        1           1  2021-05-20     Electronics  Smartphone      2   
1        2           2  2021-04-10        Clothing       Shirt      3   
2        3           3  2021-03-05           Books       Novel      1   
3        4           4  2021-02-15      Home Decor    Curtains      4   
4        5           5  2021-01-10          Beauty    Lipstick      1   

   TotalAmount  
0      1200.00  
1        75.50  
2        15.99  
3        89.75  
4        20.50  }


In [13]:
nest_asyncio.apply()

# You can view a sample of the generated data:
generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=False, output_format=2)
generated_data

{'Customers':     CustomerID FirstName   LastName                        Email    JoinDate
 0            6    Sophia   Martinez  sophia.martinez@example.com  2019-12-05
 1            7     Ethan      Adams      ethan.adams@example.com  2018-06-18
 2            8     Grace     Morris     grace.morris@example.com  2017-04-20
 3            9     Isaac     Parker     isaac.parker@example.com  2016-10-30
 4           10      Lily      Evans       lily.evans@example.com  2015-09-15
 5           11    Olivia     Nguyen    olivia.nguyen@example.com  2019-10-10
 6           12     Mason     Taylor     mason.taylor@example.com  2018-02-28
 7           13       Ava       Wong         ava.wong@example.com  2017-06-15
 8           14    Elijah    Kennedy   elijah.kennedy@example.com  2016-08-20
 9           15  Scarlett     Fisher  scarlett.fisher@example.com  2015-04-05
 10          21    Sophia   Martinez  sophia.martinez@example.com  2020-02-28
 11          22      Liam      Adams       liam.ada

### Generating unstructured data

In [14]:
# Define the required data structure and view the result structure
dataStructureDescription = \
"Generate a collection of news articles about recent technological advancements in renewable energy."

In [15]:
DataGenerationPipelineObj = DataGenerationPipeline(llm = llm)

# Initialize a DataGenerationPipelineObj for the task of defining unstructured data from textual description
pipeline_name = Pipeline.DescriptionToUnstructured
sample_data = DataGenerationPipelineObj.extract_sample_data( description=dataStructureDescription, outputFormat=2 , pipelineName=pipeline_name)
print(sample_data)

{'news_articles':                                                title  \
0  New Solar Panel Design Increases Efficiency by...   
1  Wind Turbines Now Produce More Energy with Les...   
2  Breakthrough in Energy Storage Allows for Long...   
3  New Hydroelectric Plant Generates Power withou...   
4  Smart Grid Technology Optimizes Energy Distrib...   
5       AI Algorithms Improve Solar Panel Efficiency   
6       Advancements in Geothermal Energy Extraction   
7  New Wave Energy Converter Technology Shows Pro...   
8  Biodegradable Solar Panels Offer Sustainable E...   
9  Hybrid Solar-Wind Farms Provide 24/7 Renewable...   

                                             content  
0  A team of researchers has developed a new sola...  
1  Recent advancements in wind turbine technology...  
2  A breakthrough in energy storage technology ha...  
3  A new hydroelectric plant has been developed t...  
4  The implementation of smart grid technology ha...  
5  Artificial intelligence algorith

In [16]:
nest_asyncio.apply()

# You can view a sample of the generated data:
generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=False, output_format=2)
generated_data

{'news_articles':                                                 title  \
 0      New Geothermal Technology Reduces Energy Costs   
 1   Innovative Microgrid System Enhances Energy Re...   
 2   Bioinspired Wind Turbine Design Improves Effic...   
 3   Solar-Powered Desalination Plant Provides Clea...   
 4   Advanced Energy Storage System Boosts Grid Sta...   
 5   Next-Gen Tidal Energy Technology Harnesses Sea...   
 6   Carbon-Negative Bioenergy Solution Fights Clim...   
 7   Renewable Energy Microgrids Empower Off-Grid C...   
 8   Efficient Solar-Powered Water Purification Sys...   
 9   Novel Biomass Conversion Technology Transforms...   
 10  Revolutionary Solar Paint Enhances Energy Abso...   
 11    New Wind Farm Design Boosts Turbine Performance   
 12  Breakthrough in Battery Technology Improves St...   
 13  Innovative Hydropower System Harnesses River E...   
 14  Smart Energy Management System Optimizes Power...   
 15  AI-Driven Solar Tracking Technology Boosts Pan... 

### Generating API calls and outputs

In [30]:
# Define the required data structure and view the result structure
dataStructureDescription = \
"Google Maps API - Nearby places of interest (e.g., restaurants, hotels)."

In [32]:
DataGenerationPipelineObj = DataGenerationPipeline(llm = llm)

# Initialize a DataGenerationPipelineObj for the task of generating data that fits API calls and outputs
pipeline_name = Pipeline.APISpecificationToData
sample_data = DataGenerationPipelineObj.extract_sample_data( description=dataStructureDescription, outputFormat=1 , pipelineName=pipeline_name)
print(sample_data)

{'Google Maps API': [{'input': {'location': '-33.8670522,151.1957362', 'radius': 500, 'type': 'restaurant'}, 'output': {'results': [{'name': 'Cafe One'}, {'name': 'Pizza Palace'}, {'name': 'Burger Joint'}]}}, {'input': {'location': '-33.8670522,151.1957362', 'radius': 1000, 'type': 'hotel'}, 'output': {'results': [{'name': 'Ocean View Hotel'}, {'name': 'City Center Inn'}, {'name': 'Grand Hotel'}]}}, {'input': {'location': '-33.8670522,151.1957362', 'radius': 1500, 'type': 'cafe'}, 'output': {'results': [{'name': 'Coffee House'}, {'name': 'Cafe Delight'}, {'name': 'Brewery Cafe'}]}}]}


In [33]:
nest_asyncio.apply()

# You can view a sample of the generated data:
generated_data = DataGenerationPipelineObj.generate_data(num_records = 30, run_in_parallel=False, output_format=1)


In [34]:
generated_data

{'Google Maps API': [{'input': {'location': '-33.8670522,151.1957362',
    'radius': 500,
    'type': 'restaurant'},
   'output': {'results': [{'name': 'Sushi Spot'},
     {'name': 'Taco Time'},
     {'name': 'Noodle House'}]}},
  {'input': {'location': '-33.8670522,151.1957362',
    'radius': 1000,
    'type': 'hotel'},
   'output': {'results': [{'name': 'Seaside Resort'},
     {'name': 'Mountainview Lodge'},
     {'name': 'Lakeside Inn'}]}},
  {'input': {'location': '-33.8670522,151.1957362',
    'radius': 1500,
    'type': 'cafe'},
   'output': {'results': [{'name': 'Espresso Bar'},
     {'name': 'Tea Time Cafe'},
     {'name': 'Pastry Palace'}]}},
  {'input': {'location': '-33.8670522,151.1957362',
    'radius': 500,
    'type': 'restaurant'},
   'output': {'results': [{'name': 'BBQ Pit'},
     {'name': 'Fish House'},
     {'name': 'Mediterranean Grill'}]}},
  {'input': {'location': '-33.8670522,151.1957362',
    'radius': 1000,
    'type': 'hotel'},
   'output': {'results': [{'nam

### Generating relational DB data

In [35]:
# Define the required data structure and view the result structure
dataStructureDescription = \
"Create a database schema for a bookstore that includes tables for books, authors, and sales transactions."

In [36]:
DataGenerationPipelineObj = DataGenerationPipeline(llm = llm)

# Initialize a DataGenerationPipelineObj for the task of generating data that fits API calls and outputs
pipeline_name = Pipeline.DescriptionToDB
sample_data = DataGenerationPipelineObj.extract_sample_data( description=dataStructureDescription, outputFormat=2 , pipelineName=pipeline_name)
print(sample_data)

Specification's score: 90
Autocorrecting task specifications:
Specification's score: 100 ; The following errors were detected:
 [] 
{'Books':    book_id                                  title  author_id    genre  \
0        1                       The Great Gatsby          1  Fiction   
1        2                  To Kill a Mockingbird          2  Classic   
2        3  Harry Potter and the Sorcerer's Stone          3  Fantasy   

  publication_date  price  
0       1925-04-10   18.5  
1       1960-07-11   22.3  
2       1997-06-26   19.8  , 'Authors':    author_id          author_name nationality  birth_date
0          1  F. Scott Fitzgerald    American  1896-09-24
1          2           Harper Lee    American  1926-04-28
2          3         J.K. Rowling     British  1965-07-31, 'Sales Transactions':    transaction_id  book_id     transaction_date  customer_id  quantity_sold  \
0               1        1  2021-08-15 10:30:00        12345              5   
1               2        2  

In [37]:
table = list(sample_data.keys())
tables_size_dict = {table[0]:30,table[1]:20,table[2]:50}
tables_size_dict

{'Books': 30, 'Authors': 20, 'Sales Transactions': 50}

In [24]:
nest_asyncio.apply()

# You can view a sample of the generated data:
generated_data = DataGenerationPipelineObj.generate_data(tables_size_dict = tables_size_dict, run_in_parallel=False, output_format=2)


Error during code extraction (trial 1): name 'pytz' is not defined
Error during auto correction ( trial 1): probabilities do not sum to 1
Error during auto correction ( trial 2): probabilities do not sum to 1
Error during auto correction ( trial 3): probabilities do not sum to 1
Error during auto correction ( trial 4): probabilities do not sum to 1
Error during auto correction ( trial 5): probabilities do not sum to 1
Reached maximum trials for autocorrection. Returning None.
Error during code extraction (trial 2): 'float' object cannot be interpreted as an integer
Error during auto correction ( trial 1): 'float' object cannot be interpreted as an integer
Error during auto correction ( trial 2): '(' was never closed (<string>, line 28)
Error during auto correction ( trial 3): '(' was never closed (<string>, line 28)
Error during auto correction ( trial 4): '(' was never closed (<string>, line 28)
Error during auto correction ( trial 5): '(' was never closed (<string>, line 28)
Reached 

  warn_deprecated(




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Data description: None
Help me improve the quality and validity of the authors table. Ensure that the generated texts are distinct and varied, without repeating any names from previous requests, even across different sessions.
Do not return empty values unless you think the original value should be erased.; The transformation logic: Generate relevant and high quality unique, variable texts for free text fields: ['name'] Never override a value in a field from the following list: author_id. Reformat or correct any of the other numeric and datetime fields when needed (only change their values if mandatory for a valid record).; The output should be formatted as the input sa

In [25]:
generated_data

{'authors':     author_id                name nationality   birthdate
 0        1000        Hannah Green     British  1954-09-18
 1        1001        Nathan Evans    American  1952-06-12
 2        1002     Olivia Thompson    Canadian  1957-12-16
 3        1003        Ethan Parker    American  1958-04-25
 4        1004        Sophie White  Australian  1954-12-31
 5        1005         Liam Wilson  Australian  1958-01-03
 6        1006     Isabella Harris  Australian  1955-03-17
 7        1007           Mia Clark     British  1954-07-11
 8        1008  Alexander Robinson     British  1956-08-20
 9        1009        Emily Wright  Australian  1955-10-21
 10       1010   Jennifer Williams    American  1952-02-15
 11       1011    Michael Thompson    Canadian  1955-07-17
 12       1012     Sophie Anderson  Australian  1953-08-13
 13       1013       David Roberts     British  1953-08-20
 14       1014       Emma Campbell    Canadian  1951-12-23
 15       1015         Daniel Ward     Britis

In [38]:
# You can also add queries and filters to guide the generated contents:
region = 'France'
language = 'French'
query = "Only 40 years old writers or older."

sample_data = DataGenerationPipelineObj.query_sample_data(query=query, region=region, language=language, outputFormat=2 )

print(sample_data)

{'Books':    book_id            title  author_id       genre publication_date  price
0        1  Le Petit Prince          1     Fantasy       1943-04-06   24.2
1        2   Les Misérables          2     Classic       1862-01-18   16.8
2        3       L'Étranger          3  Philosophy       1942-06-14   21.5, 'Authors':    author_id               author_name nationality  birth_date
0          1  Antoine de Saint-Exupéry      French  1900-06-29
1          2               Victor Hugo      French  1802-02-26
2          3              Albert Camus      French  1913-11-07, 'Sales Transactions':    transaction_id  book_id     transaction_date  customer_id  quantity_sold  \
0               1        1  2021-08-15 10:30:00        12345              5   
1               2        2  2021-09-20 15:45:00        54321              4   
2               3        3  2021-10-05 11:20:00        98765              2   

   total_price  
0         80.2  
1         64.7  
2         39.6  }


In [39]:
nest_asyncio.apply()
generated_data = DataGenerationPipelineObj.generate_data(tables_size_dict=tables_size_dict, run_in_parallel=True, output_format=2, query=query, region=region, language=language)

Error during code extraction (trial 1): Provider.date_time_between_dates() got an unexpected keyword argument 'end_date'
Error during auto correction ( trial 1): '[' was never closed (<string>, line 31)
Error during code extraction (trial 1): 'Authors'


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Data description: Only 40 years old writers or older. ; The required region: France ; All texts should be translated to French language.
Help me improve the quality and validity of the Books table. Ensure that the generated texts are distinct and varied, without repeating any names from previous requests, even across different sessions.


In [40]:
generated_data

{'Books':     book_id                                    title  author_id  \
 0         1                           Les Misérables          6   
 1         2                               L'Étranger          2   
 2         3                          Le Petit Prince          2   
 3         4                            Madame Bovary         17   
 4         5  Le Tour du monde en quatre-vingts jours          7   
 5         6                        Les Fleurs du mal         10   
 6         7                                  L'Amant          3   
 7         8                         Un sac de billes         11   
 8         9                 Les Liaisons dangereuses         10   
 9        10                   La Liste de mes envies         16   
 10       11                           Les Misérables         13   
 11       12                          Le Petit Prince          4   
 12       13                               L'Étranger         12   
 13       14                           