# Generating Machine Learning Datasets from Descriptions

Learn how to generate machine learning datasets based on natural language descriptions. This notebook walks you through defining data structures, generating professional specifications, and creating datasets optimized for training machine learning models using DataWizzAI.

# Initial Setup Guide

## Import Required Packages

In [32]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [33]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [34]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Defining a Data Structure - DescriptionToMLDataset

In [35]:
# Initialize a DataStructureDefiner for the task of defining ML dataset from textual description
pipeline_name = get_pipeline_name('DescriptionToMLDataset')
DataDefinerObj = DataDefiner(llm, pipeline_name=pipeline_name)

In [36]:
# Define the required data structure and view the result structure
dataStructureDescription = "Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics."

In [37]:
# Convert the textual description into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics.;The guidance given by an expert: None;Format instructions: Generate Sample synthetic records for tabular data (preferably a single flattened table) that can be used for training a well-performing ML model for the task described in the data description.  

## Optional: create a TaskSpecificationAugmentor

You can augment your data description and turn it into a detailed data requirements specifications by using the TaskSpecificationAugmentor object. This component imitates a data analyst that learns your requirement and translate it into a detailed specification of the needed data and its characteristics. 

In [38]:
# Generate professional specifications
task_specifications = TaskSpecificationAugmentor.generate_specifications_from_description(llm, description=dataStructureDescription)

# Print the generated specifications
print("Generated Specifications:")
print(task_specifications)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.1.  Generate the names of the columns relevant for the description of the user. Use indicative names. Assign each column its type - numerical, categorical, datetime, free text, unique identifier. 2. Revisit each column and complete these details:For numeric columns - describe its distribution, mean and std, min and max values. for numbers and datetimes define the needed format, for categorical columns detail a complete set of categories and its probabilities. For free text columns - specify the mean and std of the text length, For unique identifier columns - specify the format and regEX to follow. For Datetime columns - specify min and max values, as well as the time intervals mean and st

In [39]:
# Convert the textual description, aided with the expert specification, into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription,
                                                                    task_specifications=task_specifications)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics.;The guidance given by an expert: 1. Columns relevant for the description of the user:
- User_ID: unique identifier
- Age: numerical
- Gender: categorical
- Income: numerical
- Account_Type: categorical
- Transaction_Amount: numerical
- Transaction_Type: 

# Generating Data

In [40]:
# First initialize the DataAugmentor object with your chosen language model (llm) and the predefined data structure (dataStructureSample):
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

## Generate a sample (for output validation)

In [41]:
# You can view a sample of the generated data:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {
  "AML_Transactions_Dataset": [
    {
      "User_ID": 1,
      "Age": 42,
      "Gender": "Male",
      "Income": 55000,
      "Account_Type": "Savings",
      "Transaction_Amount": 1200,
      "Transaction_Type": "Deposit",
      "Location": "New York",
      "Transaction_Date": "2022-03-15 08:30:00",
      "Previous_Transaction_Amount": 900
    },
    {
      "User_ID": 2,
      "Age": 55,
      "Gender": "Female",
      "Income": 48000,
      "Account_Type": "Checking",
      "Transaction_Amount": 900,
      "Transaction_Type": "Withdrawal",
      "Location": "Lo

## Optional: query/filter the data structure to control the generated content

In [42]:
# You can also add queries and filters to guide the generated contents:
region = 'France'
language = 'French'
query = "Only 40 years old customers or older."

generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only 40 years old customers or older. ; The required region: France ; All texts should be translated to French language..Required Structure: {
  "AML_Transactions_Dataset": [
    {
      "User_ID": 1,
      "Age": 42,
      "Gender": "Male",
      "Income": 55000,
      "Account_Type": "Savings",
      "Transaction_Amount": 1200,
      "Transaction_Type": "Deposit",
      "Location": "New York",
      "Transaction_Date": "2022-03-15 08:30:00",
      "Previous_Transaction_Amount": 900
    },
    {
      "User_ID": 2,
      "Age": 55,
      "Gender": "Female",
      "Income": 48000,
      "Accou

In [43]:
# If you used TaskSpecificationAugmentor for defining this structure, and you wish to add queries and filters to guide the content generate, it will be a good idea to revisit the expert specifications to adjust it to the user query and guidance while maintaining the external wisdom:
full_query = compose_query_message(query=query, region=region, language=language)
updated_task_specifications = TaskSpecificationAugmentor.refine_specifications_by_description(llm=llm,
description=full_query,previous_task_specification=task_specifications)

generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language,
                                                        task_specifications=updated_task_specifications)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.You already gave instructions for the needed data (see Previous Task Specifications), but now the user asks a content refinement (see User Query). Please revisit the columns distributions, and descriptive statistics and update those that have changed due to the user query. 
The User Query: Data description: Only 40 years old customers or older. ; The required region: France ; All texts should be translated to French language.;
Your Previous Task Specifications: 1. Columns relevant for the description of the user:
- User_ID: unique identifier
- Age: numerical
- Gender: categorical
- Income: numerical
- Account_Type: categorical
- Transaction_Amount: numerical
- Transaction_Type: categorica

## Generating Full Output

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [44]:
# Without expert specifications
generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, num_records=15) 

# With expert specifications
#generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, task_specifications=updated_task_specifications, num_records=15) 



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only 40 years old customers or older. ; The required region: France ; All texts should be translated to French language..Required Structure: {
  "AML_Transactions_Dataset": [
    {
      "User_ID": 1,
      "Age": 42,
      "Gender": "Male",
      "Income": 55000,
      "Account_Type": "Savings",
      "Transaction_Amount": 1200,
      "Transaction_Type": "Deposit",
      "Location": "New York",
      "Transaction_Date": "2022-03-15 08:30:00",
      "Previous_Transaction_Amount": 900
    },
    {
      "User_ID": 2,
      "Age": 55,
      "Gender": "Female",
      "Income": 48000,
      "Accou

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [45]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [46]:
import asyncio
# Without expert specifications
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, region=region, language=language))

generated_data

# With expert specifications
#generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, task_specifications=updated_task_specifications, region=region, language=language))



[1m> Entering new LLMChain chain...[0m

[1m> Entering new LLMChain chain...[0m

Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: France ; All texts should be translated to French language..Required Structure: {
  "AML_Transactions_Dataset": [
    {
      "User_ID": 1,
      "Age": 42,
      "Gender": "Male",
      "Income": 55000,
      "Account_Type": "Savings",
      "Transaction_Amount": 1200,
      "Transaction_Type": "Deposit",
      "Location": "New York",
      "Transaction_Date": "2022-03-15 08:30:00",
      "Previous_Transaction_Amount": 900
    },
    {
      "User_ID": 2,
      "Age": 55,
      "Gender": "Female",
      "Income": 48000,
      

{'AML_Transactions_Dataset':     User_ID  Age  Gender  Income Account_Type  Transaction_Amount  \
 0        11   44    Male   58000      Savings                1250   
 1        12   57  Female   51000     Checking                 950   
 2        13   32    Male   74000      Savings                1550   
 3        14   67  Female   32000      Savings                 750   
 4        15   40    Male   62000     Checking                1150   
 5        16   49  Female   42000      Savings                 850   
 6        17   27    Male   92000     Checking                1350   
 7        18   52  Female   58000      Savings                1450   
 8        19   37    Male   77000     Checking                1650   
 9        20   62    Male   47000     Checking                1050   
 10       11   28   Femme   85000     Checking                1250   
 11       12   45   Homme   62000      Savings                1350   
 12       13   33   Femme   72000     Checking                