# Generating Data from Example Datasets

Discover how to extract data structures from existing CSV files and use those structures to generate new data. This notebook demonstrates how to replicate the structure of sample data with new synthesized content using DataWizzAI, making it ideal for data replication and augmentation.

# Initial Setup Guide

## Import Required Packages

In [1]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [2]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [3]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Extract Data Structure from CSV

To begin, you'll need to specify the path to your CSV file and the file name. Then, use the create_json_sample_from_csv function to extract the data structure.

In [4]:
import os
import pandas as pd
# Specify the path and file name
path = r'C:\Users\Sigal\data\\'
file_name = r"titanic.csv"

# Extract data structure from the CSV file
dataStructureSample = create_json_sample_from_csv(path, file_name)
print(dataStructureSample)

{"titanic.csv": [{"PassengerId":609,"Survived":1,"Pclass":2,"Name":"Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)","Sex":"female","Age":22.0,"SibSp":1,"Parch":2,"Ticket":"SC\/Paris 2123","Fare":41.5792,"Cabin":null,"Embarked":"C"},{"PassengerId":600,"Survived":1,"Pclass":1,"Name":"Duff Gordon, Sir. Cosmo Edmund (\\Mr Morgan\\\")\"","Sex":"male","Age":49.0,"SibSp":1,"Parch":0,"Ticket":"PC 17485","Fare":56.9292,"Cabin":"A20","Embarked":"C"},{"PassengerId":551,"Survived":1,"Pclass":1,"Name":"Thayer, Mr. John Borland Jr","Sex":"male","Age":17.0,"SibSp":0,"Parch":2,"Ticket":"17421","Fare":110.8833,"Cabin":"C70","Embarked":"C"},{"PassengerId":645,"Survived":1,"Pclass":3,"Name":"Baclini, Miss. Eugenie","Sex":"female","Age":0.75,"SibSp":2,"Parch":1,"Ticket":"2666","Fare":19.2583,"Cabin":null,"Embarked":"C"},{"PassengerId":241,"Survived":0,"Pclass":3,"Name":"Zabour, Miss. Thamine","Sex":"female","Age":null,"SibSp":1,"Parch":0,"Ticket":"2665","Fare":14.4542,"Cabin":null,"Embarked":"C"}]}

## Initialize Data Generator


With the data structure extracted, you're ready to initialize the DataAugmentor object that will be used for generating data.

In [5]:
# Construct the full file path
full_file_path = os.path.join(path, file_name)

# Read the CSV file into a DataFrame
df = pd.read_csv(full_file_path)

# Initialize the DataConsumer with the extracted data structure
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

  warn_deprecated(


## Set Example Dataframe (Optional)


If you want the generated data to be influenced by examples from your original dataset, you can set an examples DataFrame within your DataAugmentor object.

In [6]:
DataAugmentorObj.set_examples_dataframe(df, file_name)

## Generate and Preview Data

Finally, you can generate and preview the synthesized data based on the extracted structure and any example data provided.

In [7]:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {"titanic.csv": [{"PassengerId":200,"Survived":0,"Pclass":2,"Name":"Yrois, Miss. Henriette (\\Mrs Harbeck\\\")\"","Sex":"female","Age":24.0,"SibSp":0,"Parch":0,"Ticket":"248747","Fare":13.0,"Cabin":null,"Embarked":"S"},{"PassengerId":50,"Survived":0,"Pclass":3,"Name":"Arnold-Franchi, Mrs. Josef (Josefine Franchi)","Sex":"female","Age":18.0,"SibSp":1,"Parch":0,"Ticket":"349237","Fare":17.8,"Cabin":null,"Embarked":"S"},{"PassengerId":66,"Survived":1,"Pclass":3,"Name":"Moubarek, Master. Gerios","Sex":"male","Age":null,"SibSp":1,"Parch":1,"Ticket":"2661","Fare":15.2458,"Ca

## Optional: query/filter the data structure to control the generated content

In [9]:
# You can also add queries and filters to guide the generated contents:
language = 'English'
query = "Only survives are included (Survived=1)"

generated_data = DataAugmentorObj.preview_output_sample(query=query, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only survives are included (Survived=1) ; All texts should be translated to English language..Required Structure: {"titanic.csv": [{"PassengerId":746,"Survived":0,"Pclass":1,"Name":"Crosby, Capt. Edward Gifford","Sex":"male","Age":70.0,"SibSp":1,"Parch":1,"Ticket":"WE\/P 5735","Fare":71.0,"Cabin":"B22","Embarked":"S"},{"PassengerId":446,"Survived":1,"Pclass":1,"Name":"Dodge, Master. Washington","Sex":"male","Age":4.0,"SibSp":0,"Parch":2,"Ticket":"33638","Fare":81.8583,"Cabin":"A34","Embarked":"S"},{"PassengerId":574,"Survived":1,"Pclass":3,"Name":"Kelly, Miss. Mary","Sex":"female","Age":null,"

## Generating Full Output

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [10]:
# Without expert specifications
generated_data = DataAugmentorObj.generate_data( query="", language=language, num_records=15) 

generated_data



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; All texts should be translated to English language..Required Structure: {"titanic.csv": [{"PassengerId":858,"Survived":1,"Pclass":1,"Name":"Daly, Mr. Peter Denis ","Sex":"male","Age":51.0,"SibSp":0,"Parch":0,"Ticket":"113055","Fare":26.55,"Cabin":"E17","Embarked":"S"},{"PassengerId":172,"Survived":0,"Pclass":3,"Name":"Rice, Master. Arthur","Sex":"male","Age":4.0,"SibSp":4,"Parch":1,"Ticket":"382652","Fare":29.125,"Cabin":null,"Embarked":"Q"},{"PassengerId":430,"Survived":1,"Pclass":3,"Name":"Pickard, Mr. Berk (Berk Trembisky)","Sex":"male","Age":32.0,"SibSp":0,"Parch":0,"Ticket":"SOTON\/O.Q

{'titanic.csv':     PassengerId  Survived  Pclass                                      Name  \
 0           429         0       3                          Flynn, Mr. James   
 1           623         0       3                          Nakid, Mr. Sahid   
 2           767         0       1                 Brewe, Dr. Arthur Jackson   
 3           112         1       1                      Zabour, Miss. Hileni   
 4           531         1       2                  Quick, Miss. Phyllis May   
 5           299         1       1                     Saalfeld, Mr. Adolphe   
 6           381         1       1                     Bidois, Miss. Rosalie   
 7           533         0       3                         Elias, Mr. Joseph   
 8            34         0       2                     Wheadon, Mr. Edward H   
 9            82         1       3               Sheerlinck, Mr. Jan Baptist   
 10          542         0       3                  Svensson, Mr. Karl Johan   
 11          413         

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [11]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [13]:
import asyncio
# Without expert specifications
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = query, records=20, language=language))

generated_data



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only survives are included (Survived=1) ; All texts should be translated to English language..Required Structure: {"titanic.csv": [{"PassengerId":44,"Survived":1,"Pclass":2,"Name":"Laroche, Miss. Simonne Marie Anne Andree","Sex":"female","Age":3.0,"SibSp":1,"Parch":2,"Ticket":"SC\/Paris 2123","Fare":41.5792,"Cabin":null,"Embarked":"C"},{"PassengerId":608,"Survived":1,"Pclass":1,"Name":"Daniel, Mr. Robert Williams","Sex":"male","Age":27.0,"SibSp":0,"Parch":0,"Ticket":"113804","Fare":30.5,"Cabin":null,"Embarked":"S"},{"PassengerId":534,"Survived":1,"Pcl

{'titanic.csv':     PassengerId  Survived  Pclass                                     Name  \
 0           891         1       3            Banfield, Mr. Frederick James   
 1           892         1       3                Lindell, Mrs. Edith Maria   
 2           893         1       1                       Wilkes, Mrs. Ellen   
 3           894         1       2                Myles, Mr. Thomas Francis   
 4           895         1       3                       Colbert, Mr. Jas L   
 5           896         1       3                      Svensson, Mr. Johan   
 6           897         1       3                     Rice, Master. Albert   
 7           898         1       3                       Johnson, Mrs. Emma   
 8           899         1       2                Carter, Miss. Lucile Polk   
 9           900         1       1            Beckwith, Mr. Richard Leonard   
 10          762         1       2             Walker, Mr. William Anderson   
 11          632         1       3  M