# Generating Data from Examples Data

Discover how to extract data structures from existing CSV files and use those structures to generate new data. This notebook demonstrates how to replicate the structure of sample data with new synthesized content using DataWizzAI, making it ideal for data replication and augmentation.

# Initial Setup Guide

## Import Required Packages

In [1]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataGenerationPipeline import *
from src.Pipeline import *
from src.utils.utils import create_json_sample_from_csv


In [2]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [3]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Extract Data Structure from CSV

To begin, you'll need to specify the path to your CSV file and the file name. Then, use the create_json_sample_from_csv function to extract the data structure.

In [4]:
import os
import pandas as pd
# Specify the path and file name
path = r'C:\Users\Sigal\data\\'
file_name = r"titanic.csv"

# Extract data structure from the CSV file
dataStructureSample = create_json_sample_from_csv(path, file_name)
print(dataStructureSample)

{"titanic.csv": [{"PassengerId":158,"Survived":0,"Pclass":3,"Name":"Corn, Mr. Harry","Sex":"male","Age":30.0,"SibSp":0,"Parch":0,"Ticket":"SOTON\/OQ 392090","Fare":8.05,"Cabin":null,"Embarked":"S"},{"PassengerId":863,"Survived":1,"Pclass":1,"Name":"Swift, Mrs. Frederick Joel (Margaret Welles Barron)","Sex":"female","Age":48.0,"SibSp":0,"Parch":0,"Ticket":"17466","Fare":25.9292,"Cabin":"D17","Embarked":"S"},{"PassengerId":394,"Survived":1,"Pclass":1,"Name":"Newell, Miss. Marjorie","Sex":"female","Age":23.0,"SibSp":1,"Parch":0,"Ticket":"35273","Fare":113.275,"Cabin":"D36","Embarked":"C"},{"PassengerId":441,"Survived":1,"Pclass":2,"Name":"Hart, Mrs. Benjamin (Esther Ada Bloomfield)","Sex":"female","Age":45.0,"SibSp":1,"Parch":1,"Ticket":"F.C.C. 13529","Fare":26.25,"Cabin":null,"Embarked":"S"},{"PassengerId":794,"Survived":0,"Pclass":1,"Name":"Hoyt, Mr. William Fisher","Sex":"male","Age":null,"SibSp":0,"Parch":0,"Ticket":"PC 17600","Fare":30.6958,"Cabin":null,"Embarked":"C"}]}


## Initialize Data Generator and Preview Data


With the data structure extracted, you're ready to initialize the DataGenerationPipeline object, and run the extract_sample_data method for viewing a sample of the created data, mainly to make sure you got the desired data structure. 

Pass pipelineName = DataGenerationPipelineObj.Pipeline.ExamplesDataframeToTabular to direct the generator to use the "examples to table" pipeline.

In [5]:
DataGenerationPipelineObj = DataGenerationPipeline(llm=llm)
print(DataGenerationPipelineObj.extract_sample_data(description=dataStructureSample,pipelineName=Pipeline.ExamplesDataframeToTabular, outputFormat=2))

{'titanic.csv':    PassengerId  Survived  Pclass  \
0          158         0       3   
1          863         1       1   
2          394         1       1   
3          441         1       2   
4          794         0       1   

                                                Name     Sex   Age  SibSp  \
0                                    Corn, Mr. Harry    male  30.0      0   
1  Swift, Mrs. Frederick Joel (Margaret Welles Ba...  female  48.0      0   
2                             Newell, Miss. Marjorie  female  23.0      1   
3        Hart, Mrs. Benjamin (Esther Ada Bloomfield)  female  45.0      1   
4                           Hoyt, Mr. William Fisher    male   NaN      0   

   Parch           Ticket      Fare Cabin Embarked  
0      0  SOTON/OQ 392090    8.0500  None        S  
1      0            17466   25.9292   D17        S  
2      0            35273  113.2750   D36        C  
3      1     F.C.C. 13529   26.2500  None        S  
4      0         PC 17600   30.6958  No

## Set Examples Data (Optional) and Generate Data


If you want the generated data to be influenced by examples from your original dataset, you can set a dictinary holding the examples (with the data name as key and the pandas dataframe as item) within your DataGenerationPipeline object.

In [6]:
# Construct the full file path
full_file_path = os.path.join(path, file_name)

# Read the CSV file into a DataFrame
df = pd.read_csv(full_file_path)

examples_dict = {file_name:df}

Finally, you can generate as much data as you need (set the num_records accordingly) based on the extracted structure and any example data provided.

Optional: Use the parameter examples_dataframe_dict to pass the full set of examples.

In [7]:
DataGenerationPipelineObj.generate_data(num_records=23, run_in_parallel=False, examples_dataframe_dict=examples_dict)

{'titanic.csv':     PassengerId  Survived  Pclass  \
 0           102         1       1   
 1           429         0       3   
 2           646         0       1   
 3           753         1       3   
 4           518         1       3   
 5           961         1       1   
 6           189         0       3   
 7           572         1       1   
 8           804         0       3   
 9           658         0       3   
 10          152         1       2   
 11          578         0       3   
 12          808         0       3   
 13          534         1       3   
 14          911         1       1   
 15          313         0       2   
 16          103         1       1   
 17          876         1       3   
 18          812         0       3   
 19          114         0       3   
 20          279         1       1   
 21          706         0       2   
 22          409         0       3   
 
                                                Name     Sex    Age  Si

## Generate Data in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, you can set the run_in_parallel parameter to True to generate your data asynchronously.


To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

### Setup for Parallel Execution

In [9]:
import nest_asyncio
nest_asyncio.apply()
DataGenerationPipelineObj.generate_data(num_records=79, run_in_parallel=True)


{'titanic.csv':     PassengerId  Survived  Pclass                                  Name  \
 0           215         1       2                   Watson, Miss. Grace   
 1           762         0       3               Calderhead, Mr. William   
 2           631         0       3  Hagland, Mr. Konrad Mathias Reiersen   
 3           504         1       1                 Maioni, Miss. Roberta   
 4           127         0       3                    McKamey, Mr. Peter   
 ..          ...       ...     ...                                   ...   
 74          498         0       1                     Smith, Mr. Robert   
 75         1024         0       3                      Baker, Mr. Henry   
 76          701         1       1                 Jones, Mrs. Elizabeth   
 77          391         1       2                 Walker, Miss. Abigail   
 78          119         0       3                   Murphy, Mr. Patrick   
 
        Sex   Age  SibSp  Parch     Ticket      Fare    Cabin Embarked 

## Query/filter the data structure to control the generated content

In [10]:
# You can also add queries and filters to guide the generated contents:
language = 'English'
query = "Only female survivors should be included in the generated data"

DataGenerationPipelineObj.query_sample_data(query=query, language=language, outputFormat=2)

{'titanic.csv':    PassengerId  Survived  Pclass                       Name     Sex   Age  \
 0          527         1       1  Anderson, Miss. Elizabeth  female  29.0   
 1          742         1       1   Carter, Mrs. Lucile Polk  female  36.0   
 2          843         1       1         Seward, Miss. Anna  female  35.0   
 
    SibSp  Parch  Ticket     Fare    Cabin Embarked  
 0      0      0   17757  227.525  C62 C64        C  
 1      1      2  113760  120.000  B96 B98        S  
 2      0      0  113794   26.550     None        S  }

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 
Note: don't send the original examples dictionary as it may override your query.

In [11]:
DataGenerationPipelineObj.generate_data(num_records=20, run_in_parallel=False, examples_dataframe_dict=examples_dict, query=query, language=language)

{'titanic.csv':     PassengerId  Survived  Pclass                                   Name  \
 0           125         1       1                Anderson, Miss. Eleanor   
 1           230         1       1         Hays, Miss. Margaret Bechstein   
 2           689         1       3                     Murphy, Miss. Nora   
 3           411         1       3                  Cotterill, Miss. Rene   
 4           512         1       3                    Webber, Miss. Susan   
 5           712         1       1                    Klaber, Miss. Annie   
 6           825         1       2                 Botsford, Miss. Amelia   
 7           944         1       2                   Hocking, Miss. Alice   
 8           765         1       1           Allison, Miss. Helen Loraine   
 9           841         1       3                     Mangan, Miss. Mary   
 10          505         1       1              Henderson, Miss. Beatrice   
 11          719         1       3                   Johnson,