# Welcome to DataWizzAI

## Initial Setup Guide

Before diving into the functionalities of our Python package, there are a few preliminary steps you'll need to follow to set up your environment. This guide will walk you through importing required packages, loading environment variables, and initializing the language model.


### Import Required Packages

First, import all the necessary packages that our application relies on. This includes utilities for loading environment variables, interfacing with OpenAI's language models, defining data structures, and consuming data.


In [91]:
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message

Make sure these packages are installed in your Python environment. You may need to install some of them using pip if they are not already present.

### Load Environment Variables

Environment variables are used to store configurations and sensitive information securely. Use dotenv to load these variables from a .env file in your project directory.

In [92]:
#install if you don't have this package already
#!pip install python-dotenv

#import relevant functions
from dotenv import load_dotenv, find_dotenv

In [93]:
# Load environment variables
_ = load_dotenv(find_dotenv())

### Initialize the Language Model

Now, initialize the language model from OpenAI. In this example, we're using the gpt-3.5-turbo model with a specific temperature setting. The ChatOpenAI class is a part of the langchain_openai package and facilitates interaction with OpenAI's API.

In [94]:
# please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")

The temperature parameter controls the randomness of the model's responses, with higher values leading to more varied outputs. Adjust this setting according to your needs.

After completing these steps, your environment is ready, and you can proceed with the specific functionalities of the package, such as defining data structures, generating datasets, and more.

##### Init HuggingFace Endpoint - OpenAI alternative 

In [95]:
# ! pip install huggingface_hub
# 
# # please make sure HUGGINGFACEHUB_API_TOKEN is loaded to your environment variables
# from langchain_community.llms import HuggingFaceEndpoint
# 
# repo_id = "mistralai/Mistral-7B-Instruct-v0.2"
# 
# llm = HuggingFaceEndpoint(repo_id=repo_id, max_new_tokens=2048)

# Defining a Data Structure

This section outlines the various pipeline types available within the package, each designed to transform inputs into structured or unstructured data sets for different applications such as machine learning, databases, or flexible data formats.

## 1. DescriptionToMLDataset

Ideal for users who want to generate machine learning datasets based on descriptive inputs. This pipeline interprets natural language descriptions of data requirements and structures, then produces datasets optimized for training machine learning models.

In [96]:
#Initialize a DataStructureDefiner
pipeline_name = get_pipeline_name('DescriptionToMLDataset')
DataDefinerObj = DataDefiner(llm, pipeline_name = pipeline_name)

#Define the required data structure and view the result structure
# Example non-expert description
dataStructureDescription = "Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics."

# Generate professional specifications
task_specifications = TaskSpecificationAugmentor.generate_specifications_from_description(llm, description = dataStructureDescription)

# Print the generated specifications
print("Generated Specifications:")
print(task_specifications)

  warn_deprecated(




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.1.  Generate the names of the columns relevant for the description of the user. Use indicative names. Assign each column its type - numerical, categorical, datetime, free text, unique identifier. 2. Revisit each column and complete these details:For numeric columns - describe its distribution, mean and std, min and max values. for numbers and datetimes define the needed format, for categorical columns detail a complete set of categories and its probabilities. For free text columns - specify the mean and std of the text length, For unique identifier columns - specify the format and regEX to follow. For Datetime columns - specify min and max values, as well as the time intervals mean and st

Convert the textual description, aided with the expert specification, into a sample of the needed data

In [97]:
dataStructureSample = DataDefinerObj.define_schema_from_description(description = dataStructureDescription, task_specifications=task_specifications)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate a machine learning dataset for predicting suspicious AML transactions based on features like customer characteristics, transaction characteristics, location, and transactions history characteristics.;The guidance given by an expert: 1. 
- Customer ID: unique identifier
- Age: numerical
- Gender: categorical
- Occupation: categorical
- Country: categorical
- Transaction ID: unique identifier
- Transaction Amount: numerical
- Transaction Date: datetime

## 2. DescriptionToDB

Use this pipeline when you need to create a relational database schema and populate it with data, all derived from a natural language description. It’s perfect for quickly prototyping database designs or for users less familiar with SQL syntax.

In [98]:
#Initialize a DataStructureDefiner
pipeline_name = get_pipeline_name('DescriptionToDB')
DataDefinerObj = DataDefiner(llm, pipeline_name = pipeline_name)

#Define the required data structure and view the result structure
dataStructureDescription = "Create a database schema for a bookstore that includes tables for books, authors, and sales transactions."
dataStructureSample = DataDefinerObj.define_schema_from_description(dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Create a database schema for a bookstore that includes tables for books, authors, and sales transactions.;The guidance given by an expert: None;Format instructions: Generate Sample synthetic records for relational DB (one or more tables).                Please follow any relevant distributions for the stated fields, as we want this data to be as valid and useful as possible for development and testing.                Format the output as JSON with each table 

## 3. SQLToTabular

This pipeline is designed for users who prefer to define their data structure using SQL syntax. It takes SQL table creation statements and query specifications, then generates a corresponding structured, relational database schema.

In [99]:
#Initialize a DataStructureDefiner
pipeline_name = get_pipeline_name('SQLToTabular')
DataDefinerObj = DataDefiner(llm, pipeline_name = pipeline_name)

#Define the required data structure and view the result structure
dataStructureDescription = "-- Create a 'Customers' table with customer information\
CREATE TABLE Customers (\
    CustomerID INT PRIMARY KEY,\
    FirstName VARCHAR(50),\
    LastName VARCHAR(50),\
    Email VARCHAR(100),\
    JoinDate DATE\
);\
\
-- Create an 'Orders' table with order details\
CREATE TABLE Orders (\
    OrderID INT PRIMARY KEY,\
    CustomerID INT,\
    OrderDate DATE,\
    ProductCategory VARCHAR(50),\
    ProductName VARCHAR(100),\
    Units INT,\
    TotalAmount DECIMAL(10, 2),\
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)\
);"
dataStructureSample = DataDefinerObj.define_schema_from_description(dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: -- Create a 'Customers' table with customer informationCREATE TABLE Customers (    CustomerID INT PRIMARY KEY,    FirstName VARCHAR(50),    LastName VARCHAR(50),    Email VARCHAR(100),    JoinDate DATE);-- Create an 'Orders' table with order detailsCREATE TABLE Orders (    OrderID INT PRIMARY KEY,    CustomerID INT,    OrderDate DATE,    ProductCategory VARCHAR(50),    ProductName VARCHAR(100),    Units INT,    TotalAmount DECIMAL(10, 2),    FOREIGN KEY (Cust

## 4. DescriptionToUnstructured

Choose this pipeline for scenarios where structured data formats are not required. It enables the creation of datasets with flexible formats, such as text blobs, documents, or other forms of unstructured data, based on descriptive inputs.

In [100]:
#Initialize a DataStructureDefiner
pipeline_name = get_pipeline_name('DescriptionToUnstructured')
DataDefinerObj = DataDefiner(llm, pipeline_name = pipeline_name)

#Define the required data structure and view the result structure
dataStructureDescription = "Generate a collection of news articles about recent technological advancements in renewable energy."
dataStructureSample = DataDefinerObj.define_schema_from_description(dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate a collection of news articles about recent technological advancements in renewable energy.;The guidance given by an expert: None;Format instructions: Generate Sample data, potentially unstructured, that fit to the data description.                Format the output as JSONL with the dataset name as key and each line contains a single sampled text in the appropriate format.Please make sure you output a valid JSON format, and don't cut it in the middle.

## 5. APINameToData

Utilize this pipeline when you have an API specification (e.g., OpenAPI/Swagger) and want to generate data that conforms to the defined endpoints and data models. It’s ideal for testing API endpoints or generating mock data for API development.

In [101]:
#Initialize a DataStructureDefiner
pipeline_name = get_pipeline_name('APISpecificationToData')
DataDefinerObj = DataDefiner(llm, pipeline_name = pipeline_name)

#Define the required data structure and view the result structure
dataStructureDescription = "Generate mock data conforming to Twitter's API. The data should include user profiles and tweets, along with metadata for engagements (likes, retweets, replies)."
task_specifications = TaskSpecificationAugmentor.generate_specifications_from_description(llm, description = dataStructureDescription)

dataStructureSample = DataDefinerObj.define_schema_from_description(dataStructureDescription, task_specifications=task_specifications)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.1.  Generate the names of the columns relevant for the description of the user. Use indicative names. Assign each column its type - numerical, categorical, datetime, free text, unique identifier. 2. Revisit each column and complete these details:For numeric columns - describe its distribution, mean and std, min and max values. for numbers and datetimes define the needed format, for categorical columns detail a complete set of categories and its probabilities. For free text columns - specify the mean and std of the text length, For unique identifier columns - specify the format and regEX to follow. For Datetime columns - specify min and max values, as well as the time intervals mean and st

# Quering the Data Structure

After setting up your data structure, the next step is to generate the data by refining content according to your specific needs and generating a data preview. This section guides you through the process of specifying content refinements such as region and language, initializing the data generator, and generating the dataset.


## Initialize a DataAugmentor object

First initialize the DataAugmentor object with your chosen language model (llm) and the predefined data structure (dataStructureSample), and with the specification for the defined data structure:

In [102]:
# Content refinements
region = 'France'
language = 'French'
query = "" 

In [103]:
# Revisiting expert specifications to adjust it to the user query
full_query = compose_query_message(query = query, region=region, language=language)
updated_task_specifications = TaskSpecificationAugmentor.refine_specifications_by_description(llm=llm,
                                                                                           description=full_query, previous_task_specification = task_specifications)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.You already gave instructions for the needed data (see Previous Task Specifications), but now the user asks a content refinement (see User Query). Please revisit the columns distributions, and descriptive statistics and update those that have changed due to the user query. 
The User Query: Data description:  ; The required region: France ; All texts should be translated to French language.;
Your Previous Task Specifications: 1. User Profile Data:
   - User_ID: unique identifier
   - Username: free text
   - Followers_Count: numerical
   - Following_Count: numerical
   - Tweet_Count: numerical
   - Created_At: datetime

2. Tweet Data:
   - Tweet_ID: unique identifier
   - User_ID: unique i

In [104]:
# Initialize DataConsumer
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

## Generate a sample

Generate a preview of your dataset to ensure it meets your requirements. You can optionally provide a query to further guide the content of the generated data. The region and language refinements will be applied to this data.

In [105]:
# Data generation that considers the expert task specification as well:
generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language, task_specifications=updated_task_specifications)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: France ; All texts should be translated to French language. ; The guidance given by an expert: 1. Données du profil utilisateur:
   - Identifiant_utilisateur: identifiant unique
   - Nom_utilisateur: texte libre
   - Nombre_abonnés: numérique
   - Nombre_abonnements: numérique
   - Nombre_tweets: numérique
   - Date_creation: datetime

2. Données des tweets:
   - Identifiant_tweet: identifiant unique
   - Identifiant_utilisateur: identifiant unique
   - Texte_tweet: texte libre
   - Date_creation_tweet: datetime
   - Nombre_likes: numérique
   - Nombre_retweets: numériq

In [106]:
# Data generation that based on the user query and no additional specifications:
generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: France ; All texts should be translated to French language..Required Structure: {
    "twitter_api": [
        {
            "input": {
                "User_ID": "0987ab12-3456-cdef-7890-123456789abc",
                "Username": "user123",
                "Followers_Count": 450,
                "Following_Count": 250,
                "Tweet_Count": 90,
                "Created_At": "2020-05-15 08:30:45"
            },
            "output": {
                "status": "User profile created successfully"
            }
        },
        {
            "input": {
        

In [107]:
generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language, task_specifications=updated_task_specifications)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: France ; All texts should be translated to French language. ; The guidance given by an expert: 1. Données du profil utilisateur:
   - Identifiant_utilisateur: identifiant unique
   - Nom_utilisateur: texte libre
   - Nombre_abonnés: numérique
   - Nombre_abonnements: numérique
   - Nombre_tweets: numérique
   - Date_creation: datetime

2. Données des tweets:
   - Identifiant_tweet: identifiant unique
   - Identifiant_utilisateur: identifiant unique
   - Texte_tweet: texte libre
   - Date_creation_tweet: datetime
   - Nombre_likes: numérique
   - Nombre_retweets: numériq

## Generating Full Output

To generate the full dataset, use the consumeData method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [108]:
# Without expert specifications
query = "Taylor Swift fans"
generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, num_records=15) 



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Taylor Swift fans ; The required region: France ; All texts should be translated to French language..Required Structure: {
    "twitter_api": [
        {
            "input": {
                "User_ID": "0987ab12-3456-cdef-7890-123456789abc",
                "Username": "user123",
                "Followers_Count": 450,
                "Following_Count": 250,
                "Tweet_Count": 90,
                "Created_At": "2020-05-15 08:30:45"
            },
            "output": {
                "status": "User profile created successfully"
            }
        },
        {
            "i

In [109]:
# With expert specifications
query = "Taylor Swift fans"
generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, task_specifications=updated_task_specifications, num_records=15) 



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Taylor Swift fans ; The required region: France ; All texts should be translated to French language. ; The guidance given by an expert: 1. Données du profil utilisateur:
   - Identifiant_utilisateur: identifiant unique
   - Nom_utilisateur: texte libre
   - Nombre_abonnés: numérique
   - Nombre_abonnements: numérique
   - Nombre_tweets: numérique
   - Date_creation: datetime

2. Données des tweets:
   - Identifiant_tweet: identifiant unique
   - Identifiant_utilisateur: identifiant unique
   - Texte_tweet: texte libre
   - Date_creation_tweet: datetime
   - Nombre_likes: numérique
   - Nombre_

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the consumeDataInParallel method of the DataConsumer class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.


In [110]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to consumeData, but executes multiple data generation tasks concurrently.


In [111]:
import asyncio
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, task_specifications=updated_task_specifications, region=region, language=language))



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: France ; All texts should be translated to French language. ; The guidance given by an expert: 1. Données du profil utilisateur:
   - Identifiant_utilisateur: identifiant unique
   - Nom_utilisateur: texte libre
   - Nombre_abonnés: numérique
   - Nombre_abonnements: numérique
   - Nombre_tweets: numérique
   - Date_creation: datetime

2. Données des tweets:
   - Identifiant_tweet: identifiant unique
   - Identifiant_utilisateur: identifiant unique
   - Texte_tweet: texte libre
   - Date_creation_tweet: datetime
   - Nombre_lik

In [112]:
generated_data

{'twitter_api':                           input.User_ID  input.Username  \
 0  9876dcba-5432-fedc-7890-214365879bac  utilisateur456   
 1  9876dcba-5432-fedc-7890-214365879bac             NaN   
 2  1234ab56-7890-cd12-ef34-567890123def             NaN   
 3  8a5f97b1-6fa3-4a17-aa3c-5dbb9bcdb914  utilisateur456   
 4  8a5f97b1-6fa3-4a17-aa3c-5dbb9bcdb914             NaN   
 5  a9ae06d9-7bdf-4cd7-bf12-07c4e36f84c9             NaN   
 
    input.Followers_Count  input.Following_Count  input.Tweet_Count  \
 0                  420.0                  230.0               85.0   
 1                    NaN                    NaN                NaN   
 2                    NaN                    NaN                NaN   
 3                  520.0                  320.0              110.0   
 4                    NaN                    NaN                NaN   
 5                    NaN                    NaN                NaN   
 
       input.Created_At                        output.status  \


# From Example Data to Data Structure and Generation

This guide outlines the process of extracting a data structure from a given CSV file and using that structure to generate new data. This is particularly useful when you have sample data and want to replicate its structure with new, synthesized content.


### Extract Data Structure from CSV

To begin, you'll need to specify the path to your CSV file and the file name. Then, use the create_json_sample_from_csv function to extract the data structure.

In [113]:
import os
import pandas as pd
# Specify the path and file name
path = r'C:\Users\Sigal\data\\'
file_name = r"titanic.csv"

# Extract data structure from the CSV file
dataStructureSample = create_json_sample_from_csv(path, file_name)
print(dataStructureSample)

{"titanic.csv": [{"PassengerId":531,"Survived":1,"Pclass":2,"Name":"Quick, Miss. Phyllis May","Sex":"female","Age":2.0,"SibSp":1,"Parch":1,"Ticket":"26360","Fare":26.0,"Cabin":null,"Embarked":"S"},{"PassengerId":567,"Survived":0,"Pclass":3,"Name":"Stoytcheff, Mr. Ilia","Sex":"male","Age":19.0,"SibSp":0,"Parch":0,"Ticket":"349205","Fare":7.8958,"Cabin":null,"Embarked":"S"},{"PassengerId":746,"Survived":0,"Pclass":1,"Name":"Crosby, Capt. Edward Gifford","Sex":"male","Age":70.0,"SibSp":1,"Parch":1,"Ticket":"WE\/P 5735","Fare":71.0,"Cabin":"B22","Embarked":"S"},{"PassengerId":371,"Survived":1,"Pclass":1,"Name":"Harder, Mr. George Achilles","Sex":"male","Age":25.0,"SibSp":1,"Parch":0,"Ticket":"11765","Fare":55.4417,"Cabin":"E50","Embarked":"C"},{"PassengerId":572,"Survived":1,"Pclass":1,"Name":"Appleton, Mrs. Edward Dale (Charlotte Lamson)","Sex":"female","Age":53.0,"SibSp":2,"Parch":0,"Ticket":"11769","Fare":51.4792,"Cabin":"C101","Embarked":"S"}]}


### Initialize Data Generator


With the data structure extracted, you're ready to initialize the DataConsumer object that will be used for generating data.


In [114]:
# Construct the full file path
full_file_path = os.path.join(path, file_name)

# Read the CSV file into a DataFrame
df = pd.read_csv(full_file_path)

# Initialize the DataConsumer with the extracted data structure
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

### Set Example Dataframe (Optional)

If you want the generated data to be influenced by examples from your original dataset, you can set an examples DataFrame within your DataConsumer object.


In [115]:
DataAugmentorObj.set_examples_dataframe(df, file_name)

### Generate and Preview Data


Finally, you can generate and preview the synthesized data based on the extracted structure and any example data provided.



In [116]:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {"titanic.csv": [{"PassengerId":557,"Survived":1,"Pclass":1,"Name":"Duff Gordon, Lady. (Lucille Christiana Sutherland) (\\Mrs Morgan\\\")\"","Sex":"female","Age":48.0,"SibSp":1,"Parch":0,"Ticket":"11755","Fare":39.6,"Cabin":"A16","Embarked":"C"},{"PassengerId":342,"Survived":1,"Pclass":1,"Name":"Fortune, Miss. Alice Elizabeth","Sex":"female","Age":24.0,"SibSp":3,"Parch":2,"Ticket":"19950","Fare":263.0,"Cabin":"C23 C25 C27","Embarked":"S"},{"PassengerId":317,"Survived":1,"Pclass":2,"Name":"Kantor, Mrs. Sinai (Miriam Sternin)","Sex":"female","Age":24.0,"SibSp":1,"Parch":

# Transform Source Data with a textual description

Utilize the DataTransformer to apply transformations to the source data based on a given description. This could be used for data enrichment, data preparation and data explanation.

This section demonstrates how to add a field indicating the main explanation for each passenger's survival in the Titanic dataset.

In [117]:
from src.DataTransformer import DataTransformer  # Assuming the necessary import

## Initialize Data Generator

In [118]:
# Instantiate the DataTransformer with the language model
DataTransformerObj = DataTransformer(llm=llm)

## Defining Data Transformation

In [119]:
# Define the transformation based on a user's description and apply it to the data.
query = "Could you add a field indicating the main explanation for the survival of each passenger?"
results = DataTransformerObj.define_transformation(source_data={"titanic": df}, description=query)
print("Transformed Sample Data:")
print(results)



[1m> Entering new SequentialChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field indicating the main explanation for the survival of each passenger?; The transformation logic: 1. Create a new field named "Survival_Reason" to indicate the main explanation for the survival of each passenger.
   
2. Analyze the available data fields like "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to determine the main explanation for survival.

3. Utilize common factors that could influence survival such as being female, being in a higher class, being young, having family members aboard, paying a higher fare, having a cabin, and embarking from certain ports.

4. Assig

## Generating Full Output

In [120]:
# Apply the transformation to the source data and print the results.
results = DataTransformerObj.transform(source_data={"titanic": df[:50]})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field indicating the main explanation for the survival of each passenger?; The transformation logic: 1. Create a new field named "Survival_Reason" to indicate the main explanation for the survival of each passenger.
   
2. Analyze the available data fields like "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to determine the main explanation for survival.

3. Utilize common factors that could influence survival such as being female, being in a higher class, being young, having family members aboard, paying a higher fare, having a cabin, and embarking from certain ports.

4. Assign values to the "Survival_Reason" field based on t

In [121]:
pd.DataFrame(results['titanic'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survival_Reason
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Low chance of survival
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,High chance of survival
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,High chance of survival
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,High chance of survival
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Low chance of survival
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Unknown
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Low chance of survival
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Low chance of survival
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,High chance of survival
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,High chance of survival


## Generating Full Output in Parallel

### Setup for Parallel Execution

In [122]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

Perform the transformation on a subset of the Titanic dataset.

Heree, the first 45 records of the Titanic dataset are processed in parallel using the asynchronous method.# The output is then collected and converted into a pandas DataFrame for easy viewing and further analysi.


In [123]:
generated_data = asyncio.run(DataTransformerObj.transform_in_parallel(source_data={"titanic":df[:45]}))



[1m> Entering new LLMChain chain...[0m

[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field indicating the main explanation for the survival of each passenger?; The transformation logic: 1. Create a new field named "Survival_Reason" to indicate the main explanation for the survival of each passenger.
   
2. Analyze the available data fields like "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to determine the main explanation for survival.

3. Utilize common factors that could influence survival such as being female, being in a higher class, be

Display the transformed data: 
 The result of the asynchronous transformation is a dictionary with keys corresponding to dataset names and values as lists of dictionaries.


We convert this into a DataFrame for better readability and to leverage DataFrame functionalities for any subsequent analysis.

In [124]:
pd.DataFrame(generated_data['titanic'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survival_Reason
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Low chances of survival based on male gender a...
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,High chances of survival based on female gende...
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,High chances of survival based on female gende...
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,High chances of survival based on female gende...
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Low chances of survival based on male gender a...
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Low chances of survival based on male gender a...
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Low chances of survival based on male gender a...
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Low chances of survival based on male gender a...
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,High chances of survival based on female gende...
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,High chances of survival based on female gende...
