# Transforming Data Based on Textual Descriptions

Learn how to apply transformations to source data based on textual descriptions for data enrichment, preparation, and explanation. This notebook demonstrates how to add fields and transform datasets using DataWizzAI, providing a powerful tool for data manipulation and enhancement.

# Initial Setup Guide

## Import Required Packages

In [13]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [14]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [15]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Transform Source Data with a textual description

This section demonstrates how to add a field indicating the main explanation for each passenger's survival in the Titanic dataset.

## Read source file

In [16]:
import os
import pandas as pd
# Specify the path and file name
path = r'C:\Users\Sigal\data\\'
file_name = r"titanic.csv"

In [17]:
# Construct the full file path
full_file_path = os.path.join(path, file_name)

# Read the CSV file into a DataFrame
df = pd.read_csv(full_file_path)

## Initialize Data Transformer


With the data structure extracted, you're ready to initialize the DataAugmentor object that will be used for generating data.

In [18]:
from src.DataTransformer import DataTransformer 

In [19]:
# Instantiate the DataTransformer with the language model
DataTransformerObj = DataTransformer(llm=llm)

## Defining Data Transformation

In [27]:
# Define the transformation based on a user's description and apply it to the data.
query = "Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds)"
results = DataTransformerObj.define_transformation(source_data={"titanic": df}, description=query)
print("Transformed Sample Data:")
print(results)



[1m> Entering new SequentialChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived (1) or not (0).
2. If the passenger survived, analyze the other attributes such as "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to explain their survival odds.
3. Consider factors like class, age, gender, family relations (SibSp and Parch), fare paid, cabin location, and embarkation point in determining survival reason.
4. Based on the analysis, create a new field named "Surv

## Transform Full Output

In [28]:
# Apply the transformation to the source data and print the results.
results = DataTransformerObj.transform(source_data={"titanic": df[:50]})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived (1) or not (0).
2. If the passenger survived, analyze the other attributes such as "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to explain their survival odds.
3. Consider factors like class, age, gender, family relations (SibSp and Parch), fare paid, cabin location, and embarkation point in determining survival reason.
4. Based on the analysis, create a new field named "Survival_Reason" to describe why the passenger survive

In [29]:
pd.DataFrame(results['titanic'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survival_Reason
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Did not survive
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Survived due to high class, female gender, and..."
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Survived due to female gender
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Survived due to high class and female gender
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Did not survive
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Did not survive
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Did not survive
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Did not survive
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Survived due to female gender and family relat...
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Survived due to female gender and family relat...


## Generating Full Output in Parallel

### Setup for Parallel Execution

In [30]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

Perform the transformation on a subset of the Titanic dataset.

Here, the first 45 records of the Titanic dataset are processed in parallel using the asynchronous method. The output is then collected and converted into a pandas DataFrame for easy viewing and further analysis.


In [32]:
import asyncio
generated_data = asyncio.run(DataTransformerObj.transform_in_parallel(source_data={"titanic":df[:45]}))



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived (1) or not (0).
2. If the passenger survived, analyze the other attributes such as "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked" to explain their survival odds.
3. Consider factors like class, age, gender, family relations (SibSp and Parch), fare paid, cabi

Display the transformed data: 
 The result of the asynchronous transformation is a dictionary with keys corresponding to dataset names and values as lists of dictionaries.

We convert this into a DataFrame for better readability and to leverage DataFrame functionalities for any subsequent analysis.

In [33]:
pd.DataFrame(generated_data['titanic'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survival_Reason
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Not specified
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Higher class, female, higher fare, cabin location"
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Female
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"Higher class, female, cabin location"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Not specified
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Not specified
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Not specified
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Not specified
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,"Female, family relations"
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,"Female, family relations"
