# Transforming Data Based on Textual Descriptions

Learn how to apply transformations to source data based on textual descriptions for data enrichment, preparation, and explanation. This notebook demonstrates how to add fields and transform datasets using DataWizzAI, providing a powerful tool for data manipulation and enhancement.

# Initial Setup Guide

## Import Required Packages

In [1]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataTransformer import DataTransformer


In [2]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [3]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Transform Source Data with a textual description

This section demonstrates how to add a field indicating the main explanation for each passenger's survival in the Titanic dataset.

## Read source file

In [4]:
import os
import pandas as pd
# Specify the path and file name
path = r'C:\Users\Sigal\data\\'
file_name = r"titanic.csv"

In [5]:
# Construct the full file path
full_file_path = os.path.join(path, file_name)

# Read the CSV file into a DataFrame
df = pd.read_csv(full_file_path)

## Initialize Data Transformer


With the data structure extracted, you're ready to initialize the DataAugmentor object that will be used for generating data.

In [6]:
# Instantiate the DataTransformer with the language model
DataTransformerObj = DataTransformer(llm=llm)

  warn_deprecated(


## Defining Data Transformation

Define the transformation based on a user's description and apply it to the data.

The input data required for the source_data in this method is a dictionary with the data name as a key and the dataframe holding the data as value.

In [7]:
query = "Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds)"
results = DataTransformerObj.define_transformation(source_data={"titanic": df}, description=query)
print("Transformed Sample Data:")
print(results)



[1m> Entering new SequentialChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived or not.
2. Analyze the "Pclass" field to see if there is any correlation between the passenger's ticket class and their survival.
3. Consider the "Sex" field and investigate if gender played a role in the passenger's survival.
4. Evaluate the "Age" field to see if age had any impact on the passenger's likelihood of survival.
5. Look at the "SibSp" and "Parch" fields to see if having siblings, spouses, pa


## Transform Full Output

Apply the transformation to the source data and print the results. 

In this method the expected source_data is a dataframe holding the data to be transformed.

The output_format can be 0- for a string, 1- for a json object, or 3 for a pandas dataframe


In [8]:
# Apply the transformation to the source data and print the results.
results = DataTransformerObj.transform(source_data=df[:50], output_format=2)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived or not.
2. Analyze the "Pclass" field to see if there is any correlation between the passenger's ticket class and their survival.
3. Consider the "Sex" field and investigate if gender played a role in the passenger's survival.
4. Evaluate the "Age" field to see if age had any impact on the passenger's likelihood of survival.
5. Look at the "SibSp" and "Parch" fields to see if having siblings, spouses, parents, or children aboard affected the passenger's

In [9]:
results

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survival Reason
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"Pclass:3, Sex:male, Age:22.0, SibSp:1, Parch:0..."
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Pclass:1, Sex:female, Age:38.0, SibSp:1, Parch..."
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"Pclass:3, Sex:female, Age:26.0, SibSp:0, Parch..."
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"Pclass:1, Sex:female, Age:35.0, SibSp:1, Parch..."
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"Pclass:3, Sex:male, Age:35.0, SibSp:0, Parch:0..."
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,"Pclass:3, Sex:male, Age:null, SibSp:0, Parch:0..."
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,"Pclass:1, Sex:male, Age:54.0, SibSp:0, Parch:0..."
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,"Pclass:3, Sex:male, Age:2.0, SibSp:3, Parch:1,..."
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,"Pclass:3, Sex:female, Age:27.0, SibSp:0, Parch..."
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,"Pclass:2, Sex:female, Age:14.0, SibSp:1, Parch..."


## Generating Full Output in Parallel

### Setup for Parallel Execution

In [10]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

Perform the transformation on a subset of the Titanic dataset.

Here, the first 45 records of the Titanic dataset are processed in parallel using the asynchronous method. The output is then collected and converted into a pandas DataFrame for easy viewing and further analysis.

The output_format can be 0- for a string, 1- for a json object, or 3 for a pandas dataframe

In [11]:
import asyncio
generated_data = asyncio.run(DataTransformerObj.transform_in_parallel(source_data=df[:45], output_format = 2))



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in enriching or transforming given source data with additional attributes according to user requests. Follow the given transformation logic for adding additional features. The task as described by the user: Could you add a field explaining the survival reason for each passenger? (how its other attributes explain its survive odds); The transformation logic: 1. Check the "Survived" field to determine if the passenger survived or not.
2. Analyze the "Pclass" field to see if there is any correlation between the passenger's ticket class and their survival.
3. Consider the "Sex" field and investigate if gender played a role in the passenger's survival.
4. Evaluate the "Age" field to see if age had any impact o

Display the transformed data: 
 The result of the asynchronous transformation is a dictionary with keys corresponding to dataset names and values as lists of dictionaries.

In [12]:
print(generated_data)

    PassengerId  Survived  Pclass  \
0             1         0       3   
1             2         1       1   
2             3         1       3   
3             4         1       1   
4             5         0       3   
5             6         0       3   
6             7         0       1   
7             8         0       3   
8             9         1       3   
9            10         1       2   
10           11         1       3   
11           12         1       1   
12           13         0       3   
13           14         0       3   
14           15         0       3   
15           16         1       2   
16           17         0       3   
17           18         1       2   
18           19         0       3   
19           20         1       3   
20           21         0       2   
21           22         1       2   
22           23         1       3   
23           24         1       1   
24           25         0       3   
25           26         1       3   
2