# Defining Data Structures with SQL Syntax

Learn how to create datasets with flexible formats such as text blobs, documents, or other forms of unstructured data based on descriptive inputs. This notebook demonstrates how to generate unstructured datasets using the DescriptionToUnstructured pipeline in DataWizzAI.

# Initial Setup Guide

## Import Required Packages

In [1]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [2]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [3]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Defining a Data Structure - DescriptionToUnstructured

In [4]:
# Initialize a DataStructureDefiner for the task of defining unstructured data with a textual description
pipeline_name = get_pipeline_name('DescriptionToUnstructured')
DataDefinerObj = DataDefiner(llm, pipeline_name=pipeline_name)

  warn_deprecated(


In [5]:
# Define the required data structure and view the result structure
dataStructureDescription = \
"Generate a collection of news articles about recent technological advancements in renewable energy."

In [6]:
# Convert the textual description into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate a collection of news articles about recent technological advancements in renewable energy.;The guidance given by an expert: None;Format instructions: Generate Sample data, potentially unstructured, that fit to the data description.                Format the output as JSONL with the dataset name as key and each line contains a single sampled text in the appropriate format.Please make sure you output a valid JSON format, and don't cut it in the middle.

# Generating Data

In [7]:
# First initialize the DataAugmentor object with your chosen language model (llm) and the predefined data structure (dataStructureSample):
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

## Generate a sample (for output validation)

In [8]:
# You can view a sample of the generated data:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {
  "news_articles_renewable_energy": [
    {
      "title": "New Solar Panel Design Increases Efficiency by 20%",
      "content": "A team of researchers has developed a new solar panel design that increases efficiency by 20%. The design incorporates innovative materials and manufacturing techniques to improve energy conversion rates."
    },
    {
      "title": "Breakthrough in Wind Turbine Technology Enables Higher Power Output",
      "content": "Scientists have made a breakthrough in wind turbine technology that enables higher power output. The new design feature

## Optional: query/filter the data structure to control the generated content

In [14]:
# You can also add queries and filters to guide the generated contents:
language = 'English'
query = "Only state of the art techniques are included "

generated_data = DataAugmentorObj.preview_output_sample(query=query, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only state of the art techniques are included  ; All texts should be translated to English language..Required Structure: {
  "news_articles_renewable_energy": [
    {
      "title": "New Solar Panel Design Increases Efficiency by 20%",
      "content": "A team of researchers has developed a new solar panel design that increases efficiency by 20%. The design incorporates innovative materials and manufacturing techniques to improve energy conversion rates."
    },
    {
      "title": "Breakthrough in Wind Turbine Technology Enables Higher Power Output",
      "content": "Scientists have made a 

## Generating Full Output

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [15]:
# Without expert specifications
generated_data = DataAugmentorObj.generate_data( query=query, language=language, num_records=15) 

generated_data



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only state of the art techniques are included  ; All texts should be translated to English language..Required Structure: {
  "news_articles_renewable_energy": [
    {
      "title": "New Solar Panel Design Increases Efficiency by 20%",
      "content": "A team of researchers has developed a new solar panel design that increases efficiency by 20%. The design incorporates innovative materials and manufacturing techniques to improve energy conversion rates."
    },
    {
      "title": "Breakthrough in Wind Turbine Technology Enables Higher Power Output",
      "content": "Scientists have made a 

{'news_articles_renewable_energy':                                                 title  \
 0   Innovative Wind Turbine Design Boosts Power Ou...   
 1   Breakthrough in Solar Panel Technology Increas...   
 2   New Battery Innovation Offers Longer Lifespan ...   
 3   Advanced Energy Storage Systems Address Renewa...   
 4   Cutting-Edge Hydrogen Fuel Cells Transform Tra...   
 5   Revolutionary Geothermal Power Plants Harness ...   
 6   State-of-the-Art Solar Paint Enables Any Surfa...   
 7   AI-Driven Energy Management Systems Enhance Ef...   
 8   Breakthrough in Tidal Energy Harvesting Boosts...   
 9   Smart Grid Implementation Results in Significa...   
 10  Cutting-Edge Solar Panel Design Boosts Efficie...   
 11  New Advancements in Wind Turbine Technology Le...   
 12  Innovative Battery Technology Promises Extende...   
 13  Smart Grid Implementation Demonstrates Substan...   
 14  Hydrogen Fuel Cells Set to Revolutionize Trans...   
 15  Advancements in Tidal Energy Harv

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [16]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [17]:
import asyncio
# Without any query / filter
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, language=language))

generated_data



[1m> Entering new LLMChain chain...[0m

[1m> Entering new LLMChain chain...[0m

Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; All texts should be translated to English language..Required Structure: {
  "news_articles_renewable_energy": [
    {
      "title": "New Solar Panel Design Increases Efficiency by 20%",
      "content": "A team of researchers has developed a new solar panel design that increases efficiency by 20%. The design incorporates innovative materials and manufacturing techniques to improve energy conversion rates."
    },
    {
      "title": "Breakthrough in Wind Turbine Technology Enables Higher Power Output",
      "content": "Scientists have made a bre

{'news_articles_renewable_energy':                                                 title  \
 0   Breakthrough in Solar Cell Technology Boosts E...   
 1   Innovative Wind Turbine Design Promises Higher...   
 2   Revolutionary Battery Tech Breakthrough Extend...   
 3   Smart Grid Integration Demonstrates Significan...   
 4   Hydrogen Fuel Cells to Transform Transportatio...   
 5   Progress in Tidal Energy Capture Drives Increa...   
 6   Cutting-Edge Geothermal Power Plants Harness E...   
 7   Breakthrough Solar Paint Technology Enables El...   
 8   Progress in Energy Storage Systems Mitigate Ch...   
 9   AI-Powered Energy Management Systems Enhance E...   
 10  Breakthrough in Ocean Wave Energy Generation I...   
 11  Next-Generation Wind Turbines with Improved Ef...   
 12  Novel Bioenergy Technology Converts Agricultur...   
 13  Advances in Solar Cell Manufacturing Drive Cos...   
 14  Hybrid Energy Systems Combine Solar and Wind P...   
 15  Innovative Energy Storage Solutio