# Creating Database Schemas from Descriptions

Explore how to create relational database schemas and populate them with data derived from natural language descriptions. This notebook demonstrates how to quickly prototype database designs using DataWizzAI, making it perfect for users less familiar with SQL syntax.

# Initial Setup Guide

## Import Required Packages

In [2]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [3]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [4]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Defining a Data Structure - DescriptionToDB

In [5]:
# Initialize a DataStructureDefiner for the task of defining a relational DB from textual description
pipeline_name = get_pipeline_name('DescriptionToDB')
DataDefinerObj = DataDefiner(llm, pipeline_name=pipeline_name)

  warn_deprecated(


In [6]:
# Define the required data structure and view the result structure
dataStructureDescription = "Create a database schema for a bookstore that includes tables for books, authors, and sales transactions."

In [7]:
# Convert the textual description into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Create a database schema for a bookstore that includes tables for books, authors, and sales transactions.;The guidance given by an expert: None;Format instructions: Generate Sample synthetic records for relational DB (one or more tables).                Please follow any relevant distributions for the stated fields, as we want this data to be as valid and useful as possible for development and testing.                Format the output as JSON with each table 

## Optional: create a TaskSpecificationAugmentor

You can augment your data description and turn it into a detailed data requirements specifications by using the TaskSpecificationAugmentor object. This component imitates a data analyst that learns your requirement and translate it into a detailed specification of the needed data and its characteristics. 

In [8]:
# Generate professional specifications
task_specifications = TaskSpecificationAugmentor.generate_specifications_from_description(llm, description=dataStructureDescription)

# Print the generated specifications
print("Generated Specifications:")
print(task_specifications)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.1.  Generate the names of the columns relevant for the description of the user. Use indicative names. Assign each column its type - numerical, categorical, datetime, free text, unique identifier. 2. Revisit each column and complete these details:For numeric columns - describe its distribution, mean and std, min and max values. for numbers and datetimes define the needed format, for categorical columns detail a complete set of categories and its probabilities. For free text columns - specify the mean and std of the text length, For unique identifier columns - specify the format and regEX to follow. For Datetime columns - specify min and max values, as well as the time intervals mean and st

In [9]:
# Convert the textual description, aided with the expert specification, into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription,
                                                                    task_specifications=task_specifications)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Create a database schema for a bookstore that includes tables for books, authors, and sales transactions.;The guidance given by an expert: 1. 

- Books Table:
	- book_id: unique identifier
	- title: free text
	- author_id: numerical
	- genre: categorical
	- publication_date: datetime

- Authors Table:
	- author_id: unique identifier
	- author_name: free text

- Sales Transactions Table:
	- transaction_id: unique identifier
	- book_id: numerical
	- transaction

# Generating Data

In [10]:
# First initialize the DataAugmentor object with your chosen language model (llm) and the predefined data structure (dataStructureSample):
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample)

## Generate a sample (for output validation)

In [11]:
# You can view a sample of the generated data:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {
    "Books": [
        {
            "book_id": 1,
            "title": "The Great Gatsby",
            "author_id": 1,
            "genre": "fiction",
            "publication_date": "2010-05-15"
        },
        {
            "book_id": 2,
            "title": "Harry Potter and the Sorcerer's Stone",
            "author_id": 2,
            "genre": "fiction",
            "publication_date": "2005-12-20"
        },
        {
            "book_id": 3,
            "title": "Becoming",
            "author_id": 3,
            "genre": "non-fiction",
            "publi

## Optional: query/filter the data structure to control the generated content

In [12]:
# You can also add queries and filters to guide the generated contents:
region = 'Italy'
language = 'English'
query = "Only romance genre are included "

generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only romance genre are included  ; The required region: Italy ; All texts should be translated to English language..Required Structure: {
    "Books": [
        {
            "book_id": 1,
            "title": "The Great Gatsby",
            "author_id": 1,
            "genre": "fiction",
            "publication_date": "2010-05-15"
        },
        {
            "book_id": 2,
            "title": "Harry Potter and the Sorcerer's Stone",
            "author_id": 2,
            "genre": "fiction",
            "publication_date": "2005-12-20"
        },
        {
            "book_id": 3,
    

In [13]:
# If you used TaskSpecificationAugmentor for defining this structure, and you wish to add queries and filters to guide the content generate, it will be a good idea to revisit the expert specifications to adjust it to the user query and guidance while maintaining the external wisdom:
full_query = compose_query_message(query=query, region=region, language=language)
updated_task_specifications = TaskSpecificationAugmentor.refine_specifications_by_description(llm=llm,
description=full_query,previous_task_specification=task_specifications)

generated_data = DataAugmentorObj.preview_output_sample(query=query, region=region, language=language,
                                                        task_specifications=updated_task_specifications)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an analyst whose job is to conduct a deep research for a given task, and define the needed data to collect, making sure you don't miss any relevant detail, and gain a deep understanding of the data characteristics.You already gave instructions for the needed data (see Previous Task Specifications), but now the user asks a content refinement (see User Query). Please revisit the columns distributions, and descriptive statistics and update those that have changed due to the user query. 
The User Query: Data description: Only romance genre are included  ; The required region: Italy ; All texts should be translated to English language.;
Your Previous Task Specifications: 1. 

- Books Table:
	- book_id: unique identifier
	- title: free text
	- author_id: numerical
	- genre: categorical
	- publication_date: datetime

- Authors Table:
	- author_id: unique identifier
	- author_name: free text

- Sales Trans

## Generating Full Output

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [14]:
# Without expert specifications
generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, num_records=15) 

# With expert specifications
#generated_data = DataAugmentorObj.generate_data( query=query, region=region, language=language, task_specifications=updated_task_specifications, num_records=15) 



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only romance genre are included  ; The required region: Italy ; All texts should be translated to English language..Required Structure: {
    "Books": [
        {
            "book_id": 1,
            "title": "The Great Gatsby",
            "author_id": 1,
            "genre": "fiction",
            "publication_date": "2010-05-15"
        },
        {
            "book_id": 2,
            "title": "Harry Potter and the Sorcerer's Stone",
            "author_id": 2,
            "genre": "fiction",
            "publication_date": "2005-12-20"
        },
        {
            "book_id": 3,
    

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [15]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [16]:
import asyncio
# Without expert specifications
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, region=region, language=language))

generated_data

# With expert specifications
#generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = "", records=20, task_specifications=updated_task_specifications, region=region, language=language))



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 10 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; The required region: Italy ; All texts should be translated to English language..Required Structure: {
    "Books": [
        {
            "book_id": 1,
            "title": "The Great Gatsby",
            "author_id": 1,
            "genre": "fiction",
            "publication_date": "2010-05-15"
        },
        {
            "book_id": 2,
            "title": "Harry Potter and the Sorcerer's Stone",
            "author_id": 2,
            "genre": "fiction",
            "publication_date": "2005-12-20"
        },
        {
            "book_i

{'Books':    book_id                   title  author_id        genre publication_date
 0        6  The Catcher in the Rye          6      fiction       2001-03-10
 1        7        The Night Circus          7      fantasy       2012-09-15
 2        8                Educated          8       memoir       2019-02-23
 3        9               Gone Girl          9     thriller       2014-06-11
 4       10           The Alchemist         10      fiction       1988-09-10
 5        6  The Catcher in the Rye          6      fiction       2012-04-18
 6        7   To Kill a Mockingbird          7      fiction       2008-10-30
 7        8            Born A Crime          8  non-fiction       2016-09-05
 8        9               Gone Girl          9      mystery       2013-06-28
 9       10            The Notebook         10      romance       2014-03-11,
 'Authors':    author_id       author_name
 0          6     J.D. Salinger
 1          7  Erin Morgenstern
 2          8     Tara Westover
 3  