# Generating Data from API Specifications

Utilize this notebook to generate data conforming to API specifications (e.g., OpenAPI/Swagger). Perfect for testing API endpoints or generating mock data for API development, this guide shows how to generate data for API endpoints using DataWizzAI.

# Initial Setup Guide

## Import Required Packages

In [13]:
# First, import all the necessary packages.
from langchain_openai import ChatOpenAI
from src.DataDefiner import *
from src.DataAugmentor import DataAugmentor
from src.utils.utils import parse_output, try_parse_json, create_json_sample_from_csv, compose_query_message


In [14]:
## Load Environment Variables

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())


## Initialize the Language Model

In [15]:
# Please make sure OPENAI_API_KEY is loaded to your environment variables
# Initialize language model
llm = ChatOpenAI(temperature=0.9, model="gpt-3.5-turbo")


# Defining a Data Structure - APISpecificationToData

In [25]:
# Initialize a DataStructureDefiner for the task of defining unstructured data with a textual description
pipeline_name = get_pipeline_name('APISpecificationToData')
DataDefinerObj = DataDefiner(llm, pipeline_name=pipeline_name)

In [28]:
# Define the required data structure and view the result structure
dataStructureDescription = \
"Generate mock data conforming to Twitter's API. The data should include user profiles and tweets, along with metadata for engagements (likes, retweets, replies)."

In [37]:
# Convert the textual description into a sample of the needed data
dataStructureSample = DataDefinerObj.define_schema_from_description(description=dataStructureDescription)
print(parse_output(dataStructureSample))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate an initial synthetic data sample for the described task (according to the user guidance if provided, or your knowledge otherwise).Provide a sample with 10 number of items; items can be records in a table,records that can potentially be joined in a DB schema, or the equivalent tuples in a JSON format in unstructured text.The task as described by the user: Generate mock data conforming to Twitter's API. The data should include user profiles and tweets, along with metadata for engagements (likes, retweets, replies).;The guidance given by an expert: None;Format instructions: Check the input and output of the API that is mentioned in the user desription, and generate synthetic API calls (with valid input and output) that fit to the required structure of this API.                Format your output as JSONL, with th

# Generating Data

In [31]:
# First initialize the DataAugmentor object with your chosen language model (llm) and the predefined data structure (dataStructureSample):
DataAugmentorObj = DataAugmentor(llm=llm, structure=dataStructureSample, batch_size=3)

## Generate a sample (for output validation)

In [32]:
# You can view a sample of the generated data:
generated_data = DataAugmentorObj.preview_output_sample()
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 3 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: None.Required Structure: {
  "TwitterAPI": 
  [
    {
      "input": {
        "user_id": "123",
        "username": "john_doe",
        "bio": "Tech enthusiast",
        "followers_count": 1000,
        "following_count": 500
      },
      "output": {
        "user_id": "123",
        "username": "john_doe",
        "bio": "Tech enthusiast",
        "followers_count": 1000,
        "following_count": 500
      }
    },
    {
      "input": {
        "tweet_id": "456",
        "user_id": "123",
        "text": "Excited to attend the tech conference next week!",
        "date_posted": "2022-01-

## Optional: query/filter the data structure to control the generated content

In [36]:
# You can also add queries and filters to guide the generated contents:
language = 'English'
query = "Only include non NaN values"

generated_data = DataAugmentorObj.preview_output_sample(query=query, language=language)
print(generated_data)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 3 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description: Only include non NaN values ; All texts should be translated to English language..Required Structure: {
  "TwitterAPI": 
  [
    {
      "input": {
        "user_id": "123",
        "username": "john_doe",
        "bio": "Tech enthusiast",
        "followers_count": 1000,
        "following_count": 500
      },
      "output": {
        "user_id": "123",
        "username": "john_doe",
        "bio": "Tech enthusiast",
        "followers_count": 1000,
        "following_count": 500
      }
    },
    {
      "input": {
        "tweet_id": "456",
        "user_id": "123",
        "text": "Excite

## Generating Full Output

To generate the full dataset, use the generate_data method. Specify your query (if any), optionaly the region and language, and the number of records you wish to generate. 

In [22]:
# Without expert specifications
generated_data = DataAugmentorObj.generate_data( query=query, language=language, num_records=15) 

generated_data



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 3 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; All texts should be translated to English language..Required Structure: {
    "twitter_api": [
        {
            "input": {
                "user_id": "123456",
                "username": "john_doe",
                "bio": "Tech enthusiast",
                "followers_count": 1000,
                "following_count": 500
            },
            "output": {
                "user_id": "123456",
                "username": "john_doe",
                "bio": "Tech enthusiast",
                "followers_count": 1000,
                "following_count": 500
            }
        },
        

{'twitter_api':    input.user_id input.username           input.bio  input.followers_count  \
 0         943821    emily_jones   Software engineer                 1800.0   
 1         943821            NaN                 NaN                    NaN   
 2         563891     sara_smith    Digital marketer                 1200.0   
 3         563891            NaN                 NaN                    NaN   
 4         784512    emily_jones   Software engineer                  800.0   
 5         784512            NaN                 NaN                    NaN   
 6         985743     sara_smith    Digital marketer                 1200.0   
 7         985743            NaN                 NaN                    NaN   
 8         245678     jane_smith          Food lover                  800.0   
 9         245678            NaN                 NaN                    NaN   
 10        543210     emily_wong            Bookworm                 1200.0   
 11        543210            NaN     

## Generating Full Output in Parallel

For more efficient data generation, especially when dealing with large datasets or multiple requests, our package supports parallel processing. This section covers how to utilize the generate_data_in_parallel method of the DataAugmentor class to generate your dataset asynchronously.


### Setup for Parallel Execution

To ensure smooth parallel execution, especially within environments that don't natively support asynchronous operations (like Jupyter notebooks), we use nest_asyncio. This module allows asyncio to run inside environments with their own event loops.

In [23]:
import nest_asyncio
nest_asyncio.apply()

### Generate Full Output in Parallel

To generate data in parallel, use the generate_data_in_parallel coroutine. This method allows you to specify the query (if any), the number of records, region, and language, similarly to generate_data, but executes multiple data generation tasks concurrently.

In [24]:
import asyncio
# Without expert specifications
generated_data = asyncio.run(DataAugmentorObj.generate_data_in_parallel(query = query, records=20, language=language))

generated_data



[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a system that specializes in generating synthetic data according to user requests.Generate 3 Sample synthetic items(records in a table or the equivalent elements in unstructured text) for the data described below. In the following structure, but with different values.For the following task: Data description:  ; All texts should be translated to English language..Required Structure: {
    "twitter_api": [
        {
            "input": {
                "user_id": "123456",
                "username": "john_doe",
                "bio": "Tech enthusiast",
                "followers_count": 1000,
                "following_count": 500
            }

{'twitter_api':    input.user_id input.username               input.bio  \
 0         555555     jane_smith         Gamer and coder   
 1         555555            NaN                     NaN   
 2         654321     jane_smith       Social media guru   
 3         654321            NaN                     NaN   
 4         987654     jane_smith            Animal lover   
 5         987654            NaN                     NaN   
 6         654321     jane_smith         Fashion blogger   
 7         654321            NaN                     NaN   
 8         987654     jane_smith  Marketing professional   
 9         987654            NaN                     NaN   
 10        987654     jane_smith        Digital marketer   
 11        987654            NaN                     NaN   
 12        246810      sam_jones       Software engineer   
 13        654321     jane_smith  Marketing professional   
 14        654321            NaN                     NaN   
 
     input.followers_co