In [3]:
!guardrails hub install hub://guardrails/valid_length --quiet
!guardrails hub install hub://guardrails/two_words --quiet
!guardrails hub install hub://guardrails/valid_range --quiet

Installing hub:[35m/[0m[35m/guardrails/[0m[95mvalid_length...[0m
✅Successfully installed guardrails/valid_length!


Installing hub:[35m/[0m[35m/guardrails/[0m[95mtwo_words...[0m
✅Successfully installed guardrails/two_words!


Installing hub:[35m/[0m[35m/guardrails/[0m[95mvalid_range...[0m
✅Successfully installed guardrails/valid_range!




# Generating Structured Synthetic Data

!!! note
    To download this tutorial as a Jupyter notebook, click [here](https://github.com/ShreyaR/guardrails/blob/main/docs/examples/generate_structured_data.ipynb).

In this example, we'll generate structured dummy data for a `pandas` dataframe.

We make the assumption that:

1. We don't need any external libraries that are not already installed in the environment.
2. We are able to execute the code in the environment.

## Objective

We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:

1. There should be exactly 10 rows in the dataset.
2. Each user should have a first name and a last name.
3. The number of orders associated with each user should be between 0 and 50.
4. Each user should have a most recent order date.

## Step 1: Generating Pydantic `RAIL` Spec

In [4]:
from pydantic import BaseModel, Field
from guardrails.hub import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List

prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_json_suffix}
"""

class Order(BaseModel):
    user_id: str = Field(description="The user's id.")
    user_name: str = Field(
        description="The user's first name and last name",
        validators=[TwoWords()]
    )
    num_orders: int = Field(
        description="The number of orders the user has placed",
        validators=[ValidRange(0, 50)]
    )

class Orders(BaseModel):
    user_orders: List[Order] = Field(
        description="Generate a list of user, and how many orders they have placed in the past.",
        validators=[ValidLength(10, 10, on_fail="noop")]
    )

## Step 2: Create a `Guard` object with the RAIL Spec

We create a `gd.Guard` object that will check, validate and correct the generated code. This object:

1. Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
2. Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
3. Compiles the schema and type info from the RAIL spec and adds it to the prompt.

In [5]:
import guardrails as gd

from rich import print

In [None]:
guard = gd.Guard.from_pydantic(output_class=Orders, prompt=prompt)

The `Guard` object compiles the output schema and adds it to the prompt. We can see the final prompt below:

In [7]:
print(guard.rail.prompt)

## Step 3: Wrap the LLM API call with `Guard`

In [8]:
import openai


res = guard(
    openai.chat.completions.create,
    max_tokens=2048,
    temperature=0
)
res.validated_output


{'user_orders': [{'user_id': 'u123', 'user_name': 'John Doe', 'num_orders': 5},
  {'user_id': 'u456', 'user_name': 'Jane Smith', 'num_orders': 12},
  {'user_id': 'u789', 'user_name': 'Alice Johnson', 'num_orders': 3},
  {'user_id': 'u234', 'user_name': 'Michael Brown', 'num_orders': 8},
  {'user_id': 'u567', 'user_name': 'Emily Davis', 'num_orders': 20},
  {'user_id': 'u890', 'user_name': 'David Wilson', 'num_orders': 15},
  {'user_id': 'u345', 'user_name': 'Sarah Martinez', 'num_orders': 7},
  {'user_id': 'u678', 'user_name': 'Robert Anderson', 'num_orders': 10},
  {'user_id': 'u901', 'user_name': 'Laura Thompson', 'num_orders': 2},
  {'user_id': 'u432', 'user_name': 'William Garcia', 'num_orders': 18}]}

Running the cell above returns:
1. The raw LLM text output as a single string.
2. A dictionary where the key `user_orders` key contains a list of dictionaries, where each dictionary represents a row in the dataframe.

In [9]:
print(guard.history.last.tree)