<h1>Synthetic Data Genration</h1>
<p>Synthetic data generation is a process that creates information that has the same characteristics as a real-world dataset but without including any real-world data. Synthetic data is typically created using algorithms.</p>
<h3>Installing the Libraries</h3>
<p>First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include langchain_experimental in our installs.</p>

In [None]:
!pip install -U langchain langchain_experimental openai python-dotenv


<h3>Configuring Files</h3>
<p>here the openAI keys are configured</p>

In [None]:
import dotenv
import os
import openai
dotenv.load_dotenv('./.env')
openai.api_key = os.environ["OPENAI_API_KEY"]

<h3>Importing</h3>
<p>import all the necessary modules.</p>

In [8]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import (
    create_openai_data_generator,
    OPENAI_TEMPLATE,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_SUFFIX,
    SYNTHETIC_FEW_SHOT_PREFIX,
)

<h3>Defining Data Model</h3>
<p>Each dataset adheres to a specific structure, often referred to as a "schema." In the following MedicalBilling class, we establish our schema for the synthetic data. By defining this schema, we are essentially providing guidance to our synthetic data generator regarding the structure and characteristics of the data we anticipate.</p>

In [9]:
class MedicalBilling(BaseModel):
    patient_id: int
    patient_name: str
    diagnosis_code: str
    procedure_code: str
    total_charge: float
    insurance_claim_amount: float

<h3>Sample Data</h3>
<p>In order to steer the synthetic data generator, it is valuable to furnish it with a few real-world-like instances. These instances act as a "seed" and reflect the type of data desired. The generator will leverage these examples to generate additional data with a similar appearance.</p>

In [10]:
examples = [
    {
        "example": """Patient ID: 23242, Patient Name: Atul Ajay, Diagnosis Code:
        A60.9, Procedure Code: 000903, Total Charge: $123, Insurance Claim Amount: $100"""
    },
    {
        "example": """Patient ID: 123092, Patient Name: Rohan Joshi, Diagnosis
        Code: R74.5, Procedure Code: 000913, Total Charge: $150, Insurance Claim Amount: $120"""
    },
    {
        "example": """Patient ID: 342423, Patient Name: Rohan Pathak, Diagnosis Code:
        E21.9, Procedure Code: 000914, Total Charge: $500, Insurance Claim Amount: $300"""
    },
]

<h3>Prompt Template</h3>
<p>The generator does not possess an inherent understanding of how to generate our data we must provide guidance. To achieve this, we create a prompt template. This template assists in instructing the underlying language model on how to generate synthetic data in the specific format we require.</p>

In [11]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

<h3>Data Genrator</h3>
<p>After preparing the schema and prompt, the subsequent task involves establishing the data generator. This entity possesses the capability to interact with the underlying language model to obtain synthetic data.</p>

In [None]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalBilling,
    llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k"),
    prompt=prompt_template,
)


<h3>Synthetic Data</h3>
<p>This instruction instructs the generator to generate 10 synthetic medical billing records, which will be saved in the variable "synthetic_results." The output will be in the form of a list containing the MedicalBilling pydantic models.</p>

In [None]:
synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)
print(synthetic_results);