In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX
from langchain.pydantic_v1 import BaseModel

## Use Case

Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

## 1. Define Your Data Model

Every dataset has a structure or a "schema". The MedicalBilling class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [3]:
class MedicalBilling(BaseModel):
    patient_id: int
    patient_name: str
    diagnosis_code: str
    procedure_code: str
    total_charge: float
    insurance_claim_amount: float

## 2. Sample Data

To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [4]:
examples = [
    {"example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code:
        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"""},
    {"example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis
        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"""},
    {"example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code:
        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"""},
]

## 3. Craft a Prompt Template

The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [5]:
OPENAI_TEMPLATE = PromptTemplate(
    template="{example}",
    input_variables=["example"]
)

prompt_template = FewShotPromptTemplate(
    examples = examples,
    example_prompt=OPENAI_TEMPLATE,
    input_variables=["subject", "extra"],
    prefix = SYNTHETIC_FEW_SHOT_PREFIX,
    suffix = SYNTHETIC_FEW_SHOT_SUFFIX
)

The `FewShotPromptTemplate` includes:
* **prefix** and **suffix**: These likely contain guiding context or instructions.
* **examples**: The sample data we defined earlier.
* **input_variables**: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
* **example_prompt**: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator

With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.


In [6]:
synthetic_data_generator = create_openai_data_generator(
    output_schema = MedicalBilling,
    llm = ChatOpenAI(temperature=1, openai_api_key=open("openai_api.txt").read()),
    prompt = prompt_template,
)

## 5. Generate Synthetic Data

Finally, let's get our synthetic data!

In [7]:
synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)



In [8]:
for result in synthetic_results:
    print(result)

patient_id=135792 patient_name='Oliver Smith' diagnosis_code='I10' procedure_code='99202' total_charge=200.0 insurance_claim_amount=150.0
patient_id=123456 patient_name='Sophia Davis' diagnosis_code='G44.1' procedure_code='99203' total_charge=250.0 insurance_claim_amount=200.0
patient_id=987654 patient_name='Harper Johnson' diagnosis_code='F32.9' procedure_code='99213' total_charge=350.0 insurance_claim_amount=300.0
patient_id=246813 patient_name='Meadow Rodriguez' diagnosis_code='M54.5' procedure_code='99204' total_charge=400.0 insurance_claim_amount=350.0
patient_id=654321 patient_name='Ezekiel Anderson' diagnosis_code='J01.0' procedure_code='99205' total_charge=500.0 insurance_claim_amount=400.0
patient_id=123456 patient_name='Ryland Ramirez' diagnosis_code='I10' procedure_code='99214' total_charge=450.0 insurance_claim_amount=400.0
patient_id=135791 patient_name='Luna Bennett' diagnosis_code='F31.6' procedure_code='99213' total_charge=350.0 insurance_claim_amount=300.0
patient_id=9