### How Is it done ??

Some of the modelling techniques used:
- Classic statistical methods
- Deep Learning models (GAN , VAE behind the scene)
- Mix of classic statistical models and Deep Learning.

Once synthetic data is generated, we need to evaluate it to make sure it is OK to use in downstream tasks. There are many libraries, websites offering these kind of solutions. But here, we focus in GenAI part.

## Use case

Synthetic data refers to artificially generated data that imitates the characteristics of real data without containing any information from actual individuals or entities. It is typically created through mathematical models, algorithms, or other data generation techniques. Synthetic data can be used for a variety of purposes, including testing, research, and training machine learning models, while preserving privacy and security.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

**Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.**

## Quickstart

In this notebook, we'll dive deep into generating synthetic medical billing records using the langchain library. This tool is particularly useful when you want to develop or test algorithms but don't want to use real patient data due to privacy concerns or data availability issues.

## Setup
- First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs.
- [Pydantic](https://docs.pydantic.dev/latest/): Data validation library for Python

In [1]:
%%capture
!pip install -U langchain langchain_experimental openai

In [2]:
# set environment variables
# https://platform.openai.com/account/api-keys
import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [3]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator, OPENAI_TEMPLATE
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX

## 1. Define Your Data Model
- Every dataset has a structure or a "schema".
- The MedicalBilling class below serves as our schema for the synthetic data.
- By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [8]:
from datetime import date  # Assuming date format is appropriate

class InventoryData(BaseModel):
    SF_CAMPAIGN_ID: str
    JOB_ID: int
    TOTAL_LEADS_REQUESTED: int
    REQUESTED_DATE: date
    CAMPAIGN_START_DATE: date
    CAMPAIGN_END_DATE: date
    TOTAL_LEADS_AVAILABLE: int
    TOTAL_LEADS_TO_BE_PROCURED: int
    CAMPAIGN_STATUS: str
    ROLE: str
    TITLE: str
    INDUSTRY: str
    EMPLOYEE_RANGE: str
    REVENUE_RANGE: str
    STATE: str
    COUNTRY: str
    ZIP_CODE: str


## 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [9]:
examples = [
    {"example": """SF_CAMPAIGN_ID: ABC-45321, JOB_ID: 348392, TOTAL_LEADS_REQUESTED:7000,
     REQUESTED_DATE: 2020-01-24, CAMPAIGN_START_DATE: 2020-02-04, CAMPAIGN_END_DATE: 2021-01-25,
     TOTAL_LEADS_AVAILABLE: 5000, TOTAL_LEADS_TO_BE_PROCURED: 2000, CAMPAIGN_STATUS: Completed,
     ROLE: Individual Contributor, TITLE: Senior Director of Sales, INDUSTRY: Technology,
     EMPLOYEE_RANGE: 501-1000, REVENUE_RANGE: $10M - $50M, STATE: CA , COUNTRY: US, ZIP_CODE: 732823
     """},
    {"example": """SF_CAMPAIGN_ID: BCD-48393, JOB_ID: 833982, TOTAL_LEADS_REQUESTED: 2000,
     REQUESTED_DATE: 2021-03-07, CAMPAIGN_START_DATE: 2021-03-27, CAMPAIGN_END_DATE: 2021-06-07,
     TOTAL_LEADS_AVAILABLE: 2000, TOTAL_LEADS_TO_BE_PROCURED: 0, CAMPAIGN_STATUS: Completed,
     ROLE: Vice President, TITLE: SVP of Sales Operations, INDUSTRY: Chemical,
     EMPLOYEE_RANGE: 5001-10001, REVENUE_RANGE: $100M-$1B, STATE: NY , COUNTRY: US, ZIP_CODE: 273472
     """},
    {"example": """SF_CAMPAIGN_ID: HPE-788277, JOB_ID: 727628, TOTAL_LEADS_REQUESTED: 5550,
     REQUESTED_DATE: 2024-01-22, CAMPAIGN_START_DATE: 2024-01-26, CAMPAIGN_END_DATE: 2024-02-22,
     TOTAL_LEADS_AVAILABLE: 550, TOTAL_LEADS_TO_BE_PROCURED: 5000, CAMPAIGN_STATUS: Live,
     ROLE: Chief Executive, TITLE: Director of Information Technology Application , INDUSTRY: IT Services,
     EMPLOYEE_RANGE: 10001-100000, REVENUE_RANGE: $501M - $100M, STATE: WA , COUNTRY: Canada, ZIP_CODE: 373732
     """},
]

## 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [12]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

In [10]:
SYNTHETIC_FEW_SHOT_PREFIX

'This is a test about generating synthetic data about {subject}. Examples below:'

In [11]:
SYNTHETIC_FEW_SHOT_SUFFIX

'Now you generate synthetic data about {subject}. Make sure to {extra}:'

The `FewShotPromptTemplate` includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [13]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=InventoryData,
    llm=ChatOpenAI(temperature=1),
    prompt=prompt_template,
)

## 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [15]:
synthetic_results = synthetic_data_generator.generate(
    subject="InventoryData",
    extra="the campaign id, job id, total leads , requested date, start date, end date, leads available, leads procured, role, title, industry, employee range, revenue range must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)

This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models.

In [16]:
type(synthetic_results)

list

## 6. Visualize the Generated Synthetic Data

In [17]:
len(synthetic_results)

10

In [18]:
synthetic_results

[InventoryData(SF_CAMPAIGN_ID='XYZ-12456', JOB_ID=987654, TOTAL_LEADS_REQUESTED=8500, REQUESTED_DATE=datetime.date(2023, 8, 17), CAMPAIGN_START_DATE=datetime.date(2023, 9, 2), CAMPAIGN_END_DATE=datetime.date(2024, 2, 15), TOTAL_LEADS_AVAILABLE=4500, TOTAL_LEADS_TO_BE_PROCURED=4000, CAMPAIGN_STATUS='In Progress', ROLE='Team Lead', TITLE='Marketing Specialist', INDUSTRY='Manufacturing', EMPLOYEE_RANGE='500-1000', REVENUE_RANGE='$1M - $10M', STATE='TX', COUNTRY='Mexico', ZIP_CODE='846388'),
 InventoryData(SF_CAMPAIGN_ID='XYZ-76392', JOB_ID=546872, TOTAL_LEADS_REQUESTED=4100, REQUESTED_DATE=datetime.date(2022, 11, 15), CAMPAIGN_START_DATE=datetime.date(2023, 1, 9), CAMPAIGN_END_DATE=datetime.date(2023, 5, 22), TOTAL_LEADS_AVAILABLE=2300, TOTAL_LEADS_TO_BE_PROCURED=1800, CAMPAIGN_STATUS='Ongoing', ROLE='Operations Manager', TITLE='Product Development Specialist', INDUSTRY='Fashion', EMPLOYEE_RANGE='1001-5000', REVENUE_RANGE='$50M-$100M', STATE='FL', COUNTRY='Australia', ZIP_CODE='536347'),


## 7. Converting the synthetic data into Pandas Dataframe

In [19]:
import pandas as pd

# Create a list of dictionaries from the objects
synthetic_data = []
for item in synthetic_results:
    synthetic_data.append({
         'SF_CAMPAIGN_ID': item.SF_CAMPAIGN_ID,
         'JOB_ID': item.JOB_ID,
         'TOTAL_LEADS_REQUESTED': item.TOTAL_LEADS_REQUESTED,
         'REQUESTED_DATE': item.REQUESTED_DATE,
         'CAMPAIGN_START_DATE': item.CAMPAIGN_START_DATE,
         'CAMPAIGN_END_DATE': item.CAMPAIGN_END_DATE,
         'TOTAL_LEADS_AVAILABLE': item.TOTAL_LEADS_AVAILABLE,
         'TOTAL_LEADS_TO_BE_PROCURED': item.TOTAL_LEADS_TO_BE_PROCURED,
         'CAMPAIGN_STATUS': item.CAMPAIGN_STATUS,
         'ROLE': item.ROLE,
         'TITLE': item.TITLE,
         'INDUSTRY': item.INDUSTRY,
         'EMPLOYEE_RANGE': item.EMPLOYEE_RANGE,
         'REVENUE_RANGE': item.REVENUE_RANGE,
         'STATE': item.STATE,
         'COUNTRY': item.COUNTRY,
         'ZIP_CODE': item.ZIP_CODE
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df = pd.DataFrame(synthetic_data)

# Display the DataFrame
print(type(synthetic_df))
synthetic_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,SF_CAMPAIGN_ID,JOB_ID,TOTAL_LEADS_REQUESTED,REQUESTED_DATE,CAMPAIGN_START_DATE,CAMPAIGN_END_DATE,TOTAL_LEADS_AVAILABLE,TOTAL_LEADS_TO_BE_PROCURED,CAMPAIGN_STATUS,ROLE,TITLE,INDUSTRY,EMPLOYEE_RANGE,REVENUE_RANGE,STATE,COUNTRY,ZIP_CODE
0,XYZ-12456,987654,8500,2023-08-17,2023-09-02,2024-02-15,4500,4000,In Progress,Team Lead,Marketing Specialist,Manufacturing,500-1000,$1M - $10M,TX,Mexico,846388
1,XYZ-76392,546872,4100,2022-11-15,2023-01-09,2023-05-22,2300,1800,Ongoing,Operations Manager,Product Development Specialist,Fashion,1001-5000,$50M-$100M,FL,Australia,536347
2,ABC-98765,123456,7200,2023-05-11,2023-06-01,2023-11-30,4000,3200,Pending,Marketing Analyst,Supply Chain Coordinator,Food & Beverage,5000-10000,$100M - $500M,NY,Germany,728463
3,ABC-54321,876543,6300,2023-10-27,2024-01-15,2024-07-09,3600,2700,On Hold,Data Scientist,Logistics Coordinator,Technology,100-500,$10M - $50M,CA,Japan,634872
4,MNO-34567,987654,5600,2023-08-20,2024-03-14,2024-09-27,3200,2400,Completed,Customer Success Manager,Operations Coordinator,Healthcare,501-1000,$20M - $50M,TX,South Africa,873624
5,XYZ-24680,987654,8200,2023-09-15,2024-04-10,2024-10-25,5000,3200,In Progress,Sales Coordinator,Marketing Manager,Energy,1000-5000,$50M - $100M,FL,Canada,349258
6,RST-12345,654321,7400,2022-11-30,2023-06-15,2023-12-09,4200,3200,Pending Approval,Marketing Specialist,Supply Chain Analyst,Finance,500-1000,$30M - $70M,NY,Germany,456782
7,ABC-54321,123456,9800,2023-06-08,2024-01-20,2024-07-12,6000,3800,Under Review,Business Development Director,Data Analyst,Technology,1000-5000,$100M - $500M,CA,Japan,782943
8,LMN-24680,123456,7200,2024-08-22,2025-02-17,2025-09-30,4800,2400,Pending Review,Operations Manager,Customer Success Specialist,Healthcare,500-1000,$20M - $50M,TX,Australia,543216
9,XYZ-98765,789012,5600,2023-09-14,2024-04-30,2024-10-15,3200,2400,Awaiting Feedback,Human Resources Coordinator,Procurement Specialist,Manufacturing,100-500,$10M - $30M,FL,Brazil,678912


In [None]:
synthetic_df.shape

(10, 6)

### Start exploring based on your usecase and use the same approach for real sensitive data. But, be careful, as the synthetic data might not capture the real-world complexities.