# Generating Synthetic Data w/LangChain

<img src="https://dataingovernment.blog.gov.uk/wp-content/uploads/sites/46/2020/08/synthetic_data_image-1536x718.png"  width=600>

Synthetic data is no new concept in machine learning, however the generation of synthetic data is becoming easier and more robust with the advent of LLMs text generation skills. Researchers are using LLMs to generate synthetic data en masse for all sorts of tasks like:
* Generating Synthetic Confidential Data (Healthcare, HR, Financial, etc)
* Creating Small Language Models for Specialized Tasks
* Optimizations and Model Alignment via Fine Tuning
* Addressing long-tail and rare cases where data is insufficient

The LLM framework LangChain has built in synthetic data generators ([documentation here](https://python.langchain.com/v0.2/docs/tutorials/data_generation/)), we'll be going over how to use this for multiple different data models and generate our own synthetic data using LLMs

---
## Dependencies

In [1]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

from datetime import datetime, date
from enum import Enum
from typing import Type, List, Optional
import csv

---
# Defining Your Data Model

This takes the form of a Pydantic BaseModel class, where you can specify each attribute and its expected value type.

Below are some examples of potential data models across industries, with few-shot examples for each that will be used in the final prompt

### Financial Transactions

In [None]:
# Financial Transaction Data Model
class FinancialTransaction(BaseModel):
    transaction_id: str
    timestamp: datetime
    amount: float
    currency: str
    sender: str
    recipient: str
    transaction_type: str
    description: Optional[str]
    fee: Optional[float]

financial_examples = [
    {
        "example": """Transaction ID: FT001, Timestamp: 2024-07-26 09:15:30, Amount: 1500.00, 
        Currency: USD, Sender: John Smith, Recipient: Alice Johnson, 
        Transaction Type: Bank Transfer, Description: Monthly rent payment, Fee: 5.00"""
    },
    {
        "example": """Transaction ID: FT002, Timestamp: 2024-07-26 14:30:45, Amount: 75.50, 
        Currency: EUR, Sender: Marie Dupont, Recipient: Café Parisien, 
        Transaction Type: Credit Card Payment, Description: Lunch with colleagues, Fee: None"""
    },
    {
        "example": """Transaction ID: FT003, Timestamp: 2024-07-26 18:45:00, Amount: 10000.00, 
        Currency: JPY, Sender: Tanaka Corp, Recipient: Yamada Ltd, 
        Transaction Type: Wire Transfer, Description: None, Fee: 1500.00"""
    }
]

### IoT Device Data

In [None]:
# IoT Device Data Model
class IoTDeviceData(BaseModel):
    device_id: str
    timestamp: datetime
    device_type: str
    location: str
    temperature: float
    humidity: Optional[float]
    pressure: Optional[float]
    battery_level: float
    wifi_signal_strength: int
    firmware_version: str
    last_maintenance_date: Optional[datetime]
    error_code: Optional[str]
    data_transfer_rate: float
    is_active: bool

iot_examples = [
    {
        "example": """Device ID: IOT001, Timestamp: 2024-07-26 14:30:15, Device Type: Temperature Sensor, 
        Location: Warehouse A, Temperature: 23.5, Humidity: 45.2, Pressure: 1013.25, 
        Battery Level: 87.5, WiFi Signal Strength: -65, Firmware Version: v2.1.3, 
        Last Maintenance Date: 2024-05-15 09:00:00, Error Code: None, 
        Data Transfer Rate: 256.5, Is Active: True"""
    },
    {
        "example": """Device ID: IOT002, Timestamp: 2024-07-26 15:45:30, Device Type: Smart Thermostat, 
        Location: Office Building Floor 3, Temperature: 21.8, Humidity: 38.7, Pressure: None, 
        Battery Level: 92.3, WiFi Signal Strength: -58, Firmware Version: v3.0.1, 
        Last Maintenance Date: None, Error Code: E004, 
        Data Transfer Rate: 128.0, Is Active: True"""
    },
    {
        "example": """Device ID: IOT003, Timestamp: 2024-07-26 16:15:45, Device Type: Environmental Monitor, 
        Location: Greenhouse B, Temperature: 28.2, Humidity: 75.5, Pressure: 1010.75, 
        Battery Level: 45.8, WiFi Signal Strength: -72, Firmware Version: v1.9.7, 
        Last Maintenance Date: 2024-06-30 14:00:00, Error Code: None, 
        Data Transfer Rate: 512.3, Is Active: False"""
    }
]

### HR Employee Information

In [4]:
# Employee HR Model
class EmployeeHR(BaseModel):
    employee_id: str
    first_name: str
    last_name: str
    date_of_birth: date
    gender: str
    email: str
    phone_number: str
    department: str
    position: str
    hire_date: date
    salary: float
    performance_rating: float
    address: str
    emergency_contact: str
    skills: List[str]
    education: List[str]

hr_examples = [
    {
        "example": """Employee ID: EMP001, First Name: Sarah, Last Name: Johnson, Date of Birth: 1985-03-15, 
        Gender: Female, Email: sarah.johnson@company.com, Phone Number: +1-555-123-4567, 
        Department: Marketing, Position: Senior Marketing Manager, Hire Date: 2010-06-01, 
        Salary: 85000.00, Performance Rating: 4.5, Address: 123 Main St, Anytown, USA 12345, 
        Emergency Contact: John Johnson (Husband) +1-555-987-6543, 
        Skills: Digital Marketing, Content Strategy, SEO, Social Media Management, 
        Education: Bachelor of Business Administration, Master of Marketing"""
    },
    {
        "example": """Employee ID: EMP002, First Name: Michael, Last Name: Chen, Date of Birth: 1990-11-22, 
        Gender: Male, Email: michael.chen@company.com, Phone Number: +1-555-234-5678, 
        Department: IT, Position: Software Developer, Hire Date: 2015-09-15, 
        Salary: 78000.00, Performance Rating: 4.2, Address: 456 Oak Ave, Tech City, USA 67890, 
        Emergency Contact: Lisa Chen (Wife) +1-555-876-5432, 
        Skills: Python, JavaScript, React, Docker, AWS, 
        Education: Bachelor of Science in Computer Science, AWS Certified Developer"""
    },
    {
        "example": """Employee ID: EMP003, First Name: Emily, Last Name: Rodriguez, Date of Birth: 1988-07-30, 
        Gender: Female, Email: emily.rodriguez@company.com, Phone Number: +1-555-345-6789, 
        Department: Human Resources, Position: HR Specialist, Hire Date: 2013-02-28, 
        Salary: 65000.00, Performance Rating: 4.0, Address: 789 Elm St, Metro City, USA 54321, 
        Emergency Contact: David Rodriguez (Brother) +1-555-765-4321, 
        Skills: Recruitment, Employee Relations, HRIS, Conflict Resolution, 
        Education: Bachelor of Arts in Psychology, PHR Certification"""
    }
]

### Medical Records

In [None]:
# Medical Record Data Model
class MedicalRecord(BaseModel):
    patient_id: str
    first_name: str
    last_name: str
    date_of_birth: date
    gender: str
    blood_type: str
    height: float
    weight: float
    allergies: List[str]
    chronic_conditions: List[str]
    emergency_contact: str
    insurance_provider: str
    insurance_policy_number: str
    
medical_examples = [
    {
        "example": """Patient ID: P12345, First Name: Emily, Last Name: Johnson, Date of Birth: 1985-03-15, 
        Gender: Female, Blood Type: A+, Height: 165.0, Weight: 60.5, 
        Allergies: [Penicillin, Peanuts], Chronic Conditions: [Asthma, Migraine], 
        Emergency Contact: Michael Johnson (Husband) +1-555-123-4567, 
        Insurance Provider: HealthGuard, Insurance Policy Number: HG987654321"""
    },
    {
        "example": """Patient ID: P67890, First Name: Robert, Last Name: Chen, Date of Birth: 1972-11-08, 
        Gender: Male, Blood Type: B-, Height: 178.0, Weight: 82.3, 
        Allergies: [Sulfa drugs], Chronic Conditions: [Type 2 Diabetes, Hypertension], 
        Emergency Contact: Lisa Chen (Wife) +1-555-987-6543, 
        Insurance Provider: MediCare Plus, Insurance Policy Number: MP123456789"""
    },
    {
        "example": """Patient ID: P24680, First Name: Sophia, Last Name: Martinez, Date of Birth: 1990-07-22, 
        Gender: Female, Blood Type: O+, Height: 170.5, Weight: 65.8, 
        Allergies: [Latex, Shellfish], Chronic Conditions: [Hypothyroidism], 
        Emergency Contact: Carlos Martinez (Father) +1-555-246-8024, 
        Insurance Provider: BlueCross, Insurance Policy Number: BC135792468"""
    }
]

---
# Defining the Few Shot Prompt

Using our defined data models and examples, we can create a few shot prompt

Documentation Here: https://python.langchain.com/v0.1/docs/modules/model_io/prompts/few_shot_examples/

**Specifics:**  
* `prefix`: Appended to the beginning of the final prompt
* `suffix`: Appeneded to the end of the final promtp
* `examples`: Preset few-shot examples
* `input_variables`: Allows for dynamic input of different variables into the prompt
* `example_prompt`: A prompt formatter that will format the few shot examples into a string

In [5]:
prompt_template = FewShotPromptTemplate(
    prefix = "This is a test about generating synthetic data about {subject}. Examples below:",
    examples = hr_examples,
    suffix = "Now you generate synthetic data about {subject}. Make sure to {extra}:",
    input_variables = ["subject", "extra"],
    example_prompt = PromptTemplate(input_variables=["example"], template="{example}"),
)

In [6]:
dict(prompt_template)

{'name': None,
 'input_variables': ['extra', 'subject'],
 'optional_variables': [],
 'input_types': {},
 'output_parser': None,
 'partial_variables': {},
 'metadata': None,
 'tags': None,
 'examples': [{'example': 'Employee ID: EMP001, First Name: Sarah, Last Name: Johnson, Date of Birth: 1985-03-15, \n        Gender: Female, Email: sarah.johnson@company.com, Phone Number: +1-555-123-4567, \n        Department: Marketing, Position: Senior Marketing Manager, Hire Date: 2010-06-01, \n        Salary: 85000.00, Performance Rating: 4.5, Address: 123 Main St, Anytown, USA 12345, \n        Emergency Contact: John Johnson (Husband) +1-555-987-6543, \n        Skills: Digital Marketing, Content Strategy, SEO, Social Media Management, \n        Education: Bachelor of Business Administration, Master of Marketing'},
  {'example': 'Employee ID: EMP002, First Name: Michael, Last Name: Chen, Date of Birth: 1990-11-22, \n        Gender: Male, Email: michael.chen@company.com, Phone Number: +1-555-234-56

---
# Defining the Synthetic Data Generator

Combining the data model schema, the language model, and the prompt together

Essentially uses OpenAI's function calling capability to output structured data repeatably and reliably along within a LangChain Chain

In [7]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=EmployeeHR,
    llm=ChatOpenAI(model = "gpt-4o", temperature=1),
    prompt=prompt_template,
)

### Generating Synthetic Data!

Can pass in any extra instructions now as well, which we specified in our few shot prompt within our suffix

In [9]:
synthetic_results = None

synthetic_results = synthetic_data_generator.generate(
    subject="employee_data",
    extra="Make the name something you would usually not pick",
    runs=20,
)

And now we can inspect the data

In [10]:
synthetic_results

[EmployeeHR(employee_id='EMP004', first_name='Quincy', last_name='Thompson', date_of_birth=datetime.date(1992, 10, 5), gender='Male', email='quincy.thompson@company.com', phone_number='+1-555-456-7890', department='Finance', position='Financial Analyst', hire_date=datetime.date(2018, 3, 19), salary=72000.0, performance_rating=4.3, address='234 Pine St, Commerce City, USA 45678', emergency_contact='Isabella Thompson (Sister) +1-555-654-3210', skills=['Financial Modeling', 'Excel', 'Data Analysis', 'Budgeting'], education=['Bachelor of Science in Finance', 'CFA Level 1']),
 EmployeeHR(employee_id='EMP005', first_name='Sophie', last_name='Lindgren', date_of_birth=datetime.date(1995, 9, 14), gender='Female', email='sophie.lindgren@company.com', phone_number='+1-555-567-8901', department='Marketing', position='Content Strategist', hire_date=datetime.date(2020, 6, 15), salary=67000.0, performance_rating=4.5, address='123 Maple St, Marketingville, USA 98765', emergency_contact='Oskar Lindgren

In [11]:
for result in synthetic_results[0]:
    print(result)

('employee_id', 'EMP004')
('first_name', 'Quincy')
('last_name', 'Thompson')
('date_of_birth', datetime.date(1992, 10, 5))
('gender', 'Male')
('email', 'quincy.thompson@company.com')
('phone_number', '+1-555-456-7890')
('department', 'Finance')
('position', 'Financial Analyst')
('hire_date', datetime.date(2018, 3, 19))
('salary', 72000.0)
('performance_rating', 4.3)
('address', '234 Pine St, Commerce City, USA 45678')
('emergency_contact', 'Isabella Thompson (Sister) +1-555-654-3210')
('skills', ['Financial Modeling', 'Excel', 'Data Analysis', 'Budgeting'])
('education', ['Bachelor of Science in Finance', 'CFA Level 1'])


---
# Saving Data

Now that we have a data model and generated data, it's easy to convert the generated data into a shareable format, like CSV

In [12]:
def basemodel_to_csv(model: Type[BaseModel], data: List[BaseModel], filename: str):
    # Get field names from the model
    field_names = list(model.__fields__.keys())
    
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=field_names)
        
        # Write the header
        writer.writeheader()
        
        # Write the data
        for record in data:
            row = {}
            for field in field_names:
                value = getattr(record, field)
                if isinstance(value, list):
                    row[field] = ', '.join(map(str, value))
                else:
                    row[field] = str(value)
            writer.writerow(row)

# Example
basemodel_to_csv(EmployeeHR, synthetic_results, 'hr_records_example.csv')