## Shared Setup

Run this cell first to install dependencies and configure the OpenAI client for all examples in this chapter.

In [None]:
!pip install -q openai pandas
from openai import OpenAI
import pandas as pd, json, os

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY', 'sk-your-key-here'))


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Implementing QA Generation in Practice

Generate QA pairs from HR policy text using OpenAI's API. This creates training datasets by loading policy text, generating question-answer pairs via the API, and saving structured results with columns: `question`, `answer`, `source_policy`.

In [2]:
# Sample HR policies
hr_policies = pd.DataFrame({
    'policy_id': ['POL-001', 'POL-002', 'POL-003'],
    'policy_name': ['Vacation Policy', 'Remote Work Policy', 'Sick Leave Policy'],
    'policy_text': [
        'Employees get 15 days paid vacation/year. Request 2 weeks in advance. Up to 5 days carry over.',
        'Work remotely up to 3 days/week with approval. Available during core hours 10 AM-3 PM.',
        'Full-time employees get 10 days paid sick leave annually. Doctor note required after 3 days.'
    ]
})

def generate_qa_pairs(policy_text, policy_name, policy_id):
    """Generate QA pairs using OpenAI API"""
    prompt = f"""Generate 2 question-answer pairs for this HR policy.
Policy: {policy_name}
Text: {policy_text}

Return JSON: [{{"question": "...", "answer": "..."}}]"""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    
    content = response.choices[0].message.content.strip()
    if "```json" in content: content = content.split("```json")[1].split("```")[0]
    qa_pairs = json.loads(content)
    
    for pair in qa_pairs:
        pair['source_policy'] = f"{policy_id} - {policy_name}"
    return qa_pairs

# Generate QA pairs for all policies
all_qa = []
for _, row in hr_policies.iterrows():
    qa_pairs = generate_qa_pairs(row['policy_text'], row['policy_name'], row['policy_id'])
    all_qa.extend(qa_pairs)

# Create dataset and save to CSV
qa_dataset = pd.DataFrame(all_qa)[['question', 'answer', 'source_policy']]
qa_dataset.to_csv('hr_policy_qa_dataset.csv', index=False)

print(f"Generated {len(qa_dataset)} QA pairs")
print(f"Saved to hr_policy_qa_dataset.csv\n")
qa_dataset

Generated 6 QA pairs
Saved to hr_policy_qa_dataset.csv



Unnamed: 0,question,answer,source_policy
0,How many days of paid vacation do employees re...,Employees receive 15 days of paid vacation per...,POL-001 - Vacation Policy
1,How far in advance should employees request th...,Employees should request their vacation time a...,POL-001 - Vacation Policy
2,How many days can employees work remotely per ...,Employees can work remotely up to 3 days per w...,POL-002 - Remote Work Policy
3,What are the core hours for remote work under ...,Employees are expected to be available during ...,POL-002 - Remote Work Policy
4,How many days of paid sick leave do full-time ...,Full-time employees get 10 days of paid sick l...,POL-003 - Sick Leave Policy
5,When is a doctor note required for sick leave?,A doctor note is required after 3 consecutive ...,POL-003 - Sick Leave Policy


## Building Synthetic QA from Structured Data

Generate QA pairs from structured HR and financial tables. This technique converts tabular data into natural language questions and answers for training chatbots and Q&A systems.

In [3]:
# Sample structured data: Employee and financial records
employee_data = pd.DataFrame({
    'employee_id': ['E001', 'E002', 'E003'],
    'name': ['Alice Johnson', 'Bob Smith', 'Carol Williams'],
    'department': ['Engineering', 'Marketing', 'Finance'],
    'salary': [95000, 72000, 88000],
    'years_employed': [3, 5, 2]
})

financial_data = pd.DataFrame({
    'quarter': ['Q1 2024', 'Q2 2024', 'Q3 2024'],
    'revenue': [2500000, 2800000, 3100000],
    'expenses': [1800000, 2000000, 2200000],
    'profit_margin': [28.0, 28.6, 29.0]
})

def create_context_sentence(row, table_name):
    """Create descriptive sentence from table row"""
    if table_name == 'employee':
        return f"{row['name']} works in {row['department']} with a salary of ${row['salary']:,} and has been employed for {row['years_employed']} years."
    else:  # financial
        return f"In {row['quarter']}, revenue was ${row['revenue']:,}, expenses were ${row['expenses']:,}, and profit margin was {row['profit_margin']}%."

def generate_qa_from_table(context, source_table):
    """Generate QA pairs from structured data context"""
    prompt = f"""Generate 2 question-answer pairs based on this data.
Context: {context}

Return ONLY valid JSON in this exact format (no extra text):
[{{"question": "...", "answer": "..."}}]"""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a JSON generator. Always return valid JSON arrays only, with no additional text or markdown."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )
    
    content = response.choices[0].message.content.strip()
    
    # Extract JSON from markdown code blocks if present
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0].strip()
    elif "```" in content:
        content = content.split("```")[1].split("```")[0].strip()
    
    # Find JSON array in the content
    import re
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        content = json_match.group(0)
    
    qa_pairs = json.loads(content)
    
    for pair in qa_pairs:
        pair['source_table'] = source_table
        pair['context'] = context
    return qa_pairs

# Generate QA pairs from employee data
all_structured_qa = []
for _, row in employee_data.iterrows():
    context = create_context_sentence(row, 'employee')
    qa_pairs = generate_qa_from_table(context, 'employee_data')
    all_structured_qa.extend(qa_pairs)

# Generate QA pairs from financial data
for _, row in financial_data.iterrows():
    context = create_context_sentence(row, 'financial')
    qa_pairs = generate_qa_from_table(context, 'financial_data')
    all_structured_qa.extend(qa_pairs)

# Create dataset and save
structured_qa_dataset = pd.DataFrame(all_structured_qa)[['question', 'answer', 'source_table', 'context']]
structured_qa_dataset.to_csv('structured_qa_dataset.csv', index=False)

print(f"Generated {len(structured_qa_dataset)} QA pairs from structured data")
print(f"Saved to structured_qa_dataset.csv\n")
structured_qa_dataset

Generated 12 QA pairs from structured data
Saved to structured_qa_dataset.csv



Unnamed: 0,question,answer,source_table,context
0,What is Alice Johnson's job title?,Alice Johnson works in Engineering.,employee_data,Alice Johnson works in Engineering with a sala...
1,How long has Alice Johnson been employed?,Alice Johnson has been employed for 3 years.,employee_data,Alice Johnson works in Engineering with a sala...
2,What is Bob Smith's job title?,Bob Smith works in Marketing.,employee_data,Bob Smith works in Marketing with a salary of ...
3,How long has Bob Smith been employed?,Bob Smith has been employed for 5 years.,employee_data,Bob Smith works in Marketing with a salary of ...
4,What is Carol Williams' job title?,Carol Williams works in Finance.,employee_data,Carol Williams works in Finance with a salary ...
5,How long has Carol Williams been employed?,Carol Williams has been employed for 2 years.,employee_data,Carol Williams works in Finance with a salary ...
6,What was the revenue in Q1 2024?,"$2,500,000",financial_data,"In Q1 2024, revenue was $2,500,000, expenses w..."
7,What was the profit margin in Q1 2024?,28.0%,financial_data,"In Q1 2024, revenue was $2,500,000, expenses w..."
8,What was the revenue in Q2 2024?,"$2,800,000",financial_data,"In Q2 2024, revenue was $2,800,000, expenses w..."
9,What was the profit margin in Q2 2024?,28.6%,financial_data,"In Q2 2024, revenue was $2,800,000, expenses w..."


## Generating Preference Pairs Programmatically

Create preferred vs non-preferred response pairs for RLHF (Reinforcement Learning from Human Feedback) training. This technique generates high-quality and degraded responses for the same instruction.

In [4]:
# Define sample instructions for HR and moderation scenarios
instructions = [
    "How do I request time off for vacation?",
    "What is the company policy on remote work?",
    "Explain the process for submitting an expense report.",
    "What are the guidelines for professional conduct in meetings?"
]

def generate_preferred_response(instruction):
    """Generate a high-quality, helpful response"""
    prompt = f"""Provide a clear, professional, and helpful response to this question:
{instruction}

Be specific, accurate, and include relevant details."""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful, professional HR assistant. Provide clear and accurate information."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content.strip()

def generate_non_preferred_response(instruction):
    """Generate a lower-quality or degraded response"""
    prompt = f"""Provide a vague, unhelpful, or overly brief response to this question:
{instruction}

Make it unclear, missing details, or too generic."""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Provide minimal, vague responses without helpful details."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )
    return response.choices[0].message.content.strip()

# Generate preference pairs
preference_pairs = []
for instruction in instructions:
    preferred = generate_preferred_response(instruction)
    non_preferred = generate_non_preferred_response(instruction)
    
    preference_pairs.append({
        'instruction': instruction,
        'preferred_response': preferred,
        'non_preferred_response': non_preferred
    })

# Create dataset and save
preference_dataset = pd.DataFrame(preference_pairs)
preference_dataset.to_csv('preference_pairs_dataset.csv', index=False)

print(f"Generated {len(preference_dataset)} preference pairs")
print(f"Saved to preference_pairs_dataset.csv\n")
preference_dataset

Generated 4 preference pairs
Saved to preference_pairs_dataset.csv



Unnamed: 0,instruction,preferred_response,non_preferred_response
0,How do I request time off for vacation?,"To request time off for vacation, you typicall...",Submit a form or talk to HR.
1,What is the company policy on remote work?,The company policy on remote work allows emplo...,The policy is flexible and subject to change.
2,Explain the process for submitting an expense ...,Submitting an expense report typically involve...,Just follow the instructions given by your sup...
3,What are the guidelines for professional condu...,Guidelines for professional conduct in meeting...,Just follow the standard etiquette and be prof...


## Implementing Text Augmentation Techniques

Apply paraphrasing, back-translation, and noise injection using Hugging Face Transformers. This demonstrates how to create diverse training data by generating paraphrases, performing back-translation through multiple languages, and injecting realistic noise patterns.

In [7]:
!pip install -q openai

import random

# Original text corpus
original_texts = [
    "Employees must submit vacation requests at least two weeks in advance.",
    "Remote work is permitted up to three days per week with manager approval.",
    "All expenses must be documented with receipts and submitted within 30 days.",
    "Professional conduct is expected in all workplace communications.",
    "Health insurance benefits are available to full-time employees after 90 days."
]

def generate_paraphrase(text):
    """Generate paraphrase using OpenAI API (fast and high-quality)"""
    prompt = f"Paraphrase this text while keeping the same meaning: {text}"
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150
    )
    return response.choices[0].message.content.strip()

def back_translate(text):
    """Perform back-translation: English -> French -> English (using OpenAI API)"""
    try:
        # Translate to French
        french_response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": f"Translate this English text to French: {text}"}],
            temperature=0.3,
            max_tokens=150
        )
        french_text = french_response.choices[0].message.content.strip()
        
        # Translate back to English
        english_response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": f"Translate this French text to English: {french_text}"}],
            temperature=0.3,
            max_tokens=150
        )
        return english_response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Back-translation error: {e}")
        return text

def inject_noise(text, noise_level=0.05):
    """Inject typos and OCR-like distortions"""
    chars = list(text)
    num_changes = max(1, int(len(chars) * noise_level))
    
    # Common OCR and typing errors
    substitutions = {
        'o': '0', 'l': '1', 'i': '1', 'e': '3', 'a': '@',
        's': '5', 'b': '6', 't': '7', 'g': '9', 'O': '0'
    }
    
    for _ in range(num_changes):
        if len(chars) == 0:
            break
        idx = random.randint(0, len(chars) - 1)
        
        # Apply different noise types
        noise_type = random.choice(['substitute', 'delete', 'swap', 'ocr'])
        
        if noise_type == 'substitute' and chars[idx].isalpha():
            chars[idx] = random.choice('abcdefghijklmnopqrstuvwxyz')
        elif noise_type == 'delete' and len(chars) > 1:
            chars.pop(idx)
        elif noise_type == 'swap' and idx < len(chars) - 1:
            chars[idx], chars[idx + 1] = chars[idx + 1], chars[idx]
        elif noise_type == 'ocr' and chars[idx] in substitutions:
            chars[idx] = substitutions[chars[idx]]
    
    return ''.join(chars)

# Generate augmented dataset
print("Generating augmented dataset...\n")
augmented_data = []

for text in original_texts:
    # Original text
    augmented_data.append({
        'original_text': text,
        'augmented_text': text,
        'augmentation_type': 'original'
    })
    
    # Paraphrase
    try:
        paraphrased = generate_paraphrase(text)
        augmented_data.append({
            'original_text': text,
            'augmented_text': paraphrased,
            'augmentation_type': 'paraphrase'
        })
    except Exception as e:
        print(f"Paraphrase error for '{text[:50]}...': {e}")
    
    # Back-translation
    try:
        back_translated = back_translate(text)
        augmented_data.append({
            'original_text': text,
            'augmented_text': back_translated,
            'augmentation_type': 'back_translation'
        })
    except Exception as e:
        print(f"Back-translation error for '{text[:50]}...': {e}")
    
    # Noise injection (light)
    noisy_text_light = inject_noise(text, noise_level=0.03)
    augmented_data.append({
        'original_text': text,
        'augmented_text': noisy_text_light,
        'augmentation_type': 'noise_light'
    })
    
    # Noise injection (medium)
    noisy_text_medium = inject_noise(text, noise_level=0.08)
    augmented_data.append({
        'original_text': text,
        'augmented_text': noisy_text_medium,
        'augmentation_type': 'noise_medium'
    })

# Create dataset and save
augmented_dataset = pd.DataFrame(augmented_data)
augmented_dataset.to_csv('augmented_text_dataset.csv', index=False)

print(f"Generated {len(augmented_dataset)} augmented text samples")
print(f"Saved to augmented_text_dataset.csv\n")
print(f"Augmentation type distribution:")
print(augmented_dataset['augmentation_type'].value_counts())
print()
augmented_dataset


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Generating augmented dataset...

Generated 25 augmented text samples
Saved to augmented_text_dataset.csv

Augmentation type distribution:
augmentation_type
original            5
paraphrase          5
back_translation    5
noise_light         5
noise_medium        5
Name: count, dtype: int64



Unnamed: 0,original_text,augmented_text,augmentation_type
0,Employees must submit vacation requests at lea...,Employees must submit vacation requests at lea...,original
1,Employees must submit vacation requests at lea...,Workers need to give notice of their vacation ...,paraphrase
2,Employees must submit vacation requests at lea...,Employees must submit their leave requests at ...,back_translation
3,Employees must submit vacation requests at lea...,Employees must submit vacation requests at lea...,noise_light
4,Employees must submit vacation requests at lea...,Employees must submit vacation requests mt lea...,noise_medium
5,Remote work is permitted up to three days per ...,Remote work is permitted up to three days per ...,original
6,Remote work is permitted up to three days per ...,Employees are allowed to work remotely for a m...,paraphrase
7,Remote work is permitted up to three days per ...,Remote work is allowed up to three days per we...,back_translation
8,Remote work is permitted up to three days per ...,Remote work is permitted up t0 three days per ...,noise_light
9,Remote work is permitted up to three days per ...,Remote work is pemrit7ed up to three days per ...,noise_medium


## Building Multimodal Augmentation Workflows

Apply image augmentation using Python libraries and generate captions for visual data. This demonstrates how to augment images with transformations (rotation, flipping, noise) and generate corresponding text descriptions using an LLM.

In [9]:
!pip install -q pillow opencv-python numpy

from PIL import Image, ImageFilter, ImageEnhance
import cv2
import numpy as np
import os

# Create sample directories for demonstration
os.makedirs('sample_images', exist_ok=True)
os.makedirs('augmented_images', exist_ok=True)

# For demonstration, we'll create synthetic images
# In practice, you would load your actual image files

def create_sample_image(filename, color):
    """Create a simple sample image for demonstration"""
    img = Image.new('RGB', (200, 200), color=color)
    img.save(f'sample_images/{filename}')
    return f'sample_images/{filename}'

# Create sample files
print("Creating sample image files...\n")
sample_images = [
    create_sample_image('image1.jpg', (100, 150, 200)),
    create_sample_image('image2.jpg', (200, 100, 150)),
    create_sample_image('image3.jpg', (150, 200, 100))
]

# Image augmentation functions
def rotate_image(image_path, angle=45):
    """Rotate image by specified angle"""
    img = Image.open(image_path)
    rotated = img.rotate(angle, expand=True, fillcolor=(255, 255, 255))
    output_path = image_path.replace('sample_images', 'augmented_images').replace('.jpg', '_rotated.jpg')
    rotated.save(output_path)
    return output_path

def flip_image(image_path):
    """Flip image horizontally"""
    img = Image.open(image_path)
    flipped = img.transpose(Image.FLIP_LEFT_RIGHT)
    output_path = image_path.replace('sample_images', 'augmented_images').replace('.jpg', '_flipped.jpg')
    flipped.save(output_path)
    return output_path

def add_gaussian_noise_image(image_path):
    """Add Gaussian noise to image"""
    img = cv2.imread(image_path)
    noise = np.random.normal(0, 25, img.shape).astype(np.uint8)
    noisy_img = cv2.add(img, noise)
    output_path = image_path.replace('sample_images', 'augmented_images').replace('.jpg', '_noisy.jpg')
    cv2.imwrite(output_path, noisy_img)
    return output_path

def generate_image_caption(image_path, augmentation_type):
    """Generate caption for image using OpenAI API"""
    # Since we can't send actual images in this demo, we'll generate descriptive captions
    prompt = f"""Generate a short, descriptive caption for a sample image that has been augmented using {augmentation_type}. 
    The caption should be 1-2 sentences describing what might be visible in the image."""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100
    )
    return response.choices[0].message.content.strip()

# Generate augmented image dataset
print("Generating augmented image dataset...\n")
image_data = []

for image_path in sample_images:
    # Original image
    image_data.append({
        'image_path': image_path,
        'caption': generate_image_caption(image_path, 'original'),
        'augmentation_type': 'original'
    })
    
    # Image augmentations
    try:
        # Rotated image
        rotated_img = rotate_image(image_path)
        image_data.append({
            'image_path': rotated_img,
            'caption': generate_image_caption(rotated_img, 'rotation'),
            'augmentation_type': 'rotation'
        })
        
        # Flipped image
        flipped_img = flip_image(image_path)
        image_data.append({
            'image_path': flipped_img,
            'caption': generate_image_caption(flipped_img, 'horizontal flip'),
            'augmentation_type': 'flip'
        })
        
        # Noisy image
        noisy_img = add_gaussian_noise_image(image_path)
        image_data.append({
            'image_path': noisy_img,
            'caption': generate_image_caption(noisy_img, 'Gaussian noise'),
            'augmentation_type': 'noise'
        })
    except Exception as e:
        print(f"Image augmentation error: {e}")

# Create dataset and save
image_dataset = pd.DataFrame(image_data)
image_dataset.to_csv('image_augmented_dataset.csv', index=False)

print(f"Generated {len(image_dataset)} image samples")
print(f"Saved to image_augmented_dataset.csv\n")
print(f"Augmentation type distribution:")
print(image_dataset['augmentation_type'].value_counts())
print()
image_dataset


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Creating sample image files...

Generating augmented image dataset...

Generated 12 image samples
Saved to image_augmented_dataset.csv

Augmentation type distribution:
augmentation_type
original    3
rotation    3
flip        3
noise       3
Name: count, dtype: int64



Unnamed: 0,image_path,caption,augmentation_type
0,sample_images/image1.jpg,A serene landscape of a mountain lake reflecti...,original
1,augmented_images/image1_rotated.jpg,A serene landscape with a clear blue sky and l...,rotation
2,augmented_images/image1_flipped.jpg,A mirrored image of a serene lake reflecting t...,flip
3,augmented_images/image1_noisy.jpg,"An abstract, distorted image with a grainy tex...",noise
4,sample_images/image2.jpg,A serene landscape transformed into a surreal ...,original
5,augmented_images/image2_rotated.jpg,The image shows a stunning sunset over the oce...,rotation
6,augmented_images/image2_flipped.jpg,A mirrored reflection of a beautiful sunset ov...,flip
7,augmented_images/image2_noisy.jpg,An abstract and grainy image with distorted sh...,noise
8,sample_images/image3.jpg,"A serene landscape transformed with vibrant, s...",original
9,augmented_images/image3_rotated.jpg,A stunning aerial view of a city skyline at su...,rotation
