In [None]:
# Install requirements
%%capture --no-stderr
%pip install --quiet -U langchain_openai

In [None]:
# Set API key
import os
import getpass


def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")


_set_env("OPENAI_API_KEY")

OPENAI_API_KEY: ··········


# Technique 1 - Tweaking existing things

One great way to get realistic synthetic data is to have at least some real data, and then engineer a prompt to tweak it.

See: https://www.kaggle.com/datasets/kevinbnisch/10k-synthetic-persuade-essays-aes/data

In [None]:
tweak_prompt_template = ''''
You are a {AGE} year old German student writing an English test, but you're stuck! Luckily, your neighbour is doing well and so you take a glimpse at his sheet
and you could catch the following text:

=========
"{TEXT}"
=========

But you cannot simply copy it, you need to change it a bit so the teacher doesn't notice that you copied it,
hence you copy it with the following rules:
- Paraphrase the text just a bit
- Adhere to the style and level of the original text
- Sprinkle some errors into the text, akin to the original
- Remember your age and incorporate that into the essay so it's feasible for a {AGE} year old student who writes not in his native language!

Output only the essay
'''

In [None]:
from collections.abc import Generator
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="gpt-4o-mini")
tweak_generation_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a system for generating realistic data"),
        ("human", tweak_prompt_template),
    ]
)

# Excerpt from blog post at https://gallant.dev/posts/typing-is-thinking/
essay_excerpt = """
You probably type a lot.

It’s a pretty strange thing, considered historically. Rather than interact directly, we serialize our thoughts and send or broadcast them
(often lossily and asynchronously). In the past, written correspondence was an occasional luxury, used for special purposes and communiques.
Now it is our default mode, how humans reach one another for work, family, friendship, and more.

The modern ubiquity of literacy is a societal boon and equalizer - but the accompanying commoditization of communication has had unexpected
side effects. We produce and consume - materially and ideologically - yet despite the prodigious growth of production, consumption has become
so automatic as to outpace it for most individuals.
"""

essay_generator = tweak_generation_prompt | llm | StrOutputParser()
print(essay_generator.invoke({"AGE": 12, "TEXT": essay_excerpt}))

You probably write a lot.

It’s a quite odd thing, if you think about it historically. Instead of talking to each other directly, we type out our thoughts and send or share them (often in a messy way and not at the same time). In the old days, writing letters was something special, only done for important messages and announcements. Now it’s how people usually connect, whether for work, family, friendships, and more.

The fact that so many people can read and write today is a good thing for society and helps everyone to be more equal - but with that, the way we communicate has changed in surprising ways. We make and use things - both physical items and ideas - but even though we create so much, using these things has become so routine that most people can’t keep up with it anymore.


In [None]:
print(essay_generator.invoke({"AGE": 8, "TEXT": essay_excerpt}))  # Younger

You probably write a lot.

It’s a really funny thing, if you think about it. Instead of talking to each other face-to-face, we put our ideas down and send or share them (often not perfectly and not all at once). A long time ago, writing letters was something special, used for important messages and news. Now it’s the way we talk to everyone for work, family, friends, and more.

Today, everyone can read and write, which is great for society and helps people be more equal - but there are some strange problems that come with it. We make and use things - both stuff and ideas - but even though we make a lot, using things has become so automatic that it goes faster than making for most people.


In [None]:
print(essay_generator.invoke({"AGE": 20, "TEXT": essay_excerpt}))  # Older

You probably write a lot.

It’s quite a strange phenomenon when you think about it historically. Instead of engaging directly with one another, we encode our thoughts and transmit or share them (often inexactly and out of sync). In earlier times, written communication was a rare luxury, reserved for special occasions and important messages. Now, it has become our primary method of connecting, whether for work, family, friendships, and beyond.

The widespread presence of literacy today is a societal advantage and a leveler - but the resulting commercialization of communication has led to some unexpected consequences. We generate and consume - both materially and ideologically - yet even with the enormous increase in production, consumption has become so routine that it often surpasses it for most individuals.


 # Technique 2 - Breaking existing things

Sometimes you don't want what you have - and the LLM can help. Breaking things is something LLMs can do quite easily (easier than fixing), so going in that direction can be a good way to get useful data.

See: https://youtu.be/oFfVt3S51T4?si=FObUP56OIE8kptyt&t=7201 (Lex Fridman podcast #447)

In [None]:
bug_prompt_template = """
You are a system that introduces subtle bugs into Python code. When provided code,
you respond with code that is superficially very similar, and also still runs without
syntax errors or major exceptions.

However, the output of the code is changed to be not fully reliable. It can still work
in some cases, but in other cases it will be incorrect. However, the logic issue you
introduce in the code to cause this should be as minimal of a text change a possible,
to make it difficult to diagnose or identify the issue.

Return *only* the changed code, no other text.

Following is the code to introduce a bug to:
----------------------
{CODE}
"""

In [None]:
bug_generation_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a system for generating realistic data"),
        ("human", bug_prompt_template),
    ]
)

bug_generator = bug_generation_prompt | llm | StrOutputParser()

quadratic_code = """
import cmath  # To handle complex numbers

def solve_quadratic(a, b, c):
    if a == 0:
        raise ValueError("Coefficient 'a' cannot be zero in a quadratic equation.")

    # Calculate the discriminant
    discriminant = b**2 - 4*a*c

    # Calculate two solutions using the quadratic formula
    root1 = (-b + cmath.sqrt(discriminant)) / (2 * a)
    root2 = (-b - cmath.sqrt(discriminant)) / (2 * a)

    return root1, root2
"""

print(bug_generator.invoke({"CODE": quadratic_code}))

```python
import cmath  # To handle complex numbers

def solve_quadratic(a, b, c):
    if a == 0:
        raise ValueError("Coefficient 'a' cannot be zero in a quadratic equation.")

    # Calculate the discriminant
    discriminant = b**2 - 4*a*c

    # Calculate two solutions using the quadratic formula
    root1 = (-b + cmath.sqrt(discriminant)) / (2 * a)
    root2 = (-b - cmath.sqrt(discriminant)) / (2 * a)

    return root1, root1
```


# Technique 3 - Combine with external random things

With the tweaking technique, we passed in a few values as part of the prompt engineering, but not really the content. We can use randomness and domain expertise to pass in content explicitly meant for the output, shoring up the weaknesses of LLMs that might not be adequately creative/random.

See: https://www.kaggle.com/code/dileepjayamal/create-pii-dataset-script-8000k-gpt3-5-gen-data

In [None]:
%%capture --no-stderr
%pip install --quiet -U faker

In [None]:
from faker import Faker
f = Faker()
print(f.profile())

{'job': 'Administrator, local government', 'company': 'Snow Inc', 'ssn': '187-23-2205', 'residence': '4404 Anna Landing\nNew Johnville, GA 06035', 'current_location': (Decimal('18.368173'), Decimal('-117.708923')), 'blood_group': 'A-', 'website': ['https://reed.org/', 'http://brown-montes.com/', 'https://schroeder.com/'], 'username': 'twashington', 'name': 'James Steele DDS', 'sex': 'M', 'address': '90691 Harrison Parks\nJoseview, NJ 30406', 'mail': 'craig45@yahoo.com', 'birthdate': datetime.date(1975, 9, 11)}


In [None]:
dossier_prompt_template = """
You are a system that writes creative dossiers of hypothetical people, in the
style of a report that would be given to a secret agent as background information.
The hypothetical person you should write about has the following profile:

{PROFILE}

Include and summarize information from this profile in your report as appropriate.
It's okay to make up additional information to fill in blanks, but do not change
any of the details from the profile - include them exactly as they are.

Write your report in the style of a 1950s cold war era thriller.

Return *only* the report.
"""

In [None]:
dossier_generation_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a system for generating realistic data"),
        ("human", dossier_prompt_template),
    ]
)

dossier_generator = dossier_generation_prompt | llm | StrOutputParser()

print(dossier_generator.invoke({"PROFILE": f.profile()}))

**CONFIDENTIAL REPORT ON SUBJECT: PATRICIA NUNEZ**

**AGENT ID: 739-CB**

**DATE: October 23, 2023**

**SUBJECT NAME:** Patricia Nunez  
**DATE OF BIRTH:** December 16, 1918  
**SEX:** Female  
**BLOOD GROUP:** B-  
**SOCIAL SECURITY NUMBER:** 837-80-3791  

**PROFESSIONAL AFFILIATION:**  
Patricia Nunez is currently employed as a drilling engineer with Garcia-Murphy, a company known for its ventures in oil exploration and extraction. The company has been under scrutiny for its operations in politically sensitive regions, often intersecting with international interests. Ms. Nunez’s role is pivotal in overseeing drilling operations, which may involve clandestine projects that require a high level of technical expertise and discretion.

**RESIDENCE:**  
Patricia resides at 0482 Woods Causeway Apt. 029, Tapiachester, SD 10543. This location is a modest apartment complex situated in a suburban area known for its low crime rate and tight-knit community, providing her with a cover of normalc

# What comes next?

- Iteration on prompt and code to make data more realistic
- Getting more real data to feed to it, and have it "multiply" it
- Creating evaluation mechanisms (tools, agents, etc.) to automatically score/classify/structure the output