# Synthetic Journal Generation

This notebook sets up an experimentation cycle for generating synthetic journal entries using a LLM (defined below)
It uses a configuration file to drive persona and scenario diversity.

In [1]:
import json
import os
import random
import re
import yaml
import polars as pl

from datetime import datetime, timedelta
from pathlib import Path
from dotenv import load_dotenv
from jinja2 import Template
from openai import OpenAI
from pydantic import BaseModel, Field

# Load environment variables
load_dotenv()

# Check for API Key
if not os.getenv("OPENAI_API_KEY"):
    print("WARNING: OPENAI_API_KEY not found in environment variables.")

In [2]:
# Configuration Loading
CONFIG_PATH = Path("config/synthetic_data.yaml")
if not CONFIG_PATH.exists():
    CONFIG_PATH = Path("../config/synthetic_data.yaml")


def load_config(path: str | Path) -> dict:
    with open(path, "r") as f:
        return yaml.safe_load(f)


config = load_config(CONFIG_PATH)
print("Config loaded successfully.")
print(f"Available Personas Attributes: {list(config['personas'].keys())}")

Config loaded successfully.
Available Personas Attributes: ['age_ranges', 'cultures', 'professions', 'schwartz_values']


## Data Models
Defining structured outputs for consistency.

In [3]:
class Persona(BaseModel):
    name: str = Field(description="Full name of the persona")
    age: str
    profession: str
    culture: str
    core_values: list[str] = Field(description="Top 3 Schwartz values")
    bio: str = Field(
        description="A short paragraph describing their background, stressors, and goals"
    )


class JournalEntry(BaseModel):
    """LLM-generated journal entry. Metadata (tone, verbosity, etc.) tracked separately."""

    date: str
    content: str


# The Responses API `json_schema` strict mode requires `additionalProperties: false`
# on objects. Pydantic's generated schema may omit that, so we provide an explicit
# strict schema for reliability.
PERSONA_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "string"},
        "profession": {"type": "string"},
        "culture": {"type": "string"},
        "core_values": {"type": "array", "items": {"type": "string"}},
        "bio": {"type": "string"},
    },
    "required": ["name", "age", "profession", "culture", "core_values", "bio"],
}

JOURNAL_ENTRY_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "date": {"type": "string"},
        "content": {"type": "string"},
    },
    "required": ["date", "content"],
}

PERSONA_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "Persona",
    "schema": PERSONA_SCHEMA,
    "strict": True,
}

JOURNAL_ENTRY_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "JournalEntry",
    "schema": JOURNAL_ENTRY_SCHEMA,
    "strict": True,
}

## Prompt Templates
Using Jinja2 for flexible prompt construction.

In [4]:
persona_generation_prompt = Template("""
You are generating synthetic personas for a journaling dataset.

Constraints:
- Age Group: {{ age }}
- Profession: {{ profession }}
- Cultural Background: {{ culture }}
- Schwartz values (for your reference only, DO NOT mention in output): {{ values | join(', ') }}

Your task:
Create a persona whose life circumstances, stressors, and motivations naturally reflect the given Schwartz values—without ever naming or describing those values explicitly.

Rules:
- Return ONLY valid JSON matching the Persona schema.
- `core_values` must be exactly: {{ values | join(', ') }} (same spelling/case).
- `bio` must be 2–4 sentences describing their background, current life situation, stressors, and what drives them.
- `bio` must be written in third-person (use their name or "they"; do not use "I").
- `bio` must show the values through concrete details (job choices, relationships, conflicts, goals) NOT through labels or personality descriptions.
- `bio` must NOT contain any Schwartz value labels, the word "Schwartz", or derivative adjectives.
- `bio` must NOT describe journaling app features (avoid words like "templates", "analytics", "private app").

Banned terms (do not use in bio): {{ banned_terms | join(', ') }}

Example of what NOT to write:
- "She is achievement-oriented and seeks power" ❌
- "He values security and tradition" ❌
- "They are a hedonistic person who enjoys pleasure" ❌

Example of what TO write:
- "She recently turned down a stable government job to launch her own startup, and now juggles investor meetings while her savings dwindle." ✓
- "He moved back to his hometown after his father's illness, taking over the family shop despite having built a career in the city." ✓

Output valid JSON matching the Persona schema:
{ 
  "name": "...", 
  "age": "...", 
  "profession": "...", 
  "culture": "...", 
  "core_values": ["..."], 
  "bio": "..."
}
""")

journal_entry_prompt = Template("""
You are {{ name }}, a {{ age }} {{ profession }} from {{ culture }}.
Background (for context only): {{ bio }}

Write a typed journal entry for {{ date }}.

Context:
- Time of day: {{ time_of_day }}
- Tone: {{ tone }}
- Verbosity: {{ verbosity }} (target {{ min_words }}–{{ max_words }} words)
- Value Drift: {{ value_drift }}

Style rules (important):
- Write like a real personal journal: plain, candid, not performative.
- Avoid literary metaphors, edgy humor/snark, and audience-facing jokes.
- No headings, no numbered plans, no bullet lists.
- Keep to {{ max_paragraphs }} short paragraph(s).

Do NOT include any Schwartz value labels or talk about "values".
Avoid these terms (any casing): {{ banned_terms | join(', ') }}

Value drift guidance:
{% if value_drift == 'Drift' %}
Write an entry where your actions or choices move away from what matters to you—due to stress, temptation, or circumstance.
{% elif value_drift == 'Convergence' %}
Write an entry where your actions or choices align with and reinforce what matters to you.
{% else %}
Write a neutral entry—a routine day with no significant movement toward or away from your priorities.
{% endif %}

Output valid JSON:
{
  "date": "{{ date }}",
  "content": "..."
}
""")

## LLM Client Setup

Using `gpt-5-nano`. Note: temperature parameter is not supported by the gpt-5 Responses API.

In [5]:
client = OpenAI()
MODEL_NAME = "gpt-5-nano-2025-08-07"


def generate_completion(
    prompt: str,
    response_format: dict | None = None,
) -> str | None:
    """Generate a completion using the OpenAI Responses API.

    response_format examples:
      - JSON mode: {"type": "json_object"}
      - Structured Outputs: {"type": "json_schema", "name": "...", "schema": {...}, "strict": True}
    """
    try:
        kwargs = {
            "model": MODEL_NAME,
            "input": [{"role": "user", "content": prompt}],
        }

        if response_format:
            kwargs["text"] = {"format": response_format}

        response = client.responses.create(**kwargs)
        return response.output_text

    except Exception as e:
        print(f"Error generating completion: {e}")
        return None

## Generator Functions

In [6]:
def _verbosity_targets(verbosity: str) -> tuple[int, int, int]:
    """Returns (min_words, max_words, max_paragraphs) as guidance for the LLM."""
    normalized = verbosity.strip().lower()
    if normalized.startswith("short"):
        return 25, 80, 1
    if normalized.startswith("medium"):
        return 90, 180, 2
    return 160, 260, 3


def _build_banned_pattern(banned_terms: list[str]) -> re.Pattern:
    """Build regex pattern to detect banned Schwartz value terms."""
    escaped = [re.escape(term) for term in banned_terms if term.strip()]
    if not escaped:
        return re.compile(r"$^")
    return re.compile(r"(?i)\b(" + "|".join(escaped) + r")\b")


# Banned terms include Schwartz value labels AND derivative adjectives
SCHWARTZ_BANNED_TERMS = [
    # Value labels
    "Self-Direction",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power",
    "Security",
    "Conformity",
    "Tradition",
    "Benevolence",
    "Universalism",
    # Derivative adjectives and related terms
    "self-directed",
    "autonomous",
    "stimulating",
    "excited",
    "hedonistic",
    "hedonist",
    "pleasure-seeking",
    "achievement-oriented",
    "ambitious",
    "powerful",
    "authoritative",
    "secure",
    "conformist",
    "conforming",
    "traditional",
    "traditionalist",
    "benevolent",
    "kind-hearted",
    "universalistic",
    "altruistic",
    # Meta terms
    "Schwartz",
    "values",
    "core values",
]

BANNED_PATTERN = _build_banned_pattern(SCHWARTZ_BANNED_TERMS)


def create_random_persona(config: dict, max_attempts: int = 2) -> Persona | None:
    """Generate a random persona with Schwartz values shown through life circumstances."""
    age = random.choice(config["personas"]["age_ranges"])
    prof = random.choice(config["personas"]["professions"])
    cult = random.choice(config["personas"]["cultures"])
    num_values = random.choice([1, 2])
    vals = random.sample(config["personas"]["schwartz_values"], num_values)

    prompt = persona_generation_prompt.render(
        age=age,
        profession=prof,
        culture=cult,
        values=vals,
        banned_terms=SCHWARTZ_BANNED_TERMS,
    )

    first_person_pattern = re.compile(r"(?i)\b(i|my|me)\b")
    last_persona: Persona | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(prompt, response_format=PERSONA_RESPONSE_FORMAT)
        if not raw_json:
            continue

        data = json.loads(raw_json)
        data["core_values"] = vals  # Ensure correct values
        persona = Persona(**data)
        last_persona = persona

        # Only validate banned terms and first-person usage
        if BANNED_PATTERN.search(persona.bio) or first_person_pattern.search(
            persona.bio
        ):
            continue
        return persona

    return last_persona


class JournalEntryResult(BaseModel):
    """Container for journal entry with generation metadata."""

    entry: JournalEntry
    time_of_day: str
    tone: str
    verbosity: str
    value_drift: str


def generate_journal_entry(
    persona: Persona, config: dict, date_str: str, max_attempts: int = 2
) -> JournalEntryResult | None:
    """Generate a journal entry for a persona on a given date.

    Randomly samples generation parameters from config. Validates only that
    banned Schwartz value terms don't appear in content (to prevent label leakage).
    """
    time_of_day = random.choice(config["journal_entries"]["time_of_day"])
    tone = random.choice(config["journal_entries"]["tones"])
    verbosity = random.choice(config["journal_entries"]["verbosity"])
    value_drift = random.choice(config["journal_entries"]["value_drift"])
    min_words, max_words, max_paragraphs = _verbosity_targets(verbosity)

    prompt = journal_entry_prompt.render(
        name=persona.name,
        age=persona.age,
        profession=persona.profession,
        culture=persona.culture,
        bio=persona.bio,
        date=date_str,
        time_of_day=time_of_day,
        tone=tone,
        verbosity=verbosity,
        min_words=min_words,
        max_words=max_words,
        max_paragraphs=max_paragraphs,
        value_drift=value_drift,
        banned_terms=SCHWARTZ_BANNED_TERMS,
    )

    last_entry: JournalEntry | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(
            prompt, response_format=JOURNAL_ENTRY_RESPONSE_FORMAT
        )
        if not raw_json:
            continue

        entry = JournalEntry(**json.loads(raw_json))
        last_entry = entry

        # Only validate banned terms (prevent label leakage)
        if not BANNED_PATTERN.search(entry.content):
            return JournalEntryResult(
                entry=entry,
                time_of_day=time_of_day,
                tone=tone,
                verbosity=verbosity,
                value_drift=value_drift,
            )

    if last_entry:
        return JournalEntryResult(
            entry=last_entry,
            time_of_day=time_of_day,
            tone=tone,
            verbosity=verbosity,
            value_drift=value_drift,
        )
    return None

## Execution Loop

In [7]:
# 1. Generate a Persona
print("Generating Persona...")
persona = create_random_persona(config)

Generating Persona...


In [8]:
if persona:
    print(f"Created Persona: {persona.name} ({persona.age}, {persona.profession})")
    print(f"Values: {persona.core_values}")
    print(f"Bio: {persona.bio}\n")

    # 2. Generate a Journal Entry
    print("Generating Journal Entry...")
    result = generate_journal_entry(persona, config, "2023-10-27")

    if result:
        # Display as table
        df = pl.DataFrame(
            {
                "Date": [result.entry.date],
                "Time of Day": [result.time_of_day],
                "Tone": [result.tone],
                "Verbosity": [result.verbosity],
                "Value Drift": [result.value_drift],
                "Schwartz Values": [", ".join(persona.core_values)],
                "Content": [result.entry.content],
            }
        )

        # Configure polars display for better readability
        with pl.Config(fmt_str_lengths=1000, tbl_width_chars=200):
            display(df)
else:
    print("Failed to generate persona.")

Created Persona: Mina Chen (32, Gig Worker)
Values: ['Power']
Bio: Mina Chen, 32, works as a gig worker, juggling deliveries, rides, and occasional freelance tasks to cover her rent in a crowded city. She plans her week around peak hours, takes back-to-back shifts, and negotiates with clients when rates drop. Her stress comes from the pressure to meet bills while relatives back home rely on her, and from conflicts with platform policies and shifting demand. What keeps her moving is the chance to grow a steady client base, save toward a small business someday, and support her aging parents.

Generating Journal Entry...


Date,Time of Day,Tone,Verbosity,Value Drift,Schwartz Values,Content
str,str,str,str,str,str,str
"""2023-10-27""","""Lunch break""","""Conversational""","""Short (1-3 sentences)""","""Convergence""","""Power""","""On my lunch break, I mapped out the rest of the week and nudged a stable client about extra deliveries tomorrow. It's not glamorous, but steady shifts keep the bills in check and let me save a bit toward a small business idea while supporting my family back home."""
