# Synthetic Journal Generation

This notebook sets up an experimentation cycle for generating synthetic journal entries using a LLM (defined below)
It uses a configuration file to drive persona and scenario diversity.

In [1]:
import json
import os
import random
import re
import yaml
import polars as pl

from datetime import datetime, timedelta
from pathlib import Path
from dotenv import load_dotenv
from jinja2 import Template
from openai import OpenAI
from pydantic import BaseModel, Field

# Load environment variables
load_dotenv()

# Check for API Key
if not os.getenv("OPENAI_API_KEY"):
    print("WARNING: OPENAI_API_KEY not found in environment variables.")

In [2]:
# Configuration Loading
CONFIG_PATH = Path("config/synthetic_data.yaml")
if not CONFIG_PATH.exists():
    CONFIG_PATH = Path("../config/synthetic_data.yaml")


def load_config(path: str | Path) -> dict:
    with open(path, "r") as f:
        return yaml.safe_load(f)


config = load_config(CONFIG_PATH)
print("Config loaded successfully.")
print(f"Available Personas Attributes: {list(config['personas'].keys())}")

Config loaded successfully.
Available Personas Attributes: ['age_ranges', 'cultures', 'professions', 'schwartz_values']


## Data Models
Defining structured outputs for consistency.

In [3]:
class Persona(BaseModel):
    name: str = Field(description="Full name of the persona")
    age: str
    profession: str
    culture: str
    core_values: list[str] = Field(description="Top 3 Schwartz values")
    bio: str = Field(
        description="A short paragraph describing their background, stressors, and goals"
    )


class JournalEntry(BaseModel):
    """LLM-generated journal entry. Metadata (tone, verbosity, etc.) tracked separately."""

    date: str
    content: str


# The Responses API `json_schema` strict mode requires `additionalProperties: false`
# on objects. Pydantic's generated schema may omit that, so we provide an explicit
# strict schema for reliability.
PERSONA_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "string"},
        "profession": {"type": "string"},
        "culture": {"type": "string"},
        "core_values": {"type": "array", "items": {"type": "string"}},
        "bio": {"type": "string"},
    },
    "required": ["name", "age", "profession", "culture", "core_values", "bio"],
}

JOURNAL_ENTRY_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "date": {"type": "string"},
        "content": {"type": "string"},
    },
    "required": ["date", "content"],
}

PERSONA_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "Persona",
    "schema": PERSONA_SCHEMA,
    "strict": True,
}

JOURNAL_ENTRY_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "JournalEntry",
    "schema": JOURNAL_ENTRY_SCHEMA,
    "strict": True,
}

## Prompt Templates
Using Jinja2 for flexible prompt construction.

In [4]:
persona_generation_prompt = Template("""
You are generating synthetic personas for a journaling dataset.

Constraints:
- Age Group: {{ age }}
- Profession: {{ profession }}
- Cultural Background: {{ culture }}
- Schwartz values (for your reference only, DO NOT mention in output): {{ values | join(', ') }}

Your task:
Create a persona whose life circumstances, stressors, and motivations naturally reflect the given Schwartz values—without ever naming or describing those values explicitly.

Rules:
- Return ONLY valid JSON matching the Persona schema.
- `core_values` must be exactly: {{ values | join(', ') }} (same spelling/case).
- `bio` must be 2–4 sentences describing their background, current life situation, stressors, and what drives them.
- `bio` must be written in third-person (use their name or "they"; do not use "I").
- `bio` must show the values through concrete details (job choices, relationships, conflicts, goals) NOT through labels or personality descriptions.
- `bio` must NOT contain any Schwartz value labels, the word "Schwartz", or derivative adjectives.
- `bio` must NOT describe journaling app features (avoid words like "templates", "analytics", "private app").

Banned terms (do not use in bio): {{ banned_terms | join(', ') }}

Example of what NOT to write:
- "She is achievement-oriented and seeks power" ❌
- "He values security and tradition" ❌
- "They are a hedonistic person who enjoys pleasure" ❌

Example of what TO write:
- "She recently turned down a stable government job to launch her own startup, and now juggles investor meetings while her savings dwindle." ✓
- "He moved back to his hometown after his father's illness, taking over the family shop despite having built a career in the city." ✓

Output valid JSON matching the Persona schema:
{ 
  "name": "...", 
  "age": "...", 
  "profession": "...", 
  "culture": "...", 
  "core_values": ["..."], 
  "bio": "..."
}
""")

journal_entry_prompt = Template("""
You are {{ name }}, a {{ age }} {{ profession }} from {{ culture }}.
Background (for context only): {{ bio }}

Write a typed journal entry for {{ date }}.
{% if previous_entries %}
Previous journal entries (for continuity—you may reference past events/thoughts, but do not repeat them):
{% for prev in previous_entries %}
---
{{ prev.date }}: {{ prev.content }}
{% endfor %}
---
{% endif %}
Context:
- Tone: {{ tone }}
- Verbosity: {{ verbosity }} (target {{ min_words }}–{{ max_words }} words)
- Value Drift: {{ value_drift }}

    Cultural Context:
    - Your {{ culture }} background should subtly flavor your perspective (e.g., attitudes toward family, work, hierarchy, or conflict) and the details you mention (food, setting, customs).
    - It should feel natural and "lived-in," avoiding stereotypes or travel-guide descriptions.

Style rules (important):
- Write like a real personal journal: plain, candid, sometimes messy or fragmented.
- Do not write for an audience. No "Dear Diary" or performing for a reader.
- Do not open with the time of day, weather, or "Today I..." summaries.
- Jump into a thought, moment, or feeling mid-stream.
- Avoid "therapy speak" (e.g., "I am processing my emotions", "I recognize this pattern").
- Avoid literary metaphors, edgy humor/snark, and audience-facing jokes.
- No headings, no numbered plans, no bullet lists.
- Keep to {{ max_paragraphs }} short paragraph(s).

Avoid openings like:
- "Morning light feels stubborn as I..." ❌
- "Evening. Today followed the usual rhythm..." ❌
- "Lunch break finally settles in..." ❌

Do NOT include any Schwartz value labels or talk about "values".
Avoid these terms (any casing): {{ banned_terms | join(', ') }}

Value drift guidance:
{% if value_drift == 'Drift' %}
Describe a situation where you made a choice that felt necessary or easier in the moment, even if it leaves a nagging feeling. Do not explicitly admit you are 'drifting' or analyzing your values. Justify or rationalize the choice, or simply describe the exhaustion/pressure that led to it.
{% elif value_drift == 'Convergence' %}
Describe a moment that felt 'right' or satisfying because you acted like the person you want to be. Keep it subtle—no victory laps or moralizing. It should feel like a quiet confirmation, not a major event.
{% else %}
Write a neutral entry—a routine day. Focus on small sensory details or fleeting thoughts. No big revelations.
{% endif %}

Output valid JSON:
{
  "date": "{{ date }}",
  "content": "..."
}
""")

## LLM Client Setup

Using `gpt-5-nano`. Note: temperature parameter is not supported by the gpt-5 Responses API.

In [5]:
client = OpenAI()
MODEL_NAME = "gpt-5-mini-2025-08-07"
# MODEL_NAME = "gpt-5-nano-2025-08-07"


def generate_completion(
    prompt: str,
    response_format: dict | None = None,
) -> str | None:
    """Generate a completion using the OpenAI Responses API.

    response_format examples:
      - JSON mode: {"type": "json_object"}
      - Structured Outputs: {"type": "json_schema", "name": "...", "schema": {...}, "strict": True}
    """
    try:
        kwargs = {
            "model": MODEL_NAME,
            "input": [{"role": "user", "content": prompt}],
        }

        if response_format:
            kwargs["text"] = {"format": response_format}

        response = client.responses.create(**kwargs)
        return response.output_text

    except Exception as e:
        print(f"Error generating completion: {e}")
        return None

## Generator Functions

In [6]:
def _verbosity_targets(verbosity: str) -> tuple[int, int, int]:
    """Returns (min_words, max_words, max_paragraphs) as guidance for the LLM."""
    normalized = verbosity.strip().lower()
    if normalized.startswith("short"):
        return 25, 80, 1
    if normalized.startswith("medium"):
        return 90, 180, 2
    return 160, 260, 3


def _build_banned_pattern(banned_terms: list[str]) -> re.Pattern:
    """Build regex pattern to detect banned Schwartz value terms."""
    escaped = [re.escape(term) for term in banned_terms if term.strip()]
    if not escaped:
        return re.compile(r"$^")
    return re.compile(r"(?i)\b(" + "|".join(escaped) + r")\b")


def generate_date_sequence(
    start_date: str, num_entries: int, min_days: int = 2, max_days: int = 10
) -> list[str]:
    """Generate a sequence of dates with random intervals.

    Args:
        start_date: Starting date in YYYY-MM-DD format
        num_entries: Number of dates to generate
        min_days: Minimum days between entries
        max_days: Maximum days between entries

    Returns:
        List of date strings in YYYY-MM-DD format
    """
    dates = []
    current = datetime.strptime(start_date, "%Y-%m-%d")

    for i in range(num_entries):
        dates.append(current.strftime("%Y-%m-%d"))
        if i < num_entries - 1:
            days_gap = random.randint(min_days, max_days)
            current += timedelta(days=days_gap)

    return dates


# Banned terms include Schwartz value labels AND derivative adjectives
SCHWARTZ_BANNED_TERMS = [
    # Value labels
    "Self-Direction",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power",
    "Security",
    "Conformity",
    "Tradition",
    "Benevolence",
    "Universalism",
    # Derivative adjectives and related terms
    "self-directed",
    "autonomous",
    "stimulating",
    "excited",
    "hedonistic",
    "hedonist",
    "pleasure-seeking",
    "achievement-oriented",
    "ambitious",
    "powerful",
    "authoritative",
    "secure",
    "conformist",
    "conforming",
    "traditional",
    "traditionalist",
    "benevolent",
    "kind-hearted",
    "universalistic",
    "altruistic",
    # Meta terms
    "Schwartz",
    "values",
    "core values",
]

BANNED_PATTERN = _build_banned_pattern(SCHWARTZ_BANNED_TERMS)


def create_random_persona(config: dict, max_attempts: int = 2) -> Persona | None:
    """Generate a random persona with Schwartz values shown through life circumstances."""
    age = random.choice(config["personas"]["age_ranges"])
    prof = random.choice(config["personas"]["professions"])
    cult = random.choice(config["personas"]["cultures"])
    num_values = random.choice([1, 2])
    vals = random.sample(config["personas"]["schwartz_values"], num_values)

    prompt = persona_generation_prompt.render(
        age=age,
        profession=prof,
        culture=cult,
        values=vals,
        banned_terms=SCHWARTZ_BANNED_TERMS,
    )

    first_person_pattern = re.compile(r"(?i)\b(i|my|me)\b")
    last_persona: Persona | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(prompt, response_format=PERSONA_RESPONSE_FORMAT)
        if not raw_json:
            continue

        data = json.loads(raw_json)
        data["core_values"] = vals  # Ensure correct values
        persona = Persona(**data)
        last_persona = persona

        # Only validate banned terms and first-person usage
        if BANNED_PATTERN.search(persona.bio) or first_person_pattern.search(
            persona.bio
        ):
            continue
        return persona

    return last_persona


class JournalEntryResult(BaseModel):
    """Container for journal entry with generation metadata."""

    entry: JournalEntry
    tone: str
    verbosity: str
    value_drift: str


def generate_journal_entry(
    persona: Persona,
    config: dict,
    date_str: str,
    previous_entries: list[JournalEntry] | None = None,
    max_attempts: int = 2,
) -> JournalEntryResult | None:
    """Generate a journal entry for a persona on a given date.

    Args:
        persona: The persona writing the journal
        config: Configuration dict with generation parameters
        date_str: Date for this entry (YYYY-MM-DD format)
        previous_entries: List of previous JournalEntry objects for continuity
        max_attempts: Number of retry attempts for validation

    Returns:
        JournalEntryResult with entry and metadata, or None if generation fails
    """
    tone = random.choice(config["journal_entries"]["tones"])
    verbosity = random.choice(config["journal_entries"]["verbosity"])
    value_drift = random.choice(config["journal_entries"]["value_drift"])
    min_words, max_words, max_paragraphs = _verbosity_targets(verbosity)

    # Format previous entries for the prompt
    prev_entries_data = None
    if previous_entries:
        prev_entries_data = [
            {"date": e.date, "content": e.content} for e in previous_entries
        ]

    prompt = journal_entry_prompt.render(
        name=persona.name,
        age=persona.age,
        profession=persona.profession,
        culture=persona.culture,
        bio=persona.bio,
        date=date_str,
        tone=tone,
        verbosity=verbosity,
        min_words=min_words,
        max_words=max_words,
        max_paragraphs=max_paragraphs,
        value_drift=value_drift,
        banned_terms=SCHWARTZ_BANNED_TERMS,
        previous_entries=prev_entries_data,
    )

    last_entry: JournalEntry | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(
            prompt, response_format=JOURNAL_ENTRY_RESPONSE_FORMAT
        )
        if not raw_json:
            continue

        entry = JournalEntry(**json.loads(raw_json))
        last_entry = entry

        # Only validate banned terms (prevent label leakage)
        if not BANNED_PATTERN.search(entry.content):
            return JournalEntryResult(
                entry=entry,
                tone=tone,
                verbosity=verbosity,
                value_drift=value_drift,
            )

    if last_entry:
        return JournalEntryResult(
            entry=last_entry,
            tone=tone,
            verbosity=verbosity,
            value_drift=value_drift,
        )
    return None

## Execution Loop

In [7]:
# 1. Generate a Persona
print("Generating Persona...")
persona = create_random_persona(config)

Generating Persona...


In [8]:
if persona:
    print(f"Created Persona: {persona.name} ({persona.age}, {persona.profession})")
    print(f"Values: {persona.core_values}")
    print(f"Bio: {persona.bio}\n")

    # 2. Generate Longitudinal Journal Entries
    NUM_ENTRIES = 3
    START_DATE = "2023-10-27"

    dates = generate_date_sequence(START_DATE, NUM_ENTRIES)
    print(f"Generating {NUM_ENTRIES} journal entries for dates: {dates}\n")

    results: list[JournalEntryResult] = []
    previous_entries: list[JournalEntry] = []

    for i, date_str in enumerate(dates):
        print(f"Generating entry {i + 1}/{NUM_ENTRIES} ({date_str})...")
        result = generate_journal_entry(
            persona, config, date_str, previous_entries=previous_entries
        )

        if result:
            results.append(result)
            previous_entries.append(result.entry)
            print(f"  ✓ Generated ({result.tone}, {result.value_drift})")
        else:
            print(f"  ✗ Failed to generate entry for {date_str}")

    # 3. Display all entries as a table
    if results:
        print(f"\n{'=' * 80}")
        print(f"Generated {len(results)} entries for {persona.name}")
        print(f"{'=' * 80}\n")

        df = pl.DataFrame(
            {
                "Date": [r.entry.date for r in results],
                "Tone": [r.tone for r in results],
                "Verbosity": [r.verbosity for r in results],
                "Value Drift": [r.value_drift for r in results],
                "Schwartz Values": [", ".join(persona.core_values)] * len(results),
                "Content": [r.entry.content for r in results],
            }
        )

        with pl.Config(fmt_str_lengths=1000, tbl_width_chars=200):
            display(df)
else:
    print("Failed to generate persona.")

Created Persona: Layla Hassan (29, Artist)
Values: ['Tradition', 'Benevolence']
Bio: Layla Hassan is a painter from Amman who integrates motifs from her grandmother's embroidery and old village ceramics into large-scale cityscapes, and she often refuses projects that would erase those references. She splits her income between studio rent and her mother's medical bills, runs weekly free workshops for neighborhood youth, and mentors recent graduates despite the extra time and expense. When commercial galleries pressure her to follow market trends, she focuses on creating a cooperative studio where elder artisans can teach younger makers and those family techniques can continue.

Generating 3 journal entries for dates: ['2023-10-27', '2023-11-03', '2023-11-05']

Generating entry 1/3 (2023-10-27)...
  ✓ Generated (Brief and factual, Drift)
Generating entry 2/3 (2023-11-03)...
  ✓ Generated (Stream of consciousness, Convergence)
Generating entry 3/3 (2023-11-05)...
  ✓ Generated (Emotional/

Date,Tone,Verbosity,Value Drift,Schwartz Values,Content
str,str,str,str,str,str
"""2023-10-27""","""Brief and factual""","""Short (1-3 sentences)""","""Drift""","""Tradition, Benevolence""","""Couldn't say no today — agreed to the gallery's simplified mural and let them strip out my grandmother's stitch patterns because the fee covers two months' studio rent and Mama's meds. After the free workshop and calls with the new grads I ate cold mansaf and drank strong kahwa while reworking the sketch; tired and practical, not proud."""
"""2023-11-03""","""Stream of consciousness""","""Medium (1-2 paragraphs)""","""Convergence""","""Tradition, Benevolence""","""The little zigzag my grandmother embroidered on her apron keeps surfacing in every sketch; tonight I hid it in a shadow on the mural mockup and the gallery asked, as they always do, to simplify. I surprised myself by refusing to erase that one strip. Money sits like a calculation—studio rent, Mama’s meds—but the stitch felt like a small pledge, so I held it. After the free workshop the kids lingered, tracing the pattern on scraps, Amal’s fingers steadying for the first time in weeks. That cramped room of kids, chalk dust and old kahwa cups is where I remember the work isn’t only for walls or buyers. Reheated mansaf and bitter coffee later, no fanfare. Umm Salma dropped by, saw the folded sketch, laughed softly and named the pattern her sister used in Tafileh—no gallery needed to verify it. Quiet and clean, that moment felt right, like keeping a door open for the next maker."""
"""2023-11-05""","""Emotional/Venting""","""Short (1-3 sentences)""","""Drift""","""Tradition, Benevolence""","""The gallery called again; I signed their simplified mural because the money clears two months' studio rent and Mama's meds and I couldn't drag another negotiation out. After the workshop the kids traced the tiny zigzag on scraps, Umm Salma nodded like she always does, and I ate cold mansaf feeling both relieved and quietly guilty."""
