# Synthetic Journal Generation

This notebook sets up an experimentation cycle for generating synthetic journal entries using a LLM (defined below)
It uses a configuration file to drive persona and scenario diversity.

In [1]:
import json
import os
import random
import re
import yaml
import polars as pl

from datetime import datetime, timedelta
from pathlib import Path
from dotenv import load_dotenv
from jinja2 import Template
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

# Load environment variables
load_dotenv()

# Check for API Key
if not os.getenv("OPENAI_API_KEY"):
    print("WARNING: OPENAI_API_KEY not found in environment variables.")

In [2]:
# Configuration Loading
CONFIG_PATH = Path("config/synthetic_data.yaml")
if not CONFIG_PATH.exists():
    CONFIG_PATH = Path("../config/synthetic_data.yaml")

SCHWARTZ_VALUES_PATH = Path("config/schwartz_values.yaml")
if not SCHWARTZ_VALUES_PATH.exists():
    SCHWARTZ_VALUES_PATH = Path("../config/schwartz_values.yaml")


def load_config(path: str | Path) -> dict:
    with open(path, "r") as f:
        return yaml.safe_load(f)


config = load_config(CONFIG_PATH)
schwartz_config = load_config(SCHWARTZ_VALUES_PATH)

print("Configs loaded successfully.")
print(f"Available Persona Attributes: {list(config['personas'].keys())}")
print(f"Schwartz Values with elaborations: {list(schwartz_config['values'].keys())}")

Configs loaded successfully.
Available Persona Attributes: ['age_ranges', 'cultures', 'professions', 'schwartz_values']
Schwartz Values with elaborations: ['Self-Direction', 'Stimulation', 'Hedonism', 'Achievement', 'Power', 'Security', 'Conformity', 'Tradition', 'Benevolence', 'Universalism']


## Data Models
Defining structured outputs for consistency.

In [3]:
class Persona(BaseModel):
    name: str = Field(description="Full name of the persona")
    age: str
    profession: str
    culture: str
    core_values: list[str] = Field(description="Top 3 Schwartz values")
    bio: str = Field(
        description="A short paragraph describing their background, stressors, and goals"
    )


class JournalEntry(BaseModel):
    """LLM-generated journal entry. Metadata (tone, verbosity, etc.) tracked separately."""

    date: str
    content: str


# The Responses API `json_schema` strict mode requires `additionalProperties: false`
# on objects. Pydantic's generated schema may omit that, so we provide an explicit
# strict schema for reliability.
PERSONA_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "string"},
        "profession": {"type": "string"},
        "culture": {"type": "string"},
        "core_values": {"type": "array", "items": {"type": "string"}},
        "bio": {"type": "string"},
    },
    "required": ["name", "age", "profession", "culture", "core_values", "bio"],
}

JOURNAL_ENTRY_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "date": {"type": "string"},
        "content": {"type": "string"},
    },
    "required": ["date", "content"],
}

PERSONA_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "Persona",
    "schema": PERSONA_SCHEMA,
    "strict": True,
}

JOURNAL_ENTRY_RESPONSE_FORMAT = {
    "type": "json_schema",
    "name": "JournalEntry",
    "schema": JOURNAL_ENTRY_SCHEMA,
    "strict": True,
}

In [4]:
def build_value_context(values: list[str], schwartz_config: dict) -> str:
    """Build rich context about Schwartz values for persona generation.

    Args:
        values: List of Schwartz value names (e.g., ["Achievement", "Benevolence"])
        schwartz_config: The loaded schwartz_values.yaml config

    Returns:
        Formatted string with value elaborations for prompt injection
    """
    context_parts = []

    for value_name in values:
        if value_name not in schwartz_config["values"]:
            continue

        v = schwartz_config["values"][value_name]

        # Build a focused context block for this value
        context_parts.append(f"""
### {value_name}
**Core Motivation:** {v["core_motivation"].strip()}

**How this manifests in behavior:**
{chr(10).join(f"- {b}" for b in v["behavioral_manifestations"][:5])}

**Life domain expressions:**
- Work: {v["life_domain_expressions"]["work"].strip()}
- Relationships: {v["life_domain_expressions"]["relationships"].strip()}

**Typical stressors for this person:**
{chr(10).join(f"- {s}" for s in v["typical_stressors"][:4])}

**Typical goals:**
{chr(10).join(f"- {g}" for g in v["typical_goals"][:3])}

**Internal conflicts they may experience:**
{v["internal_conflicts"].strip()}

**Narrative guidance:**
{v["persona_narrative_guidance"].strip()}
""")

    return "\n".join(context_parts)


# Test the function
test_context = build_value_context(["Achievement"], schwartz_config)
print("Sample value context for 'Achievement':")
print(test_context[:1500] + "..." if len(test_context) > 1500 else test_context)

Sample value context for 'Achievement':

### Achievement
**Core Motivation:** The fundamental drive to excel, to be competent, and to have that competence recognized. Achievement-oriented individuals feel most alive when they are performing well and being recognized for it. Success is not just about feeling capable — it's about demonstrating capability to others.

**How this manifests in behavior:**
- Sets measurable goals and tracks progress toward them
- Compares self to peers and external benchmarks
- Works hard, sometimes to the point of overwork, to meet standards of excellence
- Seeks feedback, recognition, and credentials that validate competence
- Feels frustrated when effort doesn't translate to recognized results

**Life domain expressions:**
- Work: Career-focused; measures self-worth partly through professional accomplishments. Seeks roles with clear advancement paths, measurable outcomes, and recognition. May be drawn to prestigious organizations, competitive fields, or vi

In [5]:
persona_generation_prompt = Template("""
You are generating synthetic personas for a journaling dataset.

## Constraints
- Age Group: {{ age }}
- Profession: {{ profession }}
- Cultural Background: {{ culture }}
- Schwartz values to embody: {{ values | join(', ') }}

## Value Psychology Reference
Use the following research-based elaborations to understand how the assigned value(s) shape a person's life circumstances, stressors, and motivations. DO NOT mention any of these concepts explicitly in your output—use them only to inform realistic details.

{{ value_context }}

## Your Task
Create a persona whose life circumstances, stressors, and motivations naturally reflect the given Schwartz values—without ever naming or describing those values explicitly.

## Rules
- Return ONLY valid JSON matching the Persona schema.
- `core_values` must be exactly: {{ values | join(', ') }} (same spelling/case).
- `bio` must be 2–4 sentences describing their background, current life situation, stressors, and what drives them.
- `bio` must be written in third-person (use their name or "they"; do not use "I").
- `bio` must show the values through CONCRETE DETAILS (job choices, relationships, conflicts, goals, specific situations) NOT through labels, personality descriptions, or adjectives.
- `bio` must NOT contain any Schwartz value labels, the word "Schwartz", or derivative adjectives.
- `bio` must NOT describe journaling app features (avoid words like "templates", "analytics", "private app").
- Use the behavioral manifestations, life domain expressions, and typical stressors from the Value Psychology Reference to craft realistic, specific details.

## Banned terms (do not use in bio)
{{ banned_terms | join(', ') }}

## Examples of what NOT to write
- "She is achievement-oriented and seeks power" ❌ (uses value labels)
- "He values security and tradition" ❌ (explicitly mentions values)
- "They are a hedonistic person who enjoys pleasure" ❌ (uses derivative adjectives)
- "She is driven and ambitious" ❌ (personality adjectives instead of concrete details)

## Examples of what TO write
- "She recently turned down a stable government job to launch her own startup, and now juggles investor meetings while her savings dwindle." ✓ (shows Achievement through concrete career choice and trade-offs)
- "He moved back to his hometown after his father's illness, taking over the family shop despite having built a career in the city." ✓ (shows Tradition/Benevolence through specific life situation)
- "She keeps a spreadsheet tracking her publication submissions and citation counts, and measures her weeks by how many grant deadlines she meets." ✓ (shows Achievement through specific behaviors)

## Output
Return valid JSON matching the Persona schema:
{ 
  "name": "...", 
  "age": "...", 
  "profession": "...", 
  "culture": "...", 
  "core_values": ["..."], 
  "bio": "..."
}
""")

journal_entry_prompt = Template("""
You are {{ name }}, a {{ age }} {{ profession }} from {{ culture }}.
Background (for context only): {{ bio }}

Write a typed journal entry in English for {{ date }}.
{% if previous_entries %}
Previous journal entries (for continuity—you may reference past events/thoughts, but do not repeat them):
{% for prev in previous_entries %}
---
{{ prev.date }}: {{ prev.content }}
{% endfor %}
---
{% endif %}
Context:
- Tone: {{ tone }}
- Verbosity: {{ verbosity }} (target {{ min_words }}–{{ max_words }} words)
- Value Drift: {{ value_drift }}

    Cultural Context:
    - Your {{ culture }} background should subtly flavor your perspective (e.g., attitudes toward family, work, hierarchy, or conflict) and the details you mention (food, setting, customs).
    - It should feel natural and "lived-in," avoiding stereotypes or travel-guide descriptions.

Style rules (important):
- Write like a real personal journal: plain, candid, sometimes messy or fragmented.
- Do not write for an audience. No "Dear Diary" or performing for a reader.
- Do not open with the time of day, weather, or "Today I..." summaries.
- Jump into a thought, moment, or feeling mid-stream.
- Avoid "therapy speak" (e.g., "I am processing my emotions", "I recognize this pattern").
- Avoid literary metaphors, edgy humor/snark, and audience-facing jokes.
- No headings, no numbered plans, no bullet lists.
- Keep to {{ max_paragraphs }} short paragraph(s).

Avoid openings like:
- "Morning light feels stubborn as I..." ❌
- "Evening. Today followed the usual rhythm..." ❌
- "Lunch break finally settles in..." ❌

Do NOT include any Schwartz value labels or talk about "values".
Avoid these terms (any casing): {{ banned_terms | join(', ') }}

Value drift guidance:
{% if value_drift == 'Drift' %}
Describe a situation where you made a choice that felt necessary or easier in the moment, even if it leaves a nagging feeling. Do not explicitly admit you are 'drifting' or analyzing your values. Justify or rationalize the choice, or simply describe the exhaustion/pressure that led to it.
{% elif value_drift == 'Convergence' %}
Describe a moment that felt 'right' or satisfying because you acted like the person you want to be. Keep it subtle—no victory laps or moralizing. It should feel like a quiet confirmation, not a major event.
{% else %}
Write a neutral entry—a routine day. Focus on small sensory details or fleeting thoughts. No big revelations.
{% endif %}

Output valid JSON:
{
  "date": "{{ date }}",
  "content": "..."
}
""")

## LLM Client Setup

Using `gpt-5-mini`. 

**Note:** GPT-5 models do not support `temperature` or `top_p` parameters. Instead, use the `reasoning` parameter to control how much the model "thinks" before responding.

In [6]:
client = OpenAI()
MODEL_NAME = "gpt-5-mini-2025-08-07"
# MODEL_NAME = "gpt-5-nano-2025-08-07"

# Type alias for reasoning effort levels
ReasoningEffort = Literal["minimal", "low", "medium", "high"]

# Default reasoning effort - change this to affect all generations
DEFAULT_REASONING_EFFORT: ReasoningEffort = "high"


def generate_completion(
    prompt: str,
    response_format: dict | None = None,
) -> str | None:
    """Generate a completion using the OpenAI Responses API.

    Uses DEFAULT_REASONING_EFFORT to control how much the model "thinks".
    Valid reasoning effort values: "minimal", "low", "medium", "high".
    """
    try:
        kwargs = {
            "model": MODEL_NAME,
            "input": [{"role": "user", "content": prompt}],
            "reasoning": {"effort": DEFAULT_REASONING_EFFORT},
        }

        if response_format:
            kwargs["text"] = {"format": response_format}

        response = client.responses.create(**kwargs)
        return response.output_text

    except Exception as e:
        print(f"Error generating completion: {e}")
        return None

In [7]:
def _verbosity_targets(verbosity: str) -> tuple[int, int, int]:
    """Returns (min_words, max_words, max_paragraphs) as guidance for the LLM."""
    normalized = verbosity.strip().lower()
    if normalized.startswith("short"):
        return 25, 80, 1
    if normalized.startswith("medium"):
        return 90, 180, 2
    return 160, 260, 3


def _build_banned_pattern(banned_terms: list[str]) -> re.Pattern:
    """Build regex pattern to detect banned Schwartz value terms."""
    escaped = [re.escape(term) for term in banned_terms if term.strip()]
    if not escaped:
        return re.compile(r"$^")
    return re.compile(r"(?i)\b(" + "|".join(escaped) + r")\b")


def generate_date_sequence(
    start_date: str, num_entries: int, min_days: int = 2, max_days: int = 10
) -> list[str]:
    """Generate a sequence of dates with random intervals.

    Args:
        start_date: Starting date in YYYY-MM-DD format
        num_entries: Number of dates to generate
        min_days: Minimum days between entries
        max_days: Maximum days between entries

    Returns:
        List of date strings in YYYY-MM-DD format
    """
    dates = []
    current = datetime.strptime(start_date, "%Y-%m-%d")

    for i in range(num_entries):
        dates.append(current.strftime("%Y-%m-%d"))
        if i < num_entries - 1:
            days_gap = random.randint(min_days, max_days)
            current += timedelta(days=days_gap)

    return dates


# Banned terms include Schwartz value labels AND derivative adjectives
SCHWARTZ_BANNED_TERMS = [
    # Value labels
    "Self-Direction",
    "Stimulation",
    "Hedonism",
    "Achievement",
    "Power",
    "Security",
    "Conformity",
    "Tradition",
    "Benevolence",
    "Universalism",
    # Derivative adjectives and related terms
    "self-directed",
    "autonomous",
    "stimulating",
    "excited",
    "hedonistic",
    "hedonist",
    "pleasure-seeking",
    "achievement-oriented",
    "ambitious",
    "powerful",
    "authoritative",
    "secure",
    "conformist",
    "conforming",
    "traditional",
    "traditionalist",
    "benevolent",
    "kind-hearted",
    "universalistic",
    "altruistic",
    # Meta terms
    "Schwartz",
    "values",
    "core values",
]

BANNED_PATTERN = _build_banned_pattern(SCHWARTZ_BANNED_TERMS)


def create_random_persona(
    config: dict, schwartz_config: dict, max_attempts: int = 2
) -> Persona | None:
    """Generate a random persona with Schwartz values shown through life circumstances.

    Args:
        config: Main configuration with personas attributes
        schwartz_config: Schwartz values elaboration config
        max_attempts: Number of retry attempts for validation

    Returns:
        Generated Persona or None if generation fails
    """
    age = random.choice(config["personas"]["age_ranges"])
    prof = random.choice(config["personas"]["professions"])
    cult = random.choice(config["personas"]["cultures"])
    num_values = random.choice([1, 2])
    vals = random.sample(config["personas"]["schwartz_values"], num_values)

    # Build rich value context from the Schwartz elaborations
    value_context = build_value_context(vals, schwartz_config)

    prompt = persona_generation_prompt.render(
        age=age,
        profession=prof,
        culture=cult,
        values=vals,
        value_context=value_context,
        banned_terms=SCHWARTZ_BANNED_TERMS,
    )

    first_person_pattern = re.compile(r"(?i)\b(i|my|me)\b")
    last_persona: Persona | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(prompt, response_format=PERSONA_RESPONSE_FORMAT)
        if not raw_json:
            continue

        data = json.loads(raw_json)
        data["core_values"] = vals  # Ensure correct values
        persona = Persona(**data)
        last_persona = persona

        # Only validate banned terms and first-person usage
        if BANNED_PATTERN.search(persona.bio) or first_person_pattern.search(
            persona.bio
        ):
            continue
        return persona

    return last_persona


class JournalEntryResult(BaseModel):
    """Container for journal entry with generation metadata."""

    entry: JournalEntry
    tone: str
    verbosity: str
    value_drift: str


def generate_journal_entry(
    persona: Persona,
    config: dict,
    date_str: str,
    previous_entries: list[JournalEntry] | None = None,
    max_attempts: int = 2,
) -> JournalEntryResult | None:
    """Generate a journal entry for a persona on a given date.

    Args:
        persona: The persona writing the journal
        config: Configuration dict with generation parameters
        date_str: Date for this entry (YYYY-MM-DD format)
        previous_entries: List of previous JournalEntry objects for continuity
        max_attempts: Number of retry attempts for validation

    Returns:
        JournalEntryResult with entry and metadata, or None if generation fails
    """
    tone = random.choice(config["journal_entries"]["tones"])
    verbosity = random.choice(config["journal_entries"]["verbosity"])
    value_drift = random.choice(config["journal_entries"]["value_drift"])
    min_words, max_words, max_paragraphs = _verbosity_targets(verbosity)

    # Format previous entries for the prompt
    prev_entries_data = None
    if previous_entries:
        prev_entries_data = [
            {"date": e.date, "content": e.content} for e in previous_entries
        ]

    prompt = journal_entry_prompt.render(
        name=persona.name,
        age=persona.age,
        profession=persona.profession,
        culture=persona.culture,
        bio=persona.bio,
        date=date_str,
        tone=tone,
        verbosity=verbosity,
        min_words=min_words,
        max_words=max_words,
        max_paragraphs=max_paragraphs,
        value_drift=value_drift,
        banned_terms=SCHWARTZ_BANNED_TERMS,
        previous_entries=prev_entries_data,
    )

    last_entry: JournalEntry | None = None

    for _ in range(max_attempts):
        raw_json = generate_completion(
            prompt, response_format=JOURNAL_ENTRY_RESPONSE_FORMAT
        )
        if not raw_json:
            continue

        entry = JournalEntry(**json.loads(raw_json))
        last_entry = entry

        # Only validate banned terms (prevent label leakage)
        if not BANNED_PATTERN.search(entry.content):
            return JournalEntryResult(
                entry=entry,
                tone=tone,
                verbosity=verbosity,
                value_drift=value_drift,
            )

    if last_entry:
        return JournalEntryResult(
            entry=last_entry,
            tone=tone,
            verbosity=verbosity,
            value_drift=value_drift,
        )
    return None

## Execution Loop

In [8]:
# 1. Generate a Persona
print(f"Generating Persona (reasoning_effort={DEFAULT_REASONING_EFFORT})...")
persona = create_random_persona(config, schwartz_config)

Generating Persona (reasoning_effort=high)...


In [9]:
if persona:
    print(f"Created Persona: {persona.name} ({persona.age}, {persona.profession})")
    print(f"Values: {persona.core_values}")
    print(f"Bio: {persona.bio}\n")

    # 2. Generate Longitudinal Journal Entries
    NUM_ENTRIES = 3
    START_DATE = "2023-10-27"

    dates = generate_date_sequence(START_DATE, NUM_ENTRIES)
    print(f"Generating {NUM_ENTRIES} journal entries for dates: {dates}\n")

    results: list[JournalEntryResult] = []
    previous_entries: list[JournalEntry] = []

    for i, date_str in enumerate(dates):
        print(f"Generating entry {i + 1}/{NUM_ENTRIES} ({date_str})...")
        result = generate_journal_entry(
            persona, config, date_str, previous_entries=previous_entries
        )

        if result:
            results.append(result)
            previous_entries.append(result.entry)
            print(f"  ✓ Generated ({result.tone}, {result.value_drift})")
        else:
            print(f"  ✗ Failed to generate entry for {date_str}")

    # 3. Display all entries as a table
    if results:
        print(f"\n{'=' * 80}")
        print(f"Generated {len(results)} entries for {persona.name}")
        print(f"{'=' * 80}\n")

        df = pl.DataFrame(
            {
                "Date": [r.entry.date for r in results],
                "Tone": [r.tone for r in results],
                "Verbosity": [r.verbosity for r in results],
                "Value Drift": [r.value_drift for r in results],
                "Schwartz Values": [", ".join(persona.core_values)] * len(results),
                "Content": [r.entry.content for r in results],
            }
        )

        with pl.Config(fmt_str_lengths=1000, tbl_width_chars=200):
            display(df)
else:
    print("Failed to generate persona.")

Created Persona: Mateo Rojas (30, Gig Worker)
Values: ['Power', 'Stimulation']
Bio: Mateo Rojas is a 30-year-old gig worker from Guadalajara who coordinates three couriers via a private group chat, juggling shifts across two delivery apps and taking on private event-driving contracts so he can choose which jobs to accept. He spent his savings on a newer motorcycle with branded decals, keeps a notebook of client contacts and fuel loans, and gets deeply unsettled when platform algorithms change pay rates or a high-paying event falls through. He alternates late-night party runs, intercity courier days, and short pop-up businesses—like the taco stall he opened and closed in six weeks—because routine wears on him, and he is now trying to register his micro-fleet as a formal business to set prices and hire riders under his own rules.

Generating 3 journal entries for dates: ['2023-10-27', '2023-11-06', '2023-11-09']

Generating entry 1/3 (2023-10-27)...
  ✓ Generated (Brief and factual, Unch

Date,Tone,Verbosity,Value Drift,Schwartz Values,Content
str,str,str,str,str,str
"""2023-10-27""","""Brief and factual""","""Short (1-3 sentences)""","""Unchanged""","""Power, Stimulation""","""Sent the group chat the shift list, topped off the new moto—smell of gasolina and the decals already scuffed—receipt folded in my notebook next to the fuel loans, assigned the three riders to Centro while I took a private event; mamá called about the paperwork to register the micro-fleet, I said after the run."""
"""2023-11-06""","""Brief and factual""","""Long (Detailed reflection)""","""Drift""","""Power, Stimulation""","""Told the group chat to split Centro and Poniente—left them the dinner rush, the moto with the nueva pegatina picking up another scratch at the market gate. The apps cut the bonuses this week and private events keep canceling; mamá called again asking if I had started the paperwork to register the micro-fleet. I said after the run, because if I tell the truth she'll worry and I needed to keep the guys paid. I accepted three short app blocks back-to-back instead of waiting for a private drive that pays better but might flake. It was easier, immediate money for gasolina and the partial registration fee; the notebook shows the fuel loans and who owes what, and the receipts go folded inside. The choice felt like folding to the pressure—useful, necessary—but there's a sourness because it undercuts the plan to set our own rates and hire under my rules. Paid half the registration fee from cash in my backpack and left the rest to cover the riders' advances. Told the team to save a few fares f…"
"""2023-11-09""","""Stream of consciousness""","""Long (Detailed reflection)""","""Drift""","""Power, Stimulation""","""Promise to mamá about the registration hangs like an open tab, I tell her after the run, but today the apps pinged with three short guaranteed blocks and it was easier to take them - gasolina, the riders' advances, the rest of the registration fee in installments. The notebook stayed in my backpack while I rode; receipts went next to the fuel loans. Quick money feels blunt and useful. Centro was crazy, the nueva pegatina got another scuff at the mercado gate, some client left a tip and some back-to-back small orders added up. I sent the group chat the new split and told them to hold two fares for later like I always do, then I kept one for myself because the guy needed cash to get home. Private events are like promises with thin thread; better to stack small certainties. I paid a bit more of the registration from cash and left the rest, again, to another week; mamá sounded relieved when I said that. It settles the stomach but it scratches the plan: someday registering the micro-fleet…"
