# Dataset Generation: Narrative Contrast Pairs

This notebook generates 100 story pairs contrasting "determined/gritty" vs "drifting/lazy" characters for training motivation vectors.

**Run in Google Colab for GPU access and API rate limits.**

## Setup: Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/motivation_vectors.git
%cd motivation_vectors

In [None]:
# Install dependencies
!pip install openai anthropic transformers

## Import Utilities

In [None]:
import sys
import json
from typing import List, Dict

# Add src to path
sys.path.insert(0, '/content/motivation_vectors/src')

from motivation_vectors.dataset_generation import (
    NarrativePair,
    generate_scenario_prompts,
    save_narrative_pairs,
    validate_dataset_balance
)

## Configure API Client

Choose either OpenAI or Anthropic for generation.

In [None]:
# Option 1: OpenAI
import openai
openai.api_key = "YOUR_OPENAI_API_KEY"  # Or use environment variable

# Option 2: Anthropic
# import anthropic
# client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

## Generate Scenario Prompts

In [None]:
# Generate prompts for each domain
scenario_prompts = generate_scenario_prompts(num_per_domain=20)

print(f"Generated {len(scenario_prompts)} scenario prompts")
print("\nExample prompt:")
print(scenario_prompts[0]['prompt'])

## Generate Narrative Pairs

Use LLM to generate both determined and drifting continuations for each scenario.

In [None]:
def generate_pair_with_openai(scenario_dict: Dict, model: str = "gpt-4") -> NarrativePair:
    """
    Generate a narrative pair using OpenAI API.
    """
    prompt = f"""Generate a story scenario for: {scenario_dict['scenario_type']}

Create:
1. SETUP: 2-3 sentences ending with "decided to"
2. DETERMINED: 1-2 sentences showing persistence/grit
3. DRIFTING: 1-2 sentences showing giving up/distraction

Example:
SETUP: The encryption key was missing and the deadline was in one hour. Marcus stared at his screen, weighing his options. He decided to
DETERMINED: rewrite the decryption algorithm from scratch, line by line, checking each function systematically.
DRIFTING: close his laptop and go for a walk, hoping the problem would resolve itself.

Now generate for: {scenario_dict['scenario_type']}
"""

    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    content = response.choices[0].message.content

    # Parse response (simple regex-based parsing)
    import re
    setup_match = re.search(r'SETUP:\s*(.+?)(?=DETERMINED:|$)', content, re.DOTALL)
    determined_match = re.search(r'DETERMINED:\s*(.+?)(?=DRIFTING:|$)', content, re.DOTALL)
    drifting_match = re.search(r'DRIFTING:\s*(.+?)$', content, re.DOTALL)

    setup = setup_match.group(1).strip() if setup_match else ""
    determined = determined_match.group(1).strip() if determined_match else ""
    drifting = drifting_match.group(1).strip() if drifting_match else ""

    return NarrativePair(
        scenario=setup,
        determined_context=setup,
        determined_continuation=determined,
        drifting_context=setup,
        drifting_continuation=drifting,
        domain=scenario_dict['domain']
    )


# Generate all pairs
narrative_pairs = []

for i, scenario_dict in enumerate(scenario_prompts):
    print(f"Generating pair {i+1}/{len(scenario_prompts)}...", end="\r")
    try:
        pair = generate_pair_with_openai(scenario_dict)
        narrative_pairs.append(pair)
    except Exception as e:
        print(f"\nError generating pair {i+1}: {e}")

print(f"\nGenerated {len(narrative_pairs)} narrative pairs")

## Validate and Save

In [None]:
# Check domain balance
validate_dataset_balance(narrative_pairs)

In [None]:
# Save to JSON with train/validation split
save_narrative_pairs(
    narrative_pairs,
    output_path="data/narrative_pairs/determined_vs_drifting.json",
    train_split=0.8
)

## Preview Generated Data

In [None]:
# Show a few examples
for i in range(min(3, len(narrative_pairs))):
    pair = narrative_pairs[i]
    print(f"\n=== Example {i+1} ({pair.domain}) ===")
    print(f"\nScenario: {pair.scenario}")
    print(f"\nDetermined: {pair.determined_continuation}")
    print(f"\nDrifting: {pair.drifting_continuation}")

## Download or Push to GitHub

In [None]:
# Option 1: Download to local machine
from google.colab import files
files.download("data/narrative_pairs/determined_vs_drifting.json")

In [None]:
# Option 2: Push to GitHub
!git add data/narrative_pairs/determined_vs_drifting.json
!git commit -m "Add generated narrative pairs dataset"
!git push

## Next Steps

Now that you have the dataset, proceed to:
- `02_vector_extraction.ipynb` to extract the motivation vector
- Or manually review and refine the generated pairs if needed