# Cognitive Action Training Data Generator

This notebook generates high-quality training data for cognitive action recognition using Ollama locally or via API.

**Based on Scientific Taxonomies:**
- Bloom's Taxonomy (cognitive processes)
- Guilford's Structure of Intellect
- Krathwohl's Affective Domain
- Gross's Emotion Regulation Model
- Metacognitive Process Frameworks

**Target:** 100,000+ diverse examples of explicit cognitive and psychological actions

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install requests pandas numpy tqdm matplotlib seaborn

# If running locally with Ollama installed:
# !pip install ollama

import json
import time
import random
import requests
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("Dependencies installed successfully!")

## 2. Upload Supporting Files

Upload the following files to your Colab environment:
- `variable_pools.py`
- `prompt_templates.py` 
- `data_generator.py`

Or run the cells below to create them directly:

In [None]:
%%writefile variable_pools.py
"""
Variable Pools for Cognitive Action Training Data Generation
Based on scientific taxonomies from cognitive psychology and emotion research
"""

import random

# =============================================================================
# COGNITIVE ACTIONS (Based on Scientific Taxonomies)
# =============================================================================

# Combined from multiple taxonomies: original instructions + scientific frameworks
COGNITIVE_ACTIONS = {
    # Original Core Actions
    "reconsidering": "reconsidering a belief or decision",
    "reframing": "reframing a situation or perspective",
    "noticing": "noticing a pattern, feeling, or dynamic",
    "perspective_taking": "taking another's perspective or temporal view",
    "questioning": "questioning an assumption or belief",
    "abstracting": "abstracting from specifics to general patterns",
    "concretizing": "making abstract concepts concrete and specific",
    "connecting": "connecting disparate ideas or experiences",
    "distinguishing": "distinguishing between previously conflated concepts",
    "updating_beliefs": "updating mental models or beliefs",
    "suspending_judgment": "suspending judgment and staying with uncertainty",
    "pattern_recognition": "recognizing recurring patterns across situations",
    "zooming_out": "zooming out for broader context",
    "zooming_in": "zooming in on specific details",
    "analogical_thinking": "drawing analogies between domains",
    "counterfactual_reasoning": "engaging in 'what if' thinking",
    "hypothesis_generation": "generating possible explanations",
    "meta_awareness": "reflecting on one's own thinking process",
    "accepting": "accepting and letting go of control",

    # From Bloom's Taxonomy
    "remembering": "recalling relevant information or experiences",
    "understanding": "interpreting and explaining meaning",
    "applying": "using knowledge in new situations",
    "analyzing": "breaking down into components",
    "evaluating": "making judgments about value or effectiveness",
    "creating": "generating new ideas or solutions",

    # From Guilford's Structure of Intellect
    "divergent_thinking": "generating multiple creative solutions",
    "convergent_thinking": "finding the single best solution",
    "cognition_awareness": "becoming aware and comprehending",

    # Metacognitive Operations
    "metacognitive_monitoring": "tracking one's own comprehension",
    "metacognitive_regulation": "adjusting thinking strategies",
    "self_questioning": "interrogating one's own understanding",

    # Emotional/Affective Operations (from taxonomies)
    "emotional_reappraisal": "reinterpreting emotional meaning",
    "emotion_receiving": "becoming aware of emotions",
    "emotion_responding": "actively engaging with emotions",
    "emotion_valuing": "attaching worth to emotional experiences",
    "emotion_organizing": "integrating conflicting emotions",
    "emotion_characterizing": "aligning emotions with core values",
    "situation_selection": "choosing emotional contexts deliberately",
    "situation_modification": "changing circumstances to regulate emotion",
    "attentional_deployment": "directing attention for emotional regulation",
    "response_modulation": "modifying emotional expression",
    "emotion_perception": "identifying emotions in self/others",
    "emotion_facilitation": "using emotions to enhance thinking",
    "emotion_understanding": "comprehending emotional complexity",
    "emotion_management": "regulating emotions in self/others"
}

# =============================================================================
# SUBJECTS (Who is doing the thinking)
# =============================================================================

SUBJECTS = [
    # Professional Roles
    "a software developer", "a teacher", "a therapist", "a manager", "a researcher",
    "a scientist", "a doctor", "a lawyer", "a consultant", "a designer",
    "an engineer", "a writer", "an artist", "a musician", "an entrepreneur",
    "a nurse", "a social worker", "a coach", "a mentor", "a leader",

    # Life Stages/Demographics
    "someone in their early 20s", "someone in their 30s", "someone in their 40s",
    "someone in their 50s", "someone in their 60s", "a recent graduate",
    "a parent", "a grandparent", "a student", "a retiree",

    # Relationship Roles
    "a partner in a relationship", "a friend", "a colleague", "a team member",
    "a sibling", "a child reflecting on parents", "a mentor", "a mentee",

    # Life Situations
    "someone grieving a loss", "someone facing a major transition",
    "someone dealing with success", "someone processing failure",
    "a person in therapy", "someone in recovery", "a career changer",
    "someone learning a new skill", "a person facing illness",
    "someone in conflict", "a person seeking growth"
]

# =============================================================================
# DOMAINS (Context areas)
# =============================================================================

DOMAINS = [
    "personal relationships", "romantic relationships", "family dynamics",
    "friendships", "career decisions", "professional development",
    "creative work", "artistic expression", "scientific research",
    "academic learning", "moral and ethical dilemmas", "health and wellness",
    "financial planning", "investment decisions", "conflict resolution",
    "identity and self-concept", "parenting and caregiving", "leadership challenges",
    "team dynamics", "communication challenges", "goal setting and achievement",
    "dealing with failure", "processing success", "daily mundane decisions",
    "philosophical questions", "spiritual exploration", "time management",
    "personal growth", "therapy and healing", "addiction recovery",
    "grief and loss", "major life transitions", "retirement planning",
    "educational choices", "political beliefs", "social justice issues"
]

# =============================================================================
# CONTEXT DETAILS (Specific scenarios for each domain)
# =============================================================================

CONTEXT_DETAILS = {
    "personal relationships": [
        "after a difficult conversation with a partner",
        "noticing a recurring conflict pattern with a family member",
        "considering whether to reconnect with an old friend",
        "processing feedback from someone close",
        "dealing with feeling excluded from a social group",
        "navigating a boundary issue with a friend",
        "after a misunderstanding was clarified",
        "considering ending a toxic relationship",
        "noticing how they communicate differently with different people",
        "reflecting on a repeated argument pattern"
    ],

    "career decisions": [
        "after receiving a job offer from a different field",
        "considering a major career pivot",
        "processing harsh feedback from a supervisor",
        "deciding whether to speak up about workplace issues",
        "evaluating why a project failed",
        "thinking about asking for a promotion",
        "after being passed over for advancement",
        "considering starting their own business",
        "dealing with imposter syndrome at work",
        "reflecting on work-life balance priorities"
    ]
}

# Add more context details for other domains
for domain in DOMAINS:
    if domain not in CONTEXT_DETAILS:
        CONTEXT_DETAILS[domain] = [
            f"facing a significant decision about {domain}",
            f"processing unexpected developments in {domain}",
            f"reflecting on patterns in {domain}",
            f"considering changes to their approach in {domain}",
            f"dealing with conflict or tension in {domain}"
        ]

# =============================================================================
# ADDITIONAL VARIABLES
# =============================================================================

TRIGGERS = [
    "reading an article that contradicts their worldview",
    "receiving unexpected feedback from someone they trust",
    "noticing their physical or emotional reaction to something",
    "having a meaningful conversation with someone",
    "experiencing an unexpected setback or failure",
    "achieving success in an unexpected way",
    "witnessing someone else's perspective on the same issue",
    "having quiet time for reflection during a walk or shower",
    "facing a deadline that forces clarity",
    "encountering a similar situation to one from their past"
]

EMOTIONAL_STATES = [
    "feeling frustrated and stuck", "experiencing confusion and uncertainty",
    "feeling defensive about their position", "in a calm and reflective mood",
    "feeling anxious about the implications", "experiencing genuine curiosity",
    "feeling disappointed by outcomes", "in a moment of unexpected clarity",
    "feeling overwhelmed by options", "experiencing relief after stress"
]

LANGUAGE_STYLES = [
    "casual and conversational", "introspective and literary",
    "straightforward and direct", "tentative and exploratory",
    "confident and declarative", "stream-of-consciousness style",
    "analytical and precise", "emotional and expressive"
]

UNIQUE_ANGLES = [
    "include a specific sensory detail that triggered the insight",
    "show the cognitive process taking time rather than being instant",
    "include self-doubt about the cognitive process itself",
    "show a partial or incomplete cognitive shift",
    "include resistance or pushback before the mental shift",
    "make the scenario very mundane and everyday",
    "show it happening in a specific physical location",
    "include another person's influence on the thinking"
]

COMPLEXITY_LEVELS = {
    "simple": "Single clear cognitive action, straightforward scenario, obvious outcome",
    "moderate": "Multiple factors at play, some ambiguity, partial clarity",
    "complex": "Multiple interacting cognitive actions, high uncertainty, conflicting considerations, no clear resolution"
}

PERSPECTIVES = [
    "first-person present tense ('I'm noticing right now...')",
    "first-person past reflective ('I realized later that I had been...')",
    "first-person future conditional ('I'll need to reconsider when...')",
    "second-person coaching ('You might try reframing...')",
    "third-person observation ('She began to reconsider...')",
    "internal monologue with self-talk",
    "metacognitive commentary ('My thought process here is...')"
]

# =============================================================================
# UTILITY FUNCTIONS
# =============================================================================

def get_random_selection():
    """Get a random selection of variables for prompt generation"""
    return {
        'cognitive_action': random.choice(list(COGNITIVE_ACTIONS.keys())),
        'subject': random.choice(SUBJECTS),
        'domain': random.choice(DOMAINS),
        'trigger': random.choice(TRIGGERS),
        'emotional_state': random.choice(EMOTIONAL_STATES),
        'language_style': random.choice(LANGUAGE_STYLES),
        'unique_angle': random.choice(UNIQUE_ANGLES),
        'complexity_level': random.choice(list(COMPLEXITY_LEVELS.keys())),
        'perspective': random.choice(PERSPECTIVES)
    }

def get_context_detail(domain):
    """Get a specific context detail for a domain"""
    return random.choice(CONTEXT_DETAILS.get(domain, CONTEXT_DETAILS["personal relationships"]))

def get_cognitive_action_chains(n=3):
    """Get a sequence of cognitive actions for chain examples"""
    actions = random.sample(list(COGNITIVE_ACTIONS.keys()), n)
    return actions

print("Variable pools loaded successfully!")
print(f"Total cognitive actions: {len(COGNITIVE_ACTIONS)}")
print(f"Total domains: {len(DOMAINS)}")
print(f"Total subjects: {len(SUBJECTS)}")

In [None]:
%%writefile prompt_templates.py
"""
Prompt Templates for Cognitive Action Training Data Generation
Based on instructions2.md template system architecture
"""

import random
from variable_pools import *

# =============================================================================
# SINGLE COGNITIVE ACTION TEMPLATES
# =============================================================================

SINGLE_ACTION_TEMPLATES = [
    # Template 1: Scenario-Based
    """Generate 1 example (2-4 sentences) showing someone {cognitive_action_desc} in this specific scenario:

Scenario: {subject} is {situation} and experiences {trigger}. They engage in {cognitive_action_desc}.

Requirements:
- Domain: {domain}
- Emotional context: {emotional_state}
- Show the cognitive process explicitly
- Use {language_style} language
- Focus angle: {unique_angle}
- Complexity: {complexity_level}

Output only the example text, no preamble.""",

    # Template 2: Before/After
    """Generate 1 example (2-4 sentences) showing {cognitive_action_desc} by contrasting before and after states:

Context: {domain} situation involving {context_detail}
Before state: {initial_state}
Cognitive action: {cognitive_action_desc}
After state: {result_direction}

Requirements:
- Subject: {subject}
- Complexity: {complexity_level}
- Avoid cliché phrasings
- Language register: {language_style}
- Unique constraint: {unique_angle}

Output only the example text, no preamble.""",

    # Template 3: Process-Focused
    """Generate 1 example (2-4 sentences) that shows the PROCESS of {cognitive_action_desc}:

Setup: {subject} faces {problem_type} related to {domain}
The cognitive action unfolds gradually and visibly.

Requirements:
- Make the internal process visible
- Perspective: {perspective}
- Unique angle: {unique_angle}
- Language style: {language_style}
- Show it taking time, not being instant

Output only the example text, no preamble."""
]

# =============================================================================
# COGNITIVE ACTION CHAIN TEMPLATES
# =============================================================================

CHAIN_TEMPLATES = [
    # Template: Sequential Chain
    """Generate 1 example (4-6 sentences) showing this sequence of cognitive actions:

Step 1: {cognitive_action_1_desc} triggered by {trigger}
Step 2: This leads to {cognitive_action_2_desc}
Step 3: Which results in {cognitive_action_3_desc}

Context:
- Domain: {domain}
- Scenario: {context_detail}
- Subject: {subject}
- Emotional arc: {starting_emotion} → {ending_emotion}

Requirements:
- Show clear progression between steps
- Make causal connections visible
- Complexity: {complexity}
- Unique constraint: {unique_angle}

Output only the example text, no preamble."""
]

# =============================================================================
# DIALOGUE TEMPLATES
# =============================================================================

DIALOGUE_TEMPLATES = [
    # Therapy/Coaching Context
    """Generate a dialogue (3-4 exchanges) showing {cognitive_action_desc} in a therapeutic context:

Setting: {therapy_setting}
Client issue: {domain} - {context_detail}
Cognitive action demonstrated: {cognitive_action_desc}

Requirements:
- Show the cognitive action emerging through dialogue
- Include both therapist and client voices
- Emotional tone: {emotional_state}
- Make it feel natural and realistic
- Unique focus: {unique_angle}

Output dialogue format only, no stage directions."""
]

# =============================================================================
# THOUGHT-STREAM TEMPLATES
# =============================================================================

THOUGHT_STREAM_TEMPLATES = [
    # Stream of Consciousness
    """Generate a stream-of-consciousness example (4-6 sentences) showing {cognitive_action_desc}:

Initial state: {subject} is {initial_situation} feeling {emotional_state}
Domain: {domain}
Trigger: {trigger}
Cognitive process: {cognitive_action_desc}

Requirements:
- Show the mind in motion, not just the conclusion
- Include false starts, interruptions, tangents
- Make it feel like real internal dialogue
- Include: {unique_angle}
- Complexity: {complexity_level}

Output only the thought stream, no framing."""
]

# =============================================================================
# NEGATIVE EXAMPLE TEMPLATES
# =============================================================================

NEGATIVE_TEMPLATES = [
    """Generate 1 example (2-4 sentences) showing the ABSENCE of {cognitive_action_desc}:

Context: {subject} faces {context_detail} in {domain}
Trigger present: {trigger}
What they DON'T do: {cognitive_action_desc}
Instead they: {negative_alternative}

Requirements:
- Show rigid thinking or missed opportunity
- Make it realistic (not caricature)
- Emotional state: {emotional_state}
- Include: {unique_angle}

Output only the example text, no preamble."""
]

# =============================================================================
# TEMPLATE PARAMETER GENERATORS
# =============================================================================

def generate_template_parameters(cognitive_action_key, template_type="single"):
    """Generate all parameters needed for template formatting"""
    base_params = get_random_selection()
    domain = base_params['domain']

    params = {
        'cognitive_action': cognitive_action_key,
        'cognitive_action_desc': COGNITIVE_ACTIONS[cognitive_action_key],
        'subject': base_params['subject'],
        'domain': domain,
        'context_detail': get_context_detail(domain),
        'trigger': base_params['trigger'],
        'emotional_state': base_params['emotional_state'],
        'language_style': base_params['language_style'],
        'unique_angle': base_params['unique_angle'],
        'complexity_level': base_params['complexity_level'],
        'perspective': base_params['perspective']
    }

    # Add template-specific parameters
    if template_type == "single":
        params.update({
            'situation': f"dealing with {get_context_detail(domain)}",
            'initial_state': f"initially feeling {random.choice(EMOTIONAL_STATES)}",
            'result_direction': f"moving toward {random.choice(['clarity', 'acceptance', 'understanding', 'resolution'])}",
            'problem_type': f"a challenge involving {random.choice(['conflicting values', 'unclear options', 'emotional complexity'])}"
        })

    elif template_type == "chain":
        actions = get_cognitive_action_chains(3)
        params.update({
            'cognitive_action_1': actions[0],
            'cognitive_action_1_desc': COGNITIVE_ACTIONS[actions[0]],
            'cognitive_action_2': actions[1],
            'cognitive_action_2_desc': COGNITIVE_ACTIONS[actions[1]],
            'cognitive_action_3': actions[2],
            'cognitive_action_3_desc': COGNITIVE_ACTIONS[actions[2]],
            'starting_emotion': random.choice(EMOTIONAL_STATES),
            'ending_emotion': random.choice(EMOTIONAL_STATES),
            'complexity': params['complexity_level']
        })

    elif template_type == "dialogue":
        params.update({
            'therapy_setting': random.choice(['therapy session', 'coaching conversation', 'peer support group'])
        })

    elif template_type == "thought_stream":
        params.update({
            'initial_situation': f"dealing with {get_context_detail(domain)}"
        })

    elif template_type == "negative":
        params.update({
            'negative_alternative': random.choice([
                'stick rigidly to their original view',
                'dismiss the new information',
                'avoid thinking about it',
                'become more defensive'
            ])
        })

    return params

# =============================================================================
# MAIN TEMPLATE SELECTION FUNCTIONS
# =============================================================================

def get_random_template(template_type="single"):
    """Get a random template of specified type"""
    if template_type == "single":
        return random.choice(SINGLE_ACTION_TEMPLATES)
    elif template_type == "chain":
        return random.choice(CHAIN_TEMPLATES)
    elif template_type == "dialogue":
        return random.choice(DIALOGUE_TEMPLATES)
    elif template_type == "thought_stream":
        return random.choice(THOUGHT_STREAM_TEMPLATES)
    elif template_type == "negative":
        return random.choice(NEGATIVE_TEMPLATES)
    else:
        return random.choice(SINGLE_ACTION_TEMPLATES)

def generate_prompt(cognitive_action_key=None, template_type="single", iteration_number=0):
    """Generate a complete prompt for the LLM"""
    if cognitive_action_key is None:
        cognitive_action_key = random.choice(list(COGNITIVE_ACTIONS.keys()))

    template = get_random_template(template_type)
    params = generate_template_parameters(cognitive_action_key, template_type)

    # Format the template with parameters
    try:
        prompt = template.format(**params)
    except KeyError as e:
        # Fallback: add missing parameters
        missing_param = str(e).strip("'")
        params[missing_param] = f"[{missing_param}]"
        prompt = template.format(**params)

    # Add uniqueness constraint
    prompt += f"\n\nExample #{iteration_number + 1}. Make this distinctly different from previous examples."

    return prompt, params

print("Prompt templates loaded successfully!")
print(f"Single action templates: {len(SINGLE_ACTION_TEMPLATES)}")
print(f"Chain templates: {len(CHAIN_TEMPLATES)}")
print(f"Dialogue templates: {len(DIALOGUE_TEMPLATES)}")
print(f"Thought stream templates: {len(THOUGHT_STREAM_TEMPLATES)}")
print(f"Negative templates: {len(NEGATIVE_TEMPLATES)}")

## 3. Ollama Setup

Choose one of the following options:

### Option A: Local Ollama Installation (Recommended)

If you have Ollama running locally, modify the URL below to point to your instance:

In [None]:
import requests
import json

class OllamaClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def generate(self, model="llama3.2", prompt="", stream=False):
        """Generate text using Ollama API"""
        url = f"{self.base_url}/api/generate"
        data = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        try:
            response = self.session.post(url, json=data, timeout=120)
            response.raise_for_status()
            
            if stream:
                return response.iter_lines()
            else:
                return response.json()
                
        except requests.exceptions.RequestException as e:
            print(f"Error connecting to Ollama: {e}")
            return None
    
    def list_models(self):
        """List available models"""
        url = f"{self.base_url}/api/tags"
        try:
            response = self.session.get(url)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Error connecting to Ollama: {e}")
            return None

# Initialize Ollama client
# CHANGE THIS URL TO YOUR OLLAMA INSTANCE
ollama = OllamaClient(base_url="http://localhost:11434")

# Test connection
models = ollama.list_models()
if models:
    print("✅ Connected to Ollama successfully!")
    print("Available models:")
    for model in models.get('models', []):
        print(f"  - {model.get('name', 'Unknown')}")
else:
    print("❌ Could not connect to Ollama. Please check your setup.")
    print("Make sure Ollama is running and accessible at the specified URL.")

### Option B: Install Ollama in Colab (Experimental)

⚠️ **Note:** This is experimental and may not work reliably in all Colab environments.

In [None]:
# Uncomment and run this cell to install Ollama directly in Colab
# This is experimental and may not work in all environments

# !curl -fsSL https://ollama.ai/install.sh | sh
# 
# # Start Ollama in background
# import subprocess
# import time
# 
# # Start Ollama server
# ollama_process = subprocess.Popen(['ollama', 'serve'], 
#                                   stdout=subprocess.PIPE, 
#                                   stderr=subprocess.PIPE)
# 
# # Wait for server to start
# time.sleep(10)
# 
# # Pull a model
# !ollama pull llama3.2
# 
# print("Ollama installed and model pulled!")

## 4. Data Generation System

In [None]:
%%writefile data_generator.py
"""
Main Data Generation Engine for Cognitive Action Training Data
Integrates with Ollama for LLM generation
"""

import json
import time
import random
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from variable_pools import COGNITIVE_ACTIONS, get_random_selection
from prompt_templates import generate_prompt

@dataclass
class GeneratedExample:
    """Structure for a generated training example"""
    text: str
    primary_cognitive_action: str
    secondary_actions: List[str]
    domain: str
    complexity: str
    perspective: str
    format_type: str
    metadata: Dict[str, Any]

class CognitiveDataGenerator:
    """Main class for generating cognitive action training data"""

    def __init__(self, ollama_client=None):
        """Initialize the generator with optional Ollama client"""
        self.ollama_client = ollama_client
        self.generated_examples = []
        self.generation_stats = {
            'total_generated': 0,
            'by_cognitive_action': {},
            'by_domain': {},
            'by_complexity': {},
            'errors': []
        }

    def generate_single_example(self,
                               cognitive_action: Optional[str] = None,
                               template_type: str = "single",
                               model: str = "llama3.2") -> Optional[GeneratedExample]:
        """Generate a single training example"""
        try:
            # Generate prompt
            prompt, params = generate_prompt(cognitive_action, template_type,
                                           self.generation_stats['total_generated'])

            # Generate with Ollama
            if self.ollama_client:
                response = self.ollama_client.generate(
                    model=model,
                    prompt=prompt
                )
                if response:
                    generated_text = response['response'].strip()
                else:
                    generated_text = f"[Error generating example for {params['cognitive_action']}]"
            else:
                # Fallback for testing without Ollama
                generated_text = f"[Generated example for {params['cognitive_action']} in {params['domain']}]"

            # Create example object
            example = GeneratedExample(
                text=generated_text,
                primary_cognitive_action=params['cognitive_action'],
                secondary_actions=[],  # Could be extracted with NLP
                domain=params['domain'],
                complexity=params['complexity_level'],
                perspective=params['perspective'],
                format_type=template_type,
                metadata={
                    'subject': params['subject'],
                    'emotional_state': params['emotional_state'],
                    'language_style': params['language_style'],
                    'unique_angle': params['unique_angle'],
                    'trigger': params['trigger'],
                    'generation_timestamp': time.time(),
                    'prompt_used': prompt
                }
            )

            # Update statistics
            self._update_stats(example)
            self.generated_examples.append(example)

            return example

        except Exception as e:
            error_info = {
                'error': str(e),
                'cognitive_action': cognitive_action,
                'template_type': template_type,
                'timestamp': time.time()
            }
            self.generation_stats['errors'].append(error_info)
            print(f"Error generating example: {e}")
            return None

    def generate_batch(self,
                      batch_size: int = 10,
                      cognitive_action: Optional[str] = None,
                      template_type: str = "single",
                      model: str = "llama3.2",
                      delay: float = 0.5) -> List[GeneratedExample]:
        """Generate a batch of examples"""
        examples = []

        for i in range(batch_size):
            example = self.generate_single_example(cognitive_action, template_type, model)
            if example:
                examples.append(example)
                print(f"Generated example {i+1}/{batch_size}: {example.primary_cognitive_action}")

            # Delay to avoid overwhelming the API
            if delay > 0:
                time.sleep(delay)

        return examples

    def generate_stratified_dataset(self,
                                   total_examples: int = 1000,
                                   model: str = "llama3.2") -> List[GeneratedExample]:
        """Generate a stratified dataset ensuring coverage across cognitive actions"""
        examples_per_action = total_examples // len(COGNITIVE_ACTIONS)
        all_examples = []

        for action in COGNITIVE_ACTIONS.keys():
            print(f"Generating {examples_per_action} examples for: {action}")

            # Mix template types for variety
            template_types = ["single"] * int(examples_per_action * 0.7) + \
                           ["chain"] * int(examples_per_action * 0.2) + \
                           ["dialogue"] * int(examples_per_action * 0.1)

            random.shuffle(template_types)

            for i, template_type in enumerate(template_types):
                example = self.generate_single_example(action, template_type, model)
                if example:
                    all_examples.append(example)

        # Shuffle final dataset
        random.shuffle(all_examples)
        return all_examples

    def _update_stats(self, example: GeneratedExample):
        """Update generation statistics"""
        self.generation_stats['total_generated'] += 1

        # Update by cognitive action
        action = example.primary_cognitive_action
        if action not in self.generation_stats['by_cognitive_action']:
            self.generation_stats['by_cognitive_action'][action] = 0
        self.generation_stats['by_cognitive_action'][action] += 1

        # Update by domain
        domain = example.domain
        if domain not in self.generation_stats['by_domain']:
            self.generation_stats['by_domain'][domain] = 0
        self.generation_stats['by_domain'][domain] += 1

        # Update by complexity
        complexity = example.complexity
        if complexity not in self.generation_stats['by_complexity']:
            self.generation_stats['by_complexity'][complexity] = 0
        self.generation_stats['by_complexity'][complexity] += 1

    def export_dataset(self, filepath: str, format: str = "jsonl"):
        """Export generated dataset to file"""
        if format == "jsonl":
            with open(filepath, 'w') as f:
                for example in self.generated_examples:
                    json_obj = {
                        'text': example.text,
                        'primary_cognitive_action': example.primary_cognitive_action,
                        'secondary_actions': example.secondary_actions,
                        'domain': example.domain,
                        'complexity': example.complexity,
                        'perspective': example.perspective,
                        'format_type': example.format_type,
                        'metadata': example.metadata
                    }
                    f.write(json.dumps(json_obj) + '\n')

        elif format == "json":
            with open(filepath, 'w') as f:
                dataset = [
                    {
                        'text': example.text,
                        'primary_cognitive_action': example.primary_cognitive_action,
                        'secondary_actions': example.secondary_actions,
                        'domain': example.domain,
                        'complexity': example.complexity,
                        'perspective': example.perspective,
                        'format_type': example.format_type,
                        'metadata': example.metadata
                    }
                    for example in self.generated_examples
                ]
                json.dump(dataset, f, indent=2)

        print(f"Exported {len(self.generated_examples)} examples to {filepath}")

    def get_statistics(self) -> Dict[str, Any]:
        """Get generation statistics"""
        return self.generation_stats

    def print_statistics(self):
        """Print formatted statistics"""
        stats = self.generation_stats
        print(f"\nGeneration Statistics:")
        print(f"Total examples generated: {stats['total_generated']}")
        print(f"Errors encountered: {len(stats['errors'])}")

        print(f"\nBy Cognitive Action:")
        for action, count in sorted(stats['by_cognitive_action'].items()):
            print(f"  {action}: {count}")

        print(f"\nBy Domain:")
        for domain, count in sorted(stats['by_domain'].items()):
            print(f"  {domain}: {count}")

        print(f"\nBy Complexity:")
        for complexity, count in sorted(stats['by_complexity'].items()):
            print(f"  {complexity}: {count}")

print("Data generator loaded successfully!")

In [None]:
# Import the modules we just created
from variable_pools import *
from prompt_templates import *
from data_generator import *

print("All modules imported successfully!")
print(f"Ready to generate data with {len(COGNITIVE_ACTIONS)} cognitive actions")

## 5. Test the System

In [None]:
# Test prompt generation without LLM
print("Testing prompt generation...\n")

# Generate a few sample prompts
for i in range(3):
    prompt, params = generate_prompt(template_type="single", iteration_number=i)
    print(f"=== PROMPT {i+1} ===")
    print(f"Cognitive Action: {params['cognitive_action']}")
    print(f"Domain: {params['domain']}")
    print(f"Subject: {params['subject']}")
    print()
    print(prompt)
    print("\n" + "="*50 + "\n")

In [None]:
# Test with Ollama (if connected)
generator = CognitiveDataGenerator(ollama_client=ollama)

print("Testing single example generation...")

# Generate a single example
example = generator.generate_single_example(
    cognitive_action="reconsidering",
    template_type="single",
    model="llama3.2"  # Change this to your available model
)

if example:
    print("\n=== GENERATED EXAMPLE ===")
    print(f"Cognitive Action: {example.primary_cognitive_action}")
    print(f"Domain: {example.domain}")
    print(f"Complexity: {example.complexity}")
    print(f"Format: {example.format_type}")
    print()
    print("Generated Text:")
    print(example.text)
    print()
    print("Metadata:")
    for key, value in example.metadata.items():
        if key != 'prompt_used':  # Skip the long prompt
            print(f"  {key}: {value}")
else:
    print("Failed to generate example. Check Ollama connection.")

## 6. Batch Generation

In [None]:
# Generate a small batch for testing
print("Generating small test batch...")

test_examples = generator.generate_batch(
    batch_size=5,
    cognitive_action=None,  # Random actions
    template_type="single",
    model="llama3.2",
    delay=1.0  # 1 second delay between generations
)

print(f"\nGenerated {len(test_examples)} examples")

# Show a few examples
for i, example in enumerate(test_examples[:3]):
    print(f"\n=== EXAMPLE {i+1} ===")
    print(f"Action: {example.primary_cognitive_action}")
    print(f"Domain: {example.domain}")
    print(f"Text: {example.text[:200]}...")

# Show statistics
generator.print_statistics()

In [None]:
# Generate stratified dataset (smaller for testing)
print("Generating stratified test dataset...")

# Create new generator for clean stats
stratified_generator = CognitiveDataGenerator(ollama_client=ollama)

stratified_examples = stratified_generator.generate_stratified_dataset(
    total_examples=100,  # Start small for testing
    model="llama3.2"
)

print(f"\nGenerated {len(stratified_examples)} stratified examples")
stratified_generator.print_statistics()

## 7. Data Analysis and Visualization

In [None]:
# Convert to DataFrame for analysis
data = []
for example in stratified_generator.generated_examples:
    data.append({
        'text': example.text,
        'cognitive_action': example.primary_cognitive_action,
        'domain': example.domain,
        'complexity': example.complexity,
        'format_type': example.format_type,
        'text_length': len(example.text),
        'word_count': len(example.text.split()),
        'subject': example.metadata.get('subject', ''),
        'emotional_state': example.metadata.get('emotional_state', '')
    })

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

print("\nBasic statistics:")
print(df.describe())

In [None]:
# Visualizations
plt.figure(figsize=(15, 10))

# Distribution of cognitive actions
plt.subplot(2, 3, 1)
cognitive_action_counts = df['cognitive_action'].value_counts()
plt.bar(range(len(cognitive_action_counts)), cognitive_action_counts.values)
plt.xticks(range(len(cognitive_action_counts)), cognitive_action_counts.index, rotation=45, ha='right')
plt.title('Distribution of Cognitive Actions')
plt.ylabel('Count')

# Distribution of domains
plt.subplot(2, 3, 2)
domain_counts = df['domain'].value_counts()[:10]  # Top 10
plt.bar(range(len(domain_counts)), domain_counts.values)
plt.xticks(range(len(domain_counts)), domain_counts.index, rotation=45, ha='right')
plt.title('Top 10 Domains')
plt.ylabel('Count')

# Distribution of complexity levels
plt.subplot(2, 3, 3)
complexity_counts = df['complexity'].value_counts()
plt.pie(complexity_counts.values, labels=complexity_counts.index, autopct='%1.1f%%')
plt.title('Complexity Distribution')

# Text length distribution
plt.subplot(2, 3, 4)
plt.hist(df['text_length'], bins=20, edgecolor='black')
plt.title('Text Length Distribution')
plt.xlabel('Characters')
plt.ylabel('Frequency')

# Word count distribution
plt.subplot(2, 3, 5)
plt.hist(df['word_count'], bins=20, edgecolor='black')
plt.title('Word Count Distribution')
plt.xlabel('Words')
plt.ylabel('Frequency')

# Format type distribution
plt.subplot(2, 3, 6)
format_counts = df['format_type'].value_counts()
plt.pie(format_counts.values, labels=format_counts.index, autopct='%1.1f%%')
plt.title('Format Type Distribution')

plt.tight_layout()
plt.show()

# Show some sample texts by cognitive action
print("\n=== SAMPLE TEXTS BY COGNITIVE ACTION ===")
for action in df['cognitive_action'].unique()[:5]:  # First 5 actions
    sample = df[df['cognitive_action'] == action]['text'].iloc[0]
    print(f"\n{action.upper()}:")
    print(f"  {sample[:200]}...")

## 8. Export Data

In [None]:
# Export the generated dataset
timestamp = int(time.time())

# Export as JSONL (recommended for large datasets)
jsonl_filename = f"cognitive_actions_dataset_{timestamp}.jsonl"
stratified_generator.export_dataset(jsonl_filename, format="jsonl")

# Export as JSON (for smaller datasets)
json_filename = f"cognitive_actions_dataset_{timestamp}.json"
stratified_generator.export_dataset(json_filename, format="json")

# Export as CSV for analysis
csv_filename = f"cognitive_actions_analysis_{timestamp}.csv"
df.to_csv(csv_filename, index=False)

print(f"\nDataset exported:")
print(f"  JSONL: {jsonl_filename}")
print(f"  JSON: {json_filename}")
print(f"  CSV: {csv_filename}")

# Show file sizes
import os
for filename in [jsonl_filename, json_filename, csv_filename]:
    if os.path.exists(filename):
        size = os.path.getsize(filename)
        print(f"  {filename}: {size:,} bytes ({size/1024:.1f} KB)")

## 9. Large Scale Generation

⚠️ **Warning:** Large scale generation can take hours and consume significant computational resources.

In [None]:
# Configuration for large-scale generation
LARGE_SCALE_CONFIG = {
    'phase1_round1': 10000,  # Core cognitive actions
    'phase1_round2': 5000,   # Action combinations  
    'phase2': 10000,         # Domain variations
    'phase3': 10000,         # Complexity variations
    'phase4': 2000,          # Negative examples
    'phase5': 2000,          # Dialogue format
    'phase6': 1000,          # Thought-stream format
    'model': 'llama3.2',
    'delay': 0.1,            # Delay between generations (seconds)
    'checkpoint_interval': 100  # Save progress every N examples
}

print("Large scale configuration:")
total_target = sum(v for k, v in LARGE_SCALE_CONFIG.items() if k.startswith('phase'))
print(f"Total target examples: {total_target:,}")
print(f"Estimated time (with {LARGE_SCALE_CONFIG['delay']}s delay): {total_target * LARGE_SCALE_CONFIG['delay'] / 3600:.1f} hours")

for phase, count in LARGE_SCALE_CONFIG.items():
    if phase.startswith('phase'):
        print(f"  {phase}: {count:,} examples")

In [None]:
# Uncomment and run this cell for large-scale generation
# WARNING: This will take a very long time!

# def run_large_scale_generation():
#     """Run the complete large-scale generation pipeline"""
#     
#     large_generator = CognitiveDataGenerator(ollama_client=ollama)
#     
#     phases = [
#         ('phase1_round1', 'single', LARGE_SCALE_CONFIG['phase1_round1']),
#         ('phase1_round2', 'chain', LARGE_SCALE_CONFIG['phase1_round2']),
#         ('phase2', 'single', LARGE_SCALE_CONFIG['phase2']),
#         ('phase3', 'single', LARGE_SCALE_CONFIG['phase3']),
#         ('phase4', 'negative', LARGE_SCALE_CONFIG['phase4']),
#         ('phase5', 'dialogue', LARGE_SCALE_CONFIG['phase5']),
#         ('phase6', 'thought_stream', LARGE_SCALE_CONFIG['phase6'])
#     ]
#     
#     for phase_name, template_type, target_count in phases:
#         print(f"\n🚀 Starting {phase_name}: {target_count:,} examples")
#         print(f"Template type: {template_type}")
#         
#         phase_examples = large_generator.generate_batch(
#             batch_size=target_count,
#             template_type=template_type,
#             model=LARGE_SCALE_CONFIG['model'],
#             delay=LARGE_SCALE_CONFIG['delay']
#         )
#         
#         print(f"✅ Completed {phase_name}: {len(phase_examples):,} examples")
#         
#         # Checkpoint save
#         checkpoint_filename = f"checkpoint_{phase_name}_{int(time.time())}.jsonl"
#         large_generator.export_dataset(checkpoint_filename, format="jsonl")
#         print(f"💾 Checkpoint saved: {checkpoint_filename}")
#         
#         # Print progress statistics
#         large_generator.print_statistics()
#     
#     # Final export
#     final_timestamp = int(time.time())
#     final_filename = f"cognitive_actions_complete_dataset_{final_timestamp}.jsonl"
#     large_generator.export_dataset(final_filename, format="jsonl")
#     
#     print(f"\n🎉 COMPLETE! Generated {len(large_generator.generated_examples):,} examples")
#     print(f"📁 Final dataset: {final_filename}")
#     
#     return large_generator

# # Uncomment the line below to start large-scale generation
# # large_generator = run_large_scale_generation()

print("Large-scale generation function defined.")
print("Uncomment the last line and run this cell to start large-scale generation.")
print("⚠️  Make sure you have sufficient time and resources before starting!")

## 10. Quality Control and Validation

In [None]:
def quality_control_analysis(examples):
    """Perform quality control analysis on generated examples"""
    
    print("=== QUALITY CONTROL ANALYSIS ===")
    
    # Basic statistics
    total_examples = len(examples)
    print(f"Total examples analyzed: {total_examples:,}")
    
    # Text length analysis
    text_lengths = [len(ex.text) for ex in examples]
    word_counts = [len(ex.text.split()) for ex in examples]
    
    print(f"\nText Length Statistics:")
    print(f"  Average characters: {np.mean(text_lengths):.1f}")
    print(f"  Average words: {np.mean(word_counts):.1f}")
    print(f"  Min/Max characters: {min(text_lengths)} / {max(text_lengths)}")
    print(f"  Min/Max words: {min(word_counts)} / {max(word_counts)}")
    
    # Coverage analysis
    cognitive_actions = set(ex.primary_cognitive_action for ex in examples)
    domains = set(ex.domain for ex in examples)
    
    print(f"\nCoverage Statistics:")
    print(f"  Cognitive actions covered: {len(cognitive_actions)}/{len(COGNITIVE_ACTIONS)} ({len(cognitive_actions)/len(COGNITIVE_ACTIONS)*100:.1f}%)")
    print(f"  Domains covered: {len(domains)}/{len(DOMAINS)} ({len(domains)/len(DOMAINS)*100:.1f}%)")
    
    # Quality indicators
    print(f"\nQuality Indicators:")
    
    # Check for very short examples
    very_short = sum(1 for length in text_lengths if length < 50)
    print(f"  Very short examples (<50 chars): {very_short} ({very_short/total_examples*100:.1f}%)")
    
    # Check for very long examples
    very_long = sum(1 for length in text_lengths if length > 1000)
    print(f"  Very long examples (>1000 chars): {very_long} ({very_long/total_examples*100:.1f}%)")
    
    # Check for examples that might contain prompt artifacts
    prompt_artifacts = sum(1 for ex in examples if any(word in ex.text.lower() for word in 
                          ['generate', 'example', 'sentence', 'requirement', 'output']))
    print(f"  Potential prompt artifacts: {prompt_artifacts} ({prompt_artifacts/total_examples*100:.1f}%)")
    
    # Diversity check - look for repeated phrases
    all_texts = [ex.text for ex in examples]
    unique_texts = set(all_texts)
    print(f"  Unique texts: {len(unique_texts)}/{total_examples} ({len(unique_texts)/total_examples*100:.1f}%)")
    
    return {
        'total_examples': total_examples,
        'avg_length': np.mean(text_lengths),
        'avg_words': np.mean(word_counts),
        'cognitive_actions_covered': len(cognitive_actions),
        'domains_covered': len(domains),
        'very_short_pct': very_short/total_examples*100,
        'very_long_pct': very_long/total_examples*100,
        'prompt_artifacts_pct': prompt_artifacts/total_examples*100,
        'uniqueness_pct': len(unique_texts)/total_examples*100
    }

# Run quality control on current examples
if stratified_generator.generated_examples:
    qc_results = quality_control_analysis(stratified_generator.generated_examples)
else:
    print("No examples to analyze. Generate some examples first.")

In [None]:
# Manual review of sample examples
def review_samples(examples, n_samples=5):
    """Review a random sample of examples for quality"""
    
    if not examples:
        print("No examples to review.")
        return
    
    print("=== MANUAL QUALITY REVIEW ===")
    print(f"Reviewing {min(n_samples, len(examples))} random examples:\n")
    
    sample_examples = random.sample(examples, min(n_samples, len(examples)))
    
    for i, example in enumerate(sample_examples, 1):
        print(f"--- EXAMPLE {i} ---")
        print(f"Cognitive Action: {example.primary_cognitive_action}")
        print(f"Domain: {example.domain}")
        print(f"Complexity: {example.complexity}")
        print(f"Format: {example.format_type}")
        print(f"Length: {len(example.text)} chars, {len(example.text.split())} words")
        print()
        print("Text:")
        print(f"\"{example.text}\"")
        print()
        print("Subject:", example.metadata.get('subject', 'N/A'))
        print("Emotional State:", example.metadata.get('emotional_state', 'N/A'))
        print("Unique Angle:", example.metadata.get('unique_angle', 'N/A'))
        print("\n" + "="*60 + "\n")

# Review samples
if stratified_generator.generated_examples:
    review_samples(stratified_generator.generated_examples, n_samples=3)
else:
    print("No examples to review. Generate some examples first.")

## 🎉 Conclusion

You now have a complete cognitive action data generation system!

### What You've Built:
- **Scientific Foundation**: Based on established taxonomies from cognitive psychology
- **Flexible Architecture**: Modular system with variable pools and templates
- **Scalable Generation**: Can generate from small batches to 100,000+ examples
- **Quality Control**: Built-in analysis and validation tools
- **Multiple Formats**: Support for various example types (single actions, chains, dialogues, thought streams)

### Next Steps:
1. **Scale Up**: Use the large-scale generation for your full dataset
2. **Fine-tune**: Adjust templates and variables based on your specific needs
3. **Validate**: Run quality control on larger datasets
4. **Train Models**: Use the generated data to train your cognitive action recognition models

### Tips for Production Use:
- Monitor generation quality regularly
- Save checkpoints during long generation runs
- Experiment with different Ollama models for variety
- Consider post-processing for consistency

**Happy data generating! 🚀**