# Cognitive Action Training Data Generator

This notebook generates high-quality training data for cognitive action recognition using Ollama locally or via API.

**Based on Scientific Taxonomies:**
- Bloom's Taxonomy (cognitive processes)
- Guilford's Structure of Intellect
- Krathwohl's Affective Domain
- Gross's Emotion Regulation Model
- Metacognitive Process Frameworks

**Target:** 100,000+ diverse examples of explicit cognitive and psychological actions

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install requests pandas numpy tqdm matplotlib seaborn

# If running locally with Ollama installed:
# !pip install ollama

import json
import time
import random
import requests
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("Dependencies installed successfully!")

## 2. Import Supporting Modules

The following Python modules are part of this repository:
- `variable_pools.py` - Contains cognitive action taxonomies and variable pools
- `prompt_templates.py` - Template system for generating diverse prompts
- `data_generator.py` - Main data generation engine

These files are already in the `datagen/` directory and will be imported in the cells below.

In [None]:
# Import variable pools module
import sys
import os

# Ensure the datagen directory is in the Python path
current_dir = os.path.dirname(os.path.abspath('__file__'))
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Import from the repository
from variable_pools import *

print("Variable pools loaded successfully!")
print(f"Total cognitive actions: {len(COGNITIVE_ACTIONS)}")
print(f"Total domains: {len(DOMAINS)}")
print(f"Total subjects: {len(SUBJECTS)}")

In [None]:
# Import prompt templates module
from prompt_templates import *

print("Prompt templates loaded successfully!")
print(f"Single action templates: {len(SINGLE_ACTION_TEMPLATES)}")
print(f"Chain templates: {len(CHAIN_TEMPLATES)}")
print(f"Dialogue templates: {len(DIALOGUE_TEMPLATES)}")
print(f"Thought stream templates: {len(THOUGHT_STREAM_TEMPLATES)}")
print(f"Negative templates: {len(NEGATIVE_TEMPLATES)}")

## 3. Ollama Setup

Choose one of the following options:

### Option A: Local Ollama Installation (Recommended)

If you have Ollama running locally, modify the URL below to point to your instance:

In [None]:
import requests
import json

class OllamaClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def generate(self, model="llama3.2", prompt="", stream=False):
        """Generate text using Ollama API"""
        url = f"{self.base_url}/api/generate"
        data = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        try:
            response = self.session.post(url, json=data, timeout=120)
            response.raise_for_status()
            
            if stream:
                return response.iter_lines()
            else:
                return response.json()
                
        except requests.exceptions.RequestException as e:
            print(f"Error connecting to Ollama: {e}")
            return None
    
    def list_models(self):
        """List available models"""
        url = f"{self.base_url}/api/tags"
        try:
            response = self.session.get(url)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Error connecting to Ollama: {e}")
            return None

# Initialize Ollama client
# CHANGE THIS URL TO YOUR OLLAMA INSTANCE
ollama = OllamaClient(base_url="http://localhost:11434")

# Test connection
models = ollama.list_models()
if models:
    print("✅ Connected to Ollama successfully!")
    print("Available models:")
    for model in models.get('models', []):
        print(f"  - {model.get('name', 'Unknown')}")
else:
    print("❌ Could not connect to Ollama. Please check your setup.")
    print("Make sure Ollama is running and accessible at the specified URL.")

### Option B: Install Ollama in Colab (Experimental)

⚠️ **Note:** This is experimental and may not work reliably in all Colab environments.

In [None]:
# Uncomment and run this cell to install Ollama directly in Colab
# This is experimental and may not work in all environments

# !curl -fsSL https://ollama.ai/install.sh | sh
# 
# # Start Ollama in background
# import subprocess
# import time
# 
# # Start Ollama server
# ollama_process = subprocess.Popen(['ollama', 'serve'], 
#                                   stdout=subprocess.PIPE, 
#                                   stderr=subprocess.PIPE)
# 
# # Wait for server to start
# time.sleep(10)
# 
# # Pull a model
# !ollama pull llama3.2
# 
# print("Ollama installed and model pulled!")

## 4. Data Generation System

In [None]:
# Import data generator module
from data_generator import *

print("Data generator loaded successfully!")
print("CognitiveDataGenerator class ready to use")

In [None]:
# Verify all modules are imported and working
print("All modules imported successfully!")
print(f"Ready to generate data with {len(COGNITIVE_ACTIONS)} cognitive actions")
print(f"\nAvailable components:")
print(f"  - Variable pools: ✓")
print(f"  - Prompt templates: ✓")
print(f"  - Data generator: ✓")
print(f"  - Ollama client: {'✓' if 'ollama' in dir() else '⚠️ Configure in section 3'}")

## 5. Test the System

In [None]:
# Test prompt generation without LLM
print("Testing prompt generation...\n")

# Generate a few sample prompts
for i in range(3):
    prompt, params = generate_prompt(template_type="single", iteration_number=i)
    print(f"=== PROMPT {i+1} ===")
    print(f"Cognitive Action: {params['cognitive_action']}")
    print(f"Domain: {params['domain']}")
    print(f"Subject: {params['subject']}")
    print()
    print(prompt)
    print("\n" + "="*50 + "\n")

In [None]:
# Test with Ollama (if connected)
generator = CognitiveDataGenerator(ollama_client=ollama)

print("Testing single example generation...")

# Generate a single example
example = generator.generate_single_example(
    cognitive_action="reconsidering",
    template_type="single",
    model="llama3.2"  # Change this to your available model
)

if example:
    print("\n=== GENERATED EXAMPLE ===")
    print(f"Cognitive Action: {example.primary_cognitive_action}")
    print(f"Domain: {example.domain}")
    print(f"Complexity: {example.complexity}")
    print(f"Format: {example.format_type}")
    print()
    print("Generated Text:")
    print(example.text)
    print()
    print("Metadata:")
    for key, value in example.metadata.items():
        if key != 'prompt_used':  # Skip the long prompt
            print(f"  {key}: {value}")
else:
    print("Failed to generate example. Check Ollama connection.")

## 6. Batch Generation

In [None]:
# Generate a small batch for testing
print("Generating small test batch...")

test_examples = generator.generate_batch(
    batch_size=5,
    cognitive_action=None,  # Random actions
    template_type="single",
    model="llama3.2",
    delay=1.0  # 1 second delay between generations
)

print(f"\nGenerated {len(test_examples)} examples")

# Show a few examples
for i, example in enumerate(test_examples[:3]):
    print(f"\n=== EXAMPLE {i+1} ===")
    print(f"Action: {example.primary_cognitive_action}")
    print(f"Domain: {example.domain}")
    print(f"Text: {example.text[:200]}...")

# Show statistics
generator.print_statistics()

In [None]:
# Generate stratified dataset (smaller for testing)
print("Generating stratified test dataset...")

# Create new generator for clean stats
stratified_generator = CognitiveDataGenerator(ollama_client=ollama)

stratified_examples = stratified_generator.generate_stratified_dataset(
    total_examples=100,  # Start small for testing
    model="llama3.2"
)

print(f"\nGenerated {len(stratified_examples)} stratified examples")
stratified_generator.print_statistics()

## 7. Data Analysis and Visualization

In [None]:
# Convert to DataFrame for analysis
data = []
for example in stratified_generator.generated_examples:
    data.append({
        'text': example.text,
        'cognitive_action': example.primary_cognitive_action,
        'domain': example.domain,
        'complexity': example.complexity,
        'format_type': example.format_type,
        'text_length': len(example.text),
        'word_count': len(example.text.split()),
        'subject': example.metadata.get('subject', ''),
        'emotional_state': example.metadata.get('emotional_state', '')
    })

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

print("\nBasic statistics:")
print(df.describe())

In [None]:
# Visualizations
plt.figure(figsize=(15, 10))

# Distribution of cognitive actions
plt.subplot(2, 3, 1)
cognitive_action_counts = df['cognitive_action'].value_counts()
plt.bar(range(len(cognitive_action_counts)), cognitive_action_counts.values)
plt.xticks(range(len(cognitive_action_counts)), cognitive_action_counts.index, rotation=45, ha='right')
plt.title('Distribution of Cognitive Actions')
plt.ylabel('Count')

# Distribution of domains
plt.subplot(2, 3, 2)
domain_counts = df['domain'].value_counts()[:10]  # Top 10
plt.bar(range(len(domain_counts)), domain_counts.values)
plt.xticks(range(len(domain_counts)), domain_counts.index, rotation=45, ha='right')
plt.title('Top 10 Domains')
plt.ylabel('Count')

# Distribution of complexity levels
plt.subplot(2, 3, 3)
complexity_counts = df['complexity'].value_counts()
plt.pie(complexity_counts.values, labels=complexity_counts.index, autopct='%1.1f%%')
plt.title('Complexity Distribution')

# Text length distribution
plt.subplot(2, 3, 4)
plt.hist(df['text_length'], bins=20, edgecolor='black')
plt.title('Text Length Distribution')
plt.xlabel('Characters')
plt.ylabel('Frequency')

# Word count distribution
plt.subplot(2, 3, 5)
plt.hist(df['word_count'], bins=20, edgecolor='black')
plt.title('Word Count Distribution')
plt.xlabel('Words')
plt.ylabel('Frequency')

# Format type distribution
plt.subplot(2, 3, 6)
format_counts = df['format_type'].value_counts()
plt.pie(format_counts.values, labels=format_counts.index, autopct='%1.1f%%')
plt.title('Format Type Distribution')

plt.tight_layout()
plt.show()

# Show some sample texts by cognitive action
print("\n=== SAMPLE TEXTS BY COGNITIVE ACTION ===")
for action in df['cognitive_action'].unique()[:5]:  # First 5 actions
    sample = df[df['cognitive_action'] == action]['text'].iloc[0]
    print(f"\n{action.upper()}:")
    print(f"  {sample[:200]}...")

## 8. Export Data

In [None]:
# Export the generated dataset
timestamp = int(time.time())

# Export as JSONL (recommended for large datasets)
jsonl_filename = f"cognitive_actions_dataset_{timestamp}.jsonl"
stratified_generator.export_dataset(jsonl_filename, format="jsonl")

# Export as JSON (for smaller datasets)
json_filename = f"cognitive_actions_dataset_{timestamp}.json"
stratified_generator.export_dataset(json_filename, format="json")

# Export as CSV for analysis
csv_filename = f"cognitive_actions_analysis_{timestamp}.csv"
df.to_csv(csv_filename, index=False)

print(f"\nDataset exported:")
print(f"  JSONL: {jsonl_filename}")
print(f"  JSON: {json_filename}")
print(f"  CSV: {csv_filename}")

# Show file sizes
import os
for filename in [jsonl_filename, json_filename, csv_filename]:
    if os.path.exists(filename):
        size = os.path.getsize(filename)
        print(f"  {filename}: {size:,} bytes ({size/1024:.1f} KB)")

## 9. Large Scale Generation

⚠️ **Warning:** Large scale generation can take hours and consume significant computational resources.

In [None]:
# Configuration for large-scale generation
LARGE_SCALE_CONFIG = {
    'phase1_round1': 10000,  # Core cognitive actions
    'phase1_round2': 5000,   # Action combinations  
    'phase2': 10000,         # Domain variations
    'phase3': 10000,         # Complexity variations
    'phase4': 2000,          # Negative examples
    'phase5': 2000,          # Dialogue format
    'phase6': 1000,          # Thought-stream format
    'model': 'llama3.2',
    'delay': 0.1,            # Delay between generations (seconds)
    'checkpoint_interval': 100  # Save progress every N examples
}

print("Large scale configuration:")
total_target = sum(v for k, v in LARGE_SCALE_CONFIG.items() if k.startswith('phase'))
print(f"Total target examples: {total_target:,}")
print(f"Estimated time (with {LARGE_SCALE_CONFIG['delay']}s delay): {total_target * LARGE_SCALE_CONFIG['delay'] / 3600:.1f} hours")

for phase, count in LARGE_SCALE_CONFIG.items():
    if phase.startswith('phase'):
        print(f"  {phase}: {count:,} examples")

In [None]:
# Uncomment and run this cell for large-scale generation
# WARNING: This will take a very long time!

# def run_large_scale_generation():
#     """Run the complete large-scale generation pipeline"""
#     
#     large_generator = CognitiveDataGenerator(ollama_client=ollama)
#     
#     phases = [
#         ('phase1_round1', 'single', LARGE_SCALE_CONFIG['phase1_round1']),
#         ('phase1_round2', 'chain', LARGE_SCALE_CONFIG['phase1_round2']),
#         ('phase2', 'single', LARGE_SCALE_CONFIG['phase2']),
#         ('phase3', 'single', LARGE_SCALE_CONFIG['phase3']),
#         ('phase4', 'negative', LARGE_SCALE_CONFIG['phase4']),
#         ('phase5', 'dialogue', LARGE_SCALE_CONFIG['phase5']),
#         ('phase6', 'thought_stream', LARGE_SCALE_CONFIG['phase6'])
#     ]
#     
#     for phase_name, template_type, target_count in phases:
#         print(f"\n🚀 Starting {phase_name}: {target_count:,} examples")
#         print(f"Template type: {template_type}")
#         
#         phase_examples = large_generator.generate_batch(
#             batch_size=target_count,
#             template_type=template_type,
#             model=LARGE_SCALE_CONFIG['model'],
#             delay=LARGE_SCALE_CONFIG['delay']
#         )
#         
#         print(f"✅ Completed {phase_name}: {len(phase_examples):,} examples")
#         
#         # Checkpoint save
#         checkpoint_filename = f"checkpoint_{phase_name}_{int(time.time())}.jsonl"
#         large_generator.export_dataset(checkpoint_filename, format="jsonl")
#         print(f"💾 Checkpoint saved: {checkpoint_filename}")
#         
#         # Print progress statistics
#         large_generator.print_statistics()
#     
#     # Final export
#     final_timestamp = int(time.time())
#     final_filename = f"cognitive_actions_complete_dataset_{final_timestamp}.jsonl"
#     large_generator.export_dataset(final_filename, format="jsonl")
#     
#     print(f"\n🎉 COMPLETE! Generated {len(large_generator.generated_examples):,} examples")
#     print(f"📁 Final dataset: {final_filename}")
#     
#     return large_generator

# # Uncomment the line below to start large-scale generation
# # large_generator = run_large_scale_generation()

print("Large-scale generation function defined.")
print("Uncomment the last line and run this cell to start large-scale generation.")
print("⚠️  Make sure you have sufficient time and resources before starting!")

## 10. Quality Control and Validation

In [None]:
def quality_control_analysis(examples):
    """Perform quality control analysis on generated examples"""
    
    print("=== QUALITY CONTROL ANALYSIS ===")
    
    # Basic statistics
    total_examples = len(examples)
    print(f"Total examples analyzed: {total_examples:,}")
    
    # Text length analysis
    text_lengths = [len(ex.text) for ex in examples]
    word_counts = [len(ex.text.split()) for ex in examples]
    
    print(f"\nText Length Statistics:")
    print(f"  Average characters: {np.mean(text_lengths):.1f}")
    print(f"  Average words: {np.mean(word_counts):.1f}")
    print(f"  Min/Max characters: {min(text_lengths)} / {max(text_lengths)}")
    print(f"  Min/Max words: {min(word_counts)} / {max(word_counts)}")
    
    # Coverage analysis
    cognitive_actions = set(ex.primary_cognitive_action for ex in examples)
    domains = set(ex.domain for ex in examples)
    
    print(f"\nCoverage Statistics:")
    print(f"  Cognitive actions covered: {len(cognitive_actions)}/{len(COGNITIVE_ACTIONS)} ({len(cognitive_actions)/len(COGNITIVE_ACTIONS)*100:.1f}%)")
    print(f"  Domains covered: {len(domains)}/{len(DOMAINS)} ({len(domains)/len(DOMAINS)*100:.1f}%)")
    
    # Quality indicators
    print(f"\nQuality Indicators:")
    
    # Check for very short examples
    very_short = sum(1 for length in text_lengths if length < 50)
    print(f"  Very short examples (<50 chars): {very_short} ({very_short/total_examples*100:.1f}%)")
    
    # Check for very long examples
    very_long = sum(1 for length in text_lengths if length > 1000)
    print(f"  Very long examples (>1000 chars): {very_long} ({very_long/total_examples*100:.1f}%)")
    
    # Check for examples that might contain prompt artifacts
    prompt_artifacts = sum(1 for ex in examples if any(word in ex.text.lower() for word in 
                          ['generate', 'example', 'sentence', 'requirement', 'output']))
    print(f"  Potential prompt artifacts: {prompt_artifacts} ({prompt_artifacts/total_examples*100:.1f}%)")
    
    # Diversity check - look for repeated phrases
    all_texts = [ex.text for ex in examples]
    unique_texts = set(all_texts)
    print(f"  Unique texts: {len(unique_texts)}/{total_examples} ({len(unique_texts)/total_examples*100:.1f}%)")
    
    return {
        'total_examples': total_examples,
        'avg_length': np.mean(text_lengths),
        'avg_words': np.mean(word_counts),
        'cognitive_actions_covered': len(cognitive_actions),
        'domains_covered': len(domains),
        'very_short_pct': very_short/total_examples*100,
        'very_long_pct': very_long/total_examples*100,
        'prompt_artifacts_pct': prompt_artifacts/total_examples*100,
        'uniqueness_pct': len(unique_texts)/total_examples*100
    }

# Run quality control on current examples
if stratified_generator.generated_examples:
    qc_results = quality_control_analysis(stratified_generator.generated_examples)
else:
    print("No examples to analyze. Generate some examples first.")

In [None]:
# Manual review of sample examples
def review_samples(examples, n_samples=5):
    """Review a random sample of examples for quality"""
    
    if not examples:
        print("No examples to review.")
        return
    
    print("=== MANUAL QUALITY REVIEW ===")
    print(f"Reviewing {min(n_samples, len(examples))} random examples:\n")
    
    sample_examples = random.sample(examples, min(n_samples, len(examples)))
    
    for i, example in enumerate(sample_examples, 1):
        print(f"--- EXAMPLE {i} ---")
        print(f"Cognitive Action: {example.primary_cognitive_action}")
        print(f"Domain: {example.domain}")
        print(f"Complexity: {example.complexity}")
        print(f"Format: {example.format_type}")
        print(f"Length: {len(example.text)} chars, {len(example.text.split())} words")
        print()
        print("Text:")
        print(f"\"{example.text}\"")
        print()
        print("Subject:", example.metadata.get('subject', 'N/A'))
        print("Emotional State:", example.metadata.get('emotional_state', 'N/A'))
        print("Unique Angle:", example.metadata.get('unique_angle', 'N/A'))
        print("\n" + "="*60 + "\n")

# Review samples
if stratified_generator.generated_examples:
    review_samples(stratified_generator.generated_examples, n_samples=3)
else:
    print("No examples to review. Generate some examples first.")

## 🎉 Conclusion

You now have a complete cognitive action data generation system!

### What You've Built:
- **Scientific Foundation**: Based on established taxonomies from cognitive psychology
- **Flexible Architecture**: Modular system with variable pools and templates
- **Scalable Generation**: Can generate from small batches to 100,000+ examples
- **Quality Control**: Built-in analysis and validation tools
- **Multiple Formats**: Support for various example types (single actions, chains, dialogues, thought streams)

### Next Steps:
1. **Scale Up**: Use the large-scale generation for your full dataset
2. **Fine-tune**: Adjust templates and variables based on your specific needs
3. **Validate**: Run quality control on larger datasets
4. **Train Models**: Use the generated data to train your cognitive action recognition models

### Tips for Production Use:
- Monitor generation quality regularly
- Save checkpoints during long generation runs
- Experiment with different Ollama models for variety
- Consider post-processing for consistency

**Happy data generating! 🚀**