# Arbor YAML Trainer Tutorial

This notebook demonstrates how to use the Arbor YAML training system with adaptive context windows. We'll show you how to:

1. **Create and customize training configurations**
2. **Train models with dynamic growth and adaptive context**
3. **Monitor training progress and model adaptation**
4. **Test the trained model with different task types**

The YAML trainer makes it incredibly easy to train Arbor models - just edit a configuration file and run!

In [None]:
# Setup and Imports
import os
import sys
import yaml
import torch
from pathlib import Path

# Add arbor to path
sys.path.insert(0, str(Path.cwd()))

# Check if we're in the right directory
if not Path("arbor").exists():
    print("❌ Please run this notebook from the arbor-o1-living-ai root directory")
    print("Current directory:", Path.cwd())
else:
    print("✅ Found arbor directory")
    print("📍 Working directory:", Path.cwd())

## Step 1: Create Your Training Configuration

The YAML trainer uses configuration files to specify everything about your training run. Let's start by examining and customizing a training configuration.

In [None]:
# Let's look at the example configuration first
config_path = Path("configs/example_config.yaml")

if config_path.exists():
    with open(config_path, 'r') as f:
        config_content = f.read()
    
    print("📋 Example Configuration Structure:")
    print("=" * 50)
    
    # Show first 30 lines to get an overview
    lines = config_content.split('\n')
    for i, line in enumerate(lines[:30]):
        print(f"{i+1:2d}: {line}")
    
    if len(lines) > 30:
        print(f"... ({len(lines) - 30} more lines)")
else:
    print("❌ Example config not found. Let's create one!")

In [None]:
# Create a custom training configuration for this tutorial
tutorial_config = {
    'model': {
        'vocab_size': 128000,
        'hidden_size': 768,      # Smaller for tutorial
        'num_layers': 12,        # Fewer layers for faster training
        'num_heads': 12,
        'intermediate_size': 3072,
        'max_position_embeddings': 32768,  # 32K context for tutorial
        
        'growth': {
            'enabled': True,
            'factor': 1.5,         # Moderate growth
            'max_steps': 4,
            'threshold': 0.9
        },
        
        'adaptive_context': {
            'enabled': True,
            'min_context_length': 512,
            'max_context_length': 32768,
            'context_router_layers': 3,
            'task_types': ['chat', 'code', 'reasoning', 'document'],
            'context_lengths': [512, 1024, 2048, 4096, 8192, 16384, 32768],
            'hardware_aware': True,
            'memory_threshold': 0.8
        }
    },
    
    'datasets': [
        {
            'name': 'tiny_stories',
            'source': 'roneneldan/TinyStories',
            'split': 'train[:1000]',  # Small subset for tutorial
            'text_column': 'text',
            'preprocessing': {
                'prefix': 'Story: ',
                'suffix': '',
                'max_length': 1024
            }
        },
        {
            'name': 'code_samples',
            'source': 'codeparrot/github-code-clean',
            'split': 'train[:500]',   # Small subset
            'text_column': 'code',
            'preprocessing': {
                'prefix': '# Python Code:\n',
                'suffix': '\n# End',
                'max_length': 2048
            }
        }
    ],
    
    'training': {
        'output_dir': './tutorial_output',
        'learning_rate': 3e-5,
        'warmup_steps': 50,
        'steps_per_dataset': 100,  # Short training for tutorial
        'per_device_train_batch_size': 2,
        'gradient_accumulation_steps': 4,
        'eval_steps': 25,
        'save_steps': 50,
        'logging_steps': 10,
        'fp16': True,
        'gradient_checkpointing': True
    },
    
    'logging': {
        'wandb': {
            'enabled': False,  # Disabled for tutorial
            'project': 'arbor-tutorial'
        },
        'console': {
            'enabled': True,
            'level': 'INFO'
        }
    },
    
    'huggingface': {
        'upload': {
            'enabled': False,  # Disabled for tutorial
            'repository': 'your-username/arbor-tutorial'
        }
    }
}

# Save the tutorial configuration
tutorial_config_path = Path("configs/tutorial_config.yaml")
with open(tutorial_config_path, 'w') as f:
    yaml.dump(tutorial_config, f, default_flow_style=False, indent=2)

print("✅ Created tutorial configuration:")
print(f"📁 Saved to: {tutorial_config_path}")
print("\n🔧 Configuration highlights:")
print(f"   Model size: ~350M parameters (smaller for tutorial)")
print(f"   Context range: 512 - 32K tokens")
print(f"   Datasets: TinyStories + Code samples")
print(f"   Training: 200 steps total (100 per dataset)")
print(f"   Growth: Enabled with 1.5x factor")
print(f"   Adaptive context: Enabled with 4 task types")

## Step 2: Understanding the YAML Configuration

Let's examine the key sections of our configuration and what they control:

In [None]:
# Let's explore each section of our configuration
print("🔍 Configuration Analysis")
print("=" * 50)

# Model configuration
model_config = tutorial_config['model']
print("\n🤖 MODEL CONFIGURATION:")
print(f"   Vocabulary: {model_config['vocab_size']:,} tokens (Hermes-4-405B)")
print(f"   Architecture: {model_config['num_layers']} layers × {model_config['hidden_size']} dim")
print(f"   Parameters: ~{(model_config['hidden_size'] * model_config['num_layers'] * 4) / 1e6:.0f}M")

# Growth settings
growth = model_config['growth']
print(f"\n🌱 GROWTH SETTINGS:")
print(f"   Enabled: {growth['enabled']}")
print(f"   Growth factor: {growth['factor']}x")
print(f"   Max growth steps: {growth['max_steps']}")
print(f"   Trigger threshold: {growth['threshold']}")

# Adaptive context
adaptive = model_config['adaptive_context']
print(f"\n🧠 ADAPTIVE CONTEXT:")
print(f"   Enabled: {adaptive['enabled']}")
print(f"   Context range: {adaptive['min_context_length']:,} - {adaptive['max_context_length']:,}")
print(f"   Task types: {len(adaptive['task_types'])} ({', '.join(adaptive['task_types'])})")
print(f"   Context options: {len(adaptive['context_lengths'])} levels")

# Datasets
datasets = tutorial_config['datasets']
print(f"\n📚 DATASETS:")
for i, dataset in enumerate(datasets, 1):
    print(f"   {i}. {dataset['name']}: {dataset['source']} ({dataset['split']})")
    print(f"      Max length: {dataset['preprocessing']['max_length']} tokens")

# Training
training = tutorial_config['training']
print(f"\n🎯 TRAINING:")
print(f"   Learning rate: {training['learning_rate']}")
print(f"   Batch size: {training['per_device_train_batch_size']} × {training['gradient_accumulation_steps']} = {training['per_device_train_batch_size'] * training['gradient_accumulation_steps']}")
print(f"   Total steps: {len(datasets)} × {training['steps_per_dataset']} = {len(datasets) * training['steps_per_dataset']}")
print(f"   Mixed precision: {training['fp16']}")

## Step 3: Initialize the YAML Trainer

Now let's create and initialize the YAML trainer with our configuration:

In [None]:
# Import the YAML trainer
try:
    from arbor.train.yaml_trainer import ArborYAMLTrainer
    print("✅ Successfully imported ArborYAMLTrainer")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Make sure you're in the correct directory and arbor is in the path")
    raise

In [None]:
# Initialize the trainer
print("🚀 Initializing YAML Trainer...")
print("=" * 50)

try:
    trainer = ArborYAMLTrainer(str(tutorial_config_path))
    print("✅ Trainer initialized successfully!")
    
    # The trainer automatically validates the configuration
    print("\n📋 Configuration loaded and validated")
    
except Exception as e:
    print(f"❌ Trainer initialization failed: {e}")
    import traceback
    traceback.print_exc()

## Step 4: Setup Components

The trainer needs to setup several components before training. Let's do this step by step to see what's happening:

In [None]:
# Step 4a: Setup tokenizer
print("📥 Setting up tokenizer...")
try:
    trainer.setup_tokenizer()
    print(f"✅ Tokenizer ready: {len(trainer.tokenizer):,} vocabulary")
    
    # Test the tokenizer
    test_text = "Hello, this is a test of the Hermes tokenizer!"
    tokens = trainer.tokenizer.encode(test_text)
    print(f"🧪 Test encoding: '{test_text}' → {len(tokens)} tokens")
    print(f"   First 10 tokens: {tokens[:10]}")
    
except Exception as e:
    print(f"❌ Tokenizer setup failed: {e}")
    print("This might be due to internet connectivity or HuggingFace access")

In [None]:
# Step 4b: Setup model
print("\n🤖 Setting up Arbor model...")
try:
    trainer.setup_model()
    print(f"✅ Model created: {trainer.model.param_count():,} parameters")
    
    # Show model architecture details
    config = trainer.model.config
    print(f"\n📊 Model details:")
    print(f"   Architecture: {config.num_layers} layers")
    print(f"   Hidden size: {config.dim}")
    print(f"   Attention heads: {config.num_heads}")
    print(f"   FFN dimension: {config.ffn_dim}")
    print(f"   Max sequence length: {config.max_seq_length:,}")
    
    # Show adaptive context info
    if hasattr(trainer.model, 'get_context_info'):
        context_info = trainer.model.get_context_info()
        print(f"\n🧠 Adaptive context info:")
        print(f"   Enabled: {context_info['adaptive_context_enabled']}")
        if context_info['adaptive_context_enabled']:
            print(f"   Current context: {context_info['current_context_length']:,}")
            print(f"   Context range: {context_info['min_context_length']:,} - {context_info['max_context_length']:,}")
    
except Exception as e:
    print(f"❌ Model setup failed: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# Step 4c: Load datasets
print("\n📚 Loading datasets...")
try:
    trainer.load_datasets()
    
    print(f"✅ Datasets loaded: {len(trainer.datasets)}")
    
    # Show dataset info
    for name, dataset in trainer.datasets.items():
        print(f"\n📊 Dataset: {name}")
        print(f"   Size: {len(dataset):,} examples")
        
        # Show a sample
        if len(dataset) > 0:
            sample = dataset[0]
            input_ids = sample['input_ids']
            decoded = trainer.tokenizer.decode(input_ids[:50])  # First 50 tokens
            print(f"   Sample: {decoded}...")
            print(f"   Token length: {len(input_ids)}")
    
except Exception as e:
    print(f"❌ Dataset loading failed: {e}")
    print("This might be due to internet connectivity or dataset access issues")
    import traceback
    traceback.print_exc()

## Step 5: Training Setup

Before we start training, let's setup logging and create the trainer objects:

In [None]:
# Setup logging
print("📊 Setting up logging...")
try:
    trainer.setup_logging()
    print("✅ Logging configured")
except Exception as e:
    print(f"⚠️  Logging setup had issues: {e}")
    print("Training can continue without full logging")

In [None]:
# Create a trainer for the first dataset to inspect the setup
if trainer.datasets:
    dataset_name = list(trainer.datasets.keys())[0]
    print(f"🔧 Creating trainer for dataset: {dataset_name}")
    
    try:
        hf_trainer = trainer.create_trainer(dataset_name)
        print(f"✅ HuggingFace trainer created")
        print(f"   Training dataset: {len(hf_trainer.train_dataset):,} examples")
        if hf_trainer.eval_dataset:
            print(f"   Eval dataset: {len(hf_trainer.eval_dataset):,} examples")
        
        # Show training arguments
        args = hf_trainer.args
        print(f"\n⚙️  Training arguments:")
        print(f"   Output dir: {args.output_dir}")
        print(f"   Learning rate: {args.learning_rate}")
        print(f"   Batch size: {args.per_device_train_batch_size}")
        print(f"   Max steps: {args.max_steps}")
        print(f"   Save steps: {args.save_steps}")
        print(f"   Eval steps: {args.eval_steps}")
        print(f"   FP16: {args.fp16}")
        
    except Exception as e:
        print(f"❌ Trainer creation failed: {e}")
        import traceback
        traceback.print_exc()

## Step 6: Test Adaptive Context System

Before training, let's test the adaptive context system with different types of inputs:

In [None]:
# Test the adaptive context system
print("🧠 Testing Adaptive Context System")
print("=" * 50)

# Test different types of inputs
test_inputs = {
    "simple_chat": "Hello! How are you today?",
    
    "code_task": """
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# This is a recursive implementation
# Could be optimized with dynamic programming
for i in range(10):
    print(f"fib({i}) = {fibonacci(i)}")
""",
    
    "reasoning_task": """
Let me think through this step by step. If we have a logical puzzle where:
1. All cats are animals
2. Some animals are pets  
3. No pets are wild
4. Some cats are wild

We need to determine if there's a contradiction. Let me analyze each statement carefully
and see if they can all be true simultaneously. This requires careful logical reasoning
to avoid making invalid inferences.
""",
    
    "long_document": """
This is a comprehensive research paper on machine learning that covers multiple aspects
of the field. The introduction provides background on artificial intelligence and its
historical development. The methodology section describes various approaches including
supervised learning, unsupervised learning, and reinforcement learning paradigms.
""" + " The paper continues with detailed analysis." * 50  # Make it longer
}

# Test each input type
for task_type, text in test_inputs.items():
    print(f"\n🔍 Testing: {task_type}")
    print(f"Input length: {len(text)} characters")
    
    # Tokenize the input
    inputs = trainer.tokenizer(text, return_tensors="pt", truncation=False)
    input_ids = inputs["input_ids"]
    token_count = input_ids.shape[1]
    print(f"Token count: {token_count}")
    
    # Get current context info
    initial_context = trainer.model.get_context_info()['current_context_length']
    
    # Test the model (this should trigger adaptive context)
    trainer.model.eval()
    with torch.no_grad():
        try:
            # This forward pass will trigger context adaptation
            outputs = trainer.model(input_ids, return_dict=True)
            
            # Check if context adapted
            final_context = trainer.model.get_context_info()['current_context_length']
            
            print(f"Context: {initial_context:,} → {final_context:,} tokens")
            if final_context != initial_context:
                print(f"✅ Context adapted for {task_type}")
            else:
                print(f"→ Context unchanged for {task_type}")
                
        except Exception as e:
            print(f"❌ Error processing {task_type}: {e}")

## Step 7: Run Training

Now let's run the actual training! We'll train for a short period to demonstrate the system:

In [None]:
# Warning: This will actually train the model!
print("⚠️  TRAINING WARNING")
print("=" * 50)
print("The next cell will run actual training.")
print("This may take several minutes and will:")
print("• Download datasets from HuggingFace")
print("• Train the model for 200 steps")
print("• Show parameter growth during training")
print("• Save model checkpoints")
print("")
print("Set RUN_TRAINING = True to proceed")

RUN_TRAINING = False  # Set to True to actually run training

if RUN_TRAINING:
    print("🚀 Starting training pipeline...")
else:
    print("🛑 Training skipped (set RUN_TRAINING = True to run)")

In [None]:
# Run training if enabled
if RUN_TRAINING:
    print("🚀 Starting Arbor YAML Training Pipeline")
    print("=" * 60)
    
    try:
        # This runs the complete training pipeline
        trainer.train()
        
        print("\n🎉 Training completed successfully!")
        
        # Show final model stats
        final_params = trainer.model.param_count()
        print(f"📊 Final model size: {final_params:,} parameters")
        
        # Show training outputs
        output_dir = Path(trainer.config.training_config['output_dir'])
        if output_dir.exists():
            saved_models = list(output_dir.glob("*/"))
            print(f"💾 Saved {len(saved_models)} model checkpoints:")
            for model_dir in saved_models:
                print(f"   📁 {model_dir.name}")
        
    except Exception as e:
        print(f"❌ Training failed: {e}")
        import traceback
        traceback.print_exc()
        
else:
    # Simulate what training would show
    print("📋 Training simulation (would show):")
    print("🌱 Initialized Arbor trainer with config: configs/tutorial_config.yaml")
    print("📥 Downloading fresh Hermes-4-405B tokenizer...")
    print("✅ Successfully loaded fresh Hermes-4-405B tokenizer")
    print("✅ Created Arbor model: 347,394,048 parameters")
    print("🧠 Adaptive context enabled:")
    print("   Range: 512 - 32,768")
    print("   Supported tasks: 4")
    print("🌱 Growth monitoring enabled:")
    print("   Factor: 1.5x")
    print("   Max steps: 4")
    print("📚 Loading datasets...")
    print("   ✅ tiny_stories: 1,000 examples")
    print("   ✅ code_samples: 500 examples")
    print("")
    print("🎯 Training on tiny_stories...")
    print("   📊 Parameters: 347,394,048 → 347,894,048 (growth occurred)")
    print("   ✅ tiny_stories complete!")
    print("")
    print("🎯 Training on code_samples...")
    print("   📊 Parameters: 347,894,048 → 348,394,048 (growth occurred)")
    print("   ✅ code_samples complete!")
    print("")
    print("🎉 Training pipeline complete!")

## Step 8: Test the Trained Model

Let's test our model (or demonstrate what testing would look like) with different task types to see how the adaptive context system works:

In [None]:
# Test model generation with different tasks
print("🧪 Testing Trained Model")
print("=" * 50)

test_prompts = {
    "story": "Once upon a time, in a magical forest",
    "code": "# Python function to calculate factorial\ndef factorial(n):",
    "reasoning": "Let me solve this step by step. The problem is:",
    "chat": "User: What's the weather like today?\nAssistant:"
}

for task_type, prompt in test_prompts.items():
    print(f"\n🎯 Testing {task_type} task:")
    print(f"Prompt: {prompt}")
    
    # Tokenize prompt
    inputs = trainer.tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    
    # Show what would happen with adaptive context
    print(f"Input tokens: {input_ids.shape[1]}")
    
    if RUN_TRAINING:
        # Actually test the trained model
        trainer.model.eval()
        with torch.no_grad():
            try:
                # Generate response
                generated = trainer.model.generate(
                    input_ids,
                    max_new_tokens=50,
                    temperature=0.7,
                    do_sample=True
                )
                
                # Decode response
                response = trainer.tokenizer.decode(generated[0], skip_special_tokens=True)
                print(f"Generated: {response[len(prompt):]}")
                
                # Show context info
                context_info = trainer.model.get_context_info()
                print(f"Context used: {context_info['current_context_length']:,} tokens")
                
            except Exception as e:
                print(f"❌ Generation failed: {e}")
    else:
        # Simulate what would happen
        simulated_contexts = {"story": 2048, "code": 4096, "reasoning": 8192, "chat": 1024}
        print(f"Would adapt context to: {simulated_contexts[task_type]:,} tokens")
        print(f"Would generate appropriate {task_type} response")

## Step 9: Configuration Tips and Best Practices

Here are some tips for customizing your YAML training configurations:

In [None]:
# Configuration tips and best practices
print("💡 YAML Configuration Tips")
print("=" * 50)

tips = {
    "Model Size": {
        "Small (100M)": "hidden_size: 512, num_layers: 12",
        "Medium (500M)": "hidden_size: 1024, num_layers: 24", 
        "Large (1B)": "hidden_size: 1536, num_layers: 32"
    },
    
    "Context Lengths": {
        "Short tasks": "max 4K tokens (chat, Q&A)",
        "Medium tasks": "4K-16K tokens (code, creative)",
        "Long tasks": "16K+ tokens (documents, reasoning)"
    },
    
    "Growth Settings": {
        "Conservative": "factor: 1.25, threshold: 0.95",
        "Moderate": "factor: 1.5, threshold: 0.9",
        "Aggressive": "factor: 2.0, threshold: 0.85"
    },
    
    "Training Speed": {
        "Fast prototyping": "small datasets, few steps",
        "Full training": "complete datasets, many steps", 
        "Production": "multiple epochs, careful validation"
    }
}

for category, options in tips.items():
    print(f"\n🔧 {category}:")
    for option, description in options.items():
        print(f"   {option}: {description}")

print(f"\n📋 Common YAML patterns:")
print("""
# Enable everything for research
adaptive_context: 
  enabled: true
growth:
  enabled: true
logging:
  wandb:
    enabled: true

# Minimal setup for testing  
adaptive_context:
  enabled: false
growth:
  enabled: false
datasets:
  - name: "test"
    source: "roneneldan/TinyStories"
    split: "train[:100]"
""")

## Summary

Congratulations! You've learned how to use the Arbor YAML training system. Here's what we covered:

### ✅ **What You Learned:**

1. **📋 YAML Configuration** - How to create and customize training configs
2. **🧠 Adaptive Context** - Task-aware context window adaptation  
3. **🌱 Dynamic Growth** - Parameter expansion during training
4. **🚀 Easy Training** - One-command training with `python train.py config.yaml`
5. **🧪 Testing & Validation** - How to test trained models

### 🎯 **Key Benefits:**

- **Simple**: Just edit YAML, no complex code
- **Powerful**: Full control over model architecture and training
- **Smart**: Automatic context adaptation and parameter growth
- **Production Ready**: HuggingFace integration and monitoring

### 🚀 **Next Steps:**

1. **Customize** your own YAML config for your use case
2. **Train** with real datasets for your domain
3. **Monitor** training with WandB integration
4. **Deploy** trained models to HuggingFace Hub

The YAML trainer makes it incredibly easy to experiment with cutting-edge transformer architectures!

In [None]:
# Cleanup and final info
print("🧹 Cleanup and Final Info")
print("=" * 50)

# Show created files
created_files = [
    "configs/tutorial_config.yaml",
    "tutorial_output/" if RUN_TRAINING else "tutorial_output/ (would be created)"
]

print("📁 Files created during this tutorial:")
for file in created_files:
    if Path(file).exists() or "would be" in file:
        print(f"   ✅ {file}")

print(f"\n🎯 To run training yourself:")
print(f"   1. Set RUN_TRAINING = True in cell 15")
print(f"   2. Or run: python train.py configs/tutorial_config.yaml")

print(f"\n🔧 To customize:")
print(f"   1. Edit configs/tutorial_config.yaml")
print(f"   2. Adjust model size, datasets, training steps")
print(f"   3. Enable/disable adaptive context and growth")

print(f"\n📚 For more examples:")
print(f"   • Check configs/example_config.yaml")
print(f"   • Run python demo_adaptive_context.py")
print(f"   • See README.md for full documentation")

print(f"\n🌱 Happy training with Arbor!")