# 🎬 Stranger Things NLP with AWS SageMaker

This comprehensive tutorial shows you how to use AWS SageMaker to scale your Stranger Things NLP project for production.

## What You'll Learn
1. Setting up SageMaker infrastructure
2. Training character chatbot models at scale
3. Deploying models as auto-scaling endpoints
4. Monitoring and cost optimization
5. Building production-ready applications

## 📋 Prerequisites

Before running this notebook, make sure you have:

1. **AWS Account** with appropriate permissions
2. **AWS CLI** configured with your credentials
3. **Environment variables** set:
   - `AWS_ACCOUNT_ID`
   - `HUGGINGFACE_TOKEN`
4. **Python dependencies** installed from requirements.txt

In [None]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Core packages
packages = ['boto3', 'sagemaker', 'pandas', 'matplotlib', 'seaborn']

for package in packages:
    try:
        __import__(package)
    except ImportError:
        install_package(package)

print("✅ All packages installed!")

## 🚀 Part 1: Setting Up SageMaker Infrastructure

In [None]:
import os
import sys
import boto3
from pathlib import Path

# Add parent directory to path to import our modules
sys.path.append(str(Path().absolute().parent))

from config import SageMakerConfigManager, create_default_sagemaker_config
from storage import SageMakerS3Manager
from training_orchestrator import SageMakerTrainingOrchestrator
from deployment_manager import SageMakerDeploymentManager
from monitoring import SageMakerMonitoring

print("📦 SageMaker modules imported successfully!")

In [None]:
# Check environment setup
print("🔍 Environment Check:")
print(f"AWS Region: {os.getenv('AWS_DEFAULT_REGION', 'us-east-1')}")
print(f"AWS Account ID: {os.getenv('AWS_ACCOUNT_ID', 'Not Set')}")
print(f"HuggingFace Token: {'✅ Set' if os.getenv('HUGGINGFACE_TOKEN') else '❌ Not Set'}")

# Test AWS credentials
try:
    sts = boto3.client('sts')
    identity = sts.get_caller_identity()
    print(f"AWS Identity: {identity['Arn']}")
    print("✅ AWS credentials are working!")
except Exception as e:
    print(f"❌ AWS credentials issue: {e}")

In [None]:
# Create SageMaker configuration
print("⚙️ Creating SageMaker configuration...")

# Set your custom bucket name here
username = os.getenv('USER', 'demo')
bucket_name = f"stranger-things-sagemaker-{username}"

config = create_default_sagemaker_config(bucket_name)
print(f"✅ Configuration created with bucket: {bucket_name}")

In [None]:
# Initialize SageMaker components
print("🏗️ Initializing SageMaker components...")

s3_manager = SageMakerS3Manager(bucket_name=config.s3_config.bucket_name)
training_orchestrator = SageMakerTrainingOrchestrator(config)
deployment_manager = SageMakerDeploymentManager(config)
monitoring = SageMakerMonitoring(config)

print("✅ All components initialized!")
print(f"📊 S3 Bucket: {s3_manager.bucket_name}")
print(f"🎯 Training Instance: {config.training_config.instance_type}")
print(f"🚀 Inference Instance: {config.endpoint_config.instance_type}")

## 📊 Part 2: Data Preparation and Upload

In [None]:
# Create sample training data for demonstration
import pandas as pd
import tempfile

print("📝 Creating sample training data...")

# Sample dialogue data (replace with your actual data)
sample_data = [
    {"name": "Eleven", "line": "Friends don't lie."},
    {"name": "Mike", "line": "Will, are you okay?"},
    {"name": "Eleven", "line": "I can find him."},
    {"name": "Joyce", "line": "Where is my son?"},
    {"name": "Eleven", "line": "The Upside Down is cold and dark."},
    {"name": "Dustin", "line": "We need to find the gate."},
    {"name": "Eleven", "line": "Papa... he's gone."},
    {"name": "Steve", "line": "I can protect you guys."},
    {"name": "Eleven", "line": "I'm the monster."},
    {"name": "Hopper", "line": "Kid, you did good."},
]

# Create DataFrame
df = pd.DataFrame(sample_data)
print(f"📋 Created sample dataset with {len(df)} entries")
print(df.head())

In [None]:
# Save data locally and upload to S3
with tempfile.TemporaryDirectory() as temp_dir:
    temp_path = Path(temp_dir)
    data_file = temp_path / "training_data.csv"
    
    # Save CSV file
    df.to_csv(data_file, index=False)
    print(f"💾 Saved training data to: {data_file}")
    
    # Upload to S3
    print("📤 Uploading training data to S3...")
    training_data_uri = training_orchestrator.prepare_training_data(
        str(temp_path), "chatbot"
    )
    
    print(f"✅ Training data uploaded to: {training_data_uri}")

## 🚂 Part 3: Training Models with SageMaker

In [None]:
# Configure training job
import time

job_name = f"stranger-things-demo-{int(time.time())}"
print(f"🚂 Launching training job: {job_name}")

# For demo purposes, we'll use smaller settings
hyperparameters = {
    'base_model': 'meta-llama/Llama-3.2-3B-Instruct',
    'batch_size': '1',
    'max_steps': '10',  # Small number for demo
    'learning_rate': '2e-4',
    'gradient_accumulation_steps': '2'
}

# Note: In real usage, you would set max_steps to 1000+ for proper training
print("⚠️ Note: Using minimal training steps for demo purposes")
print("In production, increase max_steps to 1000+ for better results")

In [None]:
# Launch training job
print("🚀 Launching SageMaker training job...")
print("This may take 15-30 minutes depending on instance startup and training")

try:
    job_arn = training_orchestrator.launch_training_job(
        job_name=job_name,
        model_type="chatbot",
        training_data_s3_uri=training_data_uri,
        hyperparameters=hyperparameters
    )
    
    print(f"✅ Training job launched successfully!")
    print(f"📊 Job Name: {job_name}")
    print(f"📋 Job ARN: {job_arn}")
    
    # Store job name for later use
    current_job_name = job_name
    
except Exception as e:
    print(f"❌ Training job launch failed: {e}")
    print("This might be due to:")
    print("1. IAM role not configured properly")
    print("2. Docker image not available")
    print("3. AWS service limits reached")

In [None]:
# Monitor training job status
def check_training_status(job_name, max_checks=10):
    """Check training job status with progress updates"""
    
    for i in range(max_checks):
        try:
            status_info = training_orchestrator.get_job_status(job_name)
            status = status_info['status']
            
            print(f"📊 Check {i+1}/{max_checks}: Status = {status}")
            
            if status == 'Completed':
                print("✅ Training completed successfully!")
                return True
            elif status == 'Failed':
                failure_reason = status_info.get('failure_reason', 'Unknown')
                print(f"❌ Training failed: {failure_reason}")
                return False
            elif status in ['InProgress', 'Starting']:
                print("⏳ Training in progress...")
                if i < max_checks - 1:  # Don't sleep on last iteration
                    time.sleep(60)  # Wait 1 minute between checks
            
        except Exception as e:
            print(f"⚠️ Error checking status: {e}")
    
    print("⏰ Reached maximum status checks")
    return None

# Check status (this will take a while for real training)
if 'current_job_name' in locals():
    print(f"🔍 Monitoring training job: {current_job_name}")
    training_completed = check_training_status(current_job_name)
else:
    print("⚠️ No training job to monitor")

## 🚀 Part 4: Model Deployment

In [None]:
# Deploy the trained model (if training completed)
if 'current_job_name' in locals() and 'training_completed' in locals() and training_completed:
    print(f"🚀 Deploying model from training job: {current_job_name}")
    
    try:
        # Deploy the model
        deployment_info = deployment_manager.deploy_model_complete(
            model_name=f"{current_job_name}-model",
            model_artifacts_s3_uri=f"s3://{s3_manager.bucket_name}/models/{current_job_name}/output/model.tar.gz"
        )
        
        endpoint_name = deployment_info['endpoint_name']
        print(f"✅ Deployment initiated successfully!")
        print(f"🔗 Endpoint Name: {endpoint_name}")
        print(f"⏳ Endpoint creation takes 5-10 minutes...")
        
        # Store endpoint name for later use
        current_endpoint_name = endpoint_name
        
    except Exception as e:
        print(f"❌ Deployment failed: {e}")
else:
    print("⚠️ No completed training job to deploy")
    print("For demo purposes, let's use a pre-existing model if available...")
    
    # List existing endpoints
    endpoints = deployment_manager.list_active_endpoints()
    if endpoints:
        current_endpoint_name = endpoints[0]['name']
        print(f"📋 Using existing endpoint: {current_endpoint_name}")
    else:
        print("📋 No existing endpoints found")

In [None]:
# Wait for endpoint to be ready (if we have one)
if 'current_endpoint_name' in locals():
    print(f"⏳ Waiting for endpoint to be ready: {current_endpoint_name}")
    
    # Check endpoint status
    max_checks = 10
    for i in range(max_checks):
        status_info = deployment_manager.get_endpoint_status(current_endpoint_name)
        status = status_info['status']
        
        print(f"📊 Check {i+1}/{max_checks}: Endpoint Status = {status}")
        
        if status == 'InService':
            print("✅ Endpoint is ready for inference!")
            endpoint_ready = True
            break
        elif status == 'Failed':
            failure_reason = status_info.get('failure_reason', 'Unknown')
            print(f"❌ Endpoint failed: {failure_reason}")
            endpoint_ready = False
            break
        else:
            print(f"⏳ Endpoint status: {status}...")
            if i < max_checks - 1:
                time.sleep(60)  # Wait 1 minute between checks
    else:
        print("⏰ Reached maximum endpoint checks")
        endpoint_ready = False

## 🤖 Part 5: Testing Inference

In [None]:
# Test inference (if endpoint is ready)
if 'current_endpoint_name' in locals() and 'endpoint_ready' in locals() and endpoint_ready:
    print(f"🧪 Testing inference on endpoint: {current_endpoint_name}")
    
    # Test messages
    test_messages = [
        "Hello Eleven, how are you?",
        "What do you think about the Upside Down?",
        "Can you help us find Will?"
    ]
    
    for i, message in enumerate(test_messages, 1):
        print(f"\n🔸 Test {i}: {message}")
        
        try:
            # Create payload
            payload = {
                "inputs": message,
                "parameters": {
                    "max_length": 128,
                    "temperature": 0.7,
                    "do_sample": True
                }
            }
            
            # Invoke endpoint
            response = deployment_manager.invoke_endpoint(current_endpoint_name, payload)
            print(f"🤖 Response: {response}")
            
        except Exception as e:
            print(f"❌ Inference failed: {e}")

else:
    print("⚠️ No ready endpoint for testing")
    print("In a real scenario, you would wait for endpoint deployment to complete")

## 📊 Part 6: Monitoring and Analytics

In [None]:
# Set up monitoring dashboard
print("📊 Setting up CloudWatch monitoring...")

try:
    dashboard_url = monitoring.create_dashboard("StrangerThings-Demo-Dashboard")
    print(f"✅ CloudWatch dashboard created!")
    print(f"🔗 Dashboard URL: {dashboard_url}")
except Exception as e:
    print(f"⚠️ Dashboard creation failed: {e}")
    print("This might be due to CloudWatch permissions")

In [None]:
# Create monitoring alarms (if we have an endpoint)
if 'current_endpoint_name' in locals():
    print(f"🔔 Setting up alarms for endpoint: {current_endpoint_name}")
    
    try:
        alarms = monitoring.create_alarms(current_endpoint_name)
        print(f"✅ Created {len(alarms)} monitoring alarms:")
        for alarm in alarms:
            print(f"  📢 {alarm}")
    except Exception as e:
        print(f"⚠️ Alarm creation failed: {e}")
else:
    print("⚠️ No endpoint available for alarm setup")

In [None]:
# Generate monitoring report
print("📋 Generating monitoring report...")

report = monitoring.generate_monitoring_report(
    endpoint_name=current_endpoint_name if 'current_endpoint_name' in locals() else None,
    training_job_name=current_job_name if 'current_job_name' in locals() else None
)

print(report)

## 💰 Part 7: Cost Analysis

In [None]:
# Analyze costs
print("💰 Analyzing SageMaker costs...")

try:
    cost_data = monitoring.get_cost_metrics(days_back=7)
    
    if 'error' not in cost_data:
        print(f"📊 Total Cost (last 7 days): ${cost_data['total_cost']:.2f}")
        print(f"📈 Daily Average: ${cost_data['daily_average']:.2f}")
        
        if cost_data['services']:
            print("\n🔍 Cost Breakdown by Service:")
            for service, data in cost_data['services'].items():
                print(f"  • {service}: ${data['total']:.2f}")
    else:
        print(f"⚠️ Cost data retrieval failed: {cost_data['error']}")
        print("This might be due to Cost Explorer API permissions")
        
except Exception as e:
    print(f"⚠️ Cost analysis failed: {e}")

## 🎯 Part 8: Production Best Practices

In [None]:
# Set up auto-scaling (if we have an endpoint)
if 'current_endpoint_name' in locals() and 'endpoint_ready' in locals() and endpoint_ready:
    print(f"📈 Setting up auto-scaling for: {current_endpoint_name}")
    
    try:
        success = deployment_manager.setup_auto_scaling(
            endpoint_name=current_endpoint_name,
            min_capacity=1,
            max_capacity=3,  # Conservative for demo
            target_value=50   # Scale when reaching 50 invocations per instance
        )
        
        if success:
            print("✅ Auto-scaling configured successfully!")
            print("📊 Configuration:")
            print("  • Min Instances: 1")
            print("  • Max Instances: 3")
            print("  • Target: 50 invocations per instance")
        else:
            print("❌ Auto-scaling setup failed")
            
    except Exception as e:
        print(f"⚠️ Auto-scaling setup failed: {e}")
else:
    print("⚠️ No ready endpoint for auto-scaling setup")

In [None]:
# Show deployment summary
print("📋 Deployment Summary")
print("=" * 40)

summary = deployment_manager.get_deployment_summary()
print(f"🚀 Active Endpoints: {summary['endpoints']}")
print(f"📦 Registered Models: {summary['models']}")
print(f"⚙️ Batch Jobs: {summary['batch_jobs']}")

if summary['active_endpoints']:
    print("\n🔗 Endpoint Names:")
    for endpoint in summary['active_endpoints']:
        print(f"  • {endpoint}")

if summary['active_models']:
    print("\n📦 Model Names:")
    for model in summary['active_models']:
        print(f"  • {model}")

## 🧹 Part 9: Cleanup (Optional)

In [None]:
# IMPORTANT: Uncomment the following cell only if you want to clean up resources
# This will delete endpoints and incur no further charges

cleanup_resources = False  # Set to True if you want to cleanup

if cleanup_resources:
    print("🧹 Cleaning up resources...")
    print("⚠️ This will delete endpoints and stop charges")
    
    # Delete endpoint if it exists
    if 'current_endpoint_name' in locals():
        try:
            success = deployment_manager.delete_endpoint(
                current_endpoint_name, 
                delete_config=True, 
                delete_model=False  # Keep model for future use
            )
            
            if success:
                print(f"✅ Deleted endpoint: {current_endpoint_name}")
            else:
                print(f"❌ Failed to delete endpoint: {current_endpoint_name}")
                
        except Exception as e:
            print(f"⚠️ Cleanup error: {e}")
    
    print("✅ Cleanup completed!")
    print("💡 Models and training data are preserved in S3")
else:
    print("⚠️ Cleanup skipped (cleanup_resources = False)")
    print("💡 Remember to delete endpoints when you're done to avoid charges")
    print("💡 You can do this through the AWS Console or by running:")
    if 'current_endpoint_name' in locals():
        print(f"    deployment_manager.delete_endpoint('{current_endpoint_name}')")

## 🎉 Congratulations!

You've successfully completed the Stranger Things NLP SageMaker tutorial! Here's what you accomplished:

### ✅ What You Built
1. **🏗️ SageMaker Infrastructure** - S3 buckets, configurations, and IAM roles
2. **🚂 Scalable Training** - Cloud-based model training with GPU instances
3. **🚀 Auto-scaling Endpoints** - Production-ready inference with auto-scaling
4. **📊 Monitoring & Alerts** - CloudWatch dashboards and alarm system
5. **💰 Cost Optimization** - Cost tracking and optimization strategies

### 🚀 Next Steps
1. **Scale Your Data** - Train with larger datasets for better performance
2. **Multi-Character Models** - Train separate models for different characters
3. **A/B Testing** - Deploy multiple model versions for comparison
4. **Production App** - Integrate endpoints with your Gradio application
5. **MLOps Pipeline** - Set up automated retraining and deployment

### 📚 Additional Resources
- [SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/)
- [HuggingFace on SageMaker](https://huggingface.co/docs/sagemaker/index)
- [Cost Optimization Guide](https://aws.amazon.com/sagemaker/pricing/)

### 💡 Tips for Production
- Use spot instances for training to save up to 70% on costs
- Set up CloudWatch alarms for proactive monitoring
- Implement proper security with VPC and IAM policies
- Use model versioning for reliable deployments
- Regular cost reviews and optimization

**Happy scaling with AWS SageMaker! 🎬✨**