# Inference Profile Support Demonstration

This notebook demonstrates the **automatic inference profile support** in LLMManager. Starting in v0.4.0, LLMManager automatically detects when models require inference profiles (CRIS profiles) and handles them transparently.

## Key Features

- üîç **Automatic Detection**: Detects profile requirements from AWS errors
- üéØ **Intelligent Selection**: Automatically selects optimal inference profiles
- üß† **Learning System**: Learns and remembers successful access methods
- ‚ö° **Zero Configuration**: No code changes required - works automatically
- üîÑ **Backward Compatible**: Models with direct access continue working unchanged
- üìä **Observable**: Comprehensive logging and statistics

## What Are Inference Profiles?

Inference profiles (also called CRIS profiles) are AWS Bedrock resources that provide access to foundation models. Some newer models (like Claude Sonnet 4.5) require profile-based access instead of direct model ID invocation.

**Access Methods:**
- **Direct Access**: Using model ID directly (e.g., `anthropic.claude-3-haiku-20240307-v1:0`)
- **Regional CRIS**: Using regional inference profile ARN
- **Global CRIS**: Using global inference profile ID

## Setup and Imports

In [None]:
import sys
import json
from pathlib import Path
import logging
from datetime import datetime

# Add the src directory to path for imports
sys.path.append(str(Path.cwd().parent / "src"))

# Import the LLMManager and related classes
from bestehorn_llmmanager.llm_manager import LLMManager
from bestehorn_llmmanager.bedrock.models.llm_manager_structures import AuthConfig, RetryConfig, AuthenticationType
from bestehorn_llmmanager.bedrock.tracking.access_method_tracker import AccessMethodTracker
from bestehorn_llmmanager.bedrock.catalog.bedrock_catalog import BedrockModelCatalog

# Configure logging to see profile support in action
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

print("‚úÖ Imports successful!")
print(f"üìÅ Working directory: {Path.cwd()}")

## Example 1: Automatic Profile Detection üîç

Let's demonstrate how the system automatically detects and handles models that require inference profiles.

In [None]:
print("üîç Example 1: Automatic Profile Detection")
print("=" * 45)

# Initialize LLMManager with a model that requires inference profile
# No special configuration needed - it works automatically!
manager = LLMManager(
    models=["Claude Sonnet 4.5"],  # Requires inference profile
    regions=["us-east-1", "us-west-2"],
    log_level=logging.INFO  # See profile detection in logs
)

print("‚úÖ LLMManager initialized")
print("   Model: Claude Sonnet 4.5 (requires inference profile)")
print("   Regions: us-east-1, us-west-2")
print("\nüìù Watch the logs below to see automatic profile detection...\n")

# Simple message
messages = [
    {
        "role": "user",
        "content": [
            {"text": "Hello! Please introduce yourself briefly."}
        ]
    }
]

try:
    # Send request - system will automatically:
    # 1. Try direct access (fails with ValidationException)
    # 2. Detect profile requirement
    # 3. Select inference profile
    # 4. Retry with profile (succeeds)
    # 5. Learn preference for future requests
    response = manager.converse(messages=messages)
    
    print("\n" + "=" * 45)
    print("üìä Response Details")
    print("=" * 45)
    print(f"‚úÖ Success: {response.success}")
    print(f"ü§ñ Model: {response.model_used}")
    print(f"üåç Region: {response.region_used}")
    print(f"üîó Access Method: {response.access_method_used}")
    print(f"üìã Profile Used: {response.inference_profile_used}")
    
    if response.inference_profile_used:
        print(f"üÜî Profile ID: {response.inference_profile_id}")
    
    print(f"\nüí¨ Response: {response.get_content()[:200]}...")
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    print(f"   Type: {type(e).__name__}")

## Example 2: Access Method Learning üß†

The system learns which access method works for each model/region combination and uses it immediately on subsequent requests.

In [None]:
print("üß† Example 2: Access Method Learning")
print("=" * 40)

# Execute multiple requests to see learning in action
print("\nüìù Executing 3 requests to demonstrate learning...\n")

for i in range(3):
    messages = [
        {
            "role": "user",
            "content": [
                {"text": f"Request {i+1}: What is {2+i} + {3+i}?"}
            ]
        }
    ]
    
    response = manager.converse(messages=messages)
    
    print(f"Request {i+1}:")
    print(f"  Access Method: {response.access_method_used}")
    print(f"  Duration: {response.total_duration_ms:.2f}ms")
    print(f"  Response: {response.get_content()[:100]}...")
    print()

print("\nüí° Notice:")
print("   - First request may take longer (profile detection)")
print("   - Subsequent requests are faster (uses learned preference)")
print("   - All requests use the same access method after learning")

## Example 3: Checking Access Method Statistics üìä

Monitor access method usage across your application using the AccessMethodTracker.

In [None]:
print("üìä Example 3: Access Method Statistics")
print("=" * 40)

# Get the singleton tracker instance
tracker = AccessMethodTracker.get_instance()

# Get comprehensive statistics
stats = tracker.get_statistics()

print("\nüìà Global Access Method Statistics:")
print(f"   Total tracked combinations: {stats['total_tracked']}")
print(f"   Profile-required models: {stats['profile_required_count']}")
print(f"   Direct-access models: {stats['direct_access_count']}")

if stats.get('access_method_distribution'):
    print(f"\nüîó Access Method Distribution:")
    for method, count in stats['access_method_distribution'].items():
        print(f"   {method}: {count}")

# Check specific model/region preference
model_id = "anthropic.claude-sonnet-4-20250514-v1:0"
region = "us-east-1"

requires_profile = tracker.requires_profile(model_id, region)
print(f"\nüîç Model Analysis:")
print(f"   Model: {model_id}")
print(f"   Region: {region}")
print(f"   Requires Profile: {requires_profile}")

preference = tracker.get_preference(model_id, region)
if preference:
    print(f"   Preferred Method: {preference.get_preferred_method()}")
    print(f"   Learned from Error: {preference.learned_from_error}")
    print(f"   Last Updated: {preference.last_updated}")

## Example 4: Mixed Access Methods üîÑ

Demonstrate using models with different access requirements in the same manager.

In [None]:
print("üîÑ Example 4: Mixed Access Methods")
print("=" * 40)

# Initialize manager with models that have different access requirements
mixed_manager = LLMManager(
    models=[
        "Claude Sonnet 4.5",  # Requires inference profile
        "Claude 3 Haiku"      # Supports direct access
    ],
    regions=["us-east-1"],
    log_level=logging.INFO
)

print("‚úÖ Manager initialized with mixed access methods")
print("   Model 1: Claude Sonnet 4.5 (requires profile)")
print("   Model 2: Claude 3 Haiku (supports direct)")

messages = [
    {
        "role": "user",
        "content": [
            {"text": "What is 5 + 7?"}
        ]
    }
]

# Execute request - system will use appropriate access method for whichever model succeeds
response = mixed_manager.converse(messages=messages)

print(f"\nüìä Response Details:")
print(f"   Model Used: {response.model_used}")
print(f"   Access Method: {response.access_method_used}")
print(f"   Profile Used: {response.inference_profile_used}")
print(f"   Response: {response.get_content()}")

print(f"\nüí° The system automatically used the correct access method")
print(f"   for the model that succeeded.")

## Example 5: Checking Model Availability üîç

Use the BedrockModelCatalog to check model availability and access methods.

In [None]:
print("üîç Example 5: Checking Model Availability")
print("=" * 45)

# Initialize catalog
catalog = BedrockModelCatalog()

# Check specific model
model_name = "Claude Sonnet 4.5"
region = "us-east-1"

print(f"\nüìã Checking: {model_name} in {region}")

is_available = catalog.is_model_available(model_name, region)
print(f"   Available: {is_available}")

if is_available:
    model_info = catalog.get_model_info(model_name, region)
    
    print(f"\nüìä Model Information:")
    print(f"   Model ID: {model_info.model_id}")
    print(f"   Inference Profile: {model_info.inference_profile_id}")
    print(f"   Access Method: {model_info.access_method.value}")
    print(f"   Has Direct Access: {model_info.has_direct_access}")
    print(f"   Has Regional CRIS: {model_info.has_regional_cris}")
    print(f"   Has Global CRIS: {model_info.has_global_cris}")
    print(f"   Supports Streaming: {model_info.supports_streaming}")

# Check catalog metadata
metadata = catalog.get_catalog_metadata()
print(f"\nüìö Catalog Metadata:")
print(f"   Source: {metadata.source.value}")
print(f"   Retrieved: {metadata.retrieval_timestamp}")
print(f"   Regions Queried: {len(metadata.api_regions_queried)}")

## Example 6: Monitoring Access Method Distribution üìà

Execute multiple requests and analyze the distribution of access methods used.

In [None]:
print("üìà Example 6: Access Method Distribution")
print("=" * 42)

# Track access methods across multiple requests
access_methods = []

print("\nüîÑ Executing 10 requests...\n")

for i in range(10):
    messages = [
        {
            "role": "user",
            "content": [
                {"text": f"Request {i+1}: Count to {i+1}"}
            ]
        }
    ]
    
    response = manager.converse(messages=messages)
    access_methods.append(response.access_method_used)
    print(f"Request {i+1:2d}: {response.access_method_used:15s} ({response.total_duration_ms:6.2f}ms)")

# Analyze distribution
from collections import Counter
distribution = Counter(access_methods)

print(f"\nüìä Access Method Distribution:")
for method, count in distribution.items():
    percentage = (count / len(access_methods)) * 100
    print(f"   {method:15s}: {count:2d} ({percentage:5.1f}%)")

print(f"\nüí° Observations:")
print(f"   - After learning, all requests use the same access method")
print(f"   - This demonstrates the learning system working correctly")

## Example 7: Backward Compatibility ‚úÖ

Demonstrate that models with direct access continue working without changes.

In [None]:
print("‚úÖ Example 7: Backward Compatibility")
print("=" * 40)

# Initialize manager with model that supports direct access
direct_manager = LLMManager(
    models=["Claude 3 Haiku"],  # Supports direct access
    regions=["us-east-1"],
    log_level=logging.INFO
)

print("‚úÖ Manager initialized with direct-access model")
print("   Model: Claude 3 Haiku (supports direct access)")

messages = [
    {
        "role": "user",
        "content": [
            {"text": "Hello! What is 10 + 15?"}
        ]
    }
]

response = direct_manager.converse(messages=messages)

print(f"\nüìä Response Details:")
print(f"   Success: {response.success}")
print(f"   Model: {response.model_used}")
print(f"   Access Method: {response.access_method_used}")
print(f"   Profile Used: {response.inference_profile_used}")
print(f"   Response: {response.get_content()}")

print(f"\nüí° Notice:")
print(f"   - Access method is 'direct' (no profile needed)")
print(f"   - Profile used is False")
print(f"   - Existing code works without any changes")

## Example 8: Troubleshooting Profile Issues üîß

Demonstrate how to diagnose and troubleshoot profile-related issues.

In [None]:
print("üîß Example 8: Troubleshooting Profile Issues")
print("=" * 47)

# Enable DEBUG logging for detailed troubleshooting
debug_manager = LLMManager(
    models=["Claude Sonnet 4.5"],
    regions=["us-east-1"],
    log_level=logging.DEBUG  # Most detailed logging
)

print("‚úÖ Manager initialized with DEBUG logging")
print("   This will show detailed profile retry flow\n")

messages = [
    {
        "role": "user",
        "content": [
            {"text": "Hello!"}
        ]
    }
]

response = debug_manager.converse(messages=messages)

print("\n" + "=" * 47)
print("üìä Diagnostic Information")
print("=" * 47)

# Response information
print(f"\n‚úÖ Response Status:")
print(f"   Success: {response.success}")
print(f"   Model: {response.model_used}")
print(f"   Region: {response.region_used}")
print(f"   Access Method: {response.access_method_used}")
print(f"   Profile Used: {response.inference_profile_used}")

# Warnings
warnings = response.get_warnings()
if warnings:
    print(f"\n‚ö†Ô∏è  Warnings:")
    for warning in warnings:
        print(f"   - {warning}")

# Tracker statistics
tracker = AccessMethodTracker.get_instance()
stats = tracker.get_statistics()
print(f"\nüìà Tracker Statistics:")
print(f"   Total tracked: {stats['total_tracked']}")
print(f"   Profile required: {stats['profile_required_count']}")
print(f"   Direct access: {stats['direct_access_count']}")

# Catalog metadata
catalog = BedrockModelCatalog()
metadata = catalog.get_catalog_metadata()
print(f"\nüìö Catalog Information:")
print(f"   Source: {metadata.source.value}")
print(f"   Retrieved: {metadata.retrieval_timestamp}")

print(f"\nüí° Troubleshooting Tips:")
print(f"   1. Check DEBUG logs above for detailed retry flow")
print(f"   2. Verify access_method_used matches expected behavior")
print(f"   3. Check warnings for any profile-related issues")
print(f"   4. Review tracker statistics for learning patterns")
print(f"   5. Ensure catalog source is 'api' for latest data")

## Summary üìù

This notebook demonstrated the automatic inference profile support in LLMManager:

### Key Takeaways

1. **Automatic Detection**: System detects profile requirements from AWS errors
2. **Intelligent Selection**: Automatically selects optimal inference profiles
3. **Learning System**: Learns successful access methods for future optimization
4. **Zero Configuration**: No code changes required - works transparently
5. **Backward Compatible**: Existing code with direct-access models works unchanged
6. **Observable**: Comprehensive logging and statistics for monitoring

### Access Method Preference Order

1. **Direct Access** - Fastest, lowest latency (preferred when available)
2. **Regional CRIS** - Region-specific inference profile
3. **Global CRIS** - Cross-region inference profile

### Best Practices

1. **Use Multiple Regions**: Provide redundancy for profile availability
2. **Include Fallback Models**: Mix profile-required and direct-access models
3. **Enable Appropriate Logging**: Use INFO or DEBUG for development, WARNING for production
4. **Monitor Statistics**: Track access method distribution using AccessMethodTracker
5. **Refresh Catalog**: Periodically refresh model data for latest availability

### Additional Resources

- **Documentation**: See `docs/forLLMConsumption.md` for complete API reference
- **Troubleshooting**: See `docs/INFERENCE_PROFILE_TROUBLESHOOTING.md` for detailed troubleshooting guide
- **Examples**: See other notebooks for more usage patterns

### Performance Impact

- **First Request**: May include one additional retry for profile detection
- **Subsequent Requests**: No overhead - uses learned preference
- **Learning**: Automatic and transparent - no manual configuration needed

The inference profile support makes working with newer AWS Bedrock models seamless and transparent!