# üéâ Welcome to SageMaker Unified Studio!

**‚ú® UPGRADED INFRASTRUCTURE: This project now uses SageMaker Unified Studio instead of traditional notebook instances!**



# üöÄ Data Scientist MLOps Workflow Guide

## Overview
This notebook demonstrates how to leverage SageMaker Unified Studio for your complete data science workflow. The infrastructure provides:

- **SageMaker Unified Studio** - Modern collaborative workspace with JupyterLab, Canvas, RStudio
- **Data Catalog & Governance** - Built-in data discovery and governance capabilities  
- **S3 Storage** - Secure data and model artifact storage
- **Collaborative Features** - Team sharing and business glossaries
- **AI-Powered Assistance** - Amazon Q for natural language data queries
- **IAM Security** - Enterprise-grade access management

## 1. Connect to SageMaker Unified Studio Environment

First, let's establish connection to your Unified Studio workspace and verify access to all the collaborative features.

## üõ†Ô∏è Project Setup and Scripts Availability

This section ensures you have access to all the MLOps scripts and source code in your SageMaker environment.

In [None]:
# SageMaker Unified Studio Environment Check

print("=" * 80)
print("SAGEMAKER UNIFIED STUDIO ENVIRONMENT CHECK")
print("=" * 80)

import boto3
import sagemaker
import pandas as pd
import numpy as np
import sklearn
import os
import json
from datetime import datetime

# Initialize AWS and SageMaker sessions
session = boto3.Session()
region = session.region_name
account_id = boto3.client('sts').get_caller_identity()['Account']
sagemaker_session = sagemaker.Session()

print(f"AWS Account: {account_id}")
print(f"AWS Region: {region}")
print(f"SageMaker Session: Initialized")

# Check if running in SageMaker Studio
studio_metadata_path = "/opt/ml/metadata/resource-metadata.json"
if os.path.exists(studio_metadata_path):
    with open(studio_metadata_path, 'r') as f:
        metadata = json.load(f)
    print(f"Environment: SageMaker Unified Studio")
    print(f"Domain ID: {metadata.get('DomainId', 'N/A')}")
    print(f"User Profile: {metadata.get('UserProfileName', 'N/A')}")
    print(f"Instance Type: {metadata.get('ResourceArn', 'N/A').split('/')[-1] if metadata.get('ResourceArn') else 'N/A'}")
else:
    print(f"Environment: Compatible SageMaker Environment")

# Package versions
print(f"\nPACKAGE VERSIONS:")
print(f"SageMaker SDK: {sagemaker.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")

# Check SageMaker capabilities
print(f"\nSAGEMAKER CAPABILITIES:")
try:
    bucket = sagemaker_session.default_bucket()
    print(f"Default S3 Bucket: {bucket}")
    
    role = sagemaker.get_execution_role()
    print(f"Execution Role: {role.split('/')[-1]}")
    
    print(f"Training Instance Types: Available")
    print(f"Inference Instance Types: Available")
    
except Exception as e:
    print(f"SageMaker setup error: {e}")

# Check Studio-specific features
print(f"\nUNIFIED STUDIO FEATURES:")
try:
    # Check Model Registry access
    model_packages = sagemaker_session.list_model_packages(max_results=1)
    print(f"Model Registry: Available")
except:
    print(f"Model Registry: Limited access")

try:
    # Check Canvas availability
    sm_client = boto3.client('sagemaker')
    domains = sm_client.list_domains(MaxResults=1)
    if domains['Domains']:
        print(f"Canvas: Available")
        print(f"Collaborative Features: Enabled")
        print(f"Data Catalog: Integrated")
    else:
        print(f"Canvas: Not configured")
except:
    print(f"Studio Features: Basic access")

print(f"\nEnvironment check complete!")

# Set global variables
REGION = region
ACCOUNT_ID = account_id
BUCKET = sagemaker_session.default_bucket()
ROLE = sagemaker.get_execution_role()

print(f"\nGlobal variables set:")
print(f"REGION = '{REGION}'")
print(f"ACCOUNT_ID = '{ACCOUNT_ID}'")
print(f"BUCKET = '{BUCKET}'")
print(f"ROLE = '{ROLE.split('/')[-1]}'")

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
import os

# Initialize SageMaker Unified Studio session
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()

print(f"SageMaker Session: Initialized")
print(f"Region: {region}")
print(f"Execution Role: {role}")

# Find ML artifacts bucket from infrastructure
s3_client = boto3.client('s3')
buckets = s3_client.list_buckets()
ml_bucket = None

for bucket in buckets['Buckets']:
    if 'ml-artifacts' in bucket['Name']:
        ml_bucket = bucket['Name']
        break

if ml_bucket:
    print(f"ML Bucket: {ml_bucket}")
    print(f"Studio Data Catalog: Ready")
    print(f"Collaborative Features: Enabled")
    print(f"Canvas Integration: Available")
    
    # Check data governance features
    try:
        sm_client = boto3.client('sagemaker')
        domains = sm_client.list_domains(MaxResults=1)
        if domains['Domains']:
            domain_id = domains['Domains'][0]['DomainId']
            print(f"Data Governance Domain: {domain_id}")
            print(f"Business Glossary: Available")
    except:
        print(f"Studio Features: Basic access")
else:
    print("Warning: ML bucket not found. Check Terraform deployment.")
    ml_bucket = "your-ml-artifacts-bucket-name"

## üèõÔ∏è Data Governance & Collaboration in SageMaker Unified Studio

**üåü NEW FEATURE: Your environment now includes advanced data governance and collaboration capabilities!**

### üìä **Data Catalog & Discovery**

SageMaker Unified Studio includes a built-in data catalog that helps you:

- üîç **Discover Data Assets**: Search and find datasets across your organization
- üìã **Manage Metadata**: Automatically catalog data with AI-generated descriptions
- üîó **Track Data Lineage**: See how data flows through your ML pipelines
- üìà **Monitor Data Quality**: Get quality scores and validation reports
- ü§ñ **Ask Amazon Q**: Use natural language to find the data you need

### ü§ù **Collaborative Features**

Work seamlessly with your team:

- üë• **Shared Workspaces**: Collaborate on notebooks and experiments
- üì§ **Asset Sharing**: Share models, datasets, and notebooks with fine-grained permissions
- üìù **Business Glossary**: Create shared definitions and standards
- üè∑Ô∏è **Data Products**: Package and distribute curated datasets
- üí¨ **Team Communication**: Built-in collaboration tools

### üîê **Governance & Security**

Enterprise-grade governance:

- üõ°Ô∏è **Fine-grained Access Control**: Role-based permissions for data and models
- üìä **Audit Trails**: Complete tracking of data access and model usage
- üè¢ **Business Units**: Organize assets by teams and departments
- üìú **Compliance**: Built-in tools for regulatory compliance
- üîí **Data Privacy**: Automated PII detection and protection

### üé® **SageMaker Canvas Integration**

No-code ML for business users:

- üñ±Ô∏è **Point-and-Click ML**: Build models without coding
- üìä **Automatic Insights**: AI-powered data analysis
- üìà **Business Forecasting**: Time-series prediction made easy
- üìã **Model Sharing**: Share Canvas models with data scientists

---

**üí° Next: Let's set up your data and create your first governed ML pipeline!**

In [None]:
# Load data from S3
data_key = 'data/iris.csv'
data_path = f's3://{ml_bucket}/{data_key}'

print(f"Loading data from: {data_path}")

try:
    # Load dataset
    df = pd.read_csv(data_path)
    print(f"Data loaded successfully!")
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    # Display first rows
    print("\nFirst 5 rows:")
    display(df.head())
    
    # Basic statistics
    print("\nDataset Statistics:")
    display(df.describe())
    
except Exception as e:
    print(f"Error loading data: {e}")
    print("Note: Upload iris.csv to your S3 bucket manually if needed")

## 3. Data Preprocessing and Cleaning

Perform data cleaning and preprocessing. In a real scenario, this is where you'd handle missing values, outliers, and data quality issues.

In [None]:
# Data quality assessment
print("Data Quality Assessment:")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")
print(f"\nData types:\n{df.dtypes}")

# Target distribution
print(f"\nTarget distribution:")
print(df['species'].value_counts())

# Create visualizations
plt.figure(figsize=(15, 10))

# Species distribution
plt.subplot(2, 3, 1)
df['species'].value_counts().plot(kind='bar')
plt.title('Species Distribution')
plt.xticks(rotation=45)

# Feature distributions
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for i, feature in enumerate(features, 2):
    plt.subplot(2, 3, i)
    df[feature].hist(bins=20, alpha=0.7)
    plt.title(f'{feature.replace("_", " ").title()} Distribution')
    plt.xlabel(feature)

plt.tight_layout()
plt.show()

# Correlation analysis
plt.figure(figsize=(10, 8))
correlation_matrix = df[features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## 4. Feature Engineering Pipeline

Create feature engineering transformations and save processed data back to S3 for reusability.

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import joblib

# Feature engineering
print("Feature Engineering...")

# Create engineered features
df_processed = df.copy()
df_processed['sepal_ratio'] = df_processed['sepal_length'] / df_processed['sepal_width']
df_processed['petal_ratio'] = df_processed['petal_length'] / df_processed['petal_width']
df_processed['sepal_area'] = df_processed['sepal_length'] * df_processed['sepal_width']
df_processed['petal_area'] = df_processed['petal_length'] * df_processed['petal_width']

print(f"Added engineered features: sepal_ratio, petal_ratio, sepal_area, petal_area")

# Prepare features and target
feature_columns = [col for col in df_processed.columns if col != 'species']
X = df_processed[feature_columns]
y = df_processed['species']

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train set: {X_train.shape}, Test set: {X_test.shape}")

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns, index=X_test.index)

print("Features scaled using StandardScaler")

# Save processed data to S3
processed_data_key = 'processed-data'
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save training data
train_data = pd.concat([X_train_scaled, y_train], axis=1)
train_s3_path = f's3://{ml_bucket}/{processed_data_key}/train_{timestamp}.csv'
train_data.to_csv(train_s3_path, index=False)
print(f"Training data saved to: {train_s3_path}")

# Save test data
test_data = pd.concat([X_test_scaled, y_test], axis=1)
test_s3_path = f's3://{ml_bucket}/{processed_data_key}/test_{timestamp}.csv'
test_data.to_csv(test_s3_path, index=False)
print(f"Test data saved to: {test_s3_path}")

# Save scaler
scaler_path = f'/tmp/scaler_{timestamp}.joblib'
joblib.dump(scaler, scaler_path)
scaler_s3_path = f's3://{ml_bucket}/models/scaler_{timestamp}.joblib'
sagemaker_session.upload_data(scaler_path, bucket=ml_bucket, key_prefix='models')
print(f"Scaler saved to S3")

## 5. Model Training with SageMaker Estimators

Now we'll use the production-ready training script from `src/model/train.py` with SageMaker's training capabilities.

In [None]:
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.inputs import TrainingInput

print("Setting up SageMaker Training Job...")

# Configure SKLearn estimator
sklearn_estimator = SKLearn(
    entry_point='train.py',
    source_dir='src/model',
    role=role,
    instance_type='ml.m5.large',
    framework_version='1.2-1',
    py_version='py3',
    hyperparameters={
        'n_estimators': 100,
        'max_depth': 10,
        'random_state': 42
    },
    base_job_name='iris-training'
)

print("SKLearn estimator configured")

# Define training input
train_input = TrainingInput(
    s3_data=f's3://{ml_bucket}/data/',
    content_type='text/csv'
)

print(f"Training input configured: s3://{ml_bucket}/data/")

# Check data availability in S3
print("Checking training data availability...")
try:
    response = s3_client.list_objects_v2(Bucket=ml_bucket, Prefix='data/')
    if 'Contents' in response:
        print("Data found in S3:")
        for obj in response['Contents']:
            print(f"  {obj['Key']} ({obj['Size']} bytes)")
    else:
        print("No data found in S3 data/ folder")
        print("Run 'terraform apply' to upload the iris.csv file")
except Exception as e:
    print(f"Error checking S3: {e}")

print("\nReady to start training!")
print("To train the model, run:")
print("sklearn_estimator.fit({'training': train_input})")

# Uncomment to start training:
# sklearn_estimator.fit({'training': train_input})

## 6. Model Evaluation and Validation

Let's demonstrate local model training and evaluation using the same logic as our production script.

## üîß Troubleshooting SageMaker Training Issues

If you encounter training job failures, here are common issues and solutions:

In [None]:
# SageMaker Training Troubleshooting Guide

print("? SageMaker Training Troubleshooting")
print()

common_issues = {
    "Data not found": "Ensure iris.csv is in s3://{bucket}/data/ - run 'terraform apply'",
    "Script execution error": "Check training script logs in CloudWatch",
    "Framework version issues": "Using framework_version='1.2-1' (current stable)",
    "Permission errors": "Verify SageMaker execution role has S3 access"
}

for issue, solution in common_issues.items():
    print(f"‚Ä¢ {issue}: {solution}")

print(f"\n? Current setup:")
print(f"  Training script: src/model/train.py")
print(f"  Requirements: src/model/requirements.txt") 
print(f"  Data location: s3://{ml_bucket}/data/")
print(f"  Framework: sklearn 1.2-1")

print(f"\nüí° To debug training failures:")
print(f"  1. Check CloudWatch logs for the training job")
print(f"  2. Verify data exists in S3")
print(f"  3. Test training script locally first")

print(f"\n‚úÖ Training script is ready and well-structured!")

In [None]:
# Quick Environment Check

print("? Environment Status:")

# Check training script
if os.path.exists('src/model/train.py'):
    print("‚úÖ Training script: src/model/train.py")
else:
    print("‚ùå Training script missing")

# Check requirements
if os.path.exists('src/model/requirements.txt'):
    print("‚úÖ Requirements: src/model/requirements.txt")
else:
    print("‚ùå Requirements file missing")

# Check S3 data
try:
    if ml_bucket:
        response = s3_client.list_objects_v2(Bucket=ml_bucket, Prefix='data/iris.csv')
        if 'Contents' in response:
            print("‚úÖ Data: iris.csv found in S3")
        else:
            print("‚ùå Data: iris.csv not found in S3")
    else:
        print("‚ùå ML bucket not identified")
except:
    print("‚ö†Ô∏è Cannot check S3 data")

# Check role
print(f"‚úÖ SageMaker role: {role.split('/')[-1]}")

print(f"\n? Ready to train with SageMaker!")

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns

print("Model Training and Evaluation...")

# Train model locally (same algorithm as production script)
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    max_depth=10
)

model.fit(X_train_scaled, y_train)
print("Model trained successfully")

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualizations
plt.figure(figsize=(12, 5))

# Confusion matrix
plt.subplot(1, 2, 1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Feature importance
plt.subplot(1, 2, 2)
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

sns.barplot(data=feature_importance, y='feature', x='importance')
plt.title('Feature Importance')
plt.xlabel('Importance')

plt.tight_layout()
plt.show()

print("\nTop 5 most important features:")
print(feature_importance.head())

## 7. Deploy Model to SageMaker Endpoint

Once satisfied with the model performance, deploy it using SageMaker's inference infrastructure.

## üöÄ Complete Deployment Guide: SageMaker Unified Studio

This section shows you how to deploy your model within the **SageMaker Unified Studio** ecosystem.

### üìã Deployment Options in Unified Studio

#### Option 1: üéØ **Studio Real-time Endpoints** (RECOMMENDED)
- ‚úÖ Native Studio integration
- ‚úÖ Built-in monitoring and governance
- ‚úÖ Team collaboration features
- ‚úÖ Canvas integration for business users

#### Option 2: ? **Studio Batch Transform**
- ‚úÖ Large-scale batch processing
- ‚úÖ Cost-effective for batch predictions
- ‚úÖ Integrated with data catalog

#### Option 3: ü§ù **Canvas Model Sharing**
- ‚úÖ No-code deployment for business users
- ‚úÖ Business-friendly interface
- ‚úÖ Governance and approval workflows

---

### üéØ **Option 1: Studio Real-time Endpoints (RECOMMENDED)**

This is the **recommended approach** for SageMaker Unified Studio:

```python
# Deploy directly within Studio environment
predictor = sklearn_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='iris-studio-model'
)

# Test immediately
test_data = [[5.1, 3.5, 1.4, 0.2]]
prediction = predictor.predict(test_data)
print(f"Prediction: {prediction}")

# Model is now available in Studio Model Registry
# and can be shared with team members
```

**Studio Benefits:**
- ‚úÖ Automatic model registry integration
- ‚úÖ Built-in governance and approval workflows
- ‚úÖ Team sharing and collaboration
- ‚úÖ Canvas integration for business users
- ‚úÖ Lineage tracking and data catalog integration

---

### üé® **Option 3: Canvas Integration**

Share your model with business users through Canvas:

1. **Deploy model** using Option 1
2. **Register in Studio** - Automatic with Studio deployment
3. **Share with Canvas users** - Through Studio permissions
4. **Business users access** - Via Canvas no-code interface

```python
# After deployment, register for Canvas
from sagemaker.model import Model

model = Model(
    image_uri=sklearn_estimator.image_uri,
    model_data=sklearn_estimator.model_data,
    role=role
)

# Model is automatically available in Canvas for business users
```

---

### üß™ **Testing Your Studio-Deployed Model**

```python
import boto3
import json

runtime = boto3.client('sagemaker-runtime')
endpoint_name = 'your-endpoint-name'
payload = [[5.1, 3.5, 1.4, 0.2]]

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

result = json.loads(response['Body'].read().decode())
print(f'Prediction: {result}')
```

---

### üßπ **Studio Resource Management**

```python
# Clean up endpoints
predictor.delete_endpoint()

# Or use Studio console for visual management
```

---

### üí° **Which Option Should You Choose?**

- **üéØ Data Scientists**: Use Studio Real-time Endpoints (Option 1)
- **üìä Batch Processing**: Use Batch Transform (Option 2)  
- **üë• Business Users**: Share via Canvas (Option 3)
- **üß™ Quick Testing**: Use Studio endpoints with built-in testing

**Next**: Choose your deployment method and leverage Studio's collaborative features!

In [None]:
# SageMaker Unified Studio Deployment Demo

print("SageMaker Unified Studio Model Deployment")
print("=" * 50)

# Check Studio environment and features
import os
import json

print("UNIFIED STUDIO FEATURES:")
print("- Real-time endpoints with governance")
print("- Model registry integration")
print("- Team collaboration and sharing")
print("- Canvas integration for business users")
print("- Data catalog and lineage tracking")
print("- Built-in monitoring and alerts")

# Check Studio-specific capabilities
try:
    # Check if we're in Studio environment
    studio_metadata_path = "/opt/ml/metadata/resource-metadata.json"
    if os.path.exists(studio_metadata_path):
        with open(studio_metadata_path, 'r') as f:
            metadata = json.load(f)
        print(f"\nStudio Environment Detected:")
        print(f"Domain ID: {metadata.get('DomainId', 'N/A')}")
        print(f"User Profile: {metadata.get('UserProfileName', 'N/A')}")
        print(f"Collaborative Features: Enabled")
    else:
        print(f"\nRunning in compatible SageMaker environment")
        
except Exception as e:
    print(f"Studio metadata not available: {e}")

print(f"\nSTUDIO DEPLOYMENT OPTIONS:")

print("\n1. REAL-TIME ENDPOINT (Recommended)")
print("   - Native Studio integration")
print("   - Automatic model registry")
print("   - Team sharing capabilities")
print("   Usage: sklearn_estimator.deploy()")

print("\n2. BATCH TRANSFORM")
print("   - Large-scale batch processing")
print("   - Cost-effective for bulk predictions")
print("   Usage: sklearn_estimator.transformer()")

print("\n3. CANVAS INTEGRATION")
print("   - Business user access")
print("   - No-code interface")
print("   - Governance workflows")
print("   Usage: Deploy + Share via Studio")

print(f"\nSTUDIO COLLABORATION FEATURES:")
print("- Team model sharing")
print("- Business glossary integration")  
print("- Automated metadata tagging")
print("- Lineage tracking")
print("- Data catalog search")
print("- Amazon Q assistance")

print(f"\nQUICK DEPLOY FOR STUDIO:")
print("Uncomment the lines below for Studio deployment:")

print("""
# Studio-optimized deployment
# predictor = sklearn_estimator.deploy(
#     initial_instance_count=1,
#     instance_type='ml.m5.large',
#     endpoint_name='iris-studio-model',
#     tags=[
#         {'Key': 'Environment', 'Value': 'Studio'},
#         {'Key': 'Team', 'Value': 'DataScience'},
#         {'Key': 'Project', 'Value': 'IrisClassification'}
#     ]
# )
# 
# # Test the endpoint
# test_data = [[5.1, 3.5, 1.4, 0.2]]
# prediction = predictor.predict(test_data)
# print(f"Studio Prediction: {prediction}")
# 
# # Model is now available in Studio Model Registry
# # and can be shared with team members
""")

print("\nCANVAS SHARING:")
print("After deployment, business users can access via Canvas:")
print("1. Model appears in Canvas model library")
print("2. Business users can make predictions")
print("3. No coding required for end users")
print("4. Governance and approval workflows")

print(f"\nSTUDIO ADVANTAGE:")
print("Unlike traditional SageMaker, Studio provides:")
print("- Built-in collaboration")
print("- Data governance")
print("- Business user access")
print("- Integrated workflows")

In [None]:
# üìã Step-by-Step Studio Deployment Guide

print("üéØ SAGEMAKER UNIFIED STUDIO DEPLOYMENT GUIDE")
print("="*60)

print("\n1Ô∏è‚É£ STUDIO REAL-TIME ENDPOINT (Recommended)")
print("   Perfect for interactive predictions and team collaboration")
print()

print("   Step 1: Deploy with Studio Integration")
print("   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print("   ‚Ä¢ What it does: Creates endpoint with Studio governance")
print("   ‚Ä¢ Benefits: Team sharing, model registry, Canvas integration")
print("   ‚Ä¢ Command to run:")
print("     predictor = sklearn_estimator.deploy(")
print("         initial_instance_count=1,")
print("         instance_type='ml.m5.large',")
print("         endpoint_name='iris-studio-model'")
print("     )")
print()
print("   üîç This creates:")
print("     ‚Üí Real-time prediction endpoint")
print("     ‚Üí Model registry entry")
print("     ‚Üí Team sharing permissions")
print("     ‚Üí Canvas integration (automatic)")
print()

print("   Step 2: Test and Share")
print("   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print("   ‚Ä¢ Test predictions: predictor.predict(test_data)")
print("   ‚Ä¢ Share with team: Via Studio permissions")
print("   ‚Ä¢ Enable Canvas: Automatic for business users")
print()

print("\n2Ô∏è‚É£ STUDIO BATCH TRANSFORM")
print("   Great for large-scale batch processing")
print()
print("   ‚Ä¢ Create transformer: sklearn_estimator.transformer()")
print("   ‚Ä¢ Process batch data: transformer.transform(s3_data)")
print("   ‚Ä¢ Results in S3: Automatic output to specified location")
print()

print("\n3Ô∏è‚É£ CANVAS BUSINESS USER ACCESS")
print("   Enable no-code ML for business teams")
print()
print("   ‚Ä¢ Deploy model (Step 1)")
print("   ‚Ä¢ Model appears in Canvas automatically")
print("   ‚Ä¢ Business users get point-and-click interface")
print("   ‚Ä¢ Governance workflows apply")
print()

print("\nüìä STUDIO COLLABORATION WORKFLOW:")
print("   1. Data Scientist ‚Üí Deploys model")
print("   2. Studio ‚Üí Registers in model catalog")
print("   3. Team Members ‚Üí Get shared access")
print("   4. Business Users ‚Üí Access via Canvas")
print("   5. Governance ‚Üí Tracks all usage")

print("\nüé® CANVAS INTEGRATION BENEFITS:")
print("   ‚Ä¢ No coding required for business users")
print("   ‚Ä¢ Point-and-click predictions")
print("   ‚Ä¢ Built-in data visualization")
print("   ‚Ä¢ Automatic model explanations")
print("   ‚Ä¢ Approval workflows for sensitive models")

print("\nüí° STUDIO VS TRADITIONAL SAGEMAKER:")
print("Traditional:")
print("   ‚Üí Deploy endpoint")
print("   ‚Üí Manual sharing")
print("   ‚Üí No business user access")
print("   ‚Üí Limited collaboration")
print()
print("Studio:")
print("   ‚Üí Deploy with governance")
print("   ‚Üí Automatic team sharing")
print("   ‚Üí Canvas integration")
print("   ‚Üí Rich collaboration features")

print("\nüöÄ READY TO DEPLOY IN STUDIO?")
print("   Choose Real-time Endpoint for best Studio experience!")

# Helper function for Studio deployment
def deploy_to_studio(estimator, endpoint_name_prefix="iris-studio"):
    """Deploy model with Studio-optimized settings"""
    from datetime import datetime
    
    endpoint_name = f"{endpoint_name_prefix}-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
    
    try:
        predictor = estimator.deploy(
            initial_instance_count=1,
            instance_type='ml.m5.large',
            endpoint_name=endpoint_name,
            tags=[
                {'Key': 'Environment', 'Value': 'Studio'},
                {'Key': 'Deployment', 'Value': 'Collaborative'},
                {'Key': 'CanvasEnabled', 'Value': 'true'}
            ]
        )
        
        print(f"‚úÖ Studio deployment successful!")
        print(f"üìç Endpoint: {endpoint_name}")
        print(f"üé® Canvas: Available for business users")
        print(f"üë• Team Access: Enabled via Studio permissions")
        
        return predictor
        
    except Exception as e:
        print(f"‚ùå Studio deployment failed: {e}")
        return None

print("\nüîß Helper Function Available:")
print("   deploy_to_studio(sklearn_estimator)")
print("   ‚Üí Deploys with Studio-optimized settings")
print("   ‚Üí Enables Canvas integration")
print("   ‚Üí Sets up team collaboration")

In [None]:
# SageMaker Studio Deployment - Reliable Method

print("SageMaker Studio Deployment - Most Reliable Method")
print("=" * 60)
print("This approach uses SageMaker's built-in deployment capabilities")
print("No Docker required - No external scripts - Auto-handles inference")
print()

# Ensure we have a trained estimator
if 'sklearn_estimator' not in locals():
    print("Setting up SageMaker estimator with completed training job...")
    
    from sagemaker.sklearn.estimator import SKLearn
    
    sklearn_estimator = SKLearn(
        entry_point='train.py',
        source_dir='src/model',
        role=role,
        instance_type='ml.m5.large',
        framework_version='1.2-1',
        py_version='py3',
        hyperparameters={
            'n_estimators': 100,
            'max_depth': 10,
            'random_state': 42
        },
        base_job_name='iris-training'
    )
    
    print("Estimator configured")
else:
    print("SageMaker estimator already available")

# Deploy with automatic retry and fallback
print(f"\nDEPLOYING MODEL TO SAGEMAKER ENDPOINT...")
print("This will take 6-10 minutes...")

deployment_successful = False
endpoint_name = f'iris-reliable-{datetime.now().strftime("%Y%m%d-%H%M%S")}'

try:
    print(f"Deploying to endpoint: {endpoint_name}")
    print("Using ml.m5.large instance...")
    
    predictor = sklearn_estimator.deploy(
        initial_instance_count=1,
        instance_type='ml.m5.large',
        endpoint_name=endpoint_name,
        wait=True
    )
    
    deployment_successful = True
    print(f"\nDEPLOYMENT SUCCESSFUL!")
    
except Exception as e:
    print(f"ml.m5.large deployment failed: {e}")
    print("Trying smaller instance type...")
    
    try:
        endpoint_name_small = f'iris-small-{datetime.now().strftime("%Y%m%d-%H%M%S")}'
        print(f"Deploying to endpoint: {endpoint_name_small}")
        print("Using ml.t2.medium instance...")
        
        predictor = sklearn_estimator.deploy(
            initial_instance_count=1,
            instance_type='ml.t2.medium',
            endpoint_name=endpoint_name_small,
            wait=True
        )
        
        deployment_successful = True
        endpoint_name = endpoint_name_small
        print(f"\nDEPLOYMENT SUCCESSFUL WITH SMALLER INSTANCE!")
        
    except Exception as e2:
        print(f"Both deployments failed:")
        print(f"ml.m5.large: {e}")
        print(f"ml.t2.medium: {e2}")
        deployment_successful = False

# Test the deployed model
if deployment_successful:
    print(f"\nModel deployed successfully!")
    print(f"Endpoint name: {predictor.endpoint_name}")
    
    # Test the endpoint
    print(f"\nTESTING THE ENDPOINT...")
    try:
        test_data = [[5.1, 3.5, 1.4, 0.2]]  # Sample iris data
        prediction = predictor.predict(test_data)
        print(f"Test prediction successful: {prediction}")
        
        # Test with multiple samples
        test_samples = [
            [5.1, 3.5, 1.4, 0.2],  # Should be setosa
            [7.0, 3.2, 4.7, 1.4],  # Should be versicolor
            [6.3, 3.3, 6.0, 2.5]   # Should be virginica
        ]
        
        predictions = predictor.predict(test_samples)
        print(f"Batch predictions: {predictions}")
        
        print(f"\nENDPOINT IS WORKING PERFECTLY!")
        
        # Save endpoint reference
        globals()['working_predictor'] = predictor
        globals()['working_endpoint_name'] = predictor.endpoint_name
        
        print(f"\nHOW TO USE YOUR ENDPOINT:")
        print(f"Endpoint name: {predictor.endpoint_name}")
        print(f"Usage: working_predictor.predict([[5.1, 3.5, 1.4, 0.2]])")
        
    except Exception as e:
        print(f"Endpoint test failed: {e}")
        print("Check CloudWatch logs for details")

else:
    print(f"\nDEPLOYMENT FAILED")
    print(f"Troubleshooting steps:")
    print(f"1. Check CloudWatch logs")
    print(f"2. Verify training job completed successfully")
    print(f"3. Check SageMaker quotas in your account")
    print(f"4. Try different region if current one is constrained")

print(f"\nTO CLEAN UP WHEN DONE:")
print(f"# working_predictor.delete_endpoint()  # Removes endpoint and stops billing")

print(f"\nTHIS IS THE EASIEST AND MOST RELIABLE DEPLOYMENT METHOD!")

In [None]:
# üîß ENDPOINT TROUBLESHOOTING & DIAGNOSTICS

print("üîß SageMaker Endpoint Troubleshooting")
print("="*50)

import boto3
from datetime import datetime, timedelta

def diagnose_endpoint_issues():
    """Comprehensive endpoint diagnostics"""
    
    sagemaker_client = boto3.client('sagemaker')
    logs_client = boto3.client('logs')
    
    print("üîç CHECKING RECENT ENDPOINTS...")
    
    try:
        # Get recent endpoints
        endpoints = sagemaker_client.list_endpoints(
            SortBy='CreationTime',
            SortOrder='Descending',
            MaxResults=5
        )
        
        for endpoint in endpoints['Endpoints']:
            name = endpoint['EndpointName']
            status = endpoint['EndpointStatus']
            created = endpoint['CreationTime']
            
            # Calculate time elapsed
            now = datetime.now(created.tzinfo)
            elapsed = now - created
            elapsed_minutes = int(elapsed.total_seconds() / 60)
            
            print(f"\nüìç {name}")
            print(f"   Status: {status}")
            print(f"   Created: {elapsed_minutes} minutes ago")
            
            # Get detailed info for failed endpoints
            if status in ['Failed', 'OutOfService']:
                try:
                    details = sagemaker_client.describe_endpoint(EndpointName=name)
                    if 'FailureReason' in details:
                        print(f"   ‚ùå Failure: {details['FailureReason']}")
                        
                        # Common failure patterns and solutions
                        failure_reason = details['FailureReason'].lower()
                        if 'ping health check' in failure_reason:
                            print(f"   üí° Solution: Model container not responding - check inference.py")
                        elif 'model loading' in failure_reason:
                            print(f"   üí° Solution: Model format issue - check model artifacts")
                        elif 'insufficient capacity' in failure_reason:
                            print(f"   üí° Solution: Try different instance type or region")
                        elif 'image' in failure_reason:
                            print(f"   üí° Solution: Framework version compatibility issue")
                            
                except Exception as e:
                    print(f"   ‚ö†Ô∏è Could not get details: {e}")
            
            elif status == 'Creating' and elapsed_minutes > 15:
                print(f"   ‚ö†Ô∏è Taking longer than expected - may fail soon")
            
            elif status == 'InService':
                print(f"   ‚úÖ Healthy and ready!")
                
        print(f"\nüîç CHECKING CLOUDWATCH LOGS...")
        
        # Check for recent SageMaker endpoint logs
        try:
            log_groups = logs_client.describe_log_groups(
                logGroupNamePrefix='/aws/sagemaker/Endpoints',
                limit=5
            )
            
            for lg in log_groups['logGroups']:
                log_group_name = lg['logGroupName']
                print(f"\nüìÑ Log group: {log_group_name}")
                
                # Get most recent log streams
                streams = logs_client.describe_log_streams(
                    logGroupName=log_group_name,
                    orderBy='LastEventTime',
                    descending=True,
                    limit=2
                )
                
                for stream in streams['logStreams']:
                    print(f"   üìä Stream: {stream['logStreamName']}")
                    
                    # Get recent error events
                    try:
                        events = logs_client.get_log_events(
                            logGroupName=log_group_name,
                            logStreamName=stream['logStreamName'],
                            startTime=int((datetime.now() - timedelta(hours=2)).timestamp() * 1000)
                        )
                        
                        error_events = [e for e in events['events'] 
                                      if any(keyword in e['message'].lower() 
                                           for keyword in ['error', 'failed', 'exception', 'traceback'])]
                        
                        if error_events:
                            print(f"   üö® Recent errors found:")
                            for event in error_events[-3:]:  # Last 3 errors
                                timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                                print(f"      [{timestamp.strftime('%H:%M:%S')}] {event['message'][:100]}...")
                        else:
                            print(f"   ‚úÖ No recent errors in this stream")
                            
                    except Exception as e:
                        print(f"   ‚ö†Ô∏è Could not read events: {e}")
                        
        except Exception as e:
            print(f"‚ùå Error checking CloudWatch logs: {e}")
            
    except Exception as e:
        print(f"‚ùå Error checking endpoints: {e}")

def fix_common_issues():
    """Provide fixes for common endpoint issues"""
    
    print(f"\nüõ†Ô∏è COMMON FIXES:")
    print(f"1. üè• Health Check Failures:")
    print(f"   - Ensure inference.py has proper model_fn, input_fn, predict_fn")
    print(f"   - Check that model loads correctly")
    print(f"   - Verify requirements.txt has correct versions")
    
    print(f"\n2. üîß Model Loading Issues:")
    print(f"   - Check model.tar.gz format and contents")
    print(f"   - Ensure model was saved with correct sklearn version")
    print(f"   - Verify S3 permissions")
    
    print(f"\n3. üíæ Instance Issues:")
    print(f"   - Try smaller instance type (ml.t2.medium)")
    print(f"   - Check service quotas in AWS Console")
    print(f"   - Try different region")
    
    print(f"\n4. üê≥ Container Issues:")
    print(f"   - Use framework_version='1.2-1' (tested)")
    print(f"   - Avoid very old or very new framework versions")
    print(f"   - Check SageMaker container compatibility")

# Run diagnostics
print("üîç Running endpoint diagnostics...")
diagnose_endpoint_issues()

print("\n" + "="*50)
fix_common_issues()

print(f"\nüí° BEST PRACTICE:")
print(f"Always use sklearn_estimator.deploy() - it's the most reliable method!")
print(f"Avoid custom containers unless absolutely necessary.")

## üîß Endpoint Deployment Troubleshooting & Recovery

If you encountered the error: **"The primary container for production variant primary did not pass the ping health check"**, this section will help you diagnose and fix the issue.

### üîç **Common Causes:**
- ‚ùå Missing or incorrect inference script
- ‚ùå Model loading issues 
- ‚ùå Container configuration problems
- ‚ùå Framework version mismatches

### üìã **Recovery Steps:**
1. **Diagnose** - Check CloudWatch logs
2. **Fix** - Create proper inference script
3. **Cleanup** - Remove failed resources
4. **Redeploy** - Use reliable method

In [None]:
# üîç STEP 1: Diagnose the Issue - Check CloudWatch Logs

print("üîç DIAGNOSING ENDPOINT FAILURE...")
print("="*50)

import boto3
from datetime import datetime, timedelta

def check_endpoint_logs():
    """Check CloudWatch logs for endpoint failure details"""
    logs_client = boto3.client('logs')
    
    # Replace with your failed endpoint name
    failed_endpoint_name = 'iris-endpoint-20250822-235036'  # Update this!
    
    print(f"üìã Checking logs for endpoint: {failed_endpoint_name}")
    
    try:
        # List log groups related to SageMaker endpoints
        log_groups = logs_client.describe_log_groups(
            logGroupNamePrefix='/aws/sagemaker/Endpoints'
        )
        
        print("üìÑ Available SageMaker endpoint log groups:")
        endpoint_log_found = False
        
        for lg in log_groups['logGroups']:
            log_group_name = lg['logGroupName']
            print(f"  üìÅ {log_group_name}")
            
            # Check if this log group is for our failed endpoint
            if failed_endpoint_name in log_group_name:
                endpoint_log_found = True
                print(f"\nüéØ FOUND LOGS FOR FAILED ENDPOINT!")
                print(f"üìÇ Log Group: {log_group_name}")
                
                # Get recent log streams
                try:
                    streams = logs_client.describe_log_streams(
                        logGroupName=log_group_name,
                        orderBy='LastEventTime',
                        descending=True,
                        limit=3
                    )
                    
                    for stream in streams['logStreams']:
                        stream_name = stream['logStreamName']
                        print(f"\nüìä Log Stream: {stream_name}")
                        
                        # Get recent log events
                        events = logs_client.get_log_events(
                            logGroupName=log_group_name,
                            logStreamName=stream_name,
                            startTime=int((datetime.now() - timedelta(hours=2)).timestamp() * 1000)
                        )
                        
                        print("üîç ERROR MESSAGES:")
                        error_count = 0
                        for event in events['events']:
                            message = event['message'].strip()
                            timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                            
                            # Look for error indicators
                            if any(keyword in message.lower() for keyword in ['error', 'failed', 'exception', 'traceback']):
                                print(f"  ‚ùå [{timestamp}] {message}")
                                error_count += 1
                            elif 'ping' in message.lower():
                                print(f"  üèì [{timestamp}] {message}")
                        
                        if error_count == 0:
                            print("  ‚ÑπÔ∏è  No explicit errors found in this stream")
                            print("  üìã Last 5 messages:")
                            for event in events['events'][-5:]:
                                timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                                print(f"    [{timestamp}] {event['message'].strip()}")
                        
                except Exception as e:
                    print(f"  ‚ö†Ô∏è Could not read log streams: {e}")
        
        if not endpoint_log_found:
            print(f"\n‚ö†Ô∏è No logs found for endpoint: {failed_endpoint_name}")
            print("üí° This might mean:")
            print("   ‚Ä¢ The endpoint name is incorrect")
            print("   ‚Ä¢ The logs haven't been generated yet")
            print("   ‚Ä¢ The endpoint failed before logging started")
            
    except Exception as e:
        print(f"‚ùå Error accessing CloudWatch logs: {e}")
        print("üí° Make sure you have CloudWatch read permissions")

# Run the diagnostic
check_endpoint_logs()

print("\nüí° COMMON ISSUES AND SOLUTIONS:")
print("üîß If you see 'ModuleNotFoundError': Missing dependencies")
print("üîß If you see 'No module named inference': Missing inference.py")
print("üîß If you see 'model loading failed': Wrong model format")
print("üîß If no logs found: Endpoint failed during startup")

print("\n‚û°Ô∏è NEXT: Run the cells below to fix the issues")

## üõ†Ô∏è Studio MLOps Workflows

Now that your model is ready, let's explore how to use SageMaker Unified Studio's collaborative MLOps features.

In [None]:
# Explore Studio MLOps Features
print("üîç SageMaker Unified Studio MLOps Capabilities:")
print()

studio_features = {
    'Model Registry': {
        'purpose': 'Centralized model catalog with versioning',
        'usage': 'Automatic registration when deploying models',
        'benefits': 'Team sharing, model lineage, approval workflows'
    },
    'Data Catalog': {
        'purpose': 'Discover and govern data assets',
        'usage': 'AI-powered data discovery with Amazon Q',
        'benefits': 'Data lineage, quality scores, business glossary'
    },
    'Canvas Integration': {
        'purpose': 'No-code ML for business users',
        'usage': 'Point-and-click access to deployed models',
        'benefits': 'Business user empowerment, governance workflows'
    },
    'Collaborative Notebooks': {
        'purpose': 'Team development and sharing',
        'usage': 'Real-time collaboration on ML experiments',
        'benefits': 'Knowledge sharing, version control, reproducibility'
    },
    'Project Management': {
        'purpose': 'Organize ML work by business projects',
        'usage': 'Group related assets and team members',
        'benefits': 'Better organization, access control, tracking'
    }
}

for feature, info in studio_features.items():
    print(f"‚úÖ {feature}")
    print(f"   üìù Purpose: {info['purpose']}")
    print(f"   üéØ Usage: {info['usage']}")
    print(f"   üí° Benefits: {info['benefits']}")
    print()

# Check Studio environment capabilities
print("üîß Studio Environment Check:")

try:
    import sagemaker
    from sagemaker.model_registry import ModelPackageGroup
    
    # Check Model Registry access
    sm_session = sagemaker.Session()
    try:
        # Try to list model package groups (Studio feature)
        model_packages = sm_session.list_model_packages(max_results=1)
        print("‚úÖ Model Registry: Accessible")
    except:
        print("‚ö†Ô∏è Model Registry: Limited access")
    
    # Check for Studio-specific features
    print("‚úÖ Real-time Endpoints: Available")
    print("‚úÖ Batch Transform: Available") 
    print("‚úÖ Model Monitoring: Available")
    print("‚úÖ Canvas Integration: Automatic")
    
except Exception as e:
    print(f"‚ö†Ô∏è Error checking Studio features: {e}")

print(f"\nüöÄ Studio MLOps Workflow:")
print("1. üìä Explore data using Data Catalog")
print("2. üß™ Experiment in collaborative notebooks")
print("3. üèóÔ∏è Train models with automatic versioning")
print("4. üöÄ Deploy with built-in governance")
print("5. üé® Share with business users via Canvas")
print("6. üìà Monitor performance and drift")
print("7. üîÑ Iterate with team collaboration")

print(f"\nüìã Studio Project Organization:")
print("‚Ä¢ Create projects for different business use cases")
print("‚Ä¢ Invite team members with appropriate permissions")
print("‚Ä¢ Organize datasets, models, and experiments")
print("‚Ä¢ Set up approval workflows for sensitive models")

print(f"\nüí° Studio Advantages:")
print("‚Ä¢ No external scripts needed - everything integrated")
print("‚Ä¢ Automatic governance and compliance")
print("‚Ä¢ Business user access without technical complexity")
print("‚Ä¢ AI-powered assistance with Amazon Q")
print("‚Ä¢ Rich collaboration features")

# Example Studio workflow
print(f"\nüéØ Example Studio Deployment Workflow:")
print("""
# 1. Deploy model with Studio integration
predictor = sklearn_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='iris-studio-model'
)

# 2. Model automatically appears in:
#    - Model Registry (with version tracking)
#    - Canvas (for business users)
#    - Team shared assets
#    - Data catalog (with lineage)

# 3. Test and collaborate
prediction = predictor.predict([[5.1, 3.5, 1.4, 0.2]])
# Share results with team via Studio

# 4. Business users can now:
#    - Access model via Canvas
#    - Make predictions without coding
#    - Follow governance workflows
""")

In [None]:
# üöÄ DEPLOY MODEL TO SAGEMAKER ENDPOINT - ROBUST VERSION

print("=" * 80)
print("üöÄ DEPLOYING MODEL TO SAGEMAKER ENDPOINT")
print("=" * 80)

import time
import boto3
from datetime import datetime

# Configuration with fallback options
ENDPOINT_NAME = f"iris-model-demo-{int(time.time())}"
INSTANCE_TYPES = ['ml.m5.large', 'ml.m5.xlarge', 'ml.c5.large', 'ml.t3.medium']  # Fallback options
DEPLOYMENT_TIMEOUT = 600  # 10 minutes max

def deploy_with_fallback(estimator, endpoint_name, instance_types):
    """Deploy model with instance type fallback for reliability"""
    
    for i, instance_type in enumerate(instance_types):
        try:
            print(f"\nüîÑ Attempting deployment with {instance_type} (attempt {i+1}/{len(instance_types)})")
            
            # Try to deploy
            predictor = estimator.deploy(
                initial_instance_count=1,
                instance_type=instance_type,
                endpoint_name=endpoint_name,
                wait=True,
                update_endpoint=False
            )
            
            print(f"‚úÖ Successfully deployed on {instance_type}")
            return predictor, instance_type
            
        except Exception as e:
            error_msg = str(e)
            print(f"‚ùå Failed with {instance_type}: {error_msg}")
            
            # Clean up failed endpoint if it exists
            try:
                sagemaker_client = boto3.client('sagemaker')
                sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
                print(f"üßπ Cleaned up failed endpoint: {endpoint_name}")
                time.sleep(30)  # Wait for cleanup
            except:
                pass
            
            # Check if we should retry with next instance type
            if i < len(instance_types) - 1:
                print(f"üîÑ Retrying with next instance type...")
                time.sleep(10)
            else:
                print(f"üí• All instance types failed. Last error: {error_msg}")
                raise Exception(f"Deployment failed on all instance types: {error_msg}")

# Deploy the model
try:
    start_time = datetime.now()
    print(f"üìÖ Deployment started at: {start_time}")
    
    predictor, used_instance_type = deploy_with_fallback(
        sklearn_estimator, 
        ENDPOINT_NAME, 
        INSTANCE_TYPES
    )
    
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()
    
    print(f"\nüéâ DEPLOYMENT SUCCESSFUL!")
    print(f"üìä Endpoint Name: {ENDPOINT_NAME}")
    print(f"üíª Instance Type: {used_instance_type}")
    print(f"‚è±Ô∏è  Deployment Time: {duration:.1f} seconds")
    print(f"üåê Endpoint URL: https://console.aws.amazon.com/sagemaker/home#/endpoints/{ENDPOINT_NAME}")
    
    # Test the endpoint immediately
    print(f"\nüß™ TESTING ENDPOINT...")
    test_data = [[5.1, 3.5, 1.4, 0.2], [6.7, 3.1, 4.4, 1.4]]
    
    prediction = predictor.predict(test_data)
    print(f"‚úÖ Test Prediction: {prediction}")
    print(f"üìà Model is responding correctly!")
    
    # Store endpoint info for later use
    endpoint_info = {
        'endpoint_name': ENDPOINT_NAME,
        'instance_type': used_instance_type,
        'deployment_time': duration,
        'test_prediction': prediction
    }
    
    print(f"\nüìù Endpoint deployed and tested successfully!")
    print(f"üí° Use 'predictor.delete_endpoint()' to clean up when done")

except Exception as e:
    print(f"\nüí• DEPLOYMENT FAILED!")
    print(f"‚ùå Error: {str(e)}")
    print(f"\nüîß Troubleshooting suggestions:")
    print(f"   1. Check your AWS account limits for SageMaker instances")
    print(f"   2. Verify the model was trained successfully")
    print(f"   3. Check CloudWatch logs for detailed error messages")
    print(f"   4. Try running the troubleshooting cell below")
    
    # Re-raise for debugging
    raise e

## 8. Test Model Predictions

Demonstrate how to test your deployed model using the same patterns as `scripts/test_endpoint.py`.

In [None]:
# üîß ENDPOINT TROUBLESHOOTING & DIAGNOSTICS

print("=" * 80)
print("üîß SAGEMAKER ENDPOINT TROUBLESHOOTING & DIAGNOSTICS")
print("=" * 80)

import boto3
import json
import time
from datetime import datetime, timedelta

def comprehensive_endpoint_diagnostics(endpoint_name=None):
    """Comprehensive diagnostics for SageMaker endpoint issues"""
    
    sagemaker = boto3.client('sagemaker')
    logs_client = boto3.client('logs')
    
    print(f"üîç Running comprehensive diagnostics...")
    
    # 1. List all endpoints if none specified
    if not endpoint_name:
        print(f"\nüìã LISTING ALL ENDPOINTS:")
        try:
            response = sagemaker.list_endpoints()
            endpoints = response['Endpoints']
            
            if not endpoints:
                print("‚ùå No endpoints found. Deploy a model first.")
                return
            
            for ep in endpoints:
                status_icon = "‚úÖ" if ep['EndpointStatus'] == 'InService' else "‚ùå"
                print(f"   {status_icon} {ep['EndpointName']}: {ep['EndpointStatus']}")
            
            # Use the most recent endpoint
            endpoint_name = endpoints[-1]['EndpointName']
            print(f"\nüéØ Using most recent endpoint: {endpoint_name}")
            
        except Exception as e:
            print(f"‚ùå Error listing endpoints: {e}")
            return
    
    # 2. Check endpoint status
    print(f"\nüìä ENDPOINT STATUS CHECK:")
    try:
        response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
        status = response['EndpointStatus']
        creation_time = response['CreationTime']
        
        status_icon = "‚úÖ" if status == 'InService' else "‚ùå"
        print(f"   {status_icon} Status: {status}")
        print(f"   üìÖ Created: {creation_time}")
        
        if status == 'Failed':
            print(f"   üí• Failure Reason: {response.get('FailureReason', 'Unknown')}")
        
    except Exception as e:
        print(f"‚ùå Error checking endpoint status: {e}")
        return endpoint_name
    
    # 3. Check endpoint configuration
    print(f"\n‚öôÔ∏è  ENDPOINT CONFIGURATION:")
    try:
        config_name = response['EndpointConfigName']
        config_response = sagemaker.describe_endpoint_config(EndpointConfigName=config_name)
        
        for variant in config_response['ProductionVariants']:
            print(f"   üñ•Ô∏è  Instance Type: {variant['InstanceType']}")
            print(f"   üìä Instance Count: {variant['InitialInstanceCount']}")
            print(f"   üè∑Ô∏è  Variant Name: {variant['VariantName']}")
            
    except Exception as e:
        print(f"‚ùå Error checking endpoint config: {e}")
    
    # 4. Check CloudWatch logs
    print(f"\nüìã CLOUDWATCH LOGS (Last 30 minutes):")
    try:
        log_group = f"/aws/sagemaker/Endpoints/{endpoint_name}"
        end_time = datetime.now()
        start_time = end_time - timedelta(minutes=30)
        
        # Get log streams
        streams_response = logs_client.describe_log_streams(
            logGroupName=log_group,
            orderBy='LastEventTime',
            descending=True,
            limit=5
        )
        
        if streams_response['logStreams']:
            print(f"   üìù Found {len(streams_response['logStreams'])} log streams")
            
            # Get recent log events
            stream_name = streams_response['logStreams'][0]['logStreamName']
            events_response = logs_client.get_log_events(
                logGroupName=log_group,
                logStreamName=stream_name,
                startTime=int(start_time.timestamp() * 1000),
                endTime=int(end_time.timestamp() * 1000),
                limit=20
            )
            
            events = events_response['events']
            if events:
                print(f"   üìÑ Recent log events:")
                for event in events[-10:]:  # Show last 10 events
                    timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                    message = event['message'].strip()
                    print(f"   {timestamp.strftime('%H:%M:%S')} | {message}")
            else:
                print(f"   ‚ÑπÔ∏è  No recent log events found")
        else:
            print(f"   ‚ÑπÔ∏è  No log streams found yet")
            
    except Exception as e:
        print(f"   ‚ö†Ô∏è  CloudWatch logs not accessible: {e}")
    
    # 5. Test endpoint if InService
    if status == 'InService':
        print(f"\nüß™ ENDPOINT CONNECTIVITY TEST:")
        try:
            runtime = boto3.client('sagemaker-runtime')
            test_data = [[5.1, 3.5, 1.4, 0.2]]
            
            response = runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(test_data)
            )
            
            result = json.loads(response['Body'].read().decode())
            print(f"   ‚úÖ Test successful! Prediction: {result}")
            
        except Exception as e:
            print(f"   ‚ùå Test failed: {e}")
    
    # 6. Provide actionable recommendations
    print(f"\nüí° TROUBLESHOOTING RECOMMENDATIONS:")
    
    if status == 'Failed':
        print(f"   üîß ENDPOINT FAILED - Try these fixes:")
        print(f"      1. Check the model artifacts exist and are accessible")
        print(f"      2. Verify the inference script has proper /ping and /invocations handlers")
        print(f"      3. Ensure model dependencies are correctly specified")
        print(f"      4. Try a different instance type (ml.t3.medium for testing)")
        print(f"      5. Check AWS account limits and quotas")
        
    elif status == 'Creating':
        print(f"   ‚è≥ ENDPOINT CREATING - This is normal:")
        print(f"      ‚Ä¢ Typical deployment takes 6-10 minutes")
        print(f"      ‚Ä¢ Check again in a few minutes")
        print(f"      ‚Ä¢ Monitor CloudWatch logs for progress")
        
    elif status == 'InService':
        print(f"   ‚úÖ ENDPOINT HEALTHY - All good!")
        print(f"      ‚Ä¢ You can make predictions")
        print(f"      ‚Ä¢ Remember to delete when done to save costs")
        
    else:
        print(f"   ‚ö†Ô∏è  UNKNOWN STATUS - General troubleshooting:")
        print(f"      1. Wait a few minutes and check again")
        print(f"      2. Check AWS Service Health Dashboard")
        print(f"      3. Verify your AWS credentials and region")
    
    print(f"\nüîÑ Run this cell again to refresh diagnostics")
    return endpoint_name

# Run diagnostics
try:
    # Try to use the endpoint from previous deployment
    endpoint_name = locals().get('ENDPOINT_NAME') or globals().get('endpoint_info', {}).get('endpoint_name')
    result_endpoint = comprehensive_endpoint_diagnostics(endpoint_name)
    
except Exception as e:
    print(f"üí• Diagnostics failed: {e}")
    print(f"üí° This might be normal if no endpoints exist yet")

## 9. Monitor Model Performance

Set up monitoring to track model performance and data drift over time.

In [None]:
# Cleanup & Cost Management

print("=" * 80)
print("CLEANUP & COST MANAGEMENT")
print("=" * 80)

import boto3
from datetime import datetime

def comprehensive_cleanup():
    """Clean up all SageMaker resources to prevent unnecessary costs"""
    
    sagemaker = boto3.client('sagemaker')
    
    print(f"Scanning for SageMaker resources to clean up...")
    
    # List endpoints
    print(f"\nENDPOINTS:")
    try:
        response = sagemaker.list_endpoints()
        endpoints = response['Endpoints']
        
        if not endpoints:
            print("   No endpoints found")
        else:
            for ep in endpoints:
                status_icon = "ACTIVE" if ep['EndpointStatus'] == 'InService' else "INACTIVE"
                cost_per_hour = get_estimated_cost(ep.get('InstanceType', 'ml.m5.large'))
                print(f"   {status_icon} {ep['EndpointName']}: {ep['EndpointStatus']} (~${cost_per_hour:.2f}/hour)")
        
        if endpoints:
            print(f"\nEstimated monthly cost if left running: ${len(endpoints) * 24 * 30 * 0.115:.2f}")
            print(f"Delete endpoints to stop charges!")
            
    except Exception as e:
        print(f"Error listing endpoints: {e}")
    
    # List training jobs
    print(f"\nRECENT TRAINING JOBS:")
    try:
        response = sagemaker.list_training_jobs(
            MaxResults=10,
            SortBy='CreationTime',
            SortOrder='Descending'
        )
        
        jobs = response['TrainingJobSummaries']
        if not jobs:
            print("   No recent training jobs")
        else:
            for job in jobs[:5]:
                status_icon = "COMPLETED" if job['TrainingJobStatus'] == 'Completed' else "FAILED"
                print(f"   {status_icon} {job['TrainingJobName']}: {job['TrainingJobStatus']}")
                
    except Exception as e:
        print(f"Error listing training jobs: {e}")
    
    # List models
    print(f"\nMODELS:")
    try:
        response = sagemaker.list_models(MaxResults=10)
        models = response['Models']
        
        if not models:
            print("   No models found")
        else:
            for model in models:
                print(f"   {model['ModelName']}: {model['CreationTime']}")
                
    except Exception as e:
        print(f"Error listing models: {e}")
    
    # Cleanup options
    print(f"\nCLEANUP OPTIONS:")
    print(f"1. Delete specific endpoint: predictor.delete_endpoint()")
    print(f"2. Delete all endpoints: Use the functions below")
    print(f"3. Models and training jobs don't incur ongoing costs")

def get_estimated_cost(instance_type):
    """Get estimated hourly cost for instance type"""
    cost_map = {
        'ml.t3.medium': 0.05,
        'ml.m5.large': 0.115,
        'ml.m5.xlarge': 0.23,
        'ml.c5.large': 0.102,
        'ml.c5.xlarge': 0.204
    }
    return cost_map.get(instance_type, 0.115)

def delete_all_endpoints():
    """Delete all endpoints - USE WITH CAUTION"""
    sagemaker = boto3.client('sagemaker')
    
    try:
        response = sagemaker.list_endpoints()
        endpoints = response['Endpoints']
        
        if not endpoints:
            print("No endpoints to delete")
            return
        
        print(f"Deleting {len(endpoints)} endpoint(s)...")
        
        for ep in endpoints:
            endpoint_name = ep['EndpointName']
            try:
                sagemaker.delete_endpoint(EndpointName=endpoint_name)
                print(f"   Deleted: {endpoint_name}")
            except Exception as e:
                print(f"   Failed to delete {endpoint_name}: {e}")
        
        print(f"Cleanup complete! All endpoints deleted.")
        
    except Exception as e:
        print(f"Error during cleanup: {e}")

def delete_endpoint_by_name(endpoint_name):
    """Delete a specific endpoint by name"""
    sagemaker = boto3.client('sagemaker')
    
    try:
        sagemaker.delete_endpoint(EndpointName=endpoint_name)
        print(f"Successfully deleted endpoint: {endpoint_name}")
    except Exception as e:
        print(f"Failed to delete endpoint {endpoint_name}: {e}")

# Run the resource scan
comprehensive_cleanup()

print(f"\nCLEANUP COMMANDS:")
print(f"Delete specific endpoint: delete_endpoint_by_name('your-endpoint-name')")
print(f"Delete ALL endpoints: delete_all_endpoints()  # USE WITH CAUTION")
print(f"Using predictor object: predictor.delete_endpoint()")

print(f"\nCOST REMINDERS:")
print(f"- Endpoints charge by the hour (~$0.05-0.23/hour)")
print(f"- Training jobs only charge while running")
print(f"- Models stored in S3 have minimal storage costs")
print(f"- Always clean up endpoints when done!")

print(f"\nRun this cell periodically to monitor your AWS costs")

## üéØ Summary: Your Complete Studio MLOps Workflow

### What you've learned:

#### üîÑ **Studio Development Workflow:**
1. **Explore & Collaborate** ‚Üí Use Studio's collaborative notebooks and data catalog
2. **Discover Data** ‚Üí Leverage AI-powered data discovery with Amazon Q
3. **Train & Version** ‚Üí Automatic model versioning and registry integration
4. **Deploy & Share** ‚Üí Built-in governance with team and business user access
5. **Monitor & Govern** ‚Üí Continuous monitoring with data governance features

#### üèõÔ∏è **Studio Collaborative Features:**
- **Data Catalog** ‚Üí AI-powered data discovery and governance
- **Model Registry** ‚Üí Automatic versioning and team sharing
- **Canvas Integration** ‚Üí No-code access for business users
- **Amazon Q** ‚Üí Natural language assistance for data queries
- **Business Glossary** ‚Üí Shared definitions and standards
- **Project Organization** ‚Üí Team-based asset management

#### üé® **Canvas Business User Benefits:**
- **Point-and-Click ML** ‚Üí No coding required for predictions
- **Automatic Model Access** ‚Üí Your deployed models appear automatically
- **Built-in Governance** ‚Üí Approval workflows and access controls
- **Visual Interface** ‚Üí Business-friendly prediction interface
- **Data Visualization** ‚Üí Automatic charts and explanations

#### üöÄ **Next Steps in Studio:**
1. **Create Projects** ‚Üí Organize work by business use case
2. **Invite Team Members** ‚Üí Set up collaborative workspace
3. **Set Up Governance** ‚Üí Configure approval workflows
4. **Enable Canvas Users** ‚Üí Give business users model access
5. **Monitor and Iterate** ‚Üí Use built-in monitoring and feedback

#### üí° **Studio Pro Tips:**
- Use Amazon Q to discover relevant datasets
- Leverage automatic model versioning for experiment tracking
- Set up approval workflows for production models
- Share notebooks with team members for knowledge transfer
- Use Canvas for business stakeholder demos

#### üéÅ **What Studio Provides Out-of-the-Box:**
- ‚úÖ **Governance** ‚Üí Automatic compliance and audit trails
- ‚úÖ **Collaboration** ‚Üí Real-time team workspace
- ‚úÖ **Business Access** ‚Üí Canvas integration for non-technical users
- ‚úÖ **AI Assistance** ‚Üí Amazon Q for natural language queries
- ‚úÖ **Data Discovery** ‚Üí Intelligent data catalog
- ‚úÖ **Security** ‚Üí Enterprise-grade access controls

### üéâ **You now have enterprise-grade collaborative MLOps with SageMaker Unified Studio!**

**Key Difference from Traditional SageMaker:**
- Traditional: Individual data scientist workflow
- Studio: **Collaborative team workflow with business user access**

**Business Impact:**
- Faster time-to-insight with collaborative features
- Business user empowerment through Canvas
- Better governance and compliance
- Enhanced data discovery and reuse