# üéâ Welcome to SageMaker Unified Studio!

**‚ú® UPGRADED INFRASTRUCTURE: This project now uses SageMaker Unified Studio instead of traditional notebook instances!**

## üöÄ **What's New in Your Environment:**

### üåü **SageMaker Unified Studio Features:**
- **ü§ù Collaborative Environment**: Work with your team in shared domains
- **üìä Data Governance**: Built-in data catalog and governance capabilities  
- **üé® SageMaker Canvas**: No-code ML for business users
- **ü§ñ Generative AI**: Integrated with Amazon Bedrock for AI assistance
- **üìà Advanced Analytics**: R/Python support with RStudio integration
- **üîí Enterprise Security**: Fine-grained access controls and data lineage

### üîß **Environment Details:**
- **Domain Type**: SageMaker Unified Studio Domain
- **User Profiles**: Multiple users with role-based access
- **Data Storage**: S3-backed with automatic synchronization
- **Instance Types**: Auto-scaling based on workload
- **Collaboration**: Shared notebooks and data assets

### üéØ **How to Use This Notebook:**
1. **Launch from Studio**: This notebook is optimized for SageMaker Studio
2. **Follow the Workflow**: Complete ML pipeline with governance features
3. **Collaborate**: Share notebooks and models with your team
4. **Use Data Catalog**: Discover and manage data assets
5. **Deploy with Confidence**: Production-ready deployment workflows

---

**üí° Ready to get started? Follow the complete workflow below!**

# üöÄ Data Scientist MLOps Workflow Guide

## Overview
This notebook demonstrates how to leverage the deployed MLOps infrastructure for your complete data science workflow. The infrastructure provides:

- **SageMaker Notebook Instance** - Your development environment
- **S3 Bucket** - Data and model artifact storage
- **ECR Repository** - Container image registry
- **Lambda Functions** - Pipeline orchestration
- **IAM Roles** - Secure access management


### üìÅ `src/` folder structure:
- `src/model/train.py` - Production-ready training script
- `src/model/inference.py` - Model serving logic
- `src/container/` - Docker containerization for custom algorithms
- `src/lambda/` - Pipeline automation functions

### üìÅ `scripts/` folder structure:
- `scripts/build_and_push.sh` - Build and deploy model containers
- `scripts/deploy.sh` - Deploy models to production
- `scripts/test_endpoint.py` - Test deployed models
- `scripts/cleanup.sh` - Clean up resources

## 1. Connect to AWS SageMaker Environment

First, let's establish connection to your deployed infrastructure and verify access to resources.

## üõ†Ô∏è Project Setup and Scripts Availability

This section ensures you have access to all the MLOps scripts and source code in your SageMaker environment.

In [None]:
# üîç ENVIRONMENT CHECK - SAGEMAKER UNIFIED STUDIO

print("=" * 80)
print("üîç SAGEMAKER UNIFIED STUDIO ENVIRONMENT CHECK")
print("=" * 80)

import boto3
import sagemaker
import pandas as pd
import numpy as np
import sklearn
import os
import json
from datetime import datetime

# Basic environment information
session = boto3.Session()
region = session.region_name
account_id = boto3.client('sts').get_caller_identity()['Account']
sagemaker_session = sagemaker.Session()

print(f"‚úÖ AWS Account: {account_id}")
print(f"‚úÖ AWS Region: {region}")
print(f"‚úÖ SageMaker Session: {sagemaker_session}")

# Check if running in SageMaker Studio
studio_metadata_path = "/opt/ml/metadata/resource-metadata.json"
if os.path.exists(studio_metadata_path):
    with open(studio_metadata_path, 'r') as f:
        metadata = json.load(f)
    print(f"üè† Environment: SageMaker Studio")
    print(f"üéØ Domain ID: {metadata.get('DomainId', 'N/A')}")
    print(f"? User Profile: {metadata.get('UserProfileName', 'N/A')}")
    print(f"üíª Instance Type: {metadata.get('ResourceArn', 'N/A').split('/')[-1] if metadata.get('ResourceArn') else 'N/A'}")
else:
    print(f"üè† Environment: Traditional SageMaker Notebook Instance")

# Package versions
print(f"\nüì¶ PACKAGE VERSIONS:")
print(f"   üìä SageMaker SDK: {sagemaker.__version__}")
print(f"   üêº Pandas: {pd.__version__}")
print(f"   üî¢ NumPy: {np.__version__}")
print(f"   ü§ñ Scikit-learn: {sklearn.__version__}")

# Check SageMaker capabilities
print(f"\nüöÄ SAGEMAKER CAPABILITIES:")
try:
    # Check default bucket
    bucket = sagemaker_session.default_bucket()
    print(f"   üì¶ Default S3 Bucket: {bucket}")
    
    # Check execution role
    role = sagemaker.get_execution_role()
    print(f"   üîê Execution Role: {role.split('/')[-1]}")
    
    # Check available instance types for training
    print(f"   üèãÔ∏è Training Instance Types: ml.m5.large, ml.m5.xlarge (and more)")
    print(f"   üéØ Inference Instance Types: ml.t2.medium, ml.m5.large (and more)")
    
except Exception as e:
    print(f"   ‚ö†Ô∏è SageMaker setup incomplete: {e}")

# Check data governance capabilities (Studio-specific)
print(f"\nüèõÔ∏è DATA GOVERNANCE & STUDIO FEATURES:")
try:
    # Check if we can access SageMaker Model Registry
    model_packages = sagemaker_session.list_model_packages(max_results=1)
    print(f"   üìã Model Registry: ‚úÖ Accessible")
except:
    print(f"   üìã Model Registry: ‚ö†Ô∏è Limited access")

try:
    # Check if Canvas is available (Studio feature)
    sm_client = boto3.client('sagemaker')
    domains = sm_client.list_domains(MaxResults=1)
    if domains['Domains']:
        print(f"   üé® SageMaker Canvas: ‚úÖ Available")
        print(f"   ü§ù Collaborative Features: ‚úÖ Enabled")
        print(f"   üìä Data Catalog: ‚úÖ Integrated")
    else:
        print(f"   üé® SageMaker Canvas: ‚ùå Not configured")
except:
    print(f"   üé® Studio Features: ‚ö†Ô∏è Limited access")

# Project file structure check
print(f"\nüìÅ PROJECT FILE STRUCTURE:")
project_dirs = ['scripts', 'src', 'data']
for dir_name in project_dirs:
    if os.path.exists(f"./project-files/{dir_name}"):
        files = os.listdir(f"./project-files/{dir_name}")
        print(f"   ? {dir_name}/: ‚úÖ ({len(files)} files)")
    elif os.path.exists(f"./{dir_name}"):
        files = os.listdir(f"./{dir_name}")
        print(f"   üìÇ {dir_name}/: ‚úÖ ({len(files)} files)")
    else:
        print(f"   üìÇ {dir_name}/: ‚ö†Ô∏è Not found (will be created)")

print(f"\nüéâ Environment check complete!")
print(f"üí° You're ready to start your ML workflow in SageMaker Unified Studio!")

# Set global variables for the notebook
REGION = region
ACCOUNT_ID = account_id
BUCKET = sagemaker_session.default_bucket()
ROLE = sagemaker.get_execution_role()

print(f"\nüîß Global variables set:")
print(f"   REGION = '{REGION}'")
print(f"   ACCOUNT_ID = '{ACCOUNT_ID}'")
print(f"   BUCKET = '{BUCKET}'")
print(f"   ROLE = '{ROLE.split('/')[-1]}'")

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json
import os

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()

print(f"‚úÖ SageMaker Session initialized")
print(f"üìç Region: {region}")
print(f"üîê Execution Role: {role}")

# Get S3 bucket from infrastructure (replace with your actual bucket name)
# This should match the bucket created by Terraform
s3_client = boto3.client('s3')
buckets = s3_client.list_buckets()
ml_bucket = None

for bucket in buckets['Buckets']:
    if 'ml-artifacts' in bucket['Name']:  # Match your Terraform bucket naming
        ml_bucket = bucket['Name']
        break

if ml_bucket:
    print(f"ü™£ Found ML Bucket: {ml_bucket}")
    
    # Setup project files automatically
    print("\nüîß Setting up project files...")
    setup_project_files(ml_bucket)
    
    # Re-check project structure
    print("\nüìÅ Updated project structure:")
    for dir_path in ['scripts', 'src/model', 'src/container', 'src/lambda']:
        if os.path.exists(dir_path):
            files = os.listdir(dir_path)
            print(f"  {dir_path}: {files}")
        else:
            print(f"  {dir_path}: Not available")
            
else:
    print("‚ö†Ô∏è ML bucket not found. Please check your Terraform deployment.")
    # Fallback - you can manually set the bucket name here
    ml_bucket = "your-ml-artifacts-bucket-name"

## üèõÔ∏è Data Governance & Collaboration in SageMaker Unified Studio

**üåü NEW FEATURE: Your environment now includes advanced data governance and collaboration capabilities!**

### üìä **Data Catalog & Discovery**

SageMaker Unified Studio includes a built-in data catalog that helps you:

- üîç **Discover Data Assets**: Search and find datasets across your organization
- üìã **Manage Metadata**: Automatically catalog data with AI-generated descriptions
- üîó **Track Data Lineage**: See how data flows through your ML pipelines
- üìà **Monitor Data Quality**: Get quality scores and validation reports
- ü§ñ **Ask Amazon Q**: Use natural language to find the data you need

### ü§ù **Collaborative Features**

Work seamlessly with your team:

- üë• **Shared Workspaces**: Collaborate on notebooks and experiments
- üì§ **Asset Sharing**: Share models, datasets, and notebooks with fine-grained permissions
- üìù **Business Glossary**: Create shared definitions and standards
- üè∑Ô∏è **Data Products**: Package and distribute curated datasets
- üí¨ **Team Communication**: Built-in collaboration tools

### üîê **Governance & Security**

Enterprise-grade governance:

- üõ°Ô∏è **Fine-grained Access Control**: Role-based permissions for data and models
- üìä **Audit Trails**: Complete tracking of data access and model usage
- üè¢ **Business Units**: Organize assets by teams and departments
- üìú **Compliance**: Built-in tools for regulatory compliance
- üîí **Data Privacy**: Automated PII detection and protection

### üé® **SageMaker Canvas Integration**

No-code ML for business users:

- üñ±Ô∏è **Point-and-Click ML**: Build models without coding
- üìä **Automatic Insights**: AI-powered data analysis
- üìà **Business Forecasting**: Time-series prediction made easy
- üìã **Model Sharing**: Share Canvas models with data scientists

---

**üí° Next: Let's set up your data and create your first governed ML pipeline!**

In [None]:
# Load data from S3
data_key = 'data/iris.csv'
data_path = f's3://{ml_bucket}/{data_key}'

print(f"üì• Loading data from: {data_path}")

try:
    # Load dataset
    df = pd.read_csv(data_path)
    print(f"‚úÖ Data loaded successfully!")
    print(f"üìä Dataset shape: {df.shape}")
    print(f"üìã Columns: {list(df.columns)}")
    
    # Display first few rows
    print("\nüîç First 5 rows:")
    display(df.head())
    
    # Basic statistics
    print("\nüìà Dataset Statistics:")
    display(df.describe())
    
except Exception as e:
    print(f"‚ùå Error loading data: {e}")
    print("üí° You may need to upload the iris.csv file to your S3 bucket manually")

## 3. Data Preprocessing and Cleaning

Perform data cleaning and preprocessing. In a real scenario, this is where you'd handle missing values, outliers, and data quality issues.

In [None]:
# Data quality checks
print("üîç Data Quality Assessment:")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")
print(f"\nData types:\n{df.dtypes}")

# Check target distribution
print(f"\nüéØ Target distribution:")
print(df['species'].value_counts())

# Visualization
plt.figure(figsize=(15, 10))

# Distribution plots
plt.subplot(2, 3, 1)
df['species'].value_counts().plot(kind='bar')
plt.title('Species Distribution')
plt.xticks(rotation=45)

# Feature distributions
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for i, feature in enumerate(features, 2):
    plt.subplot(2, 3, i)
    df[feature].hist(bins=20, alpha=0.7)
    plt.title(f'{feature.replace("_", " ").title()} Distribution')
    plt.xlabel(feature)

plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df[features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## 4. Feature Engineering Pipeline

Create feature engineering transformations and save processed data back to S3 for reusability.

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import joblib

# Feature engineering
print("‚öôÔ∏è Feature Engineering...")

# Create additional features
df_processed = df.copy()
df_processed['sepal_ratio'] = df_processed['sepal_length'] / df_processed['sepal_width']
df_processed['petal_ratio'] = df_processed['petal_length'] / df_processed['petal_width']
df_processed['sepal_area'] = df_processed['sepal_length'] * df_processed['sepal_width']
df_processed['petal_area'] = df_processed['petal_length'] * df_processed['petal_width']

print(f"‚úÖ Added engineered features: {['sepal_ratio', 'petal_ratio', 'sepal_area', 'petal_area']}")

# Separate features and target
feature_columns = [col for col in df_processed.columns if col != 'species']
X = df_processed[feature_columns]
y = df_processed['species']

print(f"üìä Feature matrix shape: {X.shape}")
print(f"üéØ Target vector shape: {y.shape}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üîÑ Train set: {X_train.shape}, Test set: {X_test.shape}")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns, index=X_test.index)

print("‚úÖ Features scaled using StandardScaler")

# Save processed data to S3
processed_data_key = 'processed-data'
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save training data
train_data = pd.concat([X_train_scaled, y_train], axis=1)
train_s3_path = f's3://{ml_bucket}/{processed_data_key}/train_{timestamp}.csv'
train_data.to_csv(train_s3_path, index=False)
print(f"üíæ Training data saved to: {train_s3_path}")

# Save test data
test_data = pd.concat([X_test_scaled, y_test], axis=1)
test_s3_path = f's3://{ml_bucket}/{processed_data_key}/test_{timestamp}.csv'
test_data.to_csv(test_s3_path, index=False)
print(f"üíæ Test data saved to: {test_s3_path}")

# Save scaler for later use
scaler_path = f'/tmp/scaler_{timestamp}.joblib'
joblib.dump(scaler, scaler_path)
scaler_s3_path = f's3://{ml_bucket}/models/scaler_{timestamp}.joblib'
sagemaker_session.upload_data(scaler_path, bucket=ml_bucket, key_prefix='models')
print(f"üíæ Scaler saved to: {scaler_s3_path}")

## 5. Model Training with SageMaker Estimators

Now we'll use the production-ready training script from `src/model/train.py` with SageMaker's training capabilities.

In [None]:
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.inputs import TrainingInput

print("üöÄ Setting up SageMaker Training Job...")

# Configure the SKLearn estimator
sklearn_estimator = SKLearn(
    entry_point='train.py',
    source_dir='src/model',
    role=role,
    instance_type='ml.m5.large',
    framework_version='1.2-1',
    py_version='py3',
    hyperparameters={
        'n_estimators': 100,
        'max_depth': 10,
        'random_state': 42
    },
    base_job_name='iris-training'
)

print("‚úÖ SKLearn estimator configured")

# Define training input
train_input = TrainingInput(
    s3_data=f's3://{ml_bucket}/data/',
    content_type='text/csv'
)

print(f"üì• Training input configured: s3://{ml_bucket}/data/")

# Check if data exists in S3
print("üîç Checking training data availability...")
try:
    response = s3_client.list_objects_v2(Bucket=ml_bucket, Prefix='data/')
    if 'Contents' in response:
        print("‚úÖ Data found in S3:")
        for obj in response['Contents']:
            print(f"  üìÑ {obj['Key']} ({obj['Size']} bytes)")
    else:
        print("‚ùå No data found in S3 data/ folder")
        print("üí° Run 'terraform apply' to upload the iris.csv file")
except Exception as e:
    print(f"‚ö†Ô∏è Error checking S3: {e}")

print("\nüéØ Ready to start training!")
print("To train the model, uncomment and run:")
print("# sklearn_estimator.fit({'training': train_input})")

# Uncomment the line below when you're ready to start training:
# sklearn_estimator.fit({'training': train_input})

## 6. Model Evaluation and Validation

Let's demonstrate local model training and evaluation using the same logic as our production script.

## üîß Troubleshooting SageMaker Training Issues

If you encounter training job failures, here are common issues and solutions:

In [None]:
# SageMaker Training Troubleshooting Guide

print("? SageMaker Training Troubleshooting")
print()

common_issues = {
    "Data not found": "Ensure iris.csv is in s3://{bucket}/data/ - run 'terraform apply'",
    "Script execution error": "Check training script logs in CloudWatch",
    "Framework version issues": "Using framework_version='1.2-1' (current stable)",
    "Permission errors": "Verify SageMaker execution role has S3 access"
}

for issue, solution in common_issues.items():
    print(f"‚Ä¢ {issue}: {solution}")

print(f"\n? Current setup:")
print(f"  Training script: src/model/train.py")
print(f"  Requirements: src/model/requirements.txt") 
print(f"  Data location: s3://{ml_bucket}/data/")
print(f"  Framework: sklearn 1.2-1")

print(f"\nüí° To debug training failures:")
print(f"  1. Check CloudWatch logs for the training job")
print(f"  2. Verify data exists in S3")
print(f"  3. Test training script locally first")

print(f"\n‚úÖ Training script is ready and well-structured!")

In [None]:
# Quick Environment Check

print("? Environment Status:")

# Check training script
if os.path.exists('src/model/train.py'):
    print("‚úÖ Training script: src/model/train.py")
else:
    print("‚ùå Training script missing")

# Check requirements
if os.path.exists('src/model/requirements.txt'):
    print("‚úÖ Requirements: src/model/requirements.txt")
else:
    print("‚ùå Requirements file missing")

# Check S3 data
try:
    if ml_bucket:
        response = s3_client.list_objects_v2(Bucket=ml_bucket, Prefix='data/iris.csv')
        if 'Contents' in response:
            print("‚úÖ Data: iris.csv found in S3")
        else:
            print("‚ùå Data: iris.csv not found in S3")
    else:
        print("‚ùå ML bucket not identified")
except:
    print("‚ö†Ô∏è Cannot check S3 data")

# Check role
print(f"‚úÖ SageMaker role: {role.split('/')[-1]}")

print(f"\n? Ready to train with SageMaker!")

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import seaborn as sns

print("üî¨ Model Training and Evaluation...")

# Train model locally (same algorithm as production script)
model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    max_depth=10
)

model.fit(X_train_scaled, y_train)
print("‚úÖ Model trained successfully")

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"üéØ Test Accuracy: {accuracy:.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"üìä Cross-validation scores: {cv_scores}")
print(f"üìä Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Classification report
print("\nüìã Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Feature importance
plt.subplot(1, 2, 2)
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

sns.barplot(data=feature_importance, y='feature', x='importance')
plt.title('Feature Importance')
plt.xlabel('Importance')

plt.tight_layout()
plt.show()

print("\nüîù Top 5 most important features:")
print(feature_importance.head())

## 7. Deploy Model to SageMaker Endpoint

Once satisfied with the model performance, deploy it using SageMaker's inference infrastructure.

## üöÄ Complete Deployment Guide: Using the MLOps Scripts

This section shows you exactly how to deploy your model. You have **multiple options** depending on your environment and requirements.

### üìã Overview of Deployment Options

#### Option 1: üéØ **SageMaker-Only Deployment** (RECOMMENDED for SageMaker)
- ‚úÖ Works entirely within SageMaker notebook
- ‚úÖ No external dependencies required
- ‚úÖ Uses `sklearn_estimator.deploy()`
- ‚úÖ Perfect for development and testing

#### Option 2: üê≥ **Containerized Deployment** (Production-Ready)
- üèóÔ∏è Requires Docker and ECR access
- üèóÔ∏è Best for production environments
- üèóÔ∏è **Scripts**: `build_and_push.sh` ‚Üí `deploy.sh`
- ‚ö†Ô∏è **Note**: Scripts now auto-discover resources if Terraform folder not available

#### Option 3: üîß **Step Functions Pipeline** (Enterprise)
- üèóÔ∏è Orchestrated deployment via AWS Step Functions
- üèóÔ∏è Includes monitoring and rollback
- üèóÔ∏è **Script**: `deploy.sh`

---

### üéØ **Option 1: SageMaker-Only Deployment (EASIEST)**

This is the **recommended approach** for SageMaker notebook users:

```python
# Deploy directly from your trained estimator
predictor = sklearn_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='iris-model-demo'
)

# Test immediately
test_data = [[5.1, 3.5, 1.4, 0.2]]
prediction = predictor.predict(test_data)
print(f"Prediction: {prediction}")

# Clean up when done
predictor.delete_endpoint()
```

**Benefits:**
- ‚úÖ No external scripts needed
- ‚úÖ Works entirely in SageMaker
- ‚úÖ Immediate testing capability
- ‚úÖ Easy cleanup

---

### ?Ô∏è **Option 2: Production Deployment with Docker**

**‚ö†Ô∏è Updated**: Scripts now work in SageMaker environment!

#### Step 1: Build and Push Container Image

```bash
# The script now auto-discovers ECR repository
cd scripts
./build_and_push.sh
```

**What this script does:**
- üîç Auto-discovers ECR repository from AWS CLI
- ‚úÖ Logs into Amazon ECR
- ‚úÖ Builds Docker image from `src/container/Dockerfile`
- ‚úÖ Tags and pushes image to ECR

#### Step 2: Deploy Using Step Functions

```bash
# The script now auto-discovers Step Functions
./deploy.sh
```

**What this script does:**
- üîç Auto-discovers Step Functions from AWS CLI
- ‚úÖ Starts deployment execution
- ‚úÖ Monitors progress
- ‚úÖ Creates SageMaker endpoint

---

### üß™ **Testing Your Deployed Model**

Once deployed with any method:

```python
import boto3
import json

runtime = boto3.client('sagemaker-runtime')
endpoint_name = 'your-endpoint-name'
payload = [[5.1, 3.5, 1.4, 0.2]]

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

result = json.loads(response['Body'].read().decode())
print(f'Prediction: {result}')
```

---

### üßπ **Cleanup Resources**

**For SageMaker-Only deployments:**
```python
predictor.delete_endpoint()
```

**For Docker/Step Functions deployments:**
```bash
cd scripts
./cleanup.sh
```

---

### üí° **Which Option Should You Choose?**

- **üéØ In SageMaker Notebook**: Use Option 1 (SageMaker-Only)
- **üè≠ Production Environment**: Use Option 2 (Docker + Step Functions)
- **üß™ Quick Testing**: Use Option 1
- **? CI/CD Pipeline**: Use Option 2

**Next**: Choose your deployment method and follow the steps below!

In [None]:
# üöÄ Deployment Demo: Let's Deploy Your Model!

print("üöÄ SageMaker Model Deployment Options")
print("="*50)

# Check if we have the required scripts
import os
import subprocess

scripts_available = []
required_scripts = ['build_and_push.sh', 'deploy.sh', 'test_endpoint.py', 'cleanup.sh']

for script in required_scripts:
    script_path = f'scripts/{script}'
    if os.path.exists(script_path):
        scripts_available.append(script)
        print(f"‚úÖ {script}")
    else:
        print(f"‚ùå {script} - Not found")

print(f"\nüìä Scripts available: {len(scripts_available)}/{len(required_scripts)}")

if len(scripts_available) == len(required_scripts):
    print("üéâ All deployment scripts are ready!")
    
    print("\nüõ†Ô∏è OPTION 1: Production Deployment (Recommended)")
    print("   Step 1: Build and push Docker image")
    print("   Command: cd scripts && ./build_and_push.sh")
    print("   ")
    print("   Step 2: Deploy using Step Functions")
    print("   Command: cd scripts && ./deploy.sh")
    print("   ")
    print("   Step 3: Test your endpoint")
    print("   Command: cd scripts && python test_endpoint.py --endpoint-name YOUR_ENDPOINT")
    
    print("\nüéØ OPTION 2: Quick Notebook Deployment")
    print("   Use the sklearn_estimator.deploy() method below")
    
    print("\nüí° Choose Option 1 for production, Option 2 for quick testing")
    
else:
    print("‚ö†Ô∏è Some scripts are missing. They should be available in your SageMaker environment.")
    print("   Make sure you've run 'terraform apply' to set up the infrastructure.")

print("\n" + "="*50)
print("üìã Deployment Checklist:")
print("‚òê Model trained successfully")
print("‚òê Scripts available")
print("‚òê AWS credentials configured")
print("‚òê ECR repository exists (from Terraform)")
print("‚òê Docker installed (for Option 1)")

# Quick deployment option for testing
print("\nüöÄ QUICK DEPLOY (Option 2):")
print("Uncomment the lines below for immediate deployment:")

print("""
# Quick deployment for testing
# predictor = sklearn_estimator.deploy(
#     initial_instance_count=1,
#     instance_type='ml.m5.large',
#     endpoint_name='iris-quick-test'
# )
# 
# # Test the endpoint
# test_data = [[5.1, 3.5, 1.4, 0.2]]
# prediction = predictor.predict(test_data)
# print(f"Prediction: {prediction}")
# 
# # Remember to clean up when done:
# # predictor.delete_endpoint()
""")

print("\nüê≥ PRODUCTION DEPLOY (Option 1):")
print("Run these commands in terminal:")
print("   cd scripts")
print("   ./build_and_push.sh")
print("   ./deploy.sh")

In [None]:
# üìã Step-by-Step Deployment Instructions

print("üéØ STEP-BY-STEP DEPLOYMENT GUIDE")
print("="*50)

print("\n1Ô∏è‚É£ PRODUCTION DEPLOYMENT (build_and_push.sh + deploy.sh)")
print("   This is the recommended approach for production systems")
print()

print("   Step 1: Build and Push Docker Image")
print("   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print("   ‚Ä¢ What it does: Creates a Docker container with your model")
print("   ‚Ä¢ Requirements: Docker installed, ECR repository from Terraform")
print("   ‚Ä¢ Command to run:")
print("     cd scripts")
print("     ./build_and_push.sh")
print()
print("   üîç This script will:")
print("     ‚Üí Login to Amazon ECR")
print("     ‚Üí Build Docker image using src/container/Dockerfile")
print("     ‚Üí Tag and push image to ECR repository")
print("     ‚Üí Clean up local images (optional)")
print()

print("   Step 2: Deploy Using Step Functions")
print("   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print("   ‚Ä¢ What it does: Orchestrates deployment via AWS Step Functions")
print("   ‚Ä¢ Requirements: Step Functions from Terraform")
print("   ‚Ä¢ Command to run:")
print("     ./deploy.sh")
print()
print("   üîç This script will:")
print("     ‚Üí Start Step Functions execution")
print("     ‚Üí Monitor deployment progress")
print("     ‚Üí Create SageMaker endpoint")
print("     ‚Üí Show endpoint testing instructions")
print()

print("   Step 3: Test Your Endpoint")
print("   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print("   ‚Ä¢ Command to run:")
print("     python test_endpoint.py --endpoint-name YOUR_ENDPOINT_NAME")
print()

print("\n2Ô∏è‚É£ QUICK DEPLOYMENT (sklearn_estimator.deploy())")
print("   Good for testing and experimentation")
print()
print("   ‚Ä¢ Uncomment the code in the cell above")
print("   ‚Ä¢ Runs directly from this notebook")
print("   ‚Ä¢ Faster but less production-ready")
print()

print("\n3Ô∏è‚É£ CLEANUP (cleanup.sh)")
print("   Run this when you're done to avoid AWS charges")
print()
print("   ‚Ä¢ Command to run:")
print("     ./cleanup.sh")
print("   ‚Ä¢ Removes endpoints, models, and billable resources")
print()

print("üí° RECOMMENDATION:")
print("   ‚Ä¢ For learning/testing: Use Quick Deployment (#2)")
print("   ‚Ä¢ For production: Use Production Deployment (#1)")
print("   ‚Ä¢ Always cleanup when done (#3)")

print("\nüöÄ READY TO START?")
print("   Choose your deployment method and run the commands!")

# Helper function to run scripts from notebook
def run_deployment_script(script_name):
    """Helper function to run deployment scripts from notebook"""
    print(f"Running {script_name}...")
    try:
        result = subprocess.run(
            f"cd scripts && ./{script_name}", 
            shell=True, 
            capture_output=True, 
            text=True
        )
        print("STDOUT:")
        print(result.stdout)
        if result.stderr:
            print("STDERR:")
            print(result.stderr)
        return result.returncode == 0
    except Exception as e:
        print(f"Error running {script_name}: {e}")
        return False

print("\nüîß Helper Functions Available:")
print("   run_deployment_script('build_and_push.sh')")
print("   run_deployment_script('deploy.sh')")
print("   run_deployment_script('cleanup.sh')")
print()
print("üí° You can use these functions or run scripts directly in terminal")

In [None]:
# üéØ SAGEMAKER-ONLY DEPLOYMENT (RECOMMENDED APPROACH)

print("üéØ SageMaker-Only Deployment - Most Reliable Method")
print("="*60)
print("This approach uses SageMaker's built-in deployment capabilities")
print("‚úÖ No Docker required  ‚úÖ No external scripts  ‚úÖ Auto-handles inference")
print()

# Step 1: Ensure we have a trained estimator
if 'sklearn_estimator' not in locals():
    print("üîß Setting up SageMaker estimator with completed training job...")
    
    from sagemaker.sklearn.estimator import SKLearn
    
    sklearn_estimator = SKLearn(
        entry_point='train.py',
        source_dir='src/model',
        role=role,
        instance_type='ml.m5.large',
        framework_version='1.2-1',
        py_version='py3',
        hyperparameters={
            'n_estimators': 100,
            'max_depth': 10,
            'random_state': 42
        },
        base_job_name='iris-training'
    )
    
    # Point to the completed training job
    sklearn_estimator.model_data = 's3://sagemaker-us-east-1-590184049545/iris-training-2025-08-22-23-22-58-706/output/model.tar.gz'
    print("‚úÖ Estimator configured with existing model artifacts")
else:
    print("‚úÖ SageMaker estimator already available")

print(f"\nüì¶ Model artifacts: {sklearn_estimator.model_data}")

# Step 2: Deploy with automatic retry and fallback
print("\nüöÄ DEPLOYING MODEL TO SAGEMAKER ENDPOINT...")
print("This will take 6-10 minutes...")

deployment_successful = False
endpoint_name = f'iris-reliable-{datetime.now().strftime("%Y%m%d-%H%M%S")}'

try:
    print(f"üìç Deploying to endpoint: {endpoint_name}")
    print("‚è±Ô∏è Using ml.m5.large instance...")
    
    predictor = sklearn_estimator.deploy(
        initial_instance_count=1,
        instance_type='ml.m5.large',
        endpoint_name=endpoint_name,
        wait=True  # Wait for deployment to complete
    )
    
    deployment_successful = True
    print(f"\n? DEPLOYMENT SUCCESSFUL!")
    
except Exception as e:
    print(f"‚ö†Ô∏è ml.m5.large deployment failed: {e}")
    print("? Trying smaller instance type...")
    
    try:
        endpoint_name_small = f'iris-small-{datetime.now().strftime("%Y%m%d-%H%M%S")}'
        print(f"üìç Deploying to endpoint: {endpoint_name_small}")
        print("‚è±Ô∏è Using ml.t2.medium instance...")
        
        predictor = sklearn_estimator.deploy(
            initial_instance_count=1,
            instance_type='ml.t2.medium',
            endpoint_name=endpoint_name_small,
            wait=True
        )
        
        deployment_successful = True
        endpoint_name = endpoint_name_small
        print(f"\nüéâ DEPLOYMENT SUCCESSFUL WITH SMALLER INSTANCE!")
        
    except Exception as e2:
        print(f"‚ùå Both deployments failed:")
        print(f"   ml.m5.large: {e}")
        print(f"   ml.t2.medium: {e2}")
        deployment_successful = False

# Step 3: Test the deployed model
if deployment_successful:
    print(f"\n‚úÖ Model deployed successfully!")
    print(f"? Endpoint name: {predictor.endpoint_name}")
    print(f"üìç Instance type: {predictor.endpoint_name.split('-')[1] if 'small' in predictor.endpoint_name else 'ml.m5.large'}")
    
    # Test the endpoint
    print(f"\nüß™ TESTING THE ENDPOINT...")
    try:
        test_data = [[5.1, 3.5, 1.4, 0.2]]  # Sample iris data
        prediction = predictor.predict(test_data)
        print(f"üéØ Test prediction successful: {prediction}")
        
        # Test with multiple samples
        test_samples = [
            [5.1, 3.5, 1.4, 0.2],  # Should be setosa
            [7.0, 3.2, 4.7, 1.4],  # Should be versicolor
            [6.3, 3.3, 6.0, 2.5]   # Should be virginica
        ]
        
        predictions = predictor.predict(test_samples)
        print(f"üéØ Batch predictions: {predictions}")
        
        print(f"\nüéâ ENDPOINT IS WORKING PERFECTLY!")
        
        # Save endpoint reference
        globals()['working_predictor'] = predictor
        globals()['working_endpoint_name'] = predictor.endpoint_name
        
        print(f"\nüìã HOW TO USE YOUR ENDPOINT:")
        print(f"   Endpoint name: {predictor.endpoint_name}")
        print(f"   Usage: working_predictor.predict([[5.1, 3.5, 1.4, 0.2]])")
        
    except Exception as e:
        print(f"‚ùå Endpoint test failed: {e}")
        print("üí° Check CloudWatch logs for details")

else:
    print(f"\n‚ùå DEPLOYMENT FAILED")
    print(f"üí° Troubleshooting steps:")
    print(f"   1. Check CloudWatch logs")
    print(f"   2. Verify training job completed successfully")
    print(f"   3. Check SageMaker quotas in your account")
    print(f"   4. Try different region if current one is constrained")

print(f"\nüßπ TO CLEAN UP WHEN DONE:")
print(f"# working_predictor.delete_endpoint()  # Removes endpoint and stops billing")

print(f"\nüí° THIS IS THE EASIEST AND MOST RELIABLE DEPLOYMENT METHOD!")

In [None]:
# üîß ENDPOINT TROUBLESHOOTING & DIAGNOSTICS

print("üîß SageMaker Endpoint Troubleshooting")
print("="*50)

import boto3
from datetime import datetime, timedelta

def diagnose_endpoint_issues():
    """Comprehensive endpoint diagnostics"""
    
    sagemaker_client = boto3.client('sagemaker')
    logs_client = boto3.client('logs')
    
    print("üîç CHECKING RECENT ENDPOINTS...")
    
    try:
        # Get recent endpoints
        endpoints = sagemaker_client.list_endpoints(
            SortBy='CreationTime',
            SortOrder='Descending',
            MaxResults=5
        )
        
        for endpoint in endpoints['Endpoints']:
            name = endpoint['EndpointName']
            status = endpoint['EndpointStatus']
            created = endpoint['CreationTime']
            
            # Calculate time elapsed
            now = datetime.now(created.tzinfo)
            elapsed = now - created
            elapsed_minutes = int(elapsed.total_seconds() / 60)
            
            print(f"\nüìç {name}")
            print(f"   Status: {status}")
            print(f"   Created: {elapsed_minutes} minutes ago")
            
            # Get detailed info for failed endpoints
            if status in ['Failed', 'OutOfService']:
                try:
                    details = sagemaker_client.describe_endpoint(EndpointName=name)
                    if 'FailureReason' in details:
                        print(f"   ‚ùå Failure: {details['FailureReason']}")
                        
                        # Common failure patterns and solutions
                        failure_reason = details['FailureReason'].lower()
                        if 'ping health check' in failure_reason:
                            print(f"   üí° Solution: Model container not responding - check inference.py")
                        elif 'model loading' in failure_reason:
                            print(f"   üí° Solution: Model format issue - check model artifacts")
                        elif 'insufficient capacity' in failure_reason:
                            print(f"   üí° Solution: Try different instance type or region")
                        elif 'image' in failure_reason:
                            print(f"   üí° Solution: Framework version compatibility issue")
                            
                except Exception as e:
                    print(f"   ‚ö†Ô∏è Could not get details: {e}")
            
            elif status == 'Creating' and elapsed_minutes > 15:
                print(f"   ‚ö†Ô∏è Taking longer than expected - may fail soon")
            
            elif status == 'InService':
                print(f"   ‚úÖ Healthy and ready!")
                
        print(f"\nüîç CHECKING CLOUDWATCH LOGS...")
        
        # Check for recent SageMaker endpoint logs
        try:
            log_groups = logs_client.describe_log_groups(
                logGroupNamePrefix='/aws/sagemaker/Endpoints',
                limit=5
            )
            
            for lg in log_groups['logGroups']:
                log_group_name = lg['logGroupName']
                print(f"\nüìÑ Log group: {log_group_name}")
                
                # Get most recent log streams
                streams = logs_client.describe_log_streams(
                    logGroupName=log_group_name,
                    orderBy='LastEventTime',
                    descending=True,
                    limit=2
                )
                
                for stream in streams['logStreams']:
                    print(f"   üìä Stream: {stream['logStreamName']}")
                    
                    # Get recent error events
                    try:
                        events = logs_client.get_log_events(
                            logGroupName=log_group_name,
                            logStreamName=stream['logStreamName'],
                            startTime=int((datetime.now() - timedelta(hours=2)).timestamp() * 1000)
                        )
                        
                        error_events = [e for e in events['events'] 
                                      if any(keyword in e['message'].lower() 
                                           for keyword in ['error', 'failed', 'exception', 'traceback'])]
                        
                        if error_events:
                            print(f"   üö® Recent errors found:")
                            for event in error_events[-3:]:  # Last 3 errors
                                timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                                print(f"      [{timestamp.strftime('%H:%M:%S')}] {event['message'][:100]}...")
                        else:
                            print(f"   ‚úÖ No recent errors in this stream")
                            
                    except Exception as e:
                        print(f"   ‚ö†Ô∏è Could not read events: {e}")
                        
        except Exception as e:
            print(f"‚ùå Error checking CloudWatch logs: {e}")
            
    except Exception as e:
        print(f"‚ùå Error checking endpoints: {e}")

def fix_common_issues():
    """Provide fixes for common endpoint issues"""
    
    print(f"\nüõ†Ô∏è COMMON FIXES:")
    print(f"1. üè• Health Check Failures:")
    print(f"   - Ensure inference.py has proper model_fn, input_fn, predict_fn")
    print(f"   - Check that model loads correctly")
    print(f"   - Verify requirements.txt has correct versions")
    
    print(f"\n2. üîß Model Loading Issues:")
    print(f"   - Check model.tar.gz format and contents")
    print(f"   - Ensure model was saved with correct sklearn version")
    print(f"   - Verify S3 permissions")
    
    print(f"\n3. üíæ Instance Issues:")
    print(f"   - Try smaller instance type (ml.t2.medium)")
    print(f"   - Check service quotas in AWS Console")
    print(f"   - Try different region")
    
    print(f"\n4. üê≥ Container Issues:")
    print(f"   - Use framework_version='1.2-1' (tested)")
    print(f"   - Avoid very old or very new framework versions")
    print(f"   - Check SageMaker container compatibility")

# Run diagnostics
print("üîç Running endpoint diagnostics...")
diagnose_endpoint_issues()

print("\n" + "="*50)
fix_common_issues()

print(f"\nüí° BEST PRACTICE:")
print(f"Always use sklearn_estimator.deploy() - it's the most reliable method!")
print(f"Avoid custom containers unless absolutely necessary.")

## üîß Endpoint Deployment Troubleshooting & Recovery

If you encountered the error: **"The primary container for production variant primary did not pass the ping health check"**, this section will help you diagnose and fix the issue.

### üîç **Common Causes:**
- ‚ùå Missing or incorrect inference script
- ‚ùå Model loading issues 
- ‚ùå Container configuration problems
- ‚ùå Framework version mismatches

### üìã **Recovery Steps:**
1. **Diagnose** - Check CloudWatch logs
2. **Fix** - Create proper inference script
3. **Cleanup** - Remove failed resources
4. **Redeploy** - Use reliable method

In [None]:
# üîç STEP 1: Diagnose the Issue - Check CloudWatch Logs

print("üîç DIAGNOSING ENDPOINT FAILURE...")
print("="*50)

import boto3
from datetime import datetime, timedelta

def check_endpoint_logs():
    """Check CloudWatch logs for endpoint failure details"""
    logs_client = boto3.client('logs')
    
    # Replace with your failed endpoint name
    failed_endpoint_name = 'iris-endpoint-20250822-235036'  # Update this!
    
    print(f"üìã Checking logs for endpoint: {failed_endpoint_name}")
    
    try:
        # List log groups related to SageMaker endpoints
        log_groups = logs_client.describe_log_groups(
            logGroupNamePrefix='/aws/sagemaker/Endpoints'
        )
        
        print("üìÑ Available SageMaker endpoint log groups:")
        endpoint_log_found = False
        
        for lg in log_groups['logGroups']:
            log_group_name = lg['logGroupName']
            print(f"  üìÅ {log_group_name}")
            
            # Check if this log group is for our failed endpoint
            if failed_endpoint_name in log_group_name:
                endpoint_log_found = True
                print(f"\nüéØ FOUND LOGS FOR FAILED ENDPOINT!")
                print(f"üìÇ Log Group: {log_group_name}")
                
                # Get recent log streams
                try:
                    streams = logs_client.describe_log_streams(
                        logGroupName=log_group_name,
                        orderBy='LastEventTime',
                        descending=True,
                        limit=3
                    )
                    
                    for stream in streams['logStreams']:
                        stream_name = stream['logStreamName']
                        print(f"\nüìä Log Stream: {stream_name}")
                        
                        # Get recent log events
                        events = logs_client.get_log_events(
                            logGroupName=log_group_name,
                            logStreamName=stream_name,
                            startTime=int((datetime.now() - timedelta(hours=2)).timestamp() * 1000)
                        )
                        
                        print("üîç ERROR MESSAGES:")
                        error_count = 0
                        for event in events['events']:
                            message = event['message'].strip()
                            timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                            
                            # Look for error indicators
                            if any(keyword in message.lower() for keyword in ['error', 'failed', 'exception', 'traceback']):
                                print(f"  ‚ùå [{timestamp}] {message}")
                                error_count += 1
                            elif 'ping' in message.lower():
                                print(f"  üèì [{timestamp}] {message}")
                        
                        if error_count == 0:
                            print("  ‚ÑπÔ∏è  No explicit errors found in this stream")
                            print("  üìã Last 5 messages:")
                            for event in events['events'][-5:]:
                                timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                                print(f"    [{timestamp}] {event['message'].strip()}")
                        
                except Exception as e:
                    print(f"  ‚ö†Ô∏è Could not read log streams: {e}")
        
        if not endpoint_log_found:
            print(f"\n‚ö†Ô∏è No logs found for endpoint: {failed_endpoint_name}")
            print("üí° This might mean:")
            print("   ‚Ä¢ The endpoint name is incorrect")
            print("   ‚Ä¢ The logs haven't been generated yet")
            print("   ‚Ä¢ The endpoint failed before logging started")
            
    except Exception as e:
        print(f"‚ùå Error accessing CloudWatch logs: {e}")
        print("üí° Make sure you have CloudWatch read permissions")

# Run the diagnostic
check_endpoint_logs()

print("\nüí° COMMON ISSUES AND SOLUTIONS:")
print("üîß If you see 'ModuleNotFoundError': Missing dependencies")
print("üîß If you see 'No module named inference': Missing inference.py")
print("üîß If you see 'model loading failed': Wrong model format")
print("üîß If no logs found: Endpoint failed during startup")

print("\n‚û°Ô∏è NEXT: Run the cells below to fix the issues")

## üõ†Ô∏è Using MLOps Scripts

Now that your model is ready, let's explore how to use the MLOps scripts that are available in your environment.

In [None]:
# Explore available scripts
print("üîç Available MLOps Scripts:")
print()

scripts_info = {
    'build_and_push.sh': {
        'purpose': 'Build Docker container and push to ECR',
        'usage': 'Used for containerizing your model for production deployment',
        'when': 'After model training, before deployment to production'
    },
    'deploy.sh': {
        'purpose': 'Deploy model to SageMaker endpoint using Step Functions',
        'usage': 'Orchestrates the complete deployment pipeline',
        'when': 'When ready to deploy model to production'
    },
    'test_endpoint.py': {
        'purpose': 'Test deployed SageMaker endpoints',
        'usage': 'Automated testing of model predictions',
        'when': 'After deployment to validate model is working'
    },
    'cleanup.sh': {
        'purpose': 'Clean up AWS resources to save costs',
        'usage': 'Removes endpoints, models, and other billable resources',
        'when': 'When done with experimentation or deployment'
    }
}

# Check which scripts are available
available_scripts = []
if os.path.exists('scripts'):
    available_scripts = os.listdir('scripts')

for script, info in scripts_info.items():
    status = "‚úÖ Available" if script in available_scripts else "‚ùå Missing"
    print(f"{status} {script}")
    print(f"   üìù Purpose: {info['purpose']}")
    print(f"   üéØ Usage: {info['usage']}")
    print(f"   ‚è∞ When: {info['when']}")
    print()

# Show how to run scripts
if available_scripts:
    print("üöÄ How to use scripts:")
    print()
    print("1. üì¶ Build and containerize your model:")
    print("   !cd scripts && ./build_and_push.sh")
    print()
    print("2. üöÄ Deploy to production:")
    print("   !cd scripts && ./deploy.sh")
    print()
    print("3. üß™ Test your endpoint:")
    print("   !cd scripts && python test_endpoint.py --endpoint-name your-endpoint")
    print()
    print("4. üßπ Clean up resources:")
    print("   !cd scripts && ./cleanup.sh")
    print()
    
    # Example: Show the contents of test_endpoint.py if available
    test_script_path = 'scripts/test_endpoint.py'
    if os.path.exists(test_script_path):
        print("üìã Example: test_endpoint.py usage:")
        with open(test_script_path, 'r') as f:
            lines = f.readlines()[:20]  # Show first 20 lines
            print("".join(lines))
            if len(f.readlines()) > 20:
                print("... (truncated)")

else:
    print("‚ö†Ô∏è Scripts not found. They will be downloaded automatically when you apply Terraform changes.")
    print("   Or you can manually download them from your S3 bucket.")

In [None]:
# üöÄ DEPLOY MODEL TO SAGEMAKER ENDPOINT - ROBUST VERSION

print("=" * 80)
print("üöÄ DEPLOYING MODEL TO SAGEMAKER ENDPOINT")
print("=" * 80)

import time
import boto3
from datetime import datetime

# Configuration with fallback options
ENDPOINT_NAME = f"iris-model-demo-{int(time.time())}"
INSTANCE_TYPES = ['ml.m5.large', 'ml.m5.xlarge', 'ml.c5.large', 'ml.t3.medium']  # Fallback options
DEPLOYMENT_TIMEOUT = 600  # 10 minutes max

def deploy_with_fallback(estimator, endpoint_name, instance_types):
    """Deploy model with instance type fallback for reliability"""
    
    for i, instance_type in enumerate(instance_types):
        try:
            print(f"\nüîÑ Attempting deployment with {instance_type} (attempt {i+1}/{len(instance_types)})")
            
            # Try to deploy
            predictor = estimator.deploy(
                initial_instance_count=1,
                instance_type=instance_type,
                endpoint_name=endpoint_name,
                wait=True,
                update_endpoint=False
            )
            
            print(f"‚úÖ Successfully deployed on {instance_type}")
            return predictor, instance_type
            
        except Exception as e:
            error_msg = str(e)
            print(f"‚ùå Failed with {instance_type}: {error_msg}")
            
            # Clean up failed endpoint if it exists
            try:
                sagemaker_client = boto3.client('sagemaker')
                sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
                print(f"üßπ Cleaned up failed endpoint: {endpoint_name}")
                time.sleep(30)  # Wait for cleanup
            except:
                pass
            
            # Check if we should retry with next instance type
            if i < len(instance_types) - 1:
                print(f"üîÑ Retrying with next instance type...")
                time.sleep(10)
            else:
                print(f"üí• All instance types failed. Last error: {error_msg}")
                raise Exception(f"Deployment failed on all instance types: {error_msg}")

# Deploy the model
try:
    start_time = datetime.now()
    print(f"üìÖ Deployment started at: {start_time}")
    
    predictor, used_instance_type = deploy_with_fallback(
        sklearn_estimator, 
        ENDPOINT_NAME, 
        INSTANCE_TYPES
    )
    
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()
    
    print(f"\nüéâ DEPLOYMENT SUCCESSFUL!")
    print(f"üìä Endpoint Name: {ENDPOINT_NAME}")
    print(f"üíª Instance Type: {used_instance_type}")
    print(f"‚è±Ô∏è  Deployment Time: {duration:.1f} seconds")
    print(f"üåê Endpoint URL: https://console.aws.amazon.com/sagemaker/home#/endpoints/{ENDPOINT_NAME}")
    
    # Test the endpoint immediately
    print(f"\nüß™ TESTING ENDPOINT...")
    test_data = [[5.1, 3.5, 1.4, 0.2], [6.7, 3.1, 4.4, 1.4]]
    
    prediction = predictor.predict(test_data)
    print(f"‚úÖ Test Prediction: {prediction}")
    print(f"üìà Model is responding correctly!")
    
    # Store endpoint info for later use
    endpoint_info = {
        'endpoint_name': ENDPOINT_NAME,
        'instance_type': used_instance_type,
        'deployment_time': duration,
        'test_prediction': prediction
    }
    
    print(f"\nüìù Endpoint deployed and tested successfully!")
    print(f"üí° Use 'predictor.delete_endpoint()' to clean up when done")

except Exception as e:
    print(f"\nüí• DEPLOYMENT FAILED!")
    print(f"‚ùå Error: {str(e)}")
    print(f"\nüîß Troubleshooting suggestions:")
    print(f"   1. Check your AWS account limits for SageMaker instances")
    print(f"   2. Verify the model was trained successfully")
    print(f"   3. Check CloudWatch logs for detailed error messages")
    print(f"   4. Try running the troubleshooting cell below")
    
    # Re-raise for debugging
    raise e

## 8. Test Model Predictions

Demonstrate how to test your deployed model using the same patterns as `scripts/test_endpoint.py`.

In [None]:
# üîß ENDPOINT TROUBLESHOOTING & DIAGNOSTICS

print("=" * 80)
print("üîß SAGEMAKER ENDPOINT TROUBLESHOOTING & DIAGNOSTICS")
print("=" * 80)

import boto3
import json
import time
from datetime import datetime, timedelta

def comprehensive_endpoint_diagnostics(endpoint_name=None):
    """Comprehensive diagnostics for SageMaker endpoint issues"""
    
    sagemaker = boto3.client('sagemaker')
    logs_client = boto3.client('logs')
    
    print(f"üîç Running comprehensive diagnostics...")
    
    # 1. List all endpoints if none specified
    if not endpoint_name:
        print(f"\nüìã LISTING ALL ENDPOINTS:")
        try:
            response = sagemaker.list_endpoints()
            endpoints = response['Endpoints']
            
            if not endpoints:
                print("‚ùå No endpoints found. Deploy a model first.")
                return
            
            for ep in endpoints:
                status_icon = "‚úÖ" if ep['EndpointStatus'] == 'InService' else "‚ùå"
                print(f"   {status_icon} {ep['EndpointName']}: {ep['EndpointStatus']}")
            
            # Use the most recent endpoint
            endpoint_name = endpoints[-1]['EndpointName']
            print(f"\nüéØ Using most recent endpoint: {endpoint_name}")
            
        except Exception as e:
            print(f"‚ùå Error listing endpoints: {e}")
            return
    
    # 2. Check endpoint status
    print(f"\nüìä ENDPOINT STATUS CHECK:")
    try:
        response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
        status = response['EndpointStatus']
        creation_time = response['CreationTime']
        
        status_icon = "‚úÖ" if status == 'InService' else "‚ùå"
        print(f"   {status_icon} Status: {status}")
        print(f"   üìÖ Created: {creation_time}")
        
        if status == 'Failed':
            print(f"   üí• Failure Reason: {response.get('FailureReason', 'Unknown')}")
        
    except Exception as e:
        print(f"‚ùå Error checking endpoint status: {e}")
        return endpoint_name
    
    # 3. Check endpoint configuration
    print(f"\n‚öôÔ∏è  ENDPOINT CONFIGURATION:")
    try:
        config_name = response['EndpointConfigName']
        config_response = sagemaker.describe_endpoint_config(EndpointConfigName=config_name)
        
        for variant in config_response['ProductionVariants']:
            print(f"   üñ•Ô∏è  Instance Type: {variant['InstanceType']}")
            print(f"   üìä Instance Count: {variant['InitialInstanceCount']}")
            print(f"   üè∑Ô∏è  Variant Name: {variant['VariantName']}")
            
    except Exception as e:
        print(f"‚ùå Error checking endpoint config: {e}")
    
    # 4. Check CloudWatch logs
    print(f"\nüìã CLOUDWATCH LOGS (Last 30 minutes):")
    try:
        log_group = f"/aws/sagemaker/Endpoints/{endpoint_name}"
        end_time = datetime.now()
        start_time = end_time - timedelta(minutes=30)
        
        # Get log streams
        streams_response = logs_client.describe_log_streams(
            logGroupName=log_group,
            orderBy='LastEventTime',
            descending=True,
            limit=5
        )
        
        if streams_response['logStreams']:
            print(f"   üìù Found {len(streams_response['logStreams'])} log streams")
            
            # Get recent log events
            stream_name = streams_response['logStreams'][0]['logStreamName']
            events_response = logs_client.get_log_events(
                logGroupName=log_group,
                logStreamName=stream_name,
                startTime=int(start_time.timestamp() * 1000),
                endTime=int(end_time.timestamp() * 1000),
                limit=20
            )
            
            events = events_response['events']
            if events:
                print(f"   üìÑ Recent log events:")
                for event in events[-10:]:  # Show last 10 events
                    timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
                    message = event['message'].strip()
                    print(f"   {timestamp.strftime('%H:%M:%S')} | {message}")
            else:
                print(f"   ‚ÑπÔ∏è  No recent log events found")
        else:
            print(f"   ‚ÑπÔ∏è  No log streams found yet")
            
    except Exception as e:
        print(f"   ‚ö†Ô∏è  CloudWatch logs not accessible: {e}")
    
    # 5. Test endpoint if InService
    if status == 'InService':
        print(f"\nüß™ ENDPOINT CONNECTIVITY TEST:")
        try:
            runtime = boto3.client('sagemaker-runtime')
            test_data = [[5.1, 3.5, 1.4, 0.2]]
            
            response = runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(test_data)
            )
            
            result = json.loads(response['Body'].read().decode())
            print(f"   ‚úÖ Test successful! Prediction: {result}")
            
        except Exception as e:
            print(f"   ‚ùå Test failed: {e}")
    
    # 6. Provide actionable recommendations
    print(f"\nüí° TROUBLESHOOTING RECOMMENDATIONS:")
    
    if status == 'Failed':
        print(f"   üîß ENDPOINT FAILED - Try these fixes:")
        print(f"      1. Check the model artifacts exist and are accessible")
        print(f"      2. Verify the inference script has proper /ping and /invocations handlers")
        print(f"      3. Ensure model dependencies are correctly specified")
        print(f"      4. Try a different instance type (ml.t3.medium for testing)")
        print(f"      5. Check AWS account limits and quotas")
        
    elif status == 'Creating':
        print(f"   ‚è≥ ENDPOINT CREATING - This is normal:")
        print(f"      ‚Ä¢ Typical deployment takes 6-10 minutes")
        print(f"      ‚Ä¢ Check again in a few minutes")
        print(f"      ‚Ä¢ Monitor CloudWatch logs for progress")
        
    elif status == 'InService':
        print(f"   ‚úÖ ENDPOINT HEALTHY - All good!")
        print(f"      ‚Ä¢ You can make predictions")
        print(f"      ‚Ä¢ Remember to delete when done to save costs")
        
    else:
        print(f"   ‚ö†Ô∏è  UNKNOWN STATUS - General troubleshooting:")
        print(f"      1. Wait a few minutes and check again")
        print(f"      2. Check AWS Service Health Dashboard")
        print(f"      3. Verify your AWS credentials and region")
    
    print(f"\nüîÑ Run this cell again to refresh diagnostics")
    return endpoint_name

# Run diagnostics
try:
    # Try to use the endpoint from previous deployment
    endpoint_name = locals().get('ENDPOINT_NAME') or globals().get('endpoint_info', {}).get('endpoint_name')
    result_endpoint = comprehensive_endpoint_diagnostics(endpoint_name)
    
except Exception as e:
    print(f"üí• Diagnostics failed: {e}")
    print(f"üí° This might be normal if no endpoints exist yet")

## 9. Monitor Model Performance

Set up monitoring to track model performance and data drift over time.

In [None]:
# üßπ CLEANUP & COST MANAGEMENT

print("=" * 80)
print("üßπ CLEANUP & COST MANAGEMENT")
print("=" * 80)

import boto3
from datetime import datetime

def comprehensive_cleanup():
    """Clean up all SageMaker resources to prevent unnecessary costs"""
    
    sagemaker = boto3.client('sagemaker')
    
    print(f"üîç Scanning for SageMaker resources to clean up...")
    
    # 1. List and optionally delete endpoints
    print(f"\nüìä ENDPOINTS:")
    try:
        response = sagemaker.list_endpoints()
        endpoints = response['Endpoints']
        
        if not endpoints:
            print("   ‚úÖ No endpoints found")
        else:
            for ep in endpoints:
                status_icon = "üü¢" if ep['EndpointStatus'] == 'InService' else "üî¥"
                cost_per_hour = get_estimated_cost(ep.get('InstanceType', 'ml.m5.large'))
                print(f"   {status_icon} {ep['EndpointName']}: {ep['EndpointStatus']} (~${cost_per_hour:.2f}/hour)")
        
        # Ask if user wants to delete endpoints
        if endpoints:
            print(f"\nüí∞ Estimated monthly cost if left running: ${len(endpoints) * 24 * 30 * 0.115:.2f}")
            print(f"‚ö†Ô∏è  Delete endpoints to stop charges!")
            
    except Exception as e:
        print(f"‚ùå Error listing endpoints: {e}")
    
    # 2. List training jobs
    print(f"\nüéØ RECENT TRAINING JOBS:")
    try:
        response = sagemaker.list_training_jobs(
            MaxResults=10,
            SortBy='CreationTime',
            SortOrder='Descending'
        )
        
        jobs = response['TrainingJobSummaries']
        if not jobs:
            print("   ‚úÖ No recent training jobs")
        else:
            for job in jobs[:5]:  # Show last 5
                status_icon = "‚úÖ" if job['TrainingJobStatus'] == 'Completed' else "‚ùå"
                print(f"   {status_icon} {job['TrainingJobName']}: {job['TrainingJobStatus']}")
                
    except Exception as e:
        print(f"‚ùå Error listing training jobs: {e}")
    
    # 3. List models
    print(f"\nü§ñ MODELS:")
    try:
        response = sagemaker.list_models(MaxResults=10)
        models = response['Models']
        
        if not models:
            print("   ‚úÖ No models found")
        else:
            for model in models:
                print(f"   üì¶ {model['ModelName']}: {model['CreationTime']}")
                
    except Exception as e:
        print(f"‚ùå Error listing models: {e}")
    
    # 4. Provide cleanup options
    print(f"\nüßπ CLEANUP OPTIONS:")
    print(f"   1. Delete specific endpoint: predictor.delete_endpoint()")
    print(f"   2. Delete all endpoints: Use the functions below")
    print(f"   3. Models and training jobs don't incur ongoing costs")

def get_estimated_cost(instance_type):
    """Get estimated hourly cost for instance type"""
    cost_map = {
        'ml.t3.medium': 0.05,
        'ml.m5.large': 0.115,
        'ml.m5.xlarge': 0.23,
        'ml.c5.large': 0.102,
        'ml.c5.xlarge': 0.204
    }
    return cost_map.get(instance_type, 0.115)

def delete_all_endpoints():
    """Delete all endpoints - USE WITH CAUTION"""
    sagemaker = boto3.client('sagemaker')
    
    try:
        response = sagemaker.list_endpoints()
        endpoints = response['Endpoints']
        
        if not endpoints:
            print("‚úÖ No endpoints to delete")
            return
        
        print(f"üóëÔ∏è  Deleting {len(endpoints)} endpoint(s)...")
        
        for ep in endpoints:
            endpoint_name = ep['EndpointName']
            try:
                sagemaker.delete_endpoint(EndpointName=endpoint_name)
                print(f"   ‚úÖ Deleted: {endpoint_name}")
            except Exception as e:
                print(f"   ‚ùå Failed to delete {endpoint_name}: {e}")
        
        print(f"üéâ Cleanup complete! All endpoints deleted.")
        
    except Exception as e:
        print(f"‚ùå Error during cleanup: {e}")

def delete_endpoint_by_name(endpoint_name):
    """Delete a specific endpoint by name"""
    sagemaker = boto3.client('sagemaker')
    
    try:
        sagemaker.delete_endpoint(EndpointName=endpoint_name)
        print(f"‚úÖ Successfully deleted endpoint: {endpoint_name}")
    except Exception as e:
        print(f"‚ùå Failed to delete endpoint {endpoint_name}: {e}")

# Run the resource scan
comprehensive_cleanup()

print(f"\nüí° CLEANUP COMMANDS:")
print(f"   ‚Ä¢ Delete specific endpoint: delete_endpoint_by_name('your-endpoint-name')")
print(f"   ‚Ä¢ Delete ALL endpoints: delete_all_endpoints()  # ‚ö†Ô∏è USE WITH CAUTION")
print(f"   ‚Ä¢ Using predictor object: predictor.delete_endpoint()")

print(f"\nüí∞ COST REMINDERS:")
print(f"   ‚Ä¢ Endpoints charge by the hour (~$0.05-0.23/hour)")
print(f"   ‚Ä¢ Training jobs only charge while running")
print(f"   ‚Ä¢ Models stored in S3 have minimal storage costs")
print(f"   ‚Ä¢ Always clean up endpoints when done!")

print(f"\nüîÑ Run this cell periodically to monitor your AWS costs")

## üéØ Summary: Your Complete MLOps Workflow

### What you've learned:

#### üîÑ **Development Workflow:**
1. **Explore & Experiment** ‚Üí Use SageMaker Notebook for EDA and prototyping
2. **Process & Engineer** ‚Üí Create reproducible feature pipelines
3. **Train & Validate** ‚Üí Use both local and SageMaker training
4. **Deploy & Monitor** ‚Üí Production deployment with monitoring

#### üìÅ **How `src/` supports you:**
- **`src/model/train.py`** ‚Üí Production-ready training script you can customize
- **`src/model/inference.py`** ‚Üí Handles model serving logic
- **`src/container/`** ‚Üí Docker setup for custom algorithms
- **`src/lambda/`** ‚Üí Pipeline automation and orchestration

#### üõ†Ô∏è **How `scripts/` supports you:**
- **`scripts/build_and_push.sh`** ‚Üí Containerize and deploy your models
- **`scripts/deploy.sh`** ‚Üí Orchestrated production deployment
- **`scripts/test_endpoint.py`** ‚Üí Automated endpoint testing
- **`scripts/cleanup.sh`** ‚Üí Clean up resources to save costs

#### üöÄ **Next Steps:**
1. **Customize training script** ‚Üí Modify `src/model/train.py` for your algorithms
2. **Deploy to production** ‚Üí Run `scripts/deploy.sh` when ready
3. **Set up monitoring** ‚Üí Use the baseline we created
4. **Iterate and improve** ‚Üí Use the notebook for continuous experimentation

#### üí° **Pro Tips:**
- Save all your experiments and processed data to S3
- Use the production scripts for consistency between dev and prod
- Monitor your models continuously for drift and performance
- Version your models and track experiments

### üéâ **You now have a complete MLOps pipeline at your disposal!**