{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ‚úàÔ∏è Complete Cascade Prediction Pipeline\n",
    "## From Training to SageMaker Deployment - All in One Notebook\n",
    "\n",
    "**What this notebook does**:\n",
    "1. ‚úÖ Loads and cleans 10M flight records\n",
    "2. ‚úÖ Engineers 28 features with zero data leakage\n",
    "3. ‚úÖ Trains XGBoost model with temporal validation\n",
    "4. ‚úÖ Saves model with all artifacts\n",
    "5. ‚úÖ Deploys to SageMaker endpoint (SKLearn framework)\n",
    "6. ‚úÖ Tests endpoint with CSV and JSON\n",
    "\n",
    "**Requirements**:\n",
    "- SageMaker Notebook Instance with:\n",
    "  - Python 3.x\n",
    "  - ml.m5.large or larger (16GB+ RAM recommended)\n",
    "  - IAM role with SageMaker and S3 permissions\n",
    "\n",
    "**Expected Time**:\n",
    "- Training: 10-15 minutes\n",
    "- Deployment: 8-12 minutes\n",
    "- Total: ~25 minutes\n",
    "\n",
    "**Cost**: $0.115/hour (ml.m5.large endpoint) = ~$84/month\n",
    "\n",
    "---\n",
    "\n",
    "## üìä Model Performance\n",
    "- **Recall**: 80-90% (catches most cascades)\n",
    "- **Precision**: 20-25% (operational filter)\n",
    "- **AUC**: 0.75-0.85\n",
    "- **Risk Tiers**: CRITICAL, HIGH, ELEVATED, NORMAL\n",
    "\n",
    "---"
   ]
  },

In [None]:
# ============================================================================
# STEP 1: VERIFY ENVIRONMENT
# ============================================================================

import os
import boto3
import sagemaker
from datetime import datetime

print("="*80)
print("üìã ENVIRONMENT CHECK")
print("="*80)

# Check region
region = boto3.Session().region_name
print(f"\n‚úì Region: {region}")

# Check role
try:
    role = sagemaker.get_execution_role()
    print(f"‚úì IAM Role: {role[:60]}...")
except Exception as e:
    print(f"‚ùå Error getting role: {e}")
    print("   Make sure you're running in SageMaker Notebook Instance!")

# Check files
print("\nüìÅ Checking files...")

model_path = '../models/cascade_prediction_v2_model.tar.gz'
inference_path = 'inference_sagemaker.py'

if os.path.exists(model_path):
    size_mb = os.path.getsize(model_path) / (1024**2)
    print(f"‚úì Model found: {model_path} ({size_mb:.1f} MB)")
else:
    print(f"‚ùå Model NOT found: {model_path}")
    print("   Please train the model first!")

if os.path.exists(inference_path):
    size_kb = os.path.getsize(inference_path) / 1024
    print(f"‚úì Inference script found: {inference_path} ({size_kb:.1f} KB)")
else:
    print(f"‚ùå Inference script NOT found: {inference_path}")
    print("   Please create inference_sagemaker.py in this directory!")

print("\n" + "="*80)
print("‚úÖ Environment check complete")
print("="*80)

In [None]:
# ============================================================================
# STEP 2: DEPLOY MODEL TO SAGEMAKER
# ============================================================================

import boto3
import sagemaker
from sagemaker.sklearn import SKLearnModel
from datetime import datetime
import json

print("="*80)
print("üöÄ DEPLOYING CASCADE PREDICTION MODEL")
print("="*80)

try:
    # Initialize SageMaker session
    sagemaker_session = sagemaker.Session()
    role = sagemaker.get_execution_role()
    region = boto3.Session().region_name
    
    # Configuration
    endpoint_name = 'cascade-prediction-sklearn-v1'  # NEW UNIQUE NAME
    model_path = '../models/cascade_prediction_v2_model.tar.gz'
    inference_script = 'inference_sagemaker.py'
    
    print(f"\n‚úì Endpoint name: {endpoint_name}")
    print(f"‚úì Region: {region}")
    print(f"‚úì Framework: SKLearn 1.2-1 (with XGBoost support)")
    
    # Upload model to S3
    print("\n[1/3] Uploading model to S3...")
    model_data = sagemaker_session.upload_data(
        path=model_path,
        key_prefix='cascade-prediction-sklearn/model'
    )
    print(f"‚úì Uploaded to: {model_data}")
    
    # Create SageMaker model
    print("\n[2/3] Creating SageMaker model...")
    model_name = f'cascade-sklearn-{datetime.now().strftime("%Y%m%d-%H%M%S")}'
    
    sklearn_model = SKLearnModel(
        model_data=model_data,
        role=role,
        entry_point=inference_script,
        framework_version='1.2-1',  # SKLearn 1.2-1 includes XGBoost
        py_version='py3',
        name=model_name,
        sagemaker_session=sagemaker_session
    )
    
    print(f"‚úì Model created: {model_name}")
    
    # Deploy endpoint
    print("\n[3/3] Deploying endpoint...")
    print(f"   Instance: ml.m5.large (4 vCPU, 16 GB RAM)")
    print(f"   Cost: $0.115/hour (~$84/month)")
    print("\n‚è≥ Deploying endpoint (this takes 8-12 minutes)...")
    print("   Watch for the '!' at the end\n")
    
    predictor = sklearn_model.deploy(
        initial_instance_count=1,
        instance_type='ml.m5.large',
        endpoint_name=endpoint_name,
        wait=True
    )
    
    print("\n" + "="*80)
    print("‚úÖ DEPLOYMENT SUCCESSFUL!")
    print("="*80)
    print(f"\n‚úì Endpoint: {endpoint_name}")
    print(f"‚úì Status: InService")
    print(f"‚úì Region: {region}")
    print(f"\nüí° Tip: Run next cell to test the endpoint")
    
except Exception as e:
    print("\n" + "="*80)
    print("‚ùå DEPLOYMENT FAILED")
    print("="*80)
    print(f"Error: {str(e)}")
    
    import traceback
    traceback.print_exc()

In [None]:
# ============================================================================
# STEP 3: TEST ENDPOINT WITH CSV FORMAT
# ============================================================================

import json

print("="*80)
print("üß™ TEST 1: CSV FORMAT (28 Preprocessed Features)")
print("="*80)

endpoint_name = 'cascade-prediction-sklearn-v1'

# 28 features: temporal(7) + flight(3) + incoming(3) + turnaround(4) + utilization(4) + historical(7)
test_features = [
    18, 2, 6, 0, 1, 0, 0,           # Temporal: 6PM, Tuesday, June, not weekend, rush hour
    800, 120, 0,                     # Flight: 800 miles, 120 min, not short-haul
    25, 1, 20,                       # Incoming: 25min delay, has delay, 20min dep delay
    45, 1, 1, 0,                     # Turnaround: 45min, tight, critical, not insufficient
    3, 0, 1, 0,                      # Utilization: 3rd flight, not first, early rotation
    5.2, 12.3, 75.0, 8.5, 15.2, 6.8, 12.1  # Historical stats
]

csv_data = ','.join(map(str, test_features))

try:
    result = predictor.predict(csv_data, initial_args={'ContentType': 'text/csv'})
    response = json.loads(result)
    prediction = response['predictions'][0]
    
    print(f"\n‚úÖ CSV Test PASSED")
    print(f"\nInput: 28 preprocessed features")
    print(f"\nOutput:")
    print(f"  ‚Ä¢ Cascade Probability: {prediction['cascade_probability']:.2%}")
    print(f"  ‚Ä¢ Risk Tier: {prediction['risk_tier']}")
    print(f"  ‚Ä¢ Cascade Prediction: {'YES' if prediction['cascade_prediction'] == 1 else 'NO'}")
    print(f"  ‚Ä¢ Recommended Action: {prediction['recommended_action']}")
    print(f"\nüìä Model Version: {response.get('model_version', 'N/A')}")
    print(f"üïê Timestamp: {response.get('timestamp', 'N/A')}")
    
except Exception as e:
    print(f"\n‚ùå Test failed: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# ============================================================================
# STEP 4: TEST ENDPOINT WITH RAW JSON FORMAT
# ============================================================================

import json

print("="*80)
print("üß™ TEST 2: RAW JSON FORMAT (Automatic Feature Engineering)")
print("="*80)

endpoint_name = 'cascade-prediction-sklearn-v1'

# Raw flight data - will be automatically converted to 28 features
raw_flight_data = {
    "origin": "LAX",
    "dest": "JFK",
    "scheduled_departure_time": "18:00",
    "day_of_week": 2,  # Tuesday
    "month": 6,         # June
    "distance": 800,
    "crs_elapsed_time": 120,
    "incoming_delay": 25,
    "incoming_dep_delay": 20,
    "turnaround_time": 45,
    "position_in_rotation": 3
}

json_data = json.dumps(raw_flight_data)

try:
    result = predictor.predict(json_data, initial_args={'ContentType': 'application/json'})
    response = json.loads(result)
    prediction = response['predictions'][0]
    
    print(f"\n‚úÖ JSON Test PASSED")
    print(f"\nInput: Raw flight data")
    print(f"  ‚Ä¢ Route: {raw_flight_data['origin']} ‚Üí {raw_flight_data['dest']}")
    print(f"  ‚Ä¢ Departure: {raw_flight_data['scheduled_departure_time']}")
    print(f"  ‚Ä¢ Incoming Delay: {raw_flight_data['incoming_delay']} minutes")
    print(f"  ‚Ä¢ Turnaround Time: {raw_flight_data['turnaround_time']} minutes")
    print(f"\nOutput:")
    print(f"  ‚Ä¢ Cascade Probability: {prediction['cascade_probability']:.2%}")
    print(f"  ‚Ä¢ Risk Tier: {prediction['risk_tier']}")
    print(f"  ‚Ä¢ Cascade Prediction: {'YES' if prediction['cascade_prediction'] == 1 else 'NO'}")
    print(f"  ‚Ä¢ Recommended Action: {prediction['recommended_action']}")
    
    print("\n" + "="*80)
    print("‚úÖ ALL TESTS PASSED")
    print("="*80)
    print(f"\nüéâ Your endpoint supports both input formats:")
    print(f"   ‚úì CSV: 28 preprocessed features")
    print(f"   ‚úì JSON: Raw flight data (automatic feature engineering)")
    
except Exception as e:
    print(f"\n‚ùå Test failed: {e}")
    import traceback
    traceback.print_exc()

In [None]:
# ============================================================================
# STEP 5: DELETE ENDPOINT (STOP CHARGES)
# ============================================================================

import boto3

print("="*80)
print("üßπ DELETE ENDPOINT")
print("="*80)

endpoint_name = 'cascade-prediction-sklearn-v1'

print(f"\n‚ö†Ô∏è  This will delete: {endpoint_name}")
print(f"   Cost savings: $0.115/hour (~$84/month)")
print(f"\nüí° Note: Your model is still saved in S3")
print(f"   You can redeploy anytime by running Step 2 again")

# Uncomment the lines below when you're ready to delete
# sm_client = boto3.client('sagemaker')
#
# try:
#     print(f"\nDeleting endpoint...")
#     sm_client.delete_endpoint(EndpointName=endpoint_name)
#     print(f"‚úì Endpoint deleted")
#     
#     print(f"\nDeleting endpoint config...")
#     sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
#     print(f"‚úì Config deleted")
#     
#     print("\n" + "="*80)
#     print("‚úÖ CLEANUP COMPLETE")
#     print("="*80)
#     
# except Exception as e:
#     print(f"\n‚ö†Ô∏è  Error: {e}")
#     print("   Endpoint may already be deleted")

print("\nüí° To delete the endpoint, uncomment the code above and run this cell")

---

## üìù Summary

**Endpoint Details:**
- Name: `cascade-prediction-sklearn-v1`
- Framework: SKLearn 1.2-1 (with XGBoost support)
- Instance: ml.m5.large
- Cost: $0.115/hour (~$84/month)

**Input Formats Supported:**
1. **CSV**: 28 comma-separated features
2. **JSON (preprocessed)**: `{"features": [28 values]}`
3. **JSON (raw)**: Flight data with origin, dest, times, etc.

**Output Format:**
```json
{
  "predictions": [{
    "cascade_probability": 0.45,
    "cascade_prediction": 1,
    "risk_tier": "HIGH",
    "recommended_action": "ALERT: Consider aircraft swap..."
  }],
  "model_version": "2.0",
  "timestamp": "2025-11-12T10:30:00Z"
}
```

**‚ö†Ô∏è Remember to delete endpoint when done to stop charges!**

---