# AWS SageMaker Training & Deployment Pipeline

## How It Works:

### 1. **Training Script (train.py)**
This notebook contains your training code that will run on SageMaker. The script:
- Loads data from S3 (SageMaker downloads it automatically)
- Trains a Random Forest model
- Saves the model artifacts
- Uses special SageMaker environment variables:
  - `SM_CHANNEL_TRAINING`: Directory where training data is downloaded
  - `SM_MODEL_DIR`: Directory to save the trained model
  - `SM_OUTPUT_DATA_DIR`: Directory for training metrics/outputs

### 2. **The Workflow:**

```
Your S3 Bucket (Parquet files)
    ↓
SageMaker Training Job
    ↓ (Downloads data from S3)
Training Instance (runs train.py)
    ↓ (Trains model)
Model Artifacts saved to S3
    ↓
Create SageMaker Model
    ↓
Deploy to SageMaker Endpoint
    ↓
Real-time Predictions via API
```

### 3. **What You Need:**
- ✅ Training script (this file converted to .py)
- ✅ Parquet data uploaded to S3
- ✅ IAM Role with SageMaker permissions
- ✅ Inference script (for deployment)
- ✅ SageMaker SDK to orchestrate

### Step 1: Create a SageMaker Orchestration Script
This script will launch the training job and deploy the model

In [2]:
import boto3
import sagemaker
from sagemaker.sklearn import SKLearn
from sagemaker import get_execution_role
from datetime import datetime

# Initialize SageMaker session
session = sagemaker.Session()
bucket = 'energy-forecast-processed-krishnadev'  # REPLACE with your bucket name
region = boto3.Session().region_name

# Get execution role (or specify ARN directly)
try:
    role = get_execution_role()
except:
    # If running outside SageMaker, use the role ARN you created
    role = 'arn:aws:iam::205999347239:role/SageMakerExecutionRole-EnergyForecast'
    # Get your account ID from: aws sts get-caller-identity

print(f"Using role: {role}")
print(f"Using bucket: {bucket}")

# Define training data location
train_data = 's3://energy-forecast-processed-krishnadev/features/energy_features.parquet/'


sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\Ambika M\AppData\Local\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\Ambika M\AppData\Local\sagemaker\sagemaker\config.yaml


Couldn't call 'get_role' to get Role ARN from role name Krishnadev to get Role path.


Using role: arn:aws:iam::205999347239:role/SageMakerExecutionRole-EnergyForecast
Using bucket: energy-forecast-processed-krishnadev


### Step 4: Create and Run the Training Job

In [3]:
# Create SKLearn estimator for training
sklearn_estimator = SKLearn(
    entry_point='training.py',              # Your training script
    source_dir='sagemaker_scripts',         # Only include training scripts (NOT entire directory)
    role=role,
    instance_type='ml.m5.xlarge',        # Training instance type
    instance_count=1,
    framework_version='1.2-1',           # Scikit-learn version
    py_version='py3',
    output_path='s3://energy-forecast-models-krishnadev',
    sagemaker_session=session,
    hyperparameters={
        # You can pass hyperparameters here
    }
)

# Start training job
print("Starting training job...")
sklearn_estimator.fit({'training': train_data})

print("Training completed!")
print(f"Model artifacts saved to: {sklearn_estimator.model_data}")

Starting training job...


INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2025-10-23-16-56-05-998


2025-10-23 16:56:16 Starting - Starting the training job......
2025-10-23 16:56:48 Downloading - Downloading input data
2025-10-23 16:56:48 Downloading - Downloading input data......
2025-10-23 16:57:08 Downloading - Downloading the training image
2025-10-23 16:57:08 Downloading - Downloading the training image......
2025-10-23 16:57:54 Training - Training image download completed. Training in progress.
2025-10-23 16:57:54 Training - Training image download completed. Training in progress.......
2025-10-23 16:58:27 Uploading - Uploading generated training model
2025-10-23 16:58:27 Completed - Training job completed

2025-10-23 16:58:27 Uploading - Uploading generated training model
2025-10-23 16:58:27 Completed - Training job completed
....Training seconds: 99
Billable seconds: 99
Training seconds: 99
Billable seconds: 99
Training completed!
Model artifacts saved to: s3://energy-forecast-models-krishnadev/sagemaker-scikit-learn-2025-10-23-16-56-05-998/output/model.tar.gz
Training compl

### Step 5: Deploy the Model to an Endpoint

In [4]:
# Deploy the trained model
print("Deploying model to endpoint...")

predictor = sklearn_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',         # Inference instance type
    endpoint_name=f'energy-model-{datetime.now().strftime("%Y%m%d-%H%M%S")}',
    entry_point='inference.py'  # Your inference script
)

print(f"Model deployed to endpoint: {predictor.endpoint_name}")
print("Endpoint is ready for predictions!")

Deploying model to endpoint...


INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2025-10-23-16-58-37-432
INFO:sagemaker:Creating endpoint-config with name energy-model-20251023-222837
INFO:sagemaker:Creating endpoint-config with name energy-model-20251023-222837
INFO:sagemaker:Creating endpoint with name energy-model-20251023-222837
INFO:sagemaker:Creating endpoint with name energy-model-20251023-222837


-----------!Model deployed to endpoint: energy-model-20251023-222837
Endpoint is ready for predictions!
!Model deployed to endpoint: energy-model-20251023-222837
Endpoint is ready for predictions!


### Step 6: Make Predictions

In [None]:
# Configure serializers for proper JSON handling
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

# Example prediction - send as list of dictionaries
test_data = [
    {
        'hour': 10,
        'day_of_week': 1,
        'day': 15,
        'month': 6,
        'is_weekend': 0,
        'Global_active_power_max': 4.5,
        'Global_active_power_min': 0.5,
        'Global_active_power_std': 1.2,
        'Voltage_mean': 240.5,
        'power_lag_1h': 3.2,
        'power_lag_24h': 3.5,
        'power_lag_168h': 3.1,
        'power_rolling_mean_7d': 3.3,
        'power_rolling_std_7d': 0.8
    }
]

print("Sending prediction request...")
print(f"Input data: {test_data}\n")

# Make prediction
try:
    result = predictor.predict(test_data)
    print(f"✅ Prediction successful!")
    print(f"Result: {result}")
except Exception as e:
    print(f"❌ Error making prediction: {str(e)}")
    print("\n🔍 Check CloudWatch logs below for detailed error messages.")

Error making prediction: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
". See https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logEventViewer:group=/aws/sagemaker/Endpoints/energy-model-20251023-222837 in account 205999347239 for more information.
Check CloudWatch logs for details


### Step 6.5: Check CloudWatch Logs for Errors

### Step 6.25: Alternative - Test with JSON String Format

In [None]:
# Try with proper serialization
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Set serializers
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

# Test data - make sure it matches the exact feature order from training
test_data = [
    {
        'hour': 10,
        'day_of_week': 1,
        'day': 15,
        'month': 6,
        'is_weekend': 0,
        'Global_active_power_max': 4.5,
        'Global_active_power_min': 0.5,
        'Global_active_power_std': 1.2,
        'Voltage_mean': 240.5,
        'power_lag_1h': 3.2,
        'power_lag_24h': 3.5,
        'power_lag_168h': 3.1,
        'power_rolling_mean_7d': 3.3,
        'power_rolling_std_7d': 0.8
    }
]

print("Sending prediction request...")
print(f"Input data: {test_data}\n")

try:
    result = predictor.predict(test_data)
    print(f"✅ Prediction successful!")
    print(f"Result: {result}")
except Exception as e:
    print(f"❌ Error making prediction: {str(e)}")
    print("\n🔍 This error indicates the inference script has an issue.")
    print("Check the CloudWatch logs cell below for detailed error messages.")

In [19]:
# Fetch recent CloudWatch logs for the endpoint
import time

logs_client = boto3.client('logs', region_name=region)

# Check if endpoint exists and get its status
sm_client = boto3.client('sagemaker', region_name=region)
try:
    endpoint_desc = sm_client.describe_endpoint(EndpointName=predictor.endpoint_name)
    print(f"Endpoint Status: {endpoint_desc['EndpointStatus']}")
    print(f"Endpoint ARN: {endpoint_desc['EndpointArn']}\n")
except Exception as e:
    print(f"Error describing endpoint: {str(e)}\n")

# Try to find the log group
log_group = f'/aws/sagemaker/Endpoints/{predictor.endpoint_name}'
print(f"Looking for log group: {log_group}\n")

try:
    # Check if log group exists
    log_groups = logs_client.describe_log_groups(
        logGroupNamePrefix='/aws/sagemaker/Endpoints/'
    )
    
    print("Available SageMaker Endpoint log groups:")
    for lg in log_groups['logGroups']:
        print(f"  - {lg['logGroupName']}")
    print()
    
    # Try to get log streams
    streams = logs_client.describe_log_streams(
        logGroupName=log_group,
        orderBy='LastEventTime',
        descending=True,
        limit=5
    )
    
    print(f"Found {len(streams['logStreams'])} log streams\n")
    
    # Get recent logs from the most recent stream
    if streams['logStreams']:
        stream_name = streams['logStreams'][0]['logStreamName']
        print(f"Reading from stream: {stream_name}\n")
        
        events = logs_client.get_log_events(
            logGroupName=log_group,
            logStreamName=stream_name,
            limit=50,
            startFromHead=False
        )
        
        print("=" * 80)
        print("RECENT LOGS:")
        print("=" * 80)
        for event in events['events'][-20:]:  # Last 20 log messages
            print(event['message'])
        print("=" * 80)
    else:
        print("⚠️  No log streams found yet.")
        print("This is normal if:")
        print("  1. The endpoint was just created")
        print("  2. No prediction requests have been made yet")
        print("  3. Logs are still being initialized")
        print("\n💡 Try making a prediction request first, then check logs again.")
        
except logs_client.exceptions.ResourceNotFoundException:
    print("⚠️  Log group doesn't exist yet.")
    print("\nThis means:")
    print("  - The endpoint hasn't generated any logs yet")
    print("  - You need to wait a few minutes after endpoint creation")
    print("  - Or the endpoint may not be fully deployed")
    print("\n💡 Steps to troubleshoot:")
    print("  1. Wait 2-3 minutes after endpoint creation")
    print("  2. Make a prediction request to generate logs")
    print("  3. Run this cell again")
    print(f"\n📊 Manual check: https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logsV2:log-groups")
    
except Exception as e:
    print(f"❌ Error fetching logs: {str(e)}")
    print(f"\n📊 Manually check logs at:")
    print(f"https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logsV2:log-groups")

Endpoint Status: InService
Endpoint ARN: arn:aws:sagemaker:ap-south-1:205999347239:endpoint/energy-model-20251023-222837

Looking for log group: /aws/sagemaker/Endpoints/energy-model-20251023-222837

Available SageMaker Endpoint log groups:

⚠️  Log group doesn't exist yet.

This means:
  - The endpoint hasn't generated any logs yet
  - You need to wait a few minutes after endpoint creation
  - Or the endpoint may not be fully deployed

💡 Steps to troubleshoot:
  1. Wait 2-3 minutes after endpoint creation
  2. Make a prediction request to generate logs
  3. Run this cell again

📊 Manual check: https://ap-south-1.console.aws.amazon.com/cloudwatch/home?region=ap-south-1#logsV2:log-groups
Available SageMaker Endpoint log groups:

⚠️  Log group doesn't exist yet.

This means:
  - The endpoint hasn't generated any logs yet
  - You need to wait a few minutes after endpoint creation
  - Or the endpoint may not be fully deployed

💡 Steps to troubleshoot:
  1. Wait 2-3 minutes after endpoint c

### Step 7: Clean Up (When Done Testing)

In [None]:
# Delete endpoint to avoid charges
predictor.delete_endpoint()
print("Endpoint deleted!")

---
## Summary: How It All Works

1. **You write** `train.py` (the training logic - cells 1-4 above)
2. **SageMaker receives** your script and S3 data location
3. **SageMaker launches** a training instance (EC2 machine)
4. **SageMaker downloads** your parquet files from S3 to the instance
5. **Your script runs** on that instance with the data
6. **Model artifacts** are saved and uploaded back to S3
7. **You deploy** by creating a SageMaker Model + Endpoint
8. **SageMaker creates** an inference instance running your model
9. **You send requests** to the endpoint URL to get predictions
10. **The endpoint** loads your model and returns predictions

### Key Benefits:
- ✅ No need to manage servers
- ✅ Automatic scaling
- ✅ Built-in logging and monitoring
- ✅ Version control for models
- ✅ A/B testing capabilities
- ✅ Pay only for what you use