
# Deploying Llama-4 Scout on SageMaker with 

This notebook demonstrates deploying and running inference with the Llama-4 Scout model. We will cover 

1. Installing SageMaker python SDK, Setting up SageMaker resources and permissions
2. Deploying the model using SageMaker LMI (Large Model Inference Container powered by Vllm 0.8.4)
3. Invoking the model using streaming responses

## Environment Setup

First, we'll install the SageMaker SDK to ensure compatibility with the latest features, particularly those needed for large language model deployment and streaming inference.



In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs

## Configure Model Container and Instance

For deploying Llama-4, we'll use:
- **LMI (Deep Java Library) Inference Container with vLLM V1-0.8.4** : A container optimized for large language model inference
- **P5 Instance**: AWS's latest GPU instance type optimized for large model inference

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.p5.48xlarge` instances which offer:
  - 8 NVIDIA H100 GPUs
  - 640 GB of memory
  - High network bandwidth for optimal inference performance

> **Note**: The region in the container URI should match your AWS region. Replace `us-east-2` with your region if different.

In [None]:
# Define region where you have capacity
REGION = 'us-east-1'  

#Select the latest container. Check the link for the latest available version https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'

# Construct container URI
container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'

# Select instance type
instance_type = "ml.p5.48xlarge"  # Alternative: "ml.p5e.48xlarge"

# Validate region and print configuration
if REGION != sess.boto_region_name:
    print(f"⚠️ Warning: Container region ({REGION}) differs from session region ({sess.boto_region_name})")
else:
    print(f"✅ Region validation passed: {REGION}")
    
print(f"📦 Container URI: {container_uri}")
print(f"🖥️ Instance Type: {instance_type}")

## Create SageMaker Model

**Important**: Before you proceed, request access to the model In HuggingFace, the request should be approved in a few minutes,once request is approved, generate a HuggingFace token key and update it in the properties below

Now we'll create a SageMaker Model object that combines our:
- vllm env variables
- Container image (LMI)
- Model artifacts (configuration files)
- IAM role (for permissions)

This step defines the model configuration but doesn't deploy it yet. The Model object represents the combination of:

1. **Container Image** (`image_uri`): DJL Inference optimized for LLMs
2. **Env Variables** (`env`): Our variables for the model server
3. **IAM Role** (`role`): Permissions for model execution


In [None]:
vllm_config = {
    "HF_MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E",
    "HF_TOKEN": "",
    "OPTION_MAX_MODEL_LEN": "250000",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
    "OPTION_MODEL_LOADING_TIMEOUT": "1500",
    "SERVING_FAIL_FAST": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service"
}

In [None]:
model = Model(image_uri=container_uri,
              role=role,
              env=vllm_config)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (P5 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.p5.48xlarge` for high-performance inference
- **Health Check Timeout**: 1800 seconds 
  - Extended timeout needed for large model loading
  - Includes time for container setup and model initialization

> ⚠️ **Important**: 
> - Deployment can take upto 15 minutes
> - Monitor the endpoint status in SageMaker Console and CloudWatch logs for progress

In [None]:
endpoint_name = sagemaker.utils.name_from_base("Llama-4")

print(endpoint_name)
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout = 1800
)

## Running Inference requests to the model

In [None]:
# Invoke the model
import json
import boto3
import time

# Create SageMaker Runtime client
smr_client = boto3.client('sagemaker-runtime')
##Add your endpoint here 
endpoint_name = ''

# Invoke with messages format
body = {
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": True,
}

start_time = time.time()
first_token_received = False
ttft = None
token_count = 0
full_response = ""

print(f"Prompt: {body['messages'][0]['content']}\n")
print("Response:", end=' ', flush=True)

# Invoke endpoint with streaming
resp = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)

# Process streaming response
for event in resp['Body']:
    if 'PayloadPart' in event:
        payload = event['PayloadPart']['Bytes'].decode()
        
        try:
            
            if payload.startswith('data: '):
                data = json.loads(payload[6:])  # Skip "data: " prefix
            else:
                data = json.loads(payload)
            
            token_count += 1
            if not first_token_received:
                ttft = time.time() - start_time
                first_token_received = True
            
            # Handle different streaming response formats
            if 'choices' in data and len(data['choices']) > 0:
                # Messages-compatible format
                if 'delta' in data['choices'][0] and 'content' in data['choices'][0]['delta']:
                    token_text = data['choices'][0]['delta']['content']
                    full_response += token_text
                    print(token_text, end='', flush=True)
            elif 'token' in data and 'text' in data['token']:
                # TGI format
                token_text = data['token']['text']
                full_response += token_text
                print(token_text, end='', flush=True)
        
        except json.JSONDecodeError:
            # Skip invalid JSON
            continue

end_time = time.time()
total_latency = end_time - start_time

print("\n\nMetrics:")
print(f"Time to First Token (TTFT): {ttft:.2f} seconds if tokens received else 'No tokens received'")
print(f"Total Tokens Generated: {token_count}")
print(f"Total Latency: {total_latency:.2f} seconds")
#print(f"\nFull Response:\n{full_response}")

In [None]:
import json
import boto3
import base64
from PIL import Image

# Function to convert image to base64 data URI
def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return base64_data

# Path to your PNG image
image_path = "./img/trip.png"
base64_image = image_to_base64_data_uri(image_path)

# Create SageMaker Runtime client for invocation
smr_client = boto3.client('sagemaker-runtime')

# Update Your endpoint name
endpoint_name = ''

# Prepare request payload with image in OpenAI format
payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail please."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

# Option 1: Non-streaming invocation
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)
result = json.loads(response['Body'].read().decode())
print(result["choices"][0]["message"]["content"])

In [None]:
# Option 2: Streaming response invocation
streaming_payload = payload.copy()
streaming_payload["stream"] = True

response_stream = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(streaming_payload)
)

print("Response:", end=' ', flush=True)
full_response = ""

for event in response_stream['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        
        try:
            # Handle SSE format (data: prefix)
            if chunk.startswith('data: '):
                data = json.loads(chunk[6:])  # Skip "data: " prefix
            else:
                data = json.loads(chunk)
            
            # Extract token based on OpenAI format
            if 'choices' in data and len(data['choices']) > 0:
                if 'delta' in data['choices'][0] and 'content' in data['choices'][0]['delta']:
                    token_text = data['choices'][0]['delta']['content']
                    full_response += token_text
                    print(token_text, end='', flush=True)
        
        except json.JSONDecodeError:
            continue



In [None]:
## Delete endpoint

#import boto3
#import sagemaker

# Initialize session
#sess = sagemaker.Session()


print(f"Deleting SageMaker resources for endpoint: {endpoint_name}")
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)

