# SageMaker Inference: BiEncoder RoBerta with Scale-to-Zero Scheduling

[KLUE RoBERTa](https://huggingface.co/klue/roberta-base) 모델을 SageMaker Inference Component로 배포하고 추론합니다.

이 노트북은 **Scale-to-Zero** 기능을 사용하여 주말에는 인스턴스를 0으로 줄이고 평일에는 다시 복원하는 스케줄링을 구현합니다.

**주요 기능:**
- Inference Component를 사용한 모델 배포
- ManagedInstanceScaling으로 MinInstanceCount=0 설정
- EventBridge Scheduler를 사용한 자동 스케일링
- 주말(금요일 저녁) Scale-in, 평일(월요일 아침) Scale-out

#### 참조 블로그
- [Unlock cost savings with the new scale down to zero feature in SageMaker Inference](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/)
---

## [선수 작업] AWS Role 정보를 .env 파일에 아래와 같이 저장
```
SAGEMAKER_ROLE_ARN=arn:aws:iam::XXXXXX:role/gonsoomoon-sm-inference
```

## 0. 환경 확인

In [1]:
! which python
! uv pip list | grep torch

/home/ubuntu/lab/16-robert-sagemaker-inference/setup/.venv/bin/python
[2mUsing Python 3.11.0rc1 environment at: /home/ubuntu/lab/16-robert-sagemaker-inference/setup/.venv[0m
torch                            2.5.0+cu121


In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')

In [3]:
from dotenv import load_dotenv
import os

load_dotenv('../.env')
SAGEMAKER_ROLE_ARN = os.getenv('SAGEMAKER_ROLE_ARN')

## 1. 환경 설정

In [4]:
import json
import time
import boto3
import sagemaker
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

sagemaker_session = sagemaker.Session()
sagemaker_client = boto3.client("sagemaker")
role = SAGEMAKER_ROLE_ARN
bucket = sagemaker_session.default_bucket()
region = sagemaker_session._region_name

print(f"Bucket: {bucket}")
print(f"Region: {region}")
print(f"Role: {role}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml
Bucket: sagemaker-us-east-1-057716757052
Region: us-east-1
Role: arn:aws:iam::057716757052:role/gonsoomoon-sm-inference


## 2. 모델 아티팩트 생성 및 S3 업로드

model.tar.gz 구조로 생성을 하면, SageMaker 가 이를 인지 합니다.

model.tar.gz 구조:
```
model.tar.gz/
├── config.json
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── vocab.txt
└── code/
    ├── inference.py
    └── requirements.txt
```

참조: [SageMaker PyTorch Documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#deploy-pytorch-models)

### model.tar.gz 파일 생성

In [5]:
!rm -rf ../model_artifact
!mkdir -p ../model_artifact/code
!cp ../src/inference.py ../model_artifact/code/
!cp ../src/requirements.txt ../model_artifact/code/
!cp ../model/* ../model_artifact/

!cd ../model_artifact && tar -czf ../model.tar.gz *

model_artifact_s3_uri = f's3://{bucket}/klue-roberta-inference/model/model.tar.gz'
!aws s3 cp ../model.tar.gz {model_artifact_s3_uri}

print(f"Model uploaded to: {model_artifact_s3_uri}")

upload: ../model.tar.gz to s3://sagemaker-us-east-1-057716757052/klue-roberta-inference/model/model.tar.gz
Model uploaded to: s3://sagemaker-us-east-1-057716757052/klue-roberta-inference/model/model.tar.gz


## 3. SageMaker Model 생성

Inference Component를 사용하기 위해서는 먼저 SageMaker Model을 생성해야 합니다.

In [6]:
# PyTorch 이미지 URI 가져오기
from sagemaker import image_uris

pytorch_image_uri = image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="2.5",
    py_version="py311",
    instance_type="ml.g4dn.xlarge",
    image_scope="inference"
)

print(f"PyTorch Image URI: {pytorch_image_uri}")

PyTorch Image URI: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.5-gpu-py311


#### SageMaker Model 생성

In [9]:
# SageMaker Model 생성
prefix = sagemaker.utils.unique_name_from_base("roberta-dual-encoder")
model_name = f"{prefix}-model"

# 모델이 이미 존재하는지 확인
try:
    sagemaker_client.describe_model(ModelName=model_name)
    print(f"✅ Model already exists: {model_name}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find model" in str(e):
        # 모델이 존재하지 않으면 생성
        create_model_response = sagemaker_client.create_model(
            ModelName=model_name,
            ExecutionRoleArn=role,
            PrimaryContainer={
                "Image": pytorch_image_uri,
                "ModelDataUrl": model_artifact_s3_uri,
                "Environment": {
                    "SAGEMAKER_PROGRAM": "inference.py",
                    "SAGEMAKER_SUBMIT_DIRECTORY": model_artifact_s3_uri,
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                    "SAGEMAKER_REGION": region,
                    "MMS_DEFAULT_WORKERS_PER_MODEL": "1"
                }
            }
        )
        print(f"✅ Model created: {model_name}")
    else:
        raise

print(f"Model Name: {model_name}")

✅ Model created: roberta-dual-encoder-1760700181-8454-model
Model Name: roberta-dual-encoder-1760700181-8454-model


## 4. Endpoint Configuration 생성 (Scale-to-Zero 지원)

**ManagedInstanceScaling**을 활성화하고 **MinInstanceCount=0**으로 설정하여 인스턴스를 0으로 줄일 수 있도록 합니다.

In [10]:
# Endpoint Configuration 설정
endpoint_config_name = f"{prefix}-scale-to-zero-config"
variant_name = "AllTraffic"
instance_type = "ml.g4dn.xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0  # Scale-to-Zero를 위해 0으로 설정
max_instance_count = 2

# Endpoint Config가 이미 존재하는지 확인
try:
    sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
    print(f"✅ Endpoint Config already exists: {endpoint_config_name}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find endpoint configuration" in str(e):
        # Endpoint Config가 존재하지 않으면 생성
        sagemaker_client.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ExecutionRoleArn=role,
            ProductionVariants=[
                {
                    "VariantName": variant_name,
                    "InstanceType": instance_type,
                    "InitialInstanceCount": 1,
                    "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
                    "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
                    "ManagedInstanceScaling": {
                        "Status": "ENABLED",
                        "MinInstanceCount": min_instance_count,
                        "MaxInstanceCount": max_instance_count,
                    },
                    "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
                }
            ],
        )
        print(f"✅ Endpoint Config created: {endpoint_config_name}")
    else:
        raise

print(f"Endpoint Config Name: {endpoint_config_name}")

✅ Endpoint Config created: roberta-dual-encoder-1760700181-8454-scale-to-zero-config
Endpoint Config Name: roberta-dual-encoder-1760700181-8454-scale-to-zero-config


## 5. SageMaker Endpoint 생성

In [11]:
# Endpoint 생성
endpoint_name = f"{prefix}-scale-to-zero-endpoint"

# Endpoint가 이미 존재하는지 확인
try:
    endpoint_desc = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    print(f"✅ Endpoint already exists: {endpoint_name}")
    print(f"   Status: {endpoint_desc['EndpointStatus']}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find endpoint" in str(e):
        # Endpoint가 존재하지 않으면 생성
        sagemaker_client.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name,
        )
        print(f"✅ Endpoint creation initiated: {endpoint_name}")
    else:
        raise

print(f"Endpoint Name: {endpoint_name}")

✅ Endpoint creation initiated: roberta-dual-encoder-1760700181-8454-scale-to-zero-endpoint
Endpoint Name: roberta-dual-encoder-1760700181-8454-scale-to-zero-endpoint


In [None]:
# Endpoint가 InService 상태가 될 때까지 대기 (~3-5분 소요)
import time
import sys

start_time = time.time()

while True:
    desc = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    status = desc["EndpointStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

Creating


Creating
Creating
Creating
Creating
InService

Total time taken: 150.48 seconds (2.51 minutes)


## 6. Inference Component 생성

Inference Component를 생성하여 모델을 Endpoint에 배포합니다.

In [13]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

inference_component_name = f"{prefix}-inference-component"

# Inference Component가 이미 존재하는지 확인
try:
    ic_desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    print(f"✅ Inference Component already exists: {inference_component_name}")
    print(f"   Status: {ic_desc['InferenceComponentStatus']}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find inference component" in str(e):
        # Inference Component가 존재하지 않으면 생성
        sagemaker_client.create_inference_component(
            InferenceComponentName=inference_component_name,
            EndpointName=endpoint_name,
            VariantName=variant_name,
            Specification={
                "ModelName": model_name,
                "StartupParameters": {
                    "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
                    "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
                },
                "ComputeResourceRequirements": {
                    "MinMemoryRequiredInMb": 4096,
                    "NumberOfAcceleratorDevicesRequired": 1,
                },
            },
            RuntimeConfig={"CopyCount": 1},
        )
        print(f"✅ Inference Component creation initiated: {inference_component_name}")
    else:
        raise

print(f"Inference Component Name: {inference_component_name}")

✅ Inference Component creation initiated: roberta-dual-encoder-1760700181-8454-inference-component
Inference Component Name: roberta-dual-encoder-1760700181-8454-inference-component


In [14]:
# Inference Component가 InService 상태가 될 때까지 대기
start_time = time.time()

while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

Creating


Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService

Total time taken: 332.07 seconds (5.53 minutes)


## 7. Endpoint 추론 테스트

### 싱글 샘플 추론

In [15]:
import numpy as np

def test_inference_component(endpoint_name, inference_component_name, payload):
      """
      간단한 추론 테스트 - Inference Component가 작동하는지 확인
      
      Args:
          endpoint_name: SageMaker Endpoint 이름
          inference_component_name: Inference Component 이름
          payload: 추론 요청 페이로드 (dict with 'queries' and 'documents')
      """
      import boto3
      import json
      import time
      import numpy as np

      runtime_client = boto3.Session().client('sagemaker-runtime')

      try:
          print("🧪 Testing inference component...")
          print(f"   Queries: {len(payload['queries'])}")
          print(f"   Documents: {len(payload['documents'])}")

          start_time = time.time()

          response = runtime_client.invoke_endpoint(
              EndpointName=endpoint_name,
              InferenceComponentName=inference_component_name,
              ContentType='application/json',
              Body=json.dumps(payload)
          )

          # 응답 본문 읽기
          response_body = response['Body'].read().decode()

          # 디버깅: 응답 내용 확인
          if not response_body:
              print(f"⚠️  Warning: Empty response body")
              return False

          # JSON 파싱
          result = json.loads(response_body)
          elapsed_time = time.time() - start_time

          print(f"✅ Success! Response time: {elapsed_time:.2f}s")
          print(f"   Embedding dimension: {result['embedding_dim']}")

          # Embedding shapes 출력
          print(f"\nQuery embeddings shape: ({result['num_queries']}, {result['embedding_dim']})")
          print(f"Document embeddings shape: ({result['num_documents']}, {result['embedding_dim']})")

          # 코사인 유사도 계산 및 출력
          query_embs = np.array(result["query_embeddings"])
          doc_embs = np.array(result["doc_embeddings"])

          print(f"\nCosine similarity:")
          for i in range(len(query_embs)):
              similarity = np.dot(query_embs[i], doc_embs[i])
              print(f"   Pair {i+1}: {similarity:.4f}")

          return True

      except json.JSONDecodeError as e:
          print(f"❌ JSON parsing failed: {str(e)}")
          print(f"   Response body (first 200 chars): {response_body[:200] if 'response_body' in locals() else 'N/A'}")
          return False
      except Exception as e:
          print(f"❌ Failed: {str(e)}")
          print(f"   Error type: {type(e).__name__}")
          return False

# 사용 예시 1: 단일 쿼리-문서 쌍
payload1 = {
    "queries": ["맛있는 한국 전통 음식 김치찌개"],
    "documents": ["김치찌개와 된장찌개는 한국의 대표 전통 음식입니다."]
}
  
test_inference_component(endpoint_name, inference_component_name, payload1)

🧪 Testing inference component...
   Queries: 1
   Documents: 1
✅ Success! Response time: 0.53s
   Embedding dimension: 768

Query embeddings shape: (1, 768)
Document embeddings shape: (1, 768)

Cosine similarity:
   Pair 1: 0.8666


True

### 8개 쿼리-문서 쌍 배치 추론

In [17]:
import boto3
import json

# Create the runtime client
runtime_client = boto3.Session().client('sagemaker-runtime')

batch_payload = {
    "queries": [
        "맛있는 한국 전통 음식 김치찌개",
        "최신 기술 발전",
        "색깔",
        "여행 계획",
        "스포츠 경기",
        "영화 추천",
        "날씨 정보",
        "건강 관리"
    ],
    "documents": [
        "김치찌개와 된장찌개는 한국의 대표 전통 음식입니다.",
        "인공지능 기술이 빠르게 발전하고 있습니다.",
        "파리의 에펠탑은 프랑스의 상징입니다.",
        "제주도는 한국의 인기 여행지입니다.",
        "축구 경기가 오늘 저녁에 있습니다.",
        "최근 개봉한 영화가 좋은 평가를 받고 있습니다.",
        "내일은 맑은 날씨가 예상됩니다.",
        "규칙적인 운동이 건강에 좋습니다."
    ]
}

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name,
    ContentType='application/json',
    Body=json.dumps(batch_payload)
)

batch_result = json.loads(response['Body'].read().decode())

print(f"Batch inference completed:")
print(f"  Queries: {batch_result['num_queries']}")
print(f"  Documents: {batch_result['num_documents']}")
print(f"  Embedding dim: {batch_result['embedding_dim']}\n")

# 각 쌍의 코사인 유사도 계산
query_embs = np.array(batch_result['query_embeddings'])
doc_embs = np.array(batch_result['doc_embeddings'])

print("Pair-wise cosine similarities:")
for i in range(len(query_embs)):
    similarity = np.dot(query_embs[i], doc_embs[i])
    print(f"  Pair {i+1}: {similarity:.4f}")

Batch inference completed:
  Queries: 8
  Documents: 8
  Embedding dim: 768

Pair-wise cosine similarities:
  Pair 1: 0.8666
  Pair 2: 0.7164
  Pair 3: 0.4745
  Pair 4: 0.6969
  Pair 5: 0.6220
  Pair 6: 0.5832
  Pair 7: 0.6482
  Pair 8: 0.6714


## 수동으로 Scale-to-Zero 테스트 

스케줄을 기다리지 않고 즉시 테스트하려면 아래 코드를 실행하세요.

### inference_component 를 0개로 줄이기 (Scale In)

In [18]:
# CopyCount를 0으로 설정하여 스케일 다운
sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={'CopyCount': 0}
)

print("✅ Inference Component scaled down to 0 copies")
print("인스턴스가 종료되는 데 몇 분이 걸릴 수 있습니다.")

✅ Inference Component scaled down to 0 copies
인스턴스가 종료되는 데 몇 분이 걸릴 수 있습니다.


### 상태 확인

In [20]:
import time
import json

# CopyCount를 0으로 설정하여 스케일 다운
sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={'CopyCount': 0}
)

print("✅ Inference Component scaled down to 0 copies 요청 완료")
print("⏳ 인스턴스가 종료되는 과정을 모니터링합니다...\n")

# 스케일 다운 모니터링
max_wait_time = 600  # 10분 타임아웃
start_time = time.time()
previous_copy_count = None

while True:
    try:
        elapsed = time.time() - start_time

        # 타임아웃 체크
        if elapsed > max_wait_time:
            print(f"\n⚠️ Timeout: {max_wait_time}초 경과. 모니터링 종료.")
            break

        # Inference Component 상태 확인
        ic_desc = sagemaker_client.describe_inference_component(
            InferenceComponentName=inference_component_name
        )

        status = ic_desc['InferenceComponentStatus']
        runtime_config = ic_desc.get('RuntimeConfig', {})
        current_copy_count = runtime_config.get('CurrentCopyCount', runtime_config.get('CopyCount', 'N/A'))
        desired_copy_count = runtime_config.get('DesiredCopyCount', 0)

        # 상태 변경 시에만 출력
        if current_copy_count != previous_copy_count:
            print(f"[{elapsed:.0f}s] 📊 Status Update:")
            print(f"   IC Status: {status}")
            print(f"   Current CopyCount: {current_copy_count}")
            print(f"   Desired CopyCount: {desired_copy_count}")

            # Full RuntimeConfig (디버깅용)
            print(f"   RuntimeConfig: {json.dumps(runtime_config, default=str)}")
            print()

            previous_copy_count = current_copy_count

        # CopyCount가 0이 되면 종료
        if current_copy_count == 0:
            print(f"✅ Scale-down 완료! CopyCount가 0이 되었습니다.")
            print(f"   총 소요 시간: {elapsed:.0f}초 ({elapsed/60:.1f}분)")
            break

        # 10초 대기
        time.sleep(10)

    except sagemaker_client.exceptions.ClientError as e:
        if "Could not find inference component" in str(e):
            print(f"❌ Inference Component not found: {inference_component_name}")
            break
        else:
            print(f"❌ Error: {str(e)}")
            break
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")
        break

print("\n🎯 모니터링 완료")

✅ Inference Component scaled down to 0 copies 요청 완료
⏳ 인스턴스가 종료되는 과정을 모니터링합니다...

[0s] 📊 Status Update:
   IC Status: Updating
   Current CopyCount: 0
   Desired CopyCount: 0
   RuntimeConfig: {"DesiredCopyCount": 0, "CurrentCopyCount": 0}

✅ Scale-down 완료! CopyCount가 0이 되었습니다.
   총 소요 시간: 0초 (0.0분)

🎯 모니터링 완료


#### 실제 테스트

In [21]:
# 사용 예시 1: 단일 쿼리-문서 쌍
payload1 = {
    "queries": ["맛있는 한국 전통 음식 김치찌개"],
    "documents": ["김치찌개와 된장찌개는 한국의 대표 전통 음식입니다."]
}
  
test_inference_component(endpoint_name, inference_component_name, payload1)

🧪 Testing inference component...
   Queries: 1
   Documents: 1
❌ Failed: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.
   Error type: ValidationError


False

### inference_component 1 개 생성하여 Scale Out

In [22]:
# CopyCount를 1로 설정하여 스케일 업
sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={'CopyCount': 1}
)

print("✅ Inference Component scaled up to 1 copy")
print("인스턴스가 시작되는 데 몇 분이 걸릴 수 있습니다.")

✅ Inference Component scaled up to 1 copy
인스턴스가 시작되는 데 몇 분이 걸릴 수 있습니다.


### inference_component 업데이트 상태 확인

In [23]:
# Inference Component가 InService 상태가 될 때까지 대기
start_time = time.time()

while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

Updating


Updating
Updating
Updating
Updating
Updating
Updating
Updating
Updating
Updating
Updating
InService

Total time taken: 331.88 seconds (5.53 minutes)


In [24]:
# 사용 예시 1: 단일 쿼리-문서 쌍
payload1 = {
    "queries": ["맛있는 한국 전통 음식 김치찌개"],
    "documents": ["김치찌개와 된장찌개는 한국의 대표 전통 음식입니다."]
}
  
test_inference_component(endpoint_name, inference_component_name, payload1)

🧪 Testing inference component...
   Queries: 1
   Documents: 1
✅ Success! Response time: 0.49s
   Embedding dimension: 768

Query embeddings shape: (1, 768)
Document embeddings shape: (1, 768)

Cosine similarity:
   Pair 1: 0.8666


True

## 8. EventBridge Scheduler를 사용한 Scale-to-Zero 스케줄링

### 사전 요구 사항: 
- Role 에 아래와 같은 정책이 추가 되어 있어야 합니다.
    - AmazonEC2ContainerRegistryFullAccess
    - AmazonEventBridgeFullAccess
    - AmazonS3FullAccess
    - AmazonSageMakerFullAccess

### 방법 1: UpdateInferenceComponentRuntimeConfig API 사용

Inference Component의 CopyCount를 0으로 설정하여 스케일 다운합니다.

### Weekend Scale-in (금요일 저녁)

매주 금요일 18:00 UTC+1에 CopyCount를 0으로 설정하는 스케줄을 생성합니다.

In [25]:
role

'arn:aws:iam::057716757052:role/gonsoomoon-sm-inference'

In [26]:
import boto3
import json

scheduler = boto3.client('scheduler')

flex_window = {"Mode": "OFF"}

# Scale-in 스케줄 타겟 설정
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({
        "DesiredRuntimeConfig": {"CopyCount": 0},
        "InferenceComponentName": inference_component_name
    })
}

# 매주 금요일 18:00 UTC+9 (한국 시간)에 스케일 다운
update_IC_scale_in_schedule = f"{prefix}-scale-to-zero-schedule"

try:
    scheduler.create_schedule(
        Name=update_IC_scale_in_schedule,
        ScheduleExpression="cron(00 18 ? * 6 *)",  # 금요일 18:00
        ScheduleExpressionTimezone="Asia/Seoul",  # 한국 시간대
        Target=scale_in_target,
        FlexibleTimeWindow=flex_window,
        ActionAfterCompletion="NONE",  # 계속 유지
    )
    print(f"✅ Scale-in schedule created: {update_IC_scale_in_schedule}")
except scheduler.exceptions.ConflictException:
    print(f"Schedule {update_IC_scale_in_schedule} already exists")

✅ Scale-in schedule created: roberta-dual-encoder-1760700181-8454-scale-to-zero-schedule


### Workweek Scale-out (월요일 아침)

매주 월요일 07:00 UTC+9에 CopyCount를 1로 복원하는 스케줄을 생성합니다.

In [27]:
# Scale-out 스케줄 타겟 설정
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({
        "DesiredRuntimeConfig": {"CopyCount": 1},
        "InferenceComponentName": inference_component_name
    })
}

# 매주 월요일 07:00 UTC+9 (한국 시간)에 스케일 업
update_IC_scale_out_schedule = f"{prefix}-scale-out-schedule"

try:
    scheduler.create_schedule(
        Name=update_IC_scale_out_schedule,
        ScheduleExpression="cron(00 07 ? * 2 *)",  # 월요일 07:00
        ScheduleExpressionTimezone="Asia/Seoul",  # 한국 시간대
        Target=scale_out_target,
        FlexibleTimeWindow=flex_window,
        ActionAfterCompletion="NONE",  # 계속 유지
    )
    print(f"✅ Scale-out schedule created: {update_IC_scale_out_schedule}")
except scheduler.exceptions.ConflictException:
    print(f"Schedule {update_IC_scale_out_schedule} already exists")

✅ Scale-out schedule created: roberta-dual-encoder-1760700181-8454-scale-out-schedule


### 생성된 스케줄 확인

In [28]:
# 생성된 스케줄 목록 확인
try:
    schedules_to_check = [update_IC_scale_in_schedule, update_IC_scale_out_schedule]
    
    for schedule_name in schedules_to_check:
        try:
            schedule = scheduler.get_schedule(Name=schedule_name)
            print(f"\n📅 Schedule: {schedule_name}")
            print(f"   Expression: {schedule['ScheduleExpression']}")
            print(f"   Timezone: {schedule['ScheduleExpressionTimezone']}")
            print(f"   State: {schedule['State']}")
        except scheduler.exceptions.ResourceNotFoundException:
            print(f"\n❌ Schedule not found: {schedule_name}")
except Exception as e:
    print(f"Error checking schedules: {str(e)}")


📅 Schedule: roberta-dual-encoder-1760700181-8454-scale-to-zero-schedule
   Expression: cron(00 18 ? * 6 *)
   Timezone: Asia/Seoul
   State: ENABLED

📅 Schedule: roberta-dual-encoder-1760700181-8454-scale-out-schedule
   Expression: cron(00 07 ? * 2 *)
   Timezone: Asia/Seoul
   State: ENABLED


## 9. 리소스 정리

테스트가 완료되면 리소스를 정리합니다.

### 스케줄 삭제

In [31]:
# 생성된 스케줄 삭제
schedules = [update_IC_scale_in_schedule, update_IC_scale_out_schedule]

for schedule in schedules:
    try:
        scheduler.delete_schedule(Name=schedule)
        print(f"✅ Deleted schedule: {schedule}")
    except scheduler.exceptions.ResourceNotFoundException:
        print(f"Schedule {schedule} not found.")

✅ Deleted schedule: roberta-dual-encoder-1760700181-8454-scale-to-zero-schedule
✅ Deleted schedule: roberta-dual-encoder-1760700181-8454-scale-out-schedule


### Inference Component 및 엔드포인트 등의 리소스 삭제

In [37]:
import time

# Inference Component 삭제
print("Inference Component 삭제 중...")
try:
    sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    sagemaker_client.delete_inference_component(
        InferenceComponentName=inference_component_name
    )
    print(f"✅ Inference Component 삭제 시작: {inference_component_name}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find inference component" in str(e):
        print(f"ℹ️ Inference Component가 이미 삭제되었거나 존재하지 않음: {inference_component_name}")
    else:
        raise

# Inference Component가 삭제될 때까지 대기
print("Inference Component 삭제 대기 중...")
max_wait_time = 300  # 5분 타임아웃
start_time = time.time()

while True:
    try:
        if time.time() - start_time > max_wait_time:
            print("⚠️ Timeout: Inference Component 삭제 대기 시간 초과")
            break

        desc = sagemaker_client.describe_inference_component(
            InferenceComponentName=inference_component_name
        )
        status = desc.get('InferenceComponentStatus', 'Unknown')
        print(f"   Status: {status}")
        time.sleep(10)

    except sagemaker_client.exceptions.ClientError as e:
        if "Could not find inference component" in str(e):
            print("✅ Inference Component 삭제 완료")
            break
        else:
            print(f"⚠️ Unexpected error: {str(e)}")
            break

# Endpoint 삭제
print("\nEndpoint 삭제 중...")
try:
    sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
    print(f"✅ Endpoint 삭제 시작: {endpoint_name}")

    # Endpoint 삭제 대기
    max_wait_time = 300
    start_time = time.time()

    while True:
        try:
            if time.time() - start_time > max_wait_time:
                print("⚠️ Timeout: Endpoint 삭제 대기 시간 초과")
                break

            desc = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
            status = desc.get('EndpointStatus', 'Unknown')
            print(f"   Endpoint Status: {status}")
            time.sleep(10)

        except sagemaker_client.exceptions.ClientError as e:
            if "Could not find endpoint" in str(e):
                print("✅ Endpoint 삭제 완료")
                break
            else:
                print(f"⚠️ Unexpected error: {str(e)}")
                break

except sagemaker_client.exceptions.ClientError as e:
    if "Could not find endpoint" in str(e):
        print(f"ℹ️ Endpoint가 이미 삭제되었거나 존재하지 않음: {endpoint_name}")
    else:
        raise

# Endpoint Config 삭제
print("\nEndpoint Config 삭제 중...")
try:
    sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
    sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
    print(f"✅ Endpoint Config 삭제 완료: {endpoint_config_name}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find endpoint configuration" in str(e):
        print(f"ℹ️ Endpoint Config가 이미 삭제되었거나 존재하지 않음: {endpoint_config_name}")
    else:
        raise

# Model 삭제
print("\nModel 삭제 중...")
try:
    sagemaker_client.describe_model(ModelName=model_name)
    sagemaker_client.delete_model(ModelName=model_name)
    print(f"✅ Model 삭제 완료: {model_name}")
except sagemaker_client.exceptions.ClientError as e:
    if "Could not find model" in str(e):
        print(f"ℹ️ Model이 이미 삭제되었거나 존재하지 않음: {model_name}")
    else:
        raise

print("\n🎉 모든 리소스 삭제 완료!")



Inference Component 삭제 중...
✅ Inference Component 삭제 시작: roberta-dual-encoder-1760700181-8454-inference-component
Inference Component 삭제 대기 중...
   Status: Deleting
   Status: Deleting
   Status: Deleting
✅ Inference Component 삭제 완료

Endpoint 삭제 중...
✅ Endpoint 삭제 시작: roberta-dual-encoder-1760700181-8454-scale-to-zero-endpoint
   Endpoint Status: Deleting
✅ Endpoint 삭제 완료

Endpoint Config 삭제 중...
✅ Endpoint Config 삭제 완료: roberta-dual-encoder-1760700181-8454-scale-to-zero-config

Model 삭제 중...
✅ Model 삭제 완료: roberta-dual-encoder-1760700181-8454-model

🎉 모든 리소스 삭제 완료!


## 요약

이 노트북에서는 다음을 구현했습니다:

1. **RoBERTa Dual Encoder 모델**을 Inference Component로 배포
2. **ManagedInstanceScaling**을 활성화하여 MinInstanceCount=0 설정
3. **EventBridge Scheduler**를 사용하여:
   - 금요일 18:00에 자동으로 CopyCount를 0으로 설정 (Scale-in)
   - 월요일 07:00에 자동으로 CopyCount를 1로 복원 (Scale-out)
4. **비용 절감**: 주말 동안 인스턴스가 0개로 줄어들어 컴퓨팅 비용 절감

