# Inventory Monitoring at Distribution Centers

This notebook guides you through building, training, and deploying a machine learning model for **Inventory Monitoring at Distribution Centers** using **AWS SageMaker**. The goal is to create a model that can count the number of objects in bins based on images. This project uses the **Amazon Bin Image Dataset**, which contains images from Amazon Fulfillment Centers showing bins with various objects. 

The key steps in this project include:
1. Data preparation and upload to S3.
2. Model training on SageMaker using a custom script (`train.py`).
3. Model deployment to a SageMaker endpoint for inference.
4. Optional tasks such as hyperparameter tuning, debugging, profiling, and cost analysis.

This project focuses on implementing a machine learning engineering pipeline rather than achieving high accuracy.


In [1]:
# TODO: Install any packages that you might need
!pip install tqdm boto3



In [2]:
# TODO: Import any packages that you might need
import os
import json
import boto3
import sagemaker
from sagemaker import get_execution_role
from tqdm import tqdm

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Data Preparation
**TODO:** Run the cell below to download the data.

The cell below creates a folder called `train_data`, downloads training data and arranges it in subfolders. Each of these subfolders contain images where the number of objects is equal to the name of the folder. For instance, all images in folder `1` has images with 1 object in them. Images are not divided into training, testing or validation sets. If you feel like the number of samples are not enough, you can always download more data (instructions for that can be found [here](https://registry.opendata.aws/amazon-bin-imagery/)). However, we are not acessing you on the accuracy of your final trained model, but how you create your machine learning engineering pipeline.

In [3]:
#import os
#import json
#import boto3

def download_and_arrange_data():
    s3_client = boto3.client('s3')

    with open('file_list.json', 'r') as f:
        d=json.load(f)

    for k, v in d.items():
        print(f"Downloading Images with {k} objects")
        directory=os.path.join('train_data', k)
        if not os.path.exists(directory):
            os.makedirs(directory)
        for file_path in tqdm(v):
            file_name=os.path.basename(file_path).split('.')[0]+'.jpg'
            s3_client.download_file('aft-vbi-pds', os.path.join('bin-images', file_name),
                             os.path.join(directory, file_name))

download_and_arrange_data()

Downloading Images with 1 objects


100%|██████████| 1228/1228 [01:36<00:00, 12.77it/s]


Downloading Images with 2 objects


100%|██████████| 2299/2299 [03:05<00:00, 12.38it/s]


Downloading Images with 3 objects


100%|██████████| 2666/2666 [03:33<00:00, 12.46it/s]


Downloading Images with 4 objects


100%|██████████| 2373/2373 [03:08<00:00, 12.56it/s]


Downloading Images with 5 objects


100%|██████████| 1875/1875 [02:25<00:00, 12.91it/s]


## Dataset

For this project, we are using the **Amazon Bin Image Dataset**, which contains images of bins from Amazon Fulfillment Centers. Each image shows a bin with one or more items inside, where items are placed randomly. This dataset allows us to build a machine learning model that can classify images based on the number of objects in each bin, which is essential for efficient inventory tracking and management in distribution centers.

### Structure of the Dataset
Each image in the dataset is associated with a metadata file containing details about the items in the bin. The key information for our project is the `EXPECTED_QUANTITY`, which indicates the number of objects present in each bin. This quantity ranges from **1 to 5 objects**, and we use it to create labeled data for training a classification model with five classes:
- **Class 1**: Images with 1 object in the bin
- **Class 2**: Images with 2 objects in the bin
- **Class 3**: Images with 3 objects in the bin
- **Class 4**: Images with 4 objects in the bin
- **Class 5**: Images with 5 objects in the bin

### Data Preprocessing
For this project, the images have been organized into folders by the number of objects (1–5), based on the `EXPECTED_QUANTITY` field in the metadata. We then split the data into training, validation, and test sets to facilitate model evaluation and ensure generalization.

This dataset helps us train a classification model to determine the number of items in a bin from an image, enabling automated inventory counting at distribution centers.

More information about the dataset can be found [here](https://registry.opendata.aws/amazon-bin-imagery/).

This function splits the data into `train`, `validation`, and `test` sets based on the specified ratios. The output directory `processed_data` will contain subdirectories for training, validation, and test sets, each further organized by object count.

In [4]:
#TODO: Perform any data cleaning or data preprocessing
import shutil
from sklearn.model_selection import train_test_split

# Split data into train, validation, and test sets
def split_data(data_dir, output_dir, train_ratio=0.7, val_ratio=0.2, test_ratio=0.1):
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.makedirs(os.path.join(output_dir, 'train'))
    os.makedirs(os.path.join(output_dir, 'validation'))
    os.makedirs(os.path.join(output_dir, 'test'))
    
    for object_count in os.listdir(data_dir):
        images = os.listdir(os.path.join(data_dir, object_count))
        train, temp = train_test_split(images, train_size=train_ratio)
        val, test = train_test_split(temp, train_size=val_ratio/(val_ratio + test_ratio))
        
        for subset, subset_images in zip(['train', 'validation', 'test'], [train, val, test]):
            subset_dir = os.path.join(output_dir, subset, object_count)
            os.makedirs(subset_dir, exist_ok=True)
            for image in subset_images:
                shutil.copy2(os.path.join(data_dir, object_count, image), subset_dir)
        print(f"Data split for category '{object_count}' complete.")

# Run data split
split_data('train_data', 'processed_data')


Data split for category '1' complete.
Data split for category '2' complete.
Data split for category '3' complete.
Data split for category '4' complete.
Data split for category '5' complete.


In [5]:
#TODO: Upload the data to AWS S3

import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
#import boto3
#import os

session = sagemaker.Session()

bucket=session.default_bucket()
print("Default Bucket: {}".format(bucket))

region = session.boto_region_name
print("AWS Region: {}".format(region))

role = get_execution_role() #sagemaker iam role
print("RoleArn: {}".format(role))

data = "s3://{}/{}/".format(bucket, "inventory-monitoring")
print(f"Uploading data to S3 at {data}")
output = "s3://{}/{}/".format(bucket, "output")
model_dir = "s3://{}/{}/".format(bucket, "model")
os.environ["DEFAULT_S3_BUCKET"] =bucket
os.environ['SM_CHANNEL_TRAIN']=data 
os.environ['SM_OUTPUT_DATA_DIR']=output
os.environ['SM_MODEL_DIR']=model_dir

Default Bucket: sagemaker-us-east-1-254792028129
AWS Region: us-east-1
RoleArn: arn:aws:iam::254792028129:role/service-role/AmazonSageMaker-ExecutionRole-20241116T170316
Uploading data to S3 at s3://sagemaker-us-east-1-254792028129/inventory-monitoring/


In [6]:
s3_data_path = session.upload_data(path='processed_data', bucket=bucket, key_prefix=f'inventory-monitoring/data')

In [7]:
output

's3://sagemaker-us-east-1-254792028129/output/'

## Model Training
**TODO:** This is the part where you can train a model. The type or architecture of the model you use is not important. 

**Note:** You will need to use the `train.py` script to train your model.

`train.py`, is simple model, while `benchmark.py` is our benchmark model which is ResNet50.


### Simple Model

In [11]:
#TODO: Declare your model training hyperparameter.
#NOTE: You do not need to do hyperparameter tuning. You can use fixed hyperparameter values
# Define hyperparameters for the training job
hyperparameters = {
    "batch-size": 64,
    "epochs": 5,
    "lr": 0.005
}


In [12]:
#TODO: Create your training estimator
# Import the PyTorch estimator
from sagemaker.pytorch import PyTorch

# Create a PyTorch Estimator
estimator = PyTorch(
    entry_point="train.py",                 # Path to the training script
    role=role,                              # IAM role with permissions
    instance_count=1,                       # Number of instances
    instance_type="ml.p3.2xlarge", #"ml.m5.xlarge",           # Instance type for training
    framework_version="1.8",                # PyTorch version
    py_version="py3",                       # Python version
    hyperparameters=hyperparameters,        # Hyperparameters defined above
    output_path=output                      # Path for saving output artifacts
)


In [13]:
# TODO: Fit your estimator
# Start the training job
estimator.fit({"train": s3_data_path}, job_name="simple-train")

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: simple-train


2024-11-16 17:47:37 Starting - Starting the training job
2024-11-16 17:47:37 Pending - Training job waiting for capacity......
2024-11-16 17:48:21 Pending - Preparing the instances for training...
2024-11-16 17:49:00 Downloading - Downloading input data...
2024-11-16 17:49:30 Downloading - Downloading the training image........................
2024-11-16 17:53:18 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-11-16 17:53:37,579 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-11-16 17:53:37,610 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-11-16 17:53:37,613 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-11-16 17:53:37,881 sagemaker-training-toolkit INFO     Invoking user sc

### Benchmark Model

using `benchmark.py`

In [14]:
hyperparameters = {
    "batch-size": 64,
    "epochs": 5,
    "lr": 0.005
}

from sagemaker.pytorch import PyTorch

# Create a PyTorch Estimator
estimator = PyTorch(
    entry_point="benchmark.py",                 # Path to the training script
    role=role,                              # IAM role with permissions
    instance_count=1,                       # Number of instances
    instance_type="ml.p3.2xlarge", #"ml.m5.xlarge",           # Instance type for training
    framework_version="1.8",                # PyTorch version
    py_version="py3",                       # Python version
    hyperparameters=hyperparameters,        # Hyperparameters defined above
    output_path=output                      # Path for saving output artifacts
)

# Start the training job
estimator.fit({"train": s3_data_path},job_name="benchmark-train")

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: benchmark-train


2024-11-16 18:01:33 Starting - Starting the training job......
2024-11-16 18:02:28 Pending - Training job waiting for capacity......
2024-11-16 18:03:25 Pending - Preparing the instances for training...
2024-11-16 18:03:58 Downloading - Downloading input data...
2024-11-16 18:04:33 Downloading - Downloading the training image...........................
2024-11-16 18:08:41 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-11-16 18:08:58,284 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-11-16 18:08:58,318 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-11-16 18:08:58,321 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-11-16 18:08:58,631 sagemaker-training-toolkit INFO     Invokin

## Standout Suggestions
You do not need to perform the tasks below to finish your project. However, you can attempt these tasks to turn your project into a more advanced portfolio piece.

### Hyperparameter Tuning
**TODO:** Here you can perform hyperparameter tuning to increase the performance of your model. You are encouraged to 
- tune as many hyperparameters as you can to get the best performance from your model
- explain why you chose to tune those particular hyperparameters and the ranges.


In [41]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [27]:
#TODO: Create your hyperparameter search space
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.005, 0.05), #CategoricalParameter,
    "batch-size": CategoricalParameter([32, 64]),
    "epochs": IntegerParameter(5, 7)
}

objective_metric_name = "Test Loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "Test Loss", "Regex": "Testing Loss: ([0-9\\.]+)"}]

In [46]:
#TODO: Create your training estimator


estimator = PyTorch(
    entry_point="hpo.py",
    role=get_execution_role(),
    py_version='py36',
    framework_version="1.8",
    instance_count=1,
    instance_type='ml.p3.2xlarge'
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=2,
    max_parallel_jobs=1,# due to AWS limit, I used 1 instead of 2 or more.
    objective_type=objective_type,
    early_stopping_type="Auto"
)

In [None]:
# TODO: Fit your estimator
tuner.fit({"train": s3_data_path}, job_name="hpo3-train")

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: hpo3-train


..................................................................................................................................................................................................................................................................................................!


In [48]:
# TODO: Find the best hyperparameters

best_estimator = tuner.best_estimator()

# Get the hyperparameters of the best-trained model
best_estimator.hyperparameters()


2024-11-16 20:40:22 Starting - Found matching resource for reuse
2024-11-16 20:40:22 Downloading - Downloading the training image
2024-11-16 20:40:22 Training - Training image download completed. Training in progress.
2024-11-16 20:40:22 Uploading - Uploading generated training model
2024-11-16 20:40:22 Completed - Resource retained for reuse


{'_tuning_objective_metric': '"Test Loss"',
 'batch-size': '"64"',
 'epochs': '5',
 'lr': '0.010603903353801389',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"hpo3-train"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-254792028129/hpo3-train/source/sourcedir.tar.gz"'}

In [49]:
print('batch size '+best_estimator.hyperparameters()['batch-size'])
print('epochs '+best_estimator.hyperparameters()['epochs'])
print('learning rate '+best_estimator.hyperparameters()['lr'])

batch size "64"
epochs 5
learning rate 0.010603903353801389


### Model Profiling and Debugging
**TODO:** Use model debugging and profiling to better monitor and debug your model training job.

In [51]:
'''
# TODO: Set up debugging and profiling rules and hooks
batch_size=int(best_estimator.hyperparameters()['batch-size'].replace('"',''))
epochs=best_estimator.hyperparameters()['epochs']
learn_r=best_estimator.hyperparameters()['lr'].replace('"','')

from sagemaker.debugger import (
    Rule,
    DebuggerHookConfig,
    rule_configs,
)
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

hook_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"}
)


profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

#"num_gpu":True

hyperparameters = {"epochs": epochs, "batch-size": batch_size, "test-batch-size": "100", "lr": learn_r}
'''

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [56]:
# TODO: Create and fit an estimator

'''
estimator =PyTorch(
    entry_point="train_profile_Debug.py",
    #base_job_name="sagemaker-dog-breed-pytorch",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    hyperparameters = hyperparameters,
    framework_version = "1.8",
    py_version="py36",
    profiler_config=profiler_config,
    debugger_hook_config=hook_config,
    rules=rules
    
)   

estimator.fit({'train': data}, job_name="Profile-Debug-train2")
'''

'\nestimator =PyTorch(\n    entry_point="train_profile_Debug.py",\n    #base_job_name="sagemaker-dog-breed-pytorch",\n    role=get_execution_role(),\n    instance_count=1,\n    instance_type="ml.p3.2xlarge",\n    hyperparameters = hyperparameters,\n    framework_version = "1.8",\n    py_version="py36",\n    profiler_config=profiler_config,\n    debugger_hook_config=hook_config,\n    rules=rules\n    \n)   \n\nestimator.fit({\'train\': data}, job_name="Profile-Debug-train2")\n'

In [None]:
# TODO: Plot a debugging output.


**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

### Model Deploying and Querying
**TODO:** Can you deploy your model to an endpoint and then query that endpoint to get a result?
We will deploy 3 models.

* simple-train
* benchmark-train
* hypo3-train: best from them



In [57]:
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor


jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()


class ImagePredictor(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(ImagePredictor, self).__init__(
            endpoint_name,
            sagemaker_session=sagemaker_session,
            serializer=jpeg_serializer,
            deserializer=json_deserializer,
        )

In [74]:
# TODO: Deploy your model to an endpoint

def deploy_model(endpoint_name, model_tar_path):
    # Define the model
    pytorch_model = PyTorchModel(
        model_data= model_tar_path,
        role=role,
        sagemaker_session=session,
        entry_point="inference.py",  # Path to the inference script if needed
        framework_version="1.8",
        py_version="py36",
        predictor_cls=ImagePredictor
    )
    
    # Deploy the model to an endpoint
    predictor = pytorch_model.deploy(
        initial_instance_count=1,
        instance_type="ml.g4dn.xlarge",#"ml.m5.xlarge",
        endpoint_name=endpoint_name
    )
    
    print(f"Model deployed to endpoint: {predictor.endpoint_name}")

In [75]:
# Simple Model Path
trained_model_s3_path = "s3://sagemaker-us-east-1-254792028129/output/"
model_tar_path = f"{trained_model_s3_path}simple-train/output/model.tar.gz"
endpoint_name="endpoint-simple"
deploy_model(endpoint_name, model_tar_path)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-254792028129/output/simple-train/output/model.tar.gz), script artifact (None), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-254792028129/pytorch-inference-2024-11-16-23-52-39-726/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2024-11-16-23-52-50-231
INFO:sagemaker:Creating endpoint-config with name endpoint-simple
INFO:sagemaker:Creating endpoint with name endpoint-simple


----------!Model deployed to endpoint: endpoint-simple


In [63]:
# Benchmark Model Path
trained_model_s3_path = "s3://sagemaker-us-east-1-254792028129/output/"
model_tar_path = f"{trained_model_s3_path}benchmark-train/output/model.tar.gz"
endpoint_name="endpoint-resnet"
deploy_model(endpoint_name, model_tar_path)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-254792028129/output/benchmark-train/output/model.tar.gz), script artifact (None), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-254792028129/pytorch-inference-2024-11-16-22-27-51-771/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2024-11-16-22-28-01-094
INFO:sagemaker:Creating endpoint-config with name endpoint-resnet
INFO:sagemaker:Creating endpoint with name endpoint-resnet


------!Model deployed to endpoint: endpoint-resnet


In [64]:
# Tunned Model Path
trained_model_s3_path = "s3://sagemaker-us-east-1-254792028129/" #"s3://sagemaker-us-east-1-254792028129/output/"
model_tar_path = f"{trained_model_s3_path}hpo3-train-002-63ec1fb5/output/model.tar.gz"
endpoint_name="endpoint-hpo"
deploy_model(endpoint_name, model_tar_path)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-254792028129/hpo3-train-002-63ec1fb5/output/model.tar.gz), script artifact (None), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-254792028129/pytorch-inference-2024-11-16-22-35-20-732/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2024-11-16-22-35-39-994
INFO:sagemaker:Creating endpoint-config with name endpoint-hpo
INFO:sagemaker:Creating endpoint with name endpoint-hpo


------!Model deployed to endpoint: endpoint-hpo


In [65]:
# TODO: Run an prediction on the endpoint

import numpy as np
import io
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

def evaluate_endpoint(endpoint_name, test_data_dir, valid_classes, predictor):
    """
    Evaluates an endpoint using the provided test dataset.
    
    Args:
        endpoint_name (str): The name of the SageMaker endpoint to evaluate.
        test_data_dir (str): Path to the test dataset directory.
        valid_classes (list): List of valid class labels.
        predictor: SageMaker predictor object for making predictions.

    Returns:
        dict: Dictionary containing accuracy, confusion matrix, and classification report.
    """
    # Preprocessing pipeline for local testing
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    # Load the test dataset
    test_dataset = ImageFolder(root=test_data_dir, transform=transform)
    test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

    # Initialize lists to store true labels and predictions
    true_labels = []
    predictions = []

    # Evaluate predictions
    for inputs, labels in test_loader:
        # Convert the image to raw bytes
        raw_image = transforms.ToPILImage()(inputs.squeeze(0))  # Convert tensor to PIL Image
        with io.BytesIO() as buffer:
            raw_image.save(buffer, format="JPEG")  # Save as JPEG to bytes buffer
            payload = buffer.getvalue()  # Get raw bytes

        # Make predictions via the endpoint
        pred = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
        print(f"Raw prediction response for endpoint {endpoint_name}: {pred}")  # Debugging raw predictions
        pred_class = int(np.argmax(np.array(pred))) + 1  # Adjust the class index to match 1–5 range

        # Append true label and prediction
        true_labels.append(labels.item() + 1)  # Adjust from 0–4 to 1–5
        predictions.append(pred_class)

    # Validate class range
    print(f"Unique true labels for endpoint {endpoint_name}: {set(true_labels)}")
    print(f"Unique predicted labels for endpoint {endpoint_name}: {set(predictions)}")

    # Calculate metrics
    accuracy = accuracy_score(true_labels, predictions)
    conf_matrix = confusion_matrix(true_labels, predictions, labels=list(valid_classes))
    classification_report_str = classification_report(
        true_labels,
        predictions,
        labels=list(valid_classes),  # Explicitly specify valid classes
        target_names=test_dataset.classes
    )

    # Print the results
    print(f"Accuracy for endpoint {endpoint_name}: {accuracy * 100:.2f}%")
    print("\nConfusion Matrix:\n", conf_matrix)
    print("\nClassification Report:\n", classification_report_str)

    # Return the results
    return {
        "accuracy": accuracy,
        "confusion_matrix": conf_matrix,
        "classification_report": classification_report_str
    }


In [69]:
from sagemaker.predictor import Predictor

# Replace these with the actual endpoint names and valid classes
#endpoints = ["endpoint-simple","endpoint-resnet", "endpoint-hpo"]
test_data_dir = "processed_data/test"
valid_classes = [1, 2, 3, 4, 5]

'''
for endpoint_name in endpoints:
    print(f"Evaluating {endpoint_name}...")
    predictor = Predictor(endpoint_name)
    results = evaluate_endpoint(endpoint_name, test_data_dir, valid_classes, predictor)
    print(f"Results for {endpoint_name}: {results}")
''' 
endpoint_name= "endpoint-simple"
print(f"Evaluating {endpoint_name}...")
predictor=Predictor(endpoint_name)
results = evaluate_endpoint(endpoint_name, test_data_dir, valid_classes, predictor)
print(f"Results for {endpoint_name}: {results}")

Evaluating endpoint-simple...


ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/endpoint-simple in account 254792028129 for more information.

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done


### Cheaper Training and Cost Analysis
**TODO:** Can you perform a cost analysis of your system and then use spot instances to lessen your model training cost?

In [None]:
# TODO: Cost Analysis

In [None]:
# TODO: Train your model using a spot instance

### Multi-Instance Training
**TODO:** Can you train your model on multiple instances?

In [None]:
# TODO: Train your model on Multiple Instances