# TODO: Title
**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of the TODO's and/or use more than one cell to complete all the tasks.

In [None]:
! pip install tqdm smdebug

In [None]:
import os
import random
import io
from tqdm import tqdm

# Data
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import IPython
from PIL import Image

# AWS
import sagemaker
import boto3

## SM
from sagemaker.session import Session
from sagemaker import get_execution_role
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker.debugger import (
    Rule,
    rule_configs,
    ProfilerRule,
    DebuggerHookConfig,
    CollectionConfig,
    ProfilerConfig,
    FrameworkProfile,
)

## Data Preparation
**TODO:** Run the cell below to download the data.

The cell below creates a folder called `train_data`, downloads training data and arranges it in subfolders. Each of these subfolders contain images where the number of objects is equal to the name of the folder. For instance, all images in folder `1` has images with 1 object in them. Images are not divided into training, testing or validation sets. If you feel like the number of samples are not enough, you can always download more data (instructions for that can be found [here](https://registry.opendata.aws/amazon-bin-imagery/)). However, we are not acessing you on the accuracy of your final trained model, but how you create your machine learning engineering pipeline.

In [None]:
import os
import json
import boto3


def download_and_arrange_data():
    s3_client = boto3.client("s3")

    with open("file_list.json", "r") as f:
        d = json.load(f)

    for k, v in d.items():
        print(f"Downloading Images with {k} objects")
        directory = os.path.join("train_data", k)
        if not os.path.exists(directory):
            os.makedirs(directory)
        for file_path in tqdm(v):
            file_name = os.path.basename(file_path).split(".")[0] + ".jpg"
            s3_client.download_file(
                "aft-vbi-pds",
                os.path.join("bin-images", file_name),
                os.path.join(directory, file_name),
            )


download_and_arrange_data()

## Dataset
**TODO:** Explain what dataset you are using for this project. Give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understanding of it. You can find more information about the data [here](https://registry.opendata.aws/amazon-bin-imagery/).

In [None]:
sagemaker_session.upload_data(path="cap", bucket=bucket, key_prefix="cap")

## Model Training
**TODO:** This is the part where you can train a model. The type or architecture of the model you use is not important. 

**Note:** You will need to use the `train.py` script to train your model.

In [None]:
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

hyperparameters = {"epochs": 10, "batch_size": 32, "learning_rate": 0.001}

# Create training estimator
image_uri = sagemaker.image_uris.retrieve(
    "pytorch",
    session.boto_region_name,
    version="1.10",
    py_version="py38",
    instance_type="ml.g4dn.xlarge",
)
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    hyperparameters=hyperparameters,
    output_path=f"s3://{bucket}/model_output/",
    sagemaker_session=session,
)

estimator.fit({"train": f"s3://{bucket}/train_data"}, wait=True)

## Standout Suggestions
You do not need to perform the tasks below to finish your project. However, you can attempt these tasks to turn your project into a more advanced portfolio piece.

### Hyperparameter Tuning
**TODO:** Here you can perform hyperparameter tuning to increase the performance of your model. You are encouraged to 
- tune as many hyperparameters as you can to get the best performance from your model
- explain why you chose to tune those particular hyperparameters and the ranges.


In [None]:
os.environ["SM_CHANNEL_TRAIN"] = f"s3://{bucket}/cap/"
os.environ["SM_MODEL_DIR"] = f"s3://{bucket}/model/"

In [None]:
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([16, 32, 64]),
    "epochs": IntegerParameter(2, 6),
}

objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [
    {"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}
]

In [None]:
estimator = PyTorch(
    entry_point="hpo.py",
    base_job_name="cap-class",
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    py_version="py311",
    framework_version="2.5.1",
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=2,
    max_parallel_jobs=1,
    objective_type=objective_type,
)

In [None]:
tuner.fit({"train": os.environ["SM_CHANNEL_TRAIN"]}, wait=True)

In [None]:
best_estimator = tuner.best_estimator()
best_hyp = best_estimator.hyperparameters()

best_train_hyp = {}
for key in hyperparameter_ranges:
    best_train_hyp[key] = best_hyp[key]
best_train_hyp

### Model Profiling and Debugging
**TODO:** Use model debugging and profiling to better monitor and debug your model training job.

In [None]:
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(num_steps=10),
)

debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "1", "eval.save_interval": "10"}
)

In [None]:
estimator = PyTorch(
    entry_point="train_model.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    py_version="py311",
    framework_version="2.5.1",
    hyperparameters=best_train_hyp,
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
    rules=rules,
)

In [None]:
estimator.fit({"train": os.environ["SM_CHANNEL_TRAIN"]}, wait=True)

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)

print(f"Jobname: {job_name}")
print(f"Client: {client}")
print(f"Description: {description}")

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print('Profiler report location: {}'.format(rule_output_path))

! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

In [None]:
debug_artifacts_path = "s3://sagemaker-us-east-1-934421875319/pytorch-training-2025-03-13-20-52-24-429/debug-output/"

trial = create_trial(debug_artifacts_path)

profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

print(profiler_report_name)
IPython.display.HTML(
    filename=profiler_report_name + "/profiler-output/profiler-report.html"
)

### Model Deploying and Querying
**TODO:** Can you deploy your model to an endpoint and then query that endpoint to get a result?

In [None]:
model_location = estimator.model_data
model_location

In [None]:
jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()


class ImgPredictor(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(ImgPredictor, self).__init__(
            endpoint_name,
            sagemaker_session=sagemaker_session,
            serializer=jpeg_serializer,
            deserializer=json_deserializer,
        )


pytorch_model = PyTorchModel(
    model_data=model_location,
    role=role,
    entry_point="cap_endpoint.py",
    py_version="py311",
    framework_version="2.5.1",
    predictor_cls=ImgPredictor,
)

predictor = pytorch_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.2xlarge"
)

In [None]:
test_image = "nd009t-capstone-starter/starter/train_data/2/00113.jpg"

with open(test_image, "rb") as f:
    payload = f.read()
    display(Image.open(io.BytesIO(payload)))
    response = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
    prediction = np.argmax(response, 1) + 1

    print(f"Class: 118")
    print(f"Prediction: {prediction}")

### Cheaper Training and Cost Analysis
**TODO:** Can you perform a cost analysis of your system and then use spot instances to lessen your model training cost?

In [None]:
spot_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    use_spot_instances=True,
    max_wait=3600,
    max_run=1800,
    sagemaker_session=session,
)
spot_estimator.fit({"train": f"s3://{bucket}/train_data"})

### Multi-Instance Training
**TODO:** Can you train your model on multiple instances?

In [None]:
spot_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=4,
    instance_type="ml.g4dn.xlarge",
    use_spot_instances=True,
    max_wait=3600,
    max_run=1800,
    sagemaker_session=session,
)
spot_estimator.fit({"train": f"s3://{bucket}/train_data"})