# Dog Image Classification using AWS SageMaker

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.
In this project, we leverage AWS SageMaker to train and deploy an image classification model using transfer learning. Our goal is to classify dog breeds from images using a pretrained convolutional neural network (CNN). 

The notebook demonstrates a full machine learning lifecycle:  
- Data download and upload to S3  
- Model training using transfer learning  
- Hyperparameter tuning using SageMaker’s built-in tools  
- Debugging and profiling  
- Model deployment and inference using a hosted endpoint  

This serves as a real-world simulation of how ML Engineers build and deploy scalable image classification models using AWS infrastructure.

**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.


**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

In [12]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install -r requirements.txt



In [13]:
pip install protobuf==3.20.*

Note: you may need to restart the kernel to use updated packages.


In [14]:
!pip install --upgrade sagemaker



In [15]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
import os
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
import sagemaker.debugger as smd
from sagemaker.debugger import DebuggerHookConfig
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter, CategoricalParameter
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, CollectionConfig

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

In [10]:
#TODO: Fetch and upload the data to AWS S3

# Command to download and unzip data
!wget --no-check-certificate https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip


--2025-05-04 04:21:58--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.117.160, 16.15.4.34, 52.219.193.32, ...
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.117.160|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip.1’



Cannot write to ‘dogImages.zip.1’ (Success).
Archive:  dogImages.zip
replace dogImages/test/001.Affenpinscher/Affenpinscher_00003.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [16]:
# Set up SageMaker session and custom S3 bucket
sagemaker_session = sagemaker.Session()
role = get_execution_role()
# Specify the S3 bucket and the location of your zip file
bucket = 'dog-image-classification-project'  # Your S3 bucket
prefix = 'dog-images'  # Optional folder name or prefix
s3_data_uri = f's3://{bucket}/dogImages.zip'  # S3 URI for the zip file

print(f"SageMaker Role: {role}")
print(f"S3 Bucket: {bucket}")

SageMaker Role: arn:aws:iam::771499623809:role/service-role/AmazonSageMaker-ExecutionRole-20250501T170477
S3 Bucket: dog-image-classification-project


## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

We used SageMaker’s built-in `HyperparameterTuner` to search for the best combination of parameters. We focused on:

- `learning_rate`: Affects the model's convergence.
- `batch_size`: Balances memory usage and training stability.
- `epochs`: Determines how long we train the model.

We evaluated tuning performance using **validation loss** as the objective metric. A regex filter was used to extract the validation loss from logs. The tuning job was configured to run a maximum of 8 jobs with 2 in parallel.
**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [17]:
#TODO: Declare your HP ranges, metrics etc.
# Define hyperparameter ranges
# Example of hyperparameters you may want to tune
# Define hyperparameter ranges
# Define the range of hyperparameters for tuning
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([32, 64, 128]),
}

# Define the objective metric and its associated settings
objective_metric_name = "Test Loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "Test Loss", "Regex": "Testing Loss: ([0-9\\.]+)"}]

In [18]:
#TODO: Create estimators for your HPs
# Get execution role for SageMaker
role = get_execution_role()

# Set up the PyTorch estimator with your custom parameters
estimator = PyTorch(
    entry_point="scripts/hpo.py",  # Point to your hpo.py script location  
    base_job_name="dog-breeds-hpo",  # A meaningful name for the tuning job
    role=role,
    framework_version="1.9",  # The PyTorch framework version you're using
    instance_count=1,  # The number of instances to use for training
    instance_type="ml.m5.2xlarge",  # Choose an appropriate instance type
    py_version="py38",  # Python version
)

# Set up the hyperparameter tuner with the estimator, metrics, and ranges
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,  # Maximum number of tuning jobs to run
    max_parallel_jobs=1,  # Limit the number of parallel jobs
    base_tuning_job_name="dog-breeds-hpo-tuning",  # Base name for tuning job
    objective_type=objective_type,  # Objective type: Minimize loss
)

In [None]:
# TODO: Fit your HP Tuner
# Launch the hyperparameter tuning job

# Specify your S3 bucket name for the training data
bucket_name = 'dog-image-classification-project'

# Start the hyperparameter tuning job
tuner.fit({"training": f"s3://{bucket_name}/dog-images/"})


.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [None]:
tuner.describe()

In [None]:
# TODO: Get the best estimators and the best HPs

best_estimator =  tuner.best_estimator()#TODO

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

print("Best Estimator:", best_estimator)

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [None]:
# Get your bucket
bucket = 'dog-image-classification-project'
print(bucket)

In [None]:
# TODO: Set up debugging and profiling rules and hooks
# Debugger configuration
# Set up the profiler configuration
from sagemaker.debugger import FrameworkProfile

# Profiler configuration (collects system performance data)
# Set up the profiler configuration
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        start_step=5,  # Start profiling after 5 steps
        num_steps=10   # Profile the next 10 steps
    )
)

# Set up the debugger hook configuration
debugger_hook_config = DebuggerHookConfig(
    s3_output_path=f's3://{bucket}/profiler_output',
    collection_configs=[
        CollectionConfig(name="weights"),    # Collect weights data
        CollectionConfig(name="gradients"),  # Collect gradients data
        CollectionConfig(name="losses")      # Collect losses data (ensure it's unique)
    ]
)

In [None]:
# Extracting best hyperparameters with type conversion and cleanup
# Fetch the hyperparameters used by the best estimator
best_hyperparams = {
    "batch_size": int(best_estimator.hyperparameters().get("batch_size", "32").strip('"')),
    "learning_rate": float(best_estimator.hyperparameters().get("learning_rate", "0.001"))
}

# Display results
print("Best Estimator Object:", best_estimator)
print("Best Hyperparameters:", best_hyperparams)


In [None]:
metric_definition = [
    {"Name": "TrainingLoss", "Regex": "train loss: ([0-9]+\\.[0-9]+), acc: [0-9]+\\.[0-9]+"},
    {"Name": "ValidationLoss", "Regex": "valid loss: ([0-9]+\\.[0-9]+), acc: [0-9]+\\.[0-9]+"},
    {"Name": "TrainingAccuracy", "Regex": "train loss: [0-9]+\\.[0-9]+, acc: ([0-9]+\\.[0-9]+)"},
    {"Name": "ValidationAccuracy", "Regex": "valid loss: [0-9]+\\.[0-9]+, acc: ([0-9]+\\.[0-9]+)"}
]

In [None]:
# TODO: Create and fit an estimator
# Define the SageMaker PyTorch Estimator
estimator = PyTorch(
    entry_point='scripts/train_model.py',  # Path to your training script
    role=get_execution_role(),            # Your IAM role
    framework_version='1.9',               # PyTorch version (adjust if needed)
    py_version='py38',                    # Python version
    instance_count=1,                     # Number of instances for training
    instance_type='ml.m5.large',          # Instance type for training
    hyperparameters=best_hyperparams,     # Hyperparameters
    output_path=f's3://{bucket}/output',  # Output path for model artifacts
    base_job_name='dog-breeds-image-classifier',  # Base name for the training job
    metric_definitions=metric_definition,
    profiler_config=profiler_config,      # Attach profiler config
    debugger_hook_config=debugger_hook_config  # Attach debugger config
)

In [None]:
from sagemaker.inputs import TrainingInput

s3_input_train = TrainingInput(s3_data='s3://dog-image-classification-project/dog-images/train/', content_type='application/x-image')
s3_input_validation = TrainingInput(s3_data='s3://dog-image-classification-project/dog-images/valid/', content_type='application/x-image')

In [None]:
# Start the training job
estimator.fit({'training': s3_input_train, 'validation': s3_input_validation})

[34mTrain Batch 40: Loss = 3.1586[0m
[34mINFO:__main__:Train Batch 40: Loss = 3.1586[0m
[34mTrain Batch 50: Loss = 2.9890[0m
[34mINFO:__main__:Train Batch 50: Loss = 2.9890[0m
[34mTrain Loss: 3.9706, Accuracy: 0.1283[0m
[34mINFO:__main__:Train Loss: 3.9706, Accuracy: 0.1283[0m
[34mValid Batch 0: Loss = 1.9003[0m
[34mINFO:__main__:Valid Batch 0: Loss = 1.9003[0m
[34mTrain Batch 0: Loss = 2.8994[0m
[34mINFO:__main__:Train Batch 0: Loss = 2.8994[0m
[34mTrain Batch 10: Loss = 2.4913[0m
[34mINFO:__main__:Train Batch 10: Loss = 2.4913[0m
[34mValid Batch 20: Loss = 0.9341[0m
[34mINFO:__main__:Valid Batch 20: Loss = 0.9341[0m
[34mValid Batch 30: Loss = 1.1266[0m
[34mINFO:__main__:Valid Batch 30: Loss = 1.1266[0m


In [None]:
# TODO: Plot a debugging output.
# Set up the S3 client
import boto3

# Set up the S3 client
s3 = boto3.client('s3')
bucket_name = 'dog-image-classification-project'
selected_folder = 'profiler_output/dog-breeds-image-classifier-2025-05-04-00-50-51-017/'

# List the files in the selected profiler folder
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=selected_folder)

# Print the file names in the folder
if 'Contents' in response:
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("No files found in the selected folder.")


In [None]:
from smdebug.trials import create_trial
from scripts.debug_utils import extract_tensor_values, plot_tensor_comparison

In [None]:
# Set up the S3 client and fetch profiler output (assuming you've already downloaded the necessary files)
s3 = boto3.client('s3')
bucket_name = 'dog-image-classification-project'
selected_folder = 'profiler_output/dog-breeds-image-classifier-2025-05-04-00-50-51-017/'
local_dir = './profiler_output'

# Ensure the local directory exists
os.makedirs(local_dir, exist_ok=True)

# List the files in the selected profiler folder (download if not already done)
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=selected_folder)

# Download files if not already in local directory
if 'Contents' in response:
    for obj in response['Contents']:
        key = obj['Key']
        file_name = key.split('/')[-1]
        if file_name:  # Skip directories
            local_path = os.path.join(local_dir, file_name)
            print(f"Downloading {file_name}")
            s3.download_file(bucket_name, key, local_path)

# Create the trial object for visualization
trial = create_trial(estimator.latest_job_debugger_artifacts_path())

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorboard.backend.event_processing import event_accumulator

# Path to the downloaded tfevents file (you might have multiple)
tfevents_path = './profiler_output/dog-breeds-image-classifier-2025-05-04-00-50-51-017/debug-output/events/000000000000/000000000000_worker_0.tfevents'

# Create an EventAccumulator to read the tfevents file
event_acc = event_accumulator.EventAccumulator(tfevents_path)
event_acc.Reload()  # Load the event data

# Extract the scalar data (e.g., loss, accuracy)
losses = event_acc.Scalars('loss')  # Assuming 'loss' is the tag used during training

# Extract steps and loss values
steps = [x.step for x in losses]
loss_values = [x.value for x in losses]

# Plot the training loss
plt.plot(steps, loss_values, label='Training Loss')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.title('Training Loss over Steps')
plt.legend()
plt.show()

In [None]:
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

In [None]:
# Print training job information:
print("job name : {}\n".format(estimator.latest_training_job.job_name))
print("latest_job_debugger_artifacts_path : {}\n".format(estimator.latest_job_debugger_artifacts_path()))
print("rule_output_path : {}\n".format(estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"))

In [None]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

In [None]:
import json
import matplotlib.pyplot as plt
import os

# Path to your downloaded profiler files
index_path = './profiler_output/000000000000_worker_0.json'

with open(index_path) as f:
    index_data = json.load(f)

# Example: Plot the "loss" tensor over steps
loss_events = index_data.get('loss', {}).get('events', [])

steps = [e['step'] for e in loss_events]
values = [e['value'] for e in loss_events]

plt.plot(steps, values)
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.title("Loss over Time")
plt.grid(True)
plt.show()

In [None]:
collections_path = './profiler_output/worker_0_collections.json'

with open(collections_path) as f:
    collections = json.load(f)

print("Available Collections:")
for name, tensors in collections.items():
    print(f"\n{name}:\n", tensors)

In [None]:
import os
import json
import matplotlib.pyplot as plt

profiler_dir = './profiler_output/'
files = sorted([f for f in os.listdir(profiler_dir) if f.endswith('.json') and f.startswith('000')],
               key=lambda x: int(x.split('_')[0]))

steps = []
cpu = []
gpu = []

for f in files:
    with open(os.path.join(profiler_dir, f)) as jf:
        data = json.load(jf)
        steps.append(data.get('step', 0))
        sys_metrics = data.get('system_metrics', {})
        cpu.append(sys_metrics.get('CPU', 0))
        gpu.append(sys_metrics.get('GPU', 0))

plt.figure(figsize=(10, 5))
plt.plot(steps, cpu, label='CPU Usage (%)', marker='o')
plt.plot(steps, gpu, label='GPU Usage (%)', marker='x')
plt.xlabel("Training Step")
plt.ylabel("Utilization (%)")
plt.title("System Resource Usage During Training")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
from scripts.train_model import train  # Import the train function from train_model.py
import torch
import matplotlib.pyplot as plt

model, train_losses, val_losses = train(
    model, train_loader, val_loader, criterion, optimizer, device, batch_size, hook
)

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(train_losses, label="Train Loss", marker='o')
plt.plot(val_losses, label="Validation Loss", marker='o')
plt.xlabel("Epoch")
plt.ylabel("Cross Entropy Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
import torch

# Assuming model is already defined and trained
torch.save(model.state_dict(), 'model.pth')

In [None]:
torch.save(model.cpu().state_dict(), os.path.join(args.model_dir, "model.pth"))

In [None]:
from smdebug.trials import create_trial
from scripts.debug_utils import extract_tensor_values, plot_tensor_comparison

# Now you can use the functions in the notebook:
trial = create_trial(estimator.latest_job_debugger_artifacts_path())

In [None]:
from smdebug.trials import create_trial
import matplotlib.pyplot as plt

# Point to the local path where profiler output is stored
trial = create_trial('./profiler_output/dog-breeds-image-classifier-2025-05-03-15-53-40-671/debug-output/')

# List all available tensors
print("Available scalar tensors:", trial.tensor_names())

# Plot training loss
loss_tensor = 'loss'  # or trial.tensor_names()[0] if you're unsure
steps = trial.steps()
loss_values = [trial.tensor(loss_tensor).value(step) for step in steps]

plt.plot(steps, loss_values)
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.title("Training Loss over Time")
plt.grid(True)
plt.show()

In [None]:
import pandas as pd
# Assuming the loss data is stored in a CSV file
loss_data = pd.read_csv(os.path.join(local_dir, 'losses.csv'))

# Inspect the first few rows of the data
print(loss_data.head())

In [None]:
import matplotlib.pyplot as plt

# Plotting the loss values
plt.plot(loss_data['step'], loss_data['loss'], label='Training Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training Loss Over Steps')
plt.legend()
plt.show()

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()