# HotDog, Not HotDog

This notebook contains snippets for processing the [hotdot-nothotdot](hotdot-nothotdot) dataset.

Configure the global variables for this notebook.

1. Set the `bucket_name` to an Amazon S3 bucket within your AWS account
1. Set the `prefix` to the path prefix where you uploaded the [hotdot-nothotdot](hotdot-nothotdot) dataset
1. Set the `region_name` to the same AWS Region containing your Amazon S3 bucket

In [None]:
bucket_name='ch10-cv-book-use2'
prefix = 'sagemaker/hotdog-nothotdog'
region_name='us-east-2'

Use the `boto3` module to create the Amazon S3 client.

In [6]:
import boto3

s3_client = boto3.client('s3',region_name=region_name)

Use the `ListObjectsV2` API to find the `files` within the `bucket_name`

In [12]:
files=[]
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
for page in pages:
    for obj in page['Contents']:
        files.append(obj['Key'])

You can optionally use the `imghdr` module to confirm every image is a valid JPEG file.

In [75]:
import imghdr
bad = []
from io import BytesIO
for file in files:
    if not file.endswith('.jpg'):
        continue
        
    image = BytesIO(s3_client.get_object(Bucket=bucket_name, Key=file)['Body'].read())
    image.seek(0)    
    if not imghdr.what(image) == 'jpeg':
        bad.append(file)
        
print(len(bad))

0


The `get_dataset` function splits the `test` or `train` dataset into the two labeled file types.

In [66]:
def get_dataset(name):
    ds = {'hotdog':[], 'nothotdog':[]}
    for file in files:
        if not file.endswith('.jpg'):
            continue
        if not '/%s/' % name in file:
            continue
        if '/hotdog/' in file:
            ds['hotdog'].append(file)
        elif '/nothotdog/' in file:
            ds['nothotdog'].append(file)
    return ds
        
train_ds = get_dataset('train')
test_ds = get_dataset('test')

Let's create the `train_lst` and `test_lst` channel files for the `CreateTrainingJob` API.

In [74]:
from os import path
def create_channel(ds):
    channel=[]
    for label in ds.keys():
        for obj in ds[label]:
            identifier = path.splitext(path.basename(obj))[0]
            relpath = '%s/%s.jpg' % (label, identifier)
            class_id = 1 if label == 'hotdog' else 0
            channel.append('%s\t%s\t%s' %(
                identifier,
                class_id,
                relpath
            ))
    return channel

train_lst = '\n'.join(create_channel(train_ds))
test_lst = '\n'.join(create_channel(test_ds))

s3_client.put_object(
    Bucket=bucket_name,
    Key='sagemaker/hotdog-nothotdog/train.lst',
    Body=train_lst)

s3_client.put_object(
    Bucket=bucket_name,
    Key='sagemaker/hotdog-nothotdog/validation.lst',
    Body=test_lst)

{'ResponseMetadata': {'RequestId': 'QA5E114KZ0PCQQSJ',
  'HostId': '+EZqiDePQdaPjVq84j5NMSVDk+4VEwMZ/0eVVfKgEuNAG5IwqexnVBBdxJgdlf8ysdK/vxGWtYV2ed/BCr5hPQ==',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '+EZqiDePQdaPjVq84j5NMSVDk+4VEwMZ/0eVVfKgEuNAG5IwqexnVBBdxJgdlf8ysdK/vxGWtYV2ed/BCr5hPQ==',
   'x-amz-request-id': 'QA5E114KZ0PCQQSJ',
   'date': 'Sun, 05 Feb 2023 20:55:50 GMT',
   'x-amz-server-side-encryption': 'AES256',
   'etag': '"654dd6a32aad4e0bc8ec162b16398b2c"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"654dd6a32aad4e0bc8ec162b16398b2c"',
 'ServerSideEncryption': 'AES256'}

Create the Amazon SageMaker client for the same AWS region.

In [28]:
sagemaker = boto3.client('sagemaker', region_name=region_name)

Use the `CreateTrainingJob` API to test the client. You can find the appropriate parameters for this action by starting in the AWS management console then calling the `DescribeTrainingJob` API. 

In [None]:
sagemaker.create_training_job(
    TrainingJobName="hotdog-nothotdog",
    HyperParameters={
        "beta_1": "0.9",
        "beta_2": "0.999",
        "checkpoint_frequency": "1",
        "early_stopping": "false",
        "early_stopping_min_epochs": "10",
        "early_stopping_patience": "5",
        "early_stopping_tolerance": "0.0",
        "epochs": "30",
        "eps": "1e-8",
        "gamma": "0.9",
        "image_shape": "3,224,224",
        "learning_rate": "0.1",
        "lr_scheduler_factor": "0.1",
        "mini_batch_size": "32",
        "momentum": "0.9",
        "multi_label": "0",
        "num_classes": "2",
        "num_layers": "152",
        "num_training_samples": "3000",
        "optimizer": "sgd",
        "precision_dtype": "float32",
        "use_pretrained_model": "0",
        "use_weighted_loss": "0",
        "weight_decay": "0.0001"
    },
    AlgorithmSpecification= {
        "TrainingImage": "825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:1",
        "TrainingInputMode": "File",
        "MetricDefinitions": [
            {
                "Name": "train:accuracy",
                "Regex": "Epoch\\S* Train-accuracy=(\\S*)"
            },
            {
                "Name": "validation:accuracy",
                "Regex": "Epoch\\S* Validation-accuracy=(\\S*)"
            },
            {
                "Name": "train:accuracy:epoch",
                "Regex": "Epoch\\S* Train-accuracy=(\\S*)"
            },
            {
                "Name": "validation:accuracy:epoch",
                "Regex": "Epoch\\S* Validation-accuracy=(\\S*)"
            }
        ],
        "EnableSageMakerMetricsTimeSeries": False
    },
    RoleArn="arn:aws:iam::028973425348:role/service-role/AmazonSageMaker-ExecutionRole-20220322T125487",
    InputDataConfig=[
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://ch10-cv-book-use2/sagemaker/hotdog-nothotdog/train",
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://ch10-cv-book-use2/sagemaker/hotdog-nothotdog/test",
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
        },
        {
            "ChannelName": "train_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://ch10-cv-book-use2/sagemaker/hotdog-nothotdog/train.lst",
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
        },
        {
            "ChannelName": "validation_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://ch10-cv-book-use2/sagemaker/hotdog-nothotdog/validation.lst",
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
        }
    ],
    OutputDataConfig={
        "KmsKeyId": "",
        "S3OutputPath": "s3://ch10-cv-book-use2/sagemaker/output/hotdog-nothotdog/"
    },
    ResourceConfig= {
        "InstanceType": "ml.p2.xlarge",
        "InstanceCount": 1,
        "VolumeSizeInGB": 5
    },
    StoppingCondition={
        "MaxRuntimeInSeconds": 86400
    },
    CheckpointConfig= {
        "S3Uri": "s3://ch10-cv-book-use2/sagemaker/checkpoint/hotdog-nothotdog.checkpoint"
    })

0
