## AWS y Intel Hackathon: Model Training

### Install Python SDKs

In [1]:
import sys

In [2]:
!{sys.executable} -m pip install sagemaker-experiments==0.1.24

Collecting sagemaker-experiments==0.1.24
  Downloading sagemaker_experiments-0.1.24-py3-none-any.whl (36 kB)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.24


### Install PyTroch

In [3]:
!{sys.executable} -m pip install torch==1.1.0
!{sys.executable} -m pip install torchvision==0.3.0
!{sys.executable} -m pip install pillow==6.2.2
!{sys.executable} -m pip install --upgrade sagemaker
!{sys.executable} -m pip install torchsummary

Collecting torch==1.1.0
  Downloading torch-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (676.9 MB)
     |████████████████████████████████| 676.9 MB 2.3 kB/s             
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 1.0.61 requires nvidia-ml-py3, which is not installed.[0m
Successfully installed torch-1.1.0
Collecting torchvision==0.3.0
  Downloading torchvision-0.3.0-cp36-cp36m-manylinux1_x86_64.whl (2.6 MB)
     |████████████████████████████████| 2.6 MB 27.5 MB/s            
Installing collected packages: torchvision
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.5.0
    Uninstalling torchvision-0.5.0:
      Successfully uninstalled

### Setup

In [4]:
import time

import boto3
import numpy as np
import pandas as pd
from IPython.display import set_matplotlib_formats, display
from matplotlib import pyplot as plt
from torchvision import datasets, transforms, models

import torch

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

from tqdm.notebook import tqdm

from torchsummary import summary
import glob
from PIL import Image

import random

set_matplotlib_formats("retina")

### Download the data

In [None]:
!wget https://www.dropbox.com/s/kc4xt9fhkrdnwjo/dataset_reduced.zip

Para instancias con menos HW puede ser usada tambien la version reducida del dataset, con aproximadamente 200 imagenes por clase. Para ello ejecutar la siguiente celda

In [5]:
!wget https://www.dropbox.com/s/evm0ts2obk7n3cb/dataset_reduced.zip

--2022-03-28 19:18:07--  https://www.dropbox.com/s/evm0ts2obk7n3cb/dataset_reduced.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.64.18, 2620:100:6020:18::a27d:4012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.64.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/evm0ts2obk7n3cb/dataset_reduced.zip [following]
--2022-03-28 19:18:07--  https://www.dropbox.com/s/raw/evm0ts2obk7n3cb/dataset_reduced.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3408a33a5737a418a1180998c0.dl.dropboxusercontent.com/cd/0/inline/BiVjSXq5zc91priCia4zsT_9u229evpsiY4gIFc1Yz104vl0IzbONAXLfD7TsqpqhgVE6k1_lmZYVwGM5O3JIJyfmP3G6SHfdQr-yEFmjq7wDxyef94s3kG3uo4gMYhk7c-uyU6GXVG63fQrEq0eGJVkokBbGoWSQM2eKa_M9WUgSQ/file# [following]
--2022-03-28 19:18:08--  https://uc3408a33a5737a418a1180998c0.dl.dropboxusercontent.com/cd/0/inline/BiVjSXq5zc91priCia4zsT_9u229evpsiY4gIFc1Yz10

### Upload dataset to S3 as zip file

In [6]:
sm_sess = sagemaker.Session()
sess = sm_sess.boto_session
sm = sm_sess.sagemaker_client
role = get_execution_role()

In [7]:
account_id = sess.client("sts").get_caller_identity()["Account"]
bucket = "sagemaker-hackathon-demo-{}-{}".format(sess.region_name, account_id)
prefix = "hackathon"

try:
    if sess.region_name == "us-east-1":
        sess.client("s3").create_bucket(Bucket=bucket)
    else:
        sess.client("s3").create_bucket(
            Bucket=bucket, CreateBucketConfiguration={"LocationConstraint": sess.region_name}
        )
except Exception as e:
    print(e)

An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.


In [8]:
bucket

'sagemaker-hackathon-demo-eu-west-1-017233837209'

In [9]:
s3_resource = boto3.resource("s3", region_name = sess.region_name)

inputs = None

try:

    
    inputs = sagemaker.Session().upload_data(path="./dataset_reduced.zip", bucket=bucket, key_prefix=prefix)
    print("input spec: {}".format(inputs))
except Exception as exp:
    print("exp: ", exp)


input spec: s3://sagemaker-hackathon-demo-eu-west-1-017233837209/hackathon/dataset_reduced.zip


### Prepare the dataset

In [None]:
!unzip -uo dataset_reduced.zip

### Training

In [12]:
from sagemaker.pytorch import PyTorch, PyTorchModel

In [15]:
estimator = PyTorch(
    py_version="py3",
    entry_point="./model.py",
    role=role,
    sagemaker_session=sagemaker.Session(sagemaker_client=sm),
    framework_version="1.1.0",
    instance_count=1,
    instance_type="ml.c5.2xlarge",
    hyperparameters={
        "epochs": 2,
        "backend": "gloo",
        "dropout": 0.2,
        "kernel_size": 5,
        "optimizer": "sgd",
    },
    metric_definitions=[
        {"Name": "train:loss", "Regex": "Train Loss: (.*?);"},
        {"Name": "test:loss", "Regex": "Test Average loss: (.*?),"},
        {"Name": "test:accuracy", "Regex": "Test Accuracy: (.*?)%;"},
    ],
    enable_sagemaker_metrics=True,
)

cnn_training_job_name = "cnn-training-job-{}".format(int(time.time()))

estimator.fit(
    inputs={"training": inputs},
    job_name=cnn_training_job_name,
    wait=True,
)


time.sleep(2)

2022-03-28 19:30:36 Starting - Starting the training job...
2022-03-28 19:31:00 Starting - Preparing the instances for trainingProfilerReport-1648495836: InProgress
......
2022-03-28 19:32:00 Downloading - Downloading input data......
2022-03-28 19:33:01 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-03-28 19:32:49,257 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-03-28 19:32:49,260 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-03-28 19:32:49,269 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-03-28 19:32:49,270 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-03-28 19:32:49,569 sagemaker-containers INFO     Module model does not provide