# training script를 이용한 training job
이번 주제를 통해 training script를 통해 머신러닝 모델을 훈련하고 deploy하는 방법을 배웁니다.
fastai와 pytorch를 이용한 두개의 스크립트를 이용하여 각각의 모델을 만들 예정입니다.

## 1. 기본적인 설정
training job을 위한 기본적인 설정을 진행합니다.

In [4]:
import os
import io
import subprocess

import PIL

import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import Predictor

In [5]:
sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "cosmax-hsdm"

role = sagemaker.get_execution_role()
print(role)

Couldn't call 'get_role' to get Role ARN from role name AmazonSageMaker-ExecutionRole-20211115T162495 to get Role path.
Assuming role was created in SageMaker AWS console, as the name contains `AmazonSageMaker-ExecutionRole`. Defaulting to Role ARN with service-role in path. If this Role ARN is incorrect, please add IAM read permissions to your role or supply the Role Arn directly.


arn:aws:iam::543416176939:role/service-role/AmazonSageMaker-ExecutionRole-20211115T162495


## 2. fastai
### 2.1 training script 작성

In [12]:
%%writefile ./train.py

import argparse
import fastai
from fastai.vision import *

parser = argparse.ArgumentParser()

# Hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--num_epochs', type=int, default=1)
parser.add_argument('--batch_size', type=int, default=4)
# parser.add_argument('--lr', type=float, default=0.001)

# SageMaker Container environment
parser.add_argument('--data', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
# parser.add_argument('--num_gpus', type=int, default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
# parser.add_argument('--output_data_dir', type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
# parser.add_argument('--test_dir', type=str, default=os.environ["SM_CHANNEL_TEST"])

args, _ = parser.parse_known_args()

path = Path(args.data)

data = ImageDataBunch.from_folder(
    path, 
    train='train',
    valid='val',
    ds_tfms=get_transforms(do_flip=True),  # 데이터셋 변형 함수
    size=224,  # 이미지 사이즈
    bs=args.batch_size,     # 배치사이즈
    num_workers=8 # 워커 개수
)

data.normalize(tensor([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]))

learn = cnn_learner(data, models.resnet18, metrics = [accuracy])
learn.fit_one_cycle(args.num_epochs)
learn.save(os.path.join(args.model_dir, 'model_fastai.pth'))

Writing ./train.py


### 2.2 training job

In [6]:
estimator = PyTorch(entry_point='train.py',
                    role=role,
                    instance_type='ml.g4dn.xlarge',
                    instance_count=1,
                    framework_version='1.8.0',
                    py_version='py36',
                    hyperparameters = {'num_epochs': 2, 
                                       'batch_size': 4
                                      }                       
                   )
# s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}'.format(bucket, prefix), content_type='csv')    
#s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}'.format(bucket, prefix), content_type='csv') # SDK v1
estimator.fit('s3://cosmax-head-skin-diagnosis-data-sample')

2021-12-16 12:07:09 Starting - Starting the training job...
2021-12-16 12:07:33 Starting - Launching requested ML instancesProfilerReport-1639656429: InProgress
...
2021-12-16 12:08:05 Starting - Preparing the instances for training.........
2021-12-16 12:09:34 Downloading - Downloading input data
2021-12-16 12:09:34 Training - Downloading the training image..................
2021-12-16 12:12:34 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-12-16 12:12:28,908 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-12-16 12:12:28,930 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-12-16 12:12:35,149 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-12-16 12:12:35,462 sagemaker-training-

## 3. pytorch
### 3.1 training script 작성

In [7]:
%%writefile ./train_pytorch_one_cycle.py

import argparse
import os
import matplotlib.pyplot as plt
import time
import os
import copy
from datetime import datetime
from pytz import timezone
from tqdm import tqdm
import zipfile
import shutil
from pathlib import Path
import logging
import sys

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, transforms
import torchvision.models as models

parser = argparse.ArgumentParser()

# Hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--num_epochs', type=int, default=1)
parser.add_argument('--batch_size', type=int, default=4)
# parser.add_argument('--lr', type=float, default=0.001)

# SageMaker Container environment
parser.add_argument('--data', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
# parser.add_argument('--num_gpus', type=int, default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
# parser.add_argument('--output_data_dir', type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
# parser.add_argument('--test_dir', type=str, default=os.environ["SM_CHANNEL_TEST"])

args, _ = parser.parse_known_args()

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

# 학습을 위해 데이터 증가(augmentation) 및 일반화(normalization)
# 검증을 위한 일반화
data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(224),
        transforms.RandomRotation(degrees=(0, 180)),
        transforms.RandomHorizontalFlip(),
        transforms.RandomVerticalFlip(),
#         transforms.Grayscale(),
#         transforms.RandomPerspective(),
#         transforms.ColorJitter(brightness=.5, hue=.3),
#         transforms.GaussianBlur(kernel_size=(5, 9), sigma=(0.1, 5)),
#         transforms.RandomCrop(size=(64, 64)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

data_dir = args.data
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=args.batch_size,
                                             shuffle=True, num_workers=4)
              for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        
        start_time = time.time()
        
        start = datetime.now(timezone('Asia/Seoul')
                            ).strftime('%Y-%m-%d %H:%M:%S')
        print('Start = {}'.format(start))

        # 각 에폭(epoch)은 학습 단계와 검증 단계를 갖습니다.
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # 모델을 학습 모드로 설정
            else:
                model.eval()   # 모델을 평가 모드로 설정

            running_loss = 0.0
            running_corrects = 0

            # 데이터를 반복
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # 매개변수 경사도를 0으로 설정
                optimizer.zero_grad()

                # 순전파
                # 학습 시에만 연산 기록을 추적
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # 학습 단계인 경우 역전파 + 최적화
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # 통계
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]*100

            print('{:10}: Loss - {:10.4f} | Acc - {:10.2f}%'.format(
                phase, epoch_loss, epoch_acc))

            # 모델을 깊은 복사(deep copy)함
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
            
        finish = datetime.now(timezone('Asia/Seoul')
                            ).strftime('%Y-%m-%d %H:%M:%S')
        print('Finish = {}'.format(finish))
        
        time_elapsed = time.time() - start_time
        print('Time: {:10.2f}m'.format(time_elapsed/60))

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:10.0f}hr {:10.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600)/60))
    print('Best val Acc: {:10.2f}%'.format(best_acc))

    # 가장 나은 모델 가중치를 불러옴
    torch.save(best_model_wts, os.path.join(args.model_dir, 'model.pth'))
    
    # === Save Model Parameters ===
    logger.info("Model successfully saved at: {}".format(args.model_dir)) 


model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 4)

model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()

# 모든 매개변수들이 최적화되었는지 관찰
optimizer_ft = optim.SGD(model_ft.parameters(), lr=1e-3, momentum=0.9)

# 5에폭마다 0.1씩 학습률 감소
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(data_loader))
# exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=5,
#                                        gamma=0.1)

model_ft = train_model(model_ft, criterion, optimizer_ft,
                       scheduler,
                       num_epochs=args.num_epochs)

Overwriting ./train_pytorch_one_cycle.py


### 3.2 training job

In [None]:
estimator = PyTorch(entry_point='train_pytorch.py',
                    role=role,
                    instance_type='ml.g4dn.xlarge',
                    instance_count=1,
                    framework_version='1.8.0',
                    py_version='py36',
                    max_run = 3*24*60*60,
                    hyperparameters = {'num_epochs': 300, 
                                       'batch_size': 256
                                      }                       
                   )
# s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}'.format(bucket, prefix), content_type='csv')    
#s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}'.format(bucket, prefix), content_type='csv') # SDK v1
estimator.fit('s3://cosmax-head-skin-diagnosis')

2021-12-20 02:18:23 Starting - Starting the training job...
2021-12-20 02:18:47 Starting - Launching requested ML instancesProfilerReport-1639966703: InProgress
......
2021-12-20 02:19:47 Starting - Preparing the instances for training......
2021-12-20 02:20:47 Downloading - Downloading input data...........................................................................
2021-12-20 02:33:26 Training - Downloading the training image...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-12-20 02:33:43,048 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-12-20 02:33:43,067 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-12-20 02:33:43,075 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-12-20 02:33:43,611 sagemaker-training-toolkit INFO     Invoking us

## 4. trained model 생성 및 deploy(predictor)

In [9]:
model = PyTorchModel(
    model_data=estimator.model_data,
    name=estimator._current_job_name,
    role=role,
    framework_version=estimator.framework_version,
    py_version="py3",
    entry_point=estimator.entry_point,
#     predictor_cls=ImagePredictor,
)
# predictor.delete_endpoint()
predictor = model.deploy(instance_type='ml.m5.large',
                                     initial_instance_count=1)

------!