# Batch Transform 사용하기
- 기본적으로 sagemaker endpoint를 띄울 때, 사용했던 inference.py를 동일하게 활용함.
    - 그러므로, 하나의 코드로 endpoint와 batch transform 두 가지 태스크를 하려면, 분기를 만드는게 좋은 듯.

## 1. Batch Transform에 대한 간단한 이해
- 기본적으로 byom의 endpoint와 

## 2. Batch Transform with SageMaker SDK

In [1]:
import os
import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel

role = sagemaker.get_execution_role()
instance_type = "ml.g4dn.xlarge"
model_artifact_path = "s3://kdw-sagemaker/model/pytorch3/model.tar.gz"

- PyTorchModel SageMaker SDK 정의하기
    - endpoint 띄우는 것과 동일함.

In [4]:
model = PyTorchModel(
    entry_point="inference.py", # inference.py의 파일명.
    role=role, # role
    model_data=model_artifact_path, # model_artifact의 경로
    framework_version="1.8.1", # pytorch version
    py_version="py3" # python version
)

- transformer를 정의하기

In [5]:
transformer = model.transformer(1, instance_type, output_path="s3://kdw-sagemaker/data", strategy="MultiRecord", assemble_with="Line", accept = "application/json")

In [6]:
input_location = "s3://kdw-sagemaker/data/1000_row.json"

In [7]:
transformer.transform(
    input_location, 
    split_type="Line",
    content_type="application/json",
    job_name="kdw-batch-test-1",
    input_filter = "$.text",
    join_source="Input",
    output_filter="$"
)

........................................[34mCollecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)[0m
[34mCollecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)[0m
[34mCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)[0m
[34mCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)[0m
[34mCollecting regex!=2019.12.17
  Downloading regex-2021.4.4-cp36-cp36m-manylinux2014_x86_64.whl (722 kB)[0m
[34mCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)[0m
[34mCollecting importlib-metadata
  Downloading importlib_metadata-4.3.0-py3-none-any.whl (16 kB)[0m
[34mCollecting zipp>=0.5
  Downloading zipp-3.4.1-py3-none-any.whl (5.2 kB)[0m
[34mCollecting click
  Downloading click-8.0.1-py3-none-any.whl (97 kB)[0m
[34mInstalling collected p

---

- 성공했던 job을 불러와서 똑같은 로직으로 실행하기

In [3]:
from sagemaker.transformer import Transformer

In [5]:
import boto3

In [6]:
runtime= boto3.client('runtime.sagemaker')

In [9]:
tft= Transformer.attach(transform_job_name='kdw-batch-test-1')

In [10]:
input_location = "s3://kdw-sagemaker/data/1000_row2.json"

In [11]:
tft.transform(
    input_location, 
    split_type="Line",
    content_type="application/json",
    job_name="kdw-batch-test-2",
    input_filter = "$.text",
    join_source="Input",
    output_filter="$"
)

....................................[34mCollecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)[0m
[34mCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)[0m
[34mCollecting importlib-metadata
  Downloading importlib_metadata-4.3.0-py3-none-any.whl (16 kB)[0m
[34mCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)[0m
[34mCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)[0m
[34mCollecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)[0m
[34mCollecting regex!=2019.12.17
  Downloading regex-2021.4.4-cp36-cp36m-manylinux2014_x86_64.whl (722 kB)[0m
[34mCollecting zipp>=0.5
  Downloading zipp-3.4.1-py3-none-any.whl (5.2 kB)[0m
[34mCollecting click
  Downloading click-8.0.1-py3-none-any.whl (97 kB)[0m
[34mInstalling collected packa

In [None]:
Transformer()

In [None]:
kdw-batch-test-1

In [None]:
1000_row2.json

In [11]:
import io
import json
import boto3

In [12]:
runtime= boto3.client('runtime.sagemaker')

In [None]:
runtime.invoke_endpoint

In [10]:
transformer.transform

<bound method Transformer.transform of <sagemaker.transformer.Transformer object at 0x7fccfbc5dcc0>>

In [7]:
transformer.transform(
    input_location, 
    split_type="Line",
    content_type="application/json",
    job_name="test-batch-6",
    input_filter = "$.text",
    join_source="Input",
    output_filter="$['id','SageMakerOutput']"
)

.......................................[34mCollecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)[0m
[34mCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)[0m
[34mCollecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)[0m
[34mCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)[0m
[34mCollecting importlib-metadata
  Downloading importlib_metadata-4.3.0-py3-none-any.whl (16 kB)[0m
[34mCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)[0m
[34mCollecting regex!=2019.12.17
  Downloading regex-2021.4.4-cp36-cp36m-manylinux2014_x86_64.whl (722 kB)[0m
[34mCollecting zipp>=0.5
  Downloading zipp-3.4.1-py3-none-any.whl (5.2 kB)[0m
[34mCollecting click
  Downloading click-8.0.1-py3-none-any.whl (97 kB)[0m
[34mInstalling collected pa

UnexpectedStatusException: Error for Transform job test-batch-5: Failed. Reason: ClientError: See job logs for more information

In [8]:
import json

In [9]:
json.dumps([])

'[]'

In [18]:
isinstance({'sl':'sl'}, list)

False

In [19]:
pip install transformers

Collecting transformers
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 2.3 MB/s eta 0:00:01
[?25hCollecting tqdm>=4.27
  Downloading tqdm-4.61.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 5.8 MB/s  eta 0:00:01
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 7.4 MB/s eta 0:00:01
Collecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 11.7 MB/s eta 0:00:01
Installing collected packages: tqdm, filelock, tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed 

In [20]:
from transformers import AutoTokenizer

In [21]:
tokenizer = AutoTokenizer.from_pretrained('beomi/kcbert-base')

Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/250k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

In [22]:
input_data =['시발','개발','야발']

In [27]:
for each in input_data:
    result= tokenizer(each, return_tensors='pt')
    print(result)

{'input_ids': tensor([[    2, 13552,     3]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}
{'input_ids': tensor([[   2, 9981,    3]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}
{'input_ids': tensor([[   2, 2207, 4235,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}


In [None]:
id2label[tmp]

In [29]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
model_dir = '../model'
config = AutoConfig.from_pretrained(os.path.join(model_dir, 'config.json'))
model = AutoModelForTokenClassification.from_pretrained(os.path.join(model_dir, 'pytorch_model.bin'), config=config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_dir)

In [32]:
import numpy as np
id2label = np.array(list(map(lambda x:x[1], sorted(model.config.id2label.items(), key=lambda x: int(x[0])))))

In [44]:
result = id2label[model(**tokenizer('아모레퍼시픽 배고파', return_tensors='pt')).logits.cpu().argmax(axis=-1).numpy()].tolist()

In [46]:
tmp=[]
tmp += result

In [47]:
tmp

[['O', 'ORG-B', 'ORG-B', 'ORG-B', 'O', 'O', 'O', 'O', 'O', 'O']]

In [48]:
tmp += result

In [49]:
tmp

[['O', 'ORG-B', 'ORG-B', 'ORG-B', 'O', 'O', 'O', 'O', 'O', 'O'],
 ['O', 'ORG-B', 'ORG-B', 'ORG-B', 'O', 'O', 'O', 'O', 'O', 'O']]

In [30]:
model.config.id2label

{0: 'O',
 1: 'PER-B',
 10: 'LOC-I',
 11: 'CVL-B',
 12: 'CVL-I',
 13: 'DAT-B',
 14: 'DAT-I',
 15: 'TIM-B',
 16: 'TIM-I',
 17: 'NUM-B',
 18: 'NUM-I',
 19: 'EVT-B',
 2: 'PER-I',
 20: 'EVT-I',
 21: 'ANM-B',
 22: 'ANM-I',
 23: 'PLT-B',
 24: 'PLT-I',
 25: 'MAT-B',
 26: 'MAT-I',
 27: 'TRM-B',
 28: 'TRM-I',
 3: 'FLD-B',
 4: 'FLD-I',
 5: 'AFW-B',
 6: 'AFW-I',
 7: 'ORG-B',
 8: 'ORG-I',
 9: 'LOC-B'}

In [26]:
tokenizer(each, return_tensors='pt')

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

In [None]:
,    input_filter="$[1:]",
    join_source="Input",
    output_filter="$[1,-1]",

In [None]:
tr

In [None]:
Args:
    instance_count (int): Number of EC2 instances to use.
    instance_type (str): Type of EC2 instance to use, for example,
        'ml.c4.xlarge'.
    strategy (str): The strategy used to decide how to batch records in
        a single request (default: None). Valid values: 'MultiRecord'
        and 'SingleRecord'.
    assemble_with (str): How the output is assembled (default: None).
        Valid values: 'Line' or 'None'.
    output_path (str): S3 location for saving the transform result. If
        not specified, results are stored to a default bucket.
    output_kms_key (str): Optional. KMS key ID for encrypting the
        transform output (default: None).
    accept (str): The accept header passed by the client to
        the inference endpoint. If it is supported by the endpoint,
        it will be the format of the batch transform output.
    env (dict): Environment variables to be set for use during the
        transform job (default: None).
    max_concurrent_transforms (int): The maximum number of HTTP requests
        to be made to each individual transform container at one time.
    max_payload (int): Maximum size of the payload in a single HTTP
        request to the container in MB.
    tags (list[dict]): List of tags for labeling a transform job. If
        none specified, then the tags used for the training job are used
        for the transform job.
    volume_kms_key (str): Optional. KMS key ID for encrypting 

In [None]:
Transformer(


)

In [None]:
sagemaker.transformer.Transformer()

---

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('1000_row.csv')

In [4]:
df.

Unnamed: 0.1,Unnamed: 0,id,text
0,483,348869-173514688259,기장 아난티 코브 라메르 점심 코스 후기 안녕하세요 숨니입니다 얼마 전 가족 이벤트...
1,841,365863-173514187718,"* 제 품 협 찬 *오전에는 10도, 오후에는 25도.현타가 제대로 오는 일교차의 ..."
2,842,365863-173497341171,언니 오늘 영상두 잘봤어요!ㅎㅎ 너무 유익했어요ㅠㅜㅜㅜ요즘 고데기 진짜 많이 하는데...
3,843,365863-173513479577,@된장님 원장찌개 배달왔어요 ㅇㅈㅋㅋㅋ
4,859,365863-173513480317,근데 확실히 연기에 집중하던 배우들이랑 CF에 집중하던 배우들이랑은 평가나 수명이 ...
...,...,...,...
995,843,346527-173535865239,아니 뭐야 나빼고 다 1일 전이에요? 나만 늦게 온 거야?? 이런...
996,850,346527-173515523339,네오쿠션 17N 좀 핑크끼 돌아
997,888,346527-173515522025,그럼 둘다 밝기 정도는 비슷해..?
998,1046,346527-173538605731,유난히도 피부가 촉촉하고 피부 결이 예뻐보이는 날이 있다.그 때마다 내가 어떤 스킨...


In [1]:
data_path = "s3://kdw-sagemaker/data/1000_row.csv"

In [None]:
i

## Reference
- https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html
- https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform/batch_transform_associate_predictions_with_input