# Modelo Para Previsão de Doenças Usando Registros Médicos Eletrônicos - Parte 5

<h2> Batch Transformation</h2>

O Batch Tranformation é a divisão dos dados em partes menores a fim de evitar estouro da memória quando treinamos modelos de Machine Learning.

Aqui você tem um exemplo completo de aplicar esse recurso usando o SageMaker. Vamos usar o modelo treinado e apenas ajustá-lo com Batch Transformation para fazer as previsões.

## Imports 

In [3]:
import time
import boto3
import sagemaker
import pandas as pd
from sagemaker import get_execution_role
from time import gmtime, strftime

## Carrega os Dados e Define os Parâmetros

In [4]:
# Parâmetros
session = boto3.Session()
sagemaker_execution_role = get_execution_role()
sagemaker_session = sagemaker.session.Session()
sagemaker_client = boto3.client('sagemaker', region_name = session.region_name)
s3_client = boto3.client('s3')

In [5]:
# Altere para o nome do seu bucket
s3_bucket = 'krupck-bucket-bloodpressure'
prefix = 'dados'

In [6]:
batch_input = f's3://{s3_bucket}/{prefix}/'
batch_input

's3://krupck-bucket-bloodpressure/dados/'

In [7]:
batch_output = f's3://{s3_bucket}/{prefix}/'
batch_output

's3://krupck-bucket-bloodpressure/dados/'

In [13]:
current_timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

In [14]:
TRAINING_JOB_NAME = 'classifier-2022-07-22-13-32-08-690'  
MODEL_NAME = f'modelo-xgboost-model-{current_timestamp}'
BATCH_JOB_NAME = f'modelo-xgboost-batch-job-{current_timestamp}'

## Criando o Modelo

In [15]:
# Image URI
container_uri = sagemaker.image_uris.retrieve(region = session.region_name, 
                                              framework = 'xgboost', 
                                              version = '1.0-1', 
                                              image_scope = 'training')

In [16]:
# Info sobre o job de treinamento
info = sagemaker_client.describe_training_job(TrainingJobName = TRAINING_JOB_NAME)
info

{'TrainingJobName': 'classifier-2022-07-22-13-32-08-690',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-2:351371806175:training-job/classifier-2022-07-22-13-32-08-690',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://krupck-bucket-bloodpressure/artefatos/classifier-2022-07-22-13-32-08-690/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'num_round': '100', 'objective': 'binary:logistic'},
 'AlgorithmSpecification': {'TrainingImage': '257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:f1_binary',
    'Regex': '.*\\[[0-9]+\\].*#011validation-f1_binary:([-+]?[0-9]*\\.?[0-9]+(?:[eE]

In [17]:
# Artefatos do modelo
model_artifact_url = info['ModelArtifacts']['S3ModelArtifacts']
model_artifact_url

's3://krupck-bucket-bloodpressure/artefatos/classifier-2022-07-22-13-32-08-690/output/model.tar.gz'

In [18]:
# Container primário
primary_container = {'Image': container_uri, 'ModelDataUrl': model_artifact_url}

In [19]:
# Criação do modelo
response = sagemaker_client.create_model(ModelName = MODEL_NAME,
                                         ExecutionRoleArn = sagemaker_execution_role,
                                         PrimaryContainer = primary_container)

In [20]:
response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:351371806175:model/modelo-xgboost-model-2022-07-22-14-23-25',
 'ResponseMetadata': {'RequestId': '0f89d85a-c5de-496d-bb70-ffa1b93d0ffe',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0f89d85a-c5de-496d-bb70-ffa1b93d0ffe',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '102',
   'date': 'Fri, 22 Jul 2022 14:23:40 GMT'},
  'RetryAttempts': 0}}

## Batch Transformer Para Inferência

In [21]:
# Request com a configuração para executar o job
request = {
    "TransformJobName": BATCH_JOB_NAME,
    "ModelName": MODEL_NAME,
    "BatchStrategy": "MultiRecord",
    "TransformOutput": {
        "S3OutputPath": batch_output
    },
    "TransformInput": {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": batch_input 
            }
        },
        "ContentType": "text/csv",
        "SplitType": "Line",
        "CompressionType": "None"
    },
    "TransformResources": {
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1
    }
}

In [22]:
# Cria o job
response = sagemaker_client.create_transform_job(**request)
response

{'TransformJobArn': 'arn:aws:sagemaker:us-east-2:351371806175:transform-job/modelo-xgboost-batch-job-2022-07-22-14-23-25',
 'ResponseMetadata': {'RequestId': '9a9b1178-618b-4836-bab9-c5835bc493d2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9a9b1178-618b-4836-bab9-c5835bc493d2',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '121',
   'date': 'Fri, 22 Jul 2022 14:23:51 GMT'},
  'RetryAttempts': 0}}

In [23]:
while(True):
    response = sagemaker_client.describe_transform_job(TransformJobName = BATCH_JOB_NAME)
    status = response['TransformJobStatus']
    if  status == 'Completed':
        print("Job finalizado com status: {}".format(status))
        break
    if status == 'Failed':
        message = response['FailureReason']
        print('O job falhou com o seguinte erro: {}'.format(message))
        raise Exception('Transform job failed') 
    print("Status atual do job: {}".format(status))    
    time.sleep(30) 

Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
Status atual do job: InProgress
O job falhou com o seguinte erro: ClientError: See job logs for more information


Exception: Transform job failed

## Avaliação

In [24]:
key = f'{prefix}/batch_test.csv.out'

In [26]:
obj = s3_client.get_object(Bucket = s3_bucket, Key = key)
#results_df = pd.read_csv(obj['Body'], names = ['Predictions'])

NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

In [None]:
results_df