# Caso 3: Avançado

## 1. Criando o dataset

Para a realização dessa etapa do trabalho, será criado um dataset apenas para simplificar o processo. Portanto, será importado a biblioteca faker que irá criar um dataset de casas. O dataset irá conter as colunas: Número de quartos, número de banheros, tamanho da casa, ano de construção e o preço. O objetivo será inferir o preço utilizando o Random Forest Regressor.

In [124]:
from faker import Faker
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

In [125]:
fake = Faker()

def generate_houses(n_samples):
    data = []
    for _ in range(n_samples):
        # Generate random property data
        num_bedrooms = np.random.randint(1, 6)
        num_bathrooms = np.random.randint(1, 4)
        square_feet = np.random.randint(500, 5000)
        year_built = np.random.randint(1900, 2023)
        price = round(np.random.uniform(50000, 1000000), 2)
        
        data.append([num_bedrooms, num_bathrooms, square_feet, year_built, price])
        
    columns = ['Num_Bedrooms', 'Num_Bathrooms', 'Square_Feet', 'Year_Built', 'Price']
    return pd.DataFrame(data, columns=columns)


In [126]:
df = generate_houses(1506)

# Get training columns
train_cols = list(df.columns)
del train_cols[-1]
train_cols

# Split data
training_index = math.floor(0.8 * df.shape[0])
x_train, y_train = df[train_cols][:training_index], df["Price"][:training_index]
x_test, y_test = df[train_cols][training_index:], df["Price"][training_index:]

# Scale price
y_train = y_train / 100000
y_test = y_test / 100000

# Standardize data
x_train_np = StandardScaler().fit_transform(x_train)
x_test_np = StandardScaler().fit_transform(x_test)

In [127]:
x_train

Unnamed: 0,Num_Bedrooms,Num_Bathrooms,Square_Feet,Year_Built
0,2,2,3370,2022
1,5,3,4051,1966
2,4,2,2420,2015
3,2,1,2729,1990
4,2,2,2851,1947
...,...,...,...,...
1199,1,3,4690,1917
1200,3,1,2954,1924
1201,4,3,860,1951
1202,4,3,3995,1966


Após a criação do dataset e a separação entre dados de treino e teste, vamos normalizar o dado de preço e salvar os dados em um arquivo train.csv e test.csv.

In [128]:
train_df = pd.DataFrame(data=x_train_np)
train_df.columns = x_train.columns
train_df["Price"] = y_train / 100000
first_col = train_df.pop("Price")
train_df.insert(0, "Price", first_col)

test_df = pd.DataFrame(data=x_test_np)
test_df.columns = x_test.columns
test_df["Price"] = y_test.reset_index(drop=True) / 100000
first_col = test_df.pop("Price")
test_df.insert(0, "Price", first_col)

In [155]:
# Save as CSV
train_df.to_csv(f"{train_dir}/train.csv", header=False, index=False)
test_df.to_csv(f"{test_dir}/test.csv", header=False, index=False)

# Save as Numpy
np.save(os.path.join(train_dir, "x_train.npy"), x_train_np)
np.save(os.path.join(test_dir, "x_test.npy"), x_test_np)
np.save(os.path.join(train_dir, "y_train.npy"), y_train)
np.save(os.path.join(test_dir, "y_test.npy"), y_test)

In [156]:
y_train.head()

0    8.321125
1    2.151729
2    4.659216
3    6.341849
4    8.358060
Name: Price, dtype: float64

In [157]:
y_test.head()

1204    8.162037
1205    3.696705
1206    5.628283
1207    1.780544
1208    7.165458
Name: Price, dtype: float64

In [158]:
train_cols = list(train_df.columns)
del train_cols[0]
train_cols

['Num_Bedrooms', 'Num_Bathrooms', 'Square_Feet', 'Year_Built']

# 2. Treinamento normal

Após a criação do dataset, vamos treinar o modelo com os parâmetros: max_depth = 20, n_jobs = 4 e n_estimators = 120. Nessa etapa, vamos utilizar o RandomForestRegressor diretamente, sem a utilização do Sagemaker.

In [168]:
y = train_df.iloc[:, 0]  # Primeira coluna (rótulo)
X = train_df.iloc[:, 1:]  # Todas as outras colunas (features)

y_test = test_df.iloc[:, 0]
X_test = test_df.iloc[:, 1:]

In [169]:
# Inicializar o modelo RandomForestRegressor
rf_model = RandomForestRegressor(max_depth=20, n_jobs=4, n_estimators=120)

# Treinar o modelo
rf_model.fit(X, y)

Após o treinamento, vamos realizar a predição com os dados de teste.

In [170]:
y_pred = rf_model.predict(X_test)

In [171]:
print(y_pred)

[6.71895730e-05 5.09915920e-05 3.19157857e-05 6.50775796e-05
 4.31765861e-05 4.68040354e-05 2.85137040e-05 3.76599060e-05
 5.54587877e-05 6.05260976e-05 6.18996500e-05 5.56836846e-05
 6.27650851e-05 4.23936398e-05 4.71732706e-05 3.90237012e-05
 5.59721054e-05 6.66255654e-05 4.21362782e-05 3.69003530e-05
 4.54945872e-05 6.46247917e-05 3.81500679e-05 3.64883760e-05
 2.60584935e-05 6.36955069e-05 4.73941164e-05 5.77327220e-05
 4.86821423e-05 4.58967946e-05 5.54864683e-05 5.94634429e-05
 4.96776674e-05 5.87324098e-05 6.87617060e-05 5.55836733e-05
 4.21686106e-05 5.38703959e-05 6.49921173e-05 5.21016708e-05
 3.23794759e-05 5.59084886e-05 5.16427457e-05 7.06142680e-05
 6.71263000e-05 5.56204418e-05 6.05145680e-05 7.19746991e-05
 6.68237321e-05 5.67892098e-05 4.21321645e-05 5.54716757e-05
 5.22799569e-05 5.50608539e-05 5.47761188e-05 4.97526826e-05
 6.28156178e-05 4.58652864e-05 4.07939598e-05 6.48083240e-05
 5.97987484e-05 5.95175995e-05 5.36871378e-05 3.79821212e-05
 4.21366793e-05 3.547624

# 3. Treinamento utilizando o modo script do Sagemaker

Agora, vamos realizar o treinamento utilizando o Sagemaker. Vamos importar as bibliotecas necessárias para o projeto.

In [162]:
import sagemaker
import subprocess
import sys
import random
import math
import pandas as pd
import os
import boto3
import numpy as np
from sklearn.preprocessing import StandardScaler
from sagemaker.pytorch import PyTorch
from sagemaker.xgboost import XGBoost
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.serializers import NumpySerializer, JSONSerializer, CSVSerializer
from sagemaker.deserializers import NumpyDeserializer, JSONDeserializer
from sagemaker.predictor import Predictor

Como nessa implementação estamos utilizando vamos definir os valores default do SageMaker, caso utiliza um notebook qualquer será definido os dados da sua conta do aws.

Além disso, vamos criar os caminhos do S3 e habilitar o modo de treinamento pelo Sagemaker.

In [163]:
random.seed(42)

# Useful SageMaker variables
try:
    # You're using a SageMaker notebook
    sess = sagemaker.Session()
    bucket = sess.default_bucket()
    role = sagemaker.get_execution_role()
except ValueError:
    # You're using a notebook somewhere else
    print("Setting role and SageMaker session manually...")
    bucket = "bobby-demo"
    region = "us-east-1"

    iam = boto3.client("iam")
    sagemaker_client = boto3.client("sagemaker")

    sagemaker_execution_role_name = (
        "arn:aws:iam::575019173823:role/LabRole"  # Change this to your role name
    )
    role = iam.get_role(RoleName=sagemaker_execution_role_name)["Role"]["Arn"]
    boto3.setup_default_session(region_name=region, profile_name="default")
    sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)

# Local data paths
train_dir = os.path.join(os.getcwd(), "data/train")
test_dir = os.path.join(os.getcwd(), "data/test")
os.makedirs(train_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

# Data paths in S3
s3_prefix = "script-mode-workflow"
csv_s3_prefix = f"{s3_prefix}/csv"
csv_s3_uri = f"s3://{bucket}/{s3_prefix}/csv"
numpy_train_s3_prefix = f"{s3_prefix}/numpy/train"
numpy_train_s3_uri = f"s3://{bucket}/{numpy_train_s3_prefix}"
numpy_test_s3_prefix = f"{s3_prefix}/numpy/test"
numpy_test_s3_uri = f"s3://{bucket}/{numpy_test_s3_prefix}"
csv_train_s3_uri = f"{csv_s3_uri}/train"
csv_test_s3_uri = f"{csv_s3_uri}/test"

# Enable Local Mode training
enable_local_mode_training = False

# Endpoint names
sklearn_endpoint_name = "randomforestregressor-endpoint"

In [164]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json
!/bin/bash ./local_mode_setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


Vamos salvar os dados no S3

In [165]:
s3_resource_bucket = boto3.Session().resource("s3").Bucket(bucket)
s3_resource_bucket.Object(os.path.join(csv_s3_prefix, "train.csv")).upload_file(
    "data/train/train.csv"
)
s3_resource_bucket.Object(os.path.join(csv_s3_prefix, "test.csv")).upload_file("data/test/test.csv")
s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "x_train.npy")).upload_file(
    "data/train/x_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "y_train.npy")).upload_file(
    "data/train/y_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "x_test.npy")).upload_file(
    "data/test/x_test.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "y_test.npy")).upload_file(
    "data/test/y_test.npy"
)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Agora é uma das partes mais importantes do processo. Vamos definir os hiperparâmetros do modelo e inserir o local em que o script com o modelo está, vamos definir também o role, que será o default, e o modelo Random Forest Regressor. Após isso, vamos dar um .fit no modelo, iniciando o processo de treinamento através do script.

In [166]:
hyperparameters = {"max_depth": 20, "n_jobs": 4, "n_estimators": 120}

if enable_local_mode_training:
    train_instance_type = "local"
    inputs = {"train": f"file://{train_dir}", "test": f"file://{test_dir}"}
else:
    train_instance_type = "ml.c5.xlarge"
    inputs = {"train": csv_train_s3_uri, "test": csv_test_s3_uri}

estimator_parameters = {
    "entry_point": "script.py",
    "source_dir": "scripts",
    "framework_version": "1.2-1",
    "py_version": "py3",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "randomforestregressor-model",
}

estimator = SKLearn(**estimator_parameters)
estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: randomforestregressor-model-2024-09-21-16-37-37-105


2024-09-21 16:37:38 Starting - Starting the training job...
2024-09-21 16:37:52 Starting - Preparing the instances for training...
2024-09-21 16:38:35 Downloading - Downloading the training image.....[34m2024-09-21 16:39:20,761 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-09-21 16:39:20,763 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-21 16:39:20,766 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-21 16:39:20,784 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-09-21 16:39:21,012 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-21 16:39:21,015 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-21 16:39:21,032 sagemaker-training-toolkit INFO     No GPUs detected (normal if no

Vamos realizar o deploy para conseguir realizar a predição do modelo nesse notebook.

In [167]:
existing_endpoints = sess.sagemaker_client.list_endpoints(
    NameContains=sklearn_endpoint_name, MaxResults=30
)["Endpoints"]
if not existing_endpoints:
    sklearn_predictor = estimator.deploy(
        initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=sklearn_endpoint_name
    )
else:
    sklearn_predictor = Predictor(
        endpoint_name="randomforestregressor-endpoint",
        sagemaker_session=sess,
        serializer=NumpySerializer(),
        deserializer=NumpyDeserializer(),
    )

INFO:sagemaker:Creating model with name: randomforestregressor-model-2024-09-21-16-40-19-534
INFO:sagemaker:Creating endpoint-config with name randomforestregressor-endpoint
INFO:sagemaker:Creating endpoint with name randomforestregressor-endpoint


------!

Finalmente, vamos realizar a predição com os dados de teste.

In [172]:
sklearn_predictor.predict(X_test)

array([6.36946667e-05, 4.73584172e-05, 3.30743859e-05, 6.62398355e-05,
       3.85844267e-05, 4.92846646e-05, 2.44018553e-05, 4.04583628e-05,
       5.97943272e-05, 5.67561137e-05, 5.75161597e-05, 5.73321732e-05,
       6.54045049e-05, 4.88782599e-05, 4.12431571e-05, 3.56554252e-05,
       5.16368139e-05, 6.59387115e-05, 4.83896950e-05, 3.32644610e-05,
       4.24086798e-05, 6.62229964e-05, 3.43093210e-05, 3.09252763e-05,
       2.51248722e-05, 6.22523536e-05, 5.13166630e-05, 5.78812184e-05,
       5.09297751e-05, 4.24926240e-05, 6.02300199e-05, 5.96065338e-05,
       4.98769603e-05, 5.93270024e-05, 6.38799707e-05, 5.54067552e-05,
       4.10919593e-05, 5.24808820e-05, 6.56685909e-05, 5.11803905e-05,
       3.46347637e-05, 5.42230774e-05, 4.75933486e-05, 6.72073953e-05,
       7.13422568e-05, 5.43734500e-05, 6.22332548e-05, 7.50584573e-05,
       6.37015031e-05, 5.52043234e-05, 3.47561052e-05, 5.62970212e-05,
       5.21838286e-05, 5.81648850e-05, 5.90992395e-05, 5.04589316e-05,
      

Podemos observar que o resultado obtido foi o mesmo do modelo normal.

Por fim, vamos deletar o endpoint criado dado que não vamos utilizá-lo novamente.

In [173]:
sklearn_predictor.delete_endpoint(True)

INFO:sagemaker:Deleting endpoint configuration with name: randomforestregressor-endpoint
INFO:sagemaker:Deleting endpoint with name: randomforestregressor-endpoint
