# Criação do dataset

Para a realização dessa etapa do trabalho, será criado um dataset apenas para simplificar o processo. Portanto, será importado a biblioteca faker que irá criar um dataset de casas. O dataset irá conter as colunas: Número de quartos, número de banheros, tamanho da casa, ano de construção e o preço. O objetivo será inferir o preço utilizando o Random Forest Regressor.

In [41]:
!pip install faker



In [42]:
from faker import Faker
import math
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import os
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [43]:
fake = Faker()

def generate_houses(n_samples):
    data = []
    for _ in range(n_samples):
        # Generate random property data
        num_bedrooms = np.random.randint(1, 6)
        num_bathrooms = np.random.randint(1, 4)
        square_feet = np.random.randint(500, 5000)
        year_built = np.random.randint(1900, 2023)
        price = round(np.random.uniform(50000, 1000000), 2)
        
        data.append([num_bedrooms, num_bathrooms, square_feet, year_built, price])
        
    columns = ['Num_Bedrooms', 'Num_Bathrooms', 'Square_Feet', 'Year_Built', 'Price']
    return pd.DataFrame(data, columns=columns)


In [44]:
df = generate_houses(1506)

# Get training columns
train_cols = list(df.columns)
del train_cols[-1]
train_cols

# Split data
training_index = math.floor(0.8 * df.shape[0])
x_train, y_train = df[train_cols][:training_index], df["Price"][:training_index]
x_test, y_test = df[train_cols][training_index:], df["Price"][training_index:]

# Scale price
y_train = y_train / 100000
y_test = y_test / 100000

# Standardize data
x_train_np = StandardScaler().fit_transform(x_train)
x_test_np = StandardScaler().fit_transform(x_test)

In [45]:
x_train

Unnamed: 0,Num_Bedrooms,Num_Bathrooms,Square_Feet,Year_Built
0,2,1,2199,2007
1,3,2,1511,1937
2,4,3,2774,2014
3,2,3,4714,1933
4,3,1,1316,1939
...,...,...,...,...
1199,1,1,1455,1967
1200,4,2,1118,1981
1201,1,1,1195,1994
1202,3,3,3771,1938


In [46]:
df.head()

Unnamed: 0,Num_Bedrooms,Num_Bathrooms,Square_Feet,Year_Built,Price
0,2,1,2199,2007,245377.36
1,3,2,1511,1937,377235.75
2,4,3,2774,2014,639368.68
3,2,3,4714,1933,435823.31
4,3,1,1316,1939,485763.67


Após a criação do dataset e a separação entre dados de treino e teste, vamos normalizar o dado de preço e salvar os dados em um arquivo train.csv e test.csv.

In [47]:
train_df = pd.DataFrame(data=x_train_np)
train_df.columns = x_train.columns
train_df["Price"] = y_train / 100000
first_col = train_df.pop("Price")
train_df.insert(0, "Price", first_col)

test_df = pd.DataFrame(data=x_test_np)
test_df.columns = x_test.columns
test_df["Price"] = y_test.reset_index(drop=True) / 100000
first_col = test_df.pop("Price")
test_df.insert(0, "Price", first_col)

In [48]:
# Local data paths
train_dir = os.path.join(os.getcwd(), "data/train")
test_dir = os.path.join(os.getcwd(), "data/test")
os.makedirs(train_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

# Save as CSV
train_df.to_csv(f"{train_dir}/train.csv", header=False, index=False)
test_df.to_csv(f"{test_dir}/test.csv", header=False, index=False)

# Save as Numpy
np.save(os.path.join(train_dir, "x_train.npy"), x_train_np)
np.save(os.path.join(test_dir, "x_test.npy"), x_test_np)
np.save(os.path.join(train_dir, "y_train.npy"), y_train)
np.save(os.path.join(test_dir, "y_test.npy"), y_test)

In [49]:
y_train.head()

0    2.453774
1    3.772358
2    6.393687
3    4.358233
4    4.857637
Name: Price, dtype: float64

In [50]:
y_test.head()

1204    3.951829
1205    1.753896
1206    1.889645
1207    2.755488
1208    1.113226
Name: Price, dtype: float64

In [51]:
train_cols = list(train_df.columns)
del train_cols[0]
train_cols

['Num_Bedrooms', 'Num_Bathrooms', 'Square_Feet', 'Year_Built']

# Modo normal

Após a criação do dataset, vamos treinar o modelo com oss parâmetros: max_depth = 20, n_jobs = 4 e n_estimators = 120. Nessa etapa, vamos utilizar o RandomForestRegressor diretamente, sem a utilização do Sagemaker.

In [52]:
y = train_df.iloc[:, 0]  # Primeira coluna (rótulo)
X = train_df.iloc[:, 1:]  # Todas as outras colunas (features)

y_test = test_df.iloc[:, 0]
X_test = test_df.iloc[:, 1:]

In [53]:
# Inicializar o modelo RandomForestRegressor
rf_model = RandomForestRegressor(max_depth=20, n_jobs=4, n_estimators=120)

# Treinar o modelo
rf_model.fit(X, y)

Após o treinamento, vamos realizar a predição com os dados de teste.

In [75]:
y_pred = rf_model.predict(X_test)

In [76]:
print(y_pred)

[5.73858087e-05 3.63066175e-05 5.06816892e-05 6.22938825e-05
 6.56640870e-05 5.71706710e-05 5.17403432e-05 4.02880299e-05
 5.64161439e-05 6.47458855e-05 4.71701926e-05 5.06227663e-05
 4.43142872e-05 5.29381257e-05 5.28736911e-05 5.24948850e-05
 4.13224431e-05 6.04251165e-05 5.91623227e-05 5.14846618e-05
 5.40895732e-05 3.80862295e-05 5.32843416e-05 3.60643010e-05
 4.95582305e-05 5.00552282e-05 5.15487870e-05 6.52825179e-05
 2.69296113e-05 3.79136439e-05 5.18202937e-05 6.85873387e-05
 5.64686209e-05 6.21954535e-05 4.88465242e-05 4.82837703e-05
 5.37722174e-05 5.10289948e-05 6.44521178e-05 5.84723344e-05
 6.15201007e-05 6.23335086e-05 4.29886108e-05 5.94988048e-05
 4.13988173e-05 5.36350642e-05 4.97877017e-05 5.58402742e-05
 5.40205024e-05 5.43558786e-05 5.05016177e-05 5.34645645e-05
 6.04277428e-05 5.19342375e-05 6.25071193e-05 4.49045615e-05
 5.10340152e-05 3.36281942e-05 6.10665835e-05 4.53856200e-05
 3.81368646e-05 5.48367414e-05 4.69322590e-05 4.32337827e-05
 5.89679662e-05 3.823951

In [79]:
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE = {mae: .2e}")

MAE =  2.59e-05


In [80]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE = {rmse: .2e}")

RMSE =  3.02e-05


# Modelo Sagemaker

Agora, vamos realizar o treinamento utilizando o Sagemaker. Vamos importar as bibliotecas necessárias para o projeto.

In [58]:
import sagemaker
import subprocess
import sys
import random
import math
import pandas as pd
import os
import boto3
import numpy as np
from sagemaker.pytorch import PyTorch
from sagemaker.xgboost import XGBoost
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.serializers import NumpySerializer, JSONSerializer, CSVSerializer
from sagemaker.deserializers import NumpyDeserializer, JSONDeserializer
from sagemaker.predictor import Predictor

Como nessa implementação estamos utilizando vamos definir os valores default do SageMaker, caso utiliza um notebook qualquer será definido os dados da sua conta do aws.

Além disso, vamos criar os caminhos do S3 e habilitar o modo de treinamento pelo Sagemaker.

In [59]:
random.seed(42)

# Useful SageMaker variables
try:
    # You're using a SageMaker notebook
    sess = sagemaker.Session()
    bucket = sess.default_bucket()
    role = sagemaker.get_execution_role()
except ValueError:
    # You're using a notebook somewhere else
    print("Setting role and SageMaker session manually...")
    bucket = "bobby-demo"
    region = "us-east-1"

    iam = boto3.client("iam")
    sagemaker_client = boto3.client("sagemaker")

    sagemaker_execution_role_name = (
        "arn:aws:iam::575019173823:role/LabRole"  # Change this to your role name
    )
    role = iam.get_role(RoleName=sagemaker_execution_role_name)["Role"]["Arn"]
    boto3.setup_default_session(region_name=region, profile_name="default")
    sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)

# Data paths in S3
s3_prefix = "script-mode-workflow"
csv_s3_prefix = f"{s3_prefix}/csv"
csv_s3_uri = f"s3://{bucket}/{s3_prefix}/csv"
numpy_train_s3_prefix = f"{s3_prefix}/numpy/train"
numpy_train_s3_uri = f"s3://{bucket}/{numpy_train_s3_prefix}"
numpy_test_s3_prefix = f"{s3_prefix}/numpy/test"
numpy_test_s3_uri = f"s3://{bucket}/{numpy_test_s3_prefix}"
csv_train_s3_uri = f"{csv_s3_uri}/train"
csv_test_s3_uri = f"{csv_s3_uri}/test"

# Enable Local Mode training
enable_local_mode_training = False

# Endpoint names
sklearn_endpoint_name = "randomforestregressor-endpoint"

In [60]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json
!/bin/bash ./local_mode_setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


Vamos salvar os dados no S3

In [61]:
s3_resource_bucket = boto3.Session().resource("s3").Bucket(bucket)
s3_resource_bucket.Object(os.path.join(csv_s3_prefix, "train.csv")).upload_file(
    "data/train/train.csv"
)
s3_resource_bucket.Object(os.path.join(csv_s3_prefix, "test.csv")).upload_file("data/test/test.csv")
s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "x_train.npy")).upload_file(
    "data/train/x_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "y_train.npy")).upload_file(
    "data/train/y_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "x_test.npy")).upload_file(
    "data/test/x_test.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "y_test.npy")).upload_file(
    "data/test/y_test.npy"
)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Agora é uma das partes mais importantes do processo. Vamos definir os hiperparâmetros do modelo e inserir o local em que o script com o modelo está, vamos definir também o role, que será o default, e o modelo Random Forest Regressor. Após isso, vamos dar um .fit no modelo, iniciando o processo de treinamento através do script.

In [62]:
hyperparameters = {"max_depth": 20, "n_jobs": 4, "n_estimators": 120}

if enable_local_mode_training:
    train_instance_type = "local"
    inputs = {"train": f"file://{train_dir}", "test": f"file://{test_dir}"}
else:
    train_instance_type = "ml.c5.xlarge"
    inputs = {"train": csv_train_s3_uri, "test": csv_test_s3_uri}

estimator_parameters = {
    "entry_point": "script.py",
    "source_dir": "scripts",
    "framework_version": "1.2-1",
    "py_version": "py3",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "randomforestregressor-model",
}

estimator = SKLearn(**estimator_parameters)
estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: randomforestregressor-model-2024-09-30-14-34-49-407


2024-09-30 14:34:55 Starting - Starting the training job...
2024-09-30 14:35:11 Starting - Preparing the instances for training...
2024-09-30 14:35:39 Downloading - Downloading input data...
2024-09-30 14:36:04 Downloading - Downloading the training image...
2024-09-30 14:36:50 Training - Training image download completed. Training in progress..[34m2024-09-30 14:36:53,733 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-09-30 14:36:53,737 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-30 14:36:53,740 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-30 14:36:53,758 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-09-30 14:36:54,034 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-30 14:36:54,037 sagemaker-training-toolkit INFO     No Ne

Vamos realizar o deploy para conseguir realizar a predição do modelo nesse notebook.

In [67]:
existing_endpoints = sess.sagemaker_client.list_endpoints(
    NameContains=sklearn_endpoint_name, MaxResults=30
)["Endpoints"]
if not existing_endpoints:
    sklearn_predictor = estimator.deploy(
        initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=sklearn_endpoint_name
    )
else:
    sklearn_predictor = Predictor(
        endpoint_name="randomforestregressor-endpoint",
        sagemaker_session=sess,
        serializer=NumpySerializer(),
        deserializer=NumpyDeserializer(),
    )

INFO:sagemaker:Creating model with name: randomforestregressor-model-2024-09-30-14-38-22-021
INFO:sagemaker:Creating endpoint-config with name randomforestregressor-endpoint
INFO:sagemaker:Creating endpoint with name randomforestregressor-endpoint


-------!

Finalmente, vamos realizar a predição com os dados de teste.

In [81]:
y_pred = sklearn_predictor.predict(X_test)

In [82]:
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE = {mae:.2e}")

MAE = 2.59e-05


In [83]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE = {rmse:.2e}")

RMSE = 3.02e-05


Podemos observar que o resultado obtido foi o mesmo do modelo normal.

Por fim, vamos deletar o endpoint criado dado que não vamos utilizá-lo novamente.

In [84]:
sklearn_predictor.delete_endpoint(True)

INFO:sagemaker:Deleting endpoint configuration with name: randomforestregressor-endpoint
INFO:sagemaker:Deleting endpoint with name: randomforestregressor-endpoint
