<a href="https://colab.research.google.com/github/Erasnilson/Reg-Log-XGBoots-AWS/blob/main/Xgboost_classificacao_census_comentado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regress√£o Log√≠stica via algoritmo XGBoots - Aplica√ß√£o via Sagemaker da AWS (Modelagem Machine Learn)


## **Tratamento da base de dados**


### - Organizar a base de dados no formato {y, x1, x2, ..., xn} para modelagem de classifica√ß√£o **via Sagemaker da AWS**.
```
colunas = []
colunas.append('income')                        # inclu√≠ndo a primeira coluna var resposta y (necess√°rio no AWS)
for i in range(len(base_census.columns[:-1])):  # removendo a √∫ltima coluna y - icome
     colunas.append(base_census.columns[i])
base_census = base_census[colunas]
base_census
```

### - Configurar a **vari√°vel resposta como dummy**.
- Se y for ' >50K'-> 1, caso contr√°rio ->0 (AWS s√≥ compila valor
num√©rico)

```
def functiondymmy(text):
    if text == ' >50K':
        return 1.0
    else:
        return 0.0
```

- aplicar a transforma√ß√£o dicot√≥mica:
```
base_census['income'] = base_census['income'].apply(functiondymmy)
```

### - As vari√°veis qualitativas n√£o s√£o convertidas automaticamente em dummys(matriz de delineamento), logo, os fatores (vari√°veis qualitativas) como sexo, educa√ß√£o etc, precisam ser ser **transformadas em n√∫meros** para a an√°lise do algoritmo.
```
base_census = pd.get_dummies(base_census,prefix=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'inative-country'],
                            columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race',
                            'sex', 'inative-country'],
                            dtype='int') #incluido dtype='int', pois no colab n√£o identificou 0 ou 1
```

### - As especifica√ß√µes das Vari√°veis X e Y, via AWS, s√£o realizadas **sem os r√≥tulos das vari√°veis e sem os √≠ndices**. Assim, o algoritmo dever√° compilar apenas os valores que ser√£o transformados em formato bin√°rio.
```
X_teste = base_teste.iloc[:,1:len(base_census)].values
y_teste = base_teste.iloc[:, 0].values
```


# **Configura√ß√µes SageMaker**

## - A configura√ß√£o do Sagemaker envolve a **comunica√ß√£o do AWS(sagemaker)** com python e o **acesso para a grava√ß√£o de arquivos** espec√≠ficos (treino e teste)

```
import sagemaker
import boto3                           # faz a comunica√ß√£o da AWS com o python (espec√≠fica da AWS)
from sagemaker import Session
import sagemaker.amazon.common as smac # sagemaker commom library
import io                              # grava a base de dados no formato S3
import os                              # utilizado para acessar os recursos do sistema operacional
```

## - Ap√≥s a cria√ß√£o do bucket (cursoawssagemaker) e do usu√°rio, √© necess√°rio especificar os caminhos (diret√≥rios) para a gera√ß√£o dos modelos e para os arquivos de treino e teste, al√©m de configurar as permiss√µes adequadas.

```
session = sagemaker.Session()
bucket = 'cursoawssagemaker'               # bucket armazenado no S3
subpasta_modelo = 'modelos/census/xgboost' # local para armazenar os modelos
subpasta_dataset = 'datasets/census'
key_train = 'census-train-data-xgboost'    # nome do arquivo train
key_test = 'census-test-data-xgboost'      # nome do arquivo teste
role = sagemaker.get_execution_role()      # permiss√µes

# vari√°vel que indica o caminho para o arquivo train e test
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, subpasta_dataset, key_train)                                 
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, subpasta_dataset, key_test)

# local onde o modelo ser√° armazenado no S3
output_location = 's3://{}/{}/output'.format(bucket, subpasta_modelo)
print('Role: {}'.format(role))

# print para visualiza√ß√£o dos caminhos especificados
print('Localiza√ß√£o da base de treinamento: {}'.format(s3_train_data))
print('Localiza√ß√£o da base de teste: {}'.format(s3_test_data))
print('Modelo final ser√° salvo em: {}'.format(output_location))
```

## - Pro fim, √© necess√°rio converter os dados que est√£o em formato numpy para bin√°rio, para enviar ao S3.

```
with open('census_train_xgboost.csv', 'rb') as f:
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(subpasta_dataset, 'train', key_train)).upload_fileobj(f)

with open('census_test_xgboost.csv', 'rb') as f:
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(subpasta_dataset, 'test', key_test)).upload_fileobj(f)
```



# **Treinamento do XGBoost**

## Ao realizar o treinamento do modelo √© preciso espcificar uma inst√¢ncia, a depender da inst√¢ncia selecionada pode ser cobrado um valor por hora. Para mais detalhes, consultar a tabela de pre√ßo do sagemaker https://aws.amazon.com/pt/sagemaker/pricing/.



```
# https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-sa-east-1.html
from sagemaker import image_uris
container = image_uris.retrieve(framework = 'xgboost', region=boto3.Session().region_name, version='latest')
```


### - Blocos de anota√ß√µes do Studio e inst√¢ncias de blocos de anota√ß√µes sob demanda - Uso do n√≠vel **gratuito** por m√™s **pelos primeiros dois meses** - **250 horas da inst√¢ncia ml.t3.medium** em blocos de anota√ß√µes do Studio OU **250 horas da inst√¢ncia ml.t2 medium** ou **ml.t3.medium em blocos de anota√ß√µes sob demanda**.

üî¥ A inst√¢ncia **ml.m5.2xlarge** (tem 8vCPU e 32 GiB) **custa	0,461 USD/h**, conforme custos https://aws.amazon.com/pt/sagemaker/pricing/
```
# https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html
xgboost = sagemaker.estimator.Estimator(image_uri = container,
                                        role = role,
                                        instance_count = 1,
                                        instance_type = 'ml.m5.2xlarge',
                                        output_path = output_location,
                                        sagemaker_session = session)
```




```
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
xgboost.set_hyperparameters(num_round = 100, objective = 'reg:logistic')   #trabalhando com reg log√≠stica
```




```
train_input = sagemaker.inputs.TrainingInput(s3_data = s3_train_data, content_type='csv', s3_data_type = 'S3Prefix')
validation_input = sagemaker.inputs.TrainingInput(s3_data = s3_test_data, content_type='csv', s3_data_type = 'S3Prefix')
data_channels = {'train': train_input, 'validation': validation_input}
```



```
xgboost.fit(data_channels) #realizando o ajuste do modelo
```

Ao final, verificar ser o Endpoint foi gerado no ``inference'' do sagemaker.
üî¥ OBS, ap√≥s utilizar o **Endpoint** apagar para n√£o haver cobran√ßas.




# **Deploy, previs√µes e avalia√ß√£o**

- Avalia√ß√£o do Modelo

```
xgboost_classifier = xgboost.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')
```

- configurar o classificador, agora ser√£o convertidas para csv (antes estava em numpy)

```
from sagemaker.serializers import CSVSerializer
xgboost_classifier.serializer = CSVSerializer()
X_teste.shape, type(X_teste)
```


- Realizar as previs√µes. Agora, s√£o retornadas as probabilidades (0 - 1). Uma alternativa √© reclassificar, se >= 0,5 ser√° classificado como 1, caso contr√°rio 0.

```
# N√£o converter true=1, o python dever√° ajustar automaticamente
previsoes = (previsoes >= 0.5)
print(previsoes)
```

- Utilizar o pacote sklearn para obter matrix de confus√£o do modelo.

```
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_teste, previsoes)
cm
```

- No exemplo do curso, nota-se que a acur√°cia do modelo foi XGboost foi melhor que o linear-learner. Abaixo temos, que o algoritmo XGboost apresentou um desempenho de recall de 0,66 para a classifica√ß√£o 1, com uma precis√£o de 78% e para a classifica√ß√£o 0, tem-se um recall de 94% com precis√£o de 89%.

```
print(classification_report(y_teste, previsoes))
              precision    recall  f1-score   support

         0.0       0.89      0.94      0.92      7365
         1.0       0.78      0.66      0.71      2404

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.81      9769
weighted avg       0.87      0.87      0.87      9769
```






# **Tuning**

- No tuning iremos gerar v√°rios modelos a partir de reamostragens, nesse caso ser√£o solicitado **9 modelo**. Deve-se utilizar a compara√ß√£o do **menor erro** para obter as estimativas dos par√¢metros e realizar o novo modelo com tais estimativas.

```
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "300",
          "MinValue": "50",
          "Name": "num_round"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 9,        #<-------------------------
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:error",    #<--e n√£o rmse---------------
      "Type": "Minimize"
    }
  }
```




```
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": container,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_train_data
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_test_data
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,subpasta_modelo)
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "error",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}
```


- Por fim, deve-se executar o codigo a seguir para realizar o tunning dos par√¢metros. Ao finalizar, pode-se verificar os modelos e suas estat√≠sticas na guia Training, Hyperparameter tuning jobs (clicar no best training job) - copiar os par√¢metros e jogar no novo modelo.

```
smclient = boto3.client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = "xgboosttuningcensus",
                                          HyperParameterTuningJobConfig = tuning_job_config,
                                          TrainingJobDefinition = training_job_definition)

```


# **Constru√ß√£o do novo modelo**


- Ao verificar o melhor modelo, obtido pelo tuning, deve-se copiar as estimativas dos par√¢metros e substituir no par√¢metros da programa√ß√£o a seguir:

```
container = image_uris.retrieve(framework='xgboost',region=boto3.Session().region_name,version='latest')
xgboost_tuning = sagemaker.estimator.Estimator(image_uri = container,
                                        role = role,
                                        instance_count = 1,
                                        instance_type = 'ml.m5.2xlarge',
                                        output_path = output_location,
                                        sagemaker_session = session)
xgboost_tuning.set_hyperparameters(num_round = 102, eta = 0.14507612435685635,
                                   min_child_weight = 2.412681801757289,
                                   alpha = 0.3189676727624047, tweedie_variance_power = 1.4,
                                   rate_drop = 0.3)
xgboost_tuning.fit(data_channels)
```

- Gera√ß√£o dos Endpoints

```
xgboost_classifier_tuning = xgboost_tuning.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')
```

- Buscar as previs√µes (probabilidades)

```
from sagemaker.serializers import CSVSerializer
xgboost_classifier_tuning.serializer = CSVSerializer()
previsoes = np.array(xgboost_classifier_tuning.predict(X_teste).decode('utf-8').split(',')).astype(np.float32)
```

- Transformar as previs√µes

```
previsoes = (previsoes >= 0.5)
# deixar em formato inteiro (saida em true or false)
y_teste = np.array(y_teste).astype(int)
y_teste   # previsoes.shape, y_teste.shape
```
- Obtendo acur√°cia do modelo

```
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_teste, previsoes)
cm
#accuracy_score(y_teste, previsoes)
```
- Resultado final

```
print(classification_report(y_teste, previsoes))
              precision    recall  f1-score   support

           0       0.89      0.95      0.92      7365
           1       0.80      0.64      0.71      2404

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.82      9769
weighted avg       0.87      0.87      0.87      9769
```



## A seguir, ser√° apresentado os comando a serem processados na AWS. Destaco que este notebook foi adaptado de uma atividade de curso, resolvi compartilhar sem compilar no Aws, pois n√£o possuo a contra free.

In [1]:
!pip install pandas
!pip install numpy




In [2]:
import pandas as pd
import numpy as np


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [48]:
# Importanto data frame - Colab
base_census = pd.read_csv('/content/drive/MyDrive/Github/Aws-classificacao-Xboost/census.csv')
#base_census.head(5)

In [49]:
#base_census = pd.read_csv('census.csv') # forma simples de carregar no AWS
# migrando a vari√°vel y para a primeira coluna - necess√°rio no Sagemaker Aws
colunas = []
colunas.append('income')                        # inclu√≠ndo primeiro name y
for i in range(len(base_census.columns[:-1])):  # removendo a √∫ltima coluna y - icome
     colunas.append(base_census.columns[i])
#base_census = base_census[colunas]
#base_census
colunas

['income',
 'age',
 'workclass',
 'final-weight',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loos',
 'hour-per-week',
 'inative-country']

In [50]:
#base_census = pd.read_csv('census.csv') # forma simples de carregar no AWS
# migrando a vari√°vel y para a primeira coluna - necess√°rio no Sagemaker Aws
colunas = []
colunas.append('income')                        # inclu√≠ndo primeiro name y
for i in range(len(base_census.columns[:-1])):  # removendo a √∫ltima coluna y - icome
     colunas.append(base_census.columns[i])
base_census = base_census[colunas]
base_census.head(5)


Unnamed: 0,income,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,inative-country
0,<=50K,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,<=50K,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,<=50K,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,<=50K,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,<=50K,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [51]:
def functiondymmy(text):
    if text == ' >50K':
        return 1.0
    else:
        return 0.0

In [52]:
base_census['income'] = base_census['income'].apply(functiondymmy)

In [53]:
base_census.head(5)

Unnamed: 0,income,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,inative-country
0,0.0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,0.0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,0.0,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,0.0,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,0.0,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [54]:
base_census = pd.get_dummies(base_census,prefix=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'inative-country'],
                            columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'inative-country'],
                            dtype='int') #incluido dtype='int', pois no colab n√£o identificou 0 ou 1
base_census.head(5)

Unnamed: 0,income,age,final-weight,education-num,capital-gain,capital-loos,hour-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,...,inative-country_ Portugal,inative-country_ Puerto-Rico,inative-country_ Scotland,inative-country_ South,inative-country_ Taiwan,inative-country_ Thailand,inative-country_ Trinadad&Tobago,inative-country_ United-States,inative-country_ Vietnam,inative-country_ Yugoslavia
0,0.0,39,77516,13,2174,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.0,50,83311,13,0,0,13,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.0,38,215646,9,0,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.0,53,234721,7,0,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0.0,28,338409,13,0,0,40,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
base_treinamento = base_census.iloc[0:22792,:]
base_treinamento.shape

(22792, 109)

In [56]:
base_teste = base_census.iloc[22792:,:]
base_teste.shape

(9769, 109)

In [57]:
22792 + 9769

32561

## Especifica√ß√£o das Vari√°veis X e Y (AWS √© necess√°rio remover os r√≥tulos das vari√°veis)

In [58]:
X_teste = base_teste.iloc[:,1:len(base_census)].values
X_teste

array([[    30,  75167,     13, ...,      1,      0,      0],
       [    39, 176296,      9, ...,      1,      0,      0],
       [    19,  93518,     10, ...,      1,      0,      0],
       ...,
       [    58, 151910,      9, ...,      1,      0,      0],
       [    22, 201490,      9, ...,      1,      0,      0],
       [    52, 287927,      9, ...,      1,      0,      0]])

In [59]:
X_teste.shape

(9769, 108)

In [60]:
y_teste = base_teste.iloc[:, 0].values
y_teste

array([0., 1., 0., ..., 0., 0., 1.])

In [None]:
base_treinamento.to_csv('census_train_xgboost.csv', header = False, index = False)
base_teste.to_csv('census_test_xgboost.csv', header = False, index = False)

# Configura√ß√µes SageMaker

In [None]:
import sagemaker
import boto3
from sagemaker import Session
import sagemaker.amazon.common as smac # sagemaker commom library
import io
import os

In [None]:
session = sagemaker.Session()
bucket = 'cursoawssagemaker'
subpasta_modelo = 'modelos/census/xgboost'
subpasta_dataset = 'datasets/census'
key_train = 'census-train-data-xgboost'
key_test = 'census-test-data-xgboost'
role = sagemaker.get_execution_role()
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, subpasta_dataset, key_train)
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, subpasta_dataset, key_test)
output_location = 's3://{}/{}/output'.format(bucket, subpasta_modelo)
print('Role: {}'.format(role))
print('Localiza√ß√£o da base de treinamento: {}'.format(s3_train_data))
print('Localiza√ß√£o da base de teste: {}'.format(s3_test_data))
print('Modelo final ser√° salvo em: {}'.format(output_location))

Role: arn:aws:iam::936535973187:role/service-role/AmazonSageMaker-ExecutionRole-20220510T125992
Localiza√ß√£o da base de treinamento: s3://cursoawssagemaker/datasets/census/train/census-train-data-xgboost
Localiza√ß√£o da base de teste: s3://cursoawssagemaker/datasets/census/test/census-test-data-xgboost
Modelo final ser√° salvo em: s3://cursoawssagemaker/modelos/census/xgboost/output


In [None]:
with open('census_train_xgboost.csv', 'rb') as f:
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(subpasta_dataset, 'train', key_train)).upload_fileobj(f)

In [None]:
with open('census_test_xgboost.csv', 'rb') as f:
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(subpasta_dataset, 'test', key_test)).upload_fileobj(f)

# Treinamento do XGBoost

In [None]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-sa-east-1.html
from sagemaker import image_uris
container = image_uris.retrieve(framework = 'xgboost', region=boto3.Session().region_name, version='latest')

In [None]:
# https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html
xgboost = sagemaker.estimator.Estimator(image_uri = container,
                                        role = role,
                                        instance_count = 1,
                                        instance_type = 'ml.m5.2xlarge',
                                        output_path = output_location,
                                        sagemaker_session = session)

In [None]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
xgboost.set_hyperparameters(num_round = 100, objective = 'reg:logistic')

In [None]:
train_input = sagemaker.inputs.TrainingInput(s3_data = s3_train_data, content_type='csv', s3_data_type = 'S3Prefix')
validation_input = sagemaker.inputs.TrainingInput(s3_data = s3_test_data, content_type='csv', s3_data_type = 'S3Prefix')
data_channels = {'train': train_input, 'validation': validation_input}

In [None]:
xgboost.fit(data_channels)

2022-05-17 15:52:00 Starting - Starting the training job...
2022-05-17 15:52:29 Starting - Preparing the instances for trainingProfilerReport-1652802720: InProgress
.........
2022-05-17 15:53:57 Downloading - Downloading input data...
2022-05-17 15:54:17 Training - Downloading the training image...
2022-05-17 15:54:57 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2022-05-17:15:54:56:INFO] Running standalone xgboost training.[0m
[34m[2022-05-17:15:54:56:INFO] File size need to be processed in the node: 7.07mb. Available memory size in the node: 23504.35mb[0m
[34m[2022-05-17:15:54:56:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:54:56] S3DistributionType set as FullyReplicated[0m
[34m[15:54:56] 22792x108 matrix with 2461536 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-05-17:15:54:56:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:54:56] S3Distribution

# Deploy, previs√µes e avalia√ß√£o

In [None]:
xgboost_classifier = xgboost.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

-------!

In [None]:
from sagemaker.serializers import CSVSerializer
xgboost_classifier.serializer = CSVSerializer()

In [None]:
X_teste.shape, type(X_teste)

((9769, 108), numpy.ndarray)

In [None]:
previsoes = np.array(xgboost_classifier.predict(X_teste).decode('utf-8').split(',')).astype(np.float32)
previsoes

array([7.1665001e-01, 9.8000973e-01, 8.9295841e-05, ..., 1.3065693e-02,
       2.0877716e-04, 9.9958044e-01], dtype=float32)

In [None]:
previsoes = (previsoes >= 0.5)
print(previsoes)

[ True  True False ... False False  True]


In [None]:
previsoes.shape, y_teste.shape

((9769,), (9769,))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
cm = confusion_matrix(y_teste, previsoes)
cm

array([[6914,  451],
       [ 821, 1583]])

In [None]:
accuracy_score(y_teste, previsoes)

0.8697921998157436

In [None]:
print(classification_report(y_teste, previsoes))

              precision    recall  f1-score   support

         0.0       0.89      0.94      0.92      7365
         1.0       0.78      0.66      0.71      2404

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.81      9769
weighted avg       0.87      0.87      0.87      9769



# Tuning

In [None]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "300",
          "MinValue": "50",
          "Name": "num_round"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 9,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:error",
      "Type": "Minimize"
    }
  }

In [None]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": container,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_train_data
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_test_data
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,subpasta_modelo)
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "error",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

In [None]:
smclient = boto3.client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = "xgboosttuningcensus",
                                          HyperParameterTuningJobConfig = tuning_job_config,
                                          TrainingJobDefinition = training_job_definition)

# Constru√ß√£o do novo modelo

In [None]:
container = image_uris.retrieve(framework='xgboost',region=boto3.Session().region_name,version='latest')
xgboost_tuning = sagemaker.estimator.Estimator(image_uri = container,
                                        role = role,
                                        instance_count = 1,
                                        instance_type = 'ml.m5.2xlarge',
                                        output_path = output_location,
                                        sagemaker_session = session)
xgboost_tuning.set_hyperparameters(num_round = 102, eta = 0.14507612435685635,
                                   min_child_weight = 2.412681801757289,
                                   alpha = 0.3189676727624047, tweedie_variance_power = 1.4,
                                   rate_drop = 0.3)
xgboost_tuning.fit(data_channels)

2022-05-17 16:06:12 Starting - Starting the training job...ProfilerReport-1652803572: InProgress
...
2022-05-17 16:06:52 Starting - Preparing the instances for training......
2022-05-17 16:08:11 Downloading - Downloading input data...
2022-05-17 16:08:29 Training - Downloading the training image.....[34mArguments: train[0m
[34m[2022-05-17:16:09:20:INFO] Running standalone xgboost training.[0m
[34m[2022-05-17:16:09:20:INFO] File size need to be processed in the node: 7.07mb. Available memory size in the node: 23867.78mb[0m
[34m[2022-05-17:16:09:20:INFO] Determined delimiter of CSV input is ','[0m
[34m[16:09:20] S3DistributionType set as FullyReplicated[0m
[34m[16:09:20] 22792x108 matrix with 2461536 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-05-17:16:09:20:INFO] Determined delimiter of CSV input is ','[0m
[34m[16:09:20] S3DistributionType set as FullyReplicated[0m
[34m[16:09:20] 9769x108 matrix with 1055052 entries lo

In [None]:
xgboost_classifier_tuning = xgboost_tuning.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

------!

In [None]:
from sagemaker.serializers import CSVSerializer
xgboost_classifier_tuning.serializer = CSVSerializer()
previsoes = np.array(xgboost_classifier_tuning.predict(X_teste).decode('utf-8').split(',')).astype(np.float32)

In [None]:
previsoes

array([ 0.690067  ,  0.91854674, -0.00379324, ...,  0.01779461,
       -0.00613862,  0.9809268 ], dtype=float32)

In [None]:
previsoes = (previsoes >= 0.5)
previsoes

array([ True,  True, False, ..., False, False,  True])

In [None]:
y_teste = np.array(y_teste).astype(int)
y_teste

array([0, 1, 0, ..., 0, 0, 1])

In [None]:
previsoes.shape, y_teste.shape

((9769,), (9769,))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_teste, previsoes)
cm

array([[6974,  391],
       [ 855, 1549]])

In [None]:
accuracy_score(y_teste, previsoes)

0.8724536800081891

In [None]:
print(classification_report(y_teste, previsoes))

              precision    recall  f1-score   support

           0       0.89      0.95      0.92      7365
           1       0.80      0.64      0.71      2404

    accuracy                           0.87      9769
   macro avg       0.84      0.80      0.82      9769
weighted avg       0.87      0.87      0.87      9769

