# Swift-Diagnose Xgboost



### Necessary imports

In [1]:
import boto3
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri

import numpy as np
import io
import pandas as pd
from sklearn.model_selection import train_test_split

### Load data

Read the data **from a S3 bucket to a CSV**. 



In [22]:
data = pd.read_csv(
    's3://b1-aws-bucket/cardio_train.csv', sep = ';'
)

data = data.drop(['id'], axis=1)


In [23]:
data.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,22469,1,155,69.0,130,80,2,2,0,0,1,0
1,14648,1,163,71.0,110,70,1,1,0,0,1,1
2,21901,1,165,70.0,120,80,1,1,0,0,1,0
3,14549,2,165,85.0,120,80,1,1,1,1,1,0
4,23393,1,155,62.0,120,80,1,1,0,0,1,0


In [24]:
target_col = data.pop('cardio')
data.insert(0, 'cardio',target_col)

In [25]:
data.head()

Unnamed: 0,cardio,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,0,22469,1,155,69.0,130,80,2,2,0,0,1
1,1,14648,1,163,71.0,110,70,1,1,0,0,1
2,0,21901,1,165,70.0,120,80,1,1,0,0,1
3,0,14549,2,165,85.0,120,80,1,1,1,1,1
4,0,23393,1,155,62.0,120,80,1,1,0,0,1


### Train / Val / Test split

We split the data into train (80%), validation (10%) and test (10%) sets. 

In [26]:
train, testval = train_test_split(data, train_size=0.8, stratify=data[['cardio', 'gender']], random_state=1200)


In [27]:
train.shape, testval.shape

((55440, 12), (13861, 12))

In [28]:
s3 = boto3.resource('s3')

def upload_to_s3(df, bucket, filename):
    
    placeholder = io.StringIO()
    df.to_csv(placeholder, header=False, index=False)
    object = s3.Object(bucket, filename)
    object.put(Body=placeholder.getvalue())
    

After defining this, we proceed to the upload of the train and validation split. 

In [29]:
upload_to_s3(train, 'b1-aws-bucket', 'sagemaker-data/train.csv')

In [30]:
upload_to_s3(testval, 'b1-aws-bucket', 'sagemaker-data/testval.csv')

## Setting up the model

We use the class `Estimator` from the `sagemaker.estimator` module. That will create the **environment** to run  training jobs for a model.

We specify: 

- A container name (Sagemaker works with containers. This code is pointing to a pre-existing container that holds everything that is needed to run xgboost. 
- A role name (the training job needs a role to have sufficient permissions, similarly to what we saw in Lambda functions). Remember that we created this role when starting the notebook server. 
- The number of instances for training (we use 1 but could use more in large jobs, to scale). 
- The type of instance (we select one that's included in the Sagemaker Free Tier). 
- The output path, where the model and other info will be written
- The hyperparameters of the algorithm (number of training rounds and loss function)
- The current session (it needs that for internal purposes)

In [31]:
region_name = boto3.Session().region_name
example = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


In [32]:
role = sagemaker.get_execution_role()


container = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')
output_location = 's3://b1-aws-bucket/sagemaker-output/'

#For a list of possible parameters of xgboost, see
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters
hyperparams = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "eta": "0.2",
    "max_depth": "5",
    "num_round": "100",
}

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


Now we have to crete what sagemaker calls "channels". We need to specify where is the data and in which format in a specific dictionary:  

In [33]:
train_channel = sagemaker.session.s3_input(
    's3://b1-aws-bucket/sagemaker-data/train.csv',
    content_type='text/csv'
)
val_channel = sagemaker.session.s3_input(
    's3://b1-aws-bucket/sagemaker-data/testval.csv',
    content_type='text/csv'
)


channels_for_training = {
    'train': train_channel,
    'validation': val_channel
}

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


We are ready to train. 

In [34]:
estimator.fit(inputs=channels_for_training, logs=False)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-03-12-10-45-30-157



2023-03-12 10:45:30 Starting - Starting the training job...........
2023-03-12 10:46:27 Starting - Preparing the instances for training...............
2023-03-12 10:47:51 Downloading - Downloading input data.....
2023-03-12 10:48:21 Training - Downloading the training image.....
2023-03-12 10:48:52 Training - Training image download completed. Training in progress.....
2023-03-12 10:49:12 Uploading - Uploading generated training model..
2023-03-12 10:49:28 Completed - Training job completed


We can print the job name -- this is the name that appears in the console. 

In [38]:
estimator._current_job_name

'sagemaker-xgboost-2023-03-12-10-45-30-157'

Finally, we can also get some metrics of the training job here. 

In [39]:
metrics = sagemaker.analytics.TrainingJobAnalytics(
    estimator._current_job_name,
    metric_names = ['train:auc', 'validation:auc']
)

In [40]:
metrics.dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:auc,0.817945
1,0.0,validation:auc,0.804407


## Deploying the model

In [41]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', serializer=sagemaker.serializers.CSVSerializer())


INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-03-12-10-54-34-350
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-03-12-10-54-34-350
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-03-12-10-54-34-350


-------!

In [45]:
predictor.predict("22469,1,155,69.0,130,80,2,2,0,0,1")

b'0.6684691905975342'