# ML Model with xgboost

The selected machine learning algorith is xgboost. Detailed information about xgboost can be found here. [xgboost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)


In [1]:
import os

import time
from time import gmtime, strftime

import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


## Import the preprocessed data

To start the machine learning process I load my preprocessed csv files

In [2]:
features = pd.read_csv('data/features_completed.csv', index_col=0)
labels = pd.read_csv('data/labels_completed.csv', index_col=0)

In [3]:
features.shape, labels.shape

((63288, 11), (63288, 1))

In [4]:
features.head(3)

Unnamed: 0,age,income,F,M,O,U,member_since_days,email,mobile,social,web
0,0.180723,0.466667,0,1,0,0,1630,1,1,1,1
1,0.072289,0.333333,1,0,0,0,1791,1,1,1,1
2,0.445783,0.488889,1,0,0,0,1248,1,1,1,1


In [5]:
labels.head(3)

Unnamed: 0,binary_target
0,1
1,1
2,1


## Create Training, Validation and Testdata

The loaded data is already preprocessed, there are no further data cleaning steps necessary. However, we do need to split the rows in the dataset up into train, test and validation sets.
To avoid overfitting I split the train data additional in validation data. To simplify the splitting I use the train_test_split method from sklear.model_selection.

In [6]:
from sklearn.model_selection import train_test_split 

In [7]:
# We split the dataset into 2/3 training and 1/3 testing sets.
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.10)

# Then we split the training set further into 2/3 training and 1/3 validation sets.
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.20)

To control my input an output I define a data directory

In [8]:
# Define the data directory and make sure that the directory exists
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

## Create csv files for test, validation and train data

The training and validation data needs the groundtruth as first column in the dataset. The test data have no groundtruth column.

The xgboost does not accept a header row and an index column. This is to consider when write the training and test data to visualization.


In [9]:
# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header
# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and
# validation data, it is assumed that the first entry in each row is the target variable.

X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

## Import the sagemaker specific classes and functions
In addition to the modules above, we need to import the various bits of SageMaker that we will be using. 

In [10]:
import sagemaker
from sagemaker import get_execution_role

# This is an object that represents the SageMaker session that we are currently operating in. This
# object contains some useful information that we will need to access later such as our region.
session = sagemaker.Session()

# This is an object that represents the IAM role that we are currently assigned. When we construct
# and launch the training job later we will need to tell it what IAM role it should have. Since our
# use case is relatively simple we will simply assign the training job the role we currently have.
role = get_execution_role()

## Define a prefix for s3 data upload and upload the created files

The data will now be uploasded to s3 storage.

In [11]:
prefix = 'capstone_completed'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

## Create Sagemaker Estimator and Hyperparamaters

I use the amazon build in estimator with a xgboost container.

In [12]:
# TODO: Create a SageMaker estimator using the container location determined in the previous cell.
#       It is recommended that you use a single training instance of type ml.m4.xlarge. It is also
#       recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the
#       output path.

container = sagemaker.image_uris.retrieve('xgboost', session.boto_region_name, 'latest')

xgb = sagemaker.estimator.Estimator(container,
                                    role=role,
                                    instance_count=1,
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


# TODO: Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary
#       label so we should be using the 'binary:logistic' objective.
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        early_stopping_rounds=10, 
                        num_round=200)



After defining the xgboost estimator and the hyperparameters, the input train locations will be specified as training inputs.

In [13]:
#s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
#s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

Now it's time to train the estimator. To avoid over fitting a validation set is also used.

In [14]:
%%time
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2021-10-08 14:00:55 Starting - Starting the training job...
2021-10-08 14:01:18 Starting - Launching requested ML instancesProfilerReport-1633701654: InProgress
...
2021-10-08 14:01:45 Starting - Preparing the instances for training.........
2021-10-08 14:03:25 Downloading - Downloading input data
2021-10-08 14:03:25 Training - Downloading the training image...
2021-10-08 14:03:52 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2021-10-08:14:03:47:INFO] Running standalone xgboost training.[0m
[34m[2021-10-08:14:03:47:INFO] File size need to be processed in the node: 3.29mb. Available memory size in the node: 8392.94mb[0m
[34m[2021-10-08:14:03:47:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:03:47] S3DistributionType set as FullyReplicated[0m
[34m[14:03:47] 45567x11 matrix with 501237 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-10-08:14:03:47:INFO] Determined delimiter of CSV input is

### Test the model

Now that I have fit the model to the training data, using the validation data to avoid overfitting, I can test the model. To do this I will make use of SageMaker's Batch Transform functionality. 
First I have to create an estimator transformer object. 

In [15]:
# TODO: Create a transformer object from the trained model. Using an instance count of 1 and an instance type of ml.m4.xlarge
#       should be more than enough.
xgb_transformer = xgb.transformer(instance_count=1, instance_type='ml.m4.xlarge')

Now I can start the transformation. To do this I have to give the location to the test data, the type of data and a split type, in case that our test data is to large to send it once. Therefore our data is organized in rows the split type is 'Line'. The type of data is a test based csv file. 



In [16]:
%%time
# TODO: Start the transform job. Make sure to specify the content type and the split type of the test data.
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

................................[34mArguments: serve[0m
[34m[2021-10-08 14:09:46 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-10-08 14:09:46 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-10-08 14:09:46 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-10-08 14:09:46 +0000] [20] [INFO] Booting worker with pid: 20[0m
[34m[2021-10-08 14:09:46 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-10-08 14:09:46 +0000] [22] [INFO] Booting worker with pid: 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-10-08:14:09:46:INFO] Model loaded successfully for worker : 20[0m
[35mArguments: serve[0m
[35m[2021-10-08 14:09:46 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[35m[2021-10-08 14:09:46 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[35m[2021-10-08 14:09:46 +0000] [1] [INFO] Using worker: gevent[0m
[35m[2021-10-08 14:09:46 +0000] [20] [INFO] Booting worker with pid: 20[0m
[35m[2021-10-08 14:09:46 +0000] [

In [17]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-eu-central-1-647915836300/xgboost-2021-10-08-14-04-37-949/test.csv.out to data/test.csv.out


After transforming the results I can download the predictions for the test data.

In [18]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)


In [19]:
Y_pred.head(3)

Unnamed: 0,0
0,0.392802
1,0.455086
2,0.603957


the predictions must be round to integers to get a binary output

In [20]:
predictions = [round(num) for num in Y_pred.squeeze().values]
#predictions

Now I can use the accuracy_score method from 

In [21]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, predictions)

0.7106967925422658

In [22]:
test_labels = Y_test.values.flatten()
test_preds = np.array(predictions)
# calculate true positives, false positives, true negatives, false negatives
tp = np.logical_and(test_labels, test_preds).sum()
fp = np.logical_and(1-test_labels, test_preds).sum()
tn = np.logical_and(1-test_labels, 1-test_preds).sum()
fn = np.logical_and(test_labels, 1-test_preds).sum()

# calculate binary classification metrics
recall = tp / (tp + fn)
precision = tp / (tp + fp)
accuracy = (tp + tn) / (tp + fp + tn + fn)

In [24]:
pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)'])

prediction (col),0.0,1.0
actual (row),Unnamed: 1_level_1,Unnamed: 2_level_1
0,1560,1043
1,788,2938


In [23]:
print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
print("\n{:<11} {:.3f}".format('Recall:', recall))
print("{:<11} {:.3f}".format('Precision:', precision))
print("{:<11} {:.3f}".format('Accuracy:', accuracy))
print()



prediction (col)   0.0   1.0
actual (row)                
0                 1560  1043
1                  788  2938

Recall:     0.789
Precision:  0.738
Accuracy:   0.711



## Train the model with a hyperparameter tuning

In [24]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
                                               objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
                                               objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
                                               max_jobs = 20, # The total number of models to train
                                               max_parallel_jobs = 3, # The number of models to train in parallel
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),
                                                    'eta'      : ContinuousParameter(0.05, 0.5),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               })

## Fit the Hyperparamereter Tuner

In [25]:
%%time
# This is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
#s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
#s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

...................................................................................................................................................................................................................................................................................................................................!
CPU times: user 1.69 s, sys: 88.5 ms, total: 1.78 s
Wall time: 27min 12s


In [26]:
best_training_job = xgb_hyperparameter_tuner.best_training_job()
best_training_job, type(best_training_job)

('xgboost-211004-1036-018-77293c4f', str)

In [27]:
%%time
xgb_attached = sagemaker.estimator.Estimator.attach(best_training_job)


2021-10-04 10:59:49 Starting - Preparing the instances for training
2021-10-04 10:59:49 Downloading - Downloading input data
2021-10-04 10:59:49 Training - Training image download completed. Training in progress.
2021-10-04 10:59:49 Uploading - Uploading generated training model
2021-10-04 10:59:49 Completed - Training job completed
CPU times: user 84.7 ms, sys: 7.99 ms, total: 92.7 ms
Wall time: 194 ms


In [28]:
%%time
xgb_tuned_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

CPU times: user 8.14 ms, sys: 0 ns, total: 8.14 ms
Wall time: 363 ms


In [29]:
%%time
xgb_tuned_transformer.transform(test_location, content_type='text/csv', split_type='Line')

.............................[34mArguments: serve[0m
[34m[2021-10-04 11:09:29 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-10-04 11:09:29 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-10-04 11:09:29 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-10-04 11:09:29 +0000] [20] [INFO] Booting worker with pid: 20[0m
  monkey.patch_all(subprocess=True)[0m
[35mArguments: serve[0m
[35m[2021-10-04 11:09:29 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[35m[2021-10-04 11:09:29 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[35m[2021-10-04 11:09:29 +0000] [1] [INFO] Using worker: gevent[0m
[35m[2021-10-04 11:09:29 +0000] [20] [INFO] Booting worker with pid: 20[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-10-04:11:09:29:INFO] Model loaded successfully for worker : 20[0m
[34m[2021-10-04 11:09:29 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-10-04 11:09:29 +0000] [22] [INFO] Booting worker with pid: 22[

In [30]:
!aws s3 cp --recursive $xgb_tuned_transformer.output_path $data_dir

download: s3://sagemaker-eu-central-1-647915836300/xgboost-2021-10-04-11-04-44-892/test.csv.out to data/test.csv.out


In [31]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)

In [32]:
predictions = [round(num) for num in Y_pred.squeeze().values]
#predictions

In [33]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, predictions)

0.6983146605381595

In [74]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 100)]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}



In [90]:
test_labels

array([1, 0, 0, ..., 1, 1, 1])

In [92]:
np.array(test_preds)

array([1., 0., 1., ..., 1., 1., 1.])

In [34]:
test_labels = Y_test.values.flatten()
test_preds = np.array(predictions)
# calculate true positives, false positives, true negatives, false negatives
tp = np.logical_and(test_labels, test_preds).sum()
fp = np.logical_and(1-test_labels, test_preds).sum()
tn = np.logical_and(1-test_labels, 1-test_preds).sum()
fn = np.logical_and(test_labels, 1-test_preds).sum()

# calculate binary classification metrics
recall = tp / (tp + fn)
precision = tp / (tp + fp)
accuracy = (tp + tn) / (tp + fp + tn + fn)

In [35]:
print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
print("\n{:<11} {:.3f}".format('Recall:', recall))
print("{:<11} {:.3f}".format('Precision:', precision))
print("{:<11} {:.3f}".format('Accuracy:', accuracy))
print()



prediction (col)   0.0    1.0
actual (row)                 
0                 3997   4627
1                 1674  10588

Recall:     0.863
Precision:  0.696
Accuracy:   0.698



In [None]:
print('Metrics for simple, LinearLearner.\n')

# get metrics for linear predictor
metrics = evaluate(linear_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True) # verbose means we'll print out the metrics



## Same process for viewed data

In [36]:
features = pd.read_csv('data/preprocessed_features_viewed.csv', index_col=0)
targets = pd.read_csv('data/preprocessed_targets_viewed.csv', index_col=0)

In [39]:
targets.viewed.value_counts()

1    56895
0    19382
Name: viewed, dtype: int64

In [40]:
# We split the dataset into 2/3 training and 1/3 testing sets.
X_train, X_test, Y_train, Y_test = train_test_split(features, targets, test_size=0.33)

# Then we split the training set further into 2/3 training and 1/3 validation sets.
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.33)

In [56]:
# set values -1 to 0
Y_test

Unnamed: 0,viewed
251518,1
186892,1
210361,1
230635,1
170103,1
...,...
239052,1
201645,1
305218,1
231518,1


In [42]:
# Define the data directory and make sure that the directory exists
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

## Create csv files for test, validation and train data

In [43]:
# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header
# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and
# validation data, it is assumed that the first entry in each row is the target variable.

X_test.to_csv(os.path.join(data_dir, 'test_viewed.csv'), header=False, index=False)

pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation_viewed.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train_viewed.csv'), header=False, index=False)

## Import the sagemaker specific classes and functions

In [44]:
import sagemaker
from sagemaker import get_execution_role
#from sagemaker.amazon.amazon_estimator import get_image_uri

# This is an object that represents the SageMaker session that we are currently operating in. This
# object contains some useful information that we will need to access later such as our region.
session = sagemaker.Session()

# This is an object that represents the IAM role that we are currently assigned. When we construct
# and launch the training job later we will need to tell it what IAM role it should have. Since our
# use case is relatively simple we will simply assign the training job the role we currently have.
role = get_execution_role()

## Define a prefix for s3 data upload and upload the createrd files

In [45]:
prefix = 'capstone_viewed'

test_location = session.upload_data(os.path.join(data_dir, 'test_viewed.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation_viewed.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train_viewed.csv'), key_prefix=prefix)

## Create Sagemaker Estimator and Hyperparamaters

In [46]:
# TODO: Create a SageMaker estimator using the container location determined in the previous cell.
#       It is recommended that you use a single training instance of type ml.m4.xlarge. It is also
#       recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the
#       output path.

container = sagemaker.image_uris.retrieve('xgboost', session.boto_region_name, 'latest')

xgb = sagemaker.estimator.Estimator(container,
                                    role=role,
                                    instance_count=1,
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


# TODO: Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary
#       label so we should be using the 'binary:logistic' objective.
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        early_stopping_rounds=10, 
                        num_round=200)



In [47]:
#s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
#s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

In [48]:
%%time
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2021-09-23 06:40:06 Starting - Starting the training job...
2021-09-23 06:40:08 Starting - Launching requested ML instancesProfilerReport-1632379206: InProgress
......
2021-09-23 06:41:25 Starting - Preparing the instances for training......
2021-09-23 06:42:37 Downloading - Downloading input data
2021-09-23 06:42:37 Training - Downloading the training image...
2021-09-23 06:43:05 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2021-09-23:06:42:58:INFO] Running standalone xgboost training.[0m
[34m[2021-09-23:06:42:58:INFO] File size need to be processed in the node: 3.1mb. Available memory size in the node: 8389.71mb[0m
[34m[2021-09-23:06:42:58:INFO] Determined delimiter of CSV input is ','[0m
[34m[06:42:58] S3DistributionType set as FullyReplicated[0m
[34m[06:42:58] 34240x10 matrix with 342400 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-09-23:06:42:58:INFO] Determined delimiter of CSV input is 

In [49]:
# TODO: Create a transformer object from the trained model. Using an instance count of 1 and an instance type of ml.m4.xlarge
#       should be more than enough.
xgb_transformer = xgb.transformer(instance_count=1, instance_type='ml.m4.xlarge')

In [50]:
%%time
# TODO: Start the transform job. Make sure to specify the content type and the split type of the test data.
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

............................[34mArguments: serve[0m
[34m[2021-09-23 06:48:55 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-09-23 06:48:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-09-23 06:48:55 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-09-23 06:48:55 +0000] [20] [INFO] Booting worker with pid: 20[0m
[34m[2021-09-23 06:48:55 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-09-23 06:48:55 +0000] [22] [INFO] Booting worker with pid: 22[0m
[34m[2021-09-23 06:48:55 +0000] [23] [INFO] Booting worker with pid: 23[0m
  monkey.patch_all(subprocess=True)[0m
  monkey.patch_all(subprocess=True)[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-09-23:06:48:55:INFO] Model loaded successfully for worker : 21[0m
[34m[2021-09-23:06:48:55:INFO] Model loaded successfully for worker : 20[0m
[34m[2021-09-23:06:48:55:INFO] Model loaded successfully for worker : 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-09-23

In [51]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-eu-central-1-647915836300/xgboost-2021-09-23-06-44-25-632/test_viewed.csv.out to data/test_viewed.csv.out


In [57]:
predictions = pd.read_csv(os.path.join(data_dir, 'test_viewed.csv.out'), header=None)


In [58]:
predictions = [round(num) for num in predictions.squeeze().values]
#predictions

In [59]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, predictions)

0.8098283807405053