# Capstone Project: <br/> Arvato Customer Acquisition Prediction Using Supervised Learning <br/><br/> Part 3: Classification Using XGBoost

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Run-Common-Code-Notebook" data-toc-modified-id="Run-Common-Code-Notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Run Common Code Notebook</a></span></li><li><span><a href="#Package-Imports" data-toc-modified-id="Package-Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Package Imports</a></span><ul class="toc-item"><li><span><a href="#Import-Sagemaker-and-Boto3" data-toc-modified-id="Import-Sagemaker-and-Boto3-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import Sagemaker and Boto3</a></span></li><li><span><a href="#Import-Custom-Packages" data-toc-modified-id="Import-Custom-Packages-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Import Custom Packages</a></span></li></ul></li><li><span><a href="#Global-Variables" data-toc-modified-id="Global-Variables-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Global Variables</a></span></li><li><span><a href="#Load-Raw-Data" data-toc-modified-id="Load-Raw-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load Raw Data</a></span></li><li><span><a href="#Creating-The-Preprocessing-Pipeline" data-toc-modified-id="Creating-The-Preprocessing-Pipeline-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Creating The Preprocessing Pipeline</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Transforming-Train-and-Test-Data" data-toc-modified-id="Transforming-Train-and-Test-Data-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Transforming Train and Test Data</a></span></li><li><span><a href="#Export-csv-files-to-be-used-by-AWS-sagemaker-algorithms:" data-toc-modified-id="Export-csv-files-to-be-used-by-AWS-sagemaker-algorithms:-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Export csv files to be used by AWS sagemaker algorithms:</a></span></li></ul></li><li><span><a href="#Benchmark-Logistic-Regression-Model" data-toc-modified-id="Benchmark-Logistic-Regression-Model-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Benchmark Logistic Regression Model</a></span><ul class="toc-item"><li><span><a href="#Benchmark-Model-Definition-and-Training" data-toc-modified-id="Benchmark-Model-Definition-and-Training-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Benchmark Model Definition and Training</a></span></li><li><span><a href="#Testing-Benchmark-Model-on-Kaggle" data-toc-modified-id="Testing-Benchmark-Model-on-Kaggle-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Testing Benchmark Model on Kaggle</a></span></li></ul></li><li><span><a href="#XGBoost-Model-using-AWS-Sagemaker" data-toc-modified-id="XGBoost-Model-using-AWS-Sagemaker-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>XGBoost Model using AWS Sagemaker</a></span><ul class="toc-item"><li><span><a href="#Copy-Data-to-S3-Bucket" data-toc-modified-id="Copy-Data-to-S3-Bucket-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Copy Data to S3 Bucket</a></span></li><li><span><a href="#Define-S3-Data-Channel-Inputs" data-toc-modified-id="Define-S3-Data-Channel-Inputs-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Define S3 Data Channel Inputs</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span></li><li><span><a href="#Launch-a-Training-Job-with-the-Best-Hyperparameters" data-toc-modified-id="Launch-a-Training-Job-with-the-Best-Hyperparameters-8.4"><span class="toc-item-num">8.4&nbsp;&nbsp;</span>Launch a Training Job with the Best Hyperparameters</a></span></li><li><span><a href="#Model-Testing-and-Evaluation" data-toc-modified-id="Model-Testing-and-Evaluation-8.5"><span class="toc-item-num">8.5&nbsp;&nbsp;</span>Model Testing and Evaluation</a></span><ul class="toc-item"><li><span><a href="#Getting-Test-Perdictions-Using-AWS-Batch-Transform" data-toc-modified-id="Getting-Test-Perdictions-Using-AWS-Batch-Transform-8.5.1"><span class="toc-item-num">8.5.1&nbsp;&nbsp;</span>Getting Test Perdictions Using AWS Batch Transform</a></span></li><li><span><a href="#Validation-of-Test-Data-using-Kaggle" data-toc-modified-id="Validation-of-Test-Data-using-Kaggle-8.5.2"><span class="toc-item-num">8.5.2&nbsp;&nbsp;</span>Validation of Test Data using Kaggle</a></span></li></ul></li></ul></li></ul></div>

## Run Common Code Notebook

In [385]:
%run 00_common.ipynb

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Overwriting ../src/helper_functions.py


## Package Imports

### Import Sagemaker and Boto3

In [3]:
import sagemaker
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

import boto3

### Import Custom Packages

In [486]:
from helper_functions import *
from transformers import *
from metadata import Metadata

## Global Variables

In [5]:
IAM_ROLE = "arn:aws:iam::995409147735:role/fady_execution_role"
BUCKET =   'fady-aws-mlnd-capstone-arvato'
region =   'us-east-1'

## Load Raw Data

In [6]:
# Load the training dataset, from a compressed archive
data_archive = os.path.join(dir_data_raw, "Udacity_MAILOUT_052018_TRAIN.tar.xz")
with tarfile.open(data_archive, "r:*") as tar:
    mailout_train = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TRAIN.csv'), sep=";")

  mailout_train = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TRAIN.csv'), sep=";")


In [7]:
# Load the test data
data_archive = os.path.join(dir_data_raw, "Udacity_MAILOUT_052018_TEST.tar.xz")
with tarfile.open(data_archive, "r:*") as tar:
    mailout_test = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TEST.csv'), sep=";")

  mailout_test = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TEST.csv'), sep=";")


## Creating The Preprocessing Pipeline

In [9]:
metadata = Metadata(os.path.join(dir_data_processed, 'metadata.csv'))

In [10]:
imputer_column_transformer = CustomColumnTransformer(
    [
        (
            'numeric', 
            CustomSimpleImputer(strategy="mean"), 
            partial(metadata.lookup_features, types=['numeric'])
        ),
        (
            'ordinal_nominal', 
            CustomSimpleImputer(strategy="most_frequent"), 
            partial(metadata.lookup_features, types=['ordinal', 'nominal'])
        )
    ],
#    remainder='passthrough'
)


terminal_column_transformer = CustomColumnTransformer(
    [
        (
            'numeric_ordinal', 
            StandardScaler(), 
            partial(metadata.lookup_features, types=['numeric', 'ordinal'])
        ),
       (
            'nominal', 
            OneHotEncoder(handle_unknown='ignore'), 
            partial(metadata.lookup_features, types=['nominal'])
        ),
     ],
#    remainder='passthrough'
)

correlated_cols_remover = CustomColumnTransformer(
    [
        (
            'numeric_ordinal',
            CorrelatedRemover(correlation_threshold=0.6),
            partial(metadata.lookup_features, types=['numeric', 'ordinal'])
        )
    ],
    remainder = 'passthrough'
)


custom_outlier_remover =  TrainOutlierRemover(
                            selector_callable = partial(metadata.lookup_features, method='intersect', types=['numeric']),
)

In [123]:
pipeline = Pipeline([
    ('preprocessor', CustomPreprocessor(metadata)), 
    ('remove_missing_cols', MissingDataColsRemover(missing_threshold=0.3)),
    ('remove_correlated_cols', correlated_cols_remover),
    ('remove_duplicate_cols', TrainDuplicatesRemover()),
    ('imputer_column', imputer_column_transformer),
    ('outlier_remover', custom_outlier_remover),
    ('terminal_column', terminal_column_transformer),
])

In [128]:
pipeline

**The full description of the pipeline steps is displayed in the following image:**  

![pipeline.svg](attachment:pipeline.svg)

## Data Preprocessing

### Transforming Train and Test Data

In [56]:
# Running the pipeline steps:
n_steps = pipeline.__len__()
step_names = dir(pipeline.named_steps)

pipeline_output = list()
pipeline_output.append(
    {
     'X': pipeline[0].fit_transform(mailout_train.drop('RESPONSE', axis=1), mailout_train['RESPONSE']),
     'y': mailout_train['RESPONSE'].copy()
    }
)
print(0, pipeline.steps[0][0])

for i in range(1, n_steps):
    pipeline_output.append(
    {
     'X': pipeline[i].fit_transform(pipeline_output[i-1]['X'], pipeline_output[i-1]['y']),
     'y': pipeline_output[i-1]['y']
    }
    )   
    print(i, pipeline.steps[i][0])

0 preprocessor
1 remove_missing_cols
2 remove_correlated_cols
3 remove_duplicate_cols
4 imputer_column
5 outlier_remover
6 terminal_column


In [62]:
X, y = pipeline_output[-1].values()

In [80]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=985489, stratify=y)

In [90]:
X_test = pipeline.transform(mailout_test)

### Export csv files to be used by AWS sagemaker algorithms:

In [102]:
datasets = {'train': [y_train, X_train], 'valid': [y_valid, X_valid], 'test': [X_test]}

for key in datasets.keys():
    output_csv_path = os.path.join(dir_data_processed,f"{key}.csv")
    pd.concat(datasets[key], axis=1)\
    .to_csv(output_csv_path, header=False, index=False, float_format="%.16g"
    )
    print(f"Wrote {key} dataset to {output_csv_path}")

Wrote train dataset to ../input/data/processed/train.csv
Wrote valid dataset to ../input/data/processed/valid.csv
Wrote test dataset to ../input/data/processed/test.csv


## Benchmark Logistic Regression Model

### Benchmark Model Definition and Training

In [119]:
benchmark_model = LogisticRegression(max_iter=2000, verbose=2)

In [120]:
benchmark_model.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


At iterate  241    f=  1.29100D+03    |proj g|=  5.30256D-01

At iterate  242    f=  1.29098D+03    |proj g|=  5.11092D-01

At iterate  243    f=  1.29097D+03    |proj g|=  1.29325D+00

At iterate  244    f=  1.29096D+03    |proj g|=  5.00979D-01

At iterate  245    f=  1.29095D+03    |proj g|=  2.63499D-01

At iterate  357    f=  1.29082D+03    |proj g|=  2.60271D-02

......
......
At iterate  358    f=  1.29082D+03    |proj g|=  5.68206D-02

At iterate  359    f=  1.29082D+03    |proj g|=  5.96478D-02

At iterate  360    f=  1.29082D+03    |proj g|=  1.33985D-02

At iterate  361    f=  1.29082D+03    |proj g|=  1.25780D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N 

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.0s finished


### Testing Benchmark Model on Kaggle

In [121]:
submission_benchmark = pd.DataFrame()
submission_benchmark['LNR'] = mailout_test['LNR']
submission_benchmark['RESPONSE'] = pd.DataFrame(benchmark_model.predict_proba(X_test))[1]
submission_benchmark.to_csv(os.path.join(dir_submit, "submission_benchmark.csv"), index=False)

In [122]:
!kaggle competitions submit -c udacity-arvato-identify-customers -f {os.path.join(dir_submit, "submission_benchmark.csv")} -m "benchmark raw submission"

100%|███████████████████████████████████████| 1.11M/1.11M [00:04<00:00, 244kB/s]
Successfully submitted to Udacity+Arvato: Identify Customer Segments

**Submission Score results:**

![kaggle-score-benchmark.svg](attachment:kaggle-score-benchmark.svg)

## XGBoost Model using AWS Sagemaker

### Copy Data to S3 Bucket

In [129]:
for dataset_name in ['train', 'test', 'valid']:
    !aws s3 cp {os.path.join(dir_data_processed, dataset_name + '.csv')}  s3://{BUCKET}/data/ 

upload: ../input/data/processed/train.csv to s3://fady-aws-mlnd-capstone-arvato/data/train.csv
upload: ../input/data/processed/test.csv to s3://fady-aws-mlnd-capstone-arvato/data/test.csv
upload: ../input/data/processed/valid.csv to s3://fady-aws-mlnd-capstone-arvato/data/valid.csv


### Define S3 Data Channel Inputs

In [153]:
# define the data type and paths to the training and validation datasets in S3
content_type = "text/csv"

data_channels = dict()
for item in {'train': 'train', 'validation': 'valid'}.items():
    data_channels[item[0]] = TrainingInput(f"s3://{BUCKET}/data/{item[1]}.csv", content_type=content_type)

### Hyperparameter Tuning

In [193]:
# this retrieves the  XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_image = sagemaker.image_uris.retrieve("xgboost", region, version='1.3-1')

In [329]:
estimator_kwargs = {
    'image_uri':       xgboost_image,
    'role':            IAM_ROLE,
    'instance_count':  1,
    'instance_type':   'ml.m5.2xlarge',
    'volume_size':     5,
    'output_path':     f"s3://{BUCKET}/output/",
    'base_job_name':   "xgboost-training-arvato",
}

In [330]:
estimator_hyperparameters_common = {
    "objective": "binary:logistic",
    "early_stopping_rounds": 400,
    "tree_method": 'exact', 
}


In [331]:
tuner_hyperparameter_ranges = {
    'alpha':             ContinuousParameter(0, 1000),
    'colsample_bylevel': ContinuousParameter(0.1, 1),
    'colsample_bynode' : ContinuousParameter(0.1, 1),
    'colsample_bytree' : ContinuousParameter(0.5, 1),
    'eta':               ContinuousParameter(0.1, 0.5),  
    'gamma':             ContinuousParameter(0, 5),  
    'lambda':            ContinuousParameter(0, 1000), 
    'max_delta_step':    IntegerParameter(0, 10),
    'max_depth':         IntegerParameter(0, 10),
    'min_child_weight':  ContinuousParameter(0,120),
    'num_round' :        IntegerParameter(1,4000),
    'subsample' :        ContinuousParameter(0.5, 1),
}


In [335]:
hyperparameter_tuner_kwargs = {
    'estimator':              sagemaker.estimator.Estimator(
        **estimator_kwargs, 
        hyperparameters= estimator_hyperparameters_common
    ),
    'objective_metric_name': 'validation:logloss',
    'hyperparameter_ranges':  tuner_hyperparameter_ranges,
    'objective_type':        'Minimize',
    'max_jobs':               60,
    'max_parallel_jobs':      1,
    'base_tuning_job_name':  'arvato-hpo',
    'early_stopping_type':   'Auto',
}

In [336]:
# Define hyper parameter job
xgb_hyperparameter_tuner = HyperparameterTuner(**hyperparameter_tuner_kwargs)

In [337]:
# execute the XGBoost hyper paramater tuning job
xgb_hyperparameter_tuner.fit(data_channels, wait=False)

In [384]:
xgb_hyperparameter_tuner.describe()['HyperParameterTuningJobName']

'arvato-hpo-220219-2135'

In [322]:
#Load tuning job
#xgb_hyperparameter_tuner = HyperparameterTuner.attach('arvato-hpo-220219-2135')

In [360]:
#Get the best estimators and the best HPs
best_estimator = xgb_hyperparameter_tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()


2022-02-19 21:32:57 Starting - Preparing the instances for training
2022-02-19 21:32:57 Downloading - Downloading input data
2022-02-19 21:32:57 Training - Training image download completed. Training in progress.
2022-02-19 21:32:57 Uploading - Uploading generated training model
2022-02-19 21:32:57 Completed - Training job completed


{'_tuning_objective_metric': 'validation:logloss',
 'alpha': '1.4660712602454582',
 'colsample_bylevel': '0.7722861556109817',
 'colsample_bynode': '0.6697265662178977',
 'colsample_bytree': '0.657077361230149',
 'early_stopping_rounds': '400',
 'eta': '0.10820258928443521',
 'gamma': '0.48560248190593464',
 'lambda': '880.0722192875228',
 'max_delta_step': '1',
 'max_depth': '8',
 'min_child_weight': '53.322056702317354',
 'num_round': '1136',
 'objective': 'binary:logistic',
 'subsample': '0.9527266219121749',
 'tree_method': 'exact'}

In [235]:
#xgb_hyperparameter_tuner.analytics().dataframe().sort_values('FinalObjectiveValue').reset_index().loc[1].to_dict()

{'index': 6,
 'alpha': 7.546499163869846,
 'colsample_bylevel': 0.889456164753893,
 'colsample_bynode': 0.8490888540998051,
 'colsample_bytree': 0.9472057182405791,
 'eta': 0.3904135020867301,
 'gamma': 1.0116075823162174,
 'lambda': 238.5021238531828,
 'max_delta_step': 10.0,
 'max_depth': 10.0,
 'min_child_weight': 23.690959551634243,
 'num_round': 87.0,
 'subsample': 0.6554110894756451,
 'TrainingJobName': 'arvato-hpo-220219-1027-063-a0e91ead',
 'TrainingJobStatus': 'Completed',
 'FinalObjectiveValue': 0.06015999987721443,
 'TrainingStartTime': Timestamp('2022-02-19 14:30:44+0200', tz='tzlocal()'),
 'TrainingEndTime': Timestamp('2022-02-19 14:32:17+0200', tz='tzlocal()'),
 'TrainingElapsedTimeSeconds': 93.0}

In [387]:
hpo_analytics_df = \
xgb_hyperparameter_tuner\
.analytics()\
.dataframe()[['TrainingJobName', 'FinalObjectiveValue', 'TrainingElapsedTimeSeconds',
       'num_round', 'alpha', 'colsample_bylevel', 'colsample_bynode',
       'colsample_bytree', 'eta', 'gamma', 'lambda', 'max_delta_step',
       'max_depth', 'min_child_weight', 'subsample']]\
.sort_values('FinalObjectiveValue')

hpo_analytics_df.columns = ['job_name', 'obj_value', 'time',
       'num_round', 'alpha', 'colsample_bylevel', 'colsample_bynode',
       'colsample_bytree', 'eta', 'gamma', 'lambda', 'max_delta_step',
       'max_depth', 'min_child_weight', 'subsample']

In [388]:
# Display the top 10 results of hyperparameter tuning job, sorted by objective metric
hpo_analytics_df.head(10)

Unnamed: 0,job_name,obj_value,time,num_round,alpha,colsample_bylevel,colsample_bynode,colsample_bytree,eta,gamma,lambda,max_delta_step,max_depth,min_child_weight,subsample
32,arvato-hpo-220219-2135-028-bd859e2b,0.05898,94.0,1136.0,1.466071,0.772286,0.669727,0.657077,0.108203,0.485602,880.072219,1.0,8.0,53.322057,0.952727
3,arvato-hpo-220219-2135-057-424417fa,0.05902,84.0,1160.0,0.0,0.266009,0.144735,0.69643,0.265815,1.177283,914.404495,3.0,8.0,8.474547,0.768599
35,arvato-hpo-220219-2135-025-a121354b,0.0591,63.0,251.0,23.438728,0.543753,0.485596,0.505232,0.44571,0.380622,842.474563,4.0,9.0,21.097838,0.94467
21,arvato-hpo-220219-2135-039-4e3dd959,0.0592,63.0,239.0,30.701911,0.56404,0.895001,0.744022,0.138256,3.849457,114.649856,1.0,2.0,77.382115,0.96118
18,arvato-hpo-220219-2135-042-4931eb1a,0.0594,91.0,314.0,14.456794,0.637091,0.714065,0.840035,0.358915,0.577412,398.453567,9.0,10.0,0.493639,0.789859
8,arvato-hpo-220219-2135-052-429108fc,0.05962,96.0,362.0,16.419685,0.37914,0.181896,0.510938,0.220832,0.267402,28.834426,8.0,10.0,3.131529,0.593819
4,arvato-hpo-220219-2135-056-142beaed,0.05971,69.0,670.0,2.99195,0.740151,0.27702,0.586744,0.452172,4.540986,14.918411,8.0,1.0,91.049107,0.837584
6,arvato-hpo-220219-2135-054-5444120b,0.05996,92.0,119.0,2.381599,0.618075,0.860724,0.648573,0.162866,4.254731,752.813919,0.0,4.0,22.833678,1.0
31,arvato-hpo-220219-2135-029-d390b155,0.05998,102.0,1387.0,40.284431,0.148482,0.622046,0.838108,0.480566,3.277253,454.53511,10.0,9.0,62.257604,0.71653
5,arvato-hpo-220219-2135-055-9974f078,0.06018,103.0,859.0,67.323005,0.318386,0.814581,0.754032,0.477123,0.678071,938.849075,2.0,9.0,0.0,1.0


In [None]:
# output latex table for report
n_jobs = 5
hpo_analytics_df_output = hpo_analytics_df.head(n_jobs)\
.reset_index(drop=True)\
.transpose()\
.reset_index().rename({'index':'hyperparameter'}, axis=1).copy()
hpo_analytics_df_output = hpo_analytics_df.head(5).copy()
hpo_analytics_df_output['job_name'] = hpo_analytics_df_output['job_name'].apply(lambda x: x[-8:])
int_cols = ['time', 'num_round', 'max_delta_step', 'max_depth']
hpo_analytics_df_output[int_cols] = hpo_analytics_df_output[int_cols].applymap(lambda x: int(x))
hpo_analytics_df_output = hpo_analytics_df_output.transpose().reset_index()
hpo_analytics_df_output.columns = ['hyperparameter', *range(1,n_jobs + 1)]
hpo_analytics_df_output['hyperparameter'] = hpo_analytics_df_output['hyperparameter'].str.replace('_', '\_')

In [492]:
save_latex_table(
    hpo_analytics_df_output,
    path=os.path.join(dir_tables, '04_model_tuner-results.tex'),
    precision=None,
    caption="Hyperparameter Tuner Results",
    label="tab:tuner-results",
)

None


### Launch a Training Job with the Best Hyperparameters

In [338]:
# initialize hyperparameters

estimator_hyperparameters_best = {
 'alpha': 7.546499163869846,
 'colsample_bylevel': 0.889456164753893,
 'colsample_bynode': 0.8490888540998051,
 'colsample_bytree': 0.9472057182405791,
 'eta': 0.3904135020867301,
 'gamma': 1.0116075823162174,
 'lambda': 238.5021238531828,
 'max_delta_step': 10,
 'max_depth': 10,
 'min_child_weight': 23.690959551634243,
 'num_round': 120,
 'subsample': 0.6554110894756451,
}


xgb_estimator = sagemaker.estimator.Estimator(
    **estimator_kwargs, 
    hyperparameters={**estimator_hyperparameters_common, **estimator_hyperparameters_best}
)

In [None]:
xgb_estimator = sagemaker.estimator.Estimator(**estimator_kwargs,
                                              hyperparameters=estimator_hyperparameters_common
                                             )   


In [339]:
# execute the XGBoost training job
xgb_estimator.fit(data_channels, wait=True)

2022-02-19 19:36:36 Starting - Starting the training job...ProfilerReport-1645299395: InProgress
...
2022-02-19 19:37:42 Starting - Preparing the instances for training.........
2022-02-19 19:39:21 Downloading - Downloading input data
2022-02-19 19:39:21 Training - Downloading the training image...
2022-02-19 19:39:53 Training - Training image download completed. Training in progress..[34m[2022-02-19 19:39:56.139 ip-10-0-229-55.ec2.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2022-02-19:19:39:56:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2022-02-19:19:39:56:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2022-02-19:19:39:56:INFO] Failed to parse hyperparameter tree_method value exact to Json.[0m
[34mReturning the value itself[0m
[34m[2022-02-19:19:39:56:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-02-19:19:39:56:INFO] Running


2022-02-19 19:40:30 Uploading - Uploading generated training model
2022-02-19 19:40:30 Completed - Training job completed
Training seconds: 94
Billable seconds: 94


In [340]:
# Print training job information:
print("job name : {}\n".format(xgb_estimator.latest_training_job.job_name))
print("latest_job_debugger_artifacts_path : {}\n".format(xgb_estimator.latest_job_debugger_artifacts_path()))
print("rule_output_path : {}\n".format(xgb_estimator.output_path + xgb_estimator.latest_training_job.job_name + "/rule-output"))

job name : xgboost-training-arvato-2022-02-19-19-36-35-931

latest_job_debugger_artifacts_path : None

rule_output_path : s3://fady-aws-mlnd-capstone-arvato/output/xgboost-training-arvato-2022-02-19-19-36-35-931/rule-output



In [None]:
# Load XGBoost Estimator
# estimator = sagemaker.estimator.Estimator.attach('xgboost-training-arvato-2022-02-19-19-36-35-931')

### Model Testing and Evaluation

#### Getting Test Perdictions Using AWS Batch Transform

In [341]:
xgb_transformer = xgb_estimator.transformer(instance_count = 1, instance_type = 'ml.m5.2xlarge')

In [342]:
xgb_transformer.transform(
    f"s3://{BUCKET}/data/test.csv" , 
    content_type=content_type, 
    split_type='Line', 
    wait=True)

.........................[34m[2022-02-19:19:50:28:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-02-19:19:50:28:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-02-19:19:50:28:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location / {

[34m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36286 "-" "Go-http-client/1.1"[0m
[34m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36278 "-" "Go-http-client/1.1"[0m
[34m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36311 "-" "Go-http-client/1.1"[0m
[34m[2022-02-19:19:50:37:INFO] Determined delimiter of CSV input is ','[0m
[34m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36282 "-" "Go-http-client/1.1"[0m
[34m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36324 "-" "Go-http-client/1.1"[0m
[34m[2022-02-19:19:50:37:INFO] Determined delimiter of CSV input is ','[0m
[34m[2022-02-19:19:50:37:INFO] Determined delimiter of CSV input is ','[0m
[35m169.254.255.130 - - [19/Feb/2022:19:50:37 +0000] "POST /invocations HTTP/1.1" 200 36286 "-" "Go-http-client/1.1"[0m
[35m169.254.255.130 - - [19/Feb/2022

#### Validation of Test Data using Kaggle

In [344]:
# Copy the results of the transformation to local directory
!aws s3 cp --recursive {xgb_transformer.output_path} {dir_submit}

download: s3://sagemaker-us-east-1-995409147735/xgboost-training-arvato-2022-02-19-19-45-59-351/test.csv.out to ../output/submissions/test.csv.out


In [352]:
transform_output = pd.read_csv(os.path.join(dir_submit, 'test.csv.out') , header=None, names=['RESPONSE'])
transform_output.index = mailout_test['LNR']
transform_output.to_csv(os.path.join(dir_submit, "submission.csv"))

In [357]:
!kaggle competitions submit -c udacity-arvato-identify-customers -f {os.path.join(dir_submit, "submission.csv")} -m "test predictions submission"

100%|███████████████████████████████████████| 1.01M/1.01M [00:04<00:00, 249kB/s]
Successfully submitted to Udacity+Arvato: Identify Customer Segments

**The Resulting score is:**

![kaggle-test-score.svg](attachment:kaggle-test-score.svg)