# Using Experiments with XGBoost Prediction for Customer Churn

## Environment Setup

- Image: Data Science
- Kernel: Python 3
- Instance type: ml.t3.medium

## Background

This notebook builds on a previous notebook that trains a model that predicts customer churn (i.e., when a company loses a customer).  In this iteration of the notebook, we add SageMaker experiments so we can track and compare the results of different model trainings.

This notebook has been adapted from the [SageMaker examples](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb).

## Initialize Environment and Variables

In [None]:
# NEW for Experiments lesson
# Install sagemaker-experiments
import sys
!{sys.executable} -m pip install sagemaker-experiments

In [None]:
# Import libraries
import boto3
import re
import pandas as pd
import numpy as np
import os
import time

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import CSVSerializer
from sagemaker.inputs import TrainingInput

# Get the SageMaker session and the execution role from the SageMaker domain
sess = sagemaker.Session()
role = get_execution_role()

bucket = '<name-of-your-bucket>' # Update with the name of a bucket that is already created in S3
prefix = 'demo' # The name of the folder that will be created in the S3 bucket

In [None]:
# NEW for Experiments lesson
from time import strftime
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from botocore.exceptions import ClientError

---
## Data

For this lesson, data has already been cleaned and split into two local CSV files: **train.csv** (used to train the model) and **validation.csv** (used to validate how well the model does).

We'll take these local files and upload them to S3 so SageMaker can use them.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Experiments

In this section, we set up our experiment and trials.  Once they're set up, we can hook into them when we start training the model.

In [None]:
# NEW for Experiments lesson
# Create an experiment
create_date = strftime("%Y-%m-%d-%H-%M-%S")
experiment_name = 'demo-churn-experiment'
experiment_description = 'A demo experiment'

# Use a try-block so we can re-use an existing experiment rather than creating a new one each time
try:
    experiment = Experiment.create(experiment_name=experiment_name.format(create_date), 
                                   description=experiment_description)
except ClientError as e:
    print(f'{experiment_name} already exists and will be reused.')

In [None]:
# NEW for Experiments lesson
# Create a trial for the experiment
trial_name = "demo-churn-trial-1"

demo_trial = Trial.create(trial_name = trial_name.format(create_date),
                          experiment_name = experiment_name)

---
## Train

Now that we have our data in S3, we can move on to training.  In this section, we need to specify three things: where our training data is, the path to the algorithm container stored in the Elastic Container Registry, and the algorithm to use (along with hyperparameters).

The training job (the Estimator) takes in several hyperparameters.  More information on the hyperparameters for the XGBoost algorithm can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

In [None]:
# The location of our training and validation data in S3
s3_input_train = TrainingInput(
    s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv'
)
s3_input_validation = TrainingInput(
    s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv'
)

In [None]:
# The location of the XGBoost container version 1.5-1 (an AWS-managed container)
container = sagemaker.image_uris.retrieve('xgboost', sess.boto_region_name, '1.5-1')

In [None]:
# NEW for Experiments lesson
# Set up experiment_config, which will be passed to the Estimator
experiment_config={'ExperimentName': experiment_name,
                   'TrialName': trial_name,
                   'TrialComponentDisplayName': 'MaxDepth5'}

In [None]:
# Initialize hyperparameters
hyperparameters = {
                    'max_depth':'5',
                    'eta':'0.2',
                    'gamma':'4',
                    'min_child_weight':'6',
                    'subsample':'0.8',
                    'objective':'binary:logistic',
                    'eval_metric':'error',
                    'num_round':'100'}

# Output path where the trained model will be saved
output_path = 's3://{}/{}/output'.format(bucket, prefix)

# Set up the Estimator, which is training job
xgb = sagemaker.estimator.Estimator(image_uri=container, 
                                    hyperparameters=hyperparameters,
                                    role=role,
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge', 
                                    output_path=output_path,
                                    sagemaker_session=sess)

In [None]:
# NEW for Experiments lesson
# "fit" executes the training job
# This time, we're passing in experiment_config so that the training results will be tied to the experiment
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, experiment_config=experiment_config) 

## Cleaning Up Experiments

In this section, we iterate through our experiments and delete them (this cannot currently be done through the SageMaker UI).

In [None]:
# NEW for Experiments lesson
# Function to iterate through an experiment to delete its trials, then delete the experiment itself
def cleanup_sme_sdk(demo_experiment):
    for trial_summary in demo_experiment.list_trials():
        trial = Trial.load(trial_name=trial_summary.trial_name)
        for trial_component_summary in trial.list_trial_components():
            tc = TrialComponent.load(
                trial_component_name=trial_component_summary.trial_component_name)
            trial.remove_trial_component(tc)
            try:
                # Comment out to keep trial components
                tc.delete()
            except:
                # Trial component is associated with another trial
                continue
            # To prevent throttling
            time.sleep(.5)
        trial.delete()
        experiment_name = demo_experiment.experiment_name
    demo_experiment.delete()
    print(f"\nExperiment {experiment_name} deleted")

In [None]:
# Call the function above to delete an experiment and its trials
# Fill in your experiment name (not the display name)
experiment_to_cleanup = Experiment.load(experiment_name='demo-churn-experiment')

cleanup_sme_sdk(experiment_to_cleanup)