# Using XGBoost to Predict Customer Churn

## Environment Setup

- Image: Data Science
- Kernel: Python 3
- Instance type: ml.t3.medium

## Background

This notebook is used to train a model that predicts customer churn (i.e., when a company loses a customer).  The input is customer data from a cell phone provider, with a target feature of 'Churn?' that tells whether a customer left.  This notebook has been adapted from the [SageMaker examples](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb).

## Initialize Environment and Variables

In [5]:
# Import libraries
import boto3
import re
import pandas as pd
import numpy as np
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import CSVSerializer
from sagemaker.inputs import TrainingInput

# Get the SageMaker session and the execution role from the SageMaker domain
sess = sagemaker.Session()
role = get_execution_role()

bucket = 'test-sagemaker-script-mode-scikit-12052023' # Update with the name of a bucket that is already created in S3
prefix = 'xgboost-demo' # The name of the folder that will be created in the S3 bucket

---
## Data

For this lesson, data has already been cleaned and split into two local CSV files: **train.csv** (used to train the model) and **validation.csv** (used to validate how well the model does).

We'll take these local files and upload them to S3 so SageMaker can use them.

In [6]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---
## Train

Now that we have our data in S3, we can move on to training.  In this section, we need to specify three things: where our training data is, the path to the algorithm container stored in the Elastic Container Registry, and the algorithm to use (along with hyperparameters).

The training job (the Estimator) takes in several hyperparameters.  More information on the hyperparameters for the XGBoost algorithm can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

In [7]:
# The location of our training and validation data in S3
s3_input_train = TrainingInput(
    s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv'
)
s3_input_validation = TrainingInput(
    s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv'
)

In [8]:
# The location of the XGBoost container version 1.5-1 (an AWS-managed container)
container = sagemaker.image_uris.retrieve('xgboost', sess.boto_region_name, '1.5-1')

In [9]:
# Initialize hyperparameters
hyperparameters = {
                    'max_depth':'5',
                    'eta':'0.2',
                    'gamma':'4',
                    'min_child_weight':'6',
                    'subsample':'0.8',
                    'objective':'binary:logistic',
                    'eval_metric':'error',
                    'num_round':'100'}

# Output path where the trained model will be saved
output_path = 's3://{}/{}/output'.format(bucket, prefix)

# Set up the Estimator, which is training job
xgb = sagemaker.estimator.Estimator(image_uri=container, 
                                    hyperparameters=hyperparameters,
                                    role=role,
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge', 
                                    output_path=output_path,
                                    sagemaker_session=sess)

In [10]:
# "fit" executes the training job
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-05-15-05-46-09-695


2023-05-15 05:46:10 Starting - Starting the training job...
2023-05-15 05:46:34 Starting - Preparing the instances for training......
2023-05-15 05:47:36 Downloading - Downloading input data...
2023-05-15 05:48:06 Training - Downloading the training image...
2023-05-15 05:48:46 Training - Training image download completed. Training in progress...[34m[2023-05-15 05:48:58.797 ip-10-0-169-213.eu-west-1.compute.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-05-15 05:48:58.874 ip-10-0-169-213.eu-west-1.compute.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-05-15:05:48:59:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-05-15:05:48:59:INFO] Failed to parse hyperparameter eval_metric value error to Json.[0m
[34mReturning the value itself[0m
[34m[2023-05-15:05:48:59:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34