# Training

#### Introduction

This notebook uses the XGBoost algorithm to train and host model for the Telco-Customer-Churn dataset from [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn). \
The 01_preprocessing notebook is splitting the dataset into train, test and validation for this notebook. 

#### Prequisites and Preprocessing

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel. 

#### Permissions and environment variables

Here we set up the linkage and authentication to AWS services.
1. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.
2. The S3 buckets that you want to use for training and model data and where the downloaded data is located.

#### Imports:

In [2]:
import os
import boto3
import re
import copy
import time
import pandas as pd
from time import gmtime, strftime
import sagemaker
from sagemaker import get_execution_role

pd.set_option('display.max_columns', None)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


#### Sessions:

In [3]:
role = get_execution_role()
region = boto3.Session().region_name
sess = sagemaker.Session()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


#### Bucket paths:

In [4]:
prefix = "model"
bucket = "telco-churn-demo"
bucket_path = f"s3://{bucket}"
input_data_path = "ingest/ingest-2023-10-12-23-13-30"

In [5]:
 from sagemaker.image_uris import retrieve

container = retrieve("xgboost", region, version="1.0-1")

#### Training parameters:

In [6]:
# Ensure that the train and validation data folders generated above are reflected in the "InputDataConfig" parameter below.
job_name = f'telco-churn-xgboost-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'

common_training_params = {
    "AlgorithmSpecification": {"TrainingImage": container, "TrainingInputMode": "File"},
    "RoleArn": role,
    "OutputDataConfig": {"S3OutputPath": f"{bucket}/{prefix}/{job_name}"},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.m4.2xlarge", "VolumeSizeInGB": 5},
    "StoppingCondition": {"MaxRuntimeInSeconds": 5730},
    "HyperParameters": {"max_depth": "4", "num_round": "100"},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": f"{bucket_path}/{input_data_path}/train/train.csv",
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": f"{bucket_path}/{input_data_path}/val/val.csv",
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        },
    ],
}

Now we'll create two separate jobs, updating the parameters that are unique to each.
    
#### Training on a single instance

In [7]:
# single machine job params
single_machine_job_name = f'single-machine-{job_name}'
print("Job name is:", single_machine_job_name)

single_machine_job_params = copy.deepcopy(common_training_params)
single_machine_job_params["TrainingJobName"] = single_machine_job_name
single_machine_job_params["OutputDataConfig"]["S3OutputPath"] = f"{bucket_path}/{prefix}/{job_name}/xgboost-single"
single_machine_job_params["ResourceConfig"]["InstanceCount"] = 1

Job name is: single-machine-telco-churn-xgboost-2023-10-13-20-54-19


#### Training on multiple instances

You can also run the training job distributed over multiple instances. For larger datasets with multiple partitions, this can significantly boost the training speed. 

In [8]:
 # distributed job params
distributed_job_name = f'distributed-machine-{job_name}'
print("Job name is:", distributed_job_name)

distributed_job_params = copy.deepcopy(common_training_params)
distributed_job_params["TrainingJobName"] = distributed_job_name
distributed_job_params["OutputDataConfig"][
    "S3OutputPath"
] = f"{bucket_path}/{prefix}/{job_name}/xgboost-distributed"
# number of instances used for training
distributed_job_params["ResourceConfig"][
    "InstanceCount"
] = 2  # no more than 5 if there are total 5 partition files generated above

# data distribution type for train channel
distributed_job_params["InputDataConfig"][0]["DataSource"]["S3DataSource"][
    "S3DataDistributionType"
] = "ShardedByS3Key"
# data distribution type for validation channel
distributed_job_params["InputDataConfig"][1]["DataSource"]["S3DataSource"][
    "S3DataDistributionType"
] = "ShardedByS3Key"

Job name is: distributed-machine-telco-churn-xgboost-2023-10-13-20-54-19


Submitting these jobs, taking note that the first will be submitted to run in the background so that we can immediately run the second in parallel.

In [9]:
sm = boto3.Session(region_name=region).client("sagemaker")

sm.create_training_job(**single_machine_job_params)
sm.create_training_job(**distributed_job_params)

status = sm.describe_training_job(TrainingJobName=distributed_job_name)["TrainingJobStatus"]
print(status)
sm.get_waiter("training_job_completed_or_stopped").wait(TrainingJobName=distributed_job_name)
status = sm.describe_training_job(TrainingJobName=distributed_job_name)["TrainingJobStatus"]
sm.get_waiter("training_job_completed_or_stopped").wait(TrainingJobName=single_machine_job_name)
status = sm.describe_training_job(TrainingJobName=single_machine_job_name)["TrainingJobStatus"]
print(f"Training job ended with status: {status}")
if status == "Failed":
    message = sm.describe_training_job(TrainingJobName=single_machine_job_name)["FailureReason"]
    message = sm.describe_training_job(TrainingJobName=distributed_job_name)["FailureReason"]
    print(f"Training failed with the following error: {message}")
    raise Exception("Training job failed")

InProgress
Training job ended with status: Completed


#### Confirm both jobs have finished:

In [10]:
print(
    "Single Machine:",
    sm.describe_training_job(TrainingJobName=single_machine_job_name)["TrainingJobStatus"],
)
print(
    "Distributed:", sm.describe_training_job(TrainingJobName=distributed_job_name)["TrainingJobStatus"]
)

Single Machine: Completed
Distributed: Completed


#### Set up hosting for the model:

In order to set up hosting, we have to import the model from training to hosting. The step below demonstrated hosting the model generated from the distributed training job. Same steps can be followed to host the model obtained from the single machine job. 

##### Import model into hosting

Next, you register the model with hosting. This allows you the flexibility of importing models trained elsewhere. 

In [11]:
%%time
import boto3
from time import gmtime, strftime

model_name = f"{distributed_job_name}-mod"
print(model_name)

info = sm.describe_training_job(TrainingJobName=distributed_job_name)
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]
print(model_data)

primary_container = {"Image": container, "ModelDataUrl": model_data}

create_model_response = sm.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response["ModelArn"])

distributed-machine-telco-churn-xgboost-2023-10-13-20-54-19-mod
s3://telco-churn-demo/model/telco-churn-xgboost-2023-10-13-20-54-19/xgboost-distributed/distributed-machine-telco-churn-xgboost-2023-10-13-20-54-19/output/model.tar.gz
arn:aws:sagemaker:eu-central-1:941015873154:model/distributed-machine-telco-churn-xgboost-2023-10-13-20-54-19-mod
CPU times: user 9.3 ms, sys: 0 ns, total: 9.3 ms
Wall time: 706 ms


##### Create endpoint configuration

SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment.

In [12]:
from time import gmtime, strftime

endpoint_config_name = f'churn-demo-feature-engineered-xgbpconfig-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'
print(endpoint_config_name)
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m4.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print(f'Endpoint Config Arn: {create_endpoint_config_response["EndpointConfigArn"]}')

churn-demo-feature-engineered-xgbpconfig-2023-10-13-21-00-36
Endpoint Config Arn: arn:aws:sagemaker:eu-central-1:941015873154:endpoint-config/churn-demo-feature-engineered-xgbpconfig-2023-10-13-21-00-36


####  Create endpoint:

In [13]:
%%time
import time

endpoint_name = f'churn-demo-feature-engineered-xgb-class-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'
print(endpoint_name)
create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response["EndpointArn"])

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print(f"Status: {status}")

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print(f"Status: {status}")

print(f'Arn: {resp["EndpointArn"]}')
print(f"Status: {status}")

churn-demo-feature-engineered-xgb-class-2023-10-13-21-00-37
arn:aws:sagemaker:eu-central-1:941015873154:endpoint/churn-demo-feature-engineered-xgb-class-2023-10-13-21-00-37
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:eu-central-1:941015873154:endpoint/churn-demo-feature-engineered-xgb-class-2023-10-13-21-00-37
Status: InService
CPU times: user 52 ms, sys: 6.17 ms, total: 58.2 ms
Wall time: 3min 1s


#### Read the test data:

In [14]:
 runtime_client = boto3.client("runtime.sagemaker", region_name=region)
test = pd.read_csv(f"{bucket_path}/{input_data_path}/test/test.csv")

In [15]:
test.head()

Unnamed: 0,Churn,tenure,MonthlyCharges,TotalCharges,gender_M,SeniorCitizen_Y,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No_phone,MultipleLines_Yes,InternetService_Fiber,InternetService_No,OnlineSecurity_No_internet,OnlineSecurity_Yes,OnlineBackup_No_internet,OnlineBackup_Yes,DeviceProtection_No_internet,DeviceProtection_Yes,TechSupport_No_internet,TechSupport_Yes,StreamingTV_No_internet,StreamingTV_Yes,StreamingMovies_No_internet,StreamingMovies_Yes,Contract_One_year,Contract_Two_years,PaperlessBilling_Yes,PaymentMethod_Credit_card,PaymentMethod_Electronic_check,PaymentMethod_Mailed_check
0,1.0,0.518765,1.199763,1.084916,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,0.0,0.1923,-0.648933,-0.30318,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,1.212503,1.306451,1.90117,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
3,1.0,-1.07275,-0.563916,-0.871316,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,-0.827902,0.28125,-0.640643,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0


#### Predict:

In [16]:
from pathlib import Path
step = 10000
to = 0
result = []

Path('data').mkdir(parents=True, exist_ok=True)

for start in range(0, test.shape[0], step):
    
    if os.path.exists('data/test.csv'):
        os.remove('data/test.csv')

    test_line = test.iloc[start:start+step,1:].to_numpy() #Remove target and iterate over rows
    pd.DataFrame(test_line).to_csv('data/test.csv',index=False, header=True)
    
    csv_buffer = open('data/test.csv')
    my_payload_as_csv = csv_buffer.read()

    response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body= my_payload_as_csv,
        ContentType = 'text/csv')
    
    result += response["Body"].read().decode("ascii").split(",")[:-1]

In [17]:
test['pred'] = result
test['pred'] = test['pred'].astype('float').astype('int')
test.head()

Unnamed: 0,Churn,tenure,MonthlyCharges,TotalCharges,gender_M,SeniorCitizen_Y,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No_phone,MultipleLines_Yes,InternetService_Fiber,InternetService_No,OnlineSecurity_No_internet,OnlineSecurity_Yes,OnlineBackup_No_internet,OnlineBackup_Yes,DeviceProtection_No_internet,DeviceProtection_Yes,TechSupport_No_internet,TechSupport_Yes,StreamingTV_No_internet,StreamingTV_Yes,StreamingMovies_No_internet,StreamingMovies_Yes,Contract_One_year,Contract_Two_years,PaperlessBilling_Yes,PaymentMethod_Credit_card,PaymentMethod_Electronic_check,PaymentMethod_Mailed_check,pred
0,1.0,0.518765,1.199763,1.084916,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0
1,0.0,0.1923,-0.648933,-0.30318,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
2,0.0,1.212503,1.306451,1.90117,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0
3,1.0,-1.07275,-0.563916,-0.871316,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0
4,0.0,-0.827902,0.28125,-0.640643,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0


#### Metrics:

In [18]:
from sklearn.metrics import classification_report

cr = classification_report(test['Churn'], test['pred'])
print(cr)

              precision    recall  f1-score   support

         0.0       0.71      1.00      0.83       300
         1.0       0.00      0.00      0.00       122

    accuracy                           0.71       422
   macro avg       0.36      0.50      0.41       422
weighted avg       0.50      0.71      0.59       422



In [19]:
from sklearn.metrics import f1_score,accuracy_score

print(f"Accuracy: {accuracy_score(test['Churn'], test['pred']):.1%}")
print(f"F1 Score {f1_score(test['Churn'], test['pred'],average='macro'):.1%}")

Accuracy: 70.9%
F1 Score 41.5%


In [20]:
sm.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '6e8ae87a-1220-4670-bcd9-d993597ee809',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '6e8ae87a-1220-4670-bcd9-d993597ee809',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 13 Oct 2023 21:03:39 GMT'},
  'RetryAttempts': 0}}