## XGBoost 1TB Distributed Training FastFile Mode

Prerequisite: Please use EC2 scripts in repository to create dataset and upload to S3 before running this notebook.

### Setup SageMaker Clients

We use the Python AWS SDK (Boto3) and a higher-level wrapper known as the SageMaker Python SDK.

In [None]:
import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker_session = sagemaker.Session()
base_job_prefix = 'xgboost-example'
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = 'ml.m5.24xlarge'

### Prepare Training Inputs

Enable FastFile Mode and prepare TrainingInput with the proper path for your S3 Dataset.

In [None]:
from sagemaker.inputs import TrainingInput

#replace with your S3 Bucket with data
training_path = 's3://sagemaker-us-east-1-474422712127/xgboost-1TB/'

#set distribution to ShardedByS3Key otherwise a copy of all files will be made across all instances
#we also enable FastFile mode here where as the default is File mode
train_input = TrainingInput(training_path, content_type="text/csv", input_mode='FastFile', distribution = "ShardedByS3Key")
training_path

In [None]:
train_input.config #ensure config has proper input mdoe and distribution

In [None]:
training_instance_type = 'ml.m5.24xlarge'
training_instance_type

### Define Training Parameters

Key here is defining a proper instance type and count you may need to submit a limit request as you tune your instance count.

In [None]:
model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=25,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,
    
)

xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)
training_instance_type

### Training Job

Takes ~11 hours with 25 ml.m5.24xlarge instances

In [None]:
xgb_train.fit({'train': train_input})