# Finding the Best Compute 
Finding the optimal EC2 instance can actually be framed as its own learning problem! Here we suggest a method that explores a variety of EC2 instances for SageMaker data scientists who struggle with:
- Finding the best EC2 instance for a training job
- Picking the right EC2 instance for deploying a model
- Updating your endpoint EC2 instance when your model changes
- Updating your instances when AWS launches new instances

### Getting data
To step through this notebook, you'll need to get your hands on some data. We recommend stepping through this example notebook, handily available both on Github through the sagemaker-examples, or pre-installed on your SageMaker notebook instance
- https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/xgboost_direct_marketing 

Feel free to just run all on the cells after you specify your bucket. We'll point to a train and a test set to run our instance experiments here.

In [2]:
import sagemaker
import pandas as pd
import boto3
import os
import multiprocessing
from multiprocessing import Pool
import re
import datetime as dt
from sagemaker import get_execution_role

In [3]:
train = '/home/ec2-user/SageMaker/xgboost_direct_marketing_2019-11-22/train.csv'
validation = '/home/ec2-user/SageMaker/xgboost_direct_marketing_2019-11-22/validation.csv'

In [4]:
role = get_execution_role()

sess = sagemaker.Session()

bucket = 'mandalorian'

prefix = 'xgboost/direct-marketing'

In [5]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file(train)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file(validation)

### Run your first training job 

In [6]:
def run_job(instance_type):
    
    try:
        xgb = sagemaker.estimator.Estimator(container,
                                        role, 
                                        base_job_name = 'lovelace',
                                        train_instance_count=1, 
                                        train_instance_type = instance_type,
                                        output_path='s3://{}/{}/output'.format(bucket, prefix),
                                        sagemaker_session=sess)
    #                                     train_use_spot_instances = True,
    #                                     train_max_wait = 3600)    

        xgb.set_hyperparameters(max_depth=5,
                                eta=0.2,
                                gamma=4,
                                min_child_weight=6,
                                subsample=0.8,
                                silent=0,
                                objective='binary:logistic',
                                num_round=100)

        xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 
        
        
    except:
        print ('error on ', instance_type)

In [7]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

# run_job('ml.m4.xlarge')

### Now, can you train a faster model on a different instance? 

First, let's copy all of the instance types from the AWS webpage to get the most up-to-date varieties: 
- https://aws.amazon.com/sagemaker/pricing/instance-types/

Then, we need to regex through the mess to find the correct instance types.

In [8]:
instance_str = '''ml.t2.medium
ml.t2.large	2	 	8	 	Low to Moderate
ml.t2.xlarge	4	 	16	 	Moderate
ml.t2.2xlarge	8	 	32	 	Moderate
ml.t3.medium	2	 	4	 	Low to Moderate
ml.t3.large	2	 	8
ml.t3.xlarge	4	 	16	 	Low to Moderate
ml.t3.2xlarge	8	 	32	 	Low to Moderate
ml.m5.large	2	 	8	 	High
ml.m5.xlarge	4	 	16	 	High
ml.m5.2xlarge	8	 	32	 	High
ml.m5.4xlarge	16	 	64
ml.m5.12xlarge	48	 	192	 	10 Gigabit
ml.m5.24xlarge	96	 	384	 	25 Gigabit
ml.m4.xlarge
ml.m4.4xlarge	16	-	64	-
ml.m4.10xlarge	40	-	160	- 
ml.m4.16xlarge	64	 	256	 	25 Gigabit
ml.r5.large	2	-	16	-	Up to 10 Gbps
ml.r5.xlarge	4	-	32	-	Up to 10  Gbps
ml.r5.2xlarge	8	-	64
-	Up to 10 Gbps
ml.r5.4xlarge	16	-	128	-	Up to 10 Gbps
ml.r5.12xlarge	48	-	384	-	10 Gbps
ml.r5.24xlarge	96	-	768	-	25 Gbps
ml.c5.large	2	 	4	 	Up to 10 Gbps
ml.c5.xlarge	4	-	8	-	Up to 10 Gbps
ml.c5.2xlarge	8	-	16	-	Up to 10 Gbps
ml.c5.4xlarge	16	-	32	-	Up to 10 Gbps
ml.c5.9xlarge	36	-	72	-	10 Gigabit
ml.c5.18xlarge	72	-	144	-	25 Gigabit
ml.c5d.xlarge	4	 	8	 	Up to 10 Gbps
ml.c5d.2xlarge	8	 	16	 	Up to 10 Gbps
ml.c5d.4xlarge	16	 	32	 	Up to 10 Gbps
ml.c5d.9xlarge	36	 	72	 	10 Gbps
ml.c5d.18xlarge	72	 	144	 	25 Gbps
ml.c4.large	2	 	3.75	 	Moderate
ml.c4.xlarge	4	-	7.5	-	High
ml.c4.2xlarge	8	-	15	-	High
ml.c4.4xlarge	16	 	30	 	High
ml.c4.8xlarge	36	-	60	-	10
ml.p3.2xlarge	8	1xV100	61	16	Up to 10 Gbps
ml.p3.8xlarge	32	4xV100	244	64	10 Gigabit
ml.p3.16xlarge	64	8xV100	488	128	25 Gigabit
ml.p3dn.24xlarge	96	8xV100	768	256	100 Gigabit
ml.p2.xlarge	4	1xK80	61	12	High
ml.p2.8xlarge	32	8xK80	488
96	10 Gigabit
ml.p2.16xlarge	64	16xK80	732	192	25 Gigabit
ml.g4dn.xlarge	4	1xT4	16	16	Up to 25 Gbps
ml.g4dn.2xlarge	8	1xT4	32	16	Up to 25 Gbps
ml.g4dn.4xlarge	16	1xT4	64	16	Up to 25 Gbps
ml.g4dn.8xlarge	32	1xT4	128	16	50 Gbps
ml.g4dn.12xlarge	48	4xT4	192	64	50 Gbps
ml.g4dn.16xlarge	64	1xT4	256	16	50 Gbps'''

In [9]:
def get_instances(instance_str):
    instances = []

    for phrase in ['(ml\.[a-z][0-9]\.)(medium|large)', '(ml\.[a-z][0-9]\.)([0-9]xlarge)', '(ml\.[a-z][0-9]\.)([0-9][0-9]xlarge)', \
                  '(ml\.[a-z][0-9][a-z]\.)([0-9]xlarge)', '(ml\.[a-z][0-9][a-z][a-z]\.)([0-9]xlarge)', \
                  '(ml\.[a-z][0-9][a-z]\.)([0-9][0-9]xlarge)', '(ml\.[a-z][0-9][a-z][a-z]\.)([0-9][0-9]xlarge)']:

        results = re.findall(phrase, instance_str)

        for k, v in results:

            instance = k + v

            instances.append(instance)
            
    return instances

instances = get_instances(instance_str)

In [11]:
instances[:10]

['ml.t2.medium',
 'ml.t2.large',
 'ml.t3.medium',
 'ml.t3.large',
 'ml.m5.large',
 'ml.r5.large',
 'ml.c5.large',
 'ml.c4.large',
 'ml.t2.2xlarge',
 'ml.t3.2xlarge']

After that, let's set up a MapReduce process to run our training jobs in parallel. Otherwise, we'd only be able to start one job after the previous one finished, and that would take forever!

In [14]:
pool = Pool(processes=multiprocessing.cpu_count())
transformed_rows = pool.map(run_job, instances)
pool.close() 
pool.join()

error on  ml.t3.large
error on  ml.c4.large
error on  ml.c5.large
2019-11-23 18:19:12 Starting - Starting the training job.2019-11-23 18:19:12 Starting - Starting the training job.2019-11-23 18:19:13 Starting - Starting the training job.error on  ml.g4dn.12xlarge
error on  ml.p2.8xlarge
error on  ml.r5.24xlarge
error on  ml.c4.4xlarge
error on  ml.g4dn.8xlarge
error on  ml.c5.18xlarge
error on  ml.r5.large
error on  ml.c5d.9xlarge
error on  ml.r5.4xlarge
error on  ml.g4dn.4xlarge
error on  ml.c5d.2xlarge
error on  ml.t3.2xlarge
error on  ml.t2.large
error on  ml.c5d.4xlarge
error on  ml.g4dn.16xlarge
error on  ml.c4.8xlarge
error on  ml.c4.2xlarge
error on  ml.r5.12xlarge
error on  ml.p3.8xlarge
error on  ml.c5d.18xlarge
error on  ml.m5.4xlarge
error on  ml.t3.medium
error on  ml.m5.12xlarge
2019-11-23 18:19:21 Starting - Starting the training job.error on  ml.m5.24xlarge
error on  ml.m4.10xlarge
error on  ml.r5.2xlarge
error on  ml.t2.2xlarge
error on  ml.m4.4xlarge
error on  ml.m4.16

[31m[69]#011train-error:0.09573#011validation-error:0.105123[0m
[31m[70]#011train-error:0.0958#011validation-error:0.105001[0m
[31m[71]#011train-error:0.095869#011validation-error:0.105244[0m
[31m[72]#011train-error:0.0958#011validation-error:0.105487[0m
[31m[73]#011train-error:0.095696#011validation-error:0.105244[0m
[31m[74]#011train-error:0.095765#011validation-error:0.105608[0m
[31m[75]#011train-error:0.095661#011validation-error:0.105608[0m
[31m[76]#011train-error:0.095973#011validation-error:0.105608[0m
[31m[77]#011train-error:0.095938#011validation-error:0.105851[0m
[31m[78]#011train-error:0.095834#011validation-error:0.105851[0m
[31m[79]#011train-error:0.0958#011validation-error:0.105851[0m
[31m[80]#011train-error:0.09573#011validation-error:0.105972[0m
[31m[81]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[82]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[83]#011train-error:0.095626#011validation-error:0.105608[0m
[3

[31m[77]#011train-error:0.095938#011validation-error:0.105851[0m
[31m[78]#011train-error:0.095834#011validation-error:0.105851[0m
[31m[79]#011train-error:0.0958#011validation-error:0.105851[0m
[31m[80]#011train-error:0.09573#011validation-error:0.105972[0m
[31m[81]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[82]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[83]#011train-error:0.095626#011validation-error:0.105608[0m
[31m[84]#011train-error:0.095904#011validation-error:0.105487[0m
[31m[85]#011train-error:0.0958#011validation-error:0.105487[0m
[31m[86]#011train-error:0.0958#011validation-error:0.105365[0m
[31m[87]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[88]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[89]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[90]#011train-error:0.095383#011validation-error:0.105608[0m
[31m[91]#011train-error:0.095349#011validation-error:0.105608[0m
[

[31m[81]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[82]#011train-error:0.095626#011validation-error:0.106094[0m
[31m[83]#011train-error:0.095626#011validation-error:0.105608[0m
[31m[84]#011train-error:0.095904#011validation-error:0.105487[0m
[31m[85]#011train-error:0.0958#011validation-error:0.105487[0m
[31m[86]#011train-error:0.0958#011validation-error:0.105365[0m
[31m[87]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[88]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[89]#011train-error:0.095626#011validation-error:0.105365[0m
[31m[90]#011train-error:0.095383#011validation-error:0.105608[0m
[31m[91]#011train-error:0.095349#011validation-error:0.105608[0m
[31m[92]#011train-error:0.095349#011validation-error:0.105487[0m
[31m[93]#011train-error:0.095522#011validation-error:0.105487[0m
[31m[94]#011train-error:0.095279#011validation-error:0.105851[0m
[31m[95]#011train-error:0.095314#011validation-error:0.105851[0m

[31m[90]#011train-error:0.095383#011validation-error:0.105608[0m
[31m[91]#011train-error:0.095349#011validation-error:0.105608[0m
[31m[92]#011train-error:0.095349#011validation-error:0.105487[0m
[31m[93]#011train-error:0.095522#011validation-error:0.105487[0m
[31m[94]#011train-error:0.095279#011validation-error:0.105851[0m
[31m[95]#011train-error:0.095314#011validation-error:0.105851[0m
[31m[96]#011train-error:0.094933#011validation-error:0.105851[0m
[31m[97]#011train-error:0.095002#011validation-error:0.105851[0m
[31m[98]#011train-error:0.095106#011validation-error:0.10573[0m
[31m[99]#011train-error:0.095175#011validation-error:0.105972[0m
Training seconds: 53
Billable seconds: 53
.
2019-11-23 18:22:28 Uploading - Uploading generated training model
2019-11-23 18:22:28 Completed - Training job completed

2019-11-23 18:22:16 Completed - Training job completed
.Training seconds: 46
Billable seconds: 46
Training seconds: 63
Billable seconds: 63
.....
2019-11-23 18:23:4

[31m[92]#011train-error:0.095349#011validation-error:0.105487[0m
[31m[93]#011train-error:0.095522#011validation-error:0.105487[0m
[31m[94]#011train-error:0.095279#011validation-error:0.105851[0m
[31m[95]#011train-error:0.095314#011validation-error:0.105851[0m
[31m[96]#011train-error:0.094933#011validation-error:0.105851[0m
[31m[97]#011train-error:0.095002#011validation-error:0.105851[0m
[31m[98]#011train-error:0.095106#011validation-error:0.10573[0m
[31m[99]#011train-error:0.095175#011validation-error:0.105972[0m

2019-11-23 18:24:11 Uploading - Uploading generated training model
2019-11-23 18:24:11 Completed - Training job completed
Training seconds: 47
Billable seconds: 47


### Search Experiment Results
Now, let's use SageMaker Search to identify the model that trained the fastest.

In [12]:
smclient = boto3.client(service_name='sagemaker')

# Search the training job by Amazon S3 location of model artifacts
search_params={
   "MaxResults": 100,
   "Resource": "TrainingJob",
   "SearchExpression": { 
      "Filters": [ 
         { 
            "Name": "InputDataConfig.DataSource.S3DataSource.S3Uri",
            "Operator": "Contains",
             
             # set this to have a word that is in your bucket name
            "Value": '{}'.format(bucket)
         },
        { 
            "Name": "TrainingJobStatus",
            "Operator": "Equals",
            "Value": 'Completed'
         }, 
    ],
     
   },
    
    "SortBy": "Metrics.validation:auc",
    "SortOrder": "Descending"
}
results = smclient.search(**search_params)

In [15]:
from sagemaker.model import Model

def get_models(results):

    role = sagemaker.get_execution_role()

    models = []

    for idx, each in enumerate(results['Results']):

        job_name = each['TrainingJob']['TrainingJobName']

        artifact = each['TrainingJob']['ModelArtifacts']['S3ModelArtifacts']

        # get training image
        image =  each['TrainingJob']['AlgorithmSpecification']['TrainingImage']

        m = Model(artifact, image, role = role, sagemaker_session = sess, name = job_name)

        # get training start time
        start_time = each['TrainingJob']['TrainingStartTime']
        
        # get training end time 
        end_time = each['TrainingJob']['TrainingEndTime']

        # compute and store training run time 
        train_time = (end_time - start_time).seconds
        
        instance_type = each['TrainingJob']['ResourceConfig']['InstanceType']
                
        models.append([instance_type, train_time, job_name])
            
#         models[instance_type] = {'train_time':train_time, "model_object":m, 'job_name':job_name}
        
    print ('Found {} models'.format(len(models)))
        
    return pd.DataFrame(models, columns = ['Instance Type', 'Train Time', 'Job Name'])

models = get_models(results)

Found 18 models


In [17]:
models.head()

Unnamed: 0,Instance Type,Train Time,Job Name
0,ml.m5.24xlarge,40,sagemaker-xgboost-2019-11-23-04-34-36-177
1,ml.m5.large,63,sagemaker-xgboost-2019-11-23-04-34-36-173
2,ml.p3.16xlarge,96,sagemaker-xgboost-2019-11-23-04-34-36-178
3,ml.m4.xlarge,57,sagemaker-xgboost-2019-11-23-02-35-23-691
4,ml.m5.2xlarge,53,lovelace-2019-11-23-18-19-12-641


### Pricing
Great! Now let's get the latest price on instances in our region to see which training job gave us the lowest possible dollar value.

In [22]:
instance_pricing = {'ml.m5.large':0.134,
'ml.m5.xlarge':0.269,
'ml.m5.2xlarge':0.538,
'ml.m5.4xlarge':1.075,
'ml.m5.12xlarge':3.226,
'ml.m5.24xlarge':6.451,
'ml.m4.xlarge':0.28,
'ml.m4.2xlarge':0.56,
'ml.m4.4xlarge':1.12,
'ml.m4.10xlarge':2.80,
'ml.m4.16xlarge':4.48,
'ml.c5.xlarge':0.238,
'ml.c5.2xlarge':0.476,
'ml.c5.4xlarge':0.952,
'ml.c5.9xlarge':2.142,
'ml.c5.18xlarge':4.284,
'ml.c4.xlarge':0.279,
'ml.c4.2xlarge':0.557,
'ml.c4.4xlarge':1.114,
'ml.c4.8xlarge':2.227,
'ml.p2.xlarge':1.26,
'ml.p2.8xlarge':10.08,
'ml.p2.16xlarge':20.16,
'ml.p3.2xlarge':4.284,
'ml.p3.8xlarge':17.136,
'ml.p3.16xlarge':34.272,
'm.p3dn.24xlarge':43.697}

In [34]:
def add_job_price(models, instance_pricing):

    models['Job Price'] = 0

    for idx, row in models.iterrows():
        time, instance = row['Train Time'], row['Instance Type']

        price = instance_pricing[instance]

        p = price * time
        
        models.at[idx, 'Job Price'] = p
        
    return models

models = add_job_price(models, instance_pricing)

In [41]:
models.sort_values(by=['Job Price'], axis=0, ascending=True)

Unnamed: 0,Instance Type,Train Time,Job Name,Job Price
1,ml.m5.large,63,sagemaker-xgboost-2019-11-23-04-34-36-173,8
12,ml.m5.large,63,lovelace-2019-11-23-18-19-12-640,8
15,ml.m4.xlarge,57,sagemaker-xgboost-2019-11-23-00-06-22-652,15
3,ml.m4.xlarge,57,sagemaker-xgboost-2019-11-23-02-35-23-691,15
5,ml.m4.xlarge,58,xgboost-2019-11-23-00-02-07-246,16
14,ml.m4.xlarge,58,sagemaker-xgboost-2019-11-23-04-34-17-790,16
8,ml.m4.xlarge,63,lovelace-2019-11-23-18-18-59-635,17
16,ml.c5.2xlarge,38,sagemaker-xgboost-2019-11-23-04-34-36-175,18
4,ml.m5.2xlarge,53,lovelace-2019-11-23-18-19-12-641,28
7,ml.c5.4xlarge,36,lovelace-2019-11-23-18-19-12-642,34


In [42]:
models.sort_values(by=['Train Time'], axis=0, ascending=True)

Unnamed: 0,Instance Type,Train Time,Job Name,Job Price
7,ml.c5.4xlarge,36,lovelace-2019-11-23-18-19-12-642,34
6,ml.m5.12xlarge,37,sagemaker-xgboost-2019-11-23-02-36-51-120,119
16,ml.c5.2xlarge,38,sagemaker-xgboost-2019-11-23-04-34-36-175,18
0,ml.m5.24xlarge,40,sagemaker-xgboost-2019-11-23-04-34-36-177,258
11,ml.p3.2xlarge,46,lovelace-2019-11-23-18-19-12-643,197
10,ml.c5.18xlarge,47,sagemaker-xgboost-2019-11-23-02-36-57-091,201
17,ml.p2.16xlarge,47,lovelace-2019-11-23-18-19-12-645,947
13,ml.p3.2xlarge,49,sagemaker-xgboost-2019-11-23-04-34-36-176,209
9,ml.m4.4xlarge,51,sagemaker-xgboost-2019-11-23-02-36-50-223,57
4,ml.m5.2xlarge,53,lovelace-2019-11-23-18-19-12-641,28


# Intermediate Conclusions
Interesting. It appears obvious that there is no single option that is both lowest in training time and lowest in dollar value. This tells us that the optimal compute environment is unambiguously dependent on user-defined preferences for both dollar value and training time. 

How does this valuation change based on size of data? While the trade-off may seem less obvious at smaller scales of data, does this impact our conclusions when the size of data increases dramatically?