# Building a CPI Bid Model (Part 4)

In Part 3, we built our fist model but we found that the initial performance was very poor.  The main reason for this is that the number of conversions is very small relative to the total number of impressions.  When we train the model with this data it does a really good job of predicting non-converters but a poor job of predicting converters.  In this part, we will address this problem and then take steps to further tune the performance of the model.  

## Re-sampling
To combat the relatively small number of conversions, we can [upsample](http://www.simafore.com/blog/handling-unbalanced-data-machine-learning-models) the training set so that the conversions are more evenly balanced with the non-conversions.  This will ultimately allow the model to train on a more evenly distributed dataset.  Our goal is to create a balanced dataset for training, and then to test on an unbalanced dataset to make sure the model still works in the real world.

Let's start by re-loading our dataset from Step 2:

In [None]:
!{sys.executable} -m pip install joblib
!{sys.executable} -m pip install graphviz
!{sys.executable} -m pip install mxnet

# our usual setup, also making sure some system libraries are installed
import boto3
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role
import time
import numpy as np
import pandas as pd
import os
import sys
import tarfile
import joblib
import json
import xgboost
import matplotlib.pyplot as plt
import mxnet as mx

%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

In [None]:
df = pd.read_pickle("./data/step2-model.pkl")
df.head()

Now let's do the sampling.  We'll try to improve the ratio so that at least 5% of rows have some conversion rate.

In [None]:
# this gives us 100% of the data where conversion_rate > 0
sampled_data = df.loc[df['conversion_rate'] > 0]

# now let's calculate what percentage to sample
sample_rate = len(sampled_data) * 20.0 / (len(df) - len(sampled_data))
print('sample rate: {}'.format(sample_rate))

# finally, let's merge the two together
sampled_data = pd.concat([sampled_data, df.sample(frac=sample_rate)], axis=0, ignore_index=True).to_dense()
print('new data size: {}'.format(sampled_data.shape))

Okay, now we have a much better ratio of converters to non-converters, and the added benefit of a much smaller dataset that will allow us to move much faster from here out.  This should give us a better fitting model.  Let's find out by re-running our modeling process:

In [None]:
train_data = sampled_data.sample(frac=.9)
validation_data = sampled_data.drop(train_data.index)

pd.concat([train_data['conversion_rate'], train_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/sampled/train.csv', index=False, header=False)
pd.concat([validation_data['conversion_rate'], validation_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/sampled/validation.csv', index=False, header=False)

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'sampled/train/train.csv')).upload_file('data/sampled/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'sampled/validation/validation.csv')).upload_file('data/sampled/validation.csv')

# drop the data files from disk, they are huge and we don't want to keep them
os.remove('data/sampled/train.csv')
os.remove('data/sampled/validation.csv')

train_data.head()

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/sampled'

session = sagemaker.Session()
role = get_execution_role()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(sampled_data.columns)-1,
    mini_batch_size=200,
    predictor_type='regressor'
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-cpi-model-sampled-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train}, job_name=job_name) 

Now let's evaluate the model as before.  I'm going to use our original test dataset (without sampling) to make sure that our sampling didn't cause us to overfit the model.

In [None]:
test_data = pd.read_pickle('./data/step2-test.pkl')

ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer

def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('\nmean average error: {error}'.format(error=test_data['error'].mean()))
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

As expected, we've been able to balance our model a little better.  We are now worse at determining what does not perform well but better at determining what does perform well.

## Hyperparameter Tuning
The next step we will take is to tune our hyperparameters (basically, the configuration for the model).  This can have a dramatic effect on the performace of the model.  The bad news is that really the only way to tune these parameters is to try a bunch of different combinations.  The good news is that SageMaker has a feature specifically for this process!

We will tune four hyperparameters for our model:

* `eta`: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
* `alpha`: L1 regularization term on weights. Increasing this value makes models more conservative.
* `min_child_weight`: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
* `max_depth`: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {
    'wd': ContinuousParameter(0.0001, 1),
    'l1': ContinuousParameter(0.0001, 1),
    'learning_rate': ContinuousParameter(0.0001, 1),
    'mini_batch_size': IntegerParameter(100, 500),
    'use_bias': CategoricalParameter([True, False]),
    'positive_example_weight_mult': ContinuousParameter(0.0001, 1),
}
objective_metric_name = 'validation:objective_loss'

tuner = HyperparameterTuner(ll, objective_metric_name, hyperparameter_ranges, objective_type='Minimize', max_jobs=30, max_parallel_jobs=5)
job_name = 'canary-cpi-tuner-{timestamp}'.format(timestamp=int(time.time()))
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False, job_name=job_name)

In [None]:
# block until tuning finishes

sage_client = boto3.client('sagemaker', region_name=boto3.Session().region_name)
tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)
while tuning_job_result['HyperParameterTuningJobStatus'] != 'Completed':
    print('tuning in progress...')
    time.sleep(60)
    tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)

tuned_hyper_params = tuning_job_result['BestTrainingJob']['TunedHyperParameters']

At the completion of our tuning job (it will take quite some time), we want to pull the best job and retrain/redeploy our model:

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/sampled'

session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(sampled_data.columns)-1,
    predictor_type='regressor',
    **tuned_hyper_params
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-cpi-model-sampled-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train}, job_name=job_name) 

and then re-evaluate the performance:

In [None]:
test_data = pd.read_pickle('./data/step2-test.pkl')

ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer

def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('mean average error: {error}'.format(error=test_data['error'].mean()))
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

It turns out that our starting hyperparameters were already pretty good and we didn't really improve much with tuning.  We'll keep this step in as we might see different results overtime.

## Feature Inspection
One last model tuning step we'll look into is the relative performance and value of features in our model.  In general, we want to simplify our model as much as possible so that it's easy to understand and can be as concise as possible.  We may also be able to get better performance from the model by eliminating certain features that are confusing the model training process and ultimately doing more harm than good.

Let's start by inspecting our existing model and determining how much impact each feature has on the final prediction:

In [None]:
import os
import mxnet as mx
import boto3

key = '{}/output/{}/output/model.tar.gz'.format(prefix, jobname)

boto3.resource('s3').Bucket(bucket).download_file(key, './data/model.tar.gz')
os.system('tar -zxvf ./data/model.tar.gz')

# Linear learner model is itself a zip file, containing a mxnet model and other metadata.
# First unzip the model.
os.system('unzip model_algo-1')

# Load the mxnet module
mod = mx.module.Module.load("mx-mod", 0)

# model weights
weights = mod._arg_params['fc0_weight'].asnumpy().flatten()

# merge the weights in with the feature labels
model_weights = pd.DataFrame({'feature': list(sampled_data.columns[1:])})
model_weights.head()
model_weights['weights'] = weights

model_weights.head()

Now we have the model coefficients for each feature, but remember that we one-hot encoded our data so we have a ton of features (~500).  It will be too difficult to look across all of these features so let's instead aggregate based on the original feature.  Basically, we'll reverse our one hot encoding and then compute some stats for each top level feature:

In [None]:
top_level_features = set([col.split('-')[0] for col in sampled_data.columns[1:]])
top_level_weights = pd.DataFrame({'features': list(top_level_features)})

mean_weight = []
med_weight = []
min_weight = []
max_weight = []
dev_weight = []
for feature in top_level_weights['features']:
    _weights = model_weights.loc[model_weights['feature'].str.startswith(feature)]
    mean_weight.append(_weights['weights'].abs().mean())
    med_weight.append(_weights['weights'].abs().median())
    min_weight.append(_weights['weights'].abs().min())
    max_weight.append(_weights['weights'].abs().max())
    dev_weight.append(_weights['weights'].std())

top_level_weights['mean'] = mean_weight
top_level_weights['med'] = med_weight
top_level_weights['min'] = min_weight
top_level_weights['max'] = max_weight
top_level_weights['stdev'] = dev_weight

top_level_weights

We've now calculated some descriptive stats for each feature.  Let's sort our features by median weight and graph them so we can see the relative importance of each:

In [None]:
top_level_weights = top_level_weights.sort_values(['med'])
ax = top_level_weights.plot.barh(x='features', y='med', rot=1)

We can see that there are three features which have a very very small median impact on the model: `inventory_source`, `app_bundle`, and `platform_device_make`.  Let's try re-training our model without these features and see if our accuracy improves.  First, we'll need to re-generate our datasets:

In [None]:
model_df = sampled_data
model_df = model_df[[col for col in model_df if not col.startswith('inventory_source')]]
model_df = model_df[[col for col in model_df if not col.startswith('app_bundle')]]
model_df = model_df[[col for col in model_df if not col.startswith('platform_device_make')]]

train_data = model_df.sample(frac=.7).to_dense()
validation_data = model_df.drop(train_data.index).sample(frac=.66).to_dense()
test_data = model_df.drop(train_data.index).drop(validation_data.index).to_dense()

pd.concat([train_data['conversion_rate'], train_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/train.csv', index=False, header=False)
pd.concat([validation_data['conversion_rate'], validation_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/validation.csv', index=False, header=False)

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/train/train.csv')).upload_file('data/slim/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/validation/validation.csv')).upload_file('data/slim/validation.csv')

# drop the data files from disk, they are huge and we don't want to keep them
os.remove('data/slim/train.csv')
os.remove('data/slim/validation.csv')

Now let's retrain and re-test the model with the slimmed down dataset:

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/slim'

session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(train_data.columns)-1,
    mini_batch_size=200,
    predictor_type='regressor'
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-cpi-model-slim-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train, 'validation': s3_input_validation}, job_name=job_name) 

In [None]:
ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer

def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('\nmean average error: {error}'.format(error=test_data['error'].mean()))
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

Okay, so our performance actually got worse when we removed these fields. Re-examining the table above, we see that some of the features we eliminated actually have relatively high "max" weights.  In other words, there are a few `app_bundle`s, etc, that do have significant impact on the model.  What if we instead look for low-coefficient features that also have low standard deviations:

In [None]:
top_level_weights = top_level_weights.sort_values(['stdev'], ascending=False)
ax = top_level_weights.plot.barh(x='features', y=['stdev', 'med'], rot=1)

We can see that we actually get a different set of low performing features this way.  Let's try removing the 5 lowest features (we'll leave `rewarded` in there since it has a failry high median value) instead and then re-training one more time:

In [None]:
model_df = sampled_data
model_df = model_df[[col for col in model_df if not col.startswith('inventory_interstitial')]]
model_df = model_df[[col for col in model_df if not col.startswith('banner_height')]]
model_df = model_df[[col for col in model_df if not col.startswith('placement_type')]]
model_df = model_df[[col for col in model_df if not col.startswith('platform_bandwidth')]]
model_df = model_df[[col for col in model_df if not col.startswith('banner_width')]]

train_data = model_df.sample(frac=.7).to_dense()
validation_data = model_df.drop(train_data.index).sample(frac=.66).to_dense()
test_data = model_df.drop(train_data.index).drop(validation_data.index).to_dense()

pd.concat([train_data['conversion_rate'], train_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/train.csv', index=False, header=False)
pd.concat([validation_data['conversion_rate'], validation_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/validation.csv', index=False, header=False)

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/train/train.csv')).upload_file('data/slim/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/validation/validation.csv')).upload_file('data/slim/validation.csv')

# drop the data files from disk, they are huge and we don't want to keep them
os.remove('data/slim/train.csv')
os.remove('data/slim/validation.csv')

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/slim'

session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(train_data.columns)-1,
    mini_batch_size=200,
    predictor_type='regressor'
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-cpi-model-slim-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train, 'validation': s3_input_validation}, job_name=job_name) 

In [None]:
ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer

def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('\nmean average error: {error}'.format(error=test_data['error'].mean()))
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

This also didn't improve our model.  One last thing we can try is to talk out low-coefficient features post-encoding. Lets take a look at the coefficient distribution:

In [None]:
model_weights['weights'].hist(bins=100)

It appears that there are a lot of features with a weight at or near 0.  Lets try to remove these and re-train:

In [None]:
feat_to_keep = model_weights.loc[model_weights['weights'].abs() > 0.001]
model_df = sampled_data[[col for col in feat_to_keep['feature']]]
model_df['conversion_rate'] = sampled_data['conversion_rate']

train_data = model_df.sample(frac=.7).to_dense()
validation_data = model_df.drop(train_data.index).sample(frac=.66).to_dense()
test_data = model_df.drop(train_data.index).drop(validation_data.index).to_dense()

pd.concat([train_data['conversion_rate'], train_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/train.csv', index=False, header=False)
pd.concat([validation_data['conversion_rate'], validation_data.drop(['conversion_rate'], axis=1)], axis=1).to_csv('data/slim/validation.csv', index=False, header=False)

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/train/train.csv')).upload_file('data/slim/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'slim/validation/validation.csv')).upload_file('data/slim/validation.csv')

# drop the data files from disk, they are huge and we don't want to keep them
os.remove('data/slim/train.csv')
os.remove('data/slim/validation.csv')

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/slim'

session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(train_data.columns)-1,
    mini_batch_size=200,
    predictor_type='regressor'
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-canary-model-slim-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train, 'validation': s3_input_validation}, job_name=job_name) 

In [None]:
ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer

def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('\nmean average error: {error}'.format(error=test_data['error'].mean()))
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

Again, out model actually got worse.  We could continue with adding and removing different combinations of features, but let's stop here.  We'll redeploy our post-tuning model and move on to actuall deploying the Bid Model in Beeswax.

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker/sampled'

session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(sampled_data.columns)-1,
    predictor_type='regressor',
    **tuned_hyper_params
)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='text/csv')

job_name = 'canary-cpi-model-sampled-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train}, job_name=job_name) 
ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
# save the model features
features = list(sampled_data.columns)
features.remove('conversion_rate')
prod_model = {
    'features': features,
    'endpoint_name': job_name
}

with open('data/prod_model.json', 'w') as f:
    f.write(json.dumps(prod_model))