# Building a CPI Bid Model (Part 3)
## Model Selection

At this point, we've taken a look at our data and we know a few things about it:
* our features have very skewed distributions
* some features are likely highly correlated with one another
* some features may have little correlation, or non-linear correlation with our dependent variable, `conversion_rate`

Usually, we might try to fit many different models and see what performs the best.  However, for the sake of keeping this tutorial as brief as possible, we'll just use a linear regression model since its easy to understand and works pretty well for predicting continuous values.

In [None]:
import boto3
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role
import time
import numpy as np
import pandas as pd
import io
import os

%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

Since SageMaker has a built in `linear learner` model, we'll go ahead and leverage that to save ourselves some time:

In [None]:
container = get_image_uri(boto3.Session().region_name, 'linear-learner')
train_data = pd.read_pickle('./data/step2-train.pkl')

bucket = 'beeswax-tmp-us-east-1'
prefix = 'bid-models-test-data/canary/sagemaker'

role = get_execution_role()
session = sagemaker.Session()
ll = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)
ll.set_hyperparameters(
    feature_dim=len(train_data.columns)-1,
    mini_batch_size=500,
    predictor_type='regressor'
)
train_data.shape

Now we'll create the input data by referencing the files we wrote to S3 in the previous part:

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/train.csv'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/validation.csv'.format(bucket, prefix), content_type='text/csv')

And finally, its time to actually fit the model:

In [None]:
job_name = 'canary-cpi-model-{timestamp}'.format(timestamp=int(time.time()))
ll.fit({'train': s3_input_train, 'validation': s3_input_validation}, job_name=job_name) 

Now that we have a trained model, we want to determine how well the model performs.  We'll do this by running our test data through the model and comparing the results to the expected value.  Let's start by deploying the model so we can score against it:

In [None]:
ll_predictor = ll.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

We'll read in our test dataset and then setup our model to receive csv data:

In [None]:
test_data = pd.read_pickle('./data/step2-test.pkl')

ll_predictor.content_type = 'text/csv'
ll_predictor.serializer = csv_serializer
ll_predictor.deserializer = json_deserializer
test_data.head()

Now we'll loop loop over our test dataset to:
* split data into mini-batches
* convert those batches into CSV payloads
* get predictions for each payload
* merge the result back into our test dataset
* calculate the MAE for our test dataset (this is the metric we will use to evaluate the model fit)

In [None]:
def predict(data):
    predictions = []
    for array in data:
        result = ll_predictor.predict(array)
        predictions.append(result['predictions'][0]['score'])
    
    return np.array(predictions)

test_data['prediction'] = predict(test_data.drop(['conversion_rate'], axis=1).as_matrix())
test_data['error'] = np.abs(test_data['prediction'] - test_data['conversion_rate'])
print('mean average error: {error}'.format(error=test_data['error'].mean()))

So a MAE of 0.0006 is VERY good.  One other thing to look at is the error for rows with non-zero conversion rates.  Since most of the rows have 0 conversions, we want to make sure that we are accurate where there is signal:

In [None]:
print('mean average error for non-zero conversion rate: {error}'.format(error=test_data.loc[test_data['conversion_rate'] > 0]['error'].mean()))

Okay, worse than overall but still pretty good... we have reasonable but notably higher error for rows that actually have a conversion rate.  What does this mean?  It means we are really really good at predicting which inventory will not perform but not as good at predicting what will perform. In the next part, we'll look at ways to improve this performance.