## Prequisites and Preprocessing

### Permissions and environment variables

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [1]:
bucket = 'cc-dining-robot'
prefix = 'sagemaker'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

### Data ingestion

Next, we read the dataset from an online URL into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

In [2]:
%%time
import urllib.request, json, csv
import random

# Load the dataset
urllib.request.urlretrieve("https://s3.amazonaws.com/cc-dining-robot/sagemaker/train.csv", "train.csv")

CPU times: user 11.5 ms, sys: 0 ns, total: 11.5 ms
Wall time: 28.9 ms


In [3]:
in_value, target = list(), list()
with open('train.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    next(csv_file)
    for row in reader:
        in_value.append(row[:2] + row[3:5])
        target.append(row[-1])

In [5]:
in_value[0], target[0]

(['Q9F2ocrmYuGt1yn3M7MOBw', 'italian', '1291', '4.5'], '1')

In [6]:
CUISINE_DICT = dict(mexican=0, italian=1, british=2, american=3, chinese=4, thailand=5, japanese=6)

target = [int(item) for item in target]
for i, row in enumerate(in_value):
    row[0] = random.random() # replace the ID
    row[1] = CUISINE_DICT.get(row[1])
    row[2] = int(row[2])
    row[3] = float(row[3])

In [7]:
in_value[-1], target[-1]

([0.6935386504026277, 3, 526, 4.5], 1)

### Data conversion

Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.

Most of the conversion effort is handled by the Amazon SageMaker Python SDK, imported as `sagemaker` below.

In [8]:
import io
import numpy as np
import sagemaker.amazon.common as smac

vectors = np.array([t for t in in_value]).astype('float32')
labels = np.array([t for t in target]).astype('float32')

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

0

## Upload training data
Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

In [9]:
import boto3
import os

key = 'recommended_train'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://cc-dining-robot/sagemaker/train/recommended_train


Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm.

In [10]:
output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://cc-dining-robot/sagemaker/output


## Training the linear model

Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the Linear Learner training algorithm, although we have tested it on multi-terabyte datasets.

Again, we'll use the Amazon SageMaker Python SDK to kick off training, and monitor status until it is completed.  In this example that takes between 7 and 11 minutes.  Despite the dataset being small, provisioning hardware and loading the algorithm container take time upfront.

First, let's specify our containers.  Since we want this notebook to run in all 4 of Amazon SageMaker's regions, we'll create a small lookup.  More details on algorithm containers can be found in [AWS documentation](https://docs-aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

In [11]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters.  Notice:
- `feature_dim` is set to 784, which is the number of pixels in each 28 x 28 image.
- `predictor_type` is set to 'binary_classifier' since we are trying to predict whether the image is or is not a 0.
- `mini_batch_size` is set to 200.  This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases.

In [12]:
import boto3
import sagemaker

sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)
linear.set_hyperparameters(feature_dim=4,
                           predictor_type='binary_classifier',
                           mini_batch_size=50)

linear.fit({'train': s3_train_data})

INFO:sagemaker:Creating training-job with name: linear-learner-2019-04-04-19-50-14-028


2019-04-04 19:50:14 Starting - Starting the training job...
2019-04-04 19:50:16 Starting - Launching requested ML instances......
2019-04-04 19:51:21 Starting - Preparing the instances for training.........
2019-04-04 19:53:07 Downloading - Downloading input data..
[31mDocker entrypoint called with argument(s): train[0m
[31m[04/04/2019 19:53:25 INFO 139746224760640] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_method': u'uniform', u'init_sigma': u'0.01', u'lr_scheduler_minimum_lr': u'auto', u'target_recall': u'0.8', u'num_models': u'auto', u'early_stopping_patienc


2019-04-04 19:53:35 Training - Training image download completed. Training in progress.
2019-04-04 19:53:35 Uploading - Uploading generated training model
2019-04-04 19:53:35 Completed - Training job completed
Billable seconds: 28


## Set up hosting for the model
Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow out to make predictions (or inference) from the model dyanamically.

_Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target._

In [13]:
linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: linear-learner-2019-04-04-19-54-16-721
INFO:sagemaker:Creating endpoint with name linear-learner-2019-04-04-19-50-14-028


---------------------------------------------------------------------------------------!

## Validate the model for use
Finally, we can now validate the model for use.  We can pass HTTP POST requests to the endpoint to get back predictions.  To make this easier, we'll again use the Amazon SageMaker Python SDK and specify how to serialize requests and deserialize responses that are specific to the algorithm.

In [15]:
urllib.request.urlretrieve("https://s3.amazonaws.com/cc-dining-robot/sagemaker/test.csv", "test.csv")
test_in  = list()

In [16]:
with open('test.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    next(csv_file)
    for row in reader:
        test_in.append(row[:2] + row[3:5])

In [18]:
for i, row in enumerate(test_in):
    row[0] = random.random() # replace the ID
    row[1] = CUISINE_DICT.get(row[1])
    row[2] = int(row[2])
    row[3] = float(row[3])

In [19]:
print(len(test_in))

5178


In [20]:
from sagemaker.predictor import csv_serializer, json_deserializer

linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

In [23]:
in_value[1], target[1]

([0.6814303527351159, 4, 6, 3.0], 0)

### See the accuracy on train file

In [25]:
count = 0
for i,  item in enumerate(in_value):
    result = linear_predictor.predict(item)
    ans = result['predictions'][0]['predicted_label']
    if ans == target[i]:
        count += 1
print(count)
print(len(target))

199
200


In [31]:
result_list = list()
for i,  item in enumerate(test_in):
    result = linear_predictor.predict(item)
    ans = result['predictions'][0]['predicted_label']
    score = result['predictions'][0]['score']
    result_list.append([score, ans])

In [32]:
print(len(result_list))

5178


In [60]:
with open('test.csv', 'r') as test_file:
    reader = csv.reader(test_file)
    headers = next(test_file).strip().split(',')
    print(headers)
    headers = headers + ['score', 'best_answer']
    headers.remove('recommended')
    print(headers)
    
with open('test.csv', 'r') as test_file, open('result.csv', 'w+') as result_file:
    reader = csv.DictReader(test_file)
    writer = csv.DictWriter(fieldnames=headers, f=result_file)
    writer.writeheader()
    count = 0
    for row in reader:
        row['score'], row['best_answer'] = result_list[count]
        del row['']
        count += 1
        writer.writerow(row) 

count

['id', 'cuisine', 'name', 'review_count', 'rate', 'recommended']
['id', 'cuisine', 'name', 'review_count', 'rate', 'score', 'best_answer']


ValueError: dict contains fields not in fieldnames: 'recommended'

### (Optional) Delete the Endpoint

If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
import sagemaker

sagemaker.Session().delete_endpoint(linear_predictor.endpoint)