# Bring Your Own Model (k-means)
_**Hosting a Pre-Trained Model in Amazon SageMaker Algorithm Containers**_


## Import required packages
* io   -  provides the Python interfaces to stream handling.
* os   -  provides a portable way of using operating system dependent functionality.
* time -  provides various time-related functions.

In [45]:
import io
import os
import time

### Importing some standard python packages 
* gzip - module provides a simple interface to compress and decompress files
* json - exposes an API for processing json data.
* mxnet- the mxnet python package
* numpy- package for scientific computing with Python.
* pickle - module implements an algorithm for serializing and de-serializing a Python object structure.
* urllib.request  - module defines functions and classes for opening URLs.
* sklearn.cluster - the k-means clustering algorithm from scikit

In [46]:
import gzip
import json
import pickle
import mxnet as mx
import numpy as np
import urllib.request
import sklearn.cluster

### Importing amazon packages
* boto3 - The AWS SDK for Python to write software that uses Amazon services like S3 and EC2.
* get_execution_role - Return the role ARN whose credentials are used to call the API.

In [47]:
import boto3
from sagemaker import get_execution_role

**This section is only included for illustration purposes. In a real use case, you'd be bringing your model from an existing process and not need to complete these steps.**

## Get the pickled MNIST dataset. 
* Check if the dataset exists on the machine on which the instance runs
* If not, download it from the url specified.

In [48]:
DOWNLOADED_FILENAME = 'mnist.pkl.gz'

In [49]:
if not os.path.exists(DOWNLOADED_FILENAME):
    urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", DOWNLOADED_FILENAME)

## Preprocessing and splitting the dataset 

* The pickled file represents a tuple of 3 lists : **(training set, validation set, test set)**
* Each of the three lists is a tuple: **(list of images, list of class labels)**
* Image: Numpy 1-dimensional array of 784 (28 x 28) float values between 0 and 1
* Labels: Numbers between 0 and 9 indicating which digit the image represents

In [50]:
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

## Preprocessing the dataset
* Use the sklearn.cluster.KMeans method to define a kmeans model with 10 clusters and centroids.
* Train the model locally 
* Convert the data to a MXNet NDArray. The model format that Amazon SageMaker's k-means container expects is an MXNet NDArray with dimensions (num_clusters, feature_dim) that contains the cluster centroids.
* tar and gzip the model array
* Specify the S3 bucket and prefix that you want to use for training and model data.
* Create a bucket resource using the Bucket() method
* Create an object resource 
* Upload the array to s3 bucket

In [51]:
kmeans = sklearn.cluster.KMeans(n_clusters=10).fit(train_set[0])

In [52]:
centroids = mx.ndarray.array(kmeans.cluster_centers_)

In [53]:
mx.ndarray.save('model_algo-1', [centroids])

In [54]:
!tar czvf model.tar.gz model_algo-1

model_algo-1


In [55]:
bucket = 'loonybucket'
prefix = 'sagemaker/kmeans_byom'

In [56]:
s3_resource = boto3.Session().resource('s3')

In [57]:
current_bucket=s3_resource.Bucket(bucket).Object(os.path.join(prefix, 'model.tar.gz'))

In [58]:
current_bucket.upload_file('model.tar.gz')

## Hosting the model
* generate model name
* start a sagemaker instance
* specify the algorithm container to use
* get the role ARN whose credentials are used to call the API to instantiate the estimator
* create model using the create_model method
* setup  endpoint configuration
* initiate the endpoint and check the status to confirm deployment

In [59]:
kmeans_model = 'kmeans-scikit-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
kmeans_model

'kmeans-scikit-2018-03-09-01-53-10'

In [60]:
sm = boto3.client('sagemaker')

In [61]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/kmeans:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/kmeans:latest'}

In [62]:
container = containers[boto3.Session().region_name]

In [63]:
role = get_execution_role()
role

'arn:aws:iam::324118574079:role/service-role/AmazonSageMaker-ExecutionRole-20180209T192191'

In [64]:
create_model_response = sm.create_model(
    ModelName=kmeans_model,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': container,
        'ModelDataUrl': 's3://{}/{}/model.tar.gz'.format(bucket, prefix)})

In [65]:
create_model_response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:324118574079:model/kmeans-scikit-2018-03-09-01-53-10',
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '95',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Fri, 09 Mar 2018 01:53:25 GMT',
   'x-amzn-requestid': '47ee7c88-3234-4f0c-a2d3-cb9913a922ca'},
  'HTTPStatusCode': 200,
  'RequestId': '47ee7c88-3234-4f0c-a2d3-cb9913a922ca',
  'RetryAttempts': 0}}

In [66]:
print(create_model_response['ModelArn'])

arn:aws:sagemaker:us-east-2:324118574079:model/kmeans-scikit-2018-03-09-01-53-10


In [67]:
kmeans_endpoint_config = 'kmeans-poc-endpoint-config-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(kmeans_endpoint_config)

kmeans-poc-endpoint-config-2018-03-09-01-53-28


In [68]:
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=kmeans_endpoint_config,
    ProductionVariants=[{
        'InstanceType': 'ml.m4.xlarge',
        'InitialInstanceCount': 1,
        'ModelName': kmeans_model,
        'VariantName': 'AllTraffic'}])

In [69]:
print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Endpoint Config Arn: arn:aws:sagemaker:us-east-2:324118574079:endpoint-config/kmeans-poc-endpoint-config-2018-03-09-01-53-28


In [70]:
kmeans_endpoint = 'kmeans-poc-endpoint-' + time.strftime("%Y%m%d%H%M", time.gmtime())
print(kmeans_endpoint)

kmeans-poc-endpoint-201803090153


In [71]:
create_endpoint_response = sm.create_endpoint(
    EndpointName=kmeans_endpoint,
    EndpointConfigName=kmeans_endpoint_config)

In [72]:
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-east-2:324118574079:endpoint/kmeans-poc-endpoint-201803090153


In [73]:
sm.get_waiter('endpoint_in_service').wait(EndpointName=kmeans_endpoint)
resp = sm.describe_endpoint(EndpointName=kmeans_endpoint)

In [74]:
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Arn: arn:aws:sagemaker:us-east-2:324118574079:endpoint/kmeans-poc-endpoint-201803090153
Status: InService


## Validate the model
* define a method to get csv records from the training set, the model endpoint requires data in CSV format
* take the first 100 records from our training dataset to score them using our hosted endpoint
* instantiate a runtime session
* score the records from training set in the endpoint
* compare to the model labels from k-means example.

In [75]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

In [76]:
train_set[0][0:100].shape

(100, 784)

In [77]:
payload = np2csv(train_set[0][0:100])

In [78]:
payload

'0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0117188,0.0703125,0.0703125,0.0703125,0.492188,0.53125,0.683594,0.101562,0.648438,0.996094,0.964844,0.496094,0,0,0,0,0,0,0,0,0,0,0,0,0.117188,0.140625,0.367188,0.601562,0.664062,0.988281,0.988281,0.988281,0.988281,0.988281,0.878906,0.671875,0.988281,0.945312,0.761719,0.25,0,0,0,0,0,0,0,0,0,0,0,0.191406,0.929688,0.988281,0.988281,0.988281,0.988281,0.988281,0.988281,0.988281,0.988281,0.980469,0.363281,0.320312,0.320312,0.21875,0.152344,0,0,0,0,0,0,0,0,0,0,0,0,0.0703125,0.855469,0.988281,0.988281,0.988281,0.988281,0.988281,0.773438,0.710938,0.964844,0.941406,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.3125,0.609375,0.417969,0.988281,0.988281,0.800781,0.0429688,0,0.167969,0.601562,0,0,0,0,0,0,0

In [79]:
runtime = boto3.Session().client('runtime.sagemaker')

In [80]:
response = runtime.invoke_endpoint(EndpointName=kmeans_endpoint,
                                   ContentType='text/csv',
                                   Body=payload)

In [81]:
result = json.loads(response['Body'].read().decode())

In [82]:
result

{'predictions': [{'closest_cluster': 3.0,
   'distance_to_cluster': 6.6304473876953125},
  {'closest_cluster': 1.0, 'distance_to_cluster': 5.265570163726807},
  {'closest_cluster': 8.0, 'distance_to_cluster': 6.634407043457031},
  {'closest_cluster': 5.0, 'distance_to_cluster': 3.8344316482543945},
  {'closest_cluster': 2.0, 'distance_to_cluster': 5.48761510848999},
  {'closest_cluster': 6.0, 'distance_to_cluster': 6.56818151473999},
  {'closest_cluster': 0.0, 'distance_to_cluster': 4.5720038414001465},
  {'closest_cluster': 3.0, 'distance_to_cluster': 6.198792457580566},
  {'closest_cluster': 0.0, 'distance_to_cluster': 3.60660982131958},
  {'closest_cluster': 7.0, 'distance_to_cluster': 6.009793758392334},
  {'closest_cluster': 3.0, 'distance_to_cluster': 6.162863254547119},
  {'closest_cluster': 5.0, 'distance_to_cluster': 5.734560012817383},
  {'closest_cluster': 4.0, 'distance_to_cluster': 6.360567569732666},
  {'closest_cluster': 9.0, 'distance_to_cluster': 5.530735015869141},
  

In [83]:
scored_labels = np.array([r['closest_cluster'] for r in result['predictions']])

In [84]:
scored_labels

array([ 3.,  1.,  8.,  5.,  2.,  6.,  0.,  3.,  0.,  7.,  3.,  5.,  4.,
        9.,  0.,  7.,  6.,  3.,  9.,  7.,  8.,  1.,  7.,  5.,  0.,  6.,
        2.,  4.,  7.,  5.,  0.,  3.,  9.,  7.,  3.,  5.,  9.,  1.,  2.,
        9.,  0.,  3.,  2.,  7.,  0.,  2.,  3.,  3.,  4.,  3.,  4.,  1.,
        2.,  5.,  2.,  3.,  1.,  7.,  8.,  5.,  8.,  5.,  9.,  1.,  8.,
        5.,  9.,  5.,  4.,  4.,  0.,  7.,  0.,  9.,  4.,  1.,  4.,  5.,
        5.,  7.,  4.,  3.,  6.,  9.,  2.,  3.,  4.,  7.,  1.,  7.,  9.,
        7.,  8.,  9.,  3.,  1.,  7.,  3.,  4.,  5.])

In [85]:
scored_labels == kmeans.labels_[0:100]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

## Remove endpoint to avoid stray charges

In [86]:
sm.delete_endpoint(EndpointName=kmeans_endpoint)

{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '0',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Fri, 09 Mar 2018 02:10:58 GMT',
   'x-amzn-requestid': '1e11c7a0-1353-4e12-8e56-ae366f63b520'},
  'HTTPStatusCode': 200,
  'RequestId': '1e11c7a0-1353-4e12-8e56-ae366f63b520',
  'RetryAttempts': 0}}