# Introduction

This notebook outlines how to build a recommendation system using SageMaker's Factorization Machines (FM). The main goal is to showcase how to extend FM model to predict top "X" recommendations using SageMaker's KNN and Batch Transform.

There are four parts to this notebook:

1. Building a FM Model
2. Repackaging FM Model to fit a KNN Model
3. Building a KNN model
4. Running Batch Transform for predicting top "X" items


## Part 1 - Building a FM Model using movie lens dataset

Julien Simon has written a fantastic blog about how to build a FM model using SageMaker with detailed explanation. Please see the links below for more information. In this part, I utilized his code for the most part to have continutity for performing additional steps.

Source - https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

In [26]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os
import multiprocessing
from multiprocessing import Pool

### Download movie rating data from movie lens

In [27]:
#download data
#!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
#!unzip -o ml-100k.zip

%pwd

u'/home/ec2-user/SageMaker/MovieRecommender/Notebook'

### Shuffle the data

In [28]:
%cd ml-latest-small
#%cd ml-20m
!shuf ua.base -o ua.base.shuffled

/home/ec2-user/SageMaker/MovieRecommender/Notebook/ml-latest-small


### Load Training Data

In [29]:
user_movie_ratings_train = pd.read_csv('ua.base.shuffled', sep=',', header=0, index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

Unnamed: 0,user_id,movie_id,rating
0,6,736,5.0
1,42,648,5.0
2,409,3590,2.0
3,156,3089,4.0
4,294,2134,3.0


### Load Test Data

In [30]:
user_movie_ratings_test = pd.read_csv('ua.test', sep=',', header=0, index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test.head(5)

Unnamed: 0,user_id,movie_id,rating
0,1,3,4.0
1,1,6,4.0
2,1,47,5.0
3,1,50,5.0
4,1,70,3.0


In [31]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_features=nb_users+nb_movies
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print " # of users: ", nb_users
print " # of movies: ", nb_movies
print " Training Count: ", nb_ratings_train
print " Test Count: ", nb_ratings_test
print " Features (# of users + # of movies): ", nb_features

 # of users:  610
 # of movies:  193609
 Training Count:  94735
 Test Count:  6099
 Features (# of users + # of movies):  194219


### FM Input

Input to FM is a one-hot encoded sparse matrix. Only ratings 4 and above are considered for the model. We will be ignoring ratings 3 and below.

In [32]:
# def transformRow(row):
#     cur_row = []
#     if int(row[2]) >= 4:
#         cur_row.append(1)
#     else:
#         cur_row.append(0)
        
#      matrix[line,row[0]-1] = 1
    
#      matrix[line, nb_users+(row[1]-1)] = 1   
    
#     return cur_row

# def transformMatrix(row):
#     matrix[line,row[0]-1] = 1
#     matrix[line, nb_users+(row[1]-1)] = 1
#     return matrix

def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    
#     yPool = Pool(processes=multiprocessing.cpu_count())
#     Y = yPool.map(transformRow, df.values())
#     yPool.close() 
#     yPool.join()
    
#     xPool = Pool(processes=multiprocessing.cpu_count())
#     Y = xPool.map(transformRow, df.values())
#     xPool.close() 
#     xPool.join()
    
    line=0
    for index, row in df.iterrows():
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)

    for index, row in df.iterrows():
        X[line,row['user_id']-1] = 1
        X[line, nb_users+(row['movie_id']-1)] = 1
        line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [33]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

(94735, 194219)
(94735,)
Training labels: 45037 zeros, 49698 ones
(6099, 194219)
(6099,)
Test labels: 3542 zeros, 2557 ones


### Convert to Protobuf format for saving to S3

In [34]:
#Change this value to your own bucket name
bucket = 'movie-recommender-josh-krishna'
prefix = 'fm'

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [35]:
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", Y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", Y_test)    
  
print "Training data S3 path: ",fm_train_data_path
print "Test data S3 path: ",fm_test_data_path
print "FM model output S3 path: {}".format(output_prefix)

Training data S3 path:  s3://movie-recommender-josh-krishna/fm/train/train.protobuf
Test data S3 path:  s3://movie-recommender-josh-krishna/fm/test/test.protobuf
FM model output S3 path: s3://movie-recommender-josh-krishna/fm/output


### Run training job

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 100 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.74 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

In [36]:
instance_type='ml.m5.large'
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type=instance_type,
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nb_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

2019-06-13 18:16:18 Starting - Starting the training job...
2019-06-13 18:16:25 Starting - Launching requested ML instances.........
2019-06-13 18:17:59 Starting - Preparing the instances for training......
2019-06-13 18:19:02 Downloading - Downloading input data...
2019-06-13 18:19:49 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/13/2019 18:19:52 INFO 140082306389824] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_batch_size': u'1000', 


2019-06-13 18:19:50 Training - Training image download completed. Training in progress.[31m[2019-06-13 18:20:06.059] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 20, "duration": 1130, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=9, train binary_classification_accuracy <score>=0.731284210526[0m
[31m[06/13/2019 18:20:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=9, train binary_classification_cross_entropy <loss>=0.606125123355[0m
[31m[06/13/2019 18:20:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=9, train binary_f_1.000 <score>=0.698940962804[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1132.580041885376, "sum": 1132.580041885376, "min": 1132.580041885376}}, "EndTime": 1560450006.060827, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450004.

[31m[2019-06-13 18:20:16.218] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 38, "duration": 1091, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=18, train binary_classification_accuracy <score>=0.758336842105[0m
[31m[06/13/2019 18:20:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=18, train binary_classification_cross_entropy <loss>=0.561498596191[0m
[31m[06/13/2019 18:20:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=18, train binary_f_1.000 <score>=0.73649051926[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1093.8868522644043, "sum": 1093.8868522644043, "min": 1093.8868522644043}}, "EndTime": 1560450016.219662, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450015.124841}
[0m
[31m[06/13/2019 18:20:16 INFO 140082306389824] #progress_metric: host

[31m[2019-06-13 18:20:26.481] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 56, "duration": 1087, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=27, train binary_classification_accuracy <score>=0.768273684211[0m
[31m[06/13/2019 18:20:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=27, train binary_classification_cross_entropy <loss>=0.534450907818[0m
[31m[06/13/2019 18:20:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=27, train binary_f_1.000 <score>=0.749784041828[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1089.2109870910645, "sum": 1089.2109870910645, "min": 1089.2109870910645}}, "EndTime": 1560450026.482002, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450025.39184}
[0m
[31m[06/13/2019 18:20:26 INFO 140082306389824] #progress_metric: host

[31m[2019-06-13 18:20:36.539] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 74, "duration": 1127, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=36, train binary_classification_accuracy <score>=0.773063157895[0m
[31m[06/13/2019 18:20:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=36, train binary_classification_cross_entropy <loss>=0.516205997828[0m
[31m[06/13/2019 18:20:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=36, train binary_f_1.000 <score>=0.756243993442[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1129.3151378631592, "sum": 1129.3151378631592, "min": 1129.3151378631592}}, "EndTime": 1560450036.539976, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450035.40968}
[0m
[31m[06/13/2019 18:20:36 INFO 140082306389824] #progress_metric: host

[31m[2019-06-13 18:20:46.605] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 92, "duration": 1124, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:46 INFO 140082306389824] #quality_metric: host=algo-1, epoch=45, train binary_classification_accuracy <score>=0.777821052632[0m
[31m[06/13/2019 18:20:46 INFO 140082306389824] #quality_metric: host=algo-1, epoch=45, train binary_classification_cross_entropy <loss>=0.502476499537[0m
[31m[06/13/2019 18:20:46 INFO 140082306389824] #quality_metric: host=algo-1, epoch=45, train binary_f_1.000 <score>=0.76232729402[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1126.3039112091064, "sum": 1126.3039112091064, "min": 1126.3039112091064}}, "EndTime": 1560450046.605778, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450045.478609}
[0m
[31m[06/13/2019 18:20:46 INFO 140082306389824] #progress_metric: host

[31m[2019-06-13 18:20:56.647] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 110, "duration": 1077, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:20:56 INFO 140082306389824] #quality_metric: host=algo-1, epoch=54, train binary_classification_accuracy <score>=0.781968421053[0m
[31m[06/13/2019 18:20:56 INFO 140082306389824] #quality_metric: host=algo-1, epoch=54, train binary_classification_cross_entropy <loss>=0.491278400943[0m
[31m[06/13/2019 18:20:56 INFO 140082306389824] #quality_metric: host=algo-1, epoch=54, train binary_f_1.000 <score>=0.767643000572[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1079.5679092407227, "sum": 1079.5679092407227, "min": 1079.5679092407227}}, "EndTime": 1560450056.64821, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450055.56766}
[0m
[31m[06/13/2019 18:20:56 INFO 140082306389824] #progress_metric: host

[31m[2019-06-13 18:21:06.685] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 128, "duration": 1092, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:21:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=63, train binary_classification_accuracy <score>=0.785105263158[0m
[31m[06/13/2019 18:21:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=63, train binary_classification_cross_entropy <loss>=0.481647341116[0m
[31m[06/13/2019 18:21:06 INFO 140082306389824] #quality_metric: host=algo-1, epoch=63, train binary_f_1.000 <score>=0.771672389303[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1094.951868057251, "sum": 1094.951868057251, "min": 1094.951868057251}}, "EndTime": 1560450066.686029, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450065.589587}
[0m
[31m[06/13/2019 18:21:06 INFO 140082306389824] #progress_metric: host=

[31m[2019-06-13 18:21:16.776] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 146, "duration": 1148, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:21:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=72, train binary_classification_accuracy <score>=0.788368421053[0m
[31m[06/13/2019 18:21:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=72, train binary_classification_cross_entropy <loss>=0.473059812847[0m
[31m[06/13/2019 18:21:16 INFO 140082306389824] #quality_metric: host=algo-1, epoch=72, train binary_f_1.000 <score>=0.775641383312[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1150.5029201507568, "sum": 1150.5029201507568, "min": 1150.5029201507568}}, "EndTime": 1560450076.777092, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450075.625732}
[0m
[31m[06/13/2019 18:21:16 INFO 140082306389824] #progress_metric: ho

[31m[2019-06-13 18:21:26.761] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 164, "duration": 1079, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:21:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=81, train binary_classification_accuracy <score>=0.7916[0m
[31m[06/13/2019 18:21:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=81, train binary_classification_cross_entropy <loss>=0.465191806833[0m
[31m[06/13/2019 18:21:26 INFO 140082306389824] #quality_metric: host=algo-1, epoch=81, train binary_f_1.000 <score>=0.779596108031[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1083.493947982788, "sum": 1083.493947982788, "min": 1083.493947982788}}, "EndTime": 1560450086.762, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450085.67757}
[0m
[31m[06/13/2019 18:21:26 INFO 140082306389824] #progress_metric: host=algo-1, comp


2019-06-13 18:21:49 Uploading - Uploading generated training model[31m[2019-06-13 18:21:36.798] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 182, "duration": 1073, "num_examples": 95, "num_bytes": 6063040}[0m
[31m[06/13/2019 18:21:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=90, train binary_classification_accuracy <score>=0.794526315789[0m
[31m[06/13/2019 18:21:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=90, train binary_classification_cross_entropy <loss>=0.457821229312[0m
[31m[06/13/2019 18:21:36 INFO 140082306389824] #quality_metric: host=algo-1, epoch=90, train binary_f_1.000 <score>=0.783048436215[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 1075.9470462799072, "sum": 1075.9470462799072, "min": 1075.9470462799072}}, "EndTime": 1560450096.79949, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1560450095.722673}
[0m
[

[31m[06/13/2019 18:21:46 INFO 140082306389824] Saved checkpoint to "/tmp/tmpbCLoBq/state-0001.params"[0m
[31m[2019-06-13 18:21:47.105] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/test", "epoch": 0, "duration": 114151, "num_examples": 1, "num_bytes": 64000}[0m
[31m[2019-06-13 18:21:47.656] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/test", "epoch": 1, "duration": 550, "num_examples": 7, "num_bytes": 390336}[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 7, "sum": 7.0, "min": 7}, "Number of Batches Since Last Reset": {"count": 1, "max": 7, "sum": 7.0, "min": 7}, "Number of Records Since Last Reset": {"count": 1, "max": 6099, "sum": 6099.0, "min": 6099}, "Total Batches Seen": {"count": 1, "max": 7, "sum": 7.0, "min": 7}, "Total Records Seen": {"count": 1, "max": 6099, "sum": 6099.0, "min": 6099}, "Max Records Seen Between Resets": {"count": 1, "max": 6099, "sum": 6099.0, "min": 6099}, "Reset Count"

## Part 2 - Repackaging Model data to fit a KNN Model

Now that we have the model created and stored in SageMaker, we can download the same and repackage it to fit a KNN model.

### Download model data

In [37]:
import mxnet as mx
model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print "Model Path: ", model_full_path

#Download FM model 
%cd ..
os.system('aws s3 cp '+model_full_path+ ' ./')

#Extract model file for loading to MXNet
os.system('tar xzvf '+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

Model Path:  s3://movie-recommender-josh-krishna/fm/output/factorization-machines-2019-06-13-18-16-18-700/output/model.tar.gz
/home/ec2-user/SageMaker/MovieRecommender/Notebook


0

### Extract model data to create item and user latent matrixes

In [38]:
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

## Part 3 - Building KNN Model

In this section, we upload the model input data to S3, create a KNN model and save the same. Saving the model, will display the model in the model section of SageMaker. Also, it will aid in calling batch transform down the line or even deploying it as an end point for real-time inference.

This approach uses the default 'index_type' parameter for knn. It is precise but can be slow for large datasets. In such cases, you may want to use a different 'index_type' parameter leading to an approximate, yet fast answer.

In [39]:
print('KNN train features shape = ', knn_item_matrix.shape)
knn_prefix = 'knn'
knn_output_prefix  = 's3://{}/{}/output'.format(bucket, knn_prefix)
knn_train_data_path = writeDatasetToProtobuf(knn_item_matrix, bucket, knn_prefix, train_key, "dense", knn_train_label)
print('uploaded KNN train data: {}'.format(knn_train_data_path))

nb_recommendations = 5

# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
    get_execution_role(),
    train_instance_count=1,
    train_instance_type=instance_type,
    output_path=knn_output_prefix,
    sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(feature_dim=knn_item_matrix.shape[1], k=nb_recommendations, index_metric="INNER_PRODUCT", predictor_type='classifier', sample_size=200000)
fit_input = {'train': knn_train_data_path}
knn.fit(fit_input)
knn_model_name =  knn.latest_training_job.job_name
print "created model: ", knn_model_name

# save the model so that we can reference it in the next step during batch inference
sm = boto3.client(service_name='sagemaker')
primary_container = {
    'Image': knn.image_name,
    'ModelDataUrl': knn.model_data,
}

knn_model = sm.create_model(
        ModelName = knn.latest_training_job.job_name,
        ExecutionRoleArn = knn.role,
        PrimaryContainer = primary_container)
print "saved the model"

('KNN train features shape = ', (193609, 65))
uploaded KNN train data: s3://movie-recommender-josh-krishna/knn/train.protobuf
2019-06-13 18:22:45 Starting - Starting the training job...
2019-06-13 18:22:46 Starting - Launching requested ML instances......
2019-06-13 18:23:53 Starting - Preparing the instances for training...
2019-06-13 18:24:43 Downloading - Downloading input data......
2019-06-13 18:25:45 Training - Training image download completed. Training in progress.
2019-06-13 18:25:45 Uploading - Uploading generated training model.
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/13/2019 18:25:41 INFO 140666129676096] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'index_metric': u'L2', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'faiss_index_ivf_nlists': u'auto', u'epochs': u'1', u'index_type': u'faiss.Flat', u'_faiss_index_nprobe': u'5', u'_kvstore': u'


2019-06-13 18:25:58 Completed - Training job completed
Billable seconds: 75
created model:  knn-2019-06-13-18-22-45-208
saved the model


## Part 4 - Batch Transform

In this section, we will use SageMaker's batch transform option to batch predict top X for all the users.

In [40]:
#upload inference data to S3
knn_batch_data_path = writeDatasetToProtobuf(knn_user_matrix, bucket, knn_prefix, train_key, "dense")
print "Batch inference data path: ",knn_batch_data_path

# Initialize the transformer object
transformer =sagemaker.transformer.Transformer(
    base_transform_job_name="knn",
    model_name=knn_model_name,
    instance_count=1,
    instance_type=instance_type,
    output_path=knn_output_prefix,
    accept="application/jsonlines; verbose=true"
)

# Start a transform job:
transformer.transform(knn_batch_data_path, content_type='application/x-recordio-protobuf')
transformer.wait()


#Download predictions 
results_file_name = "inference_output"
inference_output_file = "knn/output/train.protobuf.out"
s3_client = boto3.client('s3')
s3_client.download_file(bucket, inference_output_file, results_file_name)
with open(results_file_name) as f:
    results = f.readlines()  

Batch inference data path:  s3://movie-recommender-josh-krishna/knn/train.protobuf
.........................................!


In [41]:
import json
test_user_idx = 4
u_one_json = json.loads(results[test_user_idx])

movies_df = pd.read_csv("./ml-latest-small/movies.csv")

#print user_movie_ratings_train[user_movie_ratings_train.user_id==int(test_user_idx)].movie_id

print "Recommended movie Ids for user #{} : {}".format(test_user_idx+1, [movies_df[movies_df.movieId==int(movie_id)].title.item() for movie_id in u_one_json['labels']])
print
print "Movie distances for user #{} : {}".format(test_user_idx+1,  [round(distance, 4) for distance in u_one_json['distances']])

Recommended movie Ids for user #5 : ['Cove, The (2009)', 'Shawshank Redemption, The (1994)', 'Godfather, The (1972)', 'Incendies (2010)', 'Hunt, The (Jagten) (2012)']

Movie distances for user #5 : [0.5693, 0.5709, 0.6057, 0.6203, 0.6301]


In [42]:
%pwd

u'/home/ec2-user/SageMaker/MovieRecommender/Notebook'