<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Train a recommender system with Azure Machine Learning
---

This tutorial will walk you through how to train a SAR recommender algorithm for the [Movielens dataset](https://grouplens.org/datasets/movielens/) on [Azure Machine Learning service](https://docs.microsoft.com/azure/machine-learning/service/overview-what-is-azure-ml) and deploy it to a web service. It demonstrates how to use the power of the cloud to manage data, switch to powerful GPU machines, and monitor runs while training a model. You can read more about recommenders on Azure Machine Learning service [here](https://azure.microsoft.com/en-us/blog/building-recommender-systems-with-azure-machine-learning-service/)

In this tutorial you will: 
- Connect to an Azure Machine Learning service workspace
- Access movielens data from a datastore
- Connect to CPU and GPU machines from [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)
- Create a training script using the recommender repo's [util functions](https://github.com/Microsoft/Recommenders/tree/master/reco_utils) for SAR and add logging information
- Submit the training job to AzureML, and monitor the run with a Jupyter widget
- Test an existing model with new user data
- **Optional part 2:** Deploy the model to a web service using Azure Container Instance. 


## Details of SAR
    
SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history. It produces easily explainable / interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user. 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users who have interacted with one item are also likely to have interacted with another. A user has an ***affinity*** to an item if they have interacted with it in the past.

### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matric with an affinity vector
---

**Note: If you run into any issues, please try running `!pip install --upgrade azureml-sdk`**

In [2]:
import sys
import os
import shutil
import numpy as np

from reco_utils.dataset import movielens
from reco_utils.azureml.azureml_utils import get_or_create_workspace

import azureml
from azureml.core import Workspace, Run, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

### Connect to an AzureML workspace

An [AzureML Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML Workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inferencing, and the monitoring of deployed models.

A workspace has already been created for you and a [configuration file](././home/nbuser/library/config.json). Simply load it with ```ws = Workspace.from_config()```. You may be asked to login. Please follow the prompts to 

**More information on creating your own workspace can be found in this [this tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace#portal).

In [None]:
ws = get_or_create_workspace(subscription_id="d5aa990f-2452-4701-bd8e-21959f91194c",resource_group="190500-labs-azureml",workspace_name= "pycon_azureml")

### Visualize the data

Let's take a look at the data itself. We use a sample of 100k data points from the Movielens dataset. This data has the UserId, MovieId, and Rating that a user provided, along with the timestamp and title of the movie.

In [10]:
# download dataset
data = movielens.load_pandas_df(
    size='100k',
    header=['UserId','MovieId','Rating','Timestamp'],
    title_col='Title'
)
data.head()

4.93MB [00:01, 4.19MB/s]                            


Unnamed: 0,UserId,MovieId,Rating,Timestamp,Title
0,196,242,3.0,881250949,Kolya (1996)
1,63,242,3.0,875747190,Kolya (1996)
2,226,242,5.0,883888671,Kolya (1996)
3,154,242,3.0,879138235,Kolya (1996)
4,306,242,5.0,876503793,Kolya (1996)


#### Take some time to play with the data
Above we looked at the first 5 rows in the Movielens dataset. Spend some time looking at the rest of the data to understand it. 

Here are a few things you can try. For more, check out this [Python data science cheat sheet](https://s3.amazonaws.com/dq-blog-files/pandas-cheat-sheet.pdf)
- `data.head(n)` | First n rows of the DataFrame
- `data.tail(n)` | Last n rows of the DataFrame
- `data.shape` | Number of rows and columns
- `data.info()` | Index, Datatype and Memory information
- `data.describe()` | Summary statistics for numerical columns
- `data['Title'].value_counts(dropna=False)` | View unique values and counts
- `data.groupby("Title").agg(np.mean)` | Find the average across all columns for every unique Title

### Access data from a datastore

Every workspace comes with a default [datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and access it from the compute target. The data files are uploaded into a directory named `data` at the root of the datastore.

For this lab, we have already downloaded the data and uploaded it to the datastore for you. Simply choose the data set size `100k`, `1m`, `10m`, or `20m`. 

**Note: For data sizes over 1m, you will need to utilize a GPU machine.**

#### Running this lab at home?
<details><summary> Learn how to do this on your own</summary>
<p>
Outside of this tutorial, you will need to run the following code to set up the datastore and upload the movielens dataset to it:
</p>
    
```
DATA_DIR = 'aml_data'
TARGET_DIR = 'movielens'
os.makedirs(DATA_DIR, exist_ok=True)
# # download dataset
data = movielens.load_pandas_df(
       size=MOVIELENS_DATA_SIZE,
       header=['UserId','MovieId','Rating','Timestamp'],
       title_col='Title'
       )
# # upload dataset to workspace datastore
data_file_name = "movielens_" + MOVIELENS_DATA_SIZE + "_data.pkl"
data.to_pickle(os.path.join(DATA_DIR, data_file_name))
ds = ws.get_default_datastore()
ds.upload(src_dir=DATA_DIR, target_path=TARGET_DIR, overwrite=True, show_progress=True) 
```
</details>

In [23]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

In [24]:
data_file_name = "movielens_" + MOVIELENS_DATA_SIZE + "_data.pkl"
ds = ws.get_default_datastore()

### Create or Attach Azure Machine Learning Compute 

We've created cpu and gpu clustesr as our **remote compute target**. You can read [Set up compute targets for model training](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets) to learn more about setting up compute targets on different locations. 
According to Azure [Pricing calculator](https://azure.microsoft.com/en-us/pricing/calculator/), with example VM size `STANDARD_D2_V2`, it costs a few dollars to run this notebook, which is well covered by Azure new subscription credit. For billing and pricing questions, please contact [Azure support](https://azure.microsoft.com/en-us/support/options/).

**Note**:
- 10m and 20m dataset requires more capacity than `STANDARD_D2_V2`, such as `STANDARD_NC6` or `STANDARD_NC12`. See list of all available VM sizes [here](https://docs.microsoft.com/en-us/azure/templates/Microsoft.Compute/2018-10-01/virtualMachines?toc=%2Fen-us%2Fazure%2Fazure-resource-manager%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json#hardwareprofile-object).
- As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [these instructions](https://docs.microsoft.com/en-us/azure/azure-supportability/resource-manager-core-quotas-request) on the default limits and how to request more quota.
---
#### Learn more about Azure Machine Learning Compute
<details>
    <summary>Click to learn more about compute types</summary>
    
[Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created within your workspace region and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user.

Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service.

You can provision a persistent AmlCompute resource by simply defining two parameters thanks to smart defaults. By default it autoscales from 0 nodes and provisions dedicated VMs to run your job in a container. This is useful when you want to continously re-use the same target, debug it between jobs or simply share the resource with other users of your workspace.

In addition to vm_size and max_nodes, you can specify:
- **min_nodes**: Minimum nodes (default 0 nodes) to downscale to while running a job on AmlCompute
- **vm_priority**: Choose between 'dedicated' (default) and 'lowpriority' VMs when provisioning AmlCompute. Low Priority VMs use Azure's excess capacity and are thus cheaper but risk your run being pre-empted
- **idle_seconds_before_scaledown**: Idle time (default 120 seconds) to wait after run completion before auto-scaling to min_nodes
- **vnet_resourcegroup_name**: Resource group of the existing VNet within which AmlCompute should be provisioned
- **vnet_name**: Name of VNet
- **subnet_name**: Name of SubNet within the VNet

To create your own compute run the following code:

```
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, CLUSTER_NAME, compute_config)

compute_target.wait_for_completion(show_output=True)
```
</details>
---

In [25]:
CLUSTER_NAME = 'cpucluster'
# CLUSTER_NAME = 'gpucluster'

compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)
print("Found existing compute target \nName: " + compute_target.name + "\nRegion:" + compute_target.cluster_location)

Found existing compute target 
Name: cpucluster
Region:eastus


# Prepare training script
### 1. Create a directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

In [26]:
SCRIPT_FOLDER = './movielens-sar'
os.makedirs(SCRIPT_FOLDER, exist_ok=True)

### 2.  Create a training script
To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. This training adds a regularization rate to the training algorithm, so produces a slightly different model than the local version.

This code takes what is in the local quickstart and convert it to one single training script. We use run.log() to record parameters to the run. We will be able to review and compare these measures in the Azure Portal at a later time.

In [27]:
%%writefile $SCRIPT_FOLDER/train.py

import argparse
import os
import numpy as np
import pandas as pd
import itertools
import logging
import time

from azureml.core import Run
from sklearn.externals import joblib

from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from reco_utils.recommender.sar.sar_singlenode import SARSingleNode

TARGET_DIR = 'movielens'
OUTPUT_FILE_NAME = 'outputs/movielens_sar_model.pkl'
MODEL_FILE_NAME = 'movielens_sar_model.pkl'

# get hold of the current run
run = Run.get_context()

# let user feed in 2 parameters, the location of the data files (from datastore), and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--data-file', type=str, dest='data_file', help='data file name')
parser.add_argument('--top-k', type=int, dest='top_k', default=10, help='top k items to recommend')
parser.add_argument('--data-size', type=str, dest='data_size', default=10, help='Movielens data size: 100k, 1m, 10m, or 20m')
args = parser.parse_args()

run.log("top-k",args.top_k)
run.log("data-size", args.data_size)
data_pickle_path = os.path.join(args.data_folder, args.data_file)

data = pd.read_pickle(path=data_pickle_path)

train, test = python_random_split(data,0.75)

# instantiate the SAR algorithm and set the index
header = {
    "col_user": "UserId",
    "col_item": "MovieId",
    "col_rating": "Rating",
    "col_timestamp": "Timestamp",
}

logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SARSingleNode(
    remove_seen=True, similarity_type="jaccard", 
    time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header
)

# train the SAR model
start_time = time.time()

model.fit(train)

train_time = time.time() - start_time
run.log(name="Training time", value=train_time)

start_time = time.time()

top_k = model.recommend_k_items(test)

test_time = time.time() - start_time
run.log(name="Prediction time", value=test_time)

# TODO: remove this call when the model returns same type as input
top_k['UserId'] = pd.to_numeric(top_k['UserId'])
top_k['MovieId'] = pd.to_numeric(top_k['MovieId'])

# evaluate
eval_map = map_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                    col_rating="Rating", col_prediction="prediction", 
                    relevancy_method="top_k", k=args.top_k)
eval_ndcg = ndcg_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                      col_rating="Rating", col_prediction="prediction", 
                      relevancy_method="top_k", k=args.top_k)
eval_precision = precision_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                                col_rating="Rating", col_prediction="prediction", 
                                relevancy_method="top_k", k=args.top_k)
eval_recall = recall_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                          col_rating="Rating", col_prediction="prediction", 
                          relevancy_method="top_k", k=args.top_k)

run.log("map", eval_map)
run.log("ndcg", eval_ndcg)
run.log("precision", eval_precision)
run.log("recall", eval_recall)
# run.log_table("topk", top_k.to_dict())

# automatic upload of everything in ./output folder doesn't work for very large model file
# model file has to be saved to a temp location, then uploaded by upload_file function
joblib.dump(value=model, filename=MODEL_FILE_NAME)

run.upload_file(OUTPUT_FILE_NAME, MODEL_FILE_NAME)

Overwriting ./movielens-sar/train.py


### 3. Copy the reco_utils
You will need to copy the utility functions over for the project directory as well so that the training script can access them.

In [28]:
# copy dependent python files
UTILS_DIR = './movielens-sar/reco_utils/'
# use ignore_errors=True in case the directory already exists from a previous run
shutil.rmtree(UTILS_DIR, ignore_errors=True)
shutil.copytree('reco_utils/', UTILS_DIR)

'./movielens-sar/reco_utils/'

# Run training script
### 1. Create an estimator
An [estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-ml-models) object is used to submit the run. You can create and use a generic Estimator to submit a training script using any learning framework you choose (such as scikit-learn) you want to run on any compute target, whether it's your local machine, a single VM in Azure, or a GPU cluster in Azure. 

Create your estimator by running the following code to define:  
* The name of the estimator object, `est`
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute already created. You can switch between the cpu and gpu machines.
* The training script name, train.py
* Parameters required from the training script 
* Python packages needed for training
* Connect to the data files in the datastore

In this tutorial, this target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. `ds.as_mount()` mounts a datastore on the remote compute and returns the folder. See documentation [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#access-datastores-during-training).

In [29]:
script_params = {
    '--data-folder': ds.as_mount(),
    '--data-file': 'movielens/' + data_file_name,
    '--top-k': TOP_K,
    '--data-size': MOVIELENS_DATA_SIZE
}

est = Estimator(source_directory=SCRIPT_FOLDER,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                conda_packages=['pandas', 'tqdm'],
                pip_packages=['sklearn'],
                use_gpu=False)

### 2. Submit the job to the cluster
An [experiment](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-ml-py#experiment) is a logical container in an AzureML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments. We access an experiment from our AzureML workspace by name, which will be created if it doesn't exist.

Then, run the experiment by submitting the estimator object.

#### Monitor and view the results

You can monitor the remote run with a Jupyter widget. The widget watches the progress of the run. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

You can click the link at the bottom of the widget to view all the details of the run in the Azure Portal. 

When the run is complete, you can view the metrics with `run.get_metrics()`.

In [30]:
# create experiment
EXPERIMENT_NAME = 'movielens-sar'
exp = Experiment(workspace=ws, name=EXPERIMENT_NAME)

run = exp.submit(config=est)
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [31]:
run.wait_for_completion()

{'runId': 'movielens-sar_1556835154_3063adc6',
 'target': 'cpucluster',
 'status': 'Finalizing',
 'startTimeUtc': '2019-05-02T22:14:46.831402Z',
 'properties': {'azureml.runsource': 'experiment',
  'AzureML.DerivedImageName': 'azureml/azureml_84ff80366a174fee5d7bb70637ad1d55',
  'ContentSnapshotId': '5442c926-d519-4bb9-beca-b0188cad1308'},
 'runDefinition': {'script': 'train.py',
  'arguments': ['--data-folder',
   '$AZUREML_DATAREFERENCE_workspaceblobstore',
   '--data-file',
   'movielens/movielens_100k_data.pkl',
   '--top-k',
   '10',
   '--data-size',
   '100k'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'cpucluster',
  'dataReferences': {'workspaceblobstore': {'dataStoreName': 'workspaceblobstore',
    'mode': 'Mount',
    'pathOnDataStore': None,
    'pathOnCompute': None,
    'overwrite': False}},
  'jobName': None,
  'autoPrepareEnvironment': True,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'environment': {'nam

### Test the Model locally

We can download the model from the run and test it locally. Below we view the popular items based on User Ratings. We can also as well as enter new data to find most similar items. 

In [32]:
from sklearn.externals import joblib
run.download_file('outputs/movielens_sar_model.pkl')
modelTest =joblib.load('movielens_sar_model.pkl')

In [33]:
modelTest.get_popularity_based_topk(top_k = 30, sort_top_k=True).join(data[['MovieId', 'Title']].drop_duplicates().set_index('MovieId'), 
                                on='MovieId', 
                                how='inner')

Unnamed: 0,MovieId,prediction,Title
0,50,442.0,Star Wars (1977)
1,258,389.0,Contact (1997)
2,100,386.0,Fargo (1996)
3,294,369.0,Liar Liar (1997)
4,181,363.0,Return of the Jedi (1983)
5,286,356.0,"English Patient, The (1996)"
6,288,355.0,Scream (1996)
7,1,354.0,Toy Story (1995)
8,121,322.0,Independence Day (ID4) (1996)
9,300,322.0,Air Force One (1997)


### Test out an existing webservice 

We just saw what the most popular rated movies were for the MovieLens dataset. Now let's try adding our own data and get the most similar movies based on the similarity matrix fit during training.

You will walk through rating 5 popular movies and will then call an existing webservice trained on 100k dataset to see what the most similar movies are.

In [34]:
%run existing-widget.ipynb

Label(value='Here are the recommended movies based on your ratings.', style=DescriptionStyle(description_width…

Unnamed: 0,MovieId,UserId,prediction,Title
0,50,0,4.768659,Star Wars (1977)
1,174,0,4.622387,Raiders of the Lost Ark (1981)
2,172,0,4.573981,"Empire Strikes Back, The (1980)"
3,181,0,4.544265,Return of the Jedi (1983)
4,79,0,4.5118,"Fugitive, The (1993)"
5,121,0,4.502742,Independence Day (ID4) (1996)
6,98,0,4.411063,"Silence of the Lambs, The (1991)"
7,1,0,4.405902,Toy Story (1995)
8,69,0,4.343179,Forrest Gump (1994)
9,204,0,4.331141,Back to the Future (1985)


# Congrats! You're finished with the lab!
You can go through and change the data set sizes and cpu vs gpu clusters to test out different models.

## Optional part 2: Deploy the model to a web service
Open the [deploy_with_azureml.ipynb](deploy_with_azureml.ipynb) notebook to deploy this model to a web service. 