# Part 4: Create an approximate nearest neighbor index for the item embeddings

This notebook is the fourth of five notebooks that guide you through running the [Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/tree/master/retail/recommendation-system/bqml-scann) solution.

Use this notebook to create an approximate nearest neighbor (ANN) index for the item embeddings by using the [ScaNN](https://github.com/google-research/google-research/tree/master/scann) framework. You create the index as a model, train the model on AI Platform Training, then export the index to Cloud Storage so that it can serve ANN information.

Before starting this notebook, you must run the [03_create_embedding_lookup_model](03_create_embedding_lookup_model.ipynb) notebook to process the item embeddings data and export it to Cloud Storage.

After completing this notebook, run the [05_deploy_lookup_and_scann_caip](05_deploy_lookup_and_scann_caip.ipynb) notebook to deploy the solution. Once deployed, you can submit song IDs to the solution and get similar song recommendations in return, based on the ANN index.


## Setup

Import the required libraries, configure the environment variables, and authenticate your GCP account.

In [None]:
!pip install -q scann

### Import libraries

In [None]:
import tensorflow as tf
import numpy as np
from datetime import datetime

### Configure GCP environment settings

Update the following variables to reflect the values for your GCP environment:

+ `PROJECT_ID`: The ID of the Google Cloud project you are using to implement this solution.
+ `BUCKET`: The name of the Cloud Storage bucket you created to use with this solution. The `BUCKET` value should be just the bucket name, so `myBucket` rather than `gs://myBucket`.
+ `REGION`: The region to use for the AI Platform Training job.

In [None]:
PROJECT_ID = 'yourProject' # Change to your project.
BUCKET = 'yourBucketName' # Change to the bucket you created.
REGION = 'yourTrainingRegion' # Change to your AI Platform Training region.
EMBEDDING_FILES_PREFIX = f'gs://{BUCKET}/bqml/item_embeddings/embeddings-*'
OUTPUT_INDEX_DIR = f'gs://{BUCKET}/bqml/scann_index'

### Authenticate your GCP account
This is required if you run the notebook in Colab. If you use an AI Platform notebook, you should already be authenticated.

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Build the ANN index

Use the `build` method implemented in the [indexer.py](index_builder/builder/indexer.py) module to load the embeddings from the CSV files, create the ANN index model and train it on the embedding data, and save the SavedModel file to Cloud Storage. You pass the following three parameters to this method:

+ `embedding_files_path`, which specifies the Cloud Storage location from which to load the embedding vectors.
+ `num_leaves`, which provides the value for a hyperparameter that tunes the model based on the trade-off between retrieval latency and recall. A higher `num_leaves` value will use more data and provide better recall, but will also increase latency. If `num_leaves` is set to `None` or `0`, the `num_leaves` value is the square root of the number of items.
+ `output_dir`, which specifies the Cloud Storage location to write the ANN index SavedModel file to.

Other configuration options for the model are set based on the [rules-of-thumb](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md#rules-of-thumb) provided by ScaNN.

### Build the index locally

In [None]:
from index_builder.builder import indexer
indexer.build(EMBEDDING_FILES_PREFIX, OUTPUT_INDEX_DIR)

### Build the index using AI Platform Training

Submit an AI Platform Training job to build the ScaNN index at scale. The [index_builder](index_builder) directory contains the expected [training application packaging structure](https://cloud.google.com/ai-platform/training/docs/packaging-trainer) for submitting the AI Platform Training job.

In [None]:
if tf.io.gfile.exists(OUTPUT_INDEX_DIR):
  print("Removing {} contents...".format(OUTPUT_INDEX_DIR))
  tf.io.gfile.rmtree(OUTPUT_INDEX_DIR)

print("Creating output: {}".format(OUTPUT_INDEX_DIR))
tf.io.gfile.makedirs(OUTPUT_INDEX_DIR)

timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks_bqml_build_scann_index_{timestamp}'

!gcloud ai-platform jobs submit training {job_name} \
  --project={PROJECT_ID} \
  --region={REGION} \
  --job-dir={OUTPUT_INDEX_DIR}/jobs/ \
  --package-path=index_builder/builder \
  --module-name=builder.task \
  --config='index_builder/config.yaml' \
  --runtime-version=2.2 \
  --python-version=3.7 \
  --\
  --embedding-files-path={EMBEDDING_FILES_PREFIX} \
  --output-dir={OUTPUT_INDEX_DIR} \
  --num-leaves=500

After the AI Platform Training job finishes, check that the `scann_index` folder has been created in your Cloud Storage bucket:

In [None]:
!gsutil ls {OUTPUT_INDEX_DIR}

## Test the ANN index

Test the ANN index by using the `ScaNNMatcher` class implemented in the [index_server/matching.py](index_server/matching.py) module.

Run the following code snippets to create an item embedding from random generated values and pass it to `scann_matcher`, which returns the items IDs for the five items that are the approximate nearest neighbors of the embedding you submitted.

In [None]:
from index_server.matching import ScaNNMatcher
scann_matcher = ScaNNMatcher(OUTPUT_INDEX_DIR)

In [None]:
vector = np.random.rand(50)
scann_matcher.match(vector, 5)

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**