# Part 4: Building a Approximate Nearest Neighbor Index for the Item Embeddings using ScaNN and AI Platform Training

This tutorial shows how to use Matrix Factorization algorithm in BigQuery ML to generate embeddings for items based on their cooccurrence statistics. The generated item embeddings can be then used to find similar items.

Part 4 covers building an approximate nearest neighbor index for the embeddings 
using ScaNN and AI Platform Training. The built ScaNN index then is stored in Cloud Storage.


## Setup

In [None]:
!pip install -q scann

### Import libraries

In [None]:
import tensorflow as tf
import numpy as np
from datetime import datetime

### Configure GCP environment settings

In [None]:
PROJECT_ID = 'ksalama-cloudml' # Change to your project.
BUCKET = 'ksalama-cloudml' # Change to your bucket.
REGION = 'europe-west2' # Change to your AI Platform training region.
EMBEDDING_FILES_PREFIX = f'gs://{BUCKET}/bqml/item_embeddings/embeddings-*'
OUTPUT_INDEX_DIR = f'gs://{BUCKET}/bqml/scann_index'

### Authenticate your GCP account
This is required if you run the notebook in Colab

In [None]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

## Build ScaNN Index Locally

In [None]:
from index_builder.builder import indexer
indexer.build(EMBEDDING_FILES_PREFIX, OUTPUT_INDEX_DIR)

## Build ScaNN Index using AI Platform Training

In [None]:
if tf.io.gfile.exists(OUTPUT_INDEX_DIR):
  print("Removing {} contents...".format(OUTPUT_INDEX_DIR))
  tf.io.gfile.rmtree(OUTPUT_INDEX_DIR)

print("Creating output: {}".format(OUTPUT_INDEX_DIR))
tf.io.gfile.makedirs(OUTPUT_INDEX_DIR)

timestamp = datetime.utcnow().strftime('%y%m%d%H%M%S')
job_name = f'ks_bqml_build_scann_index_{timestamp}'

!gcloud ai-platform jobs submit training {job_name} \
  --project={PROJECT_ID} \
  --region={REGION} \
  --job-dir={OUTPUT_INDEX_DIR}/jobs/ \
  --package-path=index_builder/builder \
  --module-name=builder.task \
  --config='index_builder/config.yaml' \
  --runtime-version=2.2 \
  --python-version=3.7 \
  --\
  --embedding-files-path={EMBEDDING_FILES_PREFIX} \
  --output-dir={OUTPUT_INDEX_DIR} \
  --num-leaves=500

After the AI Platform Training job finish, you can check the built and stored ScaNN index in Cloud Storage:

In [None]:
!gsutil ls {OUTPUT_INDEX_DIR}

## Test the ScaNN Index API

In [None]:
from index_server.matching import ScaNNMatcher
scann_matcher = ScaNNMatcher(OUTPUT_INDEX_DIR)

In [None]:
vector = np.random.rand(50)
scann_matcher.match(vector, 5)

## License

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

See the License for the specific language governing permissions and limitations under the License.

**This is not an official Google product but sample code provided for an educational purpose**