In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Matching Engine using Text-Based Product Embeddings

<table align="left">

  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/matching_engine/similar_products_recommender_using_matching_engine.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/matching_engine/similar_products_recommender_using_matching_engine.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This example demonstrates how to use the  Vertex AI Matching Engine's Approximate Nearest Neighbor (ANN) service to create a similar products recommender. Vertex AI Matching Engine's ANN Service is a high scale, low latency solution, to find similar vectors (or more specifically "embeddings") for a large corpus. It is a fully managed offering, further reducing operational overhead. It is built upon [Approximate Nearest Neighbor (ANN) technology](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) developed by Google Research.

### Objective

In this notebook, you learn how to generate embeddings for the tabular data, create Approximate Nearest Neighbor (ANN) Index, query against indexes, and validate the performance of the index. 

This tutorial uses the following Google Cloud ML services and resources:

- VPC network
- BigQuery
- Vertex AI Matching Engine's Approximate Nearest Neighbor (ANN) service

The steps performed include:

* Generate embeddings for the data
* Create ANN index and brute force index
* Create an index endpoint
* Deploy ANN index and brute force index
* Perform online query
* Compute recall


### Dataset

The dataset used for this tutorial is the [theLook eCommerce](https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce).

The dataset contains information about customers, products, orders, logistics, web events and digital marketing campaigns. The contents of this dataset are synthetic, and are provided to industry practitioners for the purpose of product discovery, testing, and evaluation.


### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* VPC network
* Cloud Storage



Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [VPC network pricing](https://cloud.google.com/vpc/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.


## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager).

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API and Compute Engine API, and Service Networking API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,servicenetworking.googleapis.com).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already authenticated.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

- In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

- **Click Create service account**.

- In the **Service account name** field, enter a name, and click **Create**.

- In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex" into the filter box, and select **Vertex Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

- Click Create. A JSON file that contains your key downloads to your local environment.

- Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS '[your-service-account-key-path]'

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

* **Prepare a VPC network**.  To reduce any network overhead that might lead to unnecessary increase in overhead latency, it is best to call the ANN endpoints from your VPC via a direct [VPC Peering](https://cloud.google.com/vertex-ai/docs/general/vpc-peering) connection. 
  * The following section describes how to setup a VPC Peering connection if you don't have one. 
  * This is a one-time initial setup task. You can also reuse existing VPC network and skip this section.

In [None]:
VPC_NETWORK = "[your-vpc-network-name]"  # @param {type:"string"}

PEERING_RANGE_NAME = "ann-haystack-range"

In [None]:
import os

# Remove the if condition to run the encapsulated code
if not os.getenv("IS_TESTING"):
    # Create a VPC network
    ! gcloud compute networks create {VPC_NETWORK} --bgp-routing-mode=regional --subnet-mode=auto --project={PROJECT_ID}

    # Add necessary firewall rules
    ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-icmp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow icmp

    ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-internal --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow all --source-ranges 10.128.0.0/9

    ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-rdp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:3389

    ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-ssh --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:22

    # Reserve IP range
    ! gcloud compute addresses create {PEERING_RANGE_NAME} --global --prefix-length=16 --network={VPC_NETWORK} --purpose=VPC_PEERING --project={PROJECT_ID} --description="peering range"

    # Set up peering with service networking
    # Your account must have the "Compute Network Admin" role to run the following.
    ! gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network={VPC_NETWORK} --ranges={PEERING_RANGE_NAME} --project={PROJECT_ID}

* Authentication: Rerun the `gcloud auth login` command in the Vertex AI Workbench notebook terminal when you are logged out and need the credential again.

## Make sure the following cells are run from inside the VPC network that you created in the previous step

* **WARNING:** The MatchingIndexEndpoint.match method (to create online queries against your deployed index) has to be executed in a Vertex AI Workbench notebook instance that is created with the following requirements:
  * **In the same region as where your ANN service is deployed** (for example, if you set `REGION = "us-central1"` as same as the tutorial, the notebook instance has to be in `us-central1`).
  * **Make sure you select the VPC network you created for ANN service** (instead of using the "default" one). You will have to create the VPC network below and then create a new notebook instance that uses that VPC.  
  * If you run it in a Vertex AI Workbench notebook instance in a different VPC network or region, "Create Online Queries" section will fail.
  
## Installation

Install the following packages required to execute this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"
    
! pip3 install --upgrade --quiet {USER_FLAG} tensorflow-gpu \
                                    tensorflow-hub \
                                    pandas-gbq \
                                    'google-cloud-bigquery[bqstorage,pandas]' \
                                    matplotlib \
                                    google-cloud-aiplatform \
                                    grpcio-tools \
                                    protobuf==3.19.6

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

**Note: You may get a message saying "Your session crashed for an unknown reason", this is expected. Once this cell has finished running, continue on. You do not need to re-run any of the cells above.**

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)

In [None]:
REGION = "[your-region]"
if REGION == "[your-region]":
    REGION = "us-central1"

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex AI SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

### Import libraries and define constants

In [None]:
import math
import os

import google.cloud.aiplatform as aiplatform
import matplotlib.pyplot as plt
import tensorflow_hub as hub
from google.cloud.bigquery import Client

Initialize the Vertex SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

Initiate the BigQuery client

In [None]:
client = Client(project=PROJECT_ID)

Use gcloud to retrieve the project number.

In [None]:
shell_output = ! gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = shell_output[0]
print("Project Number:", PROJECT_NUMBER)

PARENT = "projects/{}/locations/{}".format(PROJECT_ID, REGION)

print("PROJECT_ID: {}".format(PROJECT_ID))
print("REGION: {}".format(REGION))

!gcloud config set project {PROJECT_ID} --quiet
!gcloud config set ai_platform/region {REGION} --quiet

## Explore Data
View the data

In [None]:
query = """
SELECT * FROM `bigquery-public-data.thelook_ecommerce.products`
"""
df = client.query(query).to_dataframe()

In [None]:
df.head(5)

Check data types and null counts

In [None]:
df.info()

The current dataset doesn't have any null or empty fields in it.
Select the required_columns

In [None]:
required_columns = ["category", "name", "brand", "department"]

Separate categorical columns

In [None]:
categorical_columns = ["category", "department"]

In [None]:
for i in categorical_columns:
    df[i].value_counts().plot(kind="bar")
    plt.title(i)
    plt.show()

## Create embeddings for the data

The Vertex AI Matching Engine's ANN service expects embeddings in order to give nearest neighbours. Here you convert each row in the dataframe to a sentence and convert sentence to the embedding using universal sentence encoder. One way to create sentences is by concatenating fields from the dataset. Start by converting each row to a sentence by concatenating required columns separating each column value by a space.

Convert each row to a sentence by concatenating required columns separating each column value by a space.

In [None]:
sentences_list = []  # list to store sentences of each column

In [None]:
sentence_to_product_id = {}
for i in range(len(df["id"])):
    sentence = ""
    for column in required_columns:
        if df[column][i] is not None:
            sentence = sentence + df[column][i] + " "
    if len(sentence) == 0:
        continue

    sentence = sentence[0:-1]  # remove last space
    if sentence not in sentence_to_product_id:  # remove duplicate sentences
        sentence_to_product_id[sentence] = df["id"][i]
        sentences_list.append(sentence)

### Universal Sentence Encoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The pre-trained Universal Sentence Encoder is publicly available in [Tensorflow-hub](https://tfhub.dev/).

The model is available to us via the TFHub. 

Load the model:

In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print("module %s loaded" % module_url)

Generate embeddings for the sentences

In [None]:
sentence_embeddings = []
for sent in sentences_list:
    sentence_embeddings.append(
        list(model([sent])[0].numpy())
    )  # model(sentence)->gives a tensor->convert it to a numpy array, then convert it to a list

View the embeddings

In [None]:
sentence_embeddings[0]

In [None]:
# The number of nearest neighbors to be retrieved from database for each query.
NUM_NEIGHBOURS = 1000
for embedding in sentence_embeddings[:5]:
    print(embedding)

Save the data in JSONL format.


In [None]:
with open("item.json", "w") as f:
    index = 0
    for embedding in sentence_embeddings:
        f.write('{"id":"' + str(index) + '",')
        f.write('"embedding":[' + ",".join(str(x) for x in embedding) + "]}")
        f.write("\n")
        index += 1

Upload the data to GCS.

In [None]:
EMBEDDINGS_INITIAL_URI = f"{BUCKET_URI}/matching_engine/initial/"
! gsutil cp item.json {EMBEDDINGS_INITIAL_URI}

## Create Indexes

A ANN index is a collection of vectors deployed together for similarity search. Vectors can be added to an index or removed from an index. Similarity search queries are issued to a specific index and will search over the vectors in that index.


Firstly you have to create a ANN index and feed your embeddings to the index.


### Create ANN Index (for Production Usage)

In [None]:
DIMENSIONS = 512
DISPLAY_NAME = "item"
DISPLAY_NAME_BRUTE_FORCE = DISPLAY_NAME + "_brute_force"

Create the ANN index configuration:

Please read the documentation to understand the various configuration parameters that can be used to tune the index


Now, create a index by passing the following arguments:

- `display_name`: The display name of the Index.
- `contents_delta_uri`: Allows inserting, updating  or deleting the contents of the Matching Engine Index.
The string must be a valid Google Cloud Storage directory path.
- `dimensions`: The number of dimensions of the input vectors. 
- `approximate_neighbors_count`(Optional): The default number of neighbors to find via approximate search before exact reordering is performed. Exact reordering is a procedure where results returned by an approximate search algorithm are reordered via a more expensive distance computation.
- `distance_measure_type`: The distance measure used in nearest neighbor search.
- `leaf_node_embedding_count`: Number of embeddings on each leaf node. The default value is 1000 if not set.
- `leaf_nodes_to_search_percent`: The default percentage of leaf nodes that any query may be searched.
- `description`: The description of the Index.

In [None]:
tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_INITIAL_URI,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="Item ANN index",
)

In [None]:
INDEX_RESOURCE_NAME = tree_ah_index.resource_name
INDEX_RESOURCE_NAME

Using the resource name, you can retrieve an existing MatchingEngineIndex.

In [None]:
tree_ah_index = aiplatform.MatchingEngineIndex(index_name=INDEX_RESOURCE_NAME)

### Create Brute Force Index (for Ground Truth)

The brute force index uses a naive brute force method to find the nearest neighbors. This method is not fast or efficient. Hence brute force indices are not recommended for production usage. They are to be used to find the "ground truth" set of neighbors, so that the "ground truth" set can be used to measure recall of the indices being tuned for production usage. To ensure apples to apples comparison, the `distanceMeasureType` and `dimensions` of the brute force index should match those of the production indices being tuned.

Create the brute force index configuration:

Now, create a index by passing the following arguments:

- `display_name`: The display name of the Index.
- `contents_delta_uri`: Allows inserting, updating  or deleting the contents of the Matching Engine Index.
The string must be a valid Google Cloud Storage directory path.
- `dimensions`: The number of dimensions of the input vectors. 
- `distance_measure_type`: The distance measure used in nearest neighbor search.
- `description`: The description of the Index.

In [None]:
brute_force_index = aiplatform.MatchingEngineIndex.create_brute_force_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_INITIAL_URI,
    dimensions=DIMENSIONS,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    description="Item index (brute force)",
)

In [None]:
INDEX_BRUTE_FORCE_RESOURCE_NAME = brute_force_index.resource_name
INDEX_BRUTE_FORCE_RESOURCE_NAME

In [None]:
brute_force_index = aiplatform.MatchingEngineIndex(
    index_name=INDEX_BRUTE_FORCE_RESOURCE_NAME
)

## Create an Index Endpoint with VPC network

You have to create a index endpoint and deploy the index to the endpoint. Then from the endpoint you can call the match service.

In [None]:
VPC_NETWORK = "[your-network-name]"
VPC_NETWORK_FULL = "projects/{}/global/networks/{}".format(PROJECT_NUMBER, VPC_NETWORK)
VPC_NETWORK_FULL

Now, create a index point by passing the following arguments:

- `display_name`: The display name of the IndexEndpoint.
- `network`: The full name of the Google Compute Engine [network](https://cloud.google.com/compute/docs/networks-and-firewalls#networks) to which the index endpoint should be peered. Private services access must already be configured for the
network. If left unspecified, the Endpoint is not peered with any network.

In [None]:
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="index_endpoint",
    network=VPC_NETWORK_FULL,
)

In [None]:
INDEX_ENDPOINT_NAME = my_index_endpoint.resource_name
INDEX_ENDPOINT_NAME

## Deploy Indexes

### Deploy ANN Index

In [None]:
DEPLOYED_INDEX_ID = f"tree_ah_item_deployed_{UUID}"

Deploy ANN index to the endpoint by passing the following arguments:

- `index`: Index which is to be deployed.
- `deployed_index_id`: The user specified ID of the deployed index.

In [None]:
my_index_endpoint = my_index_endpoint.deploy_index(
    index=tree_ah_index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

### Deploy Brute Force Index

In [None]:
DEPLOYED_BRUTE_FORCE_INDEX_ID = f"item_brute_force_deployed_{UUID}"

Deploy Brute Force index to the endpoint by passing the following arguments:

- `index`: Index which is to be deployed.
- `deployed_index_id`: The user specified ID of the deployed index.

In [None]:
my_index_endpoint = my_index_endpoint.deploy_index(
    index=brute_force_index, deployed_index_id=DEPLOYED_BRUTE_FORCE_INDEX_ID
)

my_index_endpoint.deployed_indexes

## Create Online Queries

After you built your indexes, you may query against the deployed index through the online querying gRPC API (Match Service) within the virtual machine instances from the same region (for example 'us-central1' in this tutorial).

### Give top 10 recommendations for a user search query

In [None]:
sentence = "Men's Swim Wear"

Generate embedding by passing a sentence to the model

In [None]:
query_vec = list(model([sentence])[0].numpy())  # convert tensor to a list

Convert list to a 2d list

In [None]:
query_vec = [query_vec]

#### Test query

##### Response
The response from the match() call is a Python list with the following entries:

1. First item from the list is a list of nearest neighbours in sorted order. First neighbour is the nearest neighbour.
2. Each item from the nearest neighbour's list is a `google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchNeighbor` class which contains `id` variable(id of the embedding you assigned before) and `distance` variable (gives dot product distance) 

In [None]:
response_for_search_query = my_index_endpoint.match(
    deployed_index_id=DEPLOYED_INDEX_ID, queries=query_vec, num_neighbors=NUM_NEIGHBOURS
)

response_for_search_query

Store similar product IDs in a list

In [None]:
results_id_list = []
for i in response_for_search_query[0][:10]:
    sent = sentences_list[int(i.id)]
    print(sent)
    results_id_list.append(sentence_to_product_id[sent])

Print product IDs

In [None]:
results_id_list

View top 10 similar products

In [None]:
df.set_index("id").loc[results_id_list].reset_index()

#### Recommend almost similar products 

Store words of current sentence in a list

In [None]:
words_present = []
for word in sentence.split():
    if "'" in word:
        word = word.split("'")[
            0
        ]  # remove single quotes from a word and take singular word
    if word not in words_present:  # remove duplicate words
        words_present.append(word)

View unique words present in the query sentence

In [None]:
words_present

Calculate length of the list

In [None]:
no_words_present_given_sentence = len(words_present)

##### Perform this process for every sentence of the response
- For every sentence check how many words of the sentence matched with the words of the query sentence
- If 25% to 50% words are matched, then append corresponding id of the product associated with the sentence

In [None]:
results_id_list = []  # list to store almost similar products id's
count = 0
for i in response_for_search_query[0]:
    e += 1
    words_matched_in_current_sentence = 0
    for word in sentences_list[int(i.id)].split():
        if "'" in word:
            word = word.split("'")[0]
        if word in words_present:
            words_matched_in_current_sentence += 1
    # if a sentence has 25% to 50% match of similar words wth given sentence words
    if words_matched_in_current_sentence <= math.floor(
        no_words_present_given_sentence / 2
    ) and words_matched_in_current_sentence > math.floor(
        no_words_present_given_sentence / 4
    ):
        count += 1
        sent = sentences_list[int(i.id)]
        results_id_list.append(sentence_to_product_id[sent])
    if count == 10:
        break

View top 10 almost similar products

In [None]:
df.set_index("id").loc[results_id_list].reset_index()

### Compute Recall

Use the deployed brute force Index as the ground truth to calculate the recall of ANN Index. Note that you can run multiple queries in a single match call.

In [None]:
# Retrieve nearest neighbors for both the tree-AH index and the brute-force index
tree_ah_response_test = my_index_endpoint.match(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=sentence_embeddings[:1000],
    num_neighbors=NUM_NEIGHBOURS,
)
brute_force_response_test = my_index_endpoint.match(
    deployed_index_id=DEPLOYED_BRUTE_FORCE_INDEX_ID,
    queries=sentence_embeddings[:1000],
    num_neighbors=NUM_NEIGHBOURS,
)

In [None]:
# Calculate recall by determining how many neighbors were correctly retrieved as compared to the brute-force option.
recalled_neighbors = 0
for tree_ah_neighbors, brute_force_neighbors in zip(
    tree_ah_response_test, brute_force_response_test
):
    tree_ah_neighbor_ids = [neighbor.id for neighbor in tree_ah_neighbors]
    brute_force_neighbor_ids = [neighbor.id for neighbor in brute_force_neighbors]

    recalled_neighbors += len(
        set(tree_ah_neighbor_ids).intersection(brute_force_neighbor_ids)
    )

recall = recalled_neighbors / len(
    [neighbor for neighbors in brute_force_response_test for neighbor in neighbors]
)

print("Recall: {}".format(recall))

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.
You can also manually delete resources that you created by running the following code.

In [None]:
# Force undeployment of indexes and delete endpoint
my_index_endpoint.delete(force=True)

# Delete indexes
tree_ah_index.delete()
brute_force_index.delete()

delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI