# Semantic Matching with Matching Engine and Palm

[COVID-19 Open Research Dataset Challenge (CORD-19)
](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge)

## Setup 

In [233]:
import os
import json

from IPython import display
import pandas as pd
from google.cloud import aiplatform
from vertexai.language_models import TextEmbedding

In [121]:
REGION = "us-central1"
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
BUCKET = f"{PROJECT}-matching"

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

In [123]:
!gsutil ls gs://{BUCKET} || gsutil mb -l {REGION} gs://{BUCKET}

## Loading the data

In [145]:
metadata = pd.read_csv('../data/metadata_cord19_sample.csv')
metadata.head()

Unnamed: 0,title,abstract,url
0,Clinical features of culture-proven Mycoplasma...,OBJECTIVE: This retrospective chart review des...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
1,Nitric oxide: a pro-inflammatory mediator in l...,Inflammatory diseases of the respiratory tract...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
2,Surfactant protein-D and pulmonary host defense,Surfactant protein-D (SP-D) participates in th...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
3,Role of endothelin-1 in lung disease,Endothelin-1 (ET-1) is a 21 amino acid peptide...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
4,Gene expression in epithelial cells in respons...,Respiratory syncytial virus (RSV) and pneumoni...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...


## Creating the embeddings

In [8]:
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [50]:
MAX_BATCH_SIZE = 5
vectors = []

for i in range(0, len(metadata), MAX_BATCH_SIZE):
    batch = metadata.abstract[i: i + MAX_BATCH_SIZE]
    embeddings = model.get_embeddings(batch)
    vectors.extend([embedding.values for embedding in embeddings])

## Creating the matching engine input file

In [99]:
embeddings_file_path = "cord19_embeddings.json"

# Removing the embedding file if it already exists
!test -f {embeddings_file_path} && rm {embeddings_file_path}

with open(embeddings_file_path, 'a') as embeddings_file:    
    for i, embedding in enumerate(vectors):
        json_line = json.dumps(
            {
                "id": i,
                "embedding": embedding
            }
        ) + '\n'
        embeddings_file.writelines(json_line)

In [132]:
EMBEDDINGS_URI = f"gs://{BUCKET}"

!gsutil cp {embeddings_file_path} {EMBEDDINGS_URI}

Copying file://cord19_embeddings.json [Content-Type=application/json]...
- [1 files][ 64.9 MiB/ 64.9 MiB]                                                
Operation completed over 1 objects/64.9 MiB.                                     


## Creating the matching engine index

In [None]:
DISPLAY_NAME = "Cord19 Palm Embeddings"

tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_URI,
    dimensions=len(vectors[0]),
    approximate_neighbors_count=150,
    distance_measure_type="COSINE_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description=DISPLAY_NAME,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/115851500182/locations/us-central1/indexes/1364511522256060416/operations/4091745317752406016


In [135]:
INDEX_RESOURCE_NAME = tree_ah_index.resource_name

print(INDEX_RESOURCE_NAME)

projects/115851500182/locations/us-central1/indexes/1364511522256060416


In [136]:
tree_ah_index = aiplatform.MatchingEngineIndex(index_name=INDEX_RESOURCE_NAME)

In [138]:
DISPLAY_NAME = "Cord19 Palm Embeddings"

In [139]:
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DISPLAY_NAME,
    description=DISPLAY_NAME,
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/115851500182/locations/us-central1/indexEndpoints/8633532427064573952/operations/4340569197164625920
MatchingEngineIndexEndpoint created. Resource name: projects/115851500182/locations/us-central1/indexEndpoints/8633532427064573952
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/115851500182/locations/us-central1/indexEndpoints/8633532427064573952')


In [143]:
DEPLOYED_INDEX_ID = "cord19_deployed_index_id_unique"

In [None]:
my_index_endpoint = my_index_endpoint.deploy_index(
    index=tree_ah_index, deployed_index_id=DEPLOYED_INDEX_ID
)

my_index_endpoint.deployed_indexes

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/115851500182/locations/us-central1/indexEndpoints/8633532427064573952
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/115851500182/locations/us-central1/indexEndpoints/8633532427064573952/operations/6332849082322649088


## Querying Matching Engine

In [253]:
QUERY = "SARS-CoV-2"

text_embeddings = [
    vector.values 
    for vector in model.get_embeddings([QUERY])
]

In [254]:
# Define number of neighbors to return
NUM_NEIGHBORS = 20

response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=text_embeddings,
    num_neighbors=NUM_NEIGHBORS,
)

response

[[MatchNeighbor(id='3682', distance=0.29735422134399414),
  MatchNeighbor(id='2675', distance=0.2989482283592224),
  MatchNeighbor(id='1797', distance=0.3012887239456177),
  MatchNeighbor(id='1596', distance=0.302188515663147),
  MatchNeighbor(id='791', distance=0.3024110794067383),
  MatchNeighbor(id='2237', distance=0.305936336517334),
  MatchNeighbor(id='2082', distance=0.3060073256492615),
  MatchNeighbor(id='2138', distance=0.3063894510269165),
  MatchNeighbor(id='1109', distance=0.30717170238494873),
  MatchNeighbor(id='2366', distance=0.30803924798965454),
  MatchNeighbor(id='2471', distance=0.30810898542404175),
  MatchNeighbor(id='137', distance=0.30931782722473145),
  MatchNeighbor(id='2154', distance=0.31020474433898926),
  MatchNeighbor(id='717', distance=0.313531756401062),
  MatchNeighbor(id='3049', distance=0.3139423131942749),
  MatchNeighbor(id='2295', distance=0.31397467851638794),
  MatchNeighbor(id='3419', distance=0.3156825304031372),
  MatchNeighbor(id='3708', dis

In [255]:
matched_ids = [int(match.id) for match in response[0]]
matched_distances = [match.distance for match in response[0]]
matched_titles = [metadata.title[i] for i in matched_ids]
matched_abstracts = [metadata.abstract[i] for i in matched_ids]
matched_urls = [metadata.url[i] for i in matched_ids]

matches = pd.DataFrame({
    "distance": matched_distances,
    "title": matched_titles,
    "abstract": matched_abstracts,
    "url": matched_urls
})
matches

Unnamed: 0,distance,title,abstract,url
0,0.297354,Characterization of Host and Bacterial Contrib...,Influenza viruses are a threat to global publi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
1,0.298948,Coinfection and Mortality in Pneumonia-Related...,BACKGROUND: Pneumonia is the leading risk fact...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
2,0.301289,Investigation of Pathogenesis of H1N1 Influenz...,Swine influenza virus and Streptococcus suis a...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
3,0.302189,Genetic diversity and molecular epidemiology o...,BACKGROUND: Rhinoviruses (RV) are a well-estab...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
4,0.302411,Serious Invasive Saffold Virus Infections in C...,The first human virus in the genus Cardiovirus...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
5,0.305936,Respiratory Syncytial Virus whole-genome seque...,Respiratory Syncytial Virus (RSV) is responsib...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
6,0.306007,Real-Time Reverse Transcription PCR Assay for ...,"Senecavirus A (SV-A), formerly, Seneca Valley ...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
7,0.306389,Classical Swine Fever Virus vs. Classical Swin...,Two groups with three wild boars each were use...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
8,0.307172,Clinical Characteristics and Outcomes in Hospi...,BACKGROUND: The clinical consequences of co-in...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
9,0.308039,First report of human salivirus/klassevirus in...,Adenovirus is a leading cause of respiratory i...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...


In [256]:
html = "<html><body><ol>"
for i in range(len(matches)):
    html += f"""            
    <li> 
        <article>
            <header>
                <a href="{matches.url[i]}"> <h2>{matches.title[i]}</h2></a>
            </header>
            <p>{matches.abstract[i]}</p>
        </article>
    </li>
    """
html += "</body></html>"
display.HTML(html)

## Cleaning Up

In [None]:
my_index_endpoint.delete(force=True)
tree_ah_index.delete()

Copyright 2023 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.