# Semantic Matching with Gemini

**Learning Objectives**
  1. Learn how to identify matching product descriptions
  1. Learn how to design a prompt for semantic matching
  2. Learn how to evaluate performance of a prompt for semantic matching
  1. Learn how to use Gemini with Google Gen AI SDK
  
Semantic matching is the problem of classifying a pair of entities $(x_1, x_2)$ as being a good match or not. So it is a classification setup that is a very flexible: Namely, it comprises general information retrieval (where the first entity can be a textual query and the second entity can be a paragraph for instance), entity resolutions, or database-record fuzzy-matching. In this notebook we will focus on matching textual descriptions of retail products. More specifically:
  
  
**Use case description:** An online retail company scours the web to compare prices of products in their inventory with those offered by their competitors. Their first priority is to implement a model that compares the information on two product webpages and outputs a classification indicating whether the different product descriptions on the webpages correspond to an identical product, which we will refer to as a 'match'. We use the [Amazon-Google Products dataset](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) in the [entity resolution benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution) created by Leipzig University. 

**Model description:** This notebook illustrates how to use Gemini API in Vertex AI to match product descriptions. The [Amazon-GoogleProducts dataset](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) contains product information about the products scrapped on Amazon or Google websites. It includes the products title, description, price, and manufacturer, athough this information is worded differently on the two websites. In this notebook, we will focus solely on building a model using the product titles. The idea is straightforward: we will create a prompt that asks the Gemini API whether the product titles match or not. Although the information about the products is limited to these titles, we still achieve an accuracy close to 100% on the test set.

**Evaluation method:** In order to avoid overfitting the prompt design on our dataset, we first split the our dataset sample of paired descriptions into an evaluation set (evalset) containing 60 examples and a test set (testset) also consisting of 60 examples. The choice of 60 examples aligns with the current limit quota of 60 requests per minute for the Gemini API in Vertex AI. Subsequently, we devise the prompt using the evalset and report the model's accuracy on the testset. Both the evaluation and test sets are roughly balanced.

## Setup

In [None]:
import random

import pandas as pd
from google import genai
from google.genai.types import GenerateContentConfig

pd.options.display.max_colwidth = 1000

## Exploring the dataset

We use a [dataset of product information](https://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip) scraped from Google and Amazon websites. The dataset is part of a [benchmark for semantic matching and entity resolution](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution) from Leipzig University. It contains 3 tables which are included in this repo:

```python
../data/Amazon.csv.gz 
../data/GoogleProducts.csv.gz
../Amzon_GoogleProducts_perfectMapping.csv.gz

```

The first table contains product information listed on Amazon, including the product title, description, and manufacturer:

In [None]:
amazon = pd.read_csv("../data/Amazon.csv.gz")
amazon.columns = [
    "idAmazon",
    "amazon_title",
    "amazon_description",
    "amazon_manufacturer",
    "amazon_price",
]
amazon.head()

The second table contains the same information but for product information scrapped from Google website:

In [None]:
google = pd.read_csv("../data/GoogleProducts.csv.gz")
google.columns = [
    "idGoogleBase",
    "google_title",
    "google_description",
    "google_manufacturer",
    "google_price",
]
google.head()

The last table contains a matching of product information on both website corresponding to the same product, but possibly described differently on the two websites:

In [None]:
matching = pd.read_csv("../data/Amzon_GoogleProducts_perfectMapping.csv.gz")
matching.head()

From this raw data, we have pre-generated for you an eval and test set:

```python
../data/product_matching_eval.csv
../data/product_matching_test.csv
```


We will use the eval set to design the prompt and use the test set to evaluate the Gemini API performance on this prompt. This way, the performance we report will be closer to the real performance on never-seen-pairs of product descriptions.  


To genrate the eval and test split, we used the function in the cell below. It takes a sample (whose size is controlled by `SAMPLE_SIZE`) of matching product ID's in the `matching` dataframe, and it joins the information of the Google and Amazon product information contained in the `google` and `amazon` dataframes. Then it extracts pairs of matching Google and Amazon descriptions and creates a table of matching descriptions with columns `description_1` (Google), `description_2` (Amazon), and a target column named `match` whose value is set to `yes` since we have only matching pairs so far.
To create description pairs of not matching product, we permutate the second description columns while keeping the first description fixed, and concatenate this new dataframe of not matching descriptions to the one of matching description. We shuffle and then split the resulting table into two equal sized dataframes, which we save on disk as our eval and test splits. 

Observe that we only use the `title` column as product description. So there is much more info in the raw data. Nevertheless, we will see that Gemini will achieve a performance close to 100% on the test set. Remarkable!

**Note:** Uncomment the last line if you want to re-generate the eval and test set on a different sample of the data.

In [None]:
SAMPLE_SIZE = 60


def generate_test_and_eval_sets(sample_size=SAMPLE_SIZE):
    sample = matching.sample(sample_size)

    # Join the product information to the df of matching ID's
    matched_products = sample.merge(
        right=amazon, how="left", on="idAmazon"
    ).merge(right=google, how="left", on="idGoogleBase")

    google_descriptions = list(matched_products["google_title"])
    amazon_descriptions = list(matched_products["amazon_title"])

    # Create the dataframe of matching descriptions
    matching_descriptions = pd.DataFrame(
        {
            "description_1": google_descriptions,
            "description_2": amazon_descriptions,
            "match": "yes",
        }
    )

    # Create the dataframe of not matching descriptions
    amazon_descriptions_perm = [
        amazon_descriptions[i - 1] for i in range(len(amazon_descriptions))
    ]
    not_matching_descriptions = pd.DataFrame(
        {
            "description_1": google_descriptions,
            "description_2": amazon_descriptions_perm,
            "match": "no",
        }
    )

    full_dataset = pd.concat(
        [matching_descriptions, not_matching_descriptions], axis=0
    ).sample(len(matching_descriptions) * 2)

    evalset = full_dataset[:sample_size]
    testset = full_dataset[sample_size : 2 * sample_size]
    evalset.to_csv("../data/product_matching_eval.csv", index=None)
    testset.to_csv("../data/product_matching_test.csv", index=None)


# Uncomment to generate a different data sample
# generate_test_and_eval_sets()

The next cell loads the eval and test datasets that we pre-generated. The two CSV files contain 60 examples of product description pairs. Each pair is labeled with a `match` value of `yes` if the descriptions describe the same product and `no` otherwise. `description_1` comes from product title on Google while `description_2` comes from Amazon products.

In [None]:
evalset = pd.read_csv("../data/product_matching_eval.csv")
testset = pd.read_csv("../data/product_matching_test.csv")

Let's have a quick look at a few entries in this dataset:

In [None]:
evalset.head()

Both splits are roughly balanced. To make sure that both splits are roughly balanced, we count the number of class instances for each split.

In [None]:
evalset.match.value_counts()

In [None]:
testset.match.value_counts()

#  Model implementation

We start by instanciating our client using the Gen AI SDK. We'll use the `gemini-2.0-flash-001` version of Gemini which is a large language model (LLM) developed by Google.

In [None]:
MODEL = "gemini-2.0-flash-001"

client = genai.Client(vertexai=True, location="us-central1")

Using this client, we implement in the next cell a simple function that takes two product descriptions and a parameterized prompt as input, and outputs `yes` or `no` depending on whether the Gemini model thinks the product descriptions are matching.

### Exercise

Complete the function below so that it queries Gemini using the Gen AI SDK client.

**Hint:** Jump to the cell after next to see how this function is used.

In [None]:
def are_products_matching(d1, d2, prompt):
    prompt = None  # TODO: Substitute d1 and d2 in the parametrized prompt
    answer = None  # TODO: Call the Gemini API with the prompt
    return answer.text.strip()

The next cell allows us to rapidly test on the evaluation set whether a given prompt seems to be working for this use case.

### Exercise

In the cell below, write a prompt instructing Gemini to answer `yes` if two product descriptions are matching and `no` otherwise.
Make sure to parametrize your prompt with `{desc1}` adn `{desc2}` so that different product descriptions `desc1` and `desc2` can be
switched at run time.

In [None]:
PROMPT = """
# TODO
"""

index = random.randint(0, len(evalset) - 1)

d1 = evalset.iloc[index]["description_1"]
d2 = evalset.iloc[index]["description_2"]
ground_truth = evalset.iloc[index]["match"]
prediction = are_products_matching(d1, d2, prompt=PROMPT)


print(
    f"""
Are the following two descriptions describing the same product?

Description 1: {d1}

Descriptions 2: {d2}

MODEL ANSWER: {prediction}
GROUND TRUTH: {ground_truth}
"""
)

# Model Analysis

We now analyze the performance of our model on the test set.

Large language models may sometimes output something other than "yes" or "no," even if we ask them politely to do so. This could be due to safety filters being triggered or the model simply not understanding the question. Therefore, we first need to determine the proportion of requests that our model fails to answer. In this case, it is around 6%, which may be acceptable. Further prompt engineering or model tuning could help to reduce this number.

The second aspect to consider is the performance of the model on the requests
that succeded (i.e. for whose the output was actually `yes` or `no`).
Since the test set is balanced, we can compute the model accuracy, which is 98%. This means that only a single example in the test set received a prediction that was different from the ground truth.

## Scoring the test set

To simplify evaluation, we implement a function in the next cell that will add a `prediction` column to our `testset`. This column will contain the predictions received from the Gemini API:

### Exercise

Complete the function below so that it adds a column `prediction` to the input dataframe `dataset` containing the Gemini predictions using the prompt you created. 


In [None]:
def apply_prompt(prompt, dataset):
    scored_dataset = dataset.copy()
    # TODO: Update the line below to that the LLM predictions are
    # stored in the `predictions` column of the dataframe `scored_dataset`
    scored_dataset["predictions"] = None
    return scored_dataset

Let's apply this function to our `testset` using our simple prompt that we designed on the `evalset`:

In [None]:
scored_testset = apply_prompt(prompt=PROMPT, dataset=testset)

Since requests to the Gemini API are limited and capped to 60 requests per minute, we save our scoring to disk so that we can analyze it offline if needed.

In [None]:
scored_testset.to_csv("scored_evalset.csv", index=False)

Here are the predictions of our model. We can see that most examples are classified correctly, although some requests failed, resulting in empty predictions. We will need to analyze these cases separately and compute the accuracy only for the requests that succeeded:

In [None]:
scored_testset.head(10)

### Failed predictions

The cell below list the number of failed predictions, that is, predictions which are anything other that `yes` or `no`. There are several possible causes to such a behavior. Gemini, as all LLM's, has been trained to predict the next most likely word from a sequence of words. Therefore, although we instruct Gemini explicitely in our prompt to answer by `yes` or `no`, it may happen that certain product descriptions confuse Gemini, resulting in something different from `yes` or `no`. Another issue is that the language in the product descriptions may trigger a safety filter, which then will replace Gemini raw answer by some standard warning text. These safety filters can be triggered by words in the product descriptions that are too evocating of health or medical issues for instance, or violence, among many [other safety settings](https://developers.generativeai.google/guide/safety_setting). 

### Exercise

In the cell below, write the code that creates a boolean Pandas series
`failed_predictions_mask` that contains `True` if the Gemini API prediction stored in `scored_testset` failed and `False` otherwise.


In [None]:
failed_predictions_mask = None  # TODO

failed_predictions_mask.value_counts()

In [None]:
proportion_of_failed_predictions = failed_predictions_mask.sum() / len(
    failed_predictions_mask
)
print(
    "Proportion of failed requests:",
    round(proportion_of_failed_predictions, 3) * 100,
    "%",
)

The next cell examines the failed requests. Some of the terms may have triggered safety filters, but this would require further investigation:

In [None]:
scored_testset[failed_predictions_mask]

## Accuracy on succeeded predictions

Let us now compute the model accuracy on the requests that succeeded. First, we remove all the failed requests from the test set:

In [None]:
scored_testset_with_predictions = scored_testset[~failed_predictions_mask]

Then, we compute the number of correct answers:

### Exercise

In the cell below, compute the accuracy of the Gemini API on the predictions that succeded

**Note:** The accuracy of our model is the number of correct answers divided by the number of predictions that succeeded. 


In [None]:
# TODO: Compute the model accuracy

### Exercise

In the cell below, extract from `scored_testset_with_predictions` the uncorrect predictions to inspect them.


In [None]:
# Extract the incorrect prediction for inspection

Copyright 2023 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License