In [2]:
# Please change the name of the bucket to match your bucket
s3Bucket = "meal-recommendation-634815960608"

#Please change for your Pinecone API KEY  (get API key at app.pinecone.io)
api_key = "e5d50c57-2444-498a-9c3d-e6cd2e3a57ad"

model_name = "multilingual-e5-large"

# Semantic Search

In this POC for Preprocessing we will take the curated menus and put them in a vector store (using Pinecone) for semantic search. To begin we must install the required prerequisite libraries:

In [4]:
!pip install -qU \
  "pinecone-client[grpc]"==5.0.1 \
  datasets==2.14.6 \
  sentence-transformers==2.2.2 \
  pinecone-plugin-inference

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pathos 0.3.2 requires dill>=0.3.8, but you have dill 0.3.7 which is incompatible.
pathos 0.3.2 requires multiprocess>=0.70.16, but you have multiprocess 0.70.15 which is incompatible.[0m[31m
[0m

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preprocessing

The dataset preparation process requires a few steps:

1. We download the Menu dataset from S3 bucket

2. The text content of the dataset is embedded into vectors.

3. We reformat into a `(id, vector, metadata)` structure to be added to Pinecone.


In [5]:
import boto3
import pandas as pd
from io import StringIO

s3Key = "data/curated/menus.csv"

# We load the content of the CSV File
# Create an S3 client
s3 = boto3.client('s3')



# Read the CSV file from S3
obj = s3.get_object(Bucket=s3Bucket, Key=s3Key)
body = obj['Body'].read().decode('utf-8')

# Load the CSV content into a pandas DataFrame
dataset = pd.read_csv(StringIO(body))

# Print the number of rows in the dataset
print(f"Total number of menus loaded: {len(dataset)}")



Total number of menus loaded: 211


We remove duplicated menus

In [6]:
dataset_deduped = dataset.drop_duplicates(subset=['NAME'], keep='first')
# Print the number of rows after removing duplicates
print(f"Number of menus after removing duplicates: {len(dataset_deduped)}")

Number of menus after removing duplicates: 211


Display the five first menus:

In [7]:
dataset[:5]

Unnamed: 0,ITEM_ID,PRICE,CREATION_TIMESTAMP,NAME,GENRES,GENRE_L2,GENRE_L3,PRODUCT_DESCRIPTION,CONTENT_CLASSIFICATION,OTHERIDS
0,8627,0.0,1711843200,Grilled chicken teriyaki bowl,chicken,noodle,vegetable,soy sauce|garlic|ginger|sugar|oil|sesame seed|...,Asian,8627|8627|9024|9024
1,8628,0.0,1711843200,Pekinstyle grilled tofu bowl,tofu,rice,bell pepper,cherry|sugar|soy sauce|maltose|vinegar|corn st...,Asian,8628|8630|8629|8628|8630|8629
2,8631,0.0,1711843200,Thaiinspired shrimp with rice vermicelli and s...,shrimp,rice vermicelli,bell pepper,cabbage|carrot|corn|bok choy|green onion|sesam...,Thai,8631|8631
3,8632,0.0,1711843200,Pesto cod with orzo and vegetables,cod,orzo,zucchini,basil|parsley|parmesan|canola oil|red wine vin...,Mediterranean,8632|8632
4,8634,0.0,1711843200,Roasted salmon cubes sweet corn chowder,salmon,rice,corn,canola oil|salt|paprika|milk|white wine|cream|...,Healthy,8634|8635|8633|8634|8635|8633|9342|9343|9341|9...


With our menus ready to go we can move on to steps **2** and **3** above.

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [8]:
import os
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index.

In [9]:
from pinecone import ServerlessSpec

cloud = 'aws'
region = 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `menu-search`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [10]:
index_name = 'menu-search'

In [11]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1024,
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Now we upsert the data, we will do this in batches of `25`.

In [12]:
from tqdm.auto import tqdm

batch_size = 25
vector_limit = 100000

menus = dataset_deduped[:vector_limit]

for i in tqdm(range(0, len(menus), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(menus))
    # create IDs batch
    ids = menus['ITEM_ID'][i:i_end].astype(str).tolist()
    # create metadata batch
    metadatas = [{'name': name} for name in menus['NAME'][i:i_end]]


    # Concatenate multiple columns for input
    inputs = menus[i:i_end].apply(lambda row: f" {row['NAME']} {row['GENRES']} {row['GENRE_L2']} {row['GENRE_L3']} {row['PRODUCT_DESCRIPTION'].replace('|', ' ')}", axis=1).tolist()
    
    # create embeddings
    xc = pc.inference.embed(
        model=model_name,
        inputs=inputs,
        parameters={
            "input_type": "passage",
            "truncate": "END"
        }
    ).data
    embeddings = [item['values'] for item in xc]
    # create records list for upsert
    records = zip(ids, embeddings, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/9 [00:00<?, ?it/s]

NameError: name 'row' is not defined

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar menus*, so we should embed and search with another menu. Let's begin.

In [21]:
query = "ceviche tuna"

# create the query vector
xq = pc.inference.embed(
    model=model_name,
    inputs=[query],
    parameters={
        "input_type": "query",
        "truncate": "END"
    }
).data[0]['values']

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['name']}")

0.79: Shish taouk grilled chicken breast  garlic sauce
0.79: Steakspiced grilled chicken breast  pepper sauce
0.78: Grilled pork fillet with marsala sauce and mushroom risotto
0.78: Smoked meat poutine
0.78: Cacciatorestyle chicken with mushroom risotto


---