<a href="https://colab.research.google.com/github/JSJeong-me/AI-Innovation-2024/blob/main/NLP/4-2-Semantic_text_search_using_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Semantic text search using embeddings

We can search through all our reviews semantically in a very efficient manner and at very low cost, by embedding our search query, and then finding the most similar reviews. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).

In [9]:
!pip install openai

Collecting openai
  Downloading openai-1.51.0-py3-none-any.whl.metadata (24 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.51.0-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.5/383.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.6-py3-none-any.whl (78 kB)
[2K   [90m━

In [1]:
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1jfmC2bkgmsjC56xlOeNyrIOytxPMJP6R' -O fine_food_reviews_with_embeddings_1k.csv

--2024-10-03 02:49:27--  https://drive.google.com/uc?export=download&id=1jfmC2bkgmsjC56xlOeNyrIOytxPMJP6R
Resolving drive.google.com (drive.google.com)... 173.194.202.102, 173.194.202.113, 173.194.202.100, ...
Connecting to drive.google.com (drive.google.com)|173.194.202.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1jfmC2bkgmsjC56xlOeNyrIOytxPMJP6R&export=download [following]
--2024-10-03 02:49:27--  https://drive.usercontent.google.com/download?id=1jfmC2bkgmsjC56xlOeNyrIOytxPMJP6R&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.135.132, 2607:f8b0:400e:c0c::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.135.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35255462 (34M) [application/octet-stream]
Saving to: ‘fine_food_reviews_with_embeddings_1k.csv’


2024-10-03 02:49:32 (26.6 MB/s) -

In [2]:
from google.colab import userdata
import os

userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
api_key = os.getenv("OPENAI_API_KEY")

In [4]:
import pandas as pd
import numpy as np
from ast import literal_eval

datafile_path = "./fine_food_reviews_with_embeddings_1k.csv"

df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(literal_eval).apply(np.array)


Here we compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [6]:
!mkdir utils

In [7]:
!wget https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/utils/embeddings_utils.py -O utils/embeddings_utils.py

--2024-10-03 02:51:04--  https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/utils/embeddings_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8172 (8.0K) [text/plain]
Saving to: ‘utils/embeddings_utils.py’


2024-10-03 02:51:04 (75.0 MB/s) - ‘utils/embeddings_utils.py’ saved [8172/8172]



In [10]:
from utils.embeddings_utils import get_embedding, cosine_similarity

# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    product_embedding = get_embedding(
        product_description,
        model="text-embedding-3-small"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
        .combined.str.replace("Title: ", "")
        .str.replace("; Content:", ": ")
    )
    if pprint:
        for r in results:
            print(r[:200])
            print()
    return results


results = search_reviews(df, "delicious beans", n=3)


Not Syrup:  The product has a good strong ginger flavor and is sweet; it is, however, not syrup. Sweet ginger water would be a more accurate description.

Yummy and Healthy:  Loved the cranberry-like flavor and slightly crunchy texture.  Worked well with wheat bread. A little on the expensive side but my kids like it too.

Dreams do come true! You can now eat Mitt Romney's bowel movement!:  Year after year I would lie awake in bed, thinking, hoping, wishing for the one day I could purchase a likeness of a major politica



In [11]:
results = search_reviews(df, "whole wheat pasta", n=3)


Good stuff:  Very good product, helps with a lot of ailments. It's great help stocking up on vitimain that you are deficient in.

Yum!:  I'll never go back to regular taco seasoning again!  I use this for tacos, taco salad, and I've heard it's even good in chili--though I haven't had a chance to try that yet.  This seasoning doe

these do the job:  I bought the CET Veggiedent chews for one of my dogs who didn't do well with the regular CET chews. He would just swallow big chunks and that was potentially dangerous. He likes the



We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings.

In [12]:
results = search_reviews(df, "bad delivery", n=1)


Good & Plenty Licorice Candy:  If you like licorice you will love this candy.  I can remember eating this candy from a box at the local movie theatre when I was a kid and it is still just as good.



As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures.

In [13]:
results = search_reviews(df, "spoilt", n=1)


Full- bodied without a bitter after-taste:  This is my everyday coffee choice...a good all around crowd pleaser.  Green mountain Sumatra would be my back-up-for-a-change-of-pace second choice...nice t



In [14]:
results = search_reviews(df, "pet food", n=2)


Rodeo Drive is Crazy Good Coffee!:  Rodeo Drive is my absolute favorite and I'm ready to order more!  That's if I can find it.<br />I don't know why they are discontinuing it.<br />It arrived very fas

Rodeo Drive is Crazy Good Coffee!:  Rodeo Drive is my absolute favorite and I'm ready to order more!  That's if I can find it.<br />I don't know why they are discontinuing it.<br />It arrived very fas

