## Semantic text search using embeddings

We can search through all our reviews semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar reviews. The dataset is created in the [Obtain_dataset Notebook](Obtain_dataset.ipynb).

In [2]:
import pandas as pd
import numpy as np

# If you have not run the "Obtain_dataset.ipynb" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv
datafile_path = "./data/lovdata_embedding.csv"

df = pd.read_csv(datafile_path)
df["ada_search"] = df.ada_search.apply(eval).apply(np.array)


Remember to use the documents embedding engine for documents (in this case reviews), and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [15]:
import openai
openai.api_key = "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"
from openai.embeddings_utils import get_embedding, cosine_similarity

# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    embedding = get_embedding(
        product_description,
        engine="text-embedding-ada-002"
    )
    df["similarities"] = df.ada_search.apply(lambda x: cosine_similarity(x, embedding))
    print(df["similarities"].max())

    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
    )
    print(res)
    if pprint:
        for r in res:
            print(r[:3000])
            print()
    return res


res = search_reviews(df, "dangerous dogs", n=3, pprint=True)
# write to file
res.to_csv("aksjer.csv", index=False)


0.7271857340107427
    Unnamed: 0                                               Text  \
28          28  ### § 7-2. _Forsvarlig virksomhet_\n\n(1) Låne...   
22          22  ### § 5-2. _Krav til ledelsen av foretaket_\n\...   
27          27  ### § 7-1. _God forretningsskikk_\n\n(1) Lånef...   

                                           ada_search  similarities  
28  [-0.01844361424446106, -0.018988747149705887, ...      0.727186  
22  [-0.0156610868871212, -0.006859060376882553, -...      0.722567  
27  [-0.006928590591996908, -0.015786826610565186,...      0.721723  
Unnamed: 0

Text

ada_search

similarities

