## 在语义化搜索中使用嵌入向量

我们可以通过嵌入搜索查询，然后找到最相似的评论，以非常高效且低成本的方式对所有评论进行语义搜索。

In [None]:
%pip install pandas
%pip install numpy

In [4]:
import pandas as pd
import numpy as np
from ast import literal_eval

datafile_path = "./archive/fine_food_reviews_with_embeddings_1k.csv"

df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(literal_eval).apply(np.array)


在这里，我们比较查询和文档嵌入的余弦相似度，并展示前 top_n 个最佳匹配项。

In [9]:
# 导入获取嵌入向量和计算余弦相似度的函数
from utils.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
    """
    根据产品描述，在评论数据集中查找最相似的评论。
    
    参数:
    df: pandas.DataFrame, 包含评论文本和对应嵌入向量的数据集。
    product_description: str, 产品的描述，用于生成产品嵌入向量。
    n: int, 默认为3，指定返回的最相似评论的数量。
    pprint: bool, 默认为True，控制是否打印最相似的评论。
    
    返回:
    pandas.Series, 包含最相似评论的标题和内容的字符串。
    """
    # 为产品描述生成嵌入向量
    product_embedding = get_embedding(
        product_description,
        model="text-embedding-3-small"
    )
    
    # 计算每条评论的嵌入向量与产品描述嵌入向量的余弦相似度
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))
    
    # 根据相似度排序，并提取最相似的n条评论的标题和内容
    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
        .combined.str.replace("Title: ", "")
        .str.replace("; Content:", ": ")
    )
    
    # 如果指定，打印最相似的评论
    if pprint:
        for r in results:
            print(r[:200])
            print()
    
    return results

# 调用函数，搜索与"delicious beans"描述最相似的3条评论
results = search_reviews(df, "delicious beans", n=3)


ImportError: cannot import name 'cosine_similarity' from 'utils.embeddings_utils' (d:\CodeSpace\SelfSpace\PromptEnginnering\utils\embeddings_utils.py)

In [4]:
results = search_reviews(df, "whole wheat pasta", n=3)


Tasty and Quick Pasta:  Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara.  I just wish there was more of it.  If you aren't starving or on a 

sooo good:  tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.

Bland and vaguely gamy tasting, skip this one:  As far as prepared dinner kits go, "Barilla Whole Grain Mezze Penne with Tomato and Basil Sauce" just did not do it for me...and this is coming from a p



We can search through these reviews easily. To speed up computation, we can use a special algorithm, aimed at faster search through embeddings.

In [5]:
results = search_reviews(df, "bad delivery", n=1)


great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa



As we can see, this can immediately deliver a lot of value. In this example we show being able to quickly find the examples of delivery failures.

In [6]:
results = search_reviews(df, "spoilt", n=1)


Disappointed:  The metal cover has severely disformed. And most of the cookies inside have been crushed into small pieces. Shopping experience is awful. I'll never buy it online again.



In [7]:
results = search_reviews(df, "pet food", n=2)


Great food!:  I wanted a food for a a dog with skin problems. His skin greatly improved with the switch, though he still itches some.  He loves the food. No recalls, American made with American ingred

Great food!:  I wanted a food for a a dog with skin problems. His skin greatly improved with the switch, though he still itches some.  He loves the food. No recalls, American made with American ingred

