## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [1]:
# imports
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding

openai.api_key = 'Your api key here'  # set your API key in your environment per the README:


In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191


In [3]:
# load & inspect dataset
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)


Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [4]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
import re
# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()

    return s

df['combined_norm']= df["combined"].apply(lambda x : normalize_text(x))

In [5]:
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,combined_norm
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,Title: where does one start..and stop.. with a...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,Title: Arrived in pieces; Content: Not pleased...


In [6]:
import tiktoken
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined_norm.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)


1000

若要深入瞭解n_tokens資料行，以及文字最終如何標記化，執行下列程式碼會很有説明：

In [16]:
sample_encode = encoding.encode(df.combined_norm[0])
decode = encoding.decode_tokens_bytes(sample_encode)
decode

[b'Title',
 b':',
 b' where',
 b' does',
 b' one',
 b' start',
 b'..',
 b'and',
 b' stop',
 b'..',
 b' with',
 b' a',
 b' treat',
 b' like',
 b' this',
 b';',
 b' Content',
 b':',
 b' Wanted',
 b' to',
 b' save',
 b' some',
 b' to',
 b' bring',
 b' to',
 b' my',
 b' Chicago',
 b' family',
 b' but',
 b' my',
 b' North',
 b' Carolina',
 b' family',
 b' ate',
 b' all',
 b' ',
 b'4',
 b' boxes',
 b' before',
 b' I',
 b' could',
 b' pack',
 b'.',
 b' These',
 b' are',
 b' excellent',
 b'..',
 b'could',
 b' serve',
 b' to',
 b' anyone']

如果您接著檢查變數的 decode 長度，您會發現它符合n_tokens資料行中的第一個數位。

In [17]:
len(decode)

51

In [7]:
df.head(2)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,combined_norm,n_tokens
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,Title: where does one start..and stop.. with a...,51
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...","Title: Good, but not Wolfgang Puck good; Conte...",178


## 2. Get embeddings and save them for future reuse
既然我們已深入瞭解權杖化的運作方式，我們可以繼續進行內嵌。 請務必注意，我們尚未實際將檔標記化。 資料 n_tokens 行只是一種確保我們傳遞至模型以進行權杖化且內嵌的資料都未超過輸入權杖限制 8，192 的方法。 當我們將檔傳遞至內嵌模型時，它會將檔分成類似 (的標記，但不一定與上述範例完全相同) ，然後將標記轉換成可透過向量搜尋存取的一系列浮點數。 這些內嵌可以儲存在本機或 Azure 資料庫中。 因此，每個帳單都會在 DataFrame 右側的新 embedding 資料行中有自己的對應內嵌向量。

In [8]:
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined_norm.apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")


In [9]:
df.head(2)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,combined_norm,n_tokens,embedding
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,Title: where does one start..and stop.. with a...,51,"[0.0065054260194301605, -0.028513578698039055,..."
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.003428791416808963, -0.009930307045578957,..."


In [14]:
len(df.embedding[0])

1536

當我們執行下列搜尋程式碼區塊時，我們會使用相同的文字內嵌內嵌搜尋查詢 「我是否可以取得纜線公司稅額的資訊？」 ，並使用相同的 文字內嵌-ada-002 (第 2 版) 模型。 接下來，我們會從依 余弦相似度排名的查詢中找到與新內嵌文字內嵌最接近的帳單。

In [19]:
from openai.embeddings_utils import cosine_similarity # search through the reviews for a specific product

def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine="text-embedding-ada-002" # engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
    )
    df["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df, 'Most popular hot dogs', top_n=4)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,combined_norm,n_tokens,embedding,similarities
746,B001SB34RG,A2NM2R0EH3K6C1,5,won't kid you,I won't kid you there is nothing better than a...,Title: won't kid you; Content: I won't kid you...,Title: won't kid you; Content: I won't kid you...,125,"[-0.001628590514883399, 0.008710593916475773, ...",0.823522
797,B004IN6GVM,A5PFLD64SYXYK,5,The BEST dog candy,This brand of dog treats is by FAR the most lo...,Title: The BEST dog candy; Content: This brand...,Title: The BEST dog candy; Content: This brand...,63,"[-0.00887732021510601, -0.016540195792913437, ...",0.808917
755,B007TBO6PI,A2W8NAL3R6U8TF,5,Delicious!,"Love these packs. I have made pretzel dogs, bi...",Title: Delicious!; Content: Love these packs. ...,Title: Delicious!; Content: Love these packs. ...,42,"[0.0017885540146380663, -0.00704592140391469, ...",0.80607
383,B005HSSCFK,A3U7W07OLWYCNE,5,One of the best across the US,I am a BBQ chip fanatic and have tried every c...,Title: One of the best across the US; Content:...,Title: One of the best across the US; Content:...,94,"[-0.005858675576746464, -0.008477376773953438,...",0.792126
