## 准备工作

1.确保您按照[README](README-CN.md)中的说明在环境中设置了API密钥

2.安装依赖包

In [1]:
!pip install openai pandas matplotlib plotly scikit-learn

[0m

## 1. 生成 Embedding (基于 text-embedding-ada-002 模型)

嵌入对于处理自然语言和代码非常有用，因为其他机器学习模型和算法（如聚类或搜索）可以轻松地使用和比较它们。

![Embedding](images/embedding-vectors.svg)

### 亚马逊美食评论数据集(amazon-fine-food-reviews)

Source:[美食评论数据集](https://www.kaggle.com/snap/amazon-fine-food-reviews)

![dataset](images/amazon-fine-food-reviews.png)


该数据集包含截至2012年10月用户在亚马逊上留下的共计568,454条美食评论。为了说明目的，我们将使用该数据集的一个子集，其中包括最近1,000条评论。这些评论都是用英语撰写的，并且倾向于积极或消极。每个评论都有一个产品ID、用户ID、评分、标题（摘要）和正文。

我们将把评论摘要和正文合并成一个单一的组合文本。模型将对这个组合文本进行编码，并输出一个单一的向量嵌入。

In [2]:
import pandas as pd
import tiktoken

from openai.embeddings_utils import get_embedding

#### Embedding 模型关键参数

In [3]:
# 模型类型
# 建议使用官方推荐的第二代嵌入模型：text-embedding-ada-002
embedding_model = "text-embedding-ada-002"
# text-embedding-ada-002 模型对应的分词器（TOKENIZER）
embedding_encoding = "cl100k_base"
# text-embedding-ada-002 模型支持的输入最大 Token 数是8191，向量维度 1536
# 在我们的 DEMO 中过滤 Token 超过 8000 的文本
max_tokens = 8000  

#### 加载数据集

In [4]:
input_datapath = "data/fine_food_reviews_1k.csv"
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()

# 将 "Summary" 和 "Text" 字段组合成新的字段 "combined"
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [5]:
df["combined"]

0      Title: where does one  start...and stop... wit...
1      Title: Arrived in pieces; Content: Not pleased...
2      Title: It isn't blanc mange, but isn't bad . ....
3      Title: These also have SALT and it's not sea s...
4      Title: Happy with the product; Content: My dog...
                             ...                        
995    Title: Delicious!; Content: I have ordered the...
996    Title: Good Training Treat; Content: My dog wi...
997    Title: Jamica Me Crazy Coffee; Content: Wolfga...
998    Title: Party Peanuts; Content: Great product f...
999    Title: I love Maui Coffee!; Content: My first ...
Name: combined, Length: 1000, dtype: object

#### 将样本减少到最近的1,000个评论，并删除过长的样本


In [43]:
top_n = 1000
# 首先将前2k个条目进行初始筛选，假设不到一半会被过滤掉。
df = df.sort_values("Time").tail(top_n * 2) 
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# 忽略太长无法嵌入的评论
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
# 删除Token超长的样本
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

1000

#### 生成 Embeddings 并保存（非必须步骤，可直接复用项目中文件）

In [44]:
# 实际生成会耗时几分钟
# 提醒：非必须步骤，可直接复用项目中的嵌入文件 fine_food_reviews_with_embeddings_1k
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine=embedding_model))

output_datapath = "data/fine_food_reviews_with_embeddings_1k_demo.csv"

df.to_csv(output_datapath)

## 2.读取 fine_food_reviews_with_embeddings_1k 嵌入文件

In [11]:
embedding_datapath = "data/fine_food_reviews_with_embeddings_1k.csv"

df_embeded = pd.read_csv(embedding_datapath, index_col=0)

#### 查看 Embedding 结果

In [13]:
df_embeded["embedding"]

0      [0.007018072064965963, -0.02731654793024063, 0...
297    [-0.003140551969408989, -0.009995664469897747,...
296    [-0.01757248118519783, -8.266511576948687e-05,...
295    [-0.0013932279543951154, -0.011112828738987446...
294    [-0.01757248118519783, -8.266511576948687e-05,...
                             ...                        
623    [0.00011091353371739388, -0.00466986745595932,...
624    [-0.020869314670562744, -0.013138455338776112,...
625    [-0.009749102406203747, -0.0068712360225617886...
619    [-0.00521062919870019, 0.0009606690146028996, ...
999    [-0.006057822611182928, -0.015015840530395508,...
Name: embedding, Length: 1000, dtype: object

In [24]:
len(df_embeded["embedding"][0])

34402

In [28]:
type(df_embeded["embedding"][0])

str

In [25]:
# 将字符串转换为向量
import ast

df_embeded["embedding_vec"] = df_embeded["embedding"].apply(ast.literal_eval)

In [26]:
len(df_embeded["embedding_vec"][0])

1536

## 2. 