# Openai Embedding

## Api reference

### [Embedding](https://platform.openai.com/docs/api-reference/embeddings)

Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms.
Related guide: [Embeddings](https://platform.openai.com/docs/guides/embeddings)
   
### Embedding object

Represents an embedding vector returned by embedding endpoint.

1. index (integer): The index of the embedding in the list of embeddings.
2. object (string): The object type, which is always "embedding".
3. embedding (array): The embedding vector, which is a list of floats. The length of vector depends on the model as listed in the [embedding guide](https://platform.openai.com/docs/guides/embeddings).

### Create embeddings

POST https://api.openai.com/v1/embeddings.
Creates an embedding vector representing the input text.

#### Request body

1. model (string) (Required): ID of the model to use. You can use the [List models](https://platform.openai.com/docs/api-reference/models/list) API to see all of your available models, or see our [Model overview](https://platform.openai.com/docs/models/overview) for descriptions of them.
2. input (string or array) (Required): Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. Each input must not exceed the max input tokens for the model (8191 tokens for text-embedding-ada-002) and cannot be an empty string. [Example Python code](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) for counting tokens.
3. user (string) (Optional): A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/guides/safety-best-practices/end-user-ids).

#### Returns

A list of embedding objects.

## 环境准备

### 安装所需依赖组件

> %pip will install the package in the virtual environment where the current notebook kernel is running. 
> While !pip will install the package in the base environment. 
> If you are using Python virtual environment (as you should!), you should use %pip.

前置通过 shell 命令，安装所依赖的 python 包

In [19]:
%pip install tiktoken openai pandas matplotlib plotly scikit-learn numpy

Note: you may need to restart the kernel to use updated packages.


### 导入依赖模块

导入本项目所依赖的所有模块。

此处发现有模块缺失，可以填补在上述命令中进行安装。

In [20]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import ast
import subprocess
import time

# 导入 tiktoken 库。Tiktoken 是 OpenAI 开发的一个库，用于从模型生成的文本中计算 token 数量。
import tiktoken

# 从 openai.embeddings_utils 包中导入 get_embedding 函数。
# 这个函数可以获取 GPT-3 模型生成的嵌入向量。
# 嵌入向量是模型内部用于表示输入数据的一种形式。
# cosine_similarity 函数计算两个嵌入向量之间的余弦相似度。
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity

# 从 sklearn.manifold 模块中导入 TSNE 类。
# TSNE (t-Distributed Stochastic Neighbor Embedding) 是一种用于数据可视化的降维方法，尤其擅长处理高维数据的可视化。
# 它可以将高维度的数据映射到 2D 或 3D 的空间中，以便我们可以直观地观察和理解数据的结构。
from sklearn.manifold import TSNE

# 从 scikit-learn中导入 KMeans 类。KMeans 是一个实现 K-Means 聚类算法的类。
from sklearn.cluster import KMeans

### 加载配置信息

该项目中，密钥配置在了根目录的 config.json 文件中，可替换为自己的密钥信息。

如使用 git 进行管理，请手动忽略该文件相关变更，避免信息泄露。

In [21]:
# 通过 subprocess 执行 shell 命令，获取 git 仓库的根目录
command = ['git', 'rev-parse', '--show-toplevel']
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
git_root = output.decode().strip()

config_path = os.path.join(git_root, "config.json")
config = {}
with open(config_path,"r") as f:
    config = json.load(f)
openai.api_key = config["sk"]

## 获取 Embedding 数据（Optional）

此步骤为非必要步骤，可直接使用项目中已经生成过的文件信息。

### 配置 Embedding 模型信息

In [22]:
# 模型类型
# 建议使用官方推荐的第二代嵌入模型：text-embedding-ada-002
embedding_model = "text-embedding-ada-002"
# text-embedding-ada-002 模型对应的分词器（TOKENIZER）
embedding_encoding = "cl100k_base"
# text-embedding-ada-002 模型支持的输入最大 Token 数是 8191，向量维度 1536
max_tokens = 8000
# 免费账号，embeddings 限制为 150_000 tpm, 3 rpm。此处限制下每分钟请求次数为 2 次，每次为 70k token。
max_tpm = 70_000
max_rpm = 2

### 加载数据集

> Source:[美食评论数据集](https://www.kaggle.com/snap/amazon-fine-food-reviews)

数据集选择亚马逊美食评论数据集(amazon-fine-food-reviews)，该数据集包含截至 2012 年 10 月用户在亚马逊上留下的共计 568,454 条美食评论。

为了说明目的，我们将使用该数据集的一个子集（/data/fine_food_reviews_1k.csv），其中包括最近 1,000 条评论。这些评论都是用英语撰写的，并且倾向于积极或消极。每个评论都有一个产品ID、用户ID、评分、标题（摘要）和正文。

我们将把评论摘要（Summary）和正文（Text）合并成一个单一的组合文本（combined）。模型将对这个组合文本进行编码，并输出一个单一的向量嵌入。

In [23]:
input_datapath = os.path.join(git_root, "openai-api", "data", "fine_food_reviews_1k.csv")

df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()

# 将 "Summary" 和 "Text" 字段组合成新的字段 "combined"
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)

### 处理数据集

Api 会有针对 rpm 和 tpm 的限制，模型会有针对单次输入的 token 的限制，此处将数组拆分为多个小列表，每次请求时，入参为单个小列表。

In [24]:
# 将 combined 列元素转为数组
combined_array = df['combined'].values.tolist()

# 从'embedding_encoding'获取编码
encoding = tiktoken.get_encoding(embedding_encoding)

# 将数组按 token 限制，进行拆分
# 拆分后小数组的集合
input_array_array = []
# 当前小数组内，input 的 token 总和
current_token_sum = 0
# 当前的小数组
current_input_array = []
# 遍历数组并拆分为小数组
for input in combined_array:
    # 如果当前小数组和超过阈值，则添加到结果列表并重新开始构建新的小数组
    input_token = len(encoding.encode(input))
    if input_token > max_tokens:
        break
    if current_token_sum + input_token > max_tpm:
        input_array_array.append(current_input_array)
        current_token_sum = 0
        current_input_array = []
    
    # 将元素添加到当前小数组中
    current_input_array.append(input)
    current_token_sum += input_token

# 添加最后一个小数组到结果列表中
input_array_array.append(current_input_array)

print("评论信息列表长度:", len(combined_array))
print("列表长度:", len(input_array_array))
for sublist in input_array_array:
    print("元素长度:", len(sublist))

评论信息列表长度: 1000
列表长度: 2
元素长度: 714
元素长度: 286


### 调用 Embedding 模型

调用 embedding 模型对应的接口。

因为前置有将输入信息做拆分，此处需要按照 rpm 的限制进行请求。

In [25]:
# 每次请求后的等待时机
sleep_time = 60 / max_rpm

# 请求结果的汇总
total_embedding = []

for input_array in input_array_array:
    res = openai.Embedding.create(
        model = "text-embedding-ada-002",
        input = input_array
    )
    data_list = res['data']
    # 按照 index 的大小对对象列表进行排序
    sorted_list = sorted(data_list, key = lambda obj: obj.index)
    # 提取排序后的 name 参数到一个新的列表
    total_embedding += [obj.embedding for obj in sorted_list]
    # 每次请求间，做一次 sleep 操作，控制 rpm
    time.sleep(sleep_time)


### 处理 Embedding 结果

将多次请求后的结果进行汇总处理，并写入进一个新的文件中。

In [26]:
df["embedding_vec"] = total_embedding
output_datapath = os.path.join(git_root, "openai-api", "data", "fine_food_reviews_1k_with_embeddings.csv")
df.to_csv(output_datapath)

## 读取 Embedding 数据

在获取 Embedding 数据后，我们将结果写入本地文件中，避免重复获取。

In [27]:
embedding_datapath = os.path.join(git_root, "openai-api", "data", "fine_food_reviews_1k_with_embeddings.csv")
df_embedding = pd.read_csv(embedding_datapath, index_col=0)
df_embedding.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,embedding_vec
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,"[0.007060592994093895, -0.02732112631201744, 0..."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,"[-0.023609420284628868, -0.011784634552896023,..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....","[0.00016697357932571322, 0.005226491950452328,..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,"[0.010532955639064312, -0.01354704238474369, 0..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,"[0.015255776233971119, -0.003898625960573554, ..."


## T-SNE 处理数据