# 使用Embedding API
注：为了方便embedding api调用，应将密钥填入llm_universe下的.env文件，代码将自动读取并加载环境变量。
## 一、使用OpenAI API
GPT有封装好的接口，我们简单封装即可。目前GPT embedding mode有三种，性能如下所示：
|模型 | 每美元页数 | [MTEB](https://github.com/embeddings-benchmark/mteb)得分 | [MIRACL](https://github.com/project-miracl/miracl)得分|
| --- | --- | --- | --- |
|text-embedding-3-large|9,615|54.9|64.6|
|text-embedding-3-small|62,500|62.3|44.0|
|text-embedding-ada-002|12,500|61.0|31.4|
* MTEB得分为embedding model分类、聚类、配对等八个任务的平均得分。
* MIRACL得分为embedding model在检索任务上的平均得分。  

从以上三个embedding model我们可以看出`text-embedding-3-large`有最好的性能和最贵的价格，当我们搭建的应用需要更好的表现且成本充足的情况下可以使用；`text-embedding-3-small`有着较好的性能跟价格，当我们预算有限时可以选择该模型；而`text-embedding-ada-002`是OpenAI上一代的模型，无论在性能还是价格都不如及前两者，因此不推荐使用。

In [1]:
import os
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv


# 读取本地/项目的环境变量
_ = load_dotenv(find_dotenv())

# 如果你需要通过代理端口访问，你需要如下配置
# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'

def openai_embedding(text: str, model: str=None):
    # 获取环境变量 OPENAI_API_KEY
    client = OpenAI(
        api_key=os.environ.get("OPENAI_API_KEY"),
        base_url=os.environ.get("OPENAI_BASE_URL")
    )

    # embedding model：'text-embedding-3-small', 'text-embedding-3-large', 'text-embedding-ada-002'
    if model == None:
        model="text-embedding-3-small"

    response = client.embeddings.create(
        input=text,
        model=model
    )
    print(response)
    return response

response = openai_embedding(text='要生成 embedding 的输入文本，字符串形式。')

CreateEmbeddingResponse(data=[Embedding(embedding=[0.03884002938866615, 0.013516489416360855, -0.0024250170681625605, -0.01655769906938076, 0.024130908772349358, -0.017382603138685226, 0.04206013306975365, 0.011498954147100449, -0.028245486319065094, -0.00674333656206727, 0.0011976007372140884, 0.014013418927788734, -0.023097295314073563, 0.01580236665904522, -0.005903525277972221, 0.013764954172074795, -0.010624358430504799, -0.010823129676282406, -0.01147907692939043, 0.02377311885356903, 0.023673733696341515, -0.008934796787798405, -0.00743406917899847, 0.017124198377132416, 0.016855856403708458, -0.02202392742037773, 0.020990312099456787, -0.009993257001042366, 0.04245767742395401, -0.001410038210451603, -0.08571044355630875, -0.040986765176057816, 0.01652788370847702, -0.03925745189189911, 0.012810848653316498, 0.0416228361427784, -0.008224187418818474, -0.013238208368420601, 0.021666137501597404, 0.015991199761629105, 0.004869911354035139, 0.012999682687222958, -0.000245824921876

API返回的数据为`json`格式，除`object`向量类型外还有存放数据的`data`、embedding model 型号`model`以及本次 token 使用情况`usage`等数据，具体如下所示：
```json
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.006929283495992422,
        ... (省略)
        -4.547132266452536e-05,
      ],
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}
```
我们可以调用response的object来获取embedding的类型。

In [2]:
print(f'返回的embedding类型为：{response.object}')

返回的embedding类型为：list


embedding存放在data中，我们可以查看embedding的长度及生成的embedding。

In [3]:
print(f'embedding长度为：{len(response.data[0].embedding)}')
print(f'embedding（前10）为：{response.data[0].embedding[:10]}')

embedding长度为：1536
embedding（前10）为：[-0.042730238288640976, 0.004826112650334835, 0.008142500184476376, 0.014882600866258144, 0.010360579937696457, 0.008664822205901146, -0.01151255052536726, -0.009287315420806408, -0.018216876313090324, -0.01003860030323267]


我们也可以查看此次embedding的模型及token使用情况。

In [4]:
print(f'本次embedding model为：{response.model}')
print(f'本次token使用情况为：{response.usage}')

本次embedding model为：text-embedding-ada-002
本次token使用情况为：Usage(prompt_tokens=12, total_tokens=12)


## 二、使用文心千帆API
Embedding-V1是基于百度文心大模型技术的文本表示模型，Access token为调用接口的凭证，使用Embedding-V1时应先凭API Key、Secret Key获取Access token，再通过Access token调用接口来embedding text。同时千帆大模型平台还支持bge-large-zh等embedding model。

In [5]:
import requests
import json

def wenxin_embedding(text: str):
    # 获取环境变量 wenxin_api_key、wenxin_secret_key
    api_key = os.environ['QIANFAN_AK']
    secret_key = os.environ['QIANFAN_SK']

    # 使用API Key、Secret Key向https://aip.baidubce.com/oauth/2.0/token 获取Access token
    url = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={0}&client_secret={1}".format(api_key, secret_key)
    payload = json.dumps("")
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'
    }
    response = requests.request("POST", url, headers=headers, data=payload)
    
    # 通过获取的Access token 来embedding text
    url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/embeddings/embedding-v1?access_token=" + str(response.json().get("access_token"))
    input = []
    input.append(text)
    payload = json.dumps({
        "input": input
    })
    headers = {
        'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    return json.loads(response.text)
# text应为List(string)
text = "要生成 embedding 的输入文本，字符串形式。"
response = wenxin_embedding(text=text)
response

{'id': 'as-7kybauw0hg',
 'object': 'embedding_list',
 'created': 1713796006,
 'data': [{'object': 'embedding',
   'embedding': [0.060567744076251984,
    0.020958080887794495,
    0.053234219551086426,
    0.02243831567466259,
    -0.024505289271473885,
    -0.09820500761270523,
    0.04375714063644409,
    -0.009092536754906178,
    -0.020122773945331573,
    0.015808865427970886,
    0.02499788999557495,
    -0.05453784763813019,
    -0.0278654545545578,
    0.032102085649967194,
    -0.04915492609143257,
    -0.0073334150947630405,
    -0.02150459587574005,
    0.0574442557990551,
    -0.04584359750151634,
    -0.026732008904218674,
    0.08619209378957748,
    -0.07170350104570389,
    -0.10206466913223267,
    0.022043202072381973,
    -0.06668663024902344,
    -0.021306872367858887,
    0.02987739071249962,
    0.07711691409349442,
    0.06159079447388649,
    0.01056479662656784,
    -0.035385146737098694,
    -0.023122405633330345,
    0.022517746314406395,
    -0.1037336215376

Embedding-V1每次embedding除了有单独的id外，还有时间戳记录embedding的时间。

In [6]:
print('本次embedding id为：{}'.format(response['id']))
print('本次embedding产生时间戳为：{}'.format(response['created']))

本次embedding id为：as-7kybauw0hg
本次embedding产生时间戳为：1713796006


同样的我们也可以从response中获取embedding的类型和embedding。

In [7]:
print('返回的embedding类型为:{}'.format(response['object']))
print('embedding长度为：{}'.format(len(response['data'][0]['embedding'])))
print('embedding（前10）为：{}'.format(response['data'][0]['embedding'][:10]))

返回的embedding类型为:embedding_list
embedding长度为：384
embedding（前10）为：[0.060567744076251984, 0.020958080887794495, 0.053234219551086426, 0.02243831567466259, -0.024505289271473885, -0.09820500761270523, 0.04375714063644409, -0.009092536754906178, -0.020122773945331573, 0.015808865427970886]


## 三、使用讯飞星火API

In [7]:
import os
import re
import requests
import zipfile
from io import BytesIO
from dotenv import load_dotenv, find_dotenv

# 下载压缩包
url = "https://openres.xfyun.cn/xfyundoc/2024-03-26/78dc60db-b67d-4fb7-97a9-710fa5e226b0/1711443170449/Embedding.zip"
response = requests.get(url)
zip_file = zipfile.ZipFile(BytesIO(response.content))

# 解压压缩包到当前目录
zip_file.extractall()

# 重命名解压后的文件，这里需要根据实际文件名修改
original_file_name = "Embedding.py"  # 需要根据实际文件名进行修改
new_file_name = "Spark_Embedding.py"

# 重命名文件
os.rename(original_file_name, new_file_name)

# 读取文件内容
with open(new_file_name, "r") as file:
    content = file.read()

# 使用正则表达式确保能匹配到对应的行，并进行替换
content = re.sub(r"APPID ='.*?'", "APPID = os.environ.get('SPARK_APPID')", content)
content = re.sub(r"APISecret = '.*?'", "APISecret = os.environ.get('SPARK_API_SECRET')", content)
content = re.sub(r"APIKEY = '.*?'", "APIKEY = os.environ.get('SPARK_API_KEY')", content)

# 确保在文件顶部添加所需的import语句
if "import os" not in content:
    content = "import os\nfrom dotenv import load_dotenv, find_dotenv\n" + content

# 确保在main部分添加环境变量加载的代码
if "if __name__ == '__main__':" in content and "_ = load_dotenv(find_dotenv())" not in content:
    content = content.replace("if __name__ == '__main__':", "if __name__ == '__main__':\n    _ = load_dotenv(find_dotenv())")

# 将修改后的内容写回文件
with open(new_file_name, "w") as file:
    file.write(content)

print("文件处理完成。")


文件处理完成。


## 四、使用智谱API
智谱有封装好的SDK，我们调用即可。

In [8]:
from zhipuai import ZhipuAI
def zhipu_embedding(text: str):

    api_key = os.environ['ZHIPUAI_API_KEY']
    client = ZhipuAI(api_key=api_key)
    response = client.embeddings.create(
        model="embedding-2",
        input=text,
    )
    return response

text = '要生成 embedding 的输入文本，字符串形式。'
response = zhipu_embedding(text=text)

response为`zhipuai.types.embeddings.EmbeddingsResponded`类型，我们可以调用`object`、`data`、`model`、`usage`来查看response的embedding类型、embedding、embedding model及使用情况。

In [9]:
print(f'response类型为：{type(response)}')
print(f'embedding类型为：{response.object}')
print(f'生成embedding的model为：{response.model}')
print(f'生成的embedding长度为：{len(response.data[0].embedding)}')
print(f'embedding（前10）为: {response.data[0].embedding[:10]}')

response类型为：<class 'zhipuai.types.embeddings.EmbeddingsResponded'>
embedding类型为：list
生成embedding的model为：embedding-2
生成的embedding长度为：1024
embedding（前10）为: [0.017892399802803993, 0.0644201710820198, -0.009342825971543789, 0.02707476168870926, 0.004067837726324797, -0.05597858875989914, -0.04223804175853729, -0.03003198653459549, -0.016357755288481712, 0.06777040660381317]
