OpenAI使用`tiktoken`来拆分文本为token。该notebook介绍OpenAI是如何计数token的。

编码方法决定了不同的文本拆分Token的方式。OpenAI使用如下3个`tiktoken`支持的编码方法于不同的模型中：

1. cl100k_base: gpt-4, gpt-3.5-turbo, text-embedding-ada-002
2. p50k_base: text-davinci-002, text-davinci-003
3. r50k_base 或 gpt2: GPT-3模型，如davinci

1. 安装`tiktoken`

In [None]:
# %pip install --upgrade tiktoken > /dev/null

In [1]:

import os,sys
import openai
from dotenv import load_dotenv, find_dotenv
# sys.path.append("../..")

# 读取本地/项目的环境变量。

# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件，并将其中的环境变量加载到当前的运行环境中  
# 如果你设置的是全局的环境变量，这行代码则没有任何作用。
print(find_dotenv())
_ = load_dotenv(find_dotenv())
print(os.environ["OPENAI_API_KEY"])

from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)


d:\Anaconda3\envs\LLM\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
d:\Anaconda3\envs\LLM\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


c:\Users\lenovo\Desktop\LangChainPlayGround\DeeperTutorials\.env
sk-lANo2jIeCWQt94UCCf5d16B7C32744279bF98b06C822D519


2. 编码

In [7]:
import tiktoken

encoding_p50k_base = tiktoken.get_encoding("p50k_base")
encoding_for_davinci = tiktoken.encoding_for_model("text-davinci-002")


encoding_cl100k_base = tiktoken.get_encoding("cl100k_base")
encoding_for_gpt = tiktoken.encoding_for_model("gpt-4")


encoding_r50k_base = tiktoken.get_encoding("r50k_base")
encoding_for_davinci_1 = tiktoken.encoding_for_model("davinci")


In [8]:
text_chinese = '你好，朋友'

print(encoding_p50k_base.encode(text_chinese))
print(encoding_for_davinci.encode(text_chinese))

print(encoding_cl100k_base.encode(text_chinese))
print(encoding_for_gpt.encode(text_chinese))

print(encoding_r50k_base.encode(text_chinese))
print(encoding_for_davinci_1.encode(text_chinese))

[19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]
[19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]
[57668, 53901, 3922, 4916, 233, 98915]
[57668, 53901, 3922, 4916, 233, 98915]
[19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]
[19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]


3. 解码

In [9]:

print(encoding_p50k_base.decode([19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]))
print(encoding_for_davinci.decode([19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]))

print(encoding_cl100k_base.decode([57668, 53901, 3922, 4916, 233, 98915]))
print(encoding_for_gpt.decode([57668, 53901, 3922, 4916, 233, 98915]))

print(encoding_r50k_base.decode([19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]))
print(encoding_for_davinci_1.decode([19526, 254, 25001, 121, 171, 120, 234, 17312, 233, 20998, 233]))


你好，朋友
你好，朋友
你好，朋友
你好，朋友
你好，朋友
你好，朋友


4. OpenAI的Chat API的Token计数方式，参考官方文档[链接](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb)

In [10]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

5. 示例代码

In [11]:
import openai

example_messages = [
    {
        "role": "system",
        "content": "你是翻译助理，请帮我将英文翻译成中文，谢谢。请只回复翻译文字，不要回复其他内容。",
    },
    {
        "role": "user",
        "name": "Alice",
        "content": "The sky is blue.",
    },
]

for model in ["gpt-3.5-turbo", "gpt-4"]:
    print(model)
    # 来自上述实现的函数的token计数
    print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
    # 来自OpenAI API的token计数
    response = openai.ChatCompletion.create(
        model=model,
        messages=example_messages,
        temperature=0,
        max_tokens=1  # 仅返回用于计数的token数量，因此不需要API返回completion内容
    )
    print(f'{response["usage"]["prompt_tokens"]} prompt tokens counted by the OpenAI API.')
    print()

d:\Anaconda3\envs\LLM\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
d:\Anaconda3\envs\LLM\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


gpt-3.5-turbo
67 prompt tokens counted by num_tokens_from_messages().


AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = <API-KEY>', or you can set the environment variable OPENAI_API_KEY=<API-KEY>). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = <PATH>'. You can generate API keys in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.