## ChatGPT API - 文字向量 (Embedding) API

### 文字向量 (Embedding)
* [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings/embedding-models) 這個模型最大輸入長度可以達到 8192 個 Tokens，每個向量的維度是 1536 維。

In [None]:
import openai
from dotenv import load_dotenv

load_dotenv()  # .env
 
response = openai.embeddings.create(
    model="text-embedding-3-small",
    input=["貓", "狗", "cat", "dog"],
)

for emb in response["data"]:
    embeddings = emb["embedding"]
    print(len(embeddings), embeddings)

"""
輸出結果：
1536 [-0.009648507460951805, -0.008709395304322243, 0.0031512430869042873, -0.029856810346245766]
1536 [-0.010666832327842712, -0.020034153014421463, -0.008433295413851738, -0.02460951916873455]
1536 [-0.0070539116859436035, -0.01734057068824768, -0.009698242880403996, -0.03073945827782154]
1536 [-0.003476932644844055, -0.01781758852303028, -0.01627529226243496, -0.017506422474980354]
"""

### 常見用法 - 比對文本之間的相似度
因為我們是計算歐式距離的關係，所以數字越小代表越相近，也可以換成 [餘弦相似度 (Cosine Similarity)](https://w.wiki/neY) 或 [向量內積](https://w.wiki/A3Jz) 之類的評估公式。
* Embedding 不僅能用來比較文本之間的相似度，也具有跨語言的能力，在未來提到 Retrieval-Based 應用時會相當重要。

In [None]:
import openai
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from dotenv import load_dotenv

load_dotenv()  # .env

# 避免以科學記號輸出
np.set_printoptions(suppress=True, precision=4)

text = ["我喜歡貓咪", "i like cats", "來去看電影", "watch a movie"]
resp = openai.embeddings.create(
    input=text, 
    model="text-embedding-3-small"
)
embs = [emb.embedding for emb in resp.data]

print(euclidean_distances(embs, embs))
# """
# [[0.      0.98    1.20    1.32]
#  [0.98    0.      1.33    1.26]
#  [1.20    1.33    0.      0.93]
#  [1.32    1.26    0.93    0.  ]]
# """

query =  ["我喜歡貓咪", "i like cats", "來去看電影", "watch a movie", "猫が好き", "映画を見に来てください"]
response = openai.embeddings.create(
    input=query, 
    model="text-embedding-3-small"
)

print(text)
for i, item in enumerate(response["data"]):
    embeddings = item["embedding"]
    print(euclidean_distances([embeddings], embs)[0], query[i])

# """
# 稍微經過人工排版的輸出結果：

# ["我喜歡貓咪", "i like cats", "來去看電影", "watch a movie"]
# [  0.49925489    0.72721379] i like cats
# [  0.77279439    0.54520915] watch a movie
# [  0.50884729    0.68798437] 猫が好き
# [  0.72506646    0.50975780] 映画を見に来てください
# """