# Embeddings

* Embeddings 常用來做：
  * Search (input 一個 query 到資料庫，比對後從最相似的一路排序到最不像的)  
  * Clustering  
  * Recommendatioins (where items with related test strings are recommended)  
  * Anomaly detection  
  * Diversity measurement (where similarity distributions are analyzed)  
  * Classification (where text strings are clasified by their most similar label)  

## 如何取得 Embeddings

* 用 openapi 的 embeddings API endpoint 來做  
* 範例如下：

In [3]:
import openai
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")


response = openai.Embedding.create(
    input="Your text string goes here",
    model="text-embedding-ada-002"
)

response

<OpenAIObject list at 0x7fc03c3aea90> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.007009037770330906,
        -0.005365979392081499,
        0.011887812055647373,
        -0.024931475520133972,
        -0.024649232625961304,
        0.039729081094264984,
        -0.010133872739970684,
        -0.009421544149518013,
        -0.013171345926821232,
        -0.009898670017719269,
        -0.01161900907754898,
        0.007862486876547337,
        -0.014098715968430042,
        0.007728085853159428,
        0.010160752572119236,
        -0.005087096244096756,
        0.022928893566131592,
        -0.001687578740529716,
        0.01491856575012207,
        -0.010288434103131294,
        0.004841813817620277,
        0.012479178607463837,
        0.004882134031504393,
        0.010859640315175056,
        -0.006592392921447754,
        -0.00038892432348802686,
        0.005597821902483702,
        -0.012593419291

* 可以看到，response裡面的 data 底下的 embedding 就是我要的  
* 所以，我可以寫一個簡單的 function 來取得 embedding

In [4]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   response = openai.Embedding.create(input = [text], model=model)
   return response['data'][0]['embedding']

In [6]:
embed1 = get_embedding("打中文嘛欸通啦")

print(embed1[:5])
print(len(embed1))

[-0.012645913287997246, 0.005694281775504351, 0.011921784840524197, -0.008966024965047836, -0.0233827605843544]
1536


## 取得 embeddings 的 function

* 事實上，上面這個 function 他早就幫你包好了，直接 call 就好：

In [11]:
from openai.embeddings_utils import get_embedding
embed1 = get_embedding("打中文嘛欸通啦", engine = "text-embedding-ada-002") # 預設是第一代的 text-similarity-davinci-001
print(embed1[:5])
print(len(embed1))

[-0.012645913287997246, 0.005694281775504351, 0.011921784840524197, -0.008966024965047836, -0.0233827605843544]
1536


## Embedding models

* 剛剛的 function，可以看到 model 的地方我選 `text-embedding-ada-002`，那還有啥可以選呢？  
* 先講結論：2023/09 的時候，官網就是建議直接用這個，不要多想了 (可參考官網[這裡](https://platform.openai.com/docs/guides/embeddings/embedding-models))
* 它的特性包括：  
  * tokenizer 使用的是  
  * 餵進去的 tokens 最多為 8191  
  * output 的維度為 1536 (不管你輸入文字, 句子, 最終都轉成 1536 維的向量)  
* 那還是講一下有哪些 model 可以選：  
  * 第一代的 embeddings，是用 `-001`做結尾，例如 `*-davinci-*-001`, `*-curie-*-001`, 總共有 16 個 model  
  * 第二代的 model 是用 `-002` 做結尾， 例如 `text-embedding-ada-002`, 他更快, 更便宜, 也更簡單好用  

## Data for use cases

* 接下來，我們來整理一個資料集，然後把文字的部分做 embedding，後續來做一系列的應用
* 這筆資料取自 [fine-food reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?select=Reviews.csv) from Amazon  
* 紀錄著 Amazon users 在 2012/10 以前，對買到的食物的評價  
* 來看一下資料長怎樣：

In [2]:
import pandas as pd
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ...."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...


* 可以看到，紀錄了在哪個時間(Time)，某user (UserId) 對某產品 (ProductId)，所做的評分 (Score)，以及文字上的評論 (標題 Summary + 內文 Text)  
* 我們看更清楚一點，例如第四列的 Summary 和 Text 分別是

In [3]:
print(df.Summary[4])
print(df.Text[4])

Happy with the product
My dog was suffering with itchy skin.  He had been eating Natural Choice brand (cheaper) since he was a puppy.  I was nervous to change foods.  The vet suggested to change foods sand see if the skin issues cleared up.  Wellness brand did the job.  My dog seems to love the food and the skin issues cleared up within a few weeks.


* 那我想對 `combined` 這個欄位做 embedding, 不過, 前面有說過，做 embedding 時，input 的 token 必須小於 8191，所以我們先來看一下，每一篇的 token 各是多少：

In [13]:
import tiktoken
encoding = tiktoken.get_encoding(encoding_name = "cl100k_base")
test = encoding.encode(df.combined[4])
print(len(test))
print(test[:10])

86
[3936, 25, 24241, 449, 279, 2027, 26, 9059, 25, 3092]


* 可以看到，剛剛的第四句，被分詞成 86 個 token  
* 那我現在就可以把每一篇會被切成幾個 token 做出來：

In [14]:
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))

* 看一下，最大值是多少：

In [None]:
df.n_tokens.max()

645

* 看來沒啥問題啦，最多 token 也才 645，比上限的 8191 少多了  
* 那就開始把 combined 的部分，都轉成 embeddings 吧： 

In [16]:
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, engine="text-embedding-ada-002"))
df.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.007052760571241379, -0.02730366960167885, 0..."
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...,35,"[-0.023622721433639526, -0.011844820342957973,..."
2,1351123200,B000JMBE7M,AQX1N6A51QOKG,4,"It isn't blanc mange, but isn't bad . . .",I'm not sure that custard is really custard wi...,"Title: It isn't blanc mange, but isn't bad . ....",267,"[0.00016697357932571322, 0.005226491950452328,..."
3,1351123200,B004AHGBX4,A2UY46X0OSNVUQ,3,These also have SALT and it's not sea salt.,I like the fact that you can see what you're g...,Title: These also have SALT and it's not sea s...,239,"[0.010532955639064312, -0.01354704238474369, 0..."
4,1351123200,B001BORBHO,A1AFOYZ9HSM2CZ,5,Happy with the product,My dog was suffering with itchy skin. He had ...,Title: Happy with the product; Content: My dog...,86,"[0.015446762554347515, -0.003920299932360649, ..."


* 把這個檔案存出去吧，等等就可以來用了

In [17]:
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")