## 알맞은 embedding model 선택 방법

- 임베딩 후보 리스트 준비 (OpenAI, Cohere, e5-base-v2)
- 활용하고자 하는 데이터셋을 임베딩 변환
- Test set 랜덤 선별 후 평가 지표 생성

---

In [1]:
import pandas as pd
import os
import random
import cohere
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
import openai
from openai import OpenAI
from tqdm.notebook import tqdm
from dotenv import load_dotenv
load_dotenv()

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# initialize openai
openai.api_key = os.environ["OPENAI_API_KEY"]

# initialize cohere
co = cohere.Client(api_key=os.environ["CO_API_KEY"])

import warnings
warnings.filterwarnings('ignore')


### Read dataset

In [2]:
df = pd.read_csv("../data/quora_dataset.csv")

In [3]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1
2,How can I be a good geologist?,15,[16],1
3,What should I do to be a great geologist?,16,[15],1
4,How do I read and find my YouTube comments?,23,[24],1


### 1. Playground

In [4]:
text1 = df.loc[2, 'text']
print(text1)

How can I be a good geologist?


In [5]:
text2 = df.loc[3, 'text']
print(text2)

What should I do to be a great geologist?


In [19]:
def create_embeddings(txt_list, provider='openai'):
    if provider=='openai':
        client = OpenAI()
        
        # 최대 길이 설정
        max_length = 2048

        # 입력 텍스트를 나누기

        if len(txt_list) > max_length:
            chunks = [txt_list[i:i+max_length] for i in range(0, len(txt_list), max_length)]
        
            responses = []
            for chunk in chunks:
                response = client.embeddings.create(
                    input=chunk,
                    model="text-embedding-3-small")
                responses.extend([r.embedding for r in response.data])
        else:
            response = client.embeddings.create(
                input=txt_list,
                model="text-embedding-3-small")
            responses = [r.embedding for r in response.data]

        return responses
    
    elif provider=='cohere':
        doc_embeds = co.embed(
        txt_list,
        input_type="search_document",
        model="embed-english-v3.0")
        return doc_embeds.embeddings
    else:
        assert False, "Double check provider name"

In [9]:
emb1 = create_embeddings(df.loc[2, 'text'])
emb2 = create_embeddings(df.loc[3, 'text'])

In [10]:
from utils import cosine_similarity

In [11]:
# simarity between two embeddings
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb2[0]), text1, text2))

Cosine 유사도 : 0.9153451440997994.
사용된 문장 : 
How can I be a good geologist?
What should I do to be a great geologist?


In [12]:
text3 = df.loc[4, 'text']

emb3 = create_embeddings(text3)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text3))

Cosine 유사도 : 0.18174818369524162.
사용된 문장 : 
How can I be a good geologist?
How do I read and find my YouTube comments?


In [13]:
text4 = df.loc[6, 'text']

emb3 = create_embeddings(text4)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text4))

Cosine 유사도 : 0.279567739394289.
사용된 문장 : 
How can I be a good geologist?
What can make Physics easy to learn?


---

### 2. Embedding vector Dataset 만들기

openai embeddings

In [20]:
# create embeddings (openai)
openai_emb = create_embeddings(df.text.tolist(), provider='openai')

In [23]:
df['openai_emb'] = openai_emb

cohere embeddings

In [24]:
# create embeddings (cohere)
cohere_emb = create_embeddings(df.text.tolist(), 'cohere')

In [25]:
df['cohere_emb'] = cohere_emb

e5 embeddings

In [26]:
# load gpu if possible
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "intfloat/e5-base-v2"

# init tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [27]:
def create_e5_emb(docs, model=model):
    """
    e5 embedding 모델을 활용하여 임베딩 벡터 생성
    """
    docs = [f"query: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        out = model(**tokens)
        last_hidden = out.last_hidden_state.masked_fill( # from last hidden state
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        # average out embeddings per token (non-padding)
        doc_embeds = last_hidden.sum(dim=1) / tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

긴 runtime 주의 (약 2시간)

In [28]:
data = df.text.tolist()
batch_size = 128

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    data_batch = data[i:i_end]
    # embed current batch
    embed_batch = create_e5_emb(data_batch)
    if i == 0:
        emb3 = embed_batch.copy()
    else:
        emb3 = np.concatenate([emb3, embed_batch.copy()])

  0%|          | 0/44 [00:00<?, ?it/s]

In [29]:
emb3 = [list(e) for e in emb3]
df['e5_emb'] = emb3

In [30]:
df.to_csv("../data/quora_dataset_emb.csv", index=False)

embedding이 이미 처리된 데이터 읽어오기

In [31]:
df = pd.read_csv("../data/quora_dataset_emb.csv")
# str -> list 형태로 변환
import json
df['openai_emb'] = df['openai_emb'].apply(json.loads)
df['cohere_emb'] = df['cohere_emb'].apply(json.loads)
df['e5_emb'] = df['e5_emb'].apply(json.loads)
df['duplicated_questions'] = df['duplicated_questions'].apply(json.loads)

In [32]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1,"[-0.005708687007427216, -0.018573936074972153,...","[-0.05834961, -0.010795593, -0.04522705, 0.035...","[0.059878025, -0.15769613, -0.14131476, -0.546..."
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1,"[0.026049701496958733, -0.014290662482380867, ...","[-0.022338867, -0.0063285828, -0.057128906, 0....","[0.08937602, -0.29545033, -0.33455348, -0.3294..."
2,How can I be a good geologist?,15,[16],1,"[0.005289203487336636, 0.004235886037349701, 0...","[-0.012535095, 0.005092621, -0.033233643, -0.0...","[0.082580894, -0.09264571, -0.78053635, -0.324..."
3,What should I do to be a great geologist?,16,[15],1,"[0.015141667798161507, 0.0010603171540424228, ...","[-0.013465881, 0.0018148422, -0.052612305, 0.0...","[-0.16533084, 0.19044475, -0.8906654, -0.36435..."
4,How do I read and find my YouTube comments?,23,[24],1,"[0.03507072106003761, -0.0010471956338733435, ...","[-0.0047836304, 0.028137207, -0.037231445, -0....","[0.50644594, -0.62657803, -0.25233975, -0.1711..."


### 3. Test set 선별

테스팅을 위해 필요한 랜덤 질문들 선별

In [33]:
# now choose random 10 rows of answers
test_query = random.choices(df.id, k=1000)

In [34]:
test_query[:5]

[1440, 4962, 4878, 287, 12591]

In [35]:
test = df.loc[df.id.isin(test_query)]

각 테스트 질문별로 가장 유사한 질문들 top-k개 retrieve

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

def search_top_k(search_df, search_df_column, id, topk):
    """
    search_df : search를 할 대상 dataframe
    search_df_column : search를 위해 사용될 embedding column name
    id : test query id
    topk : 유사도 기반으로 top-k개 선별
    """
    query = search_df.loc[search_df['id']==id, search_df_column].values[0]
    query_reshaped = np.array(query).reshape(1, -1)
    
    search_df = search_df.loc[search_df['id']!=id]
    # cosine similarity in batch
    similarities = cosine_similarity(query_reshaped, np.vstack(search_df[search_df_column].values)).flatten()
    
    search_df['similarity'] = similarities
    
    # Get top-k indices
    # hence we sort the topk indices again to ensure they are truly the top-k
    topk_indices = np.argpartition(similarities, -topk)[-topk:]
    topk_indices_sorted = topk_indices[np.argsort(-similarities[topk_indices])]
    
    # Retrieve the top-k results
    search_result = search_df.iloc[topk_indices_sorted]
    
    return search_result


- 각 테스트 질문당 데이터 전체를 대상으로 cosine_similarity를 계산하고
- openai embedding, cohere embedding에 대해 각각 질문 k 개씩 진행
- search_result format :
```json
{
    'question id' : cosine_sim 기준 유사한 질문 top-k개를 담은 pd.DataFrame,
    'question id' : ...
}
```

In [37]:
# 각 질문들 중, test 질문과 동일한 질문이 가장 유사하게 도출될 것이기 때문에
# test 질문을 제외한 top-5
query_results_openai = { k:search_top_k(df, 'openai_emb', k, 5) for k in test.id }
query_results_cohere = { k:search_top_k(df, 'cohere_emb', k, 5) for k in test.id }
query_results_e5 = { k:search_top_k(df, 'e5_emb', k, 5) for k in test.id }

테스트 결과 엿보기

In [38]:
test.loc[test.length==3].tail()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
4784,What is best investment option?,12694,"[12695, 14212, 14211]",3,"[0.006630514282733202, 0.01731019653379917, 0....","[0.0029945374, -0.028869629, 0.028793335, 0.00...","[-0.876402, -0.50939333, -0.9197594, -0.143355..."
5054,How do I improve my writing skills?,13556,"[3066, 13555, 3065]",3,"[0.03843690827488899, 0.006117657292634249, 0....","[-0.029449463, 0.025756836, -0.027435303, 0.03...","[0.33717027, -0.27030376, -0.75061697, 0.00306..."
5099,What does semen taste like?,13680,"[4395, 4396, 13681]",3,"[-0.022604946047067642, -0.0003302536206319928...","[0.017456055, 0.042053223, -0.01852417, 0.0114...","[-0.09342691, -0.1040042, -0.6521572, 0.000391..."
5274,What same food should I eat every day to prote...,14182,"[6194, 854, 14183]",3,"[-0.004422239027917385, -0.04021645337343216, ...","[0.023086548, 0.03036499, -0.08929443, -0.0823...","[0.18870227, -0.30070493, -0.93830246, -0.1201..."
5404,How do I my increase memory power?,14637,"[14636, 10681, 10682]",3,"[0.022613229230046272, -0.005336236674338579, ...","[0.0038757324, 0.027053833, -0.05279541, -0.02...","[-0.31909928, -0.46033445, -0.46627197, -0.031..."


In [40]:
test.loc[test['id']==14182, 'text'].values

array(['What same food should I eat every day to protect my health?'],
      dtype=object)

In [41]:
query_results_openai[14182]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
325,What can I eat every day to be more healthy?,854,"[1038, 14182, 7834, 1039, 855, 14183, 8569, 6194]",8,"[0.011935393325984478, -0.010704121552407742, ...","[0.02798462, 0.04119873, -0.091918945, -0.0729...","[-0.025168408, -0.6016849, -1.001702, -0.25107...",0.772547
5275,Is it healthy to eat a tomato every day?,14183,"[6194, 1039, 14182, 8569, 855, 854, 7834, 1038]",8,"[-0.0049662101082503796, -0.03403849899768829,...","[0.03652954, 0.015296936, -0.04888916, -0.0508...","[-0.1945919, -0.5700765, -0.9981252, -0.337994...",0.595106
2976,Is it bad for health to eat eggs every day?,7834,"[1039, 854, 7835, 8569, 1038, 14183, 6194, 855]",8,"[0.029310524463653564, -0.01589602790772915, 0...","[0.045318604, 0.013931274, -0.057281494, -0.03...","[-0.16033919, -0.2664684, -0.9629141, -0.30153...",0.594825
400,Is it healthy to eat one chicken every day?,1039,"[854, 7834, 1038, 8569, 6194, 14183, 855]",7,"[0.055881962180137634, -0.02762104570865631, 0...","[0.030822754, 0.008201599, -0.07324219, -0.066...","[0.08981193, -0.37340692, -1.0105896, -0.21025...",0.587366
2365,Is it healthy to eat bread every day?,6194,"[1038, 854, 7834, 1039, 14182, 8569, 855, 14183]",8,"[0.022258378565311432, -0.010956810787320137, ...","[0.0317688, 0.041259766, -0.07977295, -0.05123...","[0.08884292, -0.286787, -1.2354673, -0.3420725...",0.586837


In [42]:
query_results_cohere[14182]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
325,What can I eat every day to be more healthy?,854,"[1038, 14182, 7834, 1039, 855, 14183, 8569, 6194]",8,"[0.011935393325984478, -0.010704121552407742, ...","[0.02798462, 0.04119873, -0.091918945, -0.0729...","[-0.025168408, -0.6016849, -1.001702, -0.25107...",0.838867
2365,Is it healthy to eat bread every day?,6194,"[1038, 854, 7834, 1039, 14182, 8569, 855, 14183]",8,"[0.022258378565311432, -0.010956810787320137, ...","[0.0317688, 0.041259766, -0.07977295, -0.05123...","[0.08884292, -0.286787, -1.2354673, -0.3420725...",0.783043
400,Is it healthy to eat one chicken every day?,1039,"[854, 7834, 1038, 8569, 6194, 14183, 855]",7,"[0.055881962180137634, -0.02762104570865631, 0...","[0.030822754, 0.008201599, -0.07324219, -0.066...","[0.08981193, -0.37340692, -1.0105896, -0.21025...",0.779106
5275,Is it healthy to eat a tomato every day?,14183,"[6194, 1039, 14182, 8569, 855, 854, 7834, 1038]",8,"[-0.0049662101082503796, -0.03403849899768829,...","[0.03652954, 0.015296936, -0.04888916, -0.0508...","[-0.1945919, -0.5700765, -0.9981252, -0.337994...",0.778738
399,Is it healthy to eat egg whites every day?,1038,"[7834, 854, 855, 6194, 14183, 8569, 1039]",7,"[0.04111170023679733, -0.016867700964212418, 0...","[0.012512207, 0.01525116, -0.08520508, -0.0538...","[-0.24898852, -0.40012923, -1.0762928, -0.2151...",0.774438


In [43]:
query_results_e5[14182]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
325,What can I eat every day to be more healthy?,854,"[1038, 14182, 7834, 1039, 855, 14183, 8569, 6194]",8,"[0.011935393325984478, -0.010704121552407742, ...","[0.02798462, 0.04119873, -0.091918945, -0.0729...","[-0.025168408, -0.6016849, -1.001702, -0.25107...",0.945122
794,What food should I eat to gain weight?,2049,[6772],1,"[0.03730776160955429, -0.008663864806294441, -...","[0.0020828247, 0.019592285, -0.05657959, 0.011...","[-0.2074798, -0.5919306, -1.0544997, -0.356902...",0.877167
2365,Is it healthy to eat bread every day?,6194,"[1038, 854, 7834, 1039, 14182, 8569, 855, 14183]",8,"[0.022258378565311432, -0.010956810787320137, ...","[0.0317688, 0.041259766, -0.07977295, -0.05123...","[0.08884292, -0.286787, -1.2354673, -0.3420725...",0.875936
400,Is it healthy to eat one chicken every day?,1039,"[854, 7834, 1038, 8569, 6194, 14183, 855]",7,"[0.055881962180137634, -0.02762104570865631, 0...","[0.030822754, 0.008201599, -0.07324219, -0.066...","[0.08981193, -0.37340692, -1.0105896, -0.21025...",0.874174
2976,Is it bad for health to eat eggs every day?,7834,"[1039, 854, 7835, 8569, 1038, 14183, 6194, 855]",8,"[0.029310524463653564, -0.01589602790772915, 0...","[0.045318604, 0.013931274, -0.057281494, -0.03...","[-0.16033919, -0.2664684, -0.9629141, -0.30153...",0.871546


### 4. Scoring function 정의

- 각 질문별로 accuracy score 부여
    - Accuracy score : 현재 유사하다고 태그된 질문들 중 몇 개가 실제 유사한 질문들인가?

In [44]:
def score_accuracy(full_df, tmp_df, test_id):
    """
    각 테스트 질문과 유사하다고 판단된 질문들 중, 실제 duplicated_questions에 들어있는 질문들을 count
    """
    duplicated_questions = full_df.loc[full_df['id'] == test_id, 'duplicated_questions'].values[0]

    # 본인 ID는 제외
    filtered_df = tmp_df[tmp_df['id'] != test_id]
    # 현재 retrieve 해온 ID들이, 테스트 질문 내에 들어있는 아이디들인지 count
    match_count = filtered_df['id'].isin(duplicated_questions).sum()

    # Calculate the accuracy in terms of percentage
    if filtered_df.shape[0]<len(duplicated_questions):
        percentage = (match_count / filtered_df.shape[0])
    else:
        percentage = (match_count / len(duplicated_questions))
    return percentage


In [45]:
accuracy_openai = [score_accuracy(df, query_results_openai[i], i) for i in query_results_openai.keys()]
accuracy_cohere = [score_accuracy(df, query_results_cohere[i], i) for i in query_results_cohere.keys()]
accuracy_e5 = [score_accuracy(df, query_results_e5[i], i) for i in query_results_e5.keys()]

In [46]:
np.mean(accuracy_openai)

0.9537016973636692

In [47]:
np.mean(accuracy_cohere)

0.9536114120621163

In [48]:
np.mean(accuracy_e5)

0.9427591188154569

오답 엿보기

In [49]:
indices = [index for index, value in enumerate(accuracy_openai) if value <= 0.5]

In [50]:
indices

[3,
 11,
 15,
 23,
 58,
 98,
 155,
 271,
 272,
 316,
 346,
 367,
 394,
 469,
 483,
 486,
 491,
 496,
 556,
 619,
 641,
 662,
 708,
 751,
 779,
 808,
 813,
 815,
 828,
 832,
 884,
 889,
 907,
 916,
 922]

In [51]:
list(query_results_openai.keys())[60]

1004

In [55]:
test.loc[test['id']==1004]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
381,Is civil war likely after the US presidential ...,1004,[1005],1,"[-0.005477967206388712, 0.001724598347209394, ...","[0.06341553, 0.007911682, 0.027252197, 0.00665...","[-0.21286142, -0.51795757, -0.6739897, 0.21920..."


In [57]:
query_results_openai[1004]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
382,Is the US at risk of some type of uprising aft...,1005,[1004],1,"[-0.012636340223252773, 0.023401089012622833, ...","[0.094177246, 0.030075073, 0.016159058, 0.0226...","[-0.015371838, -0.51635826, -0.7222448, 0.4520...",0.661845
4793,Is Clinton likely to win the election?,12734,"[8879, 12733]",2,"[0.03306662663817406, -0.0001477139157941565, ...","[0.059326172, 0.029678345, 0.022003174, -0.008...","[-0.19828011, -0.8402219, -0.8515694, 0.171634...",0.575099
3768,Who is going to win the presidential election?,9849,"[14362, 5912, 2026, 9848, 5913, 2025]",6,"[0.03162300959229469, -0.045507289469242096, 0...","[0.04626465, 0.01763916, 0.004360199, -0.03671...","[-0.13966073, -0.6944983, -0.8997508, 0.458869...",0.56372
1303,What will happen if Donald Trump wins the elec...,3473,"[6741, 6740]",2,"[-0.011887616477906704, 0.03239825740456581, 0...","[0.046722412, 0.020050049, 0.010284424, -0.022...","[-0.24453327, -0.47462377, -0.28943318, 0.3990...",0.557842
734,"Realistically speaking, what would happen to t...",1903,[1904],1,"[-0.02038819156587124, 0.043668001890182495, 0...","[0.0619812, -0.0063705444, 0.006061554, 0.0156...","[-0.3581315, -0.40213132, -0.30093434, 0.53169...",0.548706


#### 결론

- cohere, openai, e5 모두 굉장히 성능이 좋기 때문에 대부분의 task에 곧바로 활용해도 무방함.
- Local embedding 모델을 활용하고자 할 때 위와 같은 방법으로 classification 성능 & 자원 할당 체크 필요.
- 성능 평가 방법
    - 태깅된 데이터 셋 활용
    - 정성적 평가
        - 데이터 태깅을 할 노동력이 부족할 때
        - 태깅을 하기 애매한 분야 (정답이 없는 경우)

--END--