# 텍스트를 임베딩 API로 벡터 변환하고 유사한 컨텐츠를 검색하는 방법

해당 문서는 텍스트를 임베딩 API로 벡터로 변환하는 방법과 유사한 컨텐츠를 의미 기반 검색하는 방법을 실습합니다.  
아래 예시에서는 Wikipedia에서 제공하는 샘플 53개 문서에 대해서 벡터화하고, 유사한 문서를 검색하는 방법을 살펴봅니다.  
참고 자료: https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings

(pandas==2.0.3와 numpy==2.0.0 간의 버전 충돌에 따른 에러 이슈로 pandas==2.1.2 버전으로 변경합니다. 2024-07)  
(Update 2024.12: text_embeddings_3_large API로 변경)

In [1]:
import os
import re
import pandas as pd
import numpy as np
import tiktoken
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key        = os.getenv("AZURE_OPENAI_API_KEY"),
    api_version    = os.getenv("OPENAI_API_VERSION")
)

deployment_name = os.getenv("DEPLOYMENT_NAME")
deployment_embedding_name = os.getenv("DEPLOYMENT_EMBEDDING_NAME")

백터화를 하기 위한 파일(./data/wiki_data.csv)을 읽어서 pandas로 조회

In [2]:
df_wiki_data=pd.read_csv(os.path.join(os.getcwd(),'data/wiki_data.csv'))
df_wiki_data

Unnamed: 0,id,url,title,text
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...


In [3]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_wiki_data['text']= df_wiki_data["text"].apply(lambda x : normalize_text(x))

Azure OpenAI에서 제공하는 Embedding API를 활용하기 위해 문서에서 Text 길이가 8,192 토큰이 넘지 않는 문서를 확인

In [4]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df_wiki_data['n_tokens'] = df_wiki_data["text"].apply(lambda x: len(tokenizer.encode(x)))
df_wiki_data = df_wiki_data[df_wiki_data.n_tokens<8192]
len(df_wiki_data)
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131


문서의 Text에서 각각의 토큰별로 나뉘어진 부분 확인

In [5]:
sample_encode = tokenizer.encode(df_wiki_data.text[0]) 
decode = tokenizer.decode_tokens_bytes(sample_encode)
decode

[b'April',
 b' is',
 b' the',
 b' fourth',
 b' month',
 b' of',
 b' the',
 b' year',
 b' in',
 b' the',
 b' Julian',
 b' and',
 b' Greg',
 b'orian',
 b' calendars',
 b',',
 b' and',
 b' comes',
 b' between',
 b' March',
 b' and',
 b' May',
 b'.',
 b' It',
 b' is',
 b' one',
 b' of',
 b' four',
 b' months',
 b' to',
 b' have',
 b' ',
 b'30',
 b' days',
 b'.',
 b' April',
 b' always',
 b' begins',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' week',
 b' as',
 b' July',
 b',',
 b' and',
 b' additionally',
 b',',
 b' January',
 b' in',
 b' leap',
 b' years',
 b'.',
 b' April',
 b' always',
 b' ends',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' the',
 b' week',
 b' as',
 b' December',
 b'.',
 b' April',
 b"'s",
 b' flowers',
 b' are',
 b' the',
 b' Sweet',
 b' Pe',
 b'a',
 b' and',
 b' Daisy',
 b'.',
 b' Its',
 b' birth',
 b'stone',
 b' is',
 b' the',
 b' diamond',
 b'.',
 b' The',
 b' meaning',
 b' of',
 b' the',
 b' diamond',
 b' is',
 b' innocence',
 b'.',
 b' The',
 b' M

In [6]:
len(decode)

3902

Text를 임베딩 API로 벡터 데이터를 생성하여 새로운 컬럼인 `content_vector`에 추가합니다.

In [7]:
def generate_embeddings(text, model=deployment_embedding_name):
    return client.embeddings.create(input = [text], model=model).data[0].embedding

# model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
df_wiki_data['content_vector'] = df_wiki_data["text"].apply(lambda x : generate_embeddings (x, model = deployment_embedding_name)) 
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens,content_vector
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ..."
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ..."
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[0.0059301890432834625, 0.003919691313058138, ..."
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ..."
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607,"[-0.01116474624723196, -0.04582960158586502, 0..."
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460,"[0.028125453740358353, 0.020387642085552216, -..."
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138,"[0.002844375092536211, 0.025143058970570564, -..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987,"[-0.008278449065983295, 0.0023970487527549267,..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94,"[-0.0221348125487566, -0.0012831256026402116, ..."
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131,"[0.005165310576558113, -0.008634304627776146, ..."


In [8]:
# Save the data to a CSV file(data/wiki_data_embeddings_3_large.csv)
df_wiki_data.to_csv(os.path.join(os.getcwd(),'data/wiki_data_embeddings_3_large.csv'), index=False)

유사도 관계를 파악하기 위해서 질의에 대한 결과 분석

In [9]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model=deployment_embedding_name): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        model=deployment_embedding_name # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
    )
    df["similarities"] = df.content_vector.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_wiki_data, "4월에 대해서 알려줘.", top_n=4)
res = search_docs(df_wiki_data, "예술의 종류를 구분해줘.", top_n=4)
res = search_docs(df_wiki_data, "4월과 8월의 차이를 표로 그려줘", top_n=4)

Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ...",0.334265
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ...",0.182251
24,48,https://simple.wikipedia.org/wiki/Astronomy,Astronomy,Astronomy (from the Greek astron (ἄστρον) mean...,2564,"[-0.003089193720370531, -0.012265733443200588,...",0.088218
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ...",0.084221


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[0.0059301890432834625, 0.003919691313058138, ...",0.391206
25,49,https://simple.wikipedia.org/wiki/Architecture,Architecture,Architecture is designing the structures of bu...,1017,"[-0.00010834706336027011, 0.003430503187701106...",0.163802
11,21,https://simple.wikipedia.org/wiki/Arithmetic,Arithmetic,"In mathematics, arithmetic is the basic study ...",332,"[-0.005155470687896013, 0.027062974870204926, ...",0.122448
24,48,https://simple.wikipedia.org/wiki/Astronomy,Astronomy,Astronomy (from the Greek astron (ἄστρον) mean...,2564,"[-0.003089193720370531, -0.012265733443200588,...",0.117128


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ...",0.27457
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ...",0.269161
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ...",0.130852
12,22,https://simple.wikipedia.org/wiki/Addition,Addition,"In mathematics, addition, represented by the s...",801,"[0.01080403570085764, 0.0018525621853768826, -...",0.120813


In [10]:
# 사용자 질의에 대하여 RAG 기반의 답변을 생성하는 함수
def generate_rag_answer(user_query, top_n=3):
    content_msg = ""
    res = search_docs(df_wiki_data, user_query, top_n=top_n, to_print=False)
    for index, result in res.iterrows():
        # print(result)
        content_msg = content_msg + result.title + ":\n  " + result.text + "  \n"
    system_msg = """You should generate an answer based on the "### Grouding data" message provided below, rather than using any knowledge you have about the user's question. If there is no "### Grouding data" message, "I could not find a context for the answer." You have to answer.  \n\n### Grouding data  \n""" + content_msg
    print (system_msg + "\n질문: " + user_query)

    response = client.chat.completions.create(
        model=deployment_name,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_query},
        ],
        temperature=0.1,
        max_tokens=2000
    )

    return response.choices[0].message.content

# 사용자 질의에 대하여 RAG 기반의 답변을 생성
user_query = """4월과 8월을 비교해 보고 항목별로 차이를 간단하게 요약하여 표로 그려줘"""
# user_query = """자동차의 역사를 설명해줘."""
response = generate_rag_answer(user_query)
print("답변: " + response)

You should generate an answer based on the "### Grouding data" message provided below, rather than using any knowledge you have about the user's question. If there is no "### Grouding data" message, "I could not find a context for the answer." You have to answer.  

### Grouding data  
April:
  April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days. April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December. April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence. The Month April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year. April begins on the same day of the week as July every year and on t

답변: | 항목          | 4월 (April)                                   | 8월 (August)                                   |
|---------------|-----------------------------------------------|------------------------------------------------|
| 순서          | 4번째 월                                     | 8번째 월                                      |
| 일수          | 30일                                         | 31일                                          |
| 계절          | 북반구: 봄 / 남반구: 가을                    | 북반구: 여름 / 남반구: 겨울                   |
| 이름 유래     | 라틴어 "aperire" (열다) 또는 아프로디테     | 아우구스투스 황제에서 유래                    |
| 고정 기념일   | 여러 기념일 (예: 만우절, 세계 건강의 날 등) | 여러 기념일 (예: 인디펜던스 데이, 일본의 승리의 날 등) |
| 이동 기념일   | 부활절 관련 기념일                          | 없음                                           |
| 역사적 사건   | 여러 역사적 사건 (예: 티타닉 침몰)         | 여러 역사적 사건 (예: 히로시마 원폭 투하)    |
| 꽃            | 스위트 피, 데이지                           | 글라디올러스                                   |
| 탄생석        | 다이아몬드        