# Embeddingを各テーブルカラムに適用

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

ユーザーからのクエリ(INPUT)に関連するテーブル情報を抽出するため、カラムごとにEmbeddingベクター情報を事前に付与します。

In [None]:
# Pythonライブラリインストール
!python --version
!pip install psycopg2-binary
!pip install python-dotenv
!pip install --upgrade openai
!pip install openai[datalib]

!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install plotly
!pip install scikit-learn
!pip install sqlalchemy


## 環境変数
supabase接続用URL,APIキーと、openai api接続用のAPIキーを設定します。
自身のopenaiアカウントからapi keyを取得してください。

https://platform.openai.com/account/api-keys

supabaseの情報は管理者にお尋ねください。

下記の例では、.envファイルに変数を書き込んで、JupiterNotebookで読み込む仕様で実装しております。

※.envファイルの作成が困難、.envファイルから値を読み込めない場合、
　os.getenv("◯◯")部分に変数値を直接書き込んでいただいても動作自体には問題ありません。

In [2]:
# 環境変数
import os
from dotenv import load_dotenv
load_dotenv()

# supabase接続用変数
db_host = os.getenv("DB_HOST")
db_port = os.getenv("DB_PORT")
db_name = os.getenv("DB_NAME")
db_user = os.getenv("DB_USER")
db_pass = os.getenv("DB_PASS")

# OPENAI API KEY
openai_api_key = os.getenv("OPENAI_API_KEY")

print('環境変数読み込み完了')

環境変数読み込み完了


# 「table_columns_metadata」テーブルに対してAPIを適用

Embeddingによる検索性能比較のため、2種類のテキストに対してEmbeddingを行った。

1. embedding_on_name: テーブル名+カラム名に対してEmbeddingを適用
2. embedding_on_description: テーブル名+カラム名に加え、事前作成した説明文に対してEmbeddingを適用

参考: https://stackoverflow.com/questions/74000154/customize-fine-tune-openai-model-how-to-make-sure-answers-are-from-customized/75192794#75192794

In [13]:
import openai
import pandas as pd
from sqlalchemy import create_engine
import urllib.parse
from IPython.display import display

# Use your API key
openai.api_key = openai_api_key

# Connect to the database
connection_config = {
    'user': db_user,
    'password': urllib.parse.quote_plus(db_pass),
    'host': db_host,
    'port': db_port, 
    'database': db_name
}
sql = 'SELECT * FROM table_column_metadata ORDER BY id ASC;'
engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.format(**connection_config))


# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(input: str) -> list[float]:
    result = openai.Embedding.create(
        model = 'text-embedding-ada-002',
        input = input
    )   
    return result['data'][0]['embedding']


# load data from postgre
df = pd.read_sql(sql=sql, con=engine)

df['name'] = df['table_name'] + "." + df['column_name']
df['embedding_on_name'] = df['name'].apply(lambda x: get_embedding(x))
df['embedding_on_description'] = df['description'].apply(lambda x: get_embedding(x))
df = df.drop('name', axis=1)

display(df)


# INSERT as new rows
df.to_sql('table_column_metadata', con=engine, if_exists='replace', index=False)


Unnamed: 0,id,table_name,column_name,description,embedding_on_name,embedding_on_description
0,1,experiments,id,experiments.id: This is a unique identifier as...,"[-0.013944113627076149, 0.008559102192521095, ...","[-0.017976021394133568, 0.014666065573692322, ..."
1,2,specimens,id,specimens.id: This is a unique identifier corr...,"[-0.012582045048475266, 0.006582512520253658, ...","[-0.02479550801217556, 0.014213652350008488, 0..."
2,3,specimens,experiment-id,specimens.experiment-id: This refers to the un...,"[-0.019997188821434975, 0.011076374910771847, ...","[-0.017639081925153732, 0.021716861054301262, ..."
3,4,specimens,donor-id,specimens.donor-id: This is a unique identifie...,"[-0.015239322558045387, 0.00034999402123503387...","[-0.029239283874630928, 0.0027051696088165045,..."
4,5,specimens,sex,specimens.sex: This attribute indicates the bi...,"[-0.018099812790751457, 0.01001953985542059, -...","[-0.02135617285966873, 0.008103607222437859, 0..."
5,6,specimens,strain,specimens.strain: This attribute refers to the...,"[-0.02274075523018837, 0.00429613096639514, 0....","[-0.0377492718398571, 0.008492245338857174, 0...."
6,7,specimens,age,specimens.age: This represents the age of the ...,"[-0.0037267638836055994, 0.0018887094920501113...","[-0.021022379398345947, -0.002805102849379182,..."
7,8,specimens,weight,specimens.weight: This is the weight of the mo...,"[-0.0051937648095190525, 0.010918917134404182,...","[-0.012874914333224297, -0.0018484934698790312..."
8,9,specimens,structure-id,specimens.structure-id: This identifier signif...,"[-0.013170195743441582, 0.031720180064439774, ...","[-0.032811056822538376, 0.033925093710422516, ..."
9,10,specimens,registration-point,specimens.registration-point: This attribute d...,"[-0.003982044290751219, 0.0058493101969361305,...","[-0.011506017297506332, 0.012140102684497833, ..."


38