# [STEP1] Embeddingによるクエリに関連するデータベーステーブル・カラムの選定

ユーザーからのクエリ(INPUT)をEmbeddingし、事前処理したテーブルカラムメタデータと比較することで、関連カラムのみを選定します。

In [None]:
# Pythonライブラリインストール
!python --version
!pip install psycopg2-binary
!pip install python-dotenv
!pip install --upgrade openai
!pip install openai[datalib]

!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install plotly
!pip install scikit-learn
!pip install sqlalchemy


## 環境変数
supabase接続用URL,APIキーと、openai api接続用のAPIキーを設定します。
自身のopenaiアカウントからapi keyを取得してください。

https://platform.openai.com/account/api-keys

supabaseの情報は管理者にお尋ねください。

下記の例では、.envファイルに変数を書き込んで、JupiterNotebookで読み込む仕様で実装しております。

※.envファイルの作成が困難、.envファイルから値を読み込めない場合、
　os.getenv("◯◯")部分に変数値を直接書き込んでいただいても動作自体には問題ありません。

In [2]:
# 環境変数
import os
from dotenv import load_dotenv
load_dotenv()

# supabase接続用変数
db_host = os.getenv("DB_HOST")
db_port = os.getenv("DB_PORT")
db_name = os.getenv("DB_NAME")
db_user = os.getenv("DB_USER")
db_pass = os.getenv("DB_PASS")

# OPENAI API KEY
openai_api_key = os.getenv("OPENAI_API_KEY")

print('環境変数読み込み完了')

環境変数読み込み完了


# 処理実行
## [INPUT] クエリを入力, パラメータを設定

In [3]:
# INPUTクエリ
# 同時にさまざまなパターンで試験を行うため配列形式で保存しています。
input_queries = [
    "What is the projection relationship from the frontal pole with respect to the mouse brain?", # 大まかな脳領域からの投射関係を出力
    "What are the main brain regions that project to the motor area with respect to the mouse brain?", # 特定の脳部位に対し、投射関係を逆に辿った経路をまとめて出力
    "Is there any differences in the projection relationship to the frontal pole with respect to the brains of male and female mice?", # 雌雄のマウスでの投射関係の違いを出力
]

# 選定テーブルカラム数
# いくつのテーブルカラムを出力するかを設定します
column_selection_number = 10

## Embeddingなしでのテーブルカラム選定（検索性能比較用）

In [4]:
import openai
openai.api_key = openai_api_key

database_structure_information="""
Table: experiments
- id (integer, primary key, unique)
- qc-date (text)
- red-channel (text)
- green-channel (text)
- blue-channel (text)

Table: structures
- id (integer, primary key, unique)
- name (text)
- acronym (text)
- parent-structure-id (integer)
- hemisphere-id (integer)
- st-level (integer)
- superstructures (jsonb)
- substructures (jsonb)
- neighboring-structures (jsonb)

Table: specimens
- id (integer, primary key, unique)
- experiment-id (integer, foreign key referencing experiments.id)
- donor-id (integer)
- sex (text)
- strain (text)
- age (real)
- weight (real)
- structure-id (integer, foreign key referencing structures.id)
- registration-point (text)
- coordinates-ap real null,
- coordinates-dv real null,
- coordinates-ml real null,
- angle (integer)
- injection-materials (text)
- fluor-colors (text)
- injection-method (text)
- days-post-injection (integer)

Table: projections
- id (integer, primary key, unique)
- experiment-id (integer, foreign key referencing experiments.id)
- hemisphere-id (integer)
- structure-id (integer, foreign key referencing structures.id)
- is-injection (boolean)
- normalized-projection-volume (real)
- projection-density (real),
- projection-energy (real)
- projection-intensity (real)
- projection-volume (real)
- volume (real)
"""

def select_columns_without_embedding(query:str):
    completion = openai.ChatCompletion.create(
      model="gpt-4",
      messages=[
        {"role": "system", "content": "You have a database related with Mouse Brain Connectivity that resource is Allen Brain Atlas API.\n----\n"+database_structure_information},
        {"role": "user", "content": "Select "+str(column_selection_number)+" columns belonging to your tables in order of increasing relevance to the text below.\n----\n"+query}
      ],
      temperature=0.2
    )
    return completion.choices[0].message.content

for q in input_queries:
    ans_a = select_columns_without_embedding(q)
    
    print(q)
    print(ans_a)

What is the projection relationship from the frontal pole with respect to the mouse brain?
1. structures.name
2. structures.id
3. structures.acronym
4. specimens.structure-id
5. specimens.experiment-id
6. experiments.id
7. projections.experiment-id
8. projections.structure-id
9. projections.normalized-projection-volume
10. projections.projection-density
What are the main brain regions that project to the motor area with respect to the mouse brain?
1. structures.id
2. structures.name
3. structures.acronym
4. structures.parent-structure-id
5. specimens.structure-id
6. projections.structure-id
7. projections.experiment-id
8. projections.normalized-projection-volume
9. projections.projection-density
10. projections.projection-energy
Is there any differences in the projection relationship to the frontal pole with respect to the brains of male and female mice?
1. sex (specimens)
2. structure-id (specimens)
3. structure-id (projections)
4. name (structures)
5. acronym (structures)
6. experime

## クエリのEmbeddingメタデータを作成し、テーブルカラムの事前処理済メタデータと比較

openai apiのcosine similarityにてEmbedding間の差を比較 

In [5]:
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sqlalchemy import text
import urllib.parse
from IPython.display import display

# Use your API key
openai.api_key = openai_api_key

# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(input: str) -> list[float]:
    result = openai.Embedding.create(
        model = 'text-embedding-ada-002',
        input = input
    )
    return result['data'][0]['embedding']

# Connect to the database
connection_config = {
    'user': db_user,
    'password': urllib.parse.quote_plus(db_pass),
    'host': db_host,
    'port': db_port, 
    'database': db_name
}
sql = 'SELECT * FROM table_column_metadata ORDER BY id ASC;'
engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.format(**connection_config))
with engine.begin() as conn:
    query = text(sql)
    df = pd.read_sql_query(query, conn)

# load data from postgre
df['name'] = df['table_name']+"."+df['column_name']
df['embedding_on_name'] = df['embedding_on_name'].str.replace('{','[').str.replace('}',']').apply(eval).apply(np.array)
df['embedding_on_description'] = df['embedding_on_description'].str.replace('{','[').str.replace('}',']').apply(eval).apply(np.array)


# loop query
for q in input_queries:
    print(q)
    # Save embedding vector of the input
    input_embedding_vector = get_embedding(q)

    # Calculate similarity between the input and metadata
    df['similarity_on_name'] = df['embedding_on_name'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
    df['similarity_on_description'] = df['embedding_on_description'].apply(lambda x: cosine_similarity(x, input_embedding_vector))

    #sort and select top 
    sorted_df = df.sort_values('similarity_on_name', ascending=False)
    top_10 = sorted_df.head(10)
    print("[similarity_on_name]")
    print(top_10['name'])

    sorted_df = df.sort_values('similarity_on_description', ascending=False)
    top_10 = sorted_df.head(10)
    print("[similarity_on_description]")
    print(top_10['name'])

    display(df)

What is the projection relationship from the frontal pole with respect to the mouse brain?
[similarity_on_name]
27               projections.projection-volume
26            projections.projection-intensity
28                          projections.volume
23    projections.normalized-projection-volume
20                   projections.hemisphere-id
24              projections.projection-density
25               projections.projection-energy
19                   projections.experiment-id
18                              projections.id
22                    projections.is-injection
Name: name, dtype: object
[similarity_on_description]
18                  projections.id
20       projections.hemisphere-id
12        specimens.coordinates-ml
21        projections.structure-id
10        specimens.coordinates-ap
28              projections.volume
8           specimens.structure-id
11        specimens.coordinates-dv
33        structures.hemisphere-id
9     specimens.registration-point
Name: name, dt

Unnamed: 0,id,table_name,column_name,description,embedding_on_name,embedding_on_description,name,similarity_on_name,similarity_on_description
0,1,experiments,id,experiments.id: This is a unique identifier as...,"[-0.013835095800459385, 0.008505481295287609, ...","[-0.017975986003875732, 0.014639126136898994, ...",experiments.id,0.729732,0.716121
1,2,specimens,id,specimens.id: This is a unique identifier corr...,"[-0.012327572330832481, 0.006606162991374731, ...","[-0.02515840157866478, 0.014441575855016708, 0...",specimens.id,0.690183,0.716297
2,3,specimens,experiment-id,specimens.experiment-id: This refers to the un...,"[-0.019997188821434975, 0.011076374910771847, ...","[-0.01757403090596199, 0.02192058600485325, 0....",specimens.experiment-id,0.699297,0.714322
3,4,specimens,donor-id,specimens.donor-id: This is a unique identifie...,"[-0.015332839451730251, 0.0004237593384459615,...","[-0.0292736254632473, 0.002720945980399847, 0....",specimens.donor-id,0.693027,0.752433
4,5,specimens,sex,specimens.sex: This attribute indicates the bi...,"[-0.01808445155620575, 0.010092622600495815, -...","[-0.02135617285966873, 0.008103607222437859, 0...",specimens.sex,0.702159,0.749268
5,6,specimens,strain,specimens.strain: This attribute refers to the...,"[-0.02274075523018837, 0.00429613096639514, 0....","[-0.03764483705163002, 0.008439254015684128, 0...",specimens.strain,0.694554,0.739983
6,7,specimens,age,specimens.age: This represents the age of the ...,"[-0.0037173661403357983, 0.0018532535759732127...","[-0.020964108407497406, -0.0028263390995562077...",specimens.age,0.689141,0.761482
7,8,specimens,weight,specimens.weight: This is the weight of the mo...,"[-0.005243922118097544, 0.010786249302327633, ...","[-0.012873378582298756, -0.0016869819955900311...",specimens.weight,0.680267,0.758301
8,9,specimens,structure-id,specimens.structure-id: This identifier signif...,"[-0.013170195743441582, 0.031720180064439774, ...","[-0.03290892392396927, 0.033837057650089264, 0...",specimens.structure-id,0.695165,0.783403
9,10,specimens,registration-point,specimens.registration-point: This attribute d...,"[-0.004043759312480688, 0.005859993398189545, ...","[-0.011496290564537048, 0.012143782339990139, ...",specimens.registration-point,0.700559,0.772097


What are the main brain regions that project to the motor area with respect to the mouse brain?
[similarity_on_name]
27               projections.projection-volume
28                          projections.volume
20                   projections.hemisphere-id
36                    structures.substructures
26            projections.projection-intensity
25               projections.projection-energy
24              projections.projection-density
21                    projections.structure-id
35                  structures.superstructures
23    projections.normalized-projection-volume
Name: name, dtype: object
[similarity_on_description]
12        specimens.coordinates-ml
20       projections.hemisphere-id
18                  projections.id
8           specimens.structure-id
21        projections.structure-id
35      structures.superstructures
11        specimens.coordinates-dv
36        structures.substructures
33        structures.hemisphere-id
9     specimens.registration-point
Name: nam

Unnamed: 0,id,table_name,column_name,description,embedding_on_name,embedding_on_description,name,similarity_on_name,similarity_on_description
0,1,experiments,id,experiments.id: This is a unique identifier as...,"[-0.013835095800459385, 0.008505481295287609, ...","[-0.017975986003875732, 0.014639126136898994, ...",experiments.id,0.716484,0.696587
1,2,specimens,id,specimens.id: This is a unique identifier corr...,"[-0.012327572330832481, 0.006606162991374731, ...","[-0.02515840157866478, 0.014441575855016708, 0...",specimens.id,0.685283,0.705468
2,3,specimens,experiment-id,specimens.experiment-id: This refers to the un...,"[-0.019997188821434975, 0.011076374910771847, ...","[-0.01757403090596199, 0.02192058600485325, 0....",specimens.experiment-id,0.687828,0.686466
3,4,specimens,donor-id,specimens.donor-id: This is a unique identifie...,"[-0.015332839451730251, 0.0004237593384459615,...","[-0.0292736254632473, 0.002720945980399847, 0....",specimens.donor-id,0.685174,0.746187
4,5,specimens,sex,specimens.sex: This attribute indicates the bi...,"[-0.01808445155620575, 0.010092622600495815, -...","[-0.02135617285966873, 0.008103607222437859, 0...",specimens.sex,0.700729,0.743789
5,6,specimens,strain,specimens.strain: This attribute refers to the...,"[-0.02274075523018837, 0.00429613096639514, 0....","[-0.03764483705163002, 0.008439254015684128, 0...",specimens.strain,0.694602,0.747385
6,7,specimens,age,specimens.age: This represents the age of the ...,"[-0.0037173661403357983, 0.0018532535759732127...","[-0.020964108407497406, -0.0028263390995562077...",specimens.age,0.683421,0.754168
7,8,specimens,weight,specimens.weight: This is the weight of the mo...,"[-0.005243922118097544, 0.010786249302327633, ...","[-0.012873378582298756, -0.0016869819955900311...",specimens.weight,0.679189,0.755266
8,9,specimens,structure-id,specimens.structure-id: This identifier signif...,"[-0.013170195743441582, 0.031720180064439774, ...","[-0.03290892392396927, 0.033837057650089264, 0...",specimens.structure-id,0.691583,0.790374
9,10,specimens,registration-point,specimens.registration-point: This attribute d...,"[-0.004043759312480688, 0.005859993398189545, ...","[-0.011496290564537048, 0.012143782339990139, ...",specimens.registration-point,0.702865,0.77061


Is there any differences in the projection relationship to the frontal pole with respect to the brains of male and female mice?
[similarity_on_name]
27               projections.projection-volume
20                   projections.hemisphere-id
28                          projections.volume
24              projections.projection-density
26            projections.projection-intensity
23    projections.normalized-projection-volume
19                   projections.experiment-id
25               projections.projection-energy
22                    projections.is-injection
18                              projections.id
Name: name, dtype: object
[similarity_on_description]
20    projections.hemisphere-id
18               projections.id
4                 specimens.sex
12     specimens.coordinates-ml
21     projections.structure-id
6                 specimens.age
28           projections.volume
33     structures.hemisphere-id
10     specimens.coordinates-ap
11     specimens.coordinates-dv
Name: n

Unnamed: 0,id,table_name,column_name,description,embedding_on_name,embedding_on_description,name,similarity_on_name,similarity_on_description
0,1,experiments,id,experiments.id: This is a unique identifier as...,"[-0.013835095800459385, 0.008505481295287609, ...","[-0.017975986003875732, 0.014639126136898994, ...",experiments.id,0.736759,0.715274
1,2,specimens,id,specimens.id: This is a unique identifier corr...,"[-0.012327572330832481, 0.006606162991374731, ...","[-0.02515840157866478, 0.014441575855016708, 0...",specimens.id,0.704495,0.713218
2,3,specimens,experiment-id,specimens.experiment-id: This refers to the un...,"[-0.019997188821434975, 0.011076374910771847, ...","[-0.01757403090596199, 0.02192058600485325, 0....",specimens.experiment-id,0.709241,0.709002
3,4,specimens,donor-id,specimens.donor-id: This is a unique identifie...,"[-0.015332839451730251, 0.0004237593384459615,...","[-0.0292736254632473, 0.002720945980399847, 0....",specimens.donor-id,0.707793,0.749509
4,5,specimens,sex,specimens.sex: This attribute indicates the bi...,"[-0.01808445155620575, 0.010092622600495815, -...","[-0.02135617285966873, 0.008103607222437859, 0...",specimens.sex,0.75191,0.791993
5,6,specimens,strain,specimens.strain: This attribute refers to the...,"[-0.02274075523018837, 0.00429613096639514, 0....","[-0.03764483705163002, 0.008439254015684128, 0...",specimens.strain,0.708767,0.74886
6,7,specimens,age,specimens.age: This represents the age of the ...,"[-0.0037173661403357983, 0.0018532535759732127...","[-0.020964108407497406, -0.0028263390995562077...",specimens.age,0.716086,0.777801
7,8,specimens,weight,specimens.weight: This is the weight of the mo...,"[-0.005243922118097544, 0.010786249302327633, ...","[-0.012873378582298756, -0.0016869819955900311...",specimens.weight,0.697811,0.765671
8,9,specimens,structure-id,specimens.structure-id: This identifier signif...,"[-0.013170195743441582, 0.031720180064439774, ...","[-0.03290892392396927, 0.033837057650089264, 0...",specimens.structure-id,0.704021,0.766069
9,10,specimens,registration-point,specimens.registration-point: This attribute d...,"[-0.004043759312480688, 0.005859993398189545, ...","[-0.011496290564537048, 0.012143782339990139, ...",specimens.registration-point,0.700286,0.758808
