## CSVを読み込む

In [2]:
import pandas as pd
df = pd.read_csv('/content/philosopher_detail.csv')
print(df)

      ID        NAME                                          BIOGRAPHY
0      1         タレス  <name>タレス</name><detail>タレスは、紀元前6世紀に生きたギリシャの哲学...
1      2   アナクシマンドロス  <name>アナクシマンドロス</name><detail>アナクシマンドロスは、古代ギリシ...
2      3     アナクシメネス  <name>アナクシメネス</name><detail>アナクシメネスは、紀元前6世紀の古代...
3      4     ヘラクレイトス  <name>ヘラクレイトス</name><detail>ヘラクレイトスは、古代ギリシャの哲学...
4      5     アナクサゴラス  <name>アナクサゴラス</name><detail>アナクサゴラスは、古代ギリシャの哲学...
..   ...         ...                                                ...
663  664  ヘルマン・シュミッツ  <name>ヘルマン・シュミッツ</name><detail>ヘルマン・シュミッツは、ドイツ...
664  665        上杉慎吉  <name>上杉慎吉</name><detail>上杉慎吉は、日本の憲法学者であり、天皇主権...
665  666          劉安  <name>劉安</name><detail>劉安は、前漢時代の淮南王であり、中国古代思想の...
666  667         洪自誠  <name>洪自誠</name><detail>洪自誠は、中国明代の著作家で、本名は洪応明、...
667  668        石黒忠篤  <name>石黒忠篤</name><detail>石黒忠篤は、日本の農林官僚で「農政の神様」...

[668 rows x 3 columns]


## SentenceTransformersのインストール



In [3]:
!pip install sentence_transformers xformers

Collecting xformers
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.28.post3


## Snowflake/snowflake-arctic-embed-m-v2.0のテスト実行

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v2.0", trust_remote_code=True)

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium."
]
embeddings = model.encode(sentences)
print(embeddings)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/203 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/250k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/937 [00:00<?, ?B/s]

configuration_hf_alibaba_nlp_gte.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0:
- configuration_hf_alibaba_nlp_gte.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_alibaba_nlp_gte.py:   0%|          | 0.00/40.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0:
- modeling_hf_alibaba_nlp_gte.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/312 [00:00<?, ?B/s]

[[-0.02001398  0.06124367 -0.01872946 ... -0.01814996 -0.01246814
   0.00308553]
 [-0.05736091  0.04389375 -0.03871815 ... -0.01087298  0.02725625
  -0.00432861]
 [-0.09916752  0.0702983  -0.00679688 ... -0.00604448 -0.05341355
  -0.00442093]]


## Pandas DataFrameの列をEmbedding

In [5]:
embeded_biographies = model.encode(df["BIOGRAPHY"])

## クエリをEmbeddingして、直前の結果と比較する

In [6]:
query1 = "<name>Hippocrates</name><detail>Hippocrates was an ancient Greek physician and is often referred to as the Father of Western Medicine. His achievements laid the foundation for a scientific approach in the field of medicine. Hippocrates attributed the causes of diseases not to supernatural forces or natural phenomena but to dysfunctions within the body and environmental factors, emphasizing observation and experimentation. His work, Corpus Hippocraticum, is a comprehensive compilation of ancient Greek medical knowledge and has significantly influenced later medical practices. The Hippocratic Oath, which outlines the ethical principles for physicians, continues to be highly respected in the medical field today.</detail>"
embeded_query1 =  model.encode(query1)
query2 = "プラグマティズムの提唱者は？"
embeded_query2 =  model.encode(query2)

In [12]:
import torch

# query1と一番類似しているもの
similarities = model.similarity(embeded_biographies, embeded_query1)
idx = int(torch.argmax(similarities, dim=0)[0])
print(f"クエリ: {query1}")
print(f"類似度: {similarities[idx]}")
print(f"前後の行の類似度: {similarities[idx-1]}, {similarities[idx+1]}")
print(df[idx:idx+1])

# query2と一番類似しているもの
similarities = model.similarity(embeded_biographies, embeded_query2)
idx = int(torch.argmax(similarities, dim=0)[0])
print(f"クエリ: {query2}")
print(f"類似度: {similarities[idx]}")
print(f"前後の行の類似度: {similarities[idx-1]}, {similarities[idx+1]}")
print(df[idx:idx+1])

クエリ: <name>Hippocrates</name><detail>Hippocrates was an ancient Greek physician and is often referred to as the Father of Western Medicine. His achievements laid the foundation for a scientific approach in the field of medicine. Hippocrates attributed the causes of diseases not to supernatural forces or natural phenomena but to dysfunctions within the body and environmental factors, emphasizing observation and experimentation. His work, Corpus Hippocraticum, is a comprehensive compilation of ancient Greek medical knowledge and has significantly influenced later medical practices. The Hippocratic Oath, which outlines the ethical principles for physicians, continues to be highly respected in the medical field today.</detail>
類似度: tensor([0.6441])
前後の行の類似度: tensor([0.2423]), tensor([0.2440])
    ID    NAME                                          BIOGRAPHY
38  39  ヒポクラテス  <name>ヒポクラテス</name><detail>ヒポクラテスは、古代ギリシャの医師であ...
クエリ: プラグマティズムの提唱者は？
類似度: tensor([0.5348])
前後の行の類似度: tensor([0.1765])