## 기본 사용법

chromadb는 기본적으로 Vector DB 이다.  
데이터를 어느 Embedding Vector로 표현을 하고, 해당 벡터들에 대해 벡터 연산을 통해 DB의 기능을 수행한다.  
즉 데이터를 벡터로 바꿔주는 Embedding Function이 필요하고, chromadb는 기본적으로 onnx runtime 기반의 딥러닝 기반 embedding function을 제공한다.  

GPU를 통해 가속이 가능하다. (CUDA만 가능, onnxruntime-gpu 라이브러리 설치 필수)

In [1]:
from datasets import load_dataset

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

Number of questions with support:  10481


sciq 데이터셋의 경우 "support"가 문장 형태로 질문에 대한 답변을 보조하는 부분이다.  
따라서 support를 vector db에 추가

In [28]:
for i in range(5):
    print(f"Question {i+1}: {dataset['question'][i]}")
    print(f"Answer {i+1}: {dataset['support'][i]}")
    print()

Question 1: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Answer 1: Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.

Question 2: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Answer 2: Without Coriolis Effect the global winds would blow north to south or south to north. But Coriolis makes them blow northeast to southwest or the reverse in the Northern Hemisphere. The winds blow northwest to southeast or the reverse in the southern hemisphere.

Question 3: Changes from a less-or

In [2]:
import chromadb

client = chromadb.Client()
collection = client.create_collection("sciq_supports")

from tqdm.notebook import tqdm

# Load the supporting evidence in batches of 1000
batch_size = 1000
for i in tqdm(range(0, len(dataset), batch_size), desc="Adding documents"):
    collection.add(
        ids=[
            str(i) for i in range(i, min(i + batch_size, len(dataset)))
        ],
        documents=dataset["support"][i : i + batch_size], # 실제로 추가할 데이터
        metadatas=[
            {"type": "support"} for _ in range(i, min(i + batch_size, len(dataset)))
        ],
    )

Adding documents:   0%|          | 0/11 [00:00<?, ?it/s]

*************** EP Error ***************
EP Error D:\a\_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:456 onnxruntime::python::RegisterTensorRTPluginsAsCustomOps Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
 when using ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************


밑의 예시와 같이 query_texts에 낱말이나 단어가 제시되면 해당 토큰을 임베딩 시켜 제일 유사한
정보를 DB에서 찾아서 추출한다.

In [29]:
results = collection.query(
    query_texts=dataset["question"][:10],
    n_results=1)

# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:10]):
    print(f"검색 질문: {q}")
    print(f"Chromadb에 의해 검색된 답변: {results['documents'][i][0]}")
    print()

검색 질문: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Chromadb에 의해 검색된 답변: Bacteria can be used to make cheese from milk. The bacteria turn the milk sugars into lactic acid. The acid is what causes the milk to curdle to form cheese. Bacteria are also involved in producing other foods. Yogurt is made by using bacteria to ferment milk ( Figure below ). Fermenting cabbage with bacteria produces sauerkraut.

검색 질문: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Chromadb에 의해 검색된 답변: Without Coriolis Effect the global winds would blow north to south or south to north. But Coriolis makes them blow northeast to southwest or the reverse in the Northern Hemisphere. The winds blow northwest to southeast or the reverse in the southern hemisphere.

검색 질문: Changes from a less-ordered state to a more-ordered state (such as a liquid 

In [37]:
keywords = ["Physics", "Computer", "Machine Learning"]
results = collection.query(
    query_texts=keywords,
    n_results=1)

In [39]:
# Print the question and the corresponding support
for i in range(len(keywords)):
    print(f"검색 단어: {keywords[i]}")
    print(f"Chromadb에 의해 검색된 답변: {results['documents'][i][0]}")
    print()

검색 단어: Physics
Chromadb에 의해 검색된 답변: Physics is the study of energy and how it interacts with matter. Important concepts in physics include motion, forces such as magnetism and gravity, and different forms of energy. Physics concepts can answer all the questions on the right page of the notebook in Figure above .

검색 단어: Computer
Chromadb에 의해 검색된 답변: Over the past several decades, computer technology has revolutionized human society. Watch this video interview about ways computers have changed people’s lives. Then answer the questions below.

검색 단어: Machine Learning
Chromadb에 의해 검색된 답변: Scientists create models with computers. Computers can handle enormous amounts of data. This can more accurately represent the real situation. For example, Earth’s climate depends on an enormous number of factors. Climate models can predict how climate will change as certain gases are added to the atmosphere. To test how good a model is, scientists might start a test run at a time in the past. If the mod