## Similarity Search with ChromaDB

[Similarity Search with ChromaDB](https://apmonitor.com/dde/index.php/Main/SimilaritySearch) in the [Data-Driven Engineering](http://apmonitor.com/dde) online course.

<img align=left width=500px src='https://apmonitor.com/dde/uploads/Main/similarity_search.png'>

ChromaDB is a local database tool for creating and managing vector stores, essential for tasks like similarity search in large language model processing. This tutorial covers how to set up a vector store using training data from the [Gekko Optimization Suite](https://gekko.readthedocs.io/en/latest/) and explores the application in Retrieval-Augmented Generation (RAG) for Large-Language Models (LLMs).

The first step is to install necessary libraries. Ensure you have pandas and ChromaDB installed. You can do this using pip:

In [1]:
pip install chromadb pandas

Collecting chromadb
  Downloading chromadb-1.0.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-4.0.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.32.1-py3-none-any.whl.metadata (2.5 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.53b1-py3-none-any.whl.metadata (2.2 kB)
Collecting pypika>=0.48.9 (from 

The next step is to import the modules and read [train.jsonl from GitHub](https://github.com/BYU-PRISM/GEKKO/blob/master/docs/llm/train.jsonl).

In [1]:
import pandas as pd
import chromadb

# read Gekko LLM training data
url='https://raw.githubusercontent.com'
path='/BYU-PRISM/GEKKO/master/docs/llm/train.jsonl'
qa = pd.read_json(url+path,lines=True)

The train.jsonl file contains hundreds of questions and answers about Gekko. It is used to provide context for the Gekko Support Agent that assists with questions about modeling and optimization in Python. The train.jsonl file is added to lists required to build the vector store with documents with the text, metadatas with a unique ID name, and ids with a unique integer identifier.

In [2]:
documents = []
metadatas = []
ids = []
for i in range(len(qa)):
    s = f"### Question: {qa['question'].iloc[i]} ### Answer: {qa['answer'].iloc[i]}"
    documents.append(s)
    metadatas.append({'qid':f'qid_{i}'})
    ids.append(str(i))

The script reads training data from the Gekko Optimization Suite, processes it, and uses ChromaDB to create a vector store. This vector store is fundamental in building systems that can efficiently perform similarity searches, crucial in applications like RAG for Large-Language Models.

In [3]:
# store in memory
cc = chromadb.Client()
collection = cc.create_collection(name='mydb')
collection.add(documents=documents,metadatas=metadatas,ids=ids)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 81.0MiB/s]


The vector database is stored in memory and is regenerated every time the program runs. For large documents, this can take significant time and it may be desirable to store the vector database on a local drive. See [RAG Similarity Search](https://apmonitor.com/dde/index.php/Main/SimilaritySearch) for code to store the database on a local drive.

The final step is to perform a test query. It uses a [k-Nearest Neighbors search](https://apmonitor.com/pds/index.php/Main/KNearestNeighbors) to determine the closest 5 matches to the query. Execute a test query to ensure the vector store is functioning correctly.

In [5]:
results = collection.query(
   query_texts=['What are you trained to do?'],
   n_results=5,include=['distances','documents'])
print(results)

{'ids': [['19', '0', '2', '23', '24']], 'embeddings': None, 'documents': [["### Question: What are you trained to do? ### Answer: I'm trained to answer questions about Gekko for optimization, simulation, machine learning, data-science, model predictive control, and parameter estimation.", '### Question: Who are you? ### Answer: I am a Gekko assistant. I am a custom trained chatbot that is fine-tuned to answer questions about Gekko.', "### Question: What is your name? ### Answer: My name is Gekko Assistant. I'm available to help with your questions.", '### Question: Can you improve that answer? ### Answer: I sometimes make mistakes. Please ask the question again, but with more details about the problem.', '### Question: Can you answer questions about Python? ### Answer: Yes, in addition to Gekko I answer questions about programming in Python. My purpose is to help you get answers with modeling, optimization, simulation, machine learning, data science, and other related areas.']], 'uris'

Review the responses and the distance metric to determine how close each document is in similarity to query_texts.

#### Application in RAG with Large-Language Models

Once the vector store is set up, it can be used in Retrieval-Augmented Generation (RAG) models, particularly with Large-Language Models. RAG models leverage external knowledge sources to generate more informed and accurate responses.

In [8]:
!pip install gekko

Collecting gekko
  Downloading gekko-1.3.0-py3-none-any.whl.metadata (3.0 kB)
Downloading gekko-1.3.0-py3-none-any.whl (13.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m98.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gekko
Successfully installed gekko-1.3.0


In [11]:
from gekko import support
a = support.agent()
a.ask("what is dirichlet")

Unfortunately, the provided context doesn't contain information about the `dirichlet` function within GEKKO. My knowledge base, based on the given snippets, *doesn't include* details about a `dirichlet` function in GEKKO. 

It's possible:

*   It's a function not covered in the provided documentation snippets.
*   It's a newer addition to GEKKO not yet reflected in the provided documentation.
*   It's part of an external library used *with* GEKKO, but not inherently part of GEKKO itself.

To find out more, I recommend checking the official GEKKO documentation: [https://gekko.readthedocs.io/en/latest/](https://gekko.readthedocs.io/en/latest/) or searching for "GEKKO dirichlet function" online.





The snippet above uses the Gekko vector store and RAG to provide context to the LLM. This support agent runs in the cloud, but it can also be set up to run locally. By combining the retrieval power of ChromaDB with the generative capabilities of LLMs, you can significantly enhance the performance of AI applications in natural language processing (NLP) understanding and generation.

#### ✅ Activity: Generate Q+A Similarity Search

This activity encourages you to explore similarity search by creating your own set of questions and answers. Choose a topic you are passionate about, and generate at least 10 question-answer pairs. Once done, you'll build a vector database with these pairs and perform a similarity search using ChromaDB. This hands-on experience helps you understand the practical applications of similarity search in natural language processing.

Use the JSONL template to generate at least 10 questions and answers based on a topic of your interest and save the file as mydb.jsonl.

```
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
```

Build the vector database and perform a similarity search using the mydb.jsonl file instead of the Gekko Q+A.

In [13]:
import pandas as pd
import chromadb

# read training data
path='mydb.jsonl'
try:
    qa = pd.read_json(path,lines=True)
except:
    print('Create mydb.jsonl file')
documents = []
metadatas = []
ids = []
for i in range(len(qa)):
    s = f"### Question: {qa['question'].iloc[i]} ### Answer: {qa['answer'].iloc[i]}"
    documents.append(s)
    metadatas.append({'qid':f'qid_{i}'})
    ids.append(str(i))

# in memory
cc = chromadb.Client()
# collection = cc.create_collection(name='mydb')

collection.add(documents=documents,metadatas=metadatas,ids=ids)

results = collection.query(
   query_texts=['Question to test similarity search.'],
   n_results=5,include=['distances','documents'])
print(results)

results = collection.query(
   query_texts=['Another question to test similarity search.'],
   n_results=5,include=['distances','documents'])
print(results)

  qa = pd.read_json(path,lines=True)


Create mydb.jsonl file
{'ids': [['293', '374', '303', '146', '87']], 'embeddings': None, 'documents': [['### Question: What is FRZE_CHK in GEKKO? ### Answer: FRZE_CHK in GEKKO checks if a measurement is frozen at the same value across cycles, marking it as bad if no variation is detected.', '### Question: What does the m.solve function do in GEKKO? ### Answer: Solves the optimization problem.', '### Question: How is MEAS_CHK utilized in GEKKO? ### Answer: MEAS_CHK in GEKKO determines whether measurements are validated before being used in the application, ensuring data quality.', "### Question: What's the method to perform a grid search optimization in Gekko? ### Answer: Perform a grid search optimization in Gekko by iteratively changing parameters and solving the model.", '### Question: How do you perform a grid search in Gekko? ### Answer: A grid search in Gekko can be performed by iterating over a range of parameter values and solving the model.']], 'uris': None, 'included': ['dista

Test the similarity search with several questions and validate the distances that suggest closeness to query_texts.