chromadb, faiss, LlamaIndex

pure chromadb https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide uses parquet! through duckdb


https://medium.com/@kofsitho/using-vector-db-in-llama-index-7b47fc39b7c0

https://docs.llamaindex.ai/en/v0.9.48/api/llama_index.readers.PDFReader.html

By default, ChromaDB uses the `all-MiniLM-L6-v2` model (and downloads it automatically) to embed text in a collection. If you want anything else you need to specify it when you create the collection. Here we use an embedding provided through Ollama. We also specify at creation that we want to use the cosine distance (default is L2, the squared Euclidian metric). See documentation: https://docs.trychroma.com/docs/collections/configure

In [1]:
import chromadb
from chromadb.utils.embedding_functions.ollama_embedding_function import OllamaEmbeddingFunction

ollama_ef = OllamaEmbeddingFunction(
    url="http://localhost:11434",
    model_name="nomic-embed-text:latest",
)

client = chromadb.PersistentClient()
# you can specify the path there, see https://docs.trychroma.com/docs/run-chroma/persistent-client
# to keep the DB in memory, use: client = chromadb.EphemeralClient()

In [2]:
collection = client.create_collection(
    name="CatFacts", 
    embedding_function=ollama_ef,
    metadata={"hnsw:space": "cosine"}  # or "dotproduct"
)

If using `PersistenClient` you can load existing collections, or delete them, or do a load-else-create:

        collection = client.get_collection(name="test") 
        # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
        collection = client.get_or_create_collection(name="test") 
        # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
        client.delete_collection(name="my_collection") 
        # Delete a collection and all associated embeddings, documents, and metadata. 


## Storing our cat facts

In [9]:
import requests

cat_facts_url = 'https://huggingface.co/ngxson/demo_simple_rag_py/raw/main/cat-facts.txt'
response = requests.get(cat_facts_url)
cat_facts = response.text.split("\n")

In [10]:
collection.add(
    documents = cat_facts,
    ids = [f"id{iii}" for iii in range(len(cat_facts))],
)

`ids` are required. If this id already exists, then the document is not added to the data base.

You can also ad metadata, as a list of dictionaries.

Instead of `documents = ...` you can also pass a list of `embeddings = ...` directly.

https://docs.trychroma.com/docs/collections/add-data

If needed, collections can be renamed: `collection.modify(name="new_name")`

## Accessing the stored data

https://docs.trychroma.com/docs/querying-collections/query-and-get

In [11]:
print(collection.get().keys())
print(len(collection.get()['documents']))
print(collection.count())

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])
150
150


In [12]:
collection.get(
	ids=["id1", "id2", "id3"],
)

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': None,
 'documents': ['Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.',
  'When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.',
  'The technical term for a cat’s hairball is a “bezoar.”'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [None, None, None]}

In [7]:
# peek at the N first items:
collection.peek(2) 

{'ids': ['id0', 'id1'],
 'embeddings': array([[ 0.0231142 ,  0.06630387, -0.17538089, ..., -0.0107691 ,
          0.00104107, -0.01289808],
        [ 0.07731966,  0.07048559, -0.14541987, ..., -0.01019575,
         -0.04358175, -0.05696372]], shape=(2, 768)),
 'documents': ['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.',
  'Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.'],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': [None, None]}

If your collection has metadata, you can also use e.g. `where={"style": "style1"}` as an argument to the `get` method.

In [8]:
collection.query(
    query_texts=["Can cats swim?"],
    n_results=3,
)

{'ids': [['id39', 'id75', 'id22']],
 'embeddings': None,
 'documents': [['Cats hate the water because their fur does not insulate well when it’s wet. The Turkish Van, however, is one cat that likes swimming. Bred in central Asia, its coat has a unique texture that makes it water resistant.',
   'If they have ample water, cats can tolerate temperatures up to 133 °F.',
   'Some cats have survived falls of over 65 feet (20 meters), due largely to their “righting reflex.” The eyes and balance organs in the inner ear tell it where it is in space so the cat can land on its feet. Even cats without a tail have this ability.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None, None]],
 'distances': [[0.24525678157806396, 0.30478936433792114, 0.3183196187019348]]}

The query method also allows to filter on kewords, and on metadata:

    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"},

By default, the distance is the L2 norm (squared Euclidian distance).