# Visual Data Management System (VDMS)

>[VDMS](https://github.com/IntelLabs/vdms) is a storage solution for efficient access of big-”visual”-data that aims to achieve cloud scale by searching for relevant visual data via visual metadata stored as a graph and enabling machine friendly enhancements to visual data for faster access. VDMS is licensed under MIT.

It supports:
- K nearest neighbor search
- Euclidean distance (L2) and inner product (IP)
- Vector and metadata searches

See the [installation instructions](https://github.com/IntelLabs/vdms/blob/master/INSTALL.md) and [docker image](https://hub.docker.com/r/intellabs/vdms).

This notebook shows how to use VDMS as a vector store using the docker image.


Install Python packages for VDMS client and Sentence Transformers:

In [1]:
# Pip install necessary package
%pip install --upgrade --quiet pip sentence-transformers vdms

Note: you may need to restart the kernel to use updated packages.


## Start VDMS Server
Here we start the VDMS server with port 55555.

In [2]:
!docker run --rm -d -p 55555:55555 --name vdms_vs_test_nb intellabs/vdms:latest

a115b72998c9368f679d49acdb4bd834de9d1a7c464189c2da3befa3a5c10a61


## Basic Example (using the Docker Container)

In this basic example, we demonstrate adding documents into VDMS and using it as a vector database.

You can run the VDMS Server in a Docker container separately to use with LangChain. 

VDMS has the ability to handle multiple collections of documents, but the LangChain interface expects one, so we need to specify the name of the collection . The default collection name used by LangChain is "langchain".


In [3]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.vdms import VDMS
from langchain_community.embeddings import HuggingFaceEmbeddings

import time
time.sleep(2)
DELIMITER = "-" * 50

# Configurations
connection_args={"host": "localhost", "port": 55555}

Here are some helper functions for printing results.

In [4]:
def print_document_details(doc):
    print(f"Content:\n\t{doc.page_content}\n")
    print(f"Metadata:")
    for key, value in doc.metadata.items():
        if value != 'Missing property':
            print(f"\t{key}:\t{value}")

def print_results(similarity_results, score=True):
    print(f"{DELIMITER}\n")
    if score:
        for doc, score in similarity_results:
            print(f"Score:\t{score}\n")
            print_document_details(doc)
            print(f"{DELIMITER}\n")
    else:
        for doc in similarity_results:
            print_document_details(doc)
            print(f"{DELIMITER}\n")

def print_response(list_of_entities):
    for ent in list_of_entities:
        for key, value in ent.items():
            if value != 'Missing property':
                print(f"\n{key}:\n\t{value}")
        print(f"{DELIMITER}\n")

### Load Document and Obtain Embedding Function
Here we load the most recent State of the Union Address and split the document into chunks.  We also specify the embedding model to be used.

In [5]:
# load the document and split it into chunks
document_path = "../../modules/state_of_the_union.txt"
raw_documents = TextLoader(document_path).load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(raw_documents)
print(f"# Documents: {len(docs)}")

# create the open-source embedding function
embedding_function = HuggingFaceEmbeddings()
print(
    f"# Embedding Dimensions: {len(embedding_function.embed_query('This is a test document.'))}"
)

# Documents: 42
# Embedding Dimensions: 768


### Similarity Search using Flinng and Inner Product

In this section, we add the documents to VDMS using FLINNG indexing and Inner Product as the distance metric for similarity search. We search for one document (`k=1`) related to the query `What did the president say about Ketanji Brown Jackson` and also return the score along with the document.

LangChain vector stores use a string/keyword `id` for bookkeeping documents. By default, `id` is a uuid but here we're defining it as an integer cast as a string.

In [6]:
db_flinng = VDMS.from_documents(
    docs,
    ids=[str(i) for i in range(1, len(docs) + 1)],
    collection_name="my_collection_flinng_IP",
    embedding_function=embedding_function,
    engine="Flinng",
    distance_strategy="IP",
    connection_args=connection_args,
)

# Query
k = 1
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db_flinng.similarity_search_with_score(query, k, filter=None)
print_results(docs_with_score)

--------------------------------------------------

Score:	0.0

Content:
	Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspir

### Similarity Search using Faiss Flat and Euclidean Distance (Default)

In this section, we add the documents to VDMS using FAISS Flat indexing (default) and Euclidena distance (default) as the distance metric for simiarity search. We search for three documents (`k=3`) related to the query `What did the president say about Ketanji Brown Jackson`.

In [7]:
# create simple ids
ids = [str(i) for i in range(1, len(docs) + 1)]

# add data
collection_name="my_collection_faiss_L2"
db = VDMS.from_documents(
    docs,
    ids=ids,
    collection_name=collection_name,
    embedding_function=embedding_function,
    connection_args=connection_args,
)

# Query
k = 3
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print_results(docs, score=False)

--------------------------------------------------

Content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata:
	id:	32
	source:	../../modules/state_of_the_union.txt
--------------------------------------------------

Content:
	As Frances Haugen, who 

### Update and Delete

While building toward a real application, you want to go beyond adding data, and also update and delete data.

Here is a basic example showing how to do so.  First we will update the metadata for the document most relevant to the query.

In [8]:
doc = db.similarity_search(query)[0]
print(f"Original metadata: \n\t{doc.metadata}")

# update the metadata for a document
doc.metadata["new_value"] = "hello world"
print(f"new metadata: \n\t{doc.metadata}")
print(f"{DELIMITER}\n")

# Update document in VDMS
id_to_update = doc.metadata["id"]
db.update_document(collection_name, id_to_update, doc)
response, response_array = db.get(collection_name, constraints={"id": ["==", id_to_update]})

# Display Results
print(f"UPDATED ENTRY (id={id_to_update}):")
print_response([response[0]['FindDescriptor']['entities'][0]])

Original metadata: 
	{'id': '32', 'source': '../../modules/state_of_the_union.txt'}
new metadata: 
	{'id': '32', 'source': '../../modules/state_of_the_union.txt', 'new_value': 'hello world'}
--------------------------------------------------

UPDATED ENTRY (id=32):

content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal

Next we will delete the last document by ID (id=42).

In [9]:
print("Documents before deletion: ", db.count(collection_name))

id_to_remove = ids[-1]
db.delete(collection_name=collection_name, ids=[id_to_remove])
print(f"Documents after deletion (id={id_to_remove}): ", db.count(collection_name))

Documents before deletion:  42
Documents after deletion (id=42):  41


## Other Information
### Similarity search by vector
Instead of searching by string query, you can also search by embedding/vector.

In [10]:
embedding_vector = embedding_function.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)

# Print Results
print_document_details(docs[0])

Content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata:
	id:	32
	new_value:	hello world
	source:	../../modules/state_of_the_union.txt


### Filtering on metadata

It can be helpful to narrow down the collection before working with it.

For example, collections can be filtered on metadata using the get method.  Here we retrieve the document where `id = 2`.

In [11]:
response, response_array = db.get(collection_name,
                                  limit=1,
                                  constraints={"id": ["==", "2"]}
)

print(f"Returned entry:")
print_response([response[0]['FindDescriptor']['entities'][0]])

Returned entry:

content:
	Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. 

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. 

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. 

Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. 

Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   

They keep moving.   

And the costs and the threats to America and the world keep rising.   

That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. 

The United States is a member along with 29 other nations. 

It matters. American diplomacy matters. Ame

### Retriever options

This section goes over different options for how to use VDMS as a retriever.


#### Simiarity Search

Here we use similarity search in the retriever object.


In [12]:
retriever = db.as_retriever()
relevant_docs = retriever.get_relevant_documents(query)[0]

print_document_details(relevant_docs)

Content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata:
	id:	32
	new_value:	hello world
	source:	../../modules/state_of_the_union.txt


#### Maximal Marginal Relevance Search (MMR)

In addition to using similarity search in the retriever object, you can also use `mmr`.

In [13]:
retriever = db.as_retriever(search_type="mmr")
relevant_docs = retriever.get_relevant_documents(query)[0]

print_document_details(relevant_docs)

Content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata:
	id:	32
	new_value:	hello world
	source:	../../modules/state_of_the_union.txt


We can also use MMR directly.

In [14]:
mmr_resp = db.max_marginal_relevance_search_with_score(query, k=2, fetch_k=10)
print_results(mmr_resp)

--------------------------------------------------

Score:	1.2032092809677124

Content:
	Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata:
	id:	32
	new_value:	hello world
	source:	../../modules/state_of_the_union.txt
----------------------------------

### Delete collection
Previously, we removed documents using ID. Here, all documents are removed since no ID is provided.

In [15]:
print("Documents before deletion: ", db.count(collection_name))

db.delete(collection_name=collection_name)

print(f"Documents after deletion: ", db.count(collection_name))

Documents before deletion:  41
Documents after deletion:  0


## Stop VDMS Server

In [16]:
!docker kill vdms_vs_test_nb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


vdms_vs_test_nb
