<div id="singlestore-header" style="display: flex; background-color: rgba(235, 249, 245, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/database.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">A Deep Dive Into Vector Databases</h1>
    </div>
</div>

**Required Installations**

In [9]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.3.0


In [10]:
!pip install openai numpy pandas singlestoredb langchain==0.1.8 langchain-community==0.0.21 langchain-core==0.1.25 langchain-openai==0.0.6

Collecting openai
  Downloading openai-1.63.0-py3-none-any.whl.metadata (27 kB)
Collecting langchain==0.1.8
  Downloading langchain-0.1.8-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community==0.0.21
  Downloading langchain_community-0.0.21-py3-none-any.whl.metadata (8.1 kB)
Collecting langchain-core==0.1.25
  Downloading langchain_core-0.1.25-py3-none-any.whl.metadata (6.0 kB)
Collecting langchain-openai==0.0.6
  Downloading langchain_openai-0.0.6-py3-none-any.whl.metadata (2.5 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain==0.1.8)
  Downloading aiohttp-3.11.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.8)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain==0.1.8)
  Downloading langsmith-0.1.147-py3-none-any.whl.metadata (14 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain==0.1.8)
  Downloading tenac

## Vector Embedding Example

In this example, we demonstrate a rule based system that generates vector embeddings based on a word. The embedding that we generate contains 5 main features:
- Length of word
- Number of vowels in the word (normalized to the length of the word)
- Whether the word starts with a vowel (1) or not (0)
- Whether the word ends with a vowel (1) or not (0)
- Percentage of consonants in the word

This is a simple implementation of a **rule** based system to demonstrate the essence of what vector embedding models do. However, they utlize neural networks that are trained on vast datasets to learn key features and self-corrects using gradient descent.

## Vector Similarity Example

In this example, we demonstrate a way to determine the similarity between two vectors. There are many techniques to find the similiarity between two vectors but one of the most popular ways is using **cosine similarity**. Consine similarity is the the dot product between the two vectors divided by the product of the vector's normals (magnitudes).

This is just an example to show how vector databases search for similar vectors. The fundamental problem with a system like this is our rule-based embedding because it does not give us a semantic understanding of the word/sentences/paragraphs. Instead, it gives us a classification of a single word's structure.

## Embedding Models

In order to generate semantic understanding of language within vectors, embedding models are required. Embedding models are trained on vast corpus of language data. Training embedding models starts by initializing word embeddings with random vectors. Each word in the vocabulary is assigned a vector of real numbers. They use neural networks trained on large datasets to predict a word from its context (Continuous Bag of Words model) or to predict the context given a word (Skip-Gram model). During training, the model adjusts the word vectors to minimize some loss function, often related to the likelihood of observing a word given its context (or vice versa) through gradient descent.

Examples of embedding models include Word2Vec, GloVe, BERT, OpenAI text-embedding.

As you can see, this is a huge vector! Over 1000 dimensions just in this one vector. This is why it is important for us to have good dimensionality reduction techniques during the similarity searches.

## Creating a vector database with SingleStoreDB

In the following code we create a vector datbase with SingleStoreDB. We utilize Langchain to chunk and split the raw text into documents and use the OpenAI embeddings model to generate the vector embeddings. We then take the raw documents and embeddings and create a table with the columns "docs" and "embeddings".

To test this out, we perform a similarity search based on a query and it returns the most similar document in the vector database.

In [11]:
!pip install embed-anything==0.4.15

Collecting embed-anything==0.4.15
  Downloading embed_anything-0.4.15-cp311-cp311-manylinux_2_34_x86_64.whl.metadata (13 kB)
Collecting onnxruntime==1.19.2 (from embed-anything==0.4.15)
  Downloading onnxruntime-1.19.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting coloredlogs (from onnxruntime==1.19.2->embed-anything==0.4.15)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting flatbuffers (from onnxruntime==1.19.2->embed-anything==0.4.15)
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting protobuf (from onnxruntime==1.19.2->embed-anything==0.4.15)
  Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Collecting sympy (from onnxruntime==1.19.2->embed-anything==0.4.15)
  Downloading sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime==1.19.2->embed-anything==0.4.15)
  Downloading humanfriendly-

In [12]:
import os
if not os.path.exists("EmbedAnything"):
  !git clone https://github.com/StarlightSearch/EmbedAnything.git

Cloning into 'EmbedAnything'...
remote: Enumerating objects: 4985, done.[K
remote: Counting objects: 100% (410/410), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 4985 (delta 355), reused 334 (delta 334), pack-reused 4575 (from 2)[K
Receiving objects: 100% (4985/4985), 30.85 MiB | 21.43 MiB/s, done.
Resolving deltas: 100% (3128/3128), done.


In [13]:
import embed_anything
import numpy as np
import time
from embed_anything import EmbedData, EmbeddingModel, TextEmbedConfig, WhichModel
import os

In [14]:
from langchain_core.documents.base import Document


In [15]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.embeddings import Embeddings
from langchain_community.vectorstores.singlestoredb import SingleStoreDB



In [16]:
class EmbedAnythingEmbeddings(Embeddings):
    import embed_anything
    def __init__(self, model_type: WhichModel, model_id, config):
        self.model = EmbeddingModel.from_pretrained_hf(WhichModel.Bert, model_id=model_id)
        self.config =config

    def embed_documents(self, texts:list[str]):
        embed_data = embed_anything.embed_query(texts, self.model, config = self.config)
        return [e.embedding for e in embed_data]

    def embed_query(self, text:str):
        embed_data = embed_anything.embed_query([text], self.model, config = self.config)[0]
        return embed_data.embedding
        

In [17]:
! sudo apt-get install wget

sudo: The "no new privileges" flag is set, which prevents sudo from running as root.
sudo: If sudo is running in a container, you may need to adjust the container configuration to disable the flag.


In [18]:
!wget https://www.biorxiv.org/content/10.1101/2025.01.23.634433v1.full.pdf -O EmbedAnything/bench/medicine.pdf


--2025-02-13 22:01:36--  https://www.biorxiv.org/content/10.1101/2025.01.23.634433v1.full.pdf
Resolving www.biorxiv.org (www.biorxiv.org)... 104.18.34.83, 172.64.153.173, 2606:4700:4400::ac40:99ad, ...
Connecting to www.biorxiv.org (www.biorxiv.org)|104.18.34.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘EmbedAnything/bench/medicine.pdf’

EmbedAnything/bench     [  <=>               ]   1.55M  7.31MB/s    in 0.2s    

2025-02-13 22:01:39 (7.31 MB/s) - ‘EmbedAnything/bench/medicine.pdf’ saved [1625826]



In [37]:
config = TextEmbedConfig(chunk_size=256, batch_size=32)
embedding_fn = EmbedAnythingEmbeddings(WhichModel.Bert, "NeuML/pubmedbert-base-embeddings", config)


Loading weights from "/home/jovyan/.cache/huggingface/hub/models--NeuML--pubmedbert-base-embeddings/snapshots/ba210f40b1b6d555d675c2d1ed6372e44570fc3c/pytorch_model.bin"
Can't find model.safetensors, loading from pytorch_model.bin


In [21]:
!cd EmbedAnything/bench && ls 

attention.pdf  colpali.pdf  medicine.pdf  mistral.pdf


In [22]:
import os
import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.singlestoredb import SingleStoreDB
from openai import OpenAI

In [24]:
from langchain_community.vectorstores.singlestoredb import SingleStoreDB

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
loader = PyPDFLoader("EmbedAnything/bench/medicine.pdf") # use your own document
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
vector_database = SingleStoreDB.from_documents(docs, embedding= embedding_fn, table_name = "demo_med") # create your own table

In [41]:
query = "What is radiotherapy"
docs = vector_database.similarity_search(query)
print(docs[0].page_content)

Figure S1.The growth of the tumor at t = (0,400,800,1200) from group (a) control pmut = 0.1, (b) control pmut = 0.3, (c)
control pmut = 0.5, (d) hyperfractioned radiotherapy pmut = 0.5, (e) conventional radiotherapy pmut = 0.5, (f) hypofractioned
radiotherapy pmut = 0.5, (g) conventional radiotherapy pmut = 0.1, (h) conventional radiotherapy pmut = 0.3, (i) targeted
radiotherapy pmut = 0.5. Blue and red cells are CCs and CSCs, respectively.
12/12
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 
The copyright holder for this preprintthis version posted January 25, 2025. ; https://doi.org/10.1101/2025.01.23.634433doi: bioRxiv preprint


In [5]:
# import openai
# from langchain.text_splitter import CharacterTextSplitter
# from langchain_community.document_loaders import TextLoader
# from langchain_community.embeddings import OpenAIEmbeddings
# from langchain_community.vectorstores.singlestoredb import SingleStoreDB
# from openai import OpenAI
# import os
# import pandas as pd


# # Load and process documents
# loader = TextLoader("michael_jackson.txt") # use your own document

# documents = loader.load()
# text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
# docs = text_splitter.split_documents(documents)

# # Generate embeddings and create a document search database
# embeddings = OpenAIEmbeddings(api_key=OPENAI_KEY)

# # Create Vector Database
# vector_database = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson") # create your own table

# query = "How old was Michael Jackson when he died?"
# docs = vector_database.similarity_search(query)
# print(docs[0].page_content)

## Retrieval Augmented Generation System

RAG combines large language models with a retrieval mechanism to search a database for relevant information before generating responses. It utilizes real-world data from retrieved documents to ground responses, enhancing factual accuracy and reducing hallucinations. Documents are vectorized using embeddings and stored in a vector database for efficient retrieval. SingleStoreDB serves as a great vector database. The user query is converted into a vector, and a vector search is performed in the database to find documents relevant to that specific query. The system returns the documents with the highest relevance scores, which are then fed to the chatbot for generating informed responses.

In [44]:
import os
import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.singlestoredb import SingleStoreDB
from openai import OpenAI

# Set up API keys and database URL
client = OpenAI(api_key="sk-proj--qG1o0v2fJUf8ziux1T41Z3_Xqeld3ElIZu8hr8A")

# Load and process documents
loader = TextLoader("michael_jackson.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Generate embeddings and create a document search database
embeddings = OpenAIEmbeddings(OPENAI_KEY)
docsearch = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson")

# Chat loop
while True:
    # Get user input
    user_query = input("\nYou: ")

    # Check for exit command
    if user_query.lower() in ['quit', 'exit']:
        print("Exiting chatbot.")
        break

    # Perform similarity search
    docs = docsearch.similarity_search(user_query)
    if docs:
        context = docs[0].page_content

        # Generate response using OpenAI GPT-4
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Context: " + context},
                {"role": "user", "content": user_query}
            ],
            stream=True,
            max_tokens=500,
        )

        # Output the response
        print("AI: ", end="")
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                print(chunk.choices[0].delta.content, end="")

    else:
        print("AI: Sorry, I couldn't find relevant information.")

RuntimeError: Error loading michael_jackson.txt

<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
<div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>