<a href="https://colab.research.google.com/github/Muntasir2179/vector-database-learning/blob/main/VD_SQLite_Vector_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQLite Database

There are some open source vector database available in the internet. One of them is `Chromadb`.

https://docs.trychroma.com/

In [1]:
import sqlite3

In [2]:
# create a connection to SQLite DB
conn = sqlite3.connect("sample.db")

In [3]:
# Create a cursor
'''
The cursor objects is going to help us execute all the SQL commands
'''
cursor = conn.cursor()

## Now let's create a table

Here we are going to create a table. It will be a `stocks` table. There will be two columns -

* stock_code
* stock_name

In [4]:
cursor.execute("""
CREATE TABLE IF NOT EXISTS stocks(
  stock_code INTEGER PROMARY KEY,
  stock_name TEXT NOT NULL
)
""")

<sqlite3.Cursor at 0x7dde249b53c0>

In [5]:
# let's now insert some data
cursor.execute("INSERT INTO stocks (stock_name) VALUES (?)", ('TESLA',))
cursor.execute("INSERT INTO stocks (stock_name) VALUES (?)", ('Microsoft',))

<sqlite3.Cursor at 0x7dde249b53c0>

In [6]:
# select records
cursor.execute("SELECT * FROM stocks")

<sqlite3.Cursor at 0x7dde249b53c0>

In [7]:
rows = cursor.fetchall()
rows

[(None, 'TESLA'), (None, 'Microsoft')]

In [8]:
# save the changes
conn.commit()

In [9]:
# it is a good practise to always close the database connection whenever we connect with some database
conn.close()

## Using SQLite as a vector storage

What is a vector?
> The vectors in machine learning signify input data, including bias and weight. In the same way, output from a machine-learning model (for example, a predicted class), can be put into vector format.

```python
# array of numbers -> numpy arrays
vector = [1.2, 2.5, 3.7, 7.5, 5.9]
```

🧮 NOTE: The information, in orther words the vectors must be stored in a bytes format.

In [10]:
import numpy as np

# creating a new connection to store vectors
conn = sqlite3.connect("sample_vectors.db")

# creting cursor to execute SQL commands
cursor = conn.cursor()

In [11]:
cursor.execute("""
CREATE TABLE IF NOT EXISTS vectors (
  vector_id INTEGER PRIMARY KEY,
  vector BLOB NOT NULL
)
""")

<sqlite3.Cursor at 0x7dde249b4ec0>

In [12]:
# creating some vectors as numpy array
vector_tesla = np.array([1.4, 3.5, 2.2, 0.9])
vector_microsoft = np.array([2.8, 1.6, 3.8, 2.2])

In [13]:
# we have to convert our vector into bytes format before inderting into the database
vector_tesla.tobytes()

b'ffffff\xf6?\x00\x00\x00\x00\x00\x00\x0c@\x9a\x99\x99\x99\x99\x99\x01@\xcd\xcc\xcc\xcc\xcc\xcc\xec?'

In [14]:
# we have to specify sqlite3 that it is stored in Binary format
cursor.execute("INSERT INTO vectors (vector) VALUES (?)", (sqlite3.Binary(vector_tesla.tobytes()),))

<sqlite3.Cursor at 0x7dde249b4ec0>

In [15]:
cursor.execute("INSERT INTO vectors (vector) VALUES (?)", (sqlite3.Binary(vector_microsoft.tobytes()),))

<sqlite3.Cursor at 0x7dde249b4ec0>

In [16]:
cursor.execute("SELECT * FROM vectors")
rows = cursor.fetchall()
rows

[(1,
  b'ffffff\xf6?\x00\x00\x00\x00\x00\x00\x0c@\x9a\x99\x99\x99\x99\x99\x01@\xcd\xcc\xcc\xcc\xcc\xcc\xec?'),
 (2,
  b'ffffff\x06@\x9a\x99\x99\x99\x99\x99\xf9?ffffff\x0e@\x9a\x99\x99\x99\x99\x99\x01@')]

## Retriving vecotr from the database

Now we can see that the data has been converted into bytes and stored in the database. But, when we will try to retrive the data, we will not going to get the data in the format that we have inserted. We have to do some transformation to get the data/vector in actual format.

The process is called `Deserialization`.

In [17]:
rows[0][1]

b'ffffff\xf6?\x00\x00\x00\x00\x00\x00\x0c@\x9a\x99\x99\x99\x99\x99\x01@\xcd\xcc\xcc\xcc\xcc\xcc\xec?'

In [18]:
# applying deserialization
vector = np.frombuffer(rows[0][1], dtype=np.float64)
vector

array([1.4, 3.5, 2.2, 0.9])

In [19]:
# retriving all the vectors
vectors = []
for row in rows:
  vectors.append(np.frombuffer(row[1], dtype=np.float64))
vectors

[array([1.4, 3.5, 2.2, 0.9]), array([2.8, 1.6, 3.8, 2.2])]

## Finding the nearest vector

In [20]:
q_vector = np.array([2.5, 1.2, 3.5, 5.5])

cursor.execute("""
SELECT vector FROM vectors ORDER BY abs(vector - ?) ASC
""", (sqlite3.Binary(q_vector.tobytes()),))

<sqlite3.Cursor at 0x7dde249b4ec0>

In [21]:
res = cursor.fetchone()
np.frombuffer(res[0], dtype=np.float64)

array([1.4, 3.5, 2.2, 0.9])

# Using SQLite-VSS for similarity search

By default `SQLite3` does not have the vector search capabilities. However SQL database comes with some extentions. For example `SQLite-VSS (Vector Semantic Search)`.

SQLite-VSS is an SQLite extension designed for vector search, emphasizing local-first operations and easy integration into applications without external servers. Leveraging the Faiss library, it offers efficient similarity search and clustering capabilities.

In [22]:
!pip install sqlite-vss

Collecting sqlite-vss
  Downloading sqlite_vss-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux1_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sqlite-vss
Successfully installed sqlite-vss-0.1.2


In [23]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.0.354-py3-none-any.whl (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.3/803.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.8 (from langchain)
  Downloading langchain_community-0.0.8-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.5 (from langchain)
  Downloading langchain_core-0.1.6-py3-none-any.whl (208 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.0/208.0 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1.0,>=0.0.77 (from langchain)
  Downloading langsmith-

In [24]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import SQLiteVSS
from langchain.document_loaders import TextLoader

In [25]:
loader = TextLoader("ViT_texts.txt")
documents = loader.load()

In [26]:
documents

[Document(page_content='Vision Transformer (ViT) have recently emerged as a competitive alternative to Convolutional Neural Networks (CNNs) that are currently state-of-the-art in different image recognition computer vision tasks. ViT models outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy. Transformer models have become the de-facto status quo in Natural Language Processing (NLP). For example, the popular ChatGPT AI chatbot is a transformer-based language model. Specifically, it is based on the GPT (Generative Pre-trained Transformer) architecture, which uses self-attention mechanisms to model the dependencies between words in a text. In computer vision research, there has recently been a rise in interest in Vision Transformer (ViTs) and Multilayer Perceptrons (MLPs).\n\nWhile the Transformer architecture has become the highest standard for tasks involving Natural Language Processing (NLP), its use cases relating to Computer V

In [29]:
type(documents), type(documents[0])

(list, langchain_core.documents.base.Document)

## Splitting the document into chunk of words

The reason we are splitting the documents into chunks of words is that these chunk of words will be treated as a single vector. Later while we will run a query on the database, it will convert the query into embedding vector using the `SentenceTransformer`. Then it will search for most similar embedding vectors stored in the database. After finding the most similar embedding vectors, the embeddings will be converted back to the initial format which is `chunk of words`. The chunk of words will be returned as the most similar answer to the query.

In [30]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)



In [31]:
type(docs), type(docs[0])

(list, langchain_core.documents.base.Document)

In [32]:
docs[0].page_content

'Vision Transformer (ViT) have recently emerged as a competitive alternative to Convolutional Neural Networks (CNNs) that are currently state-of-the-art in different image recognition computer vision tasks. ViT models outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy. Transformer models have become the de-facto status quo in Natural Language Processing (NLP). For example, the popular ChatGPT AI chatbot is a transformer-based language model. Specifically, it is based on the GPT (Generative Pre-trained Transformer) architecture, which uses self-attention mechanisms to model the dependencies between words in a text. In computer vision research, there has recently been a rise in interest in Vision Transformer (ViTs) and Multilayer Perceptrons (MLPs).'

In [33]:
texts = [doc.page_content for doc in docs]
texts

['Vision Transformer (ViT) have recently emerged as a competitive alternative to Convolutional Neural Networks (CNNs) that are currently state-of-the-art in different image recognition computer vision tasks. ViT models outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy. Transformer models have become the de-facto status quo in Natural Language Processing (NLP). For example, the popular ChatGPT AI chatbot is a transformer-based language model. Specifically, it is based on the GPT (Generative Pre-trained Transformer) architecture, which uses self-attention mechanisms to model the dependencies between words in a text. In computer vision research, there has recently been a rise in interest in Vision Transformer (ViTs) and Multilayer Perceptrons (MLPs).',
 'While the Transformer architecture has become the highest standard for tasks involving Natural Language Processing (NLP), its use cases relating to Computer Vision (CV) remain onl

In [54]:
print(f"Number of chunks is: {len(texts)}")

Number of chunks is: 14


## Using sentence-transformer for generating embeddings

In [34]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=2ddfa2689ce557c16337c6bca4d35b0504b0ed1550cd925581f02b03de354565
  Stored in directory: 

In [35]:
from google.colab import userdata
userdata.get('huggingface_key')

'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'

In [36]:
# creating a open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [38]:
# load it in sqlite-vss in a table named state_union
# the db_file parameter is the name of the file you want as your sqlite database

db = SQLiteVSS.from_texts(texts=texts,
                          embedding=embedding_function,
                          table="state_union",
                          db_file="vss.db")

In [47]:
# now let's pass some query
query = "What is Vision Transformer?"
data = db.similarity_search(query)
print(f"There are {len(data)} possible answers, the most probable one is the one on the first index of the response.")

There are 4 possible answers, the most probable one is the one on the first index of the response.


In [48]:
data

[Document(page_content='The vision transformer model uses multi-head self-attention in Computer Vision without requiring image-specific biases. The model splits the images into a series of positional embedding patches, which are processed by the transformer encoder. It does so to understand the local and global features that the image possesses. Last but not least, the ViT has a higher precision rate on a large dataset with reduced training time.'),
 Document(page_content='The Vision Transformer (ViT) model architecture was introduced in a research paper published as a conference paper at ICLR 2021 titled “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. It was developed and published by Neil Houlsby, Alexey Dosovitskiy, and 10 more authors of the Google Research Brain Team. The fine-tuning code and pre-trained ViT models are available on the GitHub of the Google Research team. You find them here. The ViT models were pre-trained on the ImageNet and ImageNet-

In [51]:
# lets see the most similar answer to the query question
data[0].page_content

'The vision transformer model uses multi-head self-attention in Computer Vision without requiring image-specific biases. The model splits the images into a series of positional embedding patches, which are processed by the transformer encoder. It does so to understand the local and global features that the image possesses. Last but not least, the ViT has a higher precision rate on a large dataset with reduced training time.'