## vector database

This notebook presents how to use embeddings and store them in vector database.

In [6]:
!pip install faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### create embedding

In [2]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")
print(model.max_seq_length)

model.max_seq_length = 256

# Our sentences we like to encode
sentences = [
    "dinosaurs live in africa but in different time dimension", 
    "this is sentence about little cat that liked to eat tomatoes",
    "this is the another sample sentence which is here just to not be matched while other one is"
]

# Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences, normalize_embeddings=True)

  from tqdm.autonotebook import tqdm, trange
2024-12-08 11:54:51.187144: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733655291.204967   84449 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733655291.210144   84449 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-08 11:54:51.228129: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

256


In [3]:
embeddings

array([[-0.04987683,  0.03634831,  0.01747592, ..., -0.05154557,
         0.01327896, -0.05160031],
       [ 0.06277536,  0.07880409,  0.01862673, ...,  0.18015604,
         0.07854404,  0.0105872 ],
       [-0.01920933,  0.06346308,  0.07642584, ...,  0.01450102,
         0.08586987, -0.00456648]], dtype=float32)

### create vector DB and load documents

In [7]:
import faiss

d = 384  # dimension

# Build index
index = faiss.IndexFlatL2(d)  # build the index
index.add(embeddings)  # add vectors to the index

### perform search

In [8]:
queryText = "french fries"
embeddingSearch = model.encode([queryText], normalize_embeddings=True)
embeddingFound, idx = index.search(embeddingSearch, 1)  # actual search
print(queryText + " matches:\n" + sentences[idx[0][0]])

queryText = "not similar text"
embeddingSearch = model.encode([queryText], normalize_embeddings=True)
embeddingFound, idx = index.search(embeddingSearch, 1)  # actual search
print(queryText + " matches:\n" + sentences[idx[0][0]])

french fries matches:
this is sentence about little cat that liked to eat tomatoes
not similar text matches:
this is the another sample sentence which is here just to not be matched while other one is
