Vector DB

In [1]:
# Install required libraries
!pip install faiss-cpu numpy sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# -------------------------------
# 1. Sample Sentences
# -------------------------------
sentences = [
    "Natural language processing is a fascinating field.",
    "Word embeddings capture semantic meanings.",
    "NLP is used in chatbots and virtual assistants.",
    "Word2Vec is a powerful tool for creating word embeddings.",
    "Deep learning improves many NLP applications."
]

# -------------------------------
# 2. Generate Sentence Embeddings
# -------------------------------
model = SentenceTransformer('all-MiniLM-L6-v2')  # Pretrained model for sentence embeddings
embeddings = model.encode(sentences).astype('float32')  # FAISS needs float32

# -------------------------------
# 3. Create Vector Database (FAISS)
# -------------------------------
dimension = embeddings.shape[1]                  # Dimension of embeddings
index = faiss.IndexFlatL2(dimension)            # L2 distance (Euclidean)
index.add(embeddings)                            # Add embeddings to the DB

print(f"Total sentences in DB: {index.ntotal}")

# -------------------------------
# 4. Query the Vector Database
# -------------------------------
query_sentence = "How can chatbots use NLP effectively?"
query_embedding = model.encode([query_sentence]).astype('float32')

# Search top 3 most similar sentences
k = 3
distances, indices = index.search(query_embedding, k)

print("\nQuery Sentence:", query_sentence)
print("\nTop similar sentences:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {sentences[idx]} (Distance: {distances[0][i]:.4f})")


Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Total sentences in DB: 5

Query Sentence: How can chatbots use NLP effectively?

Top similar sentences:
1. NLP is used in chatbots and virtual assistants. (Distance: 0.3738)
2. Deep learning improves many NLP applications. (Distance: 1.1116)
3. Natural language processing is a fascinating field. (Distance: 1.1992)
