# Aim:

To implement and compare vector databases (ChromaDB and Weaviate) for semantic search applications.

Task 1: Setup & ChromaDB Implementation

1. Install required libraries: pip install chromadb sentence-transformers

2. Create a collection in ChromaDB.

3. Generate embeddings using a pre-trained model (all-MiniLM-L6-v2).

4. Insert 5–10 text documents.

5. Run a semantic search query.

6. Extend by adding 20–30 more documents and test with multiple queries.

Task 2: Setup & Weaviate Implementation

1. Create a free cluster in Weaviate Cloud (https://console.weaviate.cloud).

2. Install required libraries: pip install weaviate-client sentence-transformers

3. Define a class schema in Weaviate.

4. Insert the same documents and embeddings used in ChromaDB.

5. Perform semantic search queries.

6. Extend by inserting additional documents and testing hybrid queries.

Task 3: Comparative Analysis

Compare ChromaDB and Weaviate with reference top the following parameters :

· Ease of setup.

· Query results and relevance.

· Performance with small datasets.

· Scalability (local vs cloud).

· Suitable use cases for each database.

Task 1

In [None]:
%pip install -q chromadb sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m75.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m69.2 MB/s[0m eta [36m0:00:

In [None]:
import os
import uuid
from pprint import pprint

import chromadb

try:
    client = chromadb.PersistentClient(path="chroma_store")
except Exception:
    from chromadb.config import Settings
    client = chromadb.Client(
        Settings(chroma_db_impl="duckdb+parquet", persist_directory="chroma_store")
    )

try:
    from chromadb.utils import embedding_functions
except Exception:
    from chromadb.utils import embedding_functions

embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

COLLECTION_NAME = "demo_docs"

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"},
)

print("Collection ready:", collection.name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Collection ready: demo_docs


In [None]:
def add_texts(texts, source="seed"):
    """
    Adds texts to the collection with unique IDs and a simple 'source' metadata tag.
    Re-running the cell won't error because IDs are always unique.
    """
    ids = [str(uuid.uuid4()) for _ in texts]
    metadatas = [{"source": source} for _ in texts]
    collection.add(documents=texts, metadatas=metadatas, ids=ids)
    print(f"Inserted {len(texts)} texts (source='{source}').")

def show_results(results, top_k=5):
    """
    Pretty-print ChromaDB query results.
    """
    for qi, (docs, metas, dists) in enumerate(
        zip(
            results.get("documents", []),
            results.get("metadatas", []),
            results.get("distances", []),
        )
    ):
        print(f"\nQuery #{qi+1} — top {top_k} results")
        print("-" * 60)
        for rank, (doc, meta, dist) in enumerate(zip(docs, metas, dists), start=1):
            print(f"[{rank}] dist={dist:.4f} | source={meta.get('source')}")
            print(f"     {doc}")


In [None]:
seed_sentences = [
    "Neural networks learn patterns from data by adjusting weights.",
    "Diffusion models generate images by denoising latent representations.",
    "Reinforcement learning balances exploration and exploitation.",
    "ChromaDB is a lightweight vector database for embeddings.",
    "Sentence transformers convert text into numerical embeddings.",
    "Cosine similarity works well for comparing sentence embeddings.",
    "HNSW is a popular approximate nearest neighbor search algorithm.",
    "Vector databases enable semantic search over unstructured text."
]

add_texts(seed_sentences, source="seed")
print("Collection count:", collection.count())


Inserted 8 texts (source='seed').
Collection count: 8


In [None]:
query = "How do I find similar sentences using embeddings?"
results = collection.query(
    query_texts=[query],
    n_results=5,
    include=["distances", "metadatas", "documents"],
)
show_results(results, top_k=5)



Query #1 — top 5 results
------------------------------------------------------------
[1] dist=0.2537 | source=seed
     Cosine similarity works well for comparing sentence embeddings.
[2] dist=0.4001 | source=seed
     Sentence transformers convert text into numerical embeddings.
[3] dist=0.5900 | source=seed
     ChromaDB is a lightweight vector database for embeddings.
[4] dist=0.7258 | source=seed
     Vector databases enable semantic search over unstructured text.
[5] dist=0.8069 | source=seed
     Neural networks learn patterns from data by adjusting weights.


In [None]:
more_sentences = [
    "Transformers use self-attention to capture long-range dependencies in text.",
    "Batch size influences the stability and speed of model training.",
    "Learning rate schedules can improve convergence in deep learning.",
    "Vector search retrieves items by similarity in embedding space.",
    "Contrastive learning aligns semantically similar pairs closer together.",
    "Evaluation metrics like MRR and nDCG assess retrieval quality.",
    "CLIP learns a joint image-text embedding space through contrastive loss.",
    "Approximate nearest neighbor methods trade a bit of accuracy for speed.",
    "Persistence lets your vector database survive kernel restarts.",
    "Cosine distance is one minus cosine similarity.",
    "K-means can cluster embeddings into topical groups.",
    "Normalization helps stabilize training and comparisons.",
    "FP16 reduces memory usage with minimal quality loss on many tasks.",
    "Tokenization splits raw text into subword units for models.",
    "Python notebooks are great for fast ML experiments and demos.",
    "Zero-shot transfer uses general representations without fine-tuning.",
    "Recall@k measures how often a relevant result is in the top k.",
    "Embeddings map raw text into dense numeric vectors.",
    "RAG augments LLMs with external knowledge via retrieval.",
    "FAISS and HNSW are widely used for fast similarity search.",
    "Cosine similarity ranges from -1 to 1 for unit-normalized vectors.",
    "Fine-tuning adapts a pre-trained model to a downstream task.",
    "Dimension reduction like PCA or UMAP helps visualization.",
    "Caching embeddings avoids recomputation during iteration."
]

add_texts(more_sentences, source="extended")
print("Collection count:", collection.count())


Inserted 24 texts (source='extended').
Collection count: 32


In [None]:
queries = [
    "What controls training stability in deep learning?",
    "Which algorithms are used for fast nearest neighbor search?",
    "How can I evaluate the quality of retrieval systems?",
]

results_multi = collection.query(
    query_texts=queries,
    n_results=5,
    include=["distances", "metadatas", "documents"],
)

show_results(results_multi, top_k=5)



Query #1 — top 5 results
------------------------------------------------------------
[1] dist=0.4365 | source=extended
     Learning rate schedules can improve convergence in deep learning.
[2] dist=0.4702 | source=extended
     Batch size influences the stability and speed of model training.
[3] dist=0.4920 | source=extended
     Normalization helps stabilize training and comparisons.
[4] dist=0.5722 | source=seed
     Neural networks learn patterns from data by adjusting weights.
[5] dist=0.6074 | source=extended
     Fine-tuning adapts a pre-trained model to a downstream task.

Query #2 — top 5 results
------------------------------------------------------------
[1] dist=0.2653 | source=extended
     Approximate nearest neighbor methods trade a bit of accuracy for speed.
[2] dist=0.2781 | source=seed
     HNSW is a popular approximate nearest neighbor search algorithm.
[3] dist=0.3843 | source=extended
     FAISS and HNSW are widely used for fast similarity search.
[4] dist=0.5781

Task 2

In [None]:
%pip install -q weaviate-client sentence-transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/579.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.1/579.1 kB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import weaviate
from getpass import getpass

WCD_URL = os.getenv("WCD_URL") or input("Enter your Weaviate Cluster URL: ").strip()
WCD_API_KEY = os.getenv("WCD_API_KEY") or getpass("Enter your Weaviate Admin API Key: ")

client = weaviate.connect_to_wcs(
    cluster_url=WCD_URL,
    auth_credentials=weaviate.AuthApiKey(WCD_API_KEY),
    headers={"X-OpenAI-Api-Key": ""},
)

print("Connected:", bool(client.is_ready()))

Enter your Weaviate Cluster URL: focznblkttogtcsfagdgw.c0.asia-southeast1.gcp.weaviate.cloud
Enter your Weaviate Admin API Key: ··········


This method is deprecated and will be removed in a future release. Use :func:`connect_to_weaviate_cloud` instead.

  client = weaviate.connect_to_wcs(


Connected: True


In [None]:
from weaviate.classes.config import Configure, Property, DataType, VectorDistances

COLLECTION = "DemoDocs"

if COLLECTION in client.collections.list_all():
    client.collections.delete(COLLECTION)

client.collections.create(
    name=COLLECTION,
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
    ],
    vectorizer_config=Configure.Vectorizer.none(),
    vector_index_config=Configure.VectorIndex.hnsw(
        distance_metric=VectorDistances.COSINE,
    ),
)

print("Created collection:", COLLECTION)


Created collection: DemoDocs


In [None]:
seed_sentences = [
    "Neural networks learn patterns from data by adjusting weights.",
    "Diffusion models generate images by denoising latent representations.",
    "Reinforcement learning balances exploration and exploitation.",
    "ChromaDB is a lightweight vector database for embeddings.",
    "Sentence transformers convert text into numerical embeddings.",
    "Cosine similarity works well for comparing sentence embeddings.",
    "HNSW is a popular approximate nearest neighbor search algorithm.",
    "Vector databases enable semantic search over unstructured text."
]

more_sentences = [
    "Transformers use self-attention to capture long-range dependencies in text.",
    "Batch size influences the stability and speed of model training.",
    "Learning rate schedules can improve convergence in deep learning.",
    "Vector search retrieves items by similarity in embedding space.",
    "Contrastive learning aligns semantically similar pairs closer together.",
    "Evaluation metrics like MRR and nDCG assess retrieval quality.",
    "CLIP learns a joint image-text embedding space through contrastive loss.",
    "Approximate nearest neighbor methods trade a bit of accuracy for speed.",
    "Persistence lets your vector database survive kernel restarts.",
    "Cosine distance is one minus cosine similarity.",
    "K-means can cluster embeddings into topical groups.",
    "Normalization helps stabilize training and comparisons.",
    "FP16 reduces memory usage with minimal quality loss on many tasks.",
    "Tokenization splits raw text into subword units for models.",
    "Python notebooks are great for fast ML experiments and demos.",
    "Zero-shot transfer uses general representations without fine-tuning.",
    "Recall@k measures how often a relevant result is in the top k.",
    "Embeddings map raw text into dense numeric vectors.",
    "RAG augments LLMs with external knowledge via retrieval.",
    "FAISS and HNSW are widely used for fast similarity search.",
    "Cosine similarity ranges from -1 to 1 for unit-normalized vectors.",
    "Fine-tuning adapts a pre-trained model to a downstream task.",
    "Dimension reduction like PCA or UMAP helps visualization.",
    "Caching embeddings avoids recomputation during iteration."
]

from sentence_transformers import SentenceTransformer
import numpy as np

embed_model = SentenceTransformer("all-MiniLM-L6-v2")  # fast & accurate for demos

def embed(texts):
    vecs = embed_model.encode(texts, normalize_embeddings=True, show_progress_bar=False)
    return np.asarray(vecs, dtype=np.float32)


In [None]:
from uuid import uuid4

col = client.collections.get(COLLECTION)

vectors = embed(seed_sentences)

with col.batch.dynamic() as batch:
    for text, vec in zip(seed_sentences, vectors):
        batch.add_object(
            properties={"text": text, "source": "seed"},
            vector=vec,
            uuid=str(uuid4()),
        )

print("Seed count inserted:", len(seed_sentences))

Seed count inserted: 8


In [None]:
from weaviate.classes.query import MetadataQuery

query_text = "How do I find similar sentences using embeddings?"
qvec = embed([query_text])[0]

result = col.query.near_vector(
    near_vector=qvec,
    limit=5,
    return_properties=["text", "source"],
    return_metadata=MetadataQuery(distance=True),
)

for i, o in enumerate(result.objects, start=1):
    print(f"[{i}] dist={o.metadata.distance:.4f} | source={o.properties.get('source')}")
    print("    ", o.properties.get("text"))


[1] dist=0.2537 | source=seed
     Cosine similarity works well for comparing sentence embeddings.
[2] dist=0.4001 | source=seed
     Sentence transformers convert text into numerical embeddings.
[3] dist=0.5900 | source=seed
     ChromaDB is a lightweight vector database for embeddings.
[4] dist=0.7258 | source=seed
     Vector databases enable semantic search over unstructured text.
[5] dist=0.8069 | source=seed
     Neural networks learn patterns from data by adjusting weights.


In [None]:
more_vectors = embed(more_sentences)

with col.batch.dynamic() as batch:
    for text, vec in zip(more_sentences, more_vectors):
        batch.add_object(
            properties={"text": text, "source": "extended"},
            vector=vec,
            uuid=str(uuid4()),
        )

print("Extended count inserted:", len(more_sentences))


Extended count inserted: 24


In [None]:
queries = [
    "What controls training stability in deep learning?",
    "Which algorithms are used for fast nearest neighbor search?",
    "How can I evaluate the quality of retrieval systems?",
]

for qi, q in enumerate(queries, start=1):
    qvec = embed([q])[0]
    res = col.query.near_vector(
        near_vector=qvec,
        limit=5,
        return_properties=["text", "source"],
        return_metadata=MetadataQuery(distance=True),
    )
    print("\n", "="*18, f"Query #{qi}", "="*18)
    print("Q:", q)
    for rank, o in enumerate(res.objects, start=1):
        print(f"[{rank}] dist={o.metadata.distance:.4f} | source={o.properties.get('source')}")
        print("    ", o.properties.get("text"))



Q: What controls training stability in deep learning?
[1] dist=0.4365 | source=extended
     Learning rate schedules can improve convergence in deep learning.
[2] dist=0.4702 | source=extended
     Batch size influences the stability and speed of model training.
[3] dist=0.4920 | source=extended
     Normalization helps stabilize training and comparisons.
[4] dist=0.5722 | source=seed
     Neural networks learn patterns from data by adjusting weights.
[5] dist=0.6074 | source=extended
     Fine-tuning adapts a pre-trained model to a downstream task.

Q: Which algorithms are used for fast nearest neighbor search?
[1] dist=0.2653 | source=extended
     Approximate nearest neighbor methods trade a bit of accuracy for speed.
[2] dist=0.2781 | source=seed
     HNSW is a popular approximate nearest neighbor search algorithm.
[3] dist=0.3843 | source=extended
     FAISS and HNSW are widely used for fast similarity search.
[4] dist=0.5781 | source=extended
     Vector search retrieves items b

In [None]:
from weaviate.classes.query import MetadataQuery

hybrid_query = "nearest neighbor search"
hybrid_vec = embed([hybrid_query])[0]

hybrid_res = col.query.hybrid(
    query=hybrid_query,       # keywords
    vector=hybrid_vec,        # semantic vector
    alpha=0.5,                # 0=lexical only, 1=vector only
    limit=5,
    return_properties=["text", "source"],
    return_metadata=MetadataQuery(score=True, explain_score=True),
)

print("Hybrid:", hybrid_query)
for i, o in enumerate(hybrid_res.objects, start=1):
    print(f"[{i}] score={o.metadata.score:.4f} | source={o.properties.get('source')}")
    print("    ", o.properties.get("text"))


Hybrid: nearest neighbor search
[1] score=0.9928 | source=seed
     HNSW is a popular approximate nearest neighbor search algorithm.
[2] score=0.7689 | source=extended
     Approximate nearest neighbor methods trade a bit of accuracy for speed.
[3] score=0.3780 | source=extended
     Vector search retrieves items by similarity in embedding space.
[4] score=0.3579 | source=extended
     FAISS and HNSW are widely used for fast similarity search.
[5] score=0.2757 | source=seed
     Vector databases enable semantic search over unstructured text.


# Task 3 — Comparative Analysis: ChromaDB vs Weaviate

#ChromaDB and Weaviate are both modern vector databases but target slightly different audiences and deployment patterns. ChromaDB is lightweight, very easy to embed inside local notebooks and experiments, and excellent for rapid prototyping and small-to-medium datasets. Weaviate is a feature-rich, production-ready vector database with built-in schema, REST/gRPC APIs, configurable vector indexes, hybrid (keyword + vector) query support, and integrations (modules for OpenAI, Cohere, etc.). Weaviate is better suited when you need a managed/cloud setup, advanced metadata querying, and production durability; Chroma is ideal when you want minimal ops friction and fast local development.

When it comes to choosing a vector database for your project, two popular options are ChromaDB and Weaviate. Here’s a breakdown of how they compare across several key areas, so you can decide which is the right fit for your needs.

### **Ease of Setup**
---
**ChromaDB** is incredibly easy to get started with. You can install it with a simple `pip install chromadb` command, and it runs directly within your notebook, saving data to your local disk. This makes it a great choice for quick demos or small projects, as it requires no external server setup.

**Weaviate**, on the other hand, is a bit more involved. It requires you to run a server, typically through Docker or a cloud service. While this adds a little more operational overhead, it’s a more robust solution for building full-scale applications. The setup is straightforward thanks to quickstart containers, but it’s not as "in-notebook" as ChromaDB.

*The takeaway: For a quick demo or proof-of-concept, ChromaDB is the clear winner. For a more serious application, Weaviate's slightly more complex setup is worth the effort.*

### **Query Results and Relevance**
---
Both databases rely on the quality of your embeddings for semantic similarity searches. They use a default distance metric like cosine similarity to find the nearest neighbors in the embedding space.

**ChromaDB** is a solid choice for straightforward semantic search and nearest-neighbor retrieval. It also supports basic filtering on metadata.

**Weaviate** offers more advanced features for fine-tuning relevance. It supports **hybrid queries** that combine keyword and vector search, providing more comprehensive results. It also offers score explanations and configurable index parameters, giving you more control over how results are retrieved.

*The takeaway: While both provide similar results with the same embeddings, Weaviate offers more tools for advanced relevance tuning and explainability.*

### **Performance on Small Datasets**
---
**ChromaDB** shines here. It's built to be lightweight and fast, making it ideal for prototyping and environments with limited resources. Its low latency is perfect for working with thousands of vectors.

**Weaviate** also performs well with small datasets, but because it runs as a server, it has more startup overhead. While the query speed is excellent, the initial setup and running costs are higher than ChromaDB's.

*The takeaway: ChromaDB is faster for local iteration and small projects due to its minimal overhead, while Weaviate's performance is great but comes with a heavier footprint.*

### **Scalability**
---
**ChromaDB** is a local-first solution. While it can handle millions of vectors, it's designed for single-machine use. Scaling to a massive corpus requires careful engineering, and its ecosystem for large-scale clusters is still developing.

**Weaviate** is built for the cloud and large-scale deployments. It offers features like clustered services, multi-node operations, and automatic backups, making it an excellent choice for handling millions of vectors in a production environment.

*The takeaway: Weaviate is the better option for production-level, cloud-based applications, while ChromaDB is best for local, single-machine projects and quick experiments.*

### **Ideal Use Cases**
---
**ChromaDB** is perfect for:
* **Rapid prototyping** and RAG demos.
* Small-to-medium personal projects, like a note search app.
* Local applications where embedding generation happens in-process.

**Weaviate** is the better fit for:
* **Production semantic search** or recommendation systems.
* Enterprise RAG pipelines that need security, explainability, and scalability.
* Projects that require hybrid search, a defined data schema, and cloud hosting.

# Task 4 — Real-world use case: Semantic FAQ Search

A — Minimal runnable example: ChromaDB (builds on your Chroma code)

In [None]:
%pip install -q chromadb sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m113.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m117.4 MB/s[0m eta [36m0:

In [None]:
# assume chroma client + collection created as in your earlier cell
from sentence_transformers import SentenceTransformer
import numpy as np
import chromadb
import os
from chromadb.config import Settings

try:
    client = chromadb.PersistentClient(path="chroma_store")
except Exception:
    client = chromadb.Client(
        Settings(chroma_db_impl="duckdb+parquet", persist_directory="chroma_store")
    )

model = SentenceTransformer("all-MiniLM-L6-v2")

faqs = [
    {"q": "How do I reset my password?", "a": "To reset your password, click 'Forgot password' and follow the email link."},
    {"q": "How can I contact support?", "a": "Email support@example.com or use the in-app chat."},
    {"q": "What is the refund policy?", "a": "Refunds are available within 30 days, subject to terms."},
    {"q": "How to change my subscription plan?", "a": "Go to Billing → Change Plan and follow the prompts."},
    {"q": "How to export my data?", "a": "Use Settings → Export to download your data in CSV."},
]

texts = [f"Q: {f['q']}\nA: {f['a']}" for f in faqs]
ids = [f"faq_{i}" for i in range(len(texts))]
metadatas = [{"source": "faq", "qid": ids[i], "question": faqs[i]["q"]} for i in range(len(texts))]

vecs = model.encode(texts, normalize_embeddings=True)

collection = client.get_or_create_collection(name="faq_demo", embedding_function=None)  # we push vectors manually
collection.add(documents=texts, metadatas=metadatas, ids=ids, embeddings=vecs.tolist())

def faq_search_chroma(query, k=3):
    qvec = model.encode([query], normalize_embeddings=True)[0].tolist()
    res = collection.query(query_embeddings=[qvec], n_results=k, include=["documents","metadatas","distances"])
    docs = res["documents"][0]
    metas = res["metadatas"][0]
    dists = res["distances"][0]
    return list(zip(docs, metas, dists))

results = faq_search_chroma("I forgot my account password — what should I do?")
for doc, meta, dist in results:
    print("Score (dist):", dist, " | Q:", meta["question"])
    print(doc)
    print("---")

Score (dist): 0.7000272274017334  | Q: How do I reset my password?
Q: How do I reset my password?
A: To reset your password, click 'Forgot password' and follow the email link.
---
Score (dist): 1.5400049686431885  | Q: How to change my subscription plan?
Q: How to change my subscription plan?
A: Go to Billing → Change Plan and follow the prompts.
---
Score (dist): 1.561098575592041  | Q: How can I contact support?
Q: How can I contact support?
A: Email support@example.com or use the in-app chat.
---


In [None]:
%pip install -q weaviate-client sentence-transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/579.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m573.4/579.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.1/579.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

C

In [None]:
import gradio as gr
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def analyze_sentiment(text):
    result = classifier(text)[0]
    return f"Label: {result['label']}, Confidence: {result['score']:.2f}"

demo = gr.Interface(
    fn=analyze_sentiment,
    inputs=gr.Textbox(lines=3, placeholder="Enter text here..."),
    outputs="text",
    title="Sentiment Analysis with BERT",
    description="Type a sentence and see if it's Positive or Negative."
)

demo.launch()


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://ef7c111a21de1ba004.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Task 5

#1) How does vector search differ from keyword search?

Vector search maps text to dense numeric vectors (embeddings) representing semantic content; similarity is measured in vector space (cosine/dot). It retrieves items by semantic closeness, so paraphrases and synonyms match even without keyword overlap. Keyword search (e.g., BM25) matches exact tokens or token patterns and is powerful when exact terms/IDs matter. The two are complementary: vector search excels at semantic matching; keyword search excels at precision for exact matches and structured queries.


#2) Which DB was easier to implement and why?

For quick notebook demos and small projects, ChromaDB was easier: single pip install, PersistentClient or in-memory client, no server/container setup, minimal boilerplate to add/query vectors. Weaviate requires spinning up a server (Docker or cloud), defining schema, and using REST/gRPC client — more steps but offering production-grade features.


#3) Which DB would you choose for prototyping vs production?

Prototyping: ChromaDB — fast, minimal friction, local-first.

Production (cloud/scale/enterprise): Weaviate — cluster/cloud deployment, hybrid search, advanced metadata filtering, explainability, and managed services.


#4) What challenges arise when scaling vector databases?

Index size and memory: HNSW and other indexes store graph structures that consume memory; large datasets require sharding or SSD-based indices.

Latency vs recall tradeoffs: Tuning parameters (ef, efConstruction, PQ levels) is necessary to balance speed and accuracy.

Consistency & updates: Handling frequent inserts/deletes in ANN indexes while keeping high performance is nontrivial.

Distributed architecture: Sharding, replication, and cross-node queries add complexity.

Costs & ops: Cloud hosting, backups, and security increase cost and operational responsibility.

Embedding pipeline scaling: Generating embeddings for millions of items is compute-intensive; you need batching, caching, and retry strategies.

# Conclusion :-

#From Task 1 to Task 5, we explored the implementation and fine-tuning of state-of-the-art Transformer models using Hugging Face. In Task 1, we compared BERT, GPT-2, and BART by applying them to sentiment classification, text generation, and summarization, highlighting their distinct behaviors and use cases. Task 2 extended this by demonstrating dataset preparation and training strategies. Task 3 applied BERT to GLUE benchmarks such as SST-2, showing its effectiveness in binary sentiment classification. Task 4 involved practical integration with tools like Gradio, enabling simple and interactive UIs to test models without complex infrastructure. Finally, Task 5 emphasized evaluation, inference, and comparison, allowing us to critically analyze model outputs and applications. Together, these tasks illustrate the versatility of Hugging Face Transformers across understanding, generation, and deployment in real-world scenarios.