# Embeddings + ChromaDB pipeline (Colab)

This notebook builds dense vector embeddings for a large product catalog (≈552k rows) using a local Sentence-Transformer model on Colab GPU, and stores them in a persistent ChromaDB vectorstore. Using sentence-transformers avoids OpenAI token/usage limits and external API costs — embeddings run locally (slower but cheaper and scalable with a GPU).

Key steps
- Install dependencies: chromadb, sentence-transformers, faiss-cpu (or faiss-gpu if available), accelerate, tqdm, python-dotenv.
- Mount Google Drive (optional) and set `CSV_PATH` to your dataset.
- Inspect CSV, construct a single text field (e.g. `text_for_embedding`) that concatenates relevant columns (name, category, company, price, rating, link, etc.).
- Load a fast embedding model (recommended: `sentence-transformers/all-MiniLM-L6-v2`) on GPU.
- Encode texts in batches and save embeddings to disk (`.npy`) as a checkpoint.
- Create a persistent Chroma collection and add ids, documents, metadatas and the precomputed embeddings in batches.
- Query by encoding the user query with the same model and running `collection.query(...)`.

Tips and gotchas
- Set Colab runtime to GPU (e.g., T4) before loading the model.
- Tune `BATCH_SIZE` for encoding and Chroma writes. If you hit OOM, reduce the batch size (try 2048 → 1024 → 512 → 256).
- Use `convert_to_numpy=True` and `batch_size` in `model.encode()` for efficient vector output.
- Save intermediate embeddings to `.npy` so the long encode step can be resumed without re-running.
- Use `chromadb.PersistentClient(path=VECTORSTORE_DIR)` to persist the vectorstore to disk and copy/zip into Drive for later reuse.
- For large collections, prefer Faiss-backed vector indices (faiss-cpu is usually fine; faiss-gpu can be faster but harder to install on Colab).
- Ensure your `text_for_embedding` handles missing values (use `.astype(str)` or fillna) to avoid errors while encoding.

Example variables (used by the notebook)
- `DATA_PATH` — path to the source CSV
- `VECTORSTORE_DIR` — where Chroma will persist the vectorstore (e.g. `/content/vectorstore`)
- `MODEL_NAME` — e.g. `sentence-transformers/all-MiniLM-L6-v2`
- `BATCH_SIZE` — encoding & add-to-Chroma batch size
- `embeddings.npy` — saved embeddings checkpoint

Security & cost note
- Running locally on Colab uses your session and GPU quota; store results in Drive if you need persistence. No external API usage means no per-embedding API costs.

In [1]:
# Run this cell first (may take a couple minutes)
!pip install -q chromadb sentence-transformers accelerate faiss-cpu tqdm python-dotenv
# If you want GPU faiss (optional) - typically faiss-cpu is OK for embedding store; installing faiss-gpu on colab can be tricky.


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m125.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m101.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m9.2 MB/s[0m eta [36m0:0

In [2]:
# Make sure your Colab runtime is set to GPU (T4)
import os, sys
print("Python", sys.version)
import time
from tqdm.auto import tqdm
import pandas as pd


Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


In [3]:
from google.colab import drive
drive.mount('/content/drive')

# Example path if your CSV is in drive: /content/drive/MyDrive/my_project/data/processed/all_products_cleaned.csv



Mounted at /content/drive


In [4]:
CSV_PATH ="/content/drive/MyDrive/Data_Science/Projects/Amazon_Ecommorce_agent/data/processed/all_products_cleaned.csv"

In [5]:
# Configure paths
DATA_PATH = CSV_PATH  # from previous cell
VECTORSTORE_DIR = "/content/vectorstore"   # will be created
os.makedirs(VECTORSTORE_DIR, exist_ok=True)
COLLECTION_NAME = "products"

print("DATA_PATH:", DATA_PATH)
print("VECTORSTORE_DIR:", VECTORSTORE_DIR)


DATA_PATH: /content/drive/MyDrive/Data_Science/Projects/Amazon_Ecommorce_agent/data/processed/all_products_cleaned.csv
VECTORSTORE_DIR: /content/vectorstore


In [6]:
# load only headers/first few to sanity-check columns
df_head = pd.read_csv(DATA_PATH, nrows=5)
print(df_head.columns.tolist())
# load full dataframe (may be large)
df = pd.read_csv(DATA_PATH)
print("Loaded rows:", len(df))


['name', 'main_category', 'sub_category', 'image', 'link', 'ratings', 'no_of_ratings', 'discount_price', 'actual_price', 'source_file', 'company']
Loaded rows: 551585


In [7]:
# adjust to the exact column names in your CSV
df["text_for_embedding"] = (
    df["name"].astype(str) + ". Category: " +
    df["main_category"].astype(str) + " > " +
    df["sub_category"].astype(str) +
    ". Company: " + df["company"].astype(str) +
    ". Price ₹" + df["discount_price"].astype(str) +
    ". Rating " + df["ratings"].astype(str)
)
print("Created text_for_embedding. Example:")
print(df["text_for_embedding"].iloc[0])


Created text_for_embedding. Example:
Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1 Convertible, Copper, Anti-Viral + Pm 2.5 Filter, 2023 Model, White, Gls18I3.... Category: appliances > Air Conditioners. Company: Lloyd. Price ₹32999.0. Rating 4.2


In [8]:
# Use an efficient embedding model tuned for speed & quality
from sentence_transformers import SentenceTransformer

MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # fast, small, works well
print("Loading model:", MODEL_NAME)
model = SentenceTransformer(MODEL_NAME, device="cuda")  # USE GPU
_ = model.max_seq_length  # trigger model init
print("Model loaded on:", model.device)


Loading model: sentence-transformers/all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded on: cuda:0


In [9]:
import numpy as np
BATCH_SIZE = 2048  # tune this — if OutOfMemory, lower to 512 or 256
total = len(df)
print("Total rows:", total)
embeddings_path = os.path.join(VECTORSTORE_DIR, "embeddings.npy")

# We'll generate embeddings batch-by-batch and save as a single .npy - but we'll also add to Chroma directly.
all_embeddings = []
for i in tqdm(range(0, total, BATCH_SIZE)):
    texts = df["text_for_embedding"].iloc[i : i + BATCH_SIZE].tolist()
    emb = model.encode(texts, batch_size=128, show_progress_bar=False, convert_to_numpy=True)
    all_embeddings.append(emb)
# Concatenate into single array
all_embeddings = np.vstack(all_embeddings)
print("Embeddings shape:", all_embeddings.shape)

# Optionally save embeddings to disk (useful checkpoint)
np.save(embeddings_path, all_embeddings)
print("Saved embeddings to:", embeddings_path)


Total rows: 551585


  0%|          | 0/270 [00:00<?, ?it/s]

Embeddings shape: (551585, 384)
Saved embeddings to: /content/vectorstore/embeddings.npy


In [10]:
import chromadb
from chromadb.config import Settings
# Use persistent client pointing at VECTORSTORE_DIR
client = chromadb.PersistentClient(path=VECTORSTORE_DIR)

# Create or get collection. We'll not pass chroma's embedding function (we already computed embeddings).
collection = client.get_or_create_collection(name=COLLECTION_NAME)
print("Collection ready:", collection.name)

# Add items in batches to avoid huge single request
for i in tqdm(range(0, total, BATCH_SIZE)):
    idxs = df.index[i : i + BATCH_SIZE].astype(str).tolist()
    docs = df["text_for_embedding"].iloc[i : i + BATCH_SIZE].tolist()
    metas = df.iloc[i : i + BATCH_SIZE].to_dict(orient="records")
    emb_batch = all_embeddings[i : i + BATCH_SIZE].tolist()
    collection.add(
        ids=idxs,
        documents=docs,
        metadatas=metas,
        embeddings=emb_batch
    )

print("All data added to Chroma at:", VECTORSTORE_DIR)


Collection ready: products


  0%|          | 0/270 [00:00<?, ?it/s]

All data added to Chroma at: /content/vectorstore


In [11]:
query = "best budget AC under 40000 with inverter technology"
# compute query embedding via same model
q_emb = model.encode([query], convert_to_numpy=True).tolist()

# query by embedding
results = collection.query(
    query_embeddings=q_emb,
    n_results=5,
    include=["metadatas","documents","distances"]
)

for i, meta in enumerate(results["metadatas"][0]):
    print(f"{i+1}. {meta.get('name','<no name>')} | ₹{meta.get('discount_price','?')} | ⭐{meta.get('ratings','?')}")
    print("   link:", meta.get("link",""))
    print()


1. 1.5 Ton Inverter Split AC | ₹42500.0 | ⭐4.0
   link: https://www.amazon.in/1-5-Ton-Inverter-Split-AC/dp/B0BGZW4Z6V/ref=sr_1_579?qid=1679134267&s=kitchen&sr=1-579

2. Power Traders 1.5 Ton 5 Star Inverter Split AC | ₹39499.0 | ⭐4.0
   link: https://www.amazon.in/Power-Traders-Star-Inverter-Split/dp/B0BH4MPKRR/ref=sr_1_573?qid=1679134267&s=kitchen&sr=1-573

3. Candes 4kVA for 1.5 Ton AC Voltage Stabilizer with Wide Working Range Best for Inverter AC, Split AC or Windows AC Upto 1.... | ₹2799.0 | ⭐4.2
   link: https://www.amazon.in/Candes-Crystal-Voltage-Stabilizer-Inverter/dp/B08YR7G8XS/ref=sr_1_902?qid=1679135640&s=appliances&sr=1-902

4. Power Traders 1.5 Ton 3 Star, Inverter Split AC | ₹39499.0 | ⭐4.0
   link: https://www.amazon.in/Power-Traders-Star-Inverter-Split/dp/B0BH4NNXZN/ref=sr_1_572?qid=1679134267&s=kitchen&sr=1-572

5. Luminous Inverter & Battery Combo (Zolt 1100 Pure Sine Wave 900VA/12V Inverter with Red Charge RC 25000 Tall Tubular 200Ah... | ₹23500.0 | ⭐4.4
   link: ht

In [17]:
!zip -r /content/vectorstore.zip /content/vectorstore

  adding: content/vectorstore/ (stored 0%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/ (stored 0%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/index_metadata.pickle (deflated 57%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/length.bin (deflated 96%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/link_lists.bin (deflated 72%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/data_level0.bin (deflated 12%)
  adding: content/vectorstore/26080368-a3ac-4597-bac2-a4ad1028ff25/header.bin (deflated 56%)
  adding: content/vectorstore/embeddings.npy (deflated 9%)
  adding: content/vectorstore/chroma.sqlite3 (deflated 70%)


In [21]:
from google.colab import files
files.download('/content/vectorstore.zip')  # triggers browser download (may fail if file too large)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [19]:
# Copy to Drive (if Drive mounted)

DRIVE_DEST = "/content/drive/MyDrive/Data_Science/Projects/Amazon_Ecommorce_agent/data/vectorstore"
!mkdir -p "{DRIVE_DEST}"
!cp /content/vectorstore.zip "{DRIVE_DEST}/vectorstore.zip"
print("Zipped vectorstore copied to Drive at:", DRIVE_DEST + "/vectorstore.zip")


Zipped vectorstore copied to Drive at: /content/drive/MyDrive/Data_Science/Projects/Amazon_Ecommorce_agent/data/vectorstore/vectorstore.zip
