# 1. Project Description and Idea

This project aims to make it easier and faster for people to choose the product they want by leveraging semantic search with Retrieval-Augmented Generation (RAG), getting semantically accurate results from product descriptions. Product descriptions are embedded and indexed using ChromaDB and the output is augmented by using Gemini as a final sales pitcher that helps users compare shortlisted items based on their preferences — such as budget, quality, or popularity.

The result is a smart, fast, and user-friendly recommender system that bridges the gap between raw product listings and decision support, enabling personalised and enhanced customer support with a conversational agent.

# 2. Testing and Defining LLM

In [2]:
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})
genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

In [None]:
# Imports
import google.generativeai as genai
from IPython.display import Markdown
import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=api_key)


# Initialize the model
model = genai.GenerativeModel('gemini-2.0-flash')  # or 'gemini-pro' / 'gemini-1.5-pro'

# Generate content
response = model.generate_content(
    "Write one line dad joke.",
    generation_config={
        "max_output_tokens": 200,
        "temperature": 1.5,
        "top_p": 1
    }
)

# Display the result
Markdown(response.text)


What do you call a fish with no eyes? Fsh!


In [4]:
chat = model.start_chat(history=[])
# response = chat.send_message('Hello! My name is Sat.')
response = chat.send_message("Do you have access to the internet?")
Markdown(response.text)

Yes, I have access to the internet. I can use it to search for information, translate languages, get real-time updates, and much more.


In [5]:
# Set generation config (you can pass it as a dict)
generation_config = {
    "temperature": 1,
    "top_p": 0.95,
}

# Prompt for generation
story_prompt = "You are a creative writer. Write a short story about a cat who goes on an adventure."

# Generate content
response = model.generate_content(
    story_prompt,
    generation_config=generation_config
)

# Print the result
print(response.text)

Whiskers twitched, emerald eyes gleamed. Jasper, a ginger tabby with a rebellious streak, wasn't meant for sunbeams and chasing dust bunnies. He was meant for adventure. One day, the open window in the kitchen beckoned, a siren song of the unknown. With a graceful leap, he landed on the dew-kissed grass, the world stretching before him, a vast and exciting canvas.

His journey began with the rustling symphony of the overgrown garden. Butterflies, vibrant as painted dreams, flitted past. He stalked a plump bumblebee, its buzz a thrilling challenge. Beyond the garden fence lay a world of asphalt and rumbling metal beasts, a world Jasper had only glimpsed from the safety of his windowsill.

He navigated the bustling street with feline agility, dodging hurried feet and weaving between parked cars. The smells were intoxicating - the savory aroma of grilling meat, the sweet tang of spilled soda, the intriguing scent of other cats marking their territory. He met a grizzled old tomcat, Scrappe

# 3. Testing and Comparing Faiss VectorDB and ChromaDB Retrieval

## FAISS DB with Sentence Transformer MiniLM (Pretrained)

In [2]:
import pandas as pd
df = pd.read_csv("amazon.csv")
df.columns

Index(['product_id', 'product_name', 'category', 'discounted_price',
       'actual_price', 'discount_percentage', 'rating', 'rating_count',
       'about_product', 'user_id', 'user_name', 'review_id', 'review_title',
       'review_content', 'img_link', 'product_link'],
      dtype='object')

In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

df = pd.read_csv("amazon.csv")
df = pd.concat([df.iloc[:1279, :], df.iloc[1280:, :]])
df['text'] = df['product_name'] + " " + df['category']
models = ['all-MiniLM-L6-v2', 'msmarco-MiniLM-L6-cos-v5']
model = SentenceTransformer(models[1])

embeddings = model.encode(df['text'].tolist(), convert_to_numpy=True, normalize_embeddings=True)

# Store in FAISS
d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)


cats = ['discounted_price', 'rating', 'rating_count','product_name','img_link', 'product_link']
def recommend_items(query, top_k=20):
    query_embedding = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    distances, indices = index.search(query_embedding, top_k)
    results = df.iloc[indices[0]][cats]
    return results

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
recommend_items('I want a bluetooth speaker', 10)

Unnamed: 0,discounted_price,rating,rating_count,product_name,img_link,product_link
730,"₹1,999",4.3,63899,"JBL Go 2, Wireless Portable Bluetooth Speaker ...",https://m.media-amazon.com/images/I/51RTfgkScM...,https://www.amazon.in/JBL-Portable-Waterproof-...
691,₹599,4.3,95116,"TP-Link USB Bluetooth Adapter for PC, 5.0 Blue...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/TP-Link-Bluetooth-Receiv...
706,"₹1,299",3.9,12452,Noise Buds Vs104 Bluetooth Truly Wireless in E...,https://m.media-amazon.com/images/I/31YW3+kpZQ...,https://www.amazon.in/Noise-Bluetooth-Wireless...
879,"₹1,599",3.6,2272,Sony WI-C100 Wireless Headphones with Customiz...,https://m.media-amazon.com/images/I/31lF-FdlrH...,https://www.amazon.in/Sony-Headphones-Customiz...
487,"₹4,499",3.5,37,Noise ColorFit Pro 4 Alpha Bluetooth Calling S...,https://m.media-amazon.com/images/I/4123OnLZCF...,https://www.amazon.in/Noise-ColorFit-Bluetooth...
867,"₹1,099",3.5,12966,Boult Audio Airbass Propods X TWS Bluetooth Tr...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Boult-Audio-Bluetooth-Re...
1021,"₹1,499",4.1,25262,"Infinity (JBL Fuze 100, Wireless Portable Blue...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Infinity-Fuze-100-Waterp...
376,"₹2,998",4.1,5179,Noise ColorFit Pro 4 Advanced Bluetooth Callin...,https://m.media-amazon.com/images/I/413x7j3Z30...,https://www.amazon.in/Noise-ColorFit-Bluetooth...
624,"₹2,998",4.1,5179,Noise ColorFit Pro 4 Advanced Bluetooth Callin...,https://m.media-amazon.com/images/I/413x7j3Z30...,https://www.amazon.in/Noise-ColorFit-Bluetooth...
899,₹899,3.8,10751,Zebronics ZEB-VITA Wireless Bluetooth 10W Port...,https://m.media-amazon.com/images/I/31c2Mxy32-...,https://www.amazon.in/Zebronics-Zeb-Vita-Porta...


## ChromaDB Retrieval

In [None]:
# -----------------------------
# SETUP AND IMPORTS
# -----------------------------

import os
import pickle
import pandas as pd
import chromadb
from chromadb.config import Settings
from chromadb import Documents, EmbeddingFunction, Embeddings
import google.generativeai as genai

# -----------------------------
# CONFIGURE GEMINI
# -----------------------------
genai.configure(api_key="AIzaSyCispqkC-AziCx0kWowD8uOzTLCoTb5oNE")  # Replace with your actual key

# -----------------------------
# LOAD DATASET
# -----------------------------
df = pd.read_csv("amazon.csv")
df = pd.concat([df.iloc[:1279, :], df.iloc[1280:, :]])  # Skip corrupted row if needed
# Optional: limit for testing
# df = df.head(100)

# -----------------------------
# PATHS
# -----------------------------
EMBEDDING_CACHE_FILE = "gemini_embeddings.pkl"
CHROMA_PERSIST_DIR = "./chroma_storage"
COLLECTION_NAME = "amazonproductdb"

# -----------------------------
# GEMINI EMBEDDING FUNCTION
# -----------------------------
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, inputs: Documents) -> Embeddings:
        embeddings = []
        for doc in inputs:
            response = genai.embed_content(
                model="models/text-embedding-004",
                content=doc,
                task_type="retrieval_document",
                title="embedding input"
            )
            embeddings.append(response["embedding"])
        return embeddings

# -----------------------------
# DOCUMENT PREPARATION
# -----------------------------
def prepare_docs(df):
    return [
        f"""
        PRODUCT NAME: {row['product_name']}
        DESCRIPTION: {row['about_product']}
        PRICE: {row['discounted_price']}
        RATING: {row['rating']}
        RATING COUNT: {row['rating_count']}
        """
        for _, row in df.iterrows()
    ]

# -----------------------------
# EMBEDDING GENERATION OR LOADING
# -----------------------------
if os.path.exists(EMBEDDING_CACHE_FILE):
    print("✅ Loading cached embeddings...")
    with open(EMBEDDING_CACHE_FILE, "rb") as f:
        doc_ids, docs, doc_embeddings = pickle.load(f)
else:
    print("⚙️ Generating new embeddings (only runs once)...")
    docs = prepare_docs(df)
    embed_fn = GeminiEmbeddingFunction()
    doc_embeddings = embed_fn(docs)
    doc_ids = [str(i) for i in range(len(docs))]

    # Save cache
    with open(EMBEDDING_CACHE_FILE, "wb") as f:
        pickle.dump((doc_ids, docs, doc_embeddings), f)
    print("✅ Embeddings saved to", EMBEDDING_CACHE_FILE)

# -----------------------------
# CHROMADB SETUP (PERSISTENT)
# -----------------------------
from chromadb import PersistentClient

# ✅ Define Gemini embedder
embed_fn = GeminiEmbeddingFunction()

# ✅ Create ChromaDB client
chroma_client = PersistentClient(path=CHROMA_PERSIST_DIR)

# ✅ Always pass Gemini embedder to collection
if COLLECTION_NAME in chroma_client.list_collections():
    collection = chroma_client.get_collection(name=COLLECTION_NAME, embedding_function=embed_fn)
else:
    collection = chroma_client.create_collection(name=COLLECTION_NAME, embedding_function=embed_fn)


# Add to ChromaDB if empty
if collection.count() == 0:
    print("📦 Populating ChromaDB...")
    batch_size = 20
    for i in range(0, len(docs), batch_size):
        collection.add(
            documents=docs[i:i + batch_size],
            embeddings=doc_embeddings[i:i + batch_size],
            ids=doc_ids[i:i + batch_size]
        )
    print("✅ ChromaDB populated and persisted at", CHROMA_PERSIST_DIR)
else:
    print("✅ ChromaDB already contains data")


# -----------------------------
# RECOMMENDER FUNCTION
# -----------------------------
cats = ['discounted_price', 'rating', 'rating_count', 'product_name', 'img_link', 'product_link']

def recommend_items(query: str, top_k: int = 10):
    results = collection.query(query_texts=[query], n_results=top_k)
    indices = [int(id) for id in results['ids'][0]]
    return df.loc[indices, cats]

# -----------------------------
# EXAMPLE USAGE
# -----------------------------
recdf = recommend_items("I want a bluetooth speaker", 10)
print(recdf)


ValueError: The onnxruntime python package is not installed. Please install it with `pip install onnxruntime`

It is visible that FAISS has a worse product recommendation out of the 10 given options from both methods, so we will be using ChromaDB for the next step.

## Testing Priority Scaling (Not Implemented)

In [None]:
from sklearn.preprocessing import MinMaxScaler
def scaleVals(xdf):
    sc = MinMaxScaler()
    xdf.iloc[:, 0] = [float(i[1:].replace(',','')) for i in xdf.iloc[:, 0]]
    xdf.iloc[:, 0] = 1-sc.fit_transform(xdf.iloc[:, 0].values.reshape(-1,1)).squeeze()
    xdf.iloc[:, 1] = [float(i) for i in xdf.iloc[:, 1]]
    xdf.iloc[:, 1] = sc.fit_transform(xdf.iloc[:, 1].values.reshape(-1,1)).squeeze()
    xdf.iloc[:, 2] = [float(str(i).replace(',','')) for i in xdf.iloc[:, 2]]
    xdf.iloc[:, 2] = sc.fit_transform(xdf.iloc[:, 2].values.reshape(-1,1)).squeeze()
    return xdf
query = 'I want a bluetooth speaker'
resdf = recommend_items(query, 10)
resdf = scaleVals(resdf)
resdf['ind'] = resdf.iloc[:, 0]*0.2+resdf.iloc[:, 1]*0.55+resdf.iloc[:, 2]*0.25
resdf.sort_values('ind', ascending=False)

Unnamed: 0,discounted_price,rating,rating_count,product_name,img_link,product_link,ind
730,0.0,1.0,0.987462,"JBL Go 2, Wireless Portable Bluetooth Speaker ...",https://m.media-amazon.com/images/I/51RTfgkScM...,https://www.amazon.in/JBL-Portable-Waterproof-...,0.796866
804,0.137931,0.833333,0.634778,boAt Stone 650 10W Bluetooth Speaker with Upto...,https://m.media-amazon.com/images/I/41rfSd9spq...,https://www.amazon.in/Stone-650-Wireless-Bluet...,0.644614
753,0.758621,0.666667,0.467451,"Infinity (JBL Fuze Pint, Wireless Ultra Portab...",https://m.media-amazon.com/images/I/41Qf-pUQr9...,https://www.amazon.in/Infinity-Fuze-Pint-Porta...,0.635253
712,1.0,0.333333,1.0,Zebronics ZEB-COUNTY 3W Wireless Bluetooth Por...,https://m.media-amazon.com/images/I/41goRo3UXh...,https://www.amazon.in/Zebronics-Zeb-County-Blu...,0.633333
999,0.551724,0.833333,0.038733,boAt Stone 250 Portable Wireless Speaker with ...,https://m.media-amazon.com/images/I/51J45Dcgkt...,https://www.amazon.in/boAt-Stone-250-Playback-...,0.578361
806,0.689655,0.666667,0.278641,boAt Stone 180 5W Bluetooth Speaker with Upto ...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/boAt-Stone-Bluetooth-Spe...,0.574258
1021,0.344828,0.666667,0.386454,"Infinity (JBL Fuze 100, Wireless Portable Blue...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Infinity-Fuze-100-Waterp...,0.532246
671,0.655172,0.333333,0.021171,ZEBRONICS Zeb-Astra 20 Wireless BT v5.0 Portab...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/ZEBRONICS-Zeb-Astra-20-W...,0.31966
899,0.758621,0.166667,0.160732,Zebronics ZEB-VITA Wireless Bluetooth 10W Port...,https://m.media-amazon.com/images/I/31c2Mxy32-...,https://www.amazon.in/Zebronics-Zeb-Vita-Porta...,0.283574
1016,0.827586,0.0,0.0,Zebronics Astra 10 Portable Wireless BT v5.0 S...,https://m.media-amazon.com/images/I/41uoxHxPDa...,https://www.amazon.in/ZEBRONICS-Zeb-Astra-Wire...,0.165517


# 4. Combining LLM and RAG power for product recommendation!

## Single Few-Shot Output, without chat

In [None]:
import google.generativeai as genai
from IPython.display import Markdown

# Configure the API key
genai.configure(api_key="AIzaSyCispqkC-AziCx0kWowD8uOzTLCoTb5oNE")  # or use GOOGLE_API_KEY if defined

# Define the system prompt
sys = """You are a product recommender and an expert on persuasive product
suggestions. Answer and pitch the users exactly TWO best products that
match their requirements according to the given product entries. Make sure to
answer the client according to their preference of either price, quality, or
mixed. Explain why they should take each product and make sure to include
a clickable product link after every product you recommend."""

# Define the model with system config
model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction=sys
)

# User query and product results
query = 'I want a good bluetooth speaker within 1200.'
resdf = recommend_items(query, 10)
parsed_prods = resdf.to_json(orient='records')

# Construct message
message = f"""User query:
{query}

Candidate products (JSON format):
{parsed_prods}
"""

# Generate response
response = model.generate_content(message)

# Display nicely
Markdown(response.text)


Okay, I understand you're looking for a good Bluetooth speaker under ₹1200. Here are two options that balance quality and price:

**1. boAt Stone 180**

*   **Price:** ₹999
*   **Rating:** 4.1 (18,331 ratings)

The boAt Stone 180 is an excellent choice because it provides a great balance of features and affordability. It has a solid rating, a decent 5W output, is IPX7 waterproof, and offers up to 10 hours of playback time. The TWS feature is a bonus, allowing you to pair two speakers for a stereo experience.

Here's the link: [https://www.amazon.in/boAt-Stone-Bluetooth-Speaker-Black/dp/B08JMC1988/ref=sr\_1\_243?qid=1672903007&s=computers&sr=1-243](https://www.amazon.in/boAt-Stone-Bluetooth-Speaker-Black/dp/B08JMC1988/ref=sr_1_243?qid=1672903007&s=computers&sr=1-243)

**2. ZEBRONICS Zeb-Astra 20**

*   **Price:** ₹1,049
*   **Rating:** 3.9 (1,779 ratings)

The ZEBRONICS Zeb-Astra 20 offers a good set of features for its price. It has a 10W RMS output for louder sound, TWS function, 10H Backup time, and multiple connectivity options like FM Radio, AUX, mSD, and USB.

Here's the link: [https://www.amazon.in/ZEBRONICS-Zeb-Astra-20-Wireless-Rechargeable/dp/B0B12K5BPM/ref=sr\_1\_93?qid=1672902998&s=computers&sr=1-93](https://www.amazon.in/ZEBRONICS-Zeb-Astra-20-Wireless-Rechargeable/dp/B0B12K5BPM/ref=sr_1_93?qid=1672902998&s=computers&sr=1-93)

## Using Chat, History available, for user follow ups

In [None]:
import google.generativeai as genai
from IPython.display import Markdown

# Configure API key
genai.configure(api_key="AIzaSyCispqkC-AziCx0kWowD8uOzTLCoTb5oNE")

# Define system prompt
sys = """You are a product recommender and an expert on persuasive product
suggestions. Answer and pitch the users exactly TWO best products that
match their requirements according to the given product entries. Make sure to
answer the client according to their preference of either price, quality, or
mixed. Explain why they should take each product and make sure to include
a clickable product link after every product you recommend."""

# Create the model
model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction=sys,
    generation_config=genai.types.GenerationConfig(
        temperature=0.7,
        top_p=0.8
    )
)


# Start chat session (to retain context/history)
chat = model.start_chat(history=[])

# Prepare user query and context
query = 'I want a good bluetooth speaker within 1200.  Preference: quality over price.'
resdf = recommend_items(query, 10)
parsed_prods = resdf.to_json(orient='records')

chat_message = f"""
User query:
{query}

Candidate products (JSON format):
{parsed_prods}
"""

# Send message and get response
response = chat.send_message(chat_message)

# Display response
Markdown(response.text)


Okay, I understand you're looking for a high-quality Bluetooth speaker for under ₹1200. Since you prioritize quality, here are two recommendations that offer the best value and performance within your budget:

1.  **boAt Stone 250 Portable Wireless Speaker**

    *   **Why you should buy it:** This speaker is an excellent choice because it provides immersive audio quality with its 5W RMS output. It has an IPX7 water resistance rating, making it durable and suitable for various environments. Additionally, the RGB LEDs add a stylish touch.
    *   **Link:** [https://www.amazon.in/boAt-Stone-250-Playback-Hours/dp/B08SMJT55F/ref=sr\_1\_464?qid=1672903018&s=computers&sr=1-464](https://www.amazon.in/boAt-Stone-250-Playback-Hours/dp/B08SMJT55F/ref=sr_1_464?qid=1672903018&s=computers&sr=1-464)
2.  **ZEBRONICS Zeb-Astra 20 Wireless BT v5.0 Portable Speaker**

    *   **Why you should buy it:** The Zebronics Zeb-Astra 20 is another solid option, offering 10W RMS output and TWS (True Wireless Stereo) functionality, allowing you to pair two speakers for an enhanced stereo experience. It also supports multiple input options like FM Radio, AUX, mSD, and USB, making it versatile.
    *   **Link:** [https://www.amazon.in/ZEBRONICS-Zeb-Astra-20-Wireless-Rechargeable/dp/B0B12K5BPM/ref=sr\_1\_93?qid=1672902998&s=computers&sr=1-93](https://www.amazon.in/ZEBRONICS-Zeb-Astra-20-Wireless-Rechargeable/dp/B0B12K5BPM/ref=sr_1_93?qid=1672902998&s=computers&sr=1-93)

Both speakers provide a good balance of features and sound quality within your budget, with the boAt Stone 250 focusing on portability and style, while the ZEBRONICS Zeb-Astra 20 offers more versatility in terms of connectivity.


In [None]:
response = chat.send_message("""
I actually want something that's over 1200, what can I get?
""")
Markdown(response.text)

Okay, if you're willing to go over ₹1200 for even better quality, here are two recommendations that offer excellent value and performance:

1.  **Infinity (JBL Fuze 100, Wireless Portable Bluetooth Speaker**

    *   **Why you should buy it:** The JBL Fuze 100 is an excellent choice because it provides deep bass and dual equalizer. It has an IPX7 water resistance rating, making it durable and suitable for various environments.
    *   **Link:** [https://www.amazon.in/Infinity-Fuze-100-Waterproof-Portable/dp/B07W7Z6DVL/ref=sr\_1\_500?qid=1672903019&s=computers&sr=1-500](https://www.amazon.in/Infinity-Fuze-100-Waterproof-Portable/dp/B07W7Z6DVL/ref=sr_1_500?qid=1672903019&s=computers&sr=1-500)
2.  **boAt Stone 650 10W Bluetooth Speaker**

    *   **Why you should buy it:** The boAt Stone 650 is another solid option, offering 10W output. It also supports multiple input options like Integrated controls, making it versatile.
    *   **Link:** [https://www.amazon.in/Stone-650-Wireless-Bluetooth-Speaker/dp/B07NC12T2R/ref=sr\_1\_241?qid=1672903007&s=computers&sr=1-241](https://www.amazon.in/Stone-650-Wireless-Bluetooth-Speaker/dp/B07NC12T2R/ref=sr_1_241?qid=1672903007&s=computers&sr=1-241)

Both speakers provide a good balance of features and sound quality, with the Infinity JBL focusing on portability and style, while the boAt Stone 650 offers more versatility in terms of connectivity.


In [None]:
!zip -r project_files.zip gemini_embeddings.pkl chroma_storage amazon.csv
from google.colab import files
files.download("project_files.zip")


  adding: gemini_embeddings.pkl (deflated 23%)
  adding: chroma_storage/ (stored 0%)
  adding: chroma_storage/chroma.sqlite3 (deflated 57%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/ (stored 0%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/length.bin (deflated 46%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/link_lists.bin (deflated 88%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/data_level0.bin (deflated 10%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/index_metadata.pickle (deflated 48%)
  adding: chroma_storage/4009e900-49f6-42f0-ad2d-ccb5eccd1980/header.bin (deflated 60%)
  adding: amazon.csv (deflated 57%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>