### **Semantic Search for Price Estimation**  

#### **Objective**  
This notebook demonstrates how **semantic search** enhances price estimation by retrieving similar products from a vector database. It covers:  
1. **Embedding Generation**: Using `all-MiniLM-L6-v2` to create dense vector representations of product descriptions.  
2. **Vector Database Setup**: Storing 400K+ products in ChromaDB for efficient similarity search.  
3. **Visualization**: t-SNE plots to validate clustering by product categories.  
4. **Downstream Use Case**: Preparing the foundation for RAG-based price prediction (to be integrated with an LLM in later steps).  


In [None]:
# imports

import os
import re
import math
import json
from tqdm import tqdm
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from utils.items import Item
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [None]:
# Environment setup
load_dotenv(override=True)
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
DB = "products_vectorstore"

# HuggingFace login
login(os.environ['HF_TOKEN'], add_to_git_credential=True)

### 1. Data Preparation


In [None]:
# Load pre-processed training data
with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

# Sample product description
train[0].prompt 

### 2. Vector Database Setup


In [None]:
# Initialize ChromaDB
client = chromadb.PersistentClient(path=DB)

In [None]:
# Create fresh collection: Check if the collection exists and delete it if it does

collection_name = "products"
existing_collection_names = [collection.name for collection in client.list_collections()]
if collection_name in existing_collection_names:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")

collection = client.create_collection(collection_name)

**Windows Tip**: Use `chromadb==0.5.0` if crashes occur.

---

### 3. Embedding Model


In [None]:
# Initialize sentence transformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Example embedding
vector = model.encode(["Sample product"])[0]

**Why This Model?**
- Local execution (no need for API calls)
- 384-dimension embeddings
- Optimized for semantic search

---

### 4. Populating the Vector Store


In [None]:
def description(item):
    """Extracts clean product description from prompt"""
    text = item.prompt.replace("How much does this cost to the nearest dollar?\n\n", "")
    return text.split("\n\nPrice is $")[0]


In [None]:
# Batch insert products
for i in tqdm(range(0, len(train), 1000)):
    documents = [description(item) for item in train[i: i+1000]]
    vectors = model.encode(documents).astype(float).tolist()
    metadatas = [{"category": item.category, "price": item.price} for item in train[i: i+1000]]
    ids = [f"doc_{j}" for j in range(i, i+1000)]
    collection.add(
        ids=ids,
        documents=documents,
        embeddings=vectors,
        metadatas=metadatas
    )

### 5. Data Visualization


In [None]:
# Configuration

MAX_DATAPOINTS = 30_000  # Adjust based on system capability

DB = "products_vectorstore"
client = chromadb.PersistentClient(path=DB)
collection = client.get_or_create_collection('products')

In [None]:
CATEGORIES = ['Appliances', 'Automotive', 'Cell_Phones_and_Accessories', 'Electronics','Musical_Instruments', 'Office_Products', 'Tools_and_Home_Improvement', 'Toys_and_Games']
COLORS = ['red', 'blue', 'brown', 'orange', 'yellow', 'green' , 'purple', 'cyan']

In [None]:
# Prework: Get data for visualization

result = collection.get(include=['embeddings', 'documents', 'metadatas'], limit=MAXIMUM_DATAPOINTS)
vectors = np.array(result['embeddings'])
documents = result['documents']
categories = [metadata['category'] for metadata in result['metadatas']]
colors = [COLORS[CATEGORIES.index(c)] for c in categories]

### Dimensionality reduction

In [None]:
# Dimensionality reduction: 2D

tsne = TSNE(n_components=2, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [None]:
# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=3, color=colors, opacity=0.7),
)])

fig.update_layout(
    title='2D Chroma Vectorstore Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y'),
    width=1200,
    height=800,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

**Visualization Insights**:
- Shows product clustering by category
- Helps validate embedding quality
- Interactive exploration of product space

### Key Takeaways

1. **Vector Search Foundation**: Built a 400K product vector database
2. **Local Embeddings**: Used sentence-transformers for efficient local processing
3. **Visual Validation**: Confirmed meaningful embedding space organization