# The Second Agent - estimate the actual value of a product

## RAG (Retrieval Augmented Generation) based on a dataset of 400,000 scraped Amazon products

#### For our 2nd agent, we will be asking DeekSeek to estimate the price of one of our deals - and we will give it a hand.

It turns out that LLMs are really good at this! Out of the box, GPT-4o is off by an average of \$76.

But we can do even better: we'll provide it with some context, in the form of 5 similar products from our training dataset

Again I'll be going quite quickly through this - the idea is for you to run this yourself.

In [3]:
# imports

import os
import re
import math
import json
import logging
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from items import Item
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from testing import Tester
from openai import OpenAI

In [4]:
# environment

load_dotenv(override=True)
os.environ['GEMINI_API_KEY'] = os.getenv('GEMINI_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')
DB = "products_vectorstore"

In [5]:
# Log in to HuggingFace
# If you don't have a HuggingFace account, you can set one up for free at www.huggingface.co
# And then add the HF_TOKEN to your .env file as explained in the project README

hf_token = os.environ['HF_TOKEN']
login(token=hf_token, add_to_git_credential=False)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


# For following along at home:

Please download the files train.pkl and test.pkl from this Google Drive folder:  
https://drive.google.com/drive/folders/1t0YnoCXCbo2g08uWIOR6TPKR2-6Egb_g?usp=sharing

And place them in the parent directory (the directory called agentic).

In [6]:
# Load the training data

with open('../train.pkl', 'rb') as file:
    train = pickle.load(file)


In [7]:
print(f"There are {len(train):,} training items scraped from Amazon, and the first one is {train[0]}")

There are 400,000 training items scraped from Amazon, and the first one is <Delphi FG0166 Fuel Pump Module = $226.95>


# Now create a Chroma Datastore

Now we will use the free, open-source Vector database Chroma.  
We will create a Chroma datastore with 400,000 products from our training dataset.

In [8]:
client = chromadb.PersistentClient(path=DB)

# Introducing the SentenceTransformer Encoding LLM

The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to 384 dimensional vectors and is ideal for tasks like semantic search.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.

As an alternative, OpenAI provides a closed-source Embeddings model. Benefits compared to OpenAI embeddings:
1. It's free and fast!
3. We can run it locally, so the data never leaves our box - might be useful if you're building a personal RAG

In [9]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [10]:
# Pass in a list of texts, get back a numpy array of vectors

vector = model.encode(["A room full of software engineers"])[0]
print(vector.shape)
vector

  return forward_call(*args, **kwargs)


(384,)


array([-1.07516805e-02, -3.26470472e-02,  1.49071508e-03, -1.16257416e-03,
        1.23239458e-02, -1.04110382e-01,  4.23309542e-02,  1.54848639e-02,
        8.58570822e-03, -1.91479810e-02, -5.22568934e-02, -6.39600977e-02,
        7.70011619e-02, -4.21598330e-02, -1.38170703e-03,  3.89791280e-02,
       -5.20618483e-02, -1.04253314e-01,  5.08281291e-02, -1.03140876e-01,
       -5.72729893e-02,  1.89593136e-02, -2.37538461e-02, -2.35941689e-02,
        4.17627841e-02,  7.25999102e-02,  4.73284870e-02, -1.88370142e-02,
        5.44371568e-02, -3.84208411e-02,  2.78963987e-03,  7.38718435e-02,
        7.02842623e-02,  5.30361310e-02,  8.85342285e-02,  7.76899010e-02,
       -2.84251408e-03, -6.90025464e-02,  3.75820659e-02,  2.84839775e-02,
       -1.00663275e-01,  2.71375179e-02,  4.32914309e-02,  2.81325709e-02,
       -2.92596761e-02, -8.14440995e-02,  1.13112703e-02, -8.49852860e-02,
        4.92296554e-02, -3.95888798e-02, -4.91748154e-02, -4.37014475e-02,
       -1.21951280e-02,  

In [11]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

vector1 = model.encode(["A room full of data scientists"])[0]
vector2 = model.encode(["A room full of LLM engineers"])[0]
cosine_similarity(vector1, vector2)

0.48992464

In [12]:
vector1 = model.encode(["A room full of data scientists"])[0]
vector3 = model.encode(["A hovercraft full of eels"])[0]
cosine_similarity(vector1, vector3)

0.058423225

In [13]:
# The sentence equivalent of the famous King - Man + Woman = Queen

room_scientists = model.encode(["A room full of data scientists"])[0]
room_llm = model.encode(["A room full of LLM engineers"])[0]
scientists = model.encode(["data scientists"])[0]
llm = model.encode(["LLM engineers"])[0]

In [14]:
cosine_similarity(room_scientists, room_llm)

0.48992464

In [15]:
cosine_similarity(scientists, llm)

0.29984185

In [16]:
new_room_llm = room_scientists - scientists + llm

In [17]:
cosine_similarity(new_room_llm, room_llm)

0.9327203

## With that background, let's populate our Chroma database

### By calculating vectors for 400,000 scraped products

In [18]:
import tqdm

In [19]:
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np

# Connect to Chroma DB
client = chromadb.PersistentClient(path="products_vectorstore")
collection_name = "products"

# ✅ Delete collection only if it exists
existing_collections = [c.name for c in client.list_collections()]
if collection_name in existing_collections:
    client.delete_collection(collection_name)

# Create fresh collection
collection = client.create_collection(collection_name)

# Load sentence transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Use only first 4000 training samples
train_subset = train[:4000]

# Ingest in batches
batch_size = 1000
for i in tqdm(range(0, len(train_subset), batch_size)):
    batch = train_subset[i:i + batch_size]
    documents = [item.text for item in batch]
    vectors = model.encode(documents).astype(float).tolist()
    metadatas = [{"category": item.category, "price": item.price} for item in batch]
    ids = [f"doc_{j}" for j in range(i, i + len(batch))]
    collection.add(ids=ids, documents=documents, embeddings=vectors, metadatas=metadatas)

print("✅ Ingestion complete.")
print("🔍 Sample:", collection.get(limit=3, include=["documents", "metadatas"]))


100%|██████████| 4/4 [01:30<00:00, 22.63s/it]

✅ Ingestion complete.
🔍 Sample: {'ids': ['doc_0', 'doc_1', 'doc_2'], 'embeddings': None, 'documents': ['Delphi FG0166 Fuel Pump Module\nDelphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7', 'Power Stop Rear Z36 Truck and Tow Brake Kit with Calipers\nThe Power Stop Z36 Truck & Tow Performance brake kit provides the superior stopping power demanded by those who tow boats, haul loads, tackle mo




In [21]:
# Inspect the collection contents
result = collection.get(include=['documents', 'metadatas'], limit=10)

print(f"\nFetched {len(result['documents'])} documents from the collection:\n")

for i, (doc, meta) in enumerate(zip(result["documents"], result["metadatas"])):
    print(f"{i+1}. {doc}")
    print(f"   Category: {meta.get('category')}, Price: {meta.get('price')}")



Fetched 10 documents from the collection:

1. Delphi FG0166 Fuel Pump Module
Delphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7
   Category: Automotive, Price: 226.95
2. Power Stop Rear Z36 Truck and Tow Brake Kit with Calipers
The Power Stop Z36 Truck & Tow Performance brake kit provides the superior stopping power demanded by those who tow boats, haul loads, tackle mountains, lift trucks

# Let's visualize the vectorized data

In [22]:
# It is very fun turning this up to 400_000 and seeing the full dataset visualized,
# but it almost crashes my box every time so do that at your own risk!! 5_000 is safe!

MAXIMUM_DATAPOINTS = 5_000

In [23]:
CATEGORIES = ['Appliances', 'Automotive', 'Cell_Phones_and_Accessories', 'Electronics','Musical_Instruments', 'Office_Products', 'Tools_and_Home_Improvement', 'Toys_and_Games']
COLORS = ['cyan', 'blue', 'brown', 'orange', 'yellow', 'green' , 'purple', 'red']

In [24]:
# Prework
result = collection.get(include=['embeddings', 'documents', 'metadatas'], limit=MAXIMUM_DATAPOINTS)
vectors = np.array(result['embeddings'])
documents = result['documents']
categories = [metadata['category'] for metadata in result['metadatas']]
colors = [COLORS[CATEGORIES.index(c)] for c in categories]

In [25]:
result

{'ids': ['doc_0',
  'doc_1',
  'doc_2',
  'doc_3',
  'doc_4',
  'doc_5',
  'doc_6',
  'doc_7',
  'doc_8',
  'doc_9',
  'doc_10',
  'doc_11',
  'doc_12',
  'doc_13',
  'doc_14',
  'doc_15',
  'doc_16',
  'doc_17',
  'doc_18',
  'doc_19',
  'doc_20',
  'doc_21',
  'doc_22',
  'doc_23',
  'doc_24',
  'doc_25',
  'doc_26',
  'doc_27',
  'doc_28',
  'doc_29',
  'doc_30',
  'doc_31',
  'doc_32',
  'doc_33',
  'doc_34',
  'doc_35',
  'doc_36',
  'doc_37',
  'doc_38',
  'doc_39',
  'doc_40',
  'doc_41',
  'doc_42',
  'doc_43',
  'doc_44',
  'doc_45',
  'doc_46',
  'doc_47',
  'doc_48',
  'doc_49',
  'doc_50',
  'doc_51',
  'doc_52',
  'doc_53',
  'doc_54',
  'doc_55',
  'doc_56',
  'doc_57',
  'doc_58',
  'doc_59',
  'doc_60',
  'doc_61',
  'doc_62',
  'doc_63',
  'doc_64',
  'doc_65',
  'doc_66',
  'doc_67',
  'doc_68',
  'doc_69',
  'doc_70',
  'doc_71',
  'doc_72',
  'doc_73',
  'doc_74',
  'doc_75',
  'doc_76',
  'doc_77',
  'doc_78',
  'doc_79',
  'doc_80',
  'doc_81',
  'doc_82',
  'doc_

In [26]:
# Let's try a 2D chart
# TSNE stands for t-distributed Stochastic Neighbor Embedding - it's a common technique for reducing dimensionality of data

tsne = TSNE(n_components=2, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [27]:
# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=4, color=colors, opacity=0.7),
    text=[f"Category: {c}<br>Text: {d[:50]}..." for c, d in zip(categories, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vectorstore Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y'),
    width=1200,
    height=800,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [28]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [29]:
# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=2, color=colors, opacity=0.7),
    text=[f"Category: {c}<br>Text: {d[:50]}..." for c, d in zip(categories, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=1200,
    height=800,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [30]:
# Load in the test pickle file

with open('../test.pkl', 'rb') as file:
    test = pickle.load(file)

In [31]:
# We need to give some context to GPT-4o-mini by selecting 5 products with similar descriptions

def make_context(similars, prices):
    message = "To provide some context, here are some other items that might be similar to the item you need to estimate.\n\n"
    for similar, price in zip(similars, prices):
        message += f"Potentially related product:\n{similar}\nPrice is ${price:.2f}\n\n"
    return message

In [32]:
def messages_for(item, similars, prices):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = make_context(similars, prices)
    user_prompt += "And now the question for you:\n\n"
    user_prompt += item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [33]:
!ollama pull llama3.2

'ollama' n'est pas reconnu en tant que commande interne
ou externe, un programme ex�cutable ou un fichier de commandes.


In [34]:
def preprocess(item):
    system_message = "You rewrite product descriptions in a format most suitable for finding similar products in a Knowledge Base"
    user_message = "Please write a short 2-3 sentence description of the following product; your description will be used to find similar products so it should be comprehensive and only about the product. Details:\n"
    user_message += item
    user_message += "\n\nNow please reply only with the short description, with no introduction"
    messages = [{"role": "system", "content": system_message}, {"role": "user", "content": user_message}]
    response = ollama_via_openai.chat.completions.create(
        model="llama3.2",
        messages=messages,
        seed=42
    )
    return response.choices[0].message.content

In [35]:
def vector(item):
    text = preprocess(item.text)
    return model.encode(text)

In [36]:
def find_similars(item):
    vec = vector(item)
    results = collection.query(query_embeddings=vec.astype(float).tolist(), n_results=5)
    documents = results['documents'][0][:]
    prices = [m['price'] for m in results['metadatas'][0][:]]
    return documents, prices

In [37]:
print(test[1].text)

Motorcraft YB3125 Fan Clutch
Motorcraft YB3125 Fan Clutch Package Dimensions 25.146 cms (L) x 20.066 cms (W) x 15.494 cms (H) Package Quantity 1 Product Type Auto Part Country Of Origin China Manufacturer Motorcraft, Brand Motorcraft, Model Fan Clutch, Weight 5 pounds, Dimensions 10 x 7.63 x 6.25 inches, Country of Origin China, model number Exterior Painted, Manufacturer Part Rank Automotive Automotive Replacement Engine Fan Clutches 583, Domestic Shipping can be shipped within U.S., International Shipping This item can be shipped to select countries outside of the U.S. Learn More, Available October 10, 2007


In [38]:
print(preprocess(test[1].text))

NameError: name 'ollama_via_openai' is not defined

In [None]:
documents, prices = find_similars(test[1])

In [None]:
print(make_context(documents, prices))

In [None]:
# Utility function that extracts a price from a response from GPT-4o-mini

def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
get_price("blah blah the price is $99.99 blah")

In [None]:
# The function for gpt-4o-mini

def gpt_4o_mini_rag(item):
    documents, prices = find_similars(item)
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(item, documents, prices),
        seed=42,
        max_tokens=8
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# How much does the Fan Clutch in test[1] actually cost, on Amazon?

test[1].price

In [None]:
# Now let's call GPT-4o-mini using RAG, passing in 5 similar items from our Chroma datastore

gpt_4o_mini_rag(test[1])

In [None]:
# Try DeepSeek-V3 

def deepseek_api_rag(item):
    documents, prices = find_similars(item)
    response = deepseek_via_openai_client.chat.completions.create(
        model="deepseek-chat", 
        messages=messages_for(item, documents, prices),
        seed=42,
        max_tokens=8
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
deepseek_api_rag(test[1])

In [None]:
root = logging.getLogger()
root.setLevel(logging.INFO)

In [None]:
from price_agents.frontier_agent import FrontierAgent

agent = FrontierAgent(collection)
agent.price("Quadcast HyperX condenser mic, connects via usb-c to your computer for crystal clear audio")

In [None]:
agent.price("Shure MV7+ professional podcaster microphone with usb-c and XLR outputs")