# If you're interested - this is a variation of Lab2 with document pre-processing

## RAG (Retrieval Augmented Generation

For our 2nd agent, we will be asking GPT-4o-mini to estimate the price of one of our deals.

It turns out that LLMs are really good at this! Out of the box, GPT-4o achieves an average error of $76, much better than our Neural Network and traditional solutions.

But we can do even better: we'll provide it with some context, in the form of 5 similar products from our training dataset

Again I'll be going quite quickly through this - the idea is for you to run this yourself.

In [1]:
# imports

import os
import re
import math
import json
from tqdm import tqdm
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from items import Item
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from testing import Tester
from openai import OpenAI
import logging

In [3]:
# environment

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')
DB = "preprocessed_vectorstore"

In [4]:
# Log in to HuggingFace
# If you don't have a HuggingFace account, you can set one up for free at www.huggingface.co
# And then add the HF_TOKEN to your .env file as explained in the project README

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [5]:
# Load the training data

with open('../train.pkl', 'rb') as file:
    train = pickle.load(file)

train[0]

<Delphi FG0166 Fuel Pump Module = $226.95>

# Now create a Chroma Datastore

Now we will use the free, open-source Vector database Chroma.  
We will create a Chroma datastore with 400,000 products from our training dataset.

In [6]:
client = chromadb.PersistentClient(path=DB)

# Introducing the SentenceTransfomer Encoding LLM

The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to 384 dimensional vectors and is ideal for tasks like semantic search.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.

As an alternative, OpenAI provides a closed-source Embeddings model. Benefits compared to OpenAI embeddings:
1. It's free and fast!
3. We can run it locally, so the data never leaves our box - might be useful if you're building a personal RAG

In [7]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
# Pass in a list of texts, get back a numpy array of vectors

vector = model.encode(["A room full of software engineers"])[0]
print(vector.shape)
vector

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

vector1 = model.encode(["A room full of software engineers"])[0]
vector2 = model.encode(["A room full of AI Data Scientists"])[0]
cosine_similarity(vector1, vector2)

In [None]:
vector1 = model.encode(["A room full of software engineers"])[0]
vector3 = model.encode(["Curried eggplant"])[0]
cosine_similarity(vector1, vector3)

In [None]:
# The sentence equivalent of the famous King - Man + Woman = Queen

room_engineers = model.encode(["A room full of software engineers"])[0]
room_scientists = model.encode(["A room full of AI Data Scientists"])[0]
engineers = model.encode(["software engineers"])[0]
scientists = model.encode(["AI Data Scientists"])[0]

In [None]:
cosine_similarity(room_engineers, room_scientists)

In [None]:
new_room = room_engineers - engineers + scientists

In [None]:
cosine_similarity(new_room, room_scientists)

## With that background, let's populate our Chroma database

### By calculating vectors for 400,000 scraped products

In [16]:
from groq import Groq
groq = Groq()
preprocess_model = "llama-3.1-8b-instant"

In [51]:
def preprocess(item):
    user_message = "Write a 2-3 sentence summary of this product; include all facts that affect its price:\n"
    user_message += item
    user_message += "\n\nReply only with the summary, no introduction"
    messages = [{"role": "user", "content": user_message}]
    response = groq.chat.completions.create(
        model=preprocess_model,
        messages=messages,
        seed=42,
        max_tokens=100
    )
    return response.choices[0].message.content

In [52]:
train[1].text

'Power Stop Rear Z36 Truck and Tow Brake Kit with Calipers\nThe Power Stop Z36 Truck & Tow Performance brake kit provides the superior stopping power demanded by those who tow boats, haul loads, tackle mountains, lift trucks, and play in the harshest conditions. The brake rotors are drilled to keep temperatures down during extreme braking and slotted to sweep away any debris for constant pad contact. Combined with our Z36 Carbon-Fiber Ceramic performance friction formulation, you can confidently push your rig to the limit and look good doing it with red powder brake calipers. Components are engineered to handle the stress of towing, hauling, mountainous driving, and lifted trucks. Dust-free braking performance. Z36 Carbon-Fiber Ceramic formula provides the extreme braking performance demanded by your truck or 4x'

In [53]:
preprocess(train[1].text)

'The Power Stop Rear Z36 Truck and Tow Brake Kit with Calipers provides superior stopping power for heavy-duty trucks and 4x4 vehicles, featuring drilled and slotted brake rotors for cooling and debris removal, combined with a Z36 Carbon-Fiber Ceramic performance formula and red powder brake calipers. This kit is designed to handle the stress of towing, hauling, mountainous driving, and lifted trucks. The components are built for extreme braking performance, resulting in dust-free braking.'

In [54]:
# Check if the collection exists; if not, create it

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=6) as executor:
  
    collection_name = "products"
    existing_collection_names = client.list_collections()
    reset = True
    
    if reset or (collection_name not in existing_collection_names):
        client.delete_collection(collection_name)
        collection = client.create_collection(collection_name)
        for i in tqdm(range(0, len(train), 100)):
            documents = [item.text for item in train[i: i+100]]
            preprocessed = list(executor.map(preprocess, documents))
            vectors = model.encode(preprocessed).astype(float).tolist()
            metadatas = [{"category": item.category, "price": item.price} for item in train[i: i+100]]
            ids = [f"doc_{j}" for j in range(i, i+100)]
            collection.add(
                ids=ids,
                documents=documents,
                embeddings=vectors,
                metadatas=metadatas
            )
    collection = client.get_or_create_collection(collection_name)

  0%|▋                                                                                                                                                          | 18/4000 [02:19<8:34:18,  7.75s/it]


KeyboardInterrupt: 

# Let's visualize the vectorized data

In [None]:
# It is very fun turning this up to 400_000 and seeing the full dataset visualized,
# but it almost crashes my box every time so do that at your own risk!! 10_000 is safe!

MAXIMUM_DATAPOINTS = 5_000

In [None]:
CATEGORIES = ['Appliances', 'Automotive', 'Cell_Phones_and_Accessories', 'Electronics','Musical_Instruments', 'Office_Products', 'Tools_and_Home_Improvement', 'Toys_and_Games']
COLORS = ['cyan', 'blue', 'brown', 'orange', 'yellow', 'green' , 'purple', 'red']

In [None]:
# Prework
result = collection.get(include=['embeddings', 'documents', 'metadatas'], limit=MAXIMUM_DATAPOINTS)
vectors = np.array(result['embeddings'])
documents = result['documents']
categories = [metadata['category'] for metadata in result['metadatas']]
colors = [COLORS[CATEGORIES.index(c)] for c in categories]

In [None]:
# Let's try a 2D chart
# TSNE stands for t-distributed Stochastic Neighbor Embedding - it's a common technique for reducing dimensionality of data

tsne = TSNE(n_components=2, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [None]:
# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=3, color=colors, opacity=0.7),
    text=[f"Category: {c}<br>Text: {d[:50]}..." for c, d in zip(categories, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vectorstore Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y'),
    width=1200,
    height=800,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [None]:
# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=3, color=colors, opacity=0.7),
    text=[f"Category: {c}<br>Text: {d[:50]}..." for c, d in zip(categories, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=1200,
    height=800,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
# And now - set up OpenAI, Ollama and DeepSeek

# OpenAI
openai = OpenAI()

# Ollama
ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

# DeepSeek
deepseek_api_key = os.getenv("DEEPSEEK_API_KEY")
deepseek_via_openai_client = OpenAI(api_key=deepseek_api_key,base_url="https://api.deepseek.com")

In [None]:
# Load in the test pickle file

with open('../test.pkl', 'rb') as file:
    test = pickle.load(file)

In [None]:
# We need to give some context to GPT-4o-mini by selecting 5 products with similar descriptions

def make_context(similars, prices):
    message = "To provide some context, here are some other items that might be similar to the item you need to estimate.\n\n"
    for similar, price in zip(similars, prices):
        message += f"Potentially related product:\n{similar}\nPrice is ${price:.2f}\n\n"
    return message

In [None]:
def messages_for(item, similars, prices):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = make_context(similars, prices)
    user_prompt += "And now the question for you:\n\n"
    user_prompt += item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [None]:
!ollama pull llama3.2
!ollama pull deepseek-r1

In [None]:
def preprocess(item):
    system_message = "You rewrite product descriptions in a format most suitable for finding similar products in a Knowledge Base"
    user_message = "Please write a short 2-3 sentence description of the following product; your description will be used to find similar products so it should be comprehensive and only about the product. Details:\n"
    user_message += item
    user_message += "\n\nNow please reply only with the short description, with no introduction"
    messages = [{"role": "system", "content": system_message}, {"role": "user", "content": user_message}]
    response = ollama_via_openai.chat.completions.create(
        model="llama3.2",
        messages=messages,
        seed=42
    )
    return response.choices[0].message.content

In [None]:
preprocess("A Shure MV7+ professional condenser mic for podcasting with exceptional audio quality")

In [None]:
def vector(item):
    text = preprocess(item.text)
    return model.encode(text)

In [None]:
def find_similars(item):
    vec = vector(item)
    results = collection.query(query_embeddings=vec.astype(float).tolist(), n_results=5)
    documents = results['documents'][0][:]
    prices = [m['price'] for m in results['metadatas'][0][:]]
    return documents, prices

In [None]:
print(test[1].text)

In [None]:
print(preprocess(test[1].text))

In [None]:
documents, prices = find_similars(test[1])

In [None]:
print(make_context(documents, prices))

In [None]:
# Utility function that extracts a price from a response from GPT-4o-mini

def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
get_price("blah blah the price is $99.99 so cheap")

In [None]:
# The function for gpt-4o-mini

def gpt_4o_mini_rag(item):
    
    # RAG - lookup similar items from our KnowledgeBase
    documents, prices = find_similars(item)

    # RAG - enrich the prompt to include the similar items
    messages = messages_for(item, documents, prices)
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages,
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# What's the actual price of this per Amazon?

test[1].price

In [None]:
# OK, time for gpt-4o-mini plus RAG to try:

gpt_4o_mini_rag(test[1])

# Were you following that?

Let's do it again with some print statements.

This is a "DIY" version of RAG; we're not using an abstraction layer like langchain to build the prompt, we're simply doing it ourselves.

In [None]:
# The function for gpt-4o-mini, now with print statements

def gpt_4o_mini_rag_explainer(item):
    documents, prices = find_similars(item)
    print(f"Asking GPT-4o-mini to estimate the price of {item.title}")
    print(f"Given similar prices of these items:")
    for document, price in zip(documents, prices):
        similar = document.split("\n")[0]
        print(f"Similar item: {similar} costs ${price:.2f}")
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(item, documents, prices),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    print(f"\n\nGPT-4o-mini reponded: {reply}")
    price = get_price(reply)
    print(f"Extracted price is {price:.2f}")
    return price

In [None]:
gpt_4o_mini_rag_explainer(test[1])

In [None]:
# Try the 7B variant of DeepSeek-R1 (the distilled version based on Qwen)
# A bit unpredictable and can take a long time!
# I needed to make the prompt a little more explicit otherwise it refused to answer sometimes..

def deepseek_local_rag(item):
    documents, prices = find_similars(item)
    messages = messages_for(item, documents, prices)
    messages[1]["content"] += "\nYou only need to guess the price, using the similar items to give you some reference point. Reply only with the price. Only think briefly; avoid overthinking."
    response = ollama_via_openai.chat.completions.create(
        model="deepseek-r1", 
        messages=messages,
        seed=42,
    )
    reply = response.choices[0].message.content
    print(reply)
    if "</think>" in reply:
        reply = reply.split("</think>")[1]
    return get_price(reply)

In [None]:
deepseek_local_rag(test[1])

In [None]:
# Try the full DeepSeek-V3 

def deepseek_api_rag(item):
    documents, prices = find_similars(item)
    response = deepseek_via_openai_client.chat.completions.create(
        model="deepseek-chat", 
        messages=messages_for(item, documents, prices),
        seed=42,
        max_tokens=8
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
deepseek_api_rag(test[1])

In [None]:
# Try the distilled version of DeepSeek-R1 with 70B parameters via Groq

from groq import Groq
model_name = "deepseek-r1-distill-llama-70b"
groq = Groq()

def deepseek_distilled_rag(item):
    documents, prices = find_similars(item)
    messages = messages_for(item, documents, prices)
    messages[1]["content"] += "\nYou only need to guess the price, using the similar items to give you some reference point. Reply only with the price. Only think briefly; avoid overthinking."
    response = groq.chat.completions.create(
        model=model_name, 
        messages=messages,
        seed=42,
        max_tokens=8
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
deepseek_distilled_rag(test[1])

## We will kick off the next line then take a 5 minute break

## When we come back: unveiling a proprietary fine-tuned LLM

In [None]:
Tester.test(deepseek_distilled_rag, test)

In [None]:
root = logging.getLogger()
root.setLevel(logging.INFO)

In [None]:
from agents.frontier_agent import FrontierAgent

frontier = FrontierAgent(collection)
frontier.price("Quadcast HyperX condenser mic, connects via usb-c to your computer for crystal clear audio")