# The Price is Right
A more complex solution for estimating prices of goods.

1. This notebook: create a RAG database with our 400,000 training data
2. Pr11.ThePriceIsRight-Rag2.1 notebook: visualize in 2D
3. Pr11.ThePriceIsRight-Rag2.2 notebook: visualize in 3D
4. Pr11.ThePriceIsRight-Rag2.3 notebook: build and test a RAG pipeline with GPT-4o-mini
5. Pr11.ThePriceIsRight-Rag2.4 notebook: (a) bring back our Random Forest pricer (b) Create a Ensemble pricer that allows contributions from all the pricers

Phew! That's a lot to get through in one day!


In [1]:
import os
import re
import math
import json
from tqdm import tqdm
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from items import Item
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [2]:
load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')
DB = "products_vectorstore"

In [3]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [4]:
with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

In [5]:
train[0].prompt

'How much does this cost to the nearest dollar?\n\nTMEE Handrails for Outdoor Steps Fits 1 to 3 Steps Outdoor Stair Railing，Staircase Handrail Fits with Installation Kit Transitional Handrail for Concrete Steps or Wooden Stairs\nFits 1 OR 3 Steps handrail Design --This is a suitable 2 or 3 steps outdoor stair railing, and also suitable for level surface, all kind of different stairs. Handrail length cm, support post height cm,Middle rail length DIY Stair Railing Our Hand Rails for Steps can be multi-angle adjustment maximum adjustable angle of 65 ° to suit your specific step height, not only can be used on 2 to 3 steps, but also on flat ground, and there is no need to make any modification to the railing, just adjust the middle railing so that\n\nPrice is $127.00'

# Now create a Chroma Datastore


In [6]:
client = chromadb.PersistentClient(path=DB)

In [7]:
collection_name = "products"
existing_collection_names = [collection.name for collection in client.list_collections()]
if collection_name in existing_collection_names:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")

collection = client.create_collection(collection_name)

Deleted existing collection: products


# Introducing the SentenceTransfomer

The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to a 384 dimensional dense vector space and is ideal for tasks like semantic search.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.

Last time we used OpenAI embeddings to produce vector embeddings. Benefits compared to OpenAI embeddings:
1. It's free and fast!
3. We can run it locally, so the data never leaves our box - might be useful if you're building a personal RAG

In [8]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [9]:
vector = model.encode(["Well hi there"])[0]

In [10]:
vector

array([-9.46715847e-02,  4.27620001e-02,  5.51620275e-02, -5.11053891e-04,
        1.16203222e-02, -6.80131093e-02,  2.76405960e-02,  6.06974810e-02,
        2.88531054e-02, -1.74128413e-02, -4.94346879e-02,  2.30992828e-02,
       -1.28614111e-02, -4.31402363e-02,  2.17509605e-02,  4.26549278e-02,
        5.10500111e-02, -7.79727548e-02, -1.23247199e-01,  3.67455184e-02,
        4.54121036e-03,  9.47937369e-02, -5.53099252e-02,  1.70641318e-02,
       -2.92871799e-02, -4.47124690e-02,  2.06784252e-02,  6.39320314e-02,
        2.27427818e-02,  4.87790182e-02, -2.33499426e-03,  4.72859628e-02,
       -2.86258813e-02,  2.30625030e-02,  2.45130733e-02,  3.95681560e-02,
       -4.33176570e-02, -1.02316678e-01,  2.79876892e-03,  2.39304379e-02,
        1.61556378e-02, -8.99082236e-03,  2.07255017e-02,  6.40122965e-02,
        6.89179003e-02, -6.98361546e-02,  2.89762625e-03, -8.10989290e-02,
        1.71123147e-02,  2.50651571e-03, -1.06529072e-01, -4.87733446e-02,
       -1.67762097e-02, -

In [11]:
def description(item):
    text = item.prompt.replace("How much does this cost to the nearest dollar?\n\n", "")
    return text.split("\n\nPrice is $")[0]

In [12]:
description(train[0])

'TMEE Handrails for Outdoor Steps Fits 1 to 3 Steps Outdoor Stair Railing，Staircase Handrail Fits with Installation Kit Transitional Handrail for Concrete Steps or Wooden Stairs\nFits 1 OR 3 Steps handrail Design --This is a suitable 2 or 3 steps outdoor stair railing, and also suitable for level surface, all kind of different stairs. Handrail length cm, support post height cm,Middle rail length DIY Stair Railing Our Hand Rails for Steps can be multi-angle adjustment maximum adjustable angle of 65 ° to suit your specific step height, not only can be used on 2 to 3 steps, but also on flat ground, and there is no need to make any modification to the railing, just adjust the middle railing so that'

In [13]:
for i in tqdm(range(0, len(train), 1000)):
    documents = [description(item) for item in train[i: i+1000]]
    vectors = model.encode(documents).astype(float).tolist()
    metadatas = [{"category": item.category, "price": item.price} for item in train[i: i+1000]]
    ids = [f"doc_{j}" for j in range(i, i+1000)]
    collection.add(
        ids=ids,
        documents=documents,
        embeddings=vectors,
        metadatas=metadatas
    )

100%|████████████████████████████████████████████████████████████████████████████| 400/400 [42:13:26<00:00, 380.02s/it]
