1. This Notebook: Create a RAG database with our 400,000 training data
2. Day 2.1 Notebook: Visualization in 2D
3. Day 2.2 Notebook: Visualization in 3D
4. Day 2.3 Notebook: Build and Test a RAG pipeline. 
5. Day 2.4 Notebook: (a) Bring back our Randon Forest Pricer, (b) Create a Ensemble Pricer that allows contributions from all the pricers

In [2]:
import os
import re
import math
import json
from tqdm import tqdm
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from items import Item
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [9]:
DB = "products_vectorstore"

In [3]:
load_dotenv(override=True)

hf_token = os.getenv("HF_TOKEN")
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [5]:
# Loading the pickle files

with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

### Now creating a Chroma Datastore  
In week 5, we created a Chroma datastore with a 123 documents representing chunks of objects from our fictional company Insurellm.    
  

Now we will create a Chroma datastore with 400,000 products from our training dataset! It's getting real!!    
  
    
    
Note that we won't be using LangChain, but thr API is very straightforward and consistent with before.

In [10]:
client = chromadb.PersistentClient(path=DB)

In [11]:
# Check if the collection exixts, and delete if it does

collection_name = "products"
existing_collection_names = [collection.name for collection in client.list_collections()]
if collection_name in existing_collection_names:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")

collection = client.create_collection(collection_name)


### **Introducing the SentenceTransformer**  
The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to a 384 dimensional dense vector space and is ideal for tasks like semantic search.  
  
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.  
Last time we used OpenAI's Enbeddings Model to produce vector embeddings. Benefits compared to OpenAI embeddings: 
* It's free and fast.  
* We can run it locally, so the data never leaves our box - might be useful while building a personal RAG.

In [12]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [14]:
# Pass in a list of texts, get back an array of numpy vectors

vector = model.encode(["Well hi there"])[0]

In [None]:
# a 384 dimensional vector

len(vector)

384

Making embeddings 

In [17]:
def description(item): 
    text = item.prompt.replace("How much does this cost to the nearest dollar?\n\n", "")
    return text.split("\n\nPrice is $")[0]

In [20]:
print(description(train[0]))

Delphi FG0166 Fuel Pump Module
Delphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7


#### **Populating the Vectorstore**


Processing in batches of 1000 items to avoid system crash due to memory limits !!

In [21]:
for i in tqdm(range(0, len(train), 1000)):
    documents = [description(item) for item in train[i: i + 1000]]
    vectors = model.encode(documents).astype(float).tolist()
    metadatas = [{"category": item.category, "price": item.price} for item in train[i: i + 1000]]
    ids = [f"doc_{j}" for j in range(i, i + 1000)]
    collection.add(
        ids=ids, 
        documents=documents, 
        embeddings=vectors, 
        metadatas=metadatas
    )

100%|██████████| 400/400 [22:31<00:00,  3.38s/it]
