### **Advanced Price Estimation with RAG and Ensemble Learning**

#### **Objectives**
1. Retrieval-Augmented Generation (RAG) and its application in price estimation.
2. Train and evaluate a Random Forest model using transformer-based embeddings.
3. Implement an ensemble model to combine multiple pricing strategies for improved predictions.
4. Utilize ChromaDB for efficient data storage and retrieval.
5. Compare different models (Specialist, Frontier, Random Forest, and Ensemble) and analyze their performance.


### **1. Importing Required Libraries**

In [None]:
import os
import re
import math
import json
from tqdm import tqdm
import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from utils.items import Item
from utils.testing import Tester
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import joblib

### **2. Setting Up Environment Variables**


In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')

In [None]:
# CONSTANTS

QUESTION = "How much does this cost to the nearest dollar?\n\n"
DB = "products_vectorstore"

### **3. Loading Data into ChromaDB**


In [None]:
# Load in the test pickle file:

with open('test.pkl', 'rb') as file:
    test = pickle.load(file)

In [None]:
# - Initializes a persistent vector database

client = chromadb.PersistentClient(path='products_vectorstore')
collection = client.get_or_create_collection('products')

### **4. Extracting Data from ChromaDB**


In [None]:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
prices = [metadata['price'] for metadata in result['metadatas']]

**Explanation:**
- Retrieves **stored embeddings** and their corresponding product prices.
- `vectors` store numerical representations of product descriptions.


### **5. Training a Random Forest Model**
- Let's train a **Random Forest model** using product vectors as input. Then saves the model for later use.

In [None]:
# This next line takes an hour on my M1 Mac!

rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(vectors, prices)

In [None]:
# Save the model to a file

joblib.dump(rf_model, 'random_forest_model.pkl')

### **6. Implementing an Ensemble Pricing Strategy**
#### **Loading Specialized Agents**

In [None]:
from Ensemble_Agent.specialist_agent import SpecialistAgent
from Ensemble_Agent.frontier_agent import FrontierAgent
from Ensemble_Agent.random_forest_agent import RandomForestAgent

#### **Defining Individual Price Predictions**


In [None]:
specialist = SpecialistAgent()
frontier = FrontierAgent(collection)
random_forest = RandomForestAgent()

- **Each above agent provides `different pricing predictions` based on distinct strategies.**

---

#### **Combining Predictions into an Ensemble Model**

- Let's creates a **feature matrix** with different price predictions.
- This feature matrix will use **min and max price values** as additional features.

In [None]:
def description(item):
    return item.prompt.split("to the nearest dollar?\n\n")[1].split("\n\nPrice is $")[0]

In [None]:
def rf(item):
    return random_forest.price(description(item))

In [None]:
Tester.test(rf, test)

In [None]:
product = "Quadcast HyperX condenser mic for high quality audio for podcasting"

In [None]:
print(specialist.price(product))
print(frontier.price(product))
print(random_forest.price(product))

In [None]:
specialists = []
frontiers = []
random_forests = []
prices = []
for item in tqdm(test[1000:1250]):
    text = description(item)
    specialists.append(specialist.price(text))
    frontiers.append(frontier.price(text))
    random_forests.append(random_forest.price(text))
    prices.append(item.price)

In [None]:
mins = [min(s,f,r) for s,f,r in zip(specialists, frontiers, random_forests)]
maxes = [max(s,f,r) for s,f,r in zip(specialists, frontiers, random_forests)]

X = pd.DataFrame({
    'Specialist': specialists,
    'Frontier': frontiers,
    'RandomForest': random_forests,
    'Min': mins,
    'Max': maxes,
})

# Convert y to a Series
y = pd.Series(prices)

#### **Training a Linear Regression Model for Ensemble Learning**


In [None]:
# Train a Linear Regression
np.random.seed(42)

lr = LinearRegression()
lr.fit(X, y)

feature_columns = X.columns.tolist()

for feature, coef in zip(feature_columns, lr.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"Intercept={lr.intercept_:.2f}")

In [None]:
joblib.dump(lr, 'ensemble_model.pkl')

**Explanation:**
- A **Linear Regression model** learns optimal weightings for different pricing strategies.
- The trained model is saved for future use.

### **7. Evaluating the Ensemble Model**


In [None]:
from agents.ensemble_agent import EnsembleAgent
ensemble = EnsembleAgent(collection)

In [None]:
ensemble.price(product)

**Explanation:**
- The ensemble model is tested using predefined test cases.
- The **best price estimate** is determined by combining predictions from multiple models.

In [None]:
def ensemble_pricer(item):
    return ensemble.price(description(item))

In [None]:
Tester.test(ensemble_pricer, test)