## **Evaluating Frontier Models for Product Pricing**  

### **1. Introduction**  
In this lesson, we will explore how **frontier models** (such as GPT-4o and Claude 3.5) perform on estimating product prices based on textual descriptions. The focus will be on **evaluating** these models rather than training them.  

We will also compare their performance against **traditional models** and **human-generated estimates**, using a structured **test harness** to measure their accuracy.

#### **2. Key Considerations**  
- **We are not training the LLMs**, only evaluating their performance on the test dataset.  
- Some **data contamination** is possible, as these models may have been trained on similar products.  
- We will use **GPT-4o-mini, GPT-4o (August model), and Claude 3.5 Sonnet** for evaluation.

---

### **2. Dataset Preparation**  

#### **Importing Required Libraries**  
To begin, we import the necessary libraries: 

In [None]:
import os
import re
import math
import json
import random
from dotenv import load_dotenv
from huggingface_hub import login
import matplotlib.pyplot as plt
import numpy as np
import pickle
from collections import Counter
from openai import OpenAI
from anthropic import Anthropic
from utils.testing import Tester  # Custom test harness

### **Environment Setup**

We load environment variables and log into Hugging Face for potential model access.

In [None]:
# Load environment variables
import os
from dotenv import load_dotenv
from huggingface_hub import login

# Load variables from .env file
load_dotenv()

# Retrieve API keys from environment variables
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')

# Check if API keys are properly loaded
if not openai_api_key or not anthropic_api_key:
    raise ValueError("Missing API keys. Ensure OPENAI_API_KEY and HF_TOKEN are set in the .env file.")

# Set environment variables explicitly (optional)
os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['ANTHROPIC_API_KEY'] = anthropic_api_key


#### **Initializing LLM Clients**  


In [None]:
openai = OpenAI()
claude = Anthropic()

### **Loading the Dataset**
We load our train and test datasets from previously saved pickle files:

In [None]:
with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

with open('test.pkl', 'rb') as file:
    test = pickle.load(file)

### **Preparing LLMs for Price Estimation**  

#### **Structuring Prompts for LLMs**  
To make our models predict prices effectively, we construct structured **prompts**:

In [None]:
def messages_for(item):
    system_message = "You estimate prices of items. Reply only with the price, no explanation."
    user_prompt = item.test_prompt().replace(" to the nearest dollar", "").replace("\n\nPrice is $", "")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

#### **Extracting Prices from LLM Responses**  
We define a function to extract the **numerical price value** from model responses:

In [None]:
def get_price(s):
    s = s.replace('$', '').replace(',', '')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
#Example Usage

get_price("The price is roughly $99.99 because blah blah")

### **Evaluating Frontier Models**  

##### **GPT-4o-Mini Price Estimation**  
We define a function that **queries GPT-4o-mini** to estimate prices:

In [None]:
def gpt_4o_mini(item):
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# Let call and get a prediction

gpt_4o_mini(test[0])

In [None]:
# Actual price of the above

test[0].price

In [None]:
# We then test GPT-4o-mini’s performance:

Tester.test(gpt_4o_mini, test)

#### **GPT-4o (August Model) Price Estimation**  
We evaluate **a more advanced version** of GPT-4o:

In [None]:
def gpt_4o_frontier(item):
    response = openai.chat.completions.create(
        model="gpt-4o-2024-08-06", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# Testing the GPT-4o frontier model

Tester.test(gpt_4o_frontier, test)

#### **Claude 3.5 Sonnet Price Estimation**  
We now test **Anthropic’s Claude 3.5 Sonnet**:

In [None]:
def claude_3_point_5_sonnet(item):
    messages = messages_for(item)
    system_message = messages[0]['content']
    messages = messages[1:]
    response = claude.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=5,
        system=system_message,
        messages=messages
    )
    reply = response.content[0].text
    return get_price(reply)

In [None]:
# Testing Claude 3.5 Sonnet

Tester.test(claude_3_point_5_sonnet, test)



### **Conclusion & Key Takeaways**  

#### **Summary**  
- We tested different **frontier LLMs** on the task of price estimation.  
- **GPT-4o-mini, GPT-4o (August), and Claude 3.5 Sonnet** were used for comparison.  
- We leveraged a **test harness (Tester class)** to measure performance.

#### **Considerations & Limitations**  
- **LLMs might have prior knowledge** of the test data due to pre-training, introducing potential bias.  
- **Fine-tuning** could further improve performance if allowed.  
- **Evaluating results across different LLMs** helps determine robustness.

#### **Next Steps**  
- Explore **fine-tuning techniques** for custom LLM adaptation.  
- Investigate **smaller, open-source models** to reduce dependency on API-based LLMs.  
- Perform **more rigorous error analysis** to refine model selection.