## **Evaluating Our Fine-Tuned LLM for Product Price Prediction**

### Learning Objectives:
1. Load and test fine-tuned QLoRA adapters
2. Compare prediction methods (greedy vs weighted)
3. Evaluate model performance quantitatively
4. Visualize prediction accuracy

#### **Install required packages (commented to prevent accidental execution)**

In [None]:
!pip install -q datasets peft requests torch bitsandbytes transformers trl accelerate sentencepiece matplotlib

In [None]:
# Import with clear grouping
import os
import re
import math
from datetime import datetime
from tqdm import tqdm
import matplotlib.pyplot as plt

# HuggingFace and Colab specific
from google.colab import userdata
from huggingface_hub import login

# PyTorch and Transformers
import torch
import torch.nn.functional as F
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed
)
from peft import PeftModel
from datasets import load_dataset

## Experiment Configuration

Key settings for our evaluation:

In [None]:
# Model Selection
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
PROJECT_NAME = "pricer"
HF_USER = "ed-donner"

# Fine-Tuned Model Details
RUN_NAME = "2024-09-13_13.04.39"
PROJECT_RUN_NAME = f"{PROJECT_NAME}-{RUN_NAME}"
FINETUNED_MODEL = f"{HF_USER}/{PROJECT_RUN_NAME}"
REVISION = "e8d637df551603dc86cd7a1598a8f44af4d7ae36"  # Specific model version

# Dataset
DATASET_NAME = f"{HF_USER}/pricer-data"

# Quantization
QUANT_4_BIT = True

# Evaluation
TOP_K = 3  # Number of top predictions to consider
TEST_SIZE = 250  # Number of test samples to evaluate

# Console Colors
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red": RED, "orange": YELLOW, "green": GREEN}

## HuggingFace Login

Required to access models and datasets:
1. Create account at https://huggingface.co
2. Generate token at https://huggingface.co/settings/tokens
3. Add to Colab secrets (Key icon → New secret) named 'HF_TOKEN'

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

## Loading Test Dataset

Our evaluation dataset contains:
- Product descriptions
- Ground truth prices
- Matches training data format

In [None]:
dataset = load_dataset(DATASET_NAME)
test = dataset['test']

In [None]:
test[0]

## Loading Fine-Tuned Model

We load:
1. Base LLaMA 3.1 model (quantized)
2. Fine-tuned LoRA adapters

In [None]:
# pick the right quantization (thank you Robert M. for spotting the bug with the 8 bit version!)

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
  )

In [None]:
# Initialize tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

In [None]:
# Load the fine-tuned model with PEFT
if REVISION:
  fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL, revision=REVISION)
else:
  fine_tuned_model = PeftModel.from_pretrained(base_model, FINETUNED_MODEL)

In [None]:
print(f"\nMemory footprint: {fine_tuned_model.get_memory_footprint() / 1e6:.1f} MB")


In [None]:
fine_tuned_model

## Price Prediction Methods

We implement two approaches:
1. **Greedy decoding**: Takes most likely next token
2. **Weighted prediction**: Averages top K predictions

In [None]:
def extract_price(response):
    """Extract numerical price from model response"""
    if "Price is $" in response:
        contents = response.split("Price is $")[1].replace(',', '')
        match = re.search(r"[-+]?\d*\.\d+|\d+", contents)
        return float(match.group()) if match else 0
    return 0

In [None]:
extract_price("Price is $a fabulous 899.99 or so")

In [None]:
def greedy_predict(prompt):
    """Standard greedy decoding prediction"""
    set_seed(42)  # For reproducibility
    inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    attention_mask = torch.ones(inputs.shape, device="cuda")
    outputs = fine_tuned_model.generate(
        inputs,
        attention_mask=attention_mask,
        max_new_tokens=3,
        num_return_sequences=1
    )
    response = tokenizer.decode(outputs[0])
    return extract_price(response)

In [None]:
def weighted_predict(prompt, device="cuda"):
    """Weighted average of top K predictions"""
    set_seed(42)
    inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
    attention_mask = torch.ones(inputs.shape, device=device)

    with torch.no_grad():
        outputs = fine_tuned_model(inputs, attention_mask=attention_mask)
        next_token_logits = outputs.logits[:, -1, :].to('cpu')

    next_token_probs = F.softmax(next_token_logits, dim=-1)
    top_prob, top_token_id = next_token_probs.topk(TOP_K)
    
    prices, weights = [], []
    for i in range(TOP_K):
        predicted_token = tokenizer.decode(top_token_id[0][i])
        probability = top_prob[0][i].item()
        try:
            price = float(predicted_token)
            if price > 0:  # Filter invalid predictions
                prices.append(price)
                weights.append(probability)
        except ValueError:
            continue
            
    return sum(p * w for p, w in zip(prices, weights)) / sum(weights) if weights else 0.0

## Evaluation Metrics

We assess model performance using:
1. **Absolute Error (USD)**: |Prediction - True Price|
2. **Squared Log Error (SLE)**: Penalizes relative errors
3. **Accuracy Categories**:
   - Green: Error < $40 or < 20%
   - Yellow: Error < $80 or < 40%
   - Red: Larger errors

Comparison baselines:
- GPT-4: $76 average error
- Base LLaMA: $396 average error

In [None]:
class PriceEvaluator:
    def __init__(self, predictor, data, title=None, size=TEST_SIZE):
        self.predictor = predictor
        self.data = data
        self.title = title or predictor.__name__.replace("_", " ").title()
        self.size = size
        self.results = {
            'guesses': [],
            'truths': [],
            'errors': [],
            'sles': [],
            'colors': []
        }

    def _categorize_error(self, error, truth):
        """Classify prediction accuracy"""
        if error < 40 or error/truth < 0.2:
            return "green"
        elif error < 80 or error/truth < 0.4:
            return "orange"
        return "red"

    def evaluate_sample(self, index):
        """Run prediction on single test case"""
        sample = self.data[index]
        prediction = self.predictor(sample["text"])
        truth = sample["price"]
        error = abs(prediction - truth)
        log_error = math.log(truth+1) - math.log(prediction+1)
        sle = log_error ** 2
        color = self._categorize_error(error, truth)
        item_desc = sample["text"].split("\n\n")[1][:20] + "..."

        # Store results
        self.results['guesses'].append(prediction)
        self.results['truths'].append(truth)
        self.results['errors'].append(error)
        self.results['sles'].append(sle)
        self.results['colors'].append(color)

        # Print colored output
        print(f"{COLOR_MAP[color]}{index+1}: "
              f"Pred: ${prediction:,.2f} | "
              f"True: ${truth:,.2f} | "
              f"Error: ${error:,.2f} | "
              f"SLE: {sle:,.2f} | "
              f"Item: {item_desc}{RESET}")

    def visualize_results(self):
        """Generate prediction vs truth scatter plot"""
        plt.figure(figsize=(12, 8))
        max_val = max(max(self.results['truths']), max(self.results['guesses']))
        
        # Perfect prediction line
        plt.plot([0, max_val], [0, max_val], 
                color='deepskyblue', lw=2, alpha=0.6, 
                label='Perfect Prediction')
        
        # Actual predictions
        plt.scatter(self.results['truths'], self.results['guesses'], 
                   s=3, c=self.results['colors'])
        
        plt.xlabel('Ground Truth Price ($)')
        plt.ylabel('Model Prediction ($)')
        plt.xlim(0, max_val)
        plt.ylim(0, max_val)
        plt.title(f"{self.title}\n"
                 f"Avg Error: ${sum(self.results['errors'])/self.size:,.2f} | "
                 f"Accuracy: {sum(1 for c in self.results['colors'] if c=='green')/self.size:.1%}")
        plt.legend()
        plt.show()

    def run_evaluation(self):
        """Execute full evaluation pipeline"""
        for i in range(self.size):
            self.evaluate_sample(i)

## Evaluating Weighted Prediction Method

Testing our fine-tuned model on {TEST_SIZE} test cases:

In [None]:
print("\n=== Evaluating Fine-Tuned Model ===")
evaluator = PriceEvaluator(weighted_predict, test, "Fine-Tuned Model Performance")
evaluator.run_evaluation()

### Key Observations:
1. Compare results to our baselines:
   - GPT-4: $76 average error
   - Base LLaMA: $396
2. Green dots show accurate predictions
3. Points above the blue line are overestimates
4. Points below are underestimates
"""