### **Introduction to Tokenization and Model Evaluation**

This notebook explores two fundamental pillars of working with large language models (LLMs):

#### **Part 1: Tokenization Demystified**
Tokenization is how LLMs convert raw text into numerical representations. We'll investigate:
- How different models (LLaMA 3.1, Phi-3, etc.) tokenize numbers differently
- Why tokenization matters for numerical tasks like price prediction
- The real-world impact: A model that splits "1000" into two tokens may struggle with arithmetic compared to one that treats it as a single unit

#### **Part 2: Model Evaluation Essentials**
We then evaluate our base model's price prediction capabilities by:
1. Loading LLaMA 3.1 with 4-bit quantization (reducing memory usage by 8x)
2. Creating a robust testing framework that measures:
   - Absolute price errors (in USD)
   - Relative accuracy (percentage differences)
   - Squared log error (penalizing large deviations)
3. Visualizing predictions vs. actual prices to identify systematic biases


### **Learning Objectives:**
1. Compare tokenization behavior across different LLMs
2. Evaluate base model performance on price prediction
3. Understand quantitative evaluation metrics
4. Visualize model performance vs ground truth

### **Install required packages (commented to prevent accidental execution)**

In [None]:
!pip install -q datasets requests torch peft bitsandbytes transformers trl accelerate sentencepiece tiktoken matplotlib

In [None]:
# Import with clear grouping
import os
import re
import math
from datetime import datetime
from tqdm import tqdm
import matplotlib.pyplot as plt

# HuggingFace and Colab specific
from google.colab import userdata
from huggingface_hub import login

# PyTorch and Transformers
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    set_seed
)
from peft import LoraConfig, PeftModel
from datasets import load_dataset, Dataset, DatasetDict

### **Model Selection**

We'll compare these foundation models:
1. **LLaMA 3.1-8B** (Meta)
2. **Qwen2.5-7B** (Alibaba)
3. **Gemma-2-9B** (Google)
4. **Phi-3-medium** (Microsoft)

Note the parameter counts:
- Our base: 8B params (4-bit quantized)
- GPT-4: ~1.8T params (1000x larger!)

In [None]:
# Tokenizers

LLAMA_3_1 = "meta-llama/Meta-Llama-3.1-8B"
QWEN_2_5 = "Qwen/Qwen2.5-7B"
GEMMA_2 = "google/gemma-2-9b"
PHI_3 = "microsoft/Phi-3-medium-4k-instruct"

# Constants

BASE_MODEL = LLAMA_3_1
HF_USER = "ed-donner"
DATASET_NAME = f"{HF_USER}/pricer-data"
MAX_SEQUENCE_LENGTH = 182
QUANT_4_BIT = True

GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red":RED, "orange": YELLOW, "green": GREEN}


### HuggingFace Login

Required steps:
1. Create account at https://huggingface.co
2. Generate token at https://huggingface.co/settings/tokens
3. Add to Colab secrets (Key icon → New secret) named 'HF_TOKEN'

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

### Tokenizer Analysis

Different models tokenize numbers differently, which affects numerical reasoning.
Let's examine how each model handles number tokenization:

In [None]:
def investigate_tokenizer(model_name):
    print(f"\n=== {model_name} ===")
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    for number in [0, 1, 10, 100, 999, 1000]:
        tokens = tokenizer.encode(str(number), add_special_tokens=False)
        print(f"{number:>4} → {tokens} (Length: {len(tokens)})")

In [None]:
# Now we will try this with each model: LLAMA_3_1, QWEN_2_5, GEMMA_2, PHI_3

for model_name in MODELS.values():
    investigate_tokenizer(model_name)

## **Loading Price Prediction Dataset**

Features:
- Product descriptions
- Ground truth prices
- Split into train/test sets

In [None]:
dataset = load_dataset(DATASET_NAME)
train = dataset['train']
test = dataset['test']

In [None]:
test[0]

### **Loading Base Model with Quantization**

Using 4-bit quantization for memory efficiency:
- Normal Float 4 (nf4) quantization
- Double quantization for additional savings
- bfloat16 compute dtype for stability

In [None]:
## pick the right quantization

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
  )

In [None]:
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

print(f"\nMemory footprint: {base_model.get_memory_footprint() / 1e9:.1f} GB")

In [None]:
# Price Extraction Utility
def extract_price(response):
    """Extract numerical price from model response"""
    if "Price is $" in response:
        contents = response.split("Price is $")[1]
        contents = contents.replace(',', '').replace('$', '')
        match = re.search(r"[-+]?\d*\.\d+|\d+", contents)
        return float(match.group()) if match else 0
    return 0

In [None]:
extract_price("Price is $999 blah blah so cheap")

In [None]:
# Prediction Function
def model_predict(prompt):
    """Generate price prediction from product description"""
    set_seed(42)  # For reproducibility
    inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    attention_mask = torch.ones(inputs.shape, device="cuda")
    outputs = base_model.generate(
        inputs,
        max_new_tokens=4,  # Limit to price prediction
        attention_mask=attention_mask,
        num_return_sequences=1
    )
    response = tokenizer.decode(outputs[0])
    return extract_price(response)

In [None]:
model_predict(test[0]['text'])

### **Evaluation Metrics**

We'll track:
1. Absolute Error (USD difference)
2. Squared Log Error (SLE) - Penalizes large relative errors
3. Accuracy Categories:
   - Green: Error < $40 or < 20% of true price
   - Yellow: Error < $80 or < 40%
   - Red: Larger errors


## **Base Model Performance Evaluation**

Testing the LLaMA 3.1 8B model on 250 test cases:

In [None]:
class Tester:

    def __init__(self, predictor, data, title=None, size=250):
        self.predictor = predictor
        self.data = data
        self.title = title or predictor.__name__.replace("_", " ").title()
        self.size = size
        self.guesses = []
        self.truths = []
        self.errors = []
        self.sles = []
        self.colors = []

    def color_for(self, error, truth):
        if error<40 or error/truth < 0.2:
            return "green"
        elif error<80 or error/truth < 0.4:
            return "orange"
        else:
            return "red"

    def run_datapoint(self, i):
        datapoint = self.data[i]
        guess = self.predictor(datapoint["text"])
        truth = datapoint["price"]
        error = abs(guess - truth)
        log_error = math.log(truth+1) - math.log(guess+1)
        sle = log_error ** 2
        color = self.color_for(error, truth)
        title = datapoint["text"].split("\n\n")[1][:20] + "..."
        self.guesses.append(guess)
        self.truths.append(truth)
        self.errors.append(error)
        self.sles.append(sle)
        self.colors.append(color)
        print(f"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}")

    def chart(self, title):
        max_error = max(self.errors)
        plt.figure(figsize=(12, 8))
        max_val = max(max(self.truths), max(self.guesses))
        plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)
        plt.scatter(self.truths, self.guesses, s=3, c=self.colors)
        plt.xlabel('Ground Truth')
        plt.ylabel('Model Estimate')
        plt.xlim(0, max_val)
        plt.ylim(0, max_val)
        plt.title(title)
        plt.show()

    def report(self):
        average_error = sum(self.errors) / self.size
        rmsle = math.sqrt(sum(self.sles) / self.size)
        hits = sum(1 for color in self.colors if color=="green")
        title = f"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%"
        self.chart(title)

    def run(self):
        self.error = 0
        for i in range(self.size):
            self.run_datapoint(i)
        self.report()

    @classmethod
    def test(cls, function, data):
        cls(function, data).run()

In [None]:
Tester.test(model_predict, test)