<div style="font-size: 13px; line-height: 1.4; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;">
This notebook documents my theoretical study alongside the lab exercises conducted on <b>July 8 & 9, 2025</b>.
</h5>
</div>

### <u><b>LAB EXERCISES:</b></u> **WEEK 6**

<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em;">
    <b style="font-size: 16px;">Overview:</b> Build a RAG pipeline to collect, clean, and balance product data, then standardize this data into <code>Item</code> objects with well-structured prompts and token counts suitable for training a product pricing model. Also, tokenizer behavior is validated to ensure that 3-digit price values were encoded into a single token, preparing the dataset for upcoming training and evaluation steps.
  </h5>
</div>


#### <code>**day1.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Begin the <b>Product Pricer</b> project by curating a dataset of <b>Home Appliances</b> from Amazon reviews. Focus on items with prices and prepare them for training by creating <code>Item</code> objects with truncated text and prompts.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.3em; font-size: 16px;"><b>The Big Project Begins!</b></h4>
  <b>Product Pricer:</b> A model that can estimate how much something costs based on its description.<br>
  <b>Data Curation – Part 1:</b> We’ll begin curating and cleaning a subset of the dataset, focusing on <b>Home Appliances</b>.<br>
  <b>Dataset source:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023</a><br>
  <b>Meta categories folder:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories</a>
</div>


<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Sidenote: What is Data Curation?</b></h5>
  Data curation is the process of collecting, cleaning, organizing, and preparing raw data into a high-quality, structured dataset suitable for analysis or machine learning tasks.<br>
  In the context of this week, curation includes:
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Selecting relevant data points (e.g., products with valid prices)</li>
    <li>Removing or correcting errors, inconsistencies, or irrelevant entries</li>
    <li>Balancing the dataset by category or price range</li>
    <li>Formatting data: creating prompts, truncating text, tokenizing</li>
    <li>Saving curated outputs for reuse in training and evaluation</li>
  </ul>
  Well-curated data ensures your models are trained on accurate, representative, and relevant samples — leading to stronger performance and more trustworthy results.
</div>


In [None]:
# Anaconda (conda) users:
# conda install -c conda-forge datasets matplotlib matplotlib-inline scipy<1.13 gensim anthropic ollama

# Python (pip) users:
# pip install datasets matplotlib matplotlib-inline scipy<1.13 gensim anthropic ollama

In [None]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt
from items import Item

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# %matplotlib inline

In [None]:
# Load in our dataset

dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_Appliances", split="full", trust_remote_code=True)

In [None]:
print(f"Number of Appliances: {len(dataset):,}")

In [None]:
# Investigate a particular datapoint
datapoint = dataset[2]

In [None]:
# Investigate
print(datapoint["title"])
print(datapoint["description"])
print(datapoint["features"])
print(datapoint["details"])
print(datapoint["price"])

In [None]:
# How many have prices?
prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError as e:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

In [None]:
# For those with prices, gather the price and the length
prices = []
lengths = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices.append(price)
            contents = datapoint["title"] + str(datapoint["description"]) + str(datapoint["features"]) + str(datapoint["details"])
            lengths.append(len(contents))
    except ValueError as e:
        pass

In [None]:
# Plot the distribution of lengths
plt.figure(figsize=(15, 6))
plt.title(f"Lengths: Avg {sum(lengths)/len(lengths):,.0f} and highest {max(lengths):,}\n")
plt.xlabel('Length (chars)')
plt.ylabel('Count')
plt.hist(lengths, rwidth=0.7, color="lightblue", bins=range(0, 6000, 100))
plt.show()

In [None]:
# Plot the distribution of prices
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.2f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="orange", bins=range(0, 1000, 10))
plt.show()

In [None]:
# So what is this item??
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 21000:
            print(datapoint['title'])
    except ValueError as e:
        pass

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Reference Product (for Comparison):</b><br>
This is the closest I can find — looks like it's going at a bargain price!<br>
<a href="https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/" target="_blank">
https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/
</a>
</div>


<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Now It's Time to Curate Our Dataset</b></h5>
  We select items that cost between <b>$1 and $999</b>.<br>
  For each product, we will create <code>Item</code> instances that:
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Truncate the product description to fit within <b>180 tokens</b> using the appropriate tokenizer</li>
    <li>Generate a <b>prompt</b> to be used during training</li>
  </ul>
  Items will be <b>rejected</b> if they do not contain a sufficient number of characters.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>But Why 180 Tokens?</b></h5>
  A student asked a great question — <i>"Why are we truncating to 180 tokens? How did we determine that number?"</i><br>
  (Thank you Moataz A. for the excellent question!)<br><br>
  The answer: This is a classic example of a <b>hyperparameter</b>. In other words, it's chosen via <b>trial and error</b>.
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>A high enough token count to give the model useful pricing context</li>
    <li>A low enough token count to ensure efficient training</li>
  </ul>
  I experimented with several values and found that <b>180</b> offered the best balance. You are encouraged to try your own tuning — this type of iteration is a key part of data science research and development.<br><br>
  There’s also a practical reason for keeping the token count low: During <b>inference time</b>, we’ll estimate prices for products using short 1–2 sentence descriptions. Our training data should mimic this format for optimal performance.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>But I See 160 Tokens in <code>items.py</code>?</b></h5>
  Another great question from Moataz A.!<br><br>
  Yes — the product description is limited to <b>160 tokens</b> because we prepend and append custom text to format it into a training prompt. That extra context brings the <b>total length</b> to around <b>180 tokens</b>.
</div>


In [None]:
# Create an Item object for each with a price
items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError as e:
        pass

print(f"There are {len(items):,} items")

In [None]:
# Look at the first item
items[1]

In [None]:
# Investigate the prompt that will be used during training - the model learns to complete this
print(items[100].prompt)

In [None]:
# Investigate the prompt that will be used during testing - the model has to complete this
print(items[100].test_prompt())

In [None]:
# Plot the distribution of token counts
tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="green", bins=range(0, 300, 10))
plt.show()

In [None]:
# Plot the distribution of prices
prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 300, 10))
plt.show()

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Sidenote</b></h5>
  If you enjoy the variety of colors used by <code>matplotlib</code> in its charts, you might want to bookmark this:<br>
  <a href="https://matplotlib.org/stable/gallery/color/named_colors.html" target="_blank">https://matplotlib.org/stable/gallery/color/named_colors.html</a>
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>To-Dos for You</b></h5>
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Review the <code>Item</code> class and ensure you're comfortable with how it works</li>
    <li>Examine a few <code>Item</code> objects — check the training prompt via <code>item.prompt</code> and test prompt with <code>item.test_prompt()</code></li>
    <li>Create additional histograms to explore the dataset’s distribution and structure</li>
  </ul>
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Coming Up Next</b></h5>
  We’ll expand the dataset to include additional product categories like <b>Electronics</b> and <b>Automotive</b>.<br>
  This will allow us to work with a larger and more diverse dataset, enabling better selection of a balanced and high-quality training set.
</div>


<div style="font-size: 12px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.4em;">Examine a few <code>Item</code> objects — look at the training prompt via <code>item.prompt</code> and the test prompt with <code>item.test_prompt()</code></h5>
</div>

In [None]:
# Test Case
# Look at the training prompt for an item
print(items[0].prompt)

# Look at the test prompt for the same item (price removed)
print(items[0].test_prompt())

# You can repeat for more items, e.g. items[1], items[100]
print(items[1].prompt)
print(items[1].test_prompt())
print(items[100].prompt)
print(items[100].test_prompt())

<br>

<br>

#### <code>**day2.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Scale up data curation by combining multiple product categories. Balance the dataset by <b>price range</b> and <b>category distribution</b>, and save final <code>train.pkl</code> and <code>test.pkl</code> files for reuse.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.2em; font-size: 17px;"><b>The Product Pricer Continued</b></h4>
  A model that can estimate how much something costs, from its description.
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Data Curation Part 2</b></h5>
  Today we’ll extend our dataset to achieve broader coverage and refine it into a high-quality training dataset.
  Data curation might not feel as exciting as model tuning or inference, but it’s a critical part of the LLM engineer’s responsibilities. Mastering this process allows you to build robust commercial solutions grounded in carefully curated data.
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <b>Dataset Source:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023</a><br>
  <b>Meta Categories Folder:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories</a>
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Important Note – Please Read First</b></h5>
  We’re about to build a large dataset of <b>400,000 items</b> across multiple product types.
  In <b>Week 7</b>, we’ll use this dataset to train our own model. Depending on your GPU, training might take <b>20+ hours</b> and may cost several dollars in compute units.
  If you prefer a <b>quicker, lower-cost alternative</b>, use a smaller dataset focused solely on <b>Home Appliances</b>. This covers the same learning goals with slightly reduced accuracy.
  Use <code>lite.ipynb</code> for the smaller dataset.
  Alternatively, you can skip the curation step by downloading the preprocessed <code>.pkl</code> files:
  <a href="https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW" target="_blank">Download Pickle Files</a>
</div>


In [None]:
import os
import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
import pickle

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
from loaders import ItemLoader
from items import Item

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>The ItemLoader Code</b></h5>
Look inside <code>loaders.py</code> — there's some helpful utility code there to make our work easier.
</div>

In [None]:
# Load in the same dataset as last time
items = ItemLoader("Appliances").load()

In [None]:
# Look for a familiar item..
print(items[1].prompt)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Now to SCALE UP</b></h5>
Let’s explore all datasets that include items typically found in a large home retail store — such as electrical, electronic, and office-related products — but <b>excluding</b> categories like clothes, beauty, and books.
</div>

In [None]:
dataset_names = [
    "Automotive",
    "Electronics",
    "Office_Products",
    "Tools_and_Home_Improvement",
    "Cell_Phones_and_Accessories",
    "Toys_and_Games",
    "Appliances",
    "Musical_Instruments",
]

In [None]:
# Download all datasets & load into items

items = []
for dataset_name in dataset_names:
    loader = ItemLoader(dataset_name)
    items.extend(loader.load())

# Now, time for a coffee break!!
# By the way, I put the biggest datasets first.. it gets faster.

In [None]:
print(f"A grand total of {len(items):,} items")

In [None]:
# Plot the distribution of token counts again
tokens = [item.token_count for item in items]

if tokens:
    plt.figure(figsize=(15, 6))
    plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
    plt.xlabel('Length (tokens)')
    plt.ylabel('Count')
    plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))
    plt.show()
else:
    print("No items to plot. The 'items' list is empty.")

In [None]:
# Plot the distribution of prices
prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))
plt.show()

In [None]:
category_counts = Counter()
for item in items:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Bar chart by category
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="goldenrod")
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Objective</b></h5>
  Craft a dataset that is more balanced in terms of pricing. Aim to:<br>
  - Reduce the dominance of low-cost (cheap) items<br>
  - Increase the average price to be higher than <b>$60</b><br>
  - Balance category representation, specifically by <b>reducing Automotive items</b>
</div>


In [None]:
# Create a dict with a key of each price from $1 to $999
# And in the value, put a list of items with that price (to nearest round number)
slots = defaultdict(list)
for item in items:
    slots[round(item.price)].append(item)

In [None]:
# Create a dataset called "sample" which tries to more evenly take from the range of prices
# And gives more weight to items from categories other than Automotive
# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)
sample = []
for i in range(1, 1000):
    slot = slots[i]
    if i>=240:
        sample.extend(slot)
    elif len(slot) <= 1200:
        sample.extend(slot)
    else:
        weights = np.array([1 if item.category=='Automotive' else 5 for item in slot])
        weights = weights / np.sum(weights)
        selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)
        selected = [slot[i] for i in selected_indices]
        sample.extend(selected)

print(f"There are {len(sample):,} items in the sample")

In [None]:
# Plot the distribution of prices in sample
prices = [float(item.price) for item in sample]
plt.figure(figsize=(15, 10))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

In [None]:
# OK, we did well in terms of raising the average price and having a smooth-ish population of prices
# Let's see the categories

category_counts = Counter()
for item in sample:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Create bar chart
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="lightgreen")

# Customize the chart
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

In [None]:
# Automotive still in the lead, but improved somewhat
# For another perspective, let's look at a pie

plt.figure(figsize=(12, 10))
plt.pie(counts, labels=categories, autopct='%1.0f%%', startangle=90)

# Add a circle at the center to create a donut chart (optional)
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Categories')

# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')  

plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Dataset Curated! ✅</b></h5>
  We've crafted an excellent dataset. Let's do some final checks.
</div>


In [None]:
# How does the price vary with the character count of the prompt?

sizes = [len(item.prompt) for item in sample]
prices = [item.price for item in sample]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(sizes, prices, s=0.2, color="red")

# Add labels and title
plt.xlabel('Size')
plt.ylabel('Price')
plt.title('Is there a simple correlation?')

# Display the plot
plt.show()

In [None]:
def report(item):
    prompt = item.prompt
    tokens = Item.tokenizer.encode(item.prompt)
    print(prompt)
    print(tokens[-10:])
    print(Item.tokenizer.batch_decode(tokens[-10:]))

In [None]:
report(sample[398000])

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Observation</b></h5>
  An interesting behavior of the <b>Llama tokenizer</b> is that every number from <b>1 to 999</b> is mapped to a <b>single token</b>, similar to what we observed with <code>gpt-4o</code>.<br>
  In contrast, models like <code>qwen2</code>, <code>gemma</code>, and <code>phi3</code> tokenize each digit separately.<br>
  While this isn’t a strict requirement, it does provide a slight advantage in our project by making numerical input more compact.
</div>


<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Finally</b></h5>
  It’s time to split our curated data into <b>training</b>, <b>test</b>, and <b>validation</b> sets.<br>
  While it’s typical to allocate around <b>5%–10%</b> of the data for testing, we currently have more data than we need. So we’ll use:
  <ul style="margin: 0.2em 0; padding-left: 1.5em;">
    <li><b>400,000 samples</b> for training</li>
    <li><b>2,000 samples</b> for testing</li>
  </ul>
  We may not use all the test samples immediately, but they will be helpful for evaluation as we iterate on our models.
</div>


In [None]:
random.seed(42)
random.shuffle(sample)
train = sample[:400_000]
test = sample[400_000:402_000]
print(f"Divided into a training set of {len(train):,} items and test set of {len(test):,} items")

In [None]:
print(train[0].prompt)

In [None]:
print(test[0].test_prompt())

In [None]:
# Plot the distribution of prices in the first 250 test points

prices = [float(item.price) for item in test[:250]]
plt.figure(figsize=(15, 6))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Finally</b></h5>
  Convert your curated dataset into prompt format and upload it to the <b>HuggingFace Hub</b>.<br>
  This ensures your dataset is accessible for training and testing, and can be reused or shared across projects.<br>
  Make sure to include:
  <ul style="margin: 0.2em 0; padding-left: 1.5em;">
    <li>Correct formatting (e.g., <code>prompt</code> and <code>price</code> fields)</li>
    <li>A clear dataset card with description and metadata</li>
    <li>Proper visibility settings (public or private, as needed)</li>
  </ul>
</div>


In [None]:
train_prompts = [item.prompt for item in train]
train_prices = [item.price for item in train]
test_prompts = [item.test_prompt() for item in test]
test_prices = [item.price for item in test]

In [None]:
# Create a Dataset from the lists

train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [None]:
# Uncomment these lines if you're ready to push to the hub, and replace my name with your HF username

# HF_USER = "ed-donner"
# DATASET_NAME = f"{HF_USER}/pricer-data"
# dataset.push_to_hub(DATASET_NAME, private=True)

In [None]:
# One more thing!
# Let's pickle the training and test dataset so we don't have to execute all this code next time!

with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>To-Dos</b></h5>
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Investigate the dataset more!</li>
    <li>Confirm that the tokenizer tokenizes all 3-digit prices into a single token</li>
  </ul>
</div>


<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em;"><b>1/</b> Investigate the dataset more </h5>
</div>


In [None]:
# View details of the first 5 items in the sample
for i in range(5):
    # print(sample[i].prompt)
    print("Product name:", sample[i].title)
    print("Price:", sample[i].price)
    print("Category:", sample[i].category)
    print("Token count:", sample[i].token_count)

    print("")

    # print("="*40)

<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em;"><b>2/</b> Confirm that the tokenizer tokenizes all 3-digit prices into a single token </h5>
</div>


In [None]:
# Check that the tokenizer tokenizes all 3-digit prices into a single token
# This is a simple test case to confirm the tokenizer's behavior with 3-digit prices
for price in [100, 123, 456, 789, 999]:
    tokens = Item.tokenizer.encode(str(price), add_special_tokens=False)
    print(f"Price: {price} -> Tokens: {tokens} (Length: {len(tokens)})")

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Note: <code>add_special_tokens=False</code> in Tokenizer</b></h5>
  When calling the <code>encode()</code> method of a tokenizer with <code>add_special_tokens=False</code>, special tokens like <code>&lt;bos&gt;</code>, <code>&lt;eos&gt;</code>, or <code>&lt;pad&gt;</code> will not be added to the encoded output.
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li><b>Purpose:</b> If you only want to tokenize the exact input (e.g., the number <code>"123"</code>), the result will contain only the token for <code>"123"</code>.</li>
    <li>If left as default (<code>add_special_tokens=True</code>), the tokenizer may automatically prepend or append special tokens, leading to more tokens and distorting your token count.</li>
  </ul>
  <b>Examples:</b><br>
  <code>encode("123", add_special_tokens=True)</code> → <code>[&lt;bos&gt;, 4513, &lt;eos&gt;]</code> (3 tokens)<br>
  <code>encode("123", add_special_tokens=False)</code> → <code>[4513]</code> (just the token for 123)
</div>


<br>

#### <code>**day3.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Build baseline models for the Product Pricer using classical ML approaches: <b>Linear Regression</b>, <b>Bag of Words</b>, <b>Word2Vec</b>, <b>Support Vector Machine</b>, and <b>Random Forest</b>. Benchmark performance using a custom <code>Tester</code> class.
</div>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.2em; font-size: 18px;"><b>The Product Pricer Continued</b></h4>
  A model that can estimate how much something costs, from its description.
</div>

<div style="font-size: 14px; line-height: 1.5; margin: 0.5em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Baseline Models</b></h5>
  Today we work on the simplest models to act as a starting point that we will beat.
</div>


In [None]:
import os
import math
import json
import random
import pickle
from collections import Counter

from dotenv import load_dotenv
from huggingface_hub import login
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>NLP Imports</b></h5>
  In the next cell, we import additional packages for NLP-related machine learning tasks.<br>
  If the <code>gensim</code> import raises an error such as:<br>
  <i>"Cannot import name 'triu' from 'scipy.linalg'"</i><br>
  Please fix it by running the following command in a separate cell:<br>
  <code>!pip install "scipy&lt;1.13"</code><br>
  This issue is discussed in detail on StackOverflow: 
  <a href="https://stackoverflow.com/questions/78279136/importerror-cannot-import-name-triu-from-scipy-linalg-when-importing-gens" target="_blank">link to fix</a>.<br>
  Special thanks to students <b>Arnaldo G</b> and <b>Ard V</b> for identifying and resolving this issue.
</div>


<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Sidenote: NLP (Natural Language Processing)</b></h5>
  NLP stands for <b>Natural Language Processing</b>. It is a field of artificial intelligence focused on enabling computers to understand, interpret, and generate human language.
  NLP imports refer to Python packages and modules that provide tools and algorithms for working with text data. These typically include libraries for:
  <ul style="margin: 0.4em 0; padding-left: 1.5em; font-size: 14px;">
    <li>Text preprocessing (tokenization, stemming, lemmatization)</li>
    <li>Feature extraction (Bag of Words, TF-IDF, embeddings)</li>
    <li>Machine learning models for text (classification, regression, clustering)</li>
    <li>Utilities for handling and visualizing text data</li>
  </ul>
  <b  style="margin-bottom: 0.2em; font-size: 16px;">Examples of common NLP imports:</b>
  <pre style="margin: 0; padding: 0.6em; font-size: 14px;">
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from transformers import AutoTokenizer, AutoModel
  </pre>
</div>


In [None]:
# NLP-related Imports
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

In [None]:
# More imports for advanced ML
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Constants - used for printing to stdout in color
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red":RED, "orange": YELLOW, "green": GREEN}

In [None]:
from dotenv import load_dotenv

# Environment
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
# Log in to HuggingFace
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# One more import after logging in
from items import Item

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px"><b>Loading the <code>.pkl</code> Files</b></h5>
  Let’s avoid curating all our data again! Load the previously saved pickle files instead.<br><br>
  If you didn’t create these in <b>Day 2</b>, you can download them from the instructor’s Google Drive (you’ll also find the slides here):<br>
  <a href="https://drive.google.com/drive/folders/1JwNorpRHdnf_pU0GE5yYtfKlyrKC3CoV?usp=sharing" target="_blank">https://drive.google.com/drive/folders/1JwNorpRHdnf_pU0GE5yYtfKlyrKC3CoV?usp=sharing</a><br><br>
  <b>Note:</b> The files are quite large — you may want to grab a coffee!
</div>


In [None]:
with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

with open('test.pkl', 'rb') as file:
    test = pickle.load(file)

In [None]:
# Remind ourselves the training prompt
print(train[0].prompt)

In [None]:
# Remind a test prompt
print(train[0].price)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;"><h5 style="margin-bottom: 0.2em;"><b>Unveiling a Mighty Script That We Will Use a Lot!</b></h5>A rather pleasing <b>Test Harness</b> that evaluates any model against <b>250 items</b> from the test set, and shows the results in a visually satisfying way.<br>You write a function of this form:<pre style="padding: 0.5em; margin: 0.5em 0;"><code>def my_prediction_function(item):
    # my code here
    return my_estimate</code></pre>And then you call:<br><code>Tester.test(my_prediction_function)</code><br>to evaluate your model.</div>


In [None]:
class Tester:

    def __init__(self, predictor, title=None, data=test, size=250):
        self.predictor = predictor
        self.data = data
        self.title = title or predictor.__name__.replace("_", " ").title()
        self.size = size
        self.guesses = []
        self.truths = []
        self.errors = []
        self.sles = []
        self.colors = []

    def color_for(self, error, truth):
        if error<40 or error/truth < 0.2:
            return "green"
        elif error<80 or error/truth < 0.4:
            return "orange"
        else:
            return "red"
    
    def run_datapoint(self, i):
        datapoint = self.data[i]
        guess = self.predictor(datapoint)
        truth = datapoint.price
        error = abs(guess - truth)
        log_error = math.log(truth+1) - math.log(guess+1)
        sle = log_error ** 2
        color = self.color_for(error, truth)
        title = datapoint.title if len(datapoint.title) <= 40 else datapoint.title[:40]+"..."
        self.guesses.append(guess)
        self.truths.append(truth)
        self.errors.append(error)
        self.sles.append(sle)
        self.colors.append(color)
        print(f"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}")

    def chart(self, title):
        max_error = max(self.errors)
        plt.figure(figsize=(12, 8))
        max_val = max(max(self.truths), max(self.guesses))
        plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)
        plt.scatter(self.truths, self.guesses, s=3, c=self.colors)
        plt.xlabel('Ground Truth')
        plt.ylabel('Model Estimate')
        plt.xlim(0, max_val)
        plt.ylim(0, max_val)
        plt.title(title)
        plt.show()

    def report(self):
        average_error = sum(self.errors) / self.size
        rmsle = math.sqrt(sum(self.sles) / self.size)
        hits = sum(1 for color in self.colors if color=="green")
        title = f"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%"
        self.chart(title)

    def run(self):
        self.error = 0
        for i in range(self.size):
            self.run_datapoint(i)
        self.report()

    @classmethod
    def test(cls, function):
        cls(function).run()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em;"><b>Now for Something Basic</b></h5>
  What's the very simplest model you could imagine?<br>
  Let's start with a <b>random number generator!</b>
</div>


In [None]:
def random_pricer(item):
    return random.randrange(1,1000)

In [None]:
# Set the random seed
random.seed(42)

# Run our TestRunner
Tester.test(random_pricer)

In [None]:
# That was fun!
# We can do better - here's another rather trivial model

training_prices = [item.price for item in train]
training_average = sum(training_prices) / len(training_prices)

def constant_pricer(item):
    return training_average

In [None]:
# Run our constant predictor
Tester.test(constant_pricer)

In [None]:
train[0].details

In [None]:
# Create a new "features" field on items, and populate it with json parsed from the details dict

for item in train:
    item.features = json.loads(item.details)
for item in test:
    item.features = json.loads(item.details)

# Look at one

In [None]:
train[0].features.keys()

In [None]:
# Look at 20 most common features in training set

feature_count = Counter()
for item in train:
    for f in item.features.keys():
        feature_count[f]+=1

feature_count.most_common(40)

In [None]:
# Now some janky code to pluck out the Item Weight
# Don't worry too much about this: spoiler alert, it's not going to be much use in training!

def get_weight(item):
    weight_str = item.features.get('Item Weight')
    if weight_str:
        parts = weight_str.split(' ')
        amount = float(parts[0])
        unit = parts[1].lower()
        if unit=="pounds":
            return amount
        elif unit=="ounces":
            return amount / 16
        elif unit=="grams":
            return amount / 453.592
        elif unit=="milligrams":
            return amount / 453592
        elif unit=="kilograms":
            return amount / 0.453592
        elif unit=="hundredths" and parts[2].lower()=="pounds":
            return amount / 100
        else:
            print(weight_str)
    return None

In [None]:
weights = [get_weight(t) for t in train]
weights = [w for w in weights if w]

In [None]:
average_weight = sum(weights)/len(weights)
average_weight

In [None]:
def get_weight_with_default(item):
    weight = get_weight(item)
    return weight or average_weight

In [None]:
def get_rank(item):
    rank_dict = item.features.get("Best Sellers Rank")
    if rank_dict:
        ranks = rank_dict.values()
        return sum(ranks)/len(ranks)
    return None

In [None]:
ranks = [get_rank(t) for t in train]
ranks = [r for r in ranks if r]
average_rank = sum(ranks)/len(ranks)
average_rank

In [None]:
def get_rank_with_default(item):
    rank = get_rank(item)
    return rank or average_rank

In [None]:
def get_text_length(item):
    return len(item.test_prompt())

In [None]:
# investigate the brands

brands = Counter()
for t in train:
    brand = t.features.get("Brand")
    if brand:
        brands[brand]+=1

# Look at most common 40 brands

brands.most_common(40)

In [None]:
TOP_ELECTRONICS_BRANDS = ["hp", "dell", "lenovo", "samsung", "asus", "sony", "canon", "apple", "intel"]
def is_top_electronics_brand(item):
    brand = item.features.get("Brand")
    return brand and brand.lower() in TOP_ELECTRONICS_BRANDS

In [None]:
def get_features(item):
    return {
        "weight": get_weight_with_default(item),
        "rank": get_rank_with_default(item),
        "text_length": get_text_length(item),
        "is_top_electronics_brand": 1 if is_top_electronics_brand(item) else 0
    }

In [None]:
# Look at features in a training item
get_features(train[0])

In [None]:
# A utility function to convert our features into a pandas dataframe

def list_to_dataframe(items):
    features = [get_features(item) for item in items]
    df = pd.DataFrame(features)
    df['price'] = [item.price for item in items]
    return df

train_df = list_to_dataframe(train)
test_df = list_to_dataframe(test[:250])

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em;"><b>Traditional Linear Regression</b></h5>
  This approach implements a classic machine learning approach - Linear Regression - to predict product prices based on a set of numeric features.
  <br><br>
  <b>Why "Traditional"?</b><br>
  <ul style="margin: 0.5em 0; padding-left: 1.5em;">
    <li>Linear Regression is one of the oldest and most fundamental algorithms in statistics and machine learning.</li>
    <li>It models the relationship between a dependent variable (here, price) and one or more independent variables (here, weight, rank, text_length, is_top_electronics_brand) by fitting a linear equation to observed data.</li>
    <li>It does not use neural networks, embeddings, or advanced NLP techniques - just numeric features and a linear model.</li>
  </ul>
  <b>In context:</b><br>
  This notebook benchmarks <i>"traditional"</i> ML models like Linear Regression before moving on to more advanced or modern approaches (such as Bag of Words, Word2Vec, Random Forest, or LLMs). This provides a baseline for comparison.<br><br>
</div>


In [None]:
# Traditional Linear Regression

np.random.seed(42)

# Separate features and target
feature_columns = ['weight', 'rank', 'text_length', 'is_top_electronics_brand']

X_train = train_df[feature_columns]
y_train = train_df['price']
X_test = test_df[feature_columns]
y_test = test_df['price']

# Train a Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

for feature, coef in zip(feature_columns, model.coef_):
    print(f"{feature}: {coef}")
print(f"Intercept: {model.intercept_}")

# Predict the test set and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

In [None]:
# Function to predict price for a new item

def linear_regression_pricer(item):
    features = get_features(item)
    features_df = pd.DataFrame([features])
    return model.predict(features_df)[0]

In [None]:
# test it

Tester.test(linear_regression_pricer)

In [None]:
# For the next few models, we prepare our documents and prices
# Note that we use the test prompt for the documents, otherwise we'll reveal the answer!!

prices = np.array([float(item.price) for item in train])
documents = [item.test_prompt() for item in train]

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em;"><b>Bag of Words (BoW)</b></h5>
  This approach below implements a <b>Bag of Words</b> model - one of the earliest and simplest techniques for extracting features from text in natural language processing.
  <br><br>
  <b>What is Bag of Words?</b><br>
  <ul style="margin: 0.5em 0; padding-left: 1.5em;">
    <li>It converts text into numerical feature vectors by counting word occurrences, ignoring grammar and word order.</li>
    <li>The result is a sparse representation of the text, where each feature corresponds to a word in the vocabulary.</li>
    <li>It is commonly used with linear models, such as Logistic or Linear Regression, for classification or regression tasks.</li>
  </ul>
  <b>In context:</b><br>
  BoW serves as a strong baseline for NLP tasks, despite its simplicity. In this notebook, it is used as a step up from pure numeric features, incorporating actual text data into the model. While it lacks semantic understanding, BoW often delivers surprisingly solid results and helps establish a benchmark before progressing to more complex techniques like Word2Vec or transformer-based models.
</div>


In [None]:
# Use the CountVectorizer for a Bag of Words model

np.random.seed(42)
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(documents)
regressor = LinearRegression()
regressor.fit(X, prices)

In [None]:
def bow_lr_pricer(item):
    x = vectorizer.transform([item.test_prompt()])
    return max(regressor.predict(x)[0], 0)

In [None]:
# test it

Tester.test(bow_lr_pricer)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Word2Vec Model</b></h5>
This approach below uses the Word2Vec technique to embed each word in the product description into a high-dimensional vector and then aggregates these vectors to predict product prices using a regression model.<br><br>
<b>Why "Word2Vec"?</b><br>
Word2Vec is a word embedding model that captures semantic relationships between words based on their context in large corpora. Unlike Bag of Words, it encodes meaning and similarity between words.<br><br>
- Each word is mapped to a dense vector based on its usage.<br>
- Vectors from a product’s description are typically averaged or pooled to represent the entire item.<br>
- The resulting feature vectors are then used in downstream models like regression.<br><br>
<b>In context:</b><br>
This model bridges the gap between traditional BoW and modern transformer-based approaches. It shows how adding semantic information through word embeddings can improve performance over frequency-based methods. It also sets the stage for evaluating even more powerful models like LLMs later in the notebook.
</div>


In [None]:
# The amazing word2vec model, implemented in gensim NLP library

np.random.seed(42)

# Preprocess the documents
processed_docs = [simple_preprocess(doc) for doc in documents]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=processed_docs, vector_size=400, window=5, min_count=1, workers=8)

In [None]:
# This step of averaging vectors across the document is a weakness in our approach

def document_vector(doc):
    doc_words = simple_preprocess(doc)
    word_vectors = [w2v_model.wv[word] for word in doc_words if word in w2v_model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(w2v_model.vector_size)

# Create feature matrix
X_w2v = np.array([document_vector(doc) for doc in documents])

In [None]:
# Run Linear Regression on word2vec

word2vec_lr_regressor = LinearRegression()
word2vec_lr_regressor.fit(X_w2v, prices)

In [None]:
def word2vec_lr_pricer(item):
    doc = item.test_prompt()
    doc_vector = document_vector(doc)
    return max(0, word2vec_lr_regressor.predict([doc_vector])[0])

In [None]:
Tester.test(word2vec_lr_pricer)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Support Vector Machine (SVM) Model</b></h5>
This approach below applies a Support Vector Machine regression model (SVR) to predict product prices based on structured features extracted from item descriptions.<br><br>
<b>Why "Support Vector Machine"?</b><br>
SVM is a powerful and versatile supervised learning algorithm that works well for both classification and regression tasks.<br><br>
- In regression, SVR tries to fit the best possible function within a margin of tolerance from the actual data points.<br>
- It is effective for handling high-dimensional feature spaces and non-linear relationships when used with appropriate kernels.<br>
- In this notebook, we typically use features like weight, rank, text length, and brand indicator to feed into the SVR model.<br><br>
<b>In context:</b><br>
This model represents another strong baseline from traditional machine learning. It allows us to evaluate how well a non-linear ML algorithm performs before moving to embedding-based or neural models. SVM often excels with well-engineered features and is included here as part of our progression through baseline comparisons.
</div>


In [None]:
# Support Vector Machines

np.random.seed(42)
svr_regressor = LinearSVR()

svr_regressor.fit(X_w2v, prices)

In [None]:
def svr_pricer(item):
    np.random.seed(42)
    doc = item.test_prompt()
    doc_vector = document_vector(doc)
    return max(float(svr_regressor.predict([doc_vector])[0]),0)

In [None]:
Tester.test(svr_pricer)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Random Forest Model</b></h5>
This approach below uses a <b>Random Forest Regressor</b> to estimate product prices based on structured features extracted from the dataset.<br><br>
<b>Why "Random Forest"?</b><br>
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.<br><br>
- It is robust to noise and outliers, and works well with both linear and non-linear data.<br>
- Each decision tree in the forest is trained on a random subset of the data (with replacement), which adds diversity to the ensemble.<br>
- Predictions are typically made by averaging the outputs of all trees (in regression tasks).<br><br>
<b>In context:</b><br>
Random Forest provides a powerful baseline for structured data. In this notebook, it's used to benchmark performance against other traditional models like Linear Regression, SVM, and modern techniques such as embeddings or LLMs. Its strength lies in its ability to model complex feature interactions without much preprocessing.
</div>


In [None]:
# And the powerful Random Forest regression
# This usually takes about 10 minutes or more to train

rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=8)
rf_model.fit(X_w2v, prices)

In [None]:
def random_forest_pricer(item):
    doc = item.test_prompt()
    doc_vector = document_vector(doc)
    return max(0, rf_model.predict([doc_vector])[0])

In [None]:
Tester.test(random_forest_pricer)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>Summary of Regression Models</b></h5></div>


<div style="font-size: 12px; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>1. General Scenarios</b></h5></div>

<table style="font-size: 14px; line-height: 1.4; border-collapse: collapse; width: 100%; margin: 0; padding: 0;">
  <thead>
    <tr>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">Model</th>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">Main Input Type</th>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">How It Works (Simple)</th>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">Main Strengths</th>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">Main Weaknesses</th>
      <th style="border: 1px solid #ccc; padding: 6px; text-align: center;">When to Use?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Linear Regression</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Numeric features</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Finds a straight line that best fits the data</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Fast, simple, easy to explain</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Misses complex (non-linear) patterns</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Data is mostly numbers, simple trends</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Bag of Words (BoW)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Text (word counts)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Counts how often each word appears in the text</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Simple, works for basic text tasks</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Ignores word meaning/context</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Text data, quick baseline</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Word2Vec</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Text (embeddings)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Turns words into vectors that capture meaning</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Understands word meaning, context</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Needs lots of data, loses sentence info</td>
      <td style="border: 1px solid #ccc; padding: 6px;">When meaning of words matters</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">SVM (SVR)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Numeric/vectors</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Finds a boundary (can be curved) to fit the data</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Good for complex, non-linear data</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Slow with big data, hard to explain</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Medium-sized, tricky data</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Random Forest</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Numeric/vectors</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Combines many decision trees for better predictions</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Handles complex data, robust</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Can be slow, less interpretable</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Complex data, many features</td>
    </tr>
  </tbody>
</table>


<div style="font-size: 12px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>2. This notebook</b></h5></div>

<table style="font-size: 14px; line-height: 1.4; border-collapse: collapse; width: 100%; margin: 0; padding: 0;">
  <thead>
    <tr>
      <th style="border: 1px solid #ccc; padding: 6px;">Model</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Input Source</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Feature Extraction</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Setup Difficulty</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Accuracy (Typical)</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Practical Strengths</th>
      <th style="border: 1px solid #ccc; padding: 6px;">Practical Weaknesses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Linear Regression</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Numeric features (weight, etc.)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Uses product info (weight, rank, etc.)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Easy</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Low (baseline)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Fast, easy to understand</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Ignores text descriptions</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Bag of Words (BoW)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Product description text</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Counts words in product description</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Medium</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Medium</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Uses text, simple to try</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Misses word meaning/context</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Word2Vec</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Product description text</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Averages word vectors from Word2Vec</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Higher</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Higher than BoW</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Captures some meaning from text</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Loses sentence structure</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">SVM (SVR)</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Word2Vec vectors</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Uses Word2Vec output as input</td>
      <td style="border: 1px solid #ccc; padding: 6px;">High</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Good for non-linear</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Can model complex patterns</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Slow, not scalable to huge data</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 6px;">Random Forest</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Word2Vec vectors</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Uses Word2Vec output as input</td>
      <td style="border: 1px solid #ccc; padding: 6px;">High</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Often best</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Handles complex, messy data well</td>
      <td style="border: 1px solid #ccc; padding: 6px;">Hard to explain, uses more resources</td>
    </tr>
  </tbody>
</table>


<br>

<br>

#### <code>**day4.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Evaluate the performance of <b>Frontier LLMs</b> - including <b>GPT-4o-mini, GPT-4o, <s>Claude 3.5 Sonnet</s> and Llama3.2</b> - on the Product Pricer dataset. Compare their effectiveness against traditional ML models using the same test set.
</div>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h4 style="margin-bottom: 0.2em; font-size: 18px;"><b>The Product Pricer Continued</b></h4>
A model that can estimate how much something costs, from its description.
</div>
<br>
<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Enter The Frontier!</b></h5>
And now – we put Frontier Models to the test.
</div>
<br>
<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b>2 important points:</b><br>
It’s important to appreciate that we <b>aren’t training</b> the frontier models. We’re only providing them with the Test dataset to see how they perform. They don’t gain the benefit of the 400,000 training examples that we provided to the Traditional ML models.<br>
<br>
<b>HAVING SAID THAT...</b><br>
It’s entirely possible that in their monstrously large training data, they’ve already been exposed to all the products in the training AND the test set. So there could be test “contamination” here which gives them an unfair advantage. We should keep that in mind.
</div>


In [None]:
import os
import re
import math
import json
import random
from dotenv import load_dotenv
from huggingface_hub import login
import matplotlib.pyplot as plt
import numpy as np
import pickle
from collections import Counter
from openai import OpenAI
from anthropic import Anthropic
import ollama

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# moved our Tester into a separate package
# call it with Tester.test(function_name, test_dataset)

from items import Item
from testing import Tester

In [None]:
openai = OpenAI()
claude = Anthropic()
llama_client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

In [None]:
# Let's avoid curating all our data again! Load in the pickle files:

with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

with open('test.pkl', 'rb') as file:
    test = pickle.load(file)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Before we look at the Frontier</b></h5>
There is one more model we could consider
</div>


In [None]:
# Write the test set to a CSV

import csv
with open('human_input.csv', 'w', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    for t in test[:250]:
        writer.writerow([t.test_prompt(), 0])

In [None]:
# Read it back in

human_predictions = []
with open('human_output.csv', 'r', encoding="utf-8") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        human_predictions.append(float(row[1]))

In [None]:
def human_pricer(item):
    idx = test.index(item)
    return human_predictions[idx]

In [None]:
Tester.test(human_pricer, test)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>1/ GPT-4o-mini</b></h5>
Note: It’s called <i>mini</i>, but it packs a punch.
</div>

In [None]:
# First let's work on a good prompt for a Frontier model
# Notice that I'm removing the " to the nearest dollar"
# When we train our own models, we'll need to make the problem as easy as possible, 
# but a Frontier model needs no such simplification.

def messages_for(item):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [None]:
# Try this out

messages_for(test[0])

In [None]:
# A utility function to extract the price from a string

def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
get_price("The price is roughly $99.99 because blah blah")

In [None]:
# The function for gpt-4o-mini

def gpt_4o_mini(item):
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
test[0].price

In [None]:
import time

# 11:57 9/7: hit the rate limit 

# Run the Tester on gpt-4o-mini with the first 10 items in the test set
# print("Testing gpt-4o-mini on the first 10 items in the test set")
# for i, item in enumerate(test[:10]):
for i, item in enumerate(test[:9]):
    guess = gpt_4o_mini(item)
    truth = item.price
    error = abs(guess - truth)
    print(f"{i+1}: Guess: ${guess:,.2f} | Truth: ${truth:,.2f} | Error: ${error:,.2f} | Item: {item.title[:40]}...")
    time.sleep(20)

# # Run the Tester on gpt-4o-mini with a larger test set
# print("Testing gpt-4o-mini on the first 250 items in the test set")
# Tester = Tester(gpt_4o_mini, title="gpt-4o-mini", data=test, size=250)
# Tester.run()

# Run the Tester on gpt-4o-mini with the full test set
# This will take about 20 - 40 minutes depending on internet speed and the model's response time
# print("Running Tester on gpt-4o-mini with the full test set")
# Tester = Tester(gpt_4o_mini, title="gpt-4o-mini", data=test, size=2000)
# Tester.run()

# Run the Tester on gpt-4o-mini with the full test set
# Tester = Tester(gpt_4o_mini, title="gpt-4o-mini", data=test, size=len(test))
# Tester.run()

In [None]:
def gpt_4o_frontier(item):
    response = openai.chat.completions.create(
        model="gpt-4o-2024-08-06", 
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
# The function for gpt-4o - the August model
# Note that it cost me about 1-2 cents to run this (pricing may vary by region)
# You can skip this and look at my results instead

Tester.test(gpt_4o_frontier, test)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>2/ Claude-3.5-sonnet</b></h5>
</div>

In [None]:
def claude_3_point_5_sonnet(item):
    messages = messages_for(item)
    system_message = messages[0]['content']
    messages = messages[1:]
    response = claude.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=5,
        system=system_message,
        messages=messages
    )
    reply = response.content[0].text
    return get_price(reply)

In [None]:
Tester.test(claude_3_point_5_sonnet, test)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;"><b>3/ Llama3.2 (local)</b></h5>
</div>

In [None]:
def llama3_2_local(item):
    response = llama_client.chat.completions.create(
        model="llama3.2",
        messages=messages_for(item),
        seed=42,
        max_tokens=5
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
import time

# 10 first items
# print("Running Tester on llama3.2 with the first 10 items of the test set")
# for i, item in enumerate(test[:10]):
#     guess = llama3_2_local(item)
#     truth = item.price
#     error = abs(guess - truth)
#     print(f"{i+1}: Guess: ${guess:,.2f} | Truth: ${truth:,.2f} | Error: ${error:,.2f} | Item: {item.title[:40]}...")
#     time.sleep(1)

# Full test set
print("Running Tester on llama3.2 with the full test set")
Tester.test(llama3_2_local, test)

<br>

<br>

#### <code>**day5.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Fine-tune the <b>GPT-4o-mini</b> model using OpenAI's API. Prepare the dataset in <code>JSONL</code> format, upload to OpenAI, initiate fine-tuning, and evaluate the accuracy of the customized Product Pricer model.
</div>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.2em; font-size: 18px;"><b>The Product Pricer Continued</b></h4>
  A model that can estimate how much something costs, from its description.
</div>

<br style="margin: 0; padding: 0;">

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>AT LAST – it’s time for Fine Tuning!</b></h5>
  After all this data preparation and old-school machine learning, we’ve finally arrived at the moment you’ve been waiting for: fine-tuning a model.
</div>


In [None]:
import os
import re
import math
import json
import random
from dotenv import load_dotenv
from huggingface_hub import login
import matplotlib.pyplot as plt
import numpy as np
import pickle
from collections import Counter
from openai import OpenAI
from anthropic import Anthropic
import ollama

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# moved our Tester into a separate package
# call it with Tester.test(function_name, test_dataset)

from items import Item
from testing import Tester

In [None]:
openai = OpenAI()
# claude = Anthropic()
llama_client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

In [None]:
with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

with open('test.pkl', 'rb') as file:
    test = pickle.load(file)

In [None]:
# OpenAI recommends fine-tuning with populations of 50-100 examples
# But as our examples are very small, I'm suggesting we go with 200 examples (and 1 epoch)

fine_tune_train = train[:200]
fine_tune_validation = train[200:250]

<br>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Step 1</b></h5>
  Prepare our data for fine-tuning in <code>JSONL</code> (JSON Lines) format and upload to OpenAI.
</div>


In [None]:
# First let's work on a good prompt for a Frontier model
# Notice that I'm removing the " to the nearest dollar"
# When we train our own models, we'll need to make the problem as easy as possible, 
# but a Frontier model needs no such simplification.

def messages_for(item):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": f"Price is ${item.price:.2f}"}
    ]

In [None]:
messages_for(train[0])

In [None]:
# Convert the items into a list of json objects - a "jsonl" string
# Each row represents a message in the form:
# {"messages" : [{"role": "system", "content": "You estimate prices...


def make_jsonl(items):
    result = ""
    for item in items:
        messages = messages_for(item)
        messages_str = json.dumps(messages)
        result += '{"messages": ' + messages_str +'}\n'
    return result.strip()

In [None]:
print(make_jsonl(train[:3]))

In [None]:
# Convert the items into jsonl and write them to a file

def write_jsonl(items, filename):
    with open(filename, "w") as f:
        jsonl = make_jsonl(items)
        f.write(jsonl)

In [None]:
write_jsonl(fine_tune_train, "fine_tune_train.jsonl")

In [None]:
write_jsonl(fine_tune_validation, "fine_tune_validation.jsonl")

In [None]:
with open("fine_tune_train.jsonl", "rb") as f:
    train_file = openai.files.create(file=f, purpose="fine-tune")

In [None]:
train_file

In [None]:
with open("fine_tune_validation.jsonl", "rb") as f:
    validation_file = openai.files.create(file=f, purpose="fine-tune")

In [None]:
validation_file

<br>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Step 2</b></h5>
  I love <b>Weights and Biases</b> – a beautiful, free platform for monitoring training runs.  
  Weights and Biases is integrated with OpenAI for fine-tuning.<br><br>
  First set up your Weights & Biases free account at:<br>
  <a href="https://wandb.ai" target="_blank">https://wandb.ai</a><br>
  From the Avatar → <b>Settings</b> menu, near the bottom, you can create an API key.<br>
  Then visit the OpenAI dashboard at:<br>
  <a href="https://platform.openai.com/account/organization" target="_blank">https://platform.openai.com/account/organization</a><br>
  In the <b>Integrations</b> section, you can add your Weights & Biases key.
</div>

<br>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>And now — time to Fine-tune!</b></h5>
</div>


In [None]:
wandb_integration = {"type": "wandb", "wandb": {"project": "gpt-pricer"}}

In [None]:
train_file.id

In [None]:
openai.fine_tuning.jobs.create(
    training_file=train_file.id,
    validation_file=validation_file.id,
    model="gpt-4o-mini-2024-07-18",
    seed=42,
    hyperparameters={"n_epochs": 1},
    integrations = [wandb_integration],
    suffix="pricer"
)

In [None]:
openai.fine_tuning.jobs.list(limit=1)

In [None]:
job_id = openai.fine_tuning.jobs.list(limit=1).data[0].id

In [None]:
job_id

In [None]:
openai.fine_tuning.jobs.retrieve(job_id)

In [None]:
openai.fine_tuning.jobs.list_events(fine_tuning_job_id=job_id, limit=10).data

<br>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Step 3</b></h5>
  Test our fine-tuned model
</div>


In [None]:
fine_tuned_model_name = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model

In [None]:
fine_tuned_model_name

In [None]:
# The prompt

def messages_for(item):
    system_message = "You estimate prices of items. Reply only with the price, no explanation"
    user_prompt = item.test_prompt().replace(" to the nearest dollar","").replace("\n\nPrice is $","")
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Price is $"}
    ]

In [None]:
# Try this out

messages_for(test[0])

In [None]:
# A utility function to extract the price from a string

def get_price(s):
    s = s.replace('$','').replace(',','')
    match = re.search(r"[-+]?\d*\.\d+|\d+", s)
    return float(match.group()) if match else 0

In [None]:
get_price("The price is roughly $99.99 because blah blah")

In [None]:
# The function for gpt-4o-mini

def gpt_fine_tuned(item):
    response = openai.chat.completions.create(
        model=fine_tuned_model_name, 
        messages=messages_for(item),
        seed=42,
        max_tokens=7
    )
    reply = response.choices[0].message.content
    return get_price(reply)

In [None]:
print(test[0].price)

# Ensure fine_tuned_model_name is set
print("Fine-tuned model name:", fine_tuned_model_name)
if not fine_tuned_model_name:
	# Retrieve the fine-tuned model name if not set
	fine_tuned_model_name = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model

if not fine_tuned_model_name:
	raise ValueError("fine_tuned_model_name is not set. Please check your fine-tuning job.")

print(gpt_fine_tuned(test[0]))

In [None]:
print(test[0].test_prompt())

In [None]:
Tester.test(gpt_fine_tuned, test)