<div style="font-size: 13px; line-height: 1.4; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em;">
This notebook documents my theoretical study alongside the lab exercises conducted on <b>July 8, 2025</b>.
</h5>
</div>

### <u><b>LAB EXERCISES:</b></u> **WEEK 6**

<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em;">
    <b style="font-size: 16px;">Overview:</b> Build a RAG pipeline to collect, clean, and balance product data, then standardize this data into <code>Item</code> objects with well-structured prompts and token counts suitable for training a product pricing model. Also, tokenizer behavior is validated to ensure that 3-digit price values were encoded into a single token, preparing the dataset for upcoming training and evaluation steps.
  </h5>
</div>


#### <code>**day1.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Begin the <b>Product Pricer</b> project by curating a dataset of <b>Home Appliances</b> from Amazon reviews. Focus on items with prices and prepare them for training by creating <code>Item</code> objects with truncated text and prompts.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.3em; font-size: 16px;"><b>The Big Project Begins!</b></h4>
  <b>Product Pricer:</b> A model that can estimate how much something costs based on its description.<br>
  <b>Data Curation – Part 1:</b> We’ll begin curating and cleaning a subset of the dataset, focusing on <b>Home Appliances</b>.<br>
  <b>Dataset source:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023</a><br>
  <b>Meta categories folder:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories</a>
</div>


<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Sidenote: What is Data Curation?</b></h5>
  Data curation is the process of collecting, cleaning, organizing, and preparing raw data into a high-quality, structured dataset suitable for analysis or machine learning tasks.<br>
  In the context of this week, curation includes:
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Selecting relevant data points (e.g., products with valid prices)</li>
    <li>Removing or correcting errors, inconsistencies, or irrelevant entries</li>
    <li>Balancing the dataset by category or price range</li>
    <li>Formatting data: creating prompts, truncating text, tokenizing</li>
    <li>Saving curated outputs for reuse in training and evaluation</li>
  </ul>
  Well-curated data ensures your models are trained on accurate, representative, and relevant samples — leading to stronger performance and more trustworthy results.
</div>


In [None]:
# Run in Anaconda Prompt (for conda users):
# conda install -c conda-forge datasets matplotlib matplotlib-inline

# pip users:
# pip install datasets matplotlib matplotlib-inline

In [None]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt
from items import Item

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
# %matplotlib inline

In [None]:
# Load in our dataset

dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_Appliances", split="full", trust_remote_code=True)

In [None]:
print(f"Number of Appliances: {len(dataset):,}")

In [None]:
# Investigate a particular datapoint
datapoint = dataset[2]

In [None]:
# Investigate
print(datapoint["title"])
print(datapoint["description"])
print(datapoint["features"])
print(datapoint["details"])
print(datapoint["price"])

In [None]:
# How many have prices?
prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError as e:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

In [None]:
# For those with prices, gather the price and the length
prices = []
lengths = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices.append(price)
            contents = datapoint["title"] + str(datapoint["description"]) + str(datapoint["features"]) + str(datapoint["details"])
            lengths.append(len(contents))
    except ValueError as e:
        pass

In [None]:
# Plot the distribution of lengths
plt.figure(figsize=(15, 6))
plt.title(f"Lengths: Avg {sum(lengths)/len(lengths):,.0f} and highest {max(lengths):,}\n")
plt.xlabel('Length (chars)')
plt.ylabel('Count')
plt.hist(lengths, rwidth=0.7, color="lightblue", bins=range(0, 6000, 100))
plt.show()

In [None]:
# Plot the distribution of prices
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.2f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="orange", bins=range(0, 1000, 10))
plt.show()

In [None]:
# So what is this item??
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 21000:
            print(datapoint['title'])
    except ValueError as e:
        pass

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Reference Product (for Comparison):</b><br>
This is the closest I can find — looks like it's going at a bargain price!<br>
<a href="https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/" target="_blank">
https://www.amazon.com/TurboChef-Electric-Countertop-Microwave-Convection/dp/B01D05U9NO/
</a>
</div>


<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Now It's Time to Curate Our Dataset</b></h5>
  We select items that cost between <b>$1 and $999</b>.<br>
  For each product, we will create <code>Item</code> instances that:
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Truncate the product description to fit within <b>180 tokens</b> using the appropriate tokenizer</li>
    <li>Generate a <b>prompt</b> to be used during training</li>
  </ul>
  Items will be <b>rejected</b> if they do not contain a sufficient number of characters.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>But Why 180 Tokens?</b></h5>
  A student asked a great question — <i>"Why are we truncating to 180 tokens? How did we determine that number?"</i><br>
  (Thank you Moataz A. for the excellent question!)<br><br>
  The answer: This is a classic example of a <b>hyperparameter</b>. In other words, it's chosen via <b>trial and error</b>.
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>A high enough token count to give the model useful pricing context</li>
    <li>A low enough token count to ensure efficient training</li>
  </ul>
  I experimented with several values and found that <b>180</b> offered the best balance. You are encouraged to try your own tuning — this type of iteration is a key part of data science research and development.<br><br>
  There’s also a practical reason for keeping the token count low: During <b>inference time</b>, we’ll estimate prices for products using short 1–2 sentence descriptions. Our training data should mimic this format for optimal performance.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>But I See 160 Tokens in <code>items.py</code>?</b></h5>
  Another great question from Moataz A.!<br><br>
  Yes — the product description is limited to <b>160 tokens</b> because we prepend and append custom text to format it into a training prompt. That extra context brings the <b>total length</b> to around <b>180 tokens</b>.
</div>


In [None]:
# Create an Item object for each with a price
items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError as e:
        pass

print(f"There are {len(items):,} items")

In [None]:
# Look at the first item
items[1]

In [None]:
# Investigate the prompt that will be used during training - the model learns to complete this
print(items[100].prompt)

In [None]:
# Investigate the prompt that will be used during testing - the model has to complete this
print(items[100].test_prompt())

In [None]:
# Plot the distribution of token counts
tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="green", bins=range(0, 300, 10))
plt.show()

In [None]:
# Plot the distribution of prices
prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 300, 10))
plt.show()

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Sidenote</b></h5>
  If you enjoy the variety of colors used by <code>matplotlib</code> in its charts, you might want to bookmark this:<br>
  <a href="https://matplotlib.org/stable/gallery/color/named_colors.html" target="_blank">https://matplotlib.org/stable/gallery/color/named_colors.html</a>
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>To-Dos for You</b></h5>
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Review the <code>Item</code> class and ensure you're comfortable with how it works</li>
    <li>Examine a few <code>Item</code> objects — check the training prompt via <code>item.prompt</code> and test prompt with <code>item.test_prompt()</code></li>
    <li>Create additional histograms to explore the dataset’s distribution and structure</li>
  </ul>
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0.8em 0 0 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Coming Up Next</b></h5>
  We’ll expand the dataset to include additional product categories like <b>Electronics</b> and <b>Automotive</b>.<br>
  This will allow us to work with a larger and more diverse dataset, enabling better selection of a balanced and high-quality training set.
</div>


<div style="font-size: 12px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.4em;">Examine a few <code>Item</code> objects — look at the training prompt via <code>item.prompt</code> and the test prompt with <code>item.test_prompt()</code></h5>
</div>

In [None]:
# Test Case
# Look at the training prompt for an item
print(items[0].prompt)

# Look at the test prompt for the same item (price removed)
print(items[0].test_prompt())

# You can repeat for more items, e.g. items[1], items[100]
print(items[1].prompt)
print(items[1].test_prompt())
print(items[100].prompt)
print(items[100].test_prompt())

<br>

<br>

#### <code>**day2.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Scale up data curation by combining multiple product categories. Balance the dataset by <b>price range</b> and <b>category distribution</b>, and save final <code>train.pkl</code> and <code>test.pkl</code> files for reuse.
</div>

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h4 style="margin-bottom: 0.2em; font-size: 17px;"><b>The Product Pricer Continued</b></h4>
  A model that can estimate how much something costs, from its description.
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Data Curation Part 2</b></h5>
  Today we’ll extend our dataset to achieve broader coverage and refine it into a high-quality training dataset.
  Data curation might not feel as exciting as model tuning or inference, but it’s a critical part of the LLM engineer’s responsibilities. Mastering this process allows you to build robust commercial solutions grounded in carefully curated data.
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <b>Dataset Source:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023</a><br>
  <b>Meta Categories Folder:</b><br>
  <a href="https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories" target="_blank">https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories</a>
</div>
<div style="font-size: 14px; line-height: 1.4; margin-top: 0.6em; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 15px;"><b>Important Note – Please Read First</b></h5>
  We’re about to build a large dataset of <b>400,000 items</b> across multiple product types.
  In <b>Week 7</b>, we’ll use this dataset to train our own model. Depending on your GPU, training might take <b>20+ hours</b> and may cost several dollars in compute units.
  If you prefer a <b>quicker, lower-cost alternative</b>, use a smaller dataset focused solely on <b>Home Appliances</b>. This covers the same learning goals with slightly reduced accuracy.
  Use <code>lite.ipynb</code> for the smaller dataset.
  Alternatively, you can skip the curation step by downloading the preprocessed <code>.pkl</code> files:
  <a href="https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW" target="_blank">Download Pickle Files</a>
</div>


In [None]:
import os
import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
import pickle

In [None]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [None]:
from loaders import ItemLoader
from items import Item

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>The ItemLoader Code</b></h5>
Look inside <code>loaders.py</code> — there's some helpful utility code there to make our work easier.
</div>

In [None]:
# Load in the same dataset as last time
items = ItemLoader("Appliances").load()

In [None]:
# Look for a familiar item..
print(items[1].prompt)

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Now to SCALE UP</b></h5>
Let’s explore all datasets that include items typically found in a large home retail store — such as electrical, electronic, and office-related products — but <b>excluding</b> categories like clothes, beauty, and books.
</div>

In [None]:
dataset_names = [
    "Automotive",
    "Electronics",
    "Office_Products",
    "Tools_and_Home_Improvement",
    "Cell_Phones_and_Accessories",
    "Toys_and_Games",
    "Appliances",
    "Musical_Instruments",
]

In [None]:
# Download all datasets & load into items

items = []
for dataset_name in dataset_names:
    loader = ItemLoader(dataset_name)
    items.extend(loader.load())

# Now, time for a coffee break!!
# By the way, I put the biggest datasets first.. it gets faster.

In [None]:
print(f"A grand total of {len(items):,} items")

In [None]:
# Plot the distribution of token counts again
tokens = [item.token_count for item in items]

if tokens:
    plt.figure(figsize=(15, 6))
    plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
    plt.xlabel('Length (tokens)')
    plt.ylabel('Count')
    plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))
    plt.show()
else:
    print("No items to plot. The 'items' list is empty.")

In [None]:
# Plot the distribution of prices
prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))
plt.show()

In [None]:
category_counts = Counter()
for item in items:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Bar chart by category
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="goldenrod")
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Objective</b></h5>
  Craft a dataset that is more balanced in terms of pricing. Aim to:<br>
  - Reduce the dominance of low-cost (cheap) items<br>
  - Increase the average price to be higher than <b>$60</b><br>
  - Balance category representation, specifically by <b>reducing Automotive items</b>
</div>


In [None]:
# Create a dict with a key of each price from $1 to $999
# And in the value, put a list of items with that price (to nearest round number)
slots = defaultdict(list)
for item in items:
    slots[round(item.price)].append(item)

In [None]:
# Create a dataset called "sample" which tries to more evenly take from the range of prices
# And gives more weight to items from categories other than Automotive
# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)
sample = []
for i in range(1, 1000):
    slot = slots[i]
    if i>=240:
        sample.extend(slot)
    elif len(slot) <= 1200:
        sample.extend(slot)
    else:
        weights = np.array([1 if item.category=='Automotive' else 5 for item in slot])
        weights = weights / np.sum(weights)
        selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)
        selected = [slot[i] for i in selected_indices]
        sample.extend(selected)

print(f"There are {len(sample):,} items in the sample")

In [None]:
# Plot the distribution of prices in sample
prices = [float(item.price) for item in sample]
plt.figure(figsize=(15, 10))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

In [None]:
# OK, we did well in terms of raising the average price and having a smooth-ish population of prices
# Let's see the categories

category_counts = Counter()
for item in sample:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Create bar chart
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="lightgreen")

# Customize the chart
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

In [None]:
# Automotive still in the lead, but improved somewhat
# For another perspective, let's look at a pie

plt.figure(figsize=(12, 10))
plt.pie(counts, labels=categories, autopct='%1.0f%%', startangle=90)

# Add a circle at the center to create a donut chart (optional)
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Categories')

# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')  

plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Dataset Curated! ✅</b></h5>
  We've crafted an excellent dataset. Let's do some final checks.
</div>


In [None]:
# How does the price vary with the character count of the prompt?

sizes = [len(item.prompt) for item in sample]
prices = [item.price for item in sample]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(sizes, prices, s=0.2, color="red")

# Add labels and title
plt.xlabel('Size')
plt.ylabel('Price')
plt.title('Is there a simple correlation?')

# Display the plot
plt.show()

In [None]:
def report(item):
    prompt = item.prompt
    tokens = Item.tokenizer.encode(item.prompt)
    print(prompt)
    print(tokens[-10:])
    print(Item.tokenizer.batch_decode(tokens[-10:]))

In [None]:
report(sample[398000])

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Observation</b></h5>
  An interesting behavior of the <b>Llama tokenizer</b> is that every number from <b>1 to 999</b> is mapped to a <b>single token</b>, similar to what we observed with <code>gpt-4o</code>.<br>
  In contrast, models like <code>qwen2</code>, <code>gemma</code>, and <code>phi3</code> tokenize each digit separately.<br>
  While this isn’t a strict requirement, it does provide a slight advantage in our project by making numerical input more compact.
</div>


<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Finally</b></h5>
  It’s time to split our curated data into <b>training</b>, <b>test</b>, and <b>validation</b> sets.<br>
  While it’s typical to allocate around <b>5%–10%</b> of the data for testing, we currently have more data than we need. So we’ll use:
  <ul style="margin: 0.2em 0; padding-left: 1.5em;">
    <li><b>400,000 samples</b> for training</li>
    <li><b>2,000 samples</b> for testing</li>
  </ul>
  We may not use all the test samples immediately, but they will be helpful for evaluation as we iterate on our models.
</div>


In [None]:
random.seed(42)
random.shuffle(sample)
train = sample[:400_000]
test = sample[400_000:402_000]
print(f"Divided into a training set of {len(train):,} items and test set of {len(test):,} items")

In [None]:
print(train[0].prompt)

In [None]:
print(test[0].test_prompt())

In [None]:
# Plot the distribution of prices in the first 250 test points

prices = [float(item.price) for item in test[:250]]
plt.figure(figsize=(15, 6))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.2em; font-size: 16px;"><b>Finally</b></h5>
  Convert your curated dataset into prompt format and upload it to the <b>HuggingFace Hub</b>.<br>
  This ensures your dataset is accessible for training and testing, and can be reused or shared across projects.<br>
  Make sure to include:
  <ul style="margin: 0.2em 0; padding-left: 1.5em;">
    <li>Correct formatting (e.g., <code>prompt</code> and <code>price</code> fields)</li>
    <li>A clear dataset card with description and metadata</li>
    <li>Proper visibility settings (public or private, as needed)</li>
  </ul>
</div>


In [None]:
train_prompts = [item.prompt for item in train]
train_prices = [item.price for item in train]
test_prompts = [item.test_prompt() for item in test]
test_prices = [item.price for item in test]

In [None]:
# Create a Dataset from the lists

train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [None]:
# Uncomment these lines if you're ready to push to the hub, and replace my name with your HF username

# HF_USER = "ed-donner"
# DATASET_NAME = f"{HF_USER}/pricer-data"
# dataset.push_to_hub(DATASET_NAME, private=True)

In [None]:
# One more thing!
# Let's pickle the training and test dataset so we don't have to execute all this code next time!

with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>To-Dos</b></h5>
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li>Investigate the dataset more!</li>
    <li>Confirm that the tokenizer tokenizes all 3-digit prices into a single token</li>
  </ul>
</div>


<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em;"><b>1/</b> Investigate the dataset more </h5>
</div>


In [None]:
# View details of the first 5 items in the sample
for i in range(5):
    # print(sample[i].prompt)
    print("Product name:", sample[i].title)
    print("Price:", sample[i].price)
    print("Category:", sample[i].category)
    print("Token count:", sample[i].token_count)

    print("")

    # print("="*40)

<div style="font-size: 12px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em;"><b>2/</b> Confirm that the tokenizer tokenizes all 3-digit prices into a single token </h5>
</div>


In [None]:
# Check that the tokenizer tokenizes all 3-digit prices into a single token
# This is a simple test case to confirm the tokenizer's behavior with 3-digit prices
for price in [100, 123, 456, 789, 999]:
    tokens = Item.tokenizer.encode(str(price), add_special_tokens=False)
    print(f"Price: {price} -> Tokens: {tokens} (Length: {len(tokens)})")

<div style="font-size: 14px; line-height: 1.4; margin: 0; padding: 0;">
  <h5 style="margin-bottom: 0.3em; font-size: 15px;"><b>Note: <code>add_special_tokens=False</code> in Tokenizer</b></h5>
  When calling the <code>encode()</code> method of a tokenizer with <code>add_special_tokens=False</code>, special tokens like <code>&lt;bos&gt;</code>, <code>&lt;eos&gt;</code>, or <code>&lt;pad&gt;</code> will not be added to the encoded output.
  <ul style="margin: 0.4em 0; padding-left: 1.5em;">
    <li><b>Purpose:</b> If you only want to tokenize the exact input (e.g., the number <code>"123"</code>), the result will contain only the token for <code>"123"</code>.</li>
    <li>If left as default (<code>add_special_tokens=True</code>), the tokenizer may automatically prepend or append special tokens, leading to more tokens and distorting your token count.</li>
  </ul>
  <b>Examples:</b><br>
  <code>encode("123", add_special_tokens=True)</code> → <code>[&lt;bos&gt;, 4513, &lt;eos&gt;]</code> (3 tokens)<br>
  <code>encode("123", add_special_tokens=False)</code> → <code>[4513]</code> (just the token for 123)
</div>


<br>

#### <code>**day3.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Build baseline models for the Product Pricer using classical ML approaches: <b>Linear Regression</b>, <b>Bag of Words</b>, <b>Word2Vec</b>, <b>Support Vector Machine</b>, and <b>Random Forest</b>. Benchmark performance using a custom <code>Tester</code> class.
</div>

<br>

<br>

#### <code>**day4.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Evaluate the performance of <b>Frontier LLMs</b>—including GPT-4o-mini, GPT-4o, and Claude 3.5 Sonnet—on the Product Pricer dataset. Compare their effectiveness against traditional ML models using the same test set.
</div>

<br>

<br>

#### <code>**day5.ipynb**</code>

<div style="font-size: 14px; line-height: 1.5; margin: 0; padding: 0;">
<b style="font-size: 16px;">Abstract:</b> Fine-tune the <b>GPT-4o-mini</b> model using OpenAI's API. Prepare the dataset in <code>JSONL</code> format, upload to OpenAI, initiate fine-tuning, and evaluate the accuracy of the customized Product Pricer model.
</div>