# The Product Pricer Continued

A model that can estimate how much something costs, from its description.

## Data Curation

Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training.

The dataset is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

And the folder with all the product datasets is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories

## Important Note - read me first please

We are about to craft a massive dataset of 400,000 items covering multiple types of product.

In [1]:
# Imports necessary libraries and modules

import os  # Provides a way to interact with the operating system, e.g., file paths
import random  # Implements random number generation and selection of random elements
from dotenv import load_dotenv  # Loads environment variables from a .env file
from huggingface_hub import login  # Allows interaction with the Hugging Face Hub (for login and access)
from datasets import load_dataset, Dataset, DatasetDict  # Provides functions to load datasets from Hugging Face and work with them
from items import Item  # Imports the custom 'Item' class from the 'items' module
from loaders import ItemLoader  # Imports the custom 'ItemLoader' class from the 'loaders' module
import matplotlib.pyplot as plt  # Used for creating visualizations (e.g., plots and graphs)
from collections import Counter, defaultdict  # Provides tools for counting and handling default dictionary behavior
import numpy as np  # Used for numerical operations, especially on arrays
import pickle  # Used for serializing and deserializing Python objects


In [2]:
# environment

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

In [None]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

In [4]:
# This command allows matplotlib plots to be displayed directly within the Jupyter notebook.
%matplotlib inline


## The ItemLoader code

Look in loaders.py - there's some useful code to make life easier for us

In [None]:
# Load in a dataset

items = ItemLoader("Appliances").load()

In [None]:
# Let's Look for a item
print(items[1].prompt)

## Now to SCALE UP

Let's look at all datasets of all the items that you might find in a large home retail store - electrical, electronic, office and related, but not clothes / beauty / books.

In [7]:
dataset_names = [
    "Automotive",
    "Electronics",
    "Office_Products",
    "Tools_and_Home_Improvement",
    "Cell_Phones_and_Accessories",
    "Toys_and_Games",
    "Appliances",
    "Musical_Instruments",
]

In [None]:
items = []
for dataset_name in dataset_names:
    loader = ItemLoader(dataset_name)
    items.extend(loader.load())

# Now, time for a coffee break!!
# By the way, I put the biggest datasets first.. it gets faster.

In [None]:
print(f"A grand total of {len(items):,} items")

In [None]:
# Plot the distribution of token counts

# Create a list of token counts from the 'items' list, extracting the 'token_count' attribute from each item
tokens = [item.token_count for item in items]

# Set up the figure size for the plot (15x6 inches)
plt.figure(figsize=(15, 6))

# Set the title of the plot, including the average and highest token counts
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")

# Set the label for the x-axis (Length of tokens)
plt.xlabel('Length (tokens)')

# Set the label for the y-axis (Count of occurrences)
plt.ylabel('Count')

# Plot a histogram of the token counts:
# - rwidth=0.7 adjusts the width of the bars
# - color="skyblue" sets the bar color
# - bins=range(0, 300, 10) defines the bins for token lengths, from 0 to 300 with intervals of 10
plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))

# Display the plot
plt.show()


In [None]:
# Plot the distribution of prices

# Create a list of prices from the 'items' list, extracting the 'price' attribute from each item
prices = [item.price for item in items]

# Set up the figure size for the plot (15x6 inches)
plt.figure(figsize=(15, 6))

# Set the title of the plot, including the average and highest prices
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")

# Set the label for the x-axis (Price in USD)
plt.xlabel('Price ($)')

# Set the label for the y-axis (Count of occurrences)
plt.ylabel('Count')

# Plot a histogram of the prices:
# - rwidth=0.7 adjusts the width of the bars
# - color="blueviolet" sets the bar color
# - bins=range(0, 1000, 10) defines the bins for prices, from 0 to 1000 with intervals of 10
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))

# Display the plot
plt.show()


In [None]:
category_counts = Counter()
for item in items:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Bar chart by category
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="goldenrod")
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

# Objective

Craft a dataset which is more balanced in terms of prices. Less heavily scewed to cheap items, with an average that's higher than $60. Try to balance out the categories - fewer Automotive items.

In [13]:
# Create a dict with a key of each price from $1 to $999
# And in the value, put a list of items with that price (to nearest round number)

slots = defaultdict(list)
for item in items:
    slots[round(item.price)].append(item)

In [None]:
# Create a dataset called "sample" which tries to more evenly take from the range of prices
# And gives more weight to items from categories other than Automotive
# Set random seed for reproducibility

# Set the random seed to ensure results are reproducible
np.random.seed(42)
random.seed(42)

# Initialize an empty list to store the sample items
sample = []

# Iterate through the range of slots (1 to 1000)
for i in range(1, 1000):
    # Retrieve the current slot (list of items) based on the index
    slot = slots[i]
    
    # If the index is greater than or equal to 240, add the whole slot to the sample
    if i >= 240:
        sample.extend(slot)
    
    # If the slot has 1200 or fewer items, add the whole slot to the sample
    elif len(slot) <= 1200:
        sample.extend(slot)
    
    # For other cases, apply weighting and sample 1200 items
    else:
        # Assign a weight of 1 to 'Automotive' category items, and 5 to other categories
        weights = np.array([1 if item.category == 'Automotive' else 5 for item in slot])
        
        # Normalize the weights to sum up to 1
        weights = weights / np.sum(weights)
        
        # Randomly select 1200 items from the slot based on the weights (without replacement)
        selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)
        
        # Add the selected items to the sample
        selected = [slot[i] for i in selected_indices]
        sample.extend(selected)

# Print the total number of items in the sample
print(f"There are {len(sample):,} items in the sample")


In [None]:
# Plot the distribution of prices in sample

# Create a list of prices from the sample dataset, converting each price to a float
prices = [float(item.price) for item in sample]

# Set up the figure size for the plot (15x10 inches)
plt.figure(figsize=(15, 10))

# Set the title of the plot, including the average and highest prices
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")

# Set the label for the x-axis (Price in USD)
plt.xlabel('Price ($)')

# Set the label for the y-axis (Count of occurrences)
plt.ylabel('Count')

# Plot a histogram of the prices:
# - rwidth=0.7 adjusts the width of the bars
# - color="darkblue" sets the bar color
# - bins=range(0, 1000, 10) defines the bins for prices, from 0 to 1000 with intervals of 10
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))

# Display the plot
plt.show()


In [None]:
# OK, we did well in terms of raising the average price and having a smooth-ish population of prices
# Let's see the categories

# Initialize a Counter to store the count of each category in the sample
category_counts = Counter()

# Iterate through the items in the sample and update the count for each category
for item in sample:
    category_counts[item.category] += 1

# Get the list of categories (keys) and their corresponding counts (values)
categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Create a bar chart to visualize the counts of items in each category
plt.figure(figsize=(15, 6))

# Plot the bar chart with categories on the x-axis and counts on the y-axis
plt.bar(categories, counts, color="lightgreen")

# Customize the chart by adding a title and labels for the x and y axes
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

# Rotate x-axis labels by 30 degrees for better readability and align them to the right
plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar for better clarity
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

In [None]:
# Automotive still in the lead, but improved somewhat
# For another perspective, let's look at a pie

plt.figure(figsize=(12, 10))
plt.pie(counts, labels=categories, autopct='%1.0f%%', startangle=90)

# Add a circle at the center to create a donut chart (optional)
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Categories')

# Equal aspect ratio ensures that pie is drawn as a circle
plt.axis('equal')  

plt.show()

# Dataset Curated!

We've crafted an excellent dataset.

Let's do some final checks

In [None]:
# How does the price vary with the character count of the prompt?

sizes = [len(item.prompt) for item in sample]
prices = [item.price for item in sample]

# Create the scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(sizes, prices, s=0.2, color="red")

# Add labels and title
plt.xlabel('Size')
plt.ylabel('Price')
plt.title('Is there a simple correlation?')

# Display the plot
plt.show()

In [19]:
def report(item):
    # Get the prompt from the item
    prompt = item.prompt
    
    # Encode the prompt using the tokenizer from the Item class
    tokens = Item.tokenizer.encode(item.prompt)
    
    # Print the full prompt
    print(prompt)
    
    # Print the last 10 tokens of the encoded prompt
    print(tokens[-10:])
    
    # Decode the last 10 tokens back into text and print it
    print(Item.tokenizer.batch_decode(tokens[-10:]))


In [None]:
report(sample[398000])

## Observation

An interesting thing about the Llama tokenizer is that every number from 1 to 999 gets mapped to 1 token, much as we saw with gpt-4o. The same is not true of qwen2, gemma and phi3, which all map individual digits to tokens. This does turn out to be a bit useful for our project, although it's not an essential requirement.

# Finally

It's time to break down our data into a training, test and validation dataset.

It's typical to use 5%-10% of your data for testing purposes, but actually we have far more than we need at this point. We'll take 400,000 points for training, and we'll reserve 2,000 for testing, although we won't use all of them.


In [None]:
# Set the random seed to ensure reproducibility of the shuffling process
random.seed(42)

# Shuffle the 'sample' dataset randomly
random.shuffle(sample)

# Split the shuffled dataset into a training set (first 400,000 items) and a test set (next 2,000 items)
train = sample[:400_000]
test = sample[400_000:402_000]

# Print the sizes of the training and test sets
print(f"Divided into a training set of {len(train):,} items and test set of {len(test):,} items")


In [None]:
print(train[0].prompt)

In [None]:
print(test[0].test_prompt())

In [None]:
# Plot the distribution of prices in the first 250 test points

# Extract the prices from the first 250 items in the test dataset, converting them to floats
prices = [float(item.price) for item in test[:250]]

# Set up the figure size for the plot (15x6 inches)
plt.figure(figsize=(15, 6))

# Set the title of the plot, including the average and highest prices
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")

# Set the label for the x-axis (Price in USD)
plt.xlabel('Price ($)')

# Set the label for the y-axis (Count of occurrences)
plt.ylabel('Count')

# Plot a histogram of the prices in the first 250 test items:
# - rwidth=0.7 adjusts the width of the bars
# - color="darkblue" sets the bar color
# - bins=range(0, 1000, 10) defines the bins for prices, from 0 to 1000 with intervals of 10
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))

# Display the plot
plt.show()


# Finally - upload your brand new dataset

Convert to prompts and upload to HuggingFace hub

In [25]:
# Create a list of prompts from the training set
train_prompts = [item.prompt for item in train]

# Create a list of prices from the training set
train_prices = [item.price for item in train]

# Create a list of prompts from the test set by calling the 'test_prompt' method on each item
test_prompts = [item.test_prompt() for item in test]

# Create a list of prices from the test set
test_prices = [item.price for item in test]


In [26]:
# Create a Dataset from the lists

# Convert the training data into a Dataset, using 'train_prompts' for the text and 'train_prices' for the price
train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})

# Convert the test data into a Dataset, using 'test_prompts' for the text and 'test_prices' for the price
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})

# Create a DatasetDict to hold both the train and test datasets
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})


In [27]:
# Uncomment these lines if you're ready to push to the hub, and replace with your HF username

# HF_USER = ""
# DATASET_NAME = f"{HF_USER}/pricer-data"
# dataset.push_to_hub(DATASET_NAME, private=True)

In [28]:
# One more thing!
# Let's pickle the training and test dataset so we don't have to execute all this code next time!

with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)

## Todos for you:

- Investigate the dataset more!
- Confirm that the tokenizer tokenizes all 3 digit prices into 1 token