# The Product Pricer Continued

A model that can estimate how much something costs, from its description.

## Data Curation Part 2

Extend dataset to a greater coverage, and craft it into a dataset for training.  
Data curation is a crucial part of the process and an important craft to hone.

The dataset is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

And the folder with all the product datasets is here:  
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories

In [None]:
# imports

import os
import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
from item import Item
from loader import ItemLoader
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
import pickle

In [None]:
# environment

load_dotenv()

In [None]:
# display charts in the notebook

%matplotlib inline

## The ItemLoader

Look in loader.py - there's some useful code to make life easier

In [None]:
# Load in the same dataset as last time

items = ItemLoader("Appliances").load()

In [None]:
# Look for a sample item..

print(items[1].prompt)

## Now to SCALE UP

All datasets of all the items that might be found in a large home retail store - electrical, electronic, office and related, but not clothes / beauty / books.

In [None]:
# Datasets are sorted from the biggest to the smallest. Comment out the ones you don't want to use.
dataset_names = [
    # "Automotive",
    # "Electronics",
    # "Office_Products",
    # "Tools_and_Home_Improvement",
    "Cell_Phones_and_Accessories",
    "Toys_and_Games",
    "Appliances",
    "Musical_Instruments",
]

In [None]:
items = []
for dataset_name in dataset_names:
    loader = ItemLoader(dataset_name)
    items.extend(loader.load(4))

# Time for a coffee break!!
# The biggest datasets first... it gets faster.

In [None]:
print(f"A grand total of {len(items):,} items")

In [None]:
# Plot the distribution of token counts again

tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))
plt.show()

In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))
plt.show()