# Data Collection

The Amazon Reviews 2023 dataset, collected by McAuley Lab, is a comprehensive collection of Amazon reviews and product information. It includes:

* User Reviews: Ratings, textual content, and helpfulness votes.

* Item Metadata: Product descriptions, pricing information, and raw images.

* Links: Graphs showing user-item interactions and "bought together" relationships.

https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories

In [5]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt

In [6]:
load_dotenv(override=True)

True

In [7]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [8]:
from items import Item


In [9]:
%matplotlib inline

In [10]:
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_Appliances", split="full", trust_remote_code=True)

Data Size

In [11]:
print(f"Number of Appliances: {len(dataset):,}")

Number of Appliances: 94,327


In [13]:
datapoint = dataset[10]

print(datapoint["title"])
print(datapoint["description"])
print(datapoint["features"])
print(datapoint["details"])
print(datapoint["price"])

WP67003405 67003405 Door Pivot Block - Compatible Kenmore KitchenAid Maytag Whirlpool Refrigerator - Replaces AP6010352 8208254 PS11743531 - Quick DIY Repair Solution
[]
['WP67003405 Pivot Block For Vernicle Mullion Strip On Door - A high-quality exact equivalent for part numbers AP6010352, 67003405, 1025322, 12698403, 67003194, 8208254, and PS11743531.', 'Compatibility with major brands - WP67003405 Door Guide is compatible with Whirlpool, Amana, Dacor, Gaggenau, Hardwick, Jenn-Air, Kenmore, KitchenAid, and Maytag.', "Quick DIY repair - WP67003405 Refrigerator Door Guide Pivot Block Replacement will help if your appliance door doesn't open or close. Wear work gloves to protect your hands during the repair process.", 'Attentive support - If you are uncertain about whether the block fits your refrigerator, we will help. We generally put forth a valiant effort to guarantee you are totally happy with your purchase.', 'High-quality elements - WP67003405 67003405 Pivot Block Replacement mee

Missingness

In [14]:
prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError as e:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

There are 46,726 with prices which is 49.5%


# Data Curation

Both prices and description length are right-skewed. Data has been truncated to make sure robust model can be built:

* Select items that cost between 1 and 999 USD

* Truncate the text to fit within 180 tokens using the Tokenizer
    
    We want a sufficiently large number of tokens so that we have enough useful information. But we also want to keep the number low so that model can be trained efficiently. 180 is a number from trial-and-error.




Creating an Item object for each data point

In [15]:
items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError as e:
        pass

print(f"There are {len(items):,} items")

There are 29,191 items


In [16]:
items[1]

<WP67003405 67003405 Door Pivot Block - Compatible Kenmore KitchenAid Maytag Whirlpool Refrigerator - Replaces AP6010352 8208254 PS11743531 - Quick DIY Repair Solution = $16.52>

what model is going to learn during training

In [17]:
print(items[100].prompt)

How much does this cost to the nearest dollar?

Samsung Assembly Ice Maker-Mech
This is an O.E.M. Authorized part, fits with various Samsung brand models, oem part # this product in manufactured in south Korea. This is an O.E.M. Authorized part Fits with various Samsung brand models Oem part # This is a Samsung replacement part Part Number This is an O.E.M. part Manufacturer J&J International Inc., Part Weight 1 pounds, Dimensions 18 x 12 x 6 inches, model number Is Discontinued No, Color White, Material Acrylonitrile Butadiene Styrene, Quantity 1, Certification Certified frustration-free, Included Components Refrigerator-replacement-parts, Rank Tools & Home Improvement Parts & Accessories 31211, Available April 21, 2011

Price is $118.00


what model is going to see during testing

In [18]:
print(items[100].test_prompt())

How much does this cost to the nearest dollar?

Samsung Assembly Ice Maker-Mech
This is an O.E.M. Authorized part, fits with various Samsung brand models, oem part # this product in manufactured in south Korea. This is an O.E.M. Authorized part Fits with various Samsung brand models Oem part # This is a Samsung replacement part Part Number This is an O.E.M. part Manufacturer J&J International Inc., Part Weight 1 pounds, Dimensions 18 x 12 x 6 inches, model number Is Discontinued No, Color White, Material Acrylonitrile Butadiene Styrene, Quantity 1, Certification Certified frustration-free, Included Components Refrigerator-replacement-parts, Rank Tools & Home Improvement Parts & Accessories 31211, Available April 21, 2011

Price is $
