## **The Product Pricer Continued**  

A model that estimates the cost of a product based on its description.  

### **Data Curation - Part 2**  

In this session, we will expand our dataset for better coverage and refine it into a high-quality dataset for training.  

**Why is data curation important?**  
Although it may not seem as exciting as model training, data curation is a **crucial skill** for LLM engineers. A well-crafted dataset ensures better model performance and can help you develop commercial AI solutions with real-world applications.  

#### **Dataset Location**  
The dataset is available here:  
🔗 [Amazon Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023)  

For access to all product categories:  
🔗 [Meta Categories Folder](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories)  


### **⚠️ Important Note**  

We are working with a **large dataset** of **400,000 items**, covering multiple product categories.  
- Training this dataset in **Week 7** could take **20+ hours** (depending on GPU power) and might incur **cloud computing costs**.  
- For a **quicker and cost-effective** alternative, use the **"lite"** dataset focused on *Home Appliances*. This ensures you cover all learning objectives with a **smaller dataset**.  

Alternatively, you can **skip the curation process** by downloading the pre-processed dataset here:  
🔗 [Pickle Files](https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW)  

---

### **Step 1: Environment Setup**  

In [None]:
# Load environment variables
import os
from dotenv import load_dotenv
from huggingface_hub import login

# Load variables from .env file
load_dotenv()

# Retrieve API keys from environment variables
openai_api_key = os.getenv('OPENAI_API_KEY')
hf_token = os.getenv('HF_TOKEN')

# Check if API keys are properly loaded
if not openai_api_key or not hf_token:
    raise ValueError("Missing API keys. Ensure OPENAI_API_KEY and HF_TOKEN are set in the .env file.")

# Set environment variables explicitly (optional)
os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['HF_TOKEN'] = hf_token

# Log in to Hugging Face
login(hf_token, add_to_git_credential=True)


In [None]:
import os

from utils.items import Item
from utils.loaders import ItemLoader

import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
import pickle

### **Step 2: Load Product Data**  

We start by loading product data from the **"Appliances"** category to check if the dataset is working correctly. 

In [None]:
# Load in the same dataset as last time

items = ItemLoader("Appliances").load()

In [None]:
# Look for a familiar item..
print(items[1].prompt)

#### **Scaling Up to Multiple Categories**  

Now, let's expand the dataset to include **various product categories** (excluding clothing, beauty, and books).

In [None]:
dataset_names = [
    "Automotive",
    "Electronics",
    "Office_Products",
    "Tools_and_Home_Improvement",
    "Cell_Phones_and_Accessories",
    "Toys_and_Games",
    "Appliances",
    "Musical_Instruments",
]

In [None]:

items = []
for dataset_name in dataset_names:
    loader = ItemLoader(dataset_name)
    items.extend(loader.load())

In [None]:
print(f"A grand total of {len(items):,} items")

### **Step 3: Data Exploration**  

#### **Token Count Distribution** 

In [None]:
# Plot the distribution of token counts

tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.hist(tokens, rwidth=0.7, color="skyblue", bins=range(0, 300, 10))
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.show()

#### **Price Distribution**  


In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.hist(prices, rwidth=0.7, color="blueviolet", bins=range(0, 1000, 10))
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.show()

#### **Category Distribution**  


In [None]:
category_counts = Counter()
for item in items:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Bar chart by category
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="goldenrod")
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

### **Step 4: Dataset Balancing & Curation**  

We will **rebalance** the dataset to:  
- Reduce **overrepresentation** of low-cost items.  
- Increase **average price** to be above **$60**.  
- **Limit** the dominance of "Automotive" category.  

#### **Hint: Let's create a dictionary with a key of each price from $1 to $999 And in the value, put a list of items with that price (to nearest round number)**

In [None]:
slots = defaultdict(list)
for item in items:
    slots[round(item.price)].append(item)

**Let's create a dataset called "sample" which tries to more evenly take from the range of prices and gives more weight to items from categories other than `Automotive`, and then set random seed for reproducibility**

In [None]:
# Step 2: Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Step 3: Initialize an empty list to store the final sampled dataset
sample = []

# Step 4: Iterate through price slots from 1 to 999
for i in range(1, 1000):
    slot = slots[i]  # Retrieve the list of items in the current price slot
    
    # Step 5: If price is >= 240, include all items (no filtering)
    if i >= 240:
        sample.extend(slot)  

    # Step 6: If the slot contains 1200 items or fewer, include all items
    elif len(slot) <= 1200:
        sample.extend(slot)  

    # Step 7: If the slot has more than 1200 items, perform weighted sampling
    else:
        # Assign a sampling weight: 
        # 'Automotive' items get weight 1, all others get weight 5 (favoring non-automotive)
        weights = np.array([1 if item.category == 'Automotive' else 5 for item in slot])

        # Normalize the weights to sum to 1 for proper probability distribution
        weights = weights / np.sum(weights)  

        # Randomly select 1200 items based on the assigned weights
        selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)

        # Add the selected items to the final sample
        sample.extend([slot[i] for i in selected_indices])

print(f"Final dataset contains {len(sample):,} items")

### **Step 5: Final Dataset Check**  

#### **Price Distribution (Final Version)** 

In [None]:
# Plot the distribution of prices in sample

prices = [item.price for item in sample]
plt.figure(figsize=(15, 6))
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.title(f"Avg Price: {sum(prices)/len(prices):.2f}, Highest Price: {max(prices):,.2f}\n")
plt.show()

#### We did well in terms of raising the average price and having a 'smooth-ish' population of prices. Let's now see the categories

In [None]:
category_counts = Counter()
for item in sample:
    category_counts[item.category]+=1

categories = category_counts.keys()
counts = [category_counts[category] for category in categories]

# Create bar chart
plt.figure(figsize=(15, 6))
plt.bar(categories, counts, color="lightgreen")

# Customize the chart
plt.title('How many in each category')
plt.xlabel('Categories')
plt.ylabel('Count')

plt.xticks(rotation=30, ha='right')

# Add value labels on top of each bar
for i, v in enumerate(counts):
    plt.text(i, v, f"{v:,}", ha='center', va='bottom')

# Display the chart
plt.show()

In [None]:
def report(item):
    prompt = item.prompt
    tokens = Item.tokenizer.encode(item.prompt)
    print(prompt)
    print(tokens[-10:])
    print(Item.tokenizer.batch_decode(tokens[-10:]))

In [None]:
report(sample[398000])

In [None]:
report(sample[40000])

### **Step 6: Train-Test Split & Upload**  

#### **Create Train-Test Split** 

In [None]:
random.seed(42)
random.shuffle(sample)
train, test = sample[:400_000], sample[400_000:402_000]
print(f"Training set: {len(train):,} items | Test set: {len(test):,} items")

In [None]:
print(train[0].prompt)

In [None]:
print(test[0].test_prompt())

In [None]:
# Plot the distribution of prices in the first 250 test points

prices = [float(item.price) for item in test[:250]]
plt.figure(figsize=(15, 6))
plt.title(f"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="darkblue", bins=range(0, 1000, 10))
plt.show()

### **Save Data as Hugging Face Dataset**  


In [None]:
train_prompts = [item.prompt for item in train]
train_prices = [item.price for item in train]
test_prompts = [item.test_prompt() for item in test]
test_prices = [item.price for item in test]

In [None]:
# Create a Dataset from the lists

train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [None]:
HF_USER = "your-hf-user-name"
DATASET_NAME = f"{HF_USER}/pricer-data"
dataset.push_to_hub(DATASET_NAME, private=True)

In [None]:
# dataset.push_to_hub("your_hf_username/pricer-data", private=True)


#### **Save Locally as Pickle Files**  


In [None]:
with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)