## **The Big Project Begins: Data Curation for Large Language Models (LLMs) - Part 1**  

Welcome to the first part of our journey into **data curation for LLM training**! In this session, we will explore how to **scrub, filter, and structure data** for a machine-learning model that estimates product prices based on textual descriptions.  

### **Key Learning Objectives**
By the end of this lecture, you will:
- Understand the **importance of data curation** in machine learning  
- Learn how to **load and preprocess datasets** from Hugging Face  
- Explore **dataset statistics and distributions** using Python  
- Learn how to **format data into training-friendly prompts**  
- Understand **tokenization constraints** and their impact on model training  

### **Problem Statement: The Product Pricer**  
The goal of this project is to build an **AI model** that can estimate **how much a product costs based on its description**. To do this, we need to **curate a high-quality dataset** that contains product descriptions and their corresponding prices.  

For this lecture, we will **focus on home appliances**, using the **Amazon Reviews 2023** dataset from Hugging Face.

---

### **Dataset Information**
We will be using a publicly available dataset:  
🔗 **Dataset:** [Amazon Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023)  
🔗 **Product Metadata Folder:** [Meta Categories](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories)  

This dataset contains **metadata for various product categories**, including their **titles, descriptions, features, details, and prices**.

In [None]:
# Importing required libraries
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
import matplotlib.pyplot as plt


In [None]:
# Load environment variables
import os
from dotenv import load_dotenv
from huggingface_hub import login

# Load variables from .env file
load_dotenv()

# Retrieve API keys from environment variables
openai_api_key = os.getenv('OPENAI_API_KEY')
hf_token = os.getenv('HF_TOKEN')

# Check if API keys are properly loaded
if not openai_api_key or not hf_token:
    raise ValueError("Missing API keys. Ensure OPENAI_API_KEY and HF_TOKEN are set in the .env file.")

# Set environment variables explicitly (optional)
os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['HF_TOKEN'] = hf_token

# Log in to Hugging Face
login(hf_token, add_to_git_credential=True)


### **Loading and Exploring the Dataset**
#### **1. Loading the Home Appliances Dataset**
We will load the dataset **specific to home appliances** from Hugging Face.

In [None]:
# Load the dataset for Home Appliances
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Appliances", split="full", trust_remote_code=True)

#### **2. Checking the Dataset Size**
To understand the dataset scale, let's check how many data points we have:

In [None]:
print(f"Number of Appliances: {len(dataset):,}")


In [None]:
# Investigate a particular datapoint
datapoint = dataset[2]
datapoint

### **Understanding the Data Structure**
Each data point in the dataset contains multiple attributes:

In [None]:
# Investigate: Print key attributes

print(datapoint["title"])       # Product title
print(datapoint["description"]) # Product description
print(datapoint["features"])    # Product features
print(datapoint["details"])     # Additional details
print(datapoint["price"])       # Price

### **Data Cleaning and Filtering**
To ensure high-quality data for training, we need to **filter and clean the dataset**.

#### **1. Checking How Many Items Have Prices**
Many product entries might be missing price information. We count how many have valid prices.

In [None]:
# How many have prices?

prices = 0
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 0:
            prices += 1
    except ValueError:
        pass

print(f"There are {prices:,} with prices which is {prices/len(dataset)*100:,.1f}%")

#### **2. Filtering Items with Valid Prices**
We only select **products with valid prices** between **$1 and $999**, ensuring a **reasonable price range**.

In [None]:
# For those with prices, gather the price and the length

valid_prices = []
lengths = []

for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if 1 <= price <= 999:
            valid_prices.append(price)
            content = datapoint["title"] + str(datapoint["description"]) + str(datapoint["features"]) + str(datapoint["details"])
            lengths.append(len(content))
    except ValueError:
        pass

### **Visualizing Data Distributions**
#### **1. Distribution of Text Lengths**
We analyze the **length of product descriptions** to ensure they contain meaningful information.

In [None]:
plt.figure(figsize=(15, 6))
plt.title(f"Lengths: Avg {sum(lengths)/len(lengths):,.0f} and highest {max(lengths):,}\n")
plt.xlabel('Length (chars)')
plt.ylabel('Count')
plt.hist(lengths, rwidth=0.7, color="lightblue", bins=range(0, 6000, 100))
plt.show()

#### **2. Distribution of Prices**
To understand price variations, we plot a histogram of **price values**.

In [None]:
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(valid_prices)/len(valid_prices):,.2f} and highest {max(valid_prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(valid_prices, rwidth=0.7, color="orange", bins=range(0, 1000, 10))
plt.show()

#### Let's printout this outpier item

In [None]:
# So what is this item??

for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if price > 21000:
            print(datapoint['title'])
    except ValueError as e:
        pass

### **Preparing Data for Training**
#### **1. Creating Structured Data with the `Item` Class**
We use an `Item` class to **process and format** data for training.

In [None]:
from utils.items import Item

# Create Item objects for products with valid prices
items = []
for datapoint in dataset:
    try:
        price = float(datapoint["price"])
        if 1 <= price <= 999:
            item = Item(datapoint, price)
            if item.include:
                items.append(item)
    except ValueError:
        pass

print(f"There are {len(items):,} items")

#### **2. Understanding Tokenization Constraints**
We **truncate product descriptions** to **160 tokens** to fit within model input limits.

In [None]:
# Look at the first item

items[1]

In [None]:
# Investigate the prompt that will be used during training - the model learns to complete this

print(items[1].prompt)

In [None]:
# Investigate the prompt that will be used during testing - the model has to complete this

print(items[1].test_prompt())

### **Let's plot the distribution of token counts**

In [None]:
# Plot the distribution of token counts

tokens = [item.token_count for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\n")
plt.xlabel('Length (tokens)')
plt.ylabel('Count')
plt.hist(tokens, rwidth=0.7, color="green", bins=range(0, 300, 10))
plt.show()

### **Let's plot the distribution of prices**

In [None]:
# Plot the distribution of prices

prices = [item.price for item in items]
plt.figure(figsize=(15, 6))
plt.title(f"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\n")
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.hist(prices, rwidth=0.7, color="purple", bins=range(0, 300, 10))
plt.show()



### **Why 180 Tokens?**
- **Balance of detail & efficiency** – enough context to predict price while keeping training efficient.  
- **Matches real-world usage** – At inference time, product descriptions will be short.  
- **Trial & error** – We found that **180 tokens** produced **optimal results**.  

---

### **Coming Up Next**
- **Merging with other categories** – We will combine Electronics, Automotive, etc.  
- **Building the final dataset** – Selecting **high-quality** samples for model training  
