# Dataset for Amazon Review NLP project
#### CSCI 3832 Natural Language Processing
Members: Adam Wuth, Benjamin Kohav, Noah Vilas, Aiden Devine, Evan Zachary

The dataset we went with was McAuley-Lab/Amazon-Reviews-2023. It is a hugging face dataset. We wanted to represent the most recent language trends possible, so we only took data from 2023. Additionally, the dataset is huge, so we filtered it to get 100,000 reviews. (100,000 reviews / 34 product categories) / 5 review categories = 589 reviews per star rating per product category.

### Project Imports and Requirements

In [1]:
import os, random, sys, copy
import torch, torch.nn as nn, numpy as np
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from nltk.tokenize import word_tokenize
from datasets import load_dataset, concatenate_datasets, load_from_disk
from datetime import datetime
import matplotlib.pyplot as plt

#seperate imports for bert

from datasets import load_from_disk
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.optim import AdamW
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from collections import Counter



### Load in the data set
The dataset is split into categories, but we wanted all categories from 2023. This code block will take forever to run, only run it the first time to get the dataset. In order to do that, I loaded in 1000000, and filter from there

In [2]:
#The dataset is split into categories

categories = [
    "All_Beauty",
    "Amazon_Fashion",
    "Appliances",
    "Arts_Crafts_and_Sewing",
    "Automotive",
    "Baby_Products",
    "Beauty_and_Personal_Care",
    "Books",
    "CDs_and_Vinyl",
    "Cell_Phones_and_Accessories",
    "Clothing_Shoes_and_Jewelry",
    "Digital_Music",
    "Electronics",
    "Gift_Cards",
    "Grocery_and_Gourmet_Food",
    "Handmade_Products",
    "Health_and_Household",
    "Health_and_Personal_Care",
    "Home_and_Kitchen",
    "Industrial_and_Scientific",
    "Kindle_Store",
    "Magazine_Subscriptions",
    "Movies_and_TV",
    "Musical_Instruments",
    "Office_Products",
    "Patio_Lawn_and_Garden",
    "Pet_Supplies",
    "Software",
    "Sports_and_Outdoors",
    "Subscription_Boxes",
    "Tools_and_Home_Improvement",
    "Toys_and_Games",
    "Video_Games",
    "Unknown"
]


limit = 589  # 100,000 target reviews 34 categories 5 stars, (100,000/34)/5 = 889

allcats = []

for cat in categories:
    print(f"Loading category: {cat}")
    #arbitrary 10000000 to make sure I get enough data after filter
    dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{cat}", split="full[:1000000]", trust_remote_code=True)
    #dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{cat}", split="full[:1%]",  trust_remote_code=True)
    #to get reviews from 2023 onwards 2020 onwards was millions of reviews and was taking
    #over an hour just to load the data
    filtered_dataset = dataset.filter(lambda x: x['timestamp'] >= int(datetime(2023, 1, 1).timestamp() * 1000))
    #in each category, for stars 1-5(not inclusive)
    for star in range(1, 6):
        data = dataset.filter(lambda x: int(float(x["rating"])) == star)
        if len(data) >= limit:
            #trim extra reviews randomly to avoid bias
            data = data.shuffle().select(range(limit))
            allcats.append(data)


Loading category: All_Beauty
Loading category: Amazon_Fashion
Loading category: Appliances
Loading category: Arts_Crafts_and_Sewing
Loading category: Automotive
Loading category: Baby_Products
Loading category: Beauty_and_Personal_Care
Loading category: Books
Loading category: CDs_and_Vinyl
Loading category: Cell_Phones_and_Accessories
Loading category: Clothing_Shoes_and_Jewelry
Loading category: Digital_Music
Loading category: Electronics
Loading category: Gift_Cards
Loading category: Grocery_and_Gourmet_Food
Loading category: Handmade_Products
Loading category: Health_and_Household
Loading category: Health_and_Personal_Care
Loading category: Home_and_Kitchen
Loading category: Industrial_and_Scientific
Loading category: Kindle_Store
Loading category: Magazine_Subscriptions
Loading category: Movies_and_TV
Loading category: Musical_Instruments


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Office_Products


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Patio_Lawn_and_Garden


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Pet_Supplies


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Software


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Sports_and_Outdoors


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Subscription_Boxes


Filter:   0%|          | 0/16216 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16216 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16216 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16216 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16216 [00:00<?, ? examples/s]

Loading category: Tools_and_Home_Improvement


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Toys_and_Games


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Video_Games


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Loading category: Unknown


Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [None]:
#Save reviews so we don't have to run code again
reviews = concatenate_datasets(allcats)
reviews.save_to_disk("filetred_amazon_reviews")
print(Counter(reviews["rating"])) 
print(f"Total reviews loaded: {len(reviews)}")

If you have run that already, reviews was saved(should be in the working directory)so you can just do the next code block instead when you want to load in the data.
## Use the line below to load in the saved dataset, so you don't have to run the code again

In [2]:
reviews = load_from_disk("filetred_amazon_reviews")

In [3]:
#code to make sure reviews is correct
print(len(reviews))
print(reviews[0])
print(reviews[1])
print(reviews.column_names)

100130
{'rating': 1.0, 'title': 'Worst nail polish ever', 'text': 'Worst nail polish ever! My daughter and I both used this nail polish in two different colors and now our nails are damaged. Our nails split horizontally and are peeling. Plus the damage has caused pain. Worst Sally Hansen product ever!', 'images': [], 'asin': 'B011855ADM', 'parent_asin': 'B011855ADM', 'user_id': 'AEMVAG56MA7MAFULCQJEOVJCKGHA', 'timestamp': 1454738837000, 'helpful_vote': 8, 'verified_purchase': True}
{'rating': 1.0, 'title': 'No funciona para mi', 'text': 'Bueno en cuanto a mi respondo que no me funciono. Tengo pocas pestañas, las enchufe antes de poner la máscara y el resultado desastroso. El producto hizo que mis pestañas perdieran el volumen del encrespado horrible.', 'images': [], 'asin': 'B09GTV6WL6', 'parent_asin': 'B09GTV6WL6', 'user_id': 'AFPNHXMEBYKO3SPMFXZCALLZ5IHA', 'timestamp': 1645820993736, 'helpful_vote': 5, 'verified_purchase': True}
['rating', 'title', 'text', 'images', 'asin', 'parent_a

## Splitting the Dataset
Because of limited access to a GPU and other issues with training time, in order to split the data into a more managable chunk, you can use this code bellow.

In [4]:
samples_per_class = 1000
rate = []

for rating in [1.0, 2.0, 3.0, 4.0, 5.0]:
    data = reviews.filter(lambda x: x["rating"] == rating)
    data = data.shuffle().select(range(min(len(data), samples_per_class)))
    rate.append(data)

reviews_small = concatenate_datasets(rate).shuffle()

def preprocess(example):
    example['labels'] = int(example['rating']) - 1
    return example

reviews_small = reviews_small.map(preprocess)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

tokenized = reviews_small.map(tokenize_function, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
print(Counter(reviews_small["rating"])) 


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Counter({5.0: 1000, 3.0: 1000, 4.0: 1000, 2.0: 1000, 1.0: 1000})


## Training split
For training, split the data 80 percent train, 20 percent test as seen here

In [5]:

split = tokenized.train_test_split(test_size=0.2, seed=42)
train_dataset = split["train"]
valid_dataset = split["test"]

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=8)
print(Counter(train_dataset["rating"])) 
print(Counter(valid_dataset["rating"])) 

Counter({2.0: 812, 4.0: 806, 5.0: 801, 3.0: 795, 1.0: 786})
Counter({1.0: 214, 3.0: 205, 5.0: 199, 4.0: 194, 2.0: 188})
