### Dataset Description

**Amazon Reviews 2023**

This dataset, collected in 2023 by McAuley Lab, is a large-scale collection of Amazon reviews. It includes rich features such as user reviews, item metadata, and user-item interaction links. Key features of the dataset include:

- **User Reviews:** Contains user ratings, review text, helpfulness votes, and more.
- **Item Metadata:** Includes descriptions, prices, images, and other attributes of the items.
- **Links:** Provides user-item and bought-together graphs.

**What's New?**

- **Larger Dataset:** The dataset comprises 571.54 million reviews, which is 245.2% larger than the previous version.
- **Newer Interactions:** The interactions range from May 1996 to September 2023.
- **Richer Metadata:** The item metadata includes more descriptive features.
- **Fine-grained Timestamp:** Interaction timestamps are provided at the second or finer level.
- **Cleaner Processing:** The item metadata is cleaner compared to previous versions.
- **Standard Splitting:** Standard data splits are provided to encourage benchmarking in recommendation systems.

**Basic Statistics**

The dataset provides statistics such as the number of reviews, users, items, tokens, and the timespan covered, grouped by year.

**Grouped by Category**

The dataset is grouped by category, with information on the number of users, items, ratings, tokens in user reviews, and tokens in item metadata for each category.

**Data Fields**

For user reviews, the fields include rating, title, text, images, ASIN, parent ASIN, user ID, timestamp, verified purchase, and helpful vote. For item metadata, the fields include main category, title, average rating, rating number, features, description, price, images, videos, store, categories, details, parent ASIN, and bought together.

**Citation**

Please cite the dataset using the provided citation:

Hou, Yupeng, et al. "Bridging Language and Items for Retrieval and Recommendation." arXiv preprint arXiv:2403.03952 (2024).


In [2]:
from datasets import load_dataset

# Load the dataset using the load_dataset function directly
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", name="raw_review_All_Beauty")

print(dataset)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading data:   0%|          | 0.00/327M [00:00<?, ?B/s]

In [1]:
import pandas as pd
import string
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.corpus import stopwords
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ok\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ok\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [None]:
# Extract the DataFrame from the loaded dataset
df = dataset["full"].to_pandas()

# Display basic information about the dataset
print("Dataset Information:")
print(df.info())

# Display summary statistics for numerical columns
print("\nSummary Statistics:")
print(df.describe())

In [None]:
# Distribution of Ratings
plt.figure(figsize=(8, 6))
sns.countplot(x='rating', data=df)
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Distribution of Review Lengths
df['review_length'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))
plt.figure(figsize=(8, 6))
sns.histplot(df['review_length'], bins=50, kde=True)
plt.title('Distribution of Review Lengths')
plt.xlabel('Review Length')
plt.ylabel('Count')
plt.show()

In [None]:
# Text Normalization and Standardization
stop_words = set(stopwords.words('english'))

def normalize_text(text):
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]
    # Remove stopwords
    tokens = [word for word in tokens if word.lower() not in stop_words]
    # Lemmatization
    lemmatizer = nltk.WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a string
    normalized_text = ' '.join(tokens)
    return normalized_text

# Apply text normalization to the 'text' column
df['normalized_text'] = df['text'].apply(normalize_text)

# Display the first few rows of the preprocessed dataset
print("\nPreprocessed Dataset:")
print(df.head())