# Electronics Dataset Preprocessing

This notebook creates a manageable subset of the Amazon Electronics dataset by filtering for popular products based on number of reviews.

**Goal**: Extract 1000 most popular Electronics products based on review count for our RAG pipeline.

**Data Source**: Amazon Reviews 2023 dataset by McAuley Lab

**Citation**: Hou et al. (2024) - Bridging Language and Items for Retrieval and Recommendation (arXiv:2403.03952)

In [1]:
import json
import pandas as pd
from collections import Counter, defaultdict
import gzip
from pathlib import Path
import numpy as np
from tqdm import tqdm

# Set paths
DATA_DIR = Path("../data")
REVIEWS_FILE = DATA_DIR / "Electronics.jsonl"
META_FILE = DATA_DIR / "meta_Electronics.jsonl"
OUTPUT_DIR = DATA_DIR / "processed"
OUTPUT_DIR.mkdir(exist_ok=True)

# Configuration
TARGET_PRODUCTS = 1000
MIN_REVIEWS_PER_PRODUCT = 10  # Minimum reviews to be considered "popular"


## Step 1: Count Reviews per Product

First, we'll scan through the reviews file to count how many reviews each product (parent_asin) has received.

In [2]:
def count_reviews_per_product(reviews_file):
    """Count the number of reviews for each product (parent_asin)."""
    print(f"Counting reviews per product from {reviews_file}...")
    
    review_counts = Counter()
    total_reviews = 0
    
    # Handle both .jsonl and .jsonl.gz files
    if str(reviews_file).endswith('.gz'):
        file_opener = gzip.open
        mode = 'rt'
    else:
        file_opener = open
        mode = 'r'
    
    with file_opener(reviews_file, mode, encoding='utf-8') as f:
        for line_num, line in enumerate(tqdm(f, desc="Processing reviews")):
            if line_num % 100000 == 0 and line_num > 0:
                print(f"Processed {line_num:,} reviews, found {len(review_counts):,} unique products")
            
            try:
                review = json.loads(line.strip())
                parent_asin = review.get('parent_asin')
                if parent_asin:
                    review_counts[parent_asin] += 1
                    total_reviews += 1
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON at line {line_num + 1}")
                continue
    
    print(f"\nTotal reviews processed: {total_reviews:,}")
    print(f"Unique products found: {len(review_counts):,}")
    
    return review_counts

# Count reviews per product
review_counts = count_reviews_per_product(REVIEWS_FILE)


Counting reviews per product from ../data/Electronics.jsonl...


Processing reviews: 164082it [00:00, 334261.68it/s]

Processed 100,000 reviews, found 59,429 unique products


Processing reviews: 271335it [00:00, 350304.00it/s]

Processed 200,000 reviews, found 95,630 unique products


Processing reviews: 344318it [00:01, 358016.20it/s]

Processed 300,000 reviews, found 126,716 unique products


Processing reviews: 452964it [00:01, 358664.08it/s]

Processed 400,000 reviews, found 153,789 unique products


Processing reviews: 562811it [00:01, 345120.38it/s]

Processed 500,000 reviews, found 177,727 unique products


Processing reviews: 670547it [00:01, 341545.99it/s]

Processed 600,000 reviews, found 199,163 unique products


Processing reviews: 740933it [00:02, 346843.09it/s]

Processed 700,000 reviews, found 219,265 unique products


Processing reviews: 847629it [00:02, 325855.67it/s]

Processed 800,000 reviews, found 238,547 unique products


Processing reviews: 956902it [00:02, 350605.27it/s]

Processed 900,000 reviews, found 254,996 unique products


Processing reviews: 1066574it [00:03, 361609.97it/s]

Processed 1,000,000 reviews, found 271,211 unique products


Processing reviews: 1139368it [00:03, 361998.80it/s]

Processed 1,100,000 reviews, found 287,118 unique products


Processing reviews: 1250076it [00:03, 366480.50it/s]

Processed 1,200,000 reviews, found 302,191 unique products


Processing reviews: 1357334it [00:03, 344217.34it/s]

Processed 1,300,000 reviews, found 316,769 unique products


Processing reviews: 1461430it [00:04, 345154.78it/s]

Processed 1,400,000 reviews, found 330,341 unique products


Processing reviews: 1538085it [00:04, 364067.92it/s]

Processed 1,500,000 reviews, found 342,819 unique products


Processing reviews: 1648604it [00:04, 361322.53it/s]

Processed 1,600,000 reviews, found 355,464 unique products


Processing reviews: 1757921it [00:05, 363083.35it/s]

Processed 1,700,000 reviews, found 367,752 unique products


Processing reviews: 1871448it [00:05, 373484.10it/s]

Processed 1,800,000 reviews, found 379,115 unique products


Processing reviews: 1945558it [00:05, 360819.34it/s]

Processed 1,900,000 reviews, found 390,431 unique products


Processing reviews: 2053976it [00:05, 321082.66it/s]

Processed 2,000,000 reviews, found 401,056 unique products


Processing reviews: 2163201it [00:06, 350149.22it/s]

Processed 2,100,000 reviews, found 411,226 unique products


Processing reviews: 2272185it [00:06, 359838.73it/s]

Processed 2,200,000 reviews, found 421,599 unique products


Processing reviews: 2345723it [00:06, 362322.26it/s]

Processed 2,300,000 reviews, found 431,726 unique products


Processing reviews: 2457670it [00:07, 368479.86it/s]

Processed 2,400,000 reviews, found 441,747 unique products


Processing reviews: 2568694it [00:07, 366735.62it/s]

Processed 2,500,000 reviews, found 450,707 unique products


Processing reviews: 2642683it [00:07, 368100.88it/s]

Processed 2,600,000 reviews, found 460,396 unique products


Processing reviews: 2753643it [00:07, 363932.73it/s]

Processed 2,700,000 reviews, found 470,333 unique products


Processing reviews: 2865365it [00:08, 370622.57it/s]

Processed 2,800,000 reviews, found 480,135 unique products


Processing reviews: 2939841it [00:08, 371370.65it/s]

Processed 2,900,000 reviews, found 488,851 unique products


Processing reviews: 3050787it [00:08, 364287.75it/s]

Processed 3,000,000 reviews, found 497,236 unique products


Processing reviews: 3161568it [00:08, 363994.30it/s]

Processed 3,100,000 reviews, found 505,812 unique products


Processing reviews: 3271902it [00:09, 361834.26it/s]

Processed 3,200,000 reviews, found 514,524 unique products


Processing reviews: 3345692it [00:09, 362901.50it/s]

Processed 3,300,000 reviews, found 523,039 unique products


Processing reviews: 3454614it [00:09, 360117.57it/s]

Processed 3,400,000 reviews, found 531,519 unique products


Processing reviews: 3562520it [00:10, 352821.02it/s]

Processed 3,500,000 reviews, found 540,173 unique products


Processing reviews: 3668495it [00:10, 329402.60it/s]

Processed 3,600,000 reviews, found 547,997 unique products


Processing reviews: 3740428it [00:10, 343398.67it/s]

Processed 3,700,000 reviews, found 555,924 unique products


Processing reviews: 3852363it [00:10, 363369.19it/s]

Processed 3,800,000 reviews, found 563,575 unique products


Processing reviews: 3963921it [00:11, 364808.67it/s]

Processed 3,900,000 reviews, found 570,559 unique products


Processing reviews: 4038548it [00:11, 368737.79it/s]

Processed 4,000,000 reviews, found 577,644 unique products


Processing reviews: 4152113it [00:11, 368453.52it/s]

Processed 4,100,000 reviews, found 583,599 unique products


Processing reviews: 4262910it [00:12, 368612.16it/s]

Processed 4,200,000 reviews, found 590,535 unique products


Processing reviews: 4336728it [00:12, 364538.23it/s]

Processed 4,300,000 reviews, found 597,624 unique products


Processing reviews: 4446912it [00:12, 356069.76it/s]

Processed 4,400,000 reviews, found 604,633 unique products


Processing reviews: 4560275it [00:12, 371882.67it/s]

Processed 4,500,000 reviews, found 611,449 unique products


Processing reviews: 4672739it [00:13, 374197.37it/s]

Processed 4,600,000 reviews, found 617,512 unique products


Processing reviews: 4746984it [00:13, 362540.67it/s]

Processed 4,700,000 reviews, found 624,109 unique products


Processing reviews: 4859444it [00:13, 369968.65it/s]

Processed 4,800,000 reviews, found 630,510 unique products


Processing reviews: 4970295it [00:13, 368287.03it/s]

Processed 4,900,000 reviews, found 637,006 unique products


Processing reviews: 5045692it [00:14, 372756.94it/s]

Processed 5,000,000 reviews, found 643,931 unique products


Processing reviews: 5158402it [00:14, 369834.27it/s]

Processed 5,100,000 reviews, found 650,330 unique products


Processing reviews: 5264439it [00:14, 323404.34it/s]

Processed 5,200,000 reviews, found 656,464 unique products


Processing reviews: 5368647it [00:15, 339771.53it/s]

Processed 5,300,000 reviews, found 662,882 unique products


Processing reviews: 5441089it [00:15, 350584.84it/s]

Processed 5,400,000 reviews, found 669,053 unique products


Processing reviews: 5552412it [00:15, 365488.83it/s]

Processed 5,500,000 reviews, found 675,184 unique products


Processing reviews: 5663975it [00:15, 365507.87it/s]

Processed 5,600,000 reviews, found 681,050 unique products


Processing reviews: 5737490it [00:16, 358895.22it/s]

Processed 5,700,000 reviews, found 686,619 unique products


Processing reviews: 5848167it [00:16, 365206.19it/s]

Processed 5,800,000 reviews, found 692,163 unique products


Processing reviews: 5957851it [00:16, 352487.40it/s]

Processed 5,900,000 reviews, found 697,892 unique products


Processing reviews: 6061719it [00:17, 307519.21it/s]

Processed 6,000,000 reviews, found 703,427 unique products


Processing reviews: 6167785it [00:17, 336113.45it/s]

Processed 6,100,000 reviews, found 708,438 unique products


Processing reviews: 6241116it [00:17, 351406.13it/s]

Processed 6,200,000 reviews, found 713,605 unique products


Processing reviews: 6351411it [00:17, 364250.67it/s]

Processed 6,300,000 reviews, found 718,683 unique products


Processing reviews: 6464180it [00:18, 371534.90it/s]

Processed 6,400,000 reviews, found 724,032 unique products


Processing reviews: 6577322it [00:18, 372383.96it/s]

Processed 6,500,000 reviews, found 729,232 unique products


Processing reviews: 6651823it [00:18, 366271.75it/s]

Processed 6,600,000 reviews, found 734,335 unique products


Processing reviews: 6753203it [00:19, 307379.24it/s]

Processed 6,700,000 reviews, found 739,416 unique products


Processing reviews: 6860679it [00:19, 339618.81it/s]

Processed 6,800,000 reviews, found 744,364 unique products


Processing reviews: 6964468it [00:19, 337960.10it/s]

Processed 6,900,000 reviews, found 749,602 unique products


Processing reviews: 7036472it [00:19, 348867.62it/s]

Processed 7,000,000 reviews, found 754,660 unique products


Processing reviews: 7142692it [00:20, 351899.04it/s]

Processed 7,100,000 reviews, found 759,627 unique products


Processing reviews: 7252608it [00:20, 363143.18it/s]

Processed 7,200,000 reviews, found 764,566 unique products


Processing reviews: 7363966it [00:20, 368330.84it/s]

Processed 7,300,000 reviews, found 769,539 unique products


Processing reviews: 7438010it [00:21, 366997.48it/s]

Processed 7,400,000 reviews, found 774,498 unique products


Processing reviews: 7549876it [00:21, 367698.03it/s]

Processed 7,500,000 reviews, found 779,314 unique products


Processing reviews: 7661198it [00:21, 370000.10it/s]

Processed 7,600,000 reviews, found 783,976 unique products


Processing reviews: 7770966it [00:21, 361215.79it/s]

Processed 7,700,000 reviews, found 788,542 unique products


Processing reviews: 7844303it [00:22, 359345.07it/s]

Processed 7,800,000 reviews, found 792,842 unique products


Processing reviews: 7955199it [00:22, 337845.34it/s]

Processed 7,900,000 reviews, found 797,486 unique products


Processing reviews: 8067261it [00:22, 359114.11it/s]

Processed 8,000,000 reviews, found 801,833 unique products


Processing reviews: 8144275it [00:23, 371559.56it/s]

Processed 8,100,000 reviews, found 806,111 unique products


Processing reviews: 8258615it [00:23, 322749.52it/s]

Processed 8,200,000 reviews, found 810,633 unique products


Processing reviews: 8370626it [00:23, 354484.27it/s]

Processed 8,300,000 reviews, found 815,130 unique products


Processing reviews: 8446283it [00:23, 366428.32it/s]

Processed 8,400,000 reviews, found 819,504 unique products


Processing reviews: 8557602it [00:24, 367781.23it/s]

Processed 8,500,000 reviews, found 823,878 unique products


Processing reviews: 8672258it [00:24, 377523.71it/s]

Processed 8,600,000 reviews, found 827,908 unique products


Processing reviews: 8747584it [00:24, 370964.53it/s]

Processed 8,700,000 reviews, found 831,853 unique products


Processing reviews: 8860576it [00:25, 374671.08it/s]

Processed 8,800,000 reviews, found 835,898 unique products


Processing reviews: 8975186it [00:25, 379595.77it/s]

Processed 8,900,000 reviews, found 840,302 unique products


Processing reviews: 9051595it [00:25, 379917.00it/s]

Processed 9,000,000 reviews, found 844,427 unique products


Processing reviews: 9165821it [00:25, 378887.28it/s]

Processed 9,100,000 reviews, found 848,539 unique products


Processing reviews: 9243091it [00:26, 380529.71it/s]

Processed 9,200,000 reviews, found 852,693 unique products


Processing reviews: 9359535it [00:26, 386295.31it/s]

Processed 9,300,000 reviews, found 856,987 unique products


Processing reviews: 9474913it [00:26, 381264.92it/s]

Processed 9,400,000 reviews, found 861,173 unique products


Processing reviews: 9551176it [00:26, 379727.11it/s]

Processed 9,500,000 reviews, found 865,105 unique products


Processing reviews: 9666456it [00:27, 383186.53it/s]

Processed 9,600,000 reviews, found 869,253 unique products


Processing reviews: 9743167it [00:27, 326448.70it/s]

Processed 9,700,000 reviews, found 873,434 unique products


Processing reviews: 9859559it [00:27, 364671.66it/s]

Processed 9,800,000 reviews, found 877,570 unique products


Processing reviews: 9974304it [00:28, 377496.13it/s]

Processed 9,900,000 reviews, found 881,569 unique products


Processing reviews: 10050444it [00:28, 378285.99it/s]

Processed 10,000,000 reviews, found 885,488 unique products


Processing reviews: 10166369it [00:28, 382738.37it/s]

Processed 10,100,000 reviews, found 889,271 unique products


Processing reviews: 10244017it [00:28, 385725.66it/s]

Processed 10,200,000 reviews, found 892,996 unique products


Processing reviews: 10359133it [00:29, 379018.17it/s]

Processed 10,300,000 reviews, found 897,089 unique products


Processing reviews: 10473939it [00:29, 378258.35it/s]

Processed 10,400,000 reviews, found 900,964 unique products


Processing reviews: 10550284it [00:29, 380189.37it/s]

Processed 10,500,000 reviews, found 904,939 unique products


Processing reviews: 10664741it [00:29, 379115.63it/s]

Processed 10,600,000 reviews, found 908,650 unique products


Processing reviews: 10740653it [00:30, 377585.56it/s]

Processed 10,700,000 reviews, found 912,435 unique products


Processing reviews: 10855126it [00:30, 378285.41it/s]

Processed 10,800,000 reviews, found 915,975 unique products


Processing reviews: 10970265it [00:30, 381773.04it/s]

Processed 10,900,000 reviews, found 919,860 unique products


Processing reviews: 11047356it [00:30, 380370.45it/s]

Processed 11,000,000 reviews, found 923,583 unique products


Processing reviews: 11161409it [00:31, 337997.45it/s]

Processed 11,100,000 reviews, found 927,323 unique products


Processing reviews: 11270157it [00:31, 351676.00it/s]

Processed 11,200,000 reviews, found 931,121 unique products


Processing reviews: 11342900it [00:31, 355535.80it/s]

Processed 11,300,000 reviews, found 934,707 unique products


Processing reviews: 11456619it [00:32, 372041.60it/s]

Processed 11,400,000 reviews, found 938,505 unique products


Processing reviews: 11570482it [00:32, 365032.98it/s]

Processed 11,500,000 reviews, found 941,783 unique products


Processing reviews: 11645225it [00:32, 368459.95it/s]

Processed 11,600,000 reviews, found 945,518 unique products


Processing reviews: 11761364it [00:32, 377959.01it/s]

Processed 11,700,000 reviews, found 948,948 unique products


Processing reviews: 11875516it [00:33, 379843.60it/s]

Processed 11,800,000 reviews, found 952,385 unique products


Processing reviews: 11952288it [00:33, 378548.60it/s]

Processed 11,900,000 reviews, found 955,958 unique products


Processing reviews: 12065545it [00:33, 369835.52it/s]

Processed 12,000,000 reviews, found 959,294 unique products


Processing reviews: 12141932it [00:33, 375506.39it/s]

Processed 12,100,000 reviews, found 962,880 unique products


Processing reviews: 12254953it [00:34, 375405.50it/s]

Processed 12,200,000 reviews, found 966,308 unique products


Processing reviews: 12368049it [00:34, 374633.75it/s]

Processed 12,300,000 reviews, found 969,887 unique products


Processing reviews: 12444141it [00:34, 376166.30it/s]

Processed 12,400,000 reviews, found 973,234 unique products


Processing reviews: 12558971it [00:34, 376217.31it/s]

Processed 12,500,000 reviews, found 976,745 unique products


Processing reviews: 12671985it [00:35, 340474.67it/s]

Processed 12,600,000 reviews, found 979,966 unique products


Processing reviews: 12745044it [00:35, 352441.20it/s]

Processed 12,700,000 reviews, found 983,358 unique products


Processing reviews: 12854265it [00:35, 357160.73it/s]

Processed 12,800,000 reviews, found 986,857 unique products


Processing reviews: 12961308it [00:36, 343922.55it/s]

Processed 12,900,000 reviews, found 990,282 unique products


Processing reviews: 13072680it [00:36, 362336.95it/s]

Processed 13,000,000 reviews, found 993,465 unique products


Processing reviews: 13146833it [00:36, 366315.74it/s]

Processed 13,100,000 reviews, found 996,984 unique products


Processing reviews: 13261121it [00:36, 373972.55it/s]

Processed 13,200,000 reviews, found 1,000,319 unique products


Processing reviews: 13339019it [00:37, 381192.12it/s]

Processed 13,300,000 reviews, found 1,003,362 unique products


Processing reviews: 13452745it [00:37, 375566.77it/s]

Processed 13,400,000 reviews, found 1,006,707 unique products


Processing reviews: 13564466it [00:37, 367417.00it/s]

Processed 13,500,000 reviews, found 1,010,079 unique products


Processing reviews: 13677866it [00:38, 375494.89it/s]

Processed 13,600,000 reviews, found 1,013,288 unique products


Processing reviews: 13753423it [00:38, 376817.87it/s]

Processed 13,700,000 reviews, found 1,016,560 unique products


Processing reviews: 13866822it [00:38, 375277.94it/s]

Processed 13,800,000 reviews, found 1,019,760 unique products


Processing reviews: 13943327it [00:38, 379305.98it/s]

Processed 13,900,000 reviews, found 1,023,128 unique products


Processing reviews: 14057589it [00:39, 376756.55it/s]

Processed 14,000,000 reviews, found 1,026,392 unique products


Processing reviews: 14166988it [00:39, 338638.01it/s]

Processed 14,100,000 reviews, found 1,029,703 unique products


Processing reviews: 14275555it [00:39, 354083.78it/s]

Processed 14,200,000 reviews, found 1,032,901 unique products


Processing reviews: 14352342it [00:39, 368946.11it/s]

Processed 14,300,000 reviews, found 1,036,211 unique products


Processing reviews: 14466849it [00:40, 377872.84it/s]

Processed 14,400,000 reviews, found 1,039,401 unique products


Processing reviews: 14543641it [00:40, 380575.22it/s]

Processed 14,500,000 reviews, found 1,042,623 unique products


Processing reviews: 14658845it [00:40, 382849.50it/s]

Processed 14,600,000 reviews, found 1,045,740 unique products


Processing reviews: 14774878it [00:41, 385786.36it/s]

Processed 14,700,000 reviews, found 1,049,103 unique products


Processing reviews: 14851920it [00:41, 380287.53it/s]

Processed 14,800,000 reviews, found 1,051,947 unique products


Processing reviews: 14966511it [00:41, 380889.36it/s]

Processed 14,900,000 reviews, found 1,055,067 unique products


Processing reviews: 15042632it [00:41, 379601.98it/s]

Processed 15,000,000 reviews, found 1,058,172 unique products


Processing reviews: 15156890it [00:42, 380495.21it/s]

Processed 15,100,000 reviews, found 1,061,212 unique products


Processing reviews: 15271001it [00:42, 362726.93it/s]

Processed 15,200,000 reviews, found 1,064,005 unique products


Processing reviews: 15345395it [00:42, 367811.60it/s]

Processed 15,300,000 reviews, found 1,066,873 unique products


Processing reviews: 15457691it [00:42, 372239.69it/s]

Processed 15,400,000 reviews, found 1,069,801 unique products


Processing reviews: 15570315it [00:43, 368403.03it/s]

Processed 15,500,000 reviews, found 1,072,876 unique products


Processing reviews: 15642754it [00:43, 316231.82it/s]

Processed 15,600,000 reviews, found 1,075,716 unique products


Processing reviews: 15754630it [00:43, 351412.31it/s]

Processed 15,700,000 reviews, found 1,078,834 unique products


Processing reviews: 15867503it [00:44, 369054.46it/s]

Processed 15,800,000 reviews, found 1,081,638 unique products


Processing reviews: 15942653it [00:44, 370677.57it/s]

Processed 15,900,000 reviews, found 1,084,439 unique products


Processing reviews: 16058597it [00:44, 381084.26it/s]

Processed 16,000,000 reviews, found 1,087,139 unique products


Processing reviews: 16173118it [00:44, 378781.40it/s]

Processed 16,100,000 reviews, found 1,090,028 unique products


Processing reviews: 16248353it [00:45, 372687.18it/s]

Processed 16,200,000 reviews, found 1,092,961 unique products


Processing reviews: 16364317it [00:45, 382050.07it/s]

Processed 16,300,000 reviews, found 1,095,640 unique products


Processing reviews: 16440847it [00:45, 381217.38it/s]

Processed 16,400,000 reviews, found 1,098,080 unique products


Processing reviews: 16555733it [00:45, 380024.62it/s]

Processed 16,500,000 reviews, found 1,100,707 unique products


Processing reviews: 16670066it [00:46, 379568.04it/s]

Processed 16,600,000 reviews, found 1,103,249 unique products


Processing reviews: 16746834it [00:46, 374038.06it/s]

Processed 16,700,000 reviews, found 1,105,790 unique products


Processing reviews: 16861253it [00:46, 379304.15it/s]

Processed 16,800,000 reviews, found 1,108,515 unique products


Processing reviews: 16977152it [00:46, 384097.43it/s]

Processed 16,900,000 reviews, found 1,111,139 unique products


Processing reviews: 17053662it [00:47, 326677.79it/s]

Processed 17,000,000 reviews, found 1,113,830 unique products


Processing reviews: 17166488it [00:47, 357695.80it/s]

Processed 17,100,000 reviews, found 1,116,647 unique products


Processing reviews: 17240254it [00:47, 360520.45it/s]

Processed 17,200,000 reviews, found 1,119,432 unique products


Processing reviews: 17353039it [00:48, 371365.63it/s]

Processed 17,300,000 reviews, found 1,122,187 unique products


Processing reviews: 17467406it [00:48, 378106.17it/s]

Processed 17,400,000 reviews, found 1,124,782 unique products


Processing reviews: 17543554it [00:48, 376765.19it/s]

Processed 17,500,000 reviews, found 1,127,256 unique products


Processing reviews: 17660030it [00:48, 384163.59it/s]

Processed 17,600,000 reviews, found 1,130,064 unique products


Processing reviews: 17739481it [00:49, 388399.43it/s]

Processed 17,700,000 reviews, found 1,132,601 unique products


Processing reviews: 17854910it [00:49, 376181.39it/s]

Processed 17,800,000 reviews, found 1,135,298 unique products


Processing reviews: 17969254it [00:49, 378342.63it/s]

Processed 17,900,000 reviews, found 1,137,889 unique products


Processing reviews: 18044303it [00:49, 370311.46it/s]

Processed 18,000,000 reviews, found 1,140,838 unique products


Processing reviews: 18158682it [00:50, 378231.14it/s]

Processed 18,100,000 reviews, found 1,143,323 unique products


Processing reviews: 18272530it [00:50, 377594.26it/s]

Processed 18,200,000 reviews, found 1,145,633 unique products


Processing reviews: 18347963it [00:50, 372267.43it/s]

Processed 18,300,000 reviews, found 1,148,261 unique products


Processing reviews: 18460189it [00:50, 372293.26it/s]

Processed 18,400,000 reviews, found 1,150,966 unique products


Processing reviews: 18573426it [00:51, 345911.26it/s]

Processed 18,500,000 reviews, found 1,153,579 unique products


Processing reviews: 18648283it [00:51, 360368.39it/s]

Processed 18,600,000 reviews, found 1,156,080 unique products


Processing reviews: 18760037it [00:51, 369222.90it/s]

Processed 18,700,000 reviews, found 1,158,529 unique products


Processing reviews: 18871915it [00:52, 367501.69it/s]

Processed 18,800,000 reviews, found 1,161,151 unique products


Processing reviews: 18944846it [00:52, 323890.55it/s]

Processed 18,900,000 reviews, found 1,163,777 unique products


Processing reviews: 19048016it [00:52, 336445.86it/s]

Processed 19,000,000 reviews, found 1,166,495 unique products


Processing reviews: 19155970it [00:52, 352756.19it/s]

Processed 19,100,000 reviews, found 1,168,913 unique products


Processing reviews: 19267898it [00:53, 365444.37it/s]

Processed 19,200,000 reviews, found 1,171,381 unique products


Processing reviews: 19341013it [00:53, 357972.14it/s]

Processed 19,300,000 reviews, found 1,173,668 unique products


Processing reviews: 19449360it [00:53, 354022.55it/s]

Processed 19,400,000 reviews, found 1,176,141 unique products


Processing reviews: 19553680it [00:54, 339363.36it/s]

Processed 19,500,000 reviews, found 1,178,771 unique products


Processing reviews: 19661319it [00:54, 338566.59it/s]

Processed 19,600,000 reviews, found 1,181,377 unique products


Processing reviews: 19764584it [00:54, 339288.76it/s]

Processed 19,700,000 reviews, found 1,183,862 unique products


Processing reviews: 19834804it [00:55, 289411.23it/s]

Processed 19,800,000 reviews, found 1,186,374 unique products


Processing reviews: 19944192it [00:55, 335652.47it/s]

Processed 19,900,000 reviews, found 1,189,011 unique products


Processing reviews: 20053186it [00:55, 353190.45it/s]

Processed 20,000,000 reviews, found 1,191,387 unique products


Processing reviews: 20161913it [00:55, 359041.98it/s]

Processed 20,100,000 reviews, found 1,193,941 unique products


Processing reviews: 20271727it [00:56, 363209.04it/s]

Processed 20,200,000 reviews, found 1,196,421 unique products


Processing reviews: 20344364it [00:56, 344904.28it/s]

Processed 20,300,000 reviews, found 1,198,832 unique products


Processing reviews: 20451263it [00:56, 352823.51it/s]

Processed 20,400,000 reviews, found 1,201,399 unique products


Processing reviews: 20559189it [00:57, 355432.47it/s]

Processed 20,500,000 reviews, found 1,203,730 unique products


Processing reviews: 20667464it [00:57, 358681.96it/s]

Processed 20,600,000 reviews, found 1,206,352 unique products


Processing reviews: 20739010it [00:57, 354887.59it/s]

Processed 20,700,000 reviews, found 1,208,897 unique products


Processing reviews: 20846192it [00:57, 352554.67it/s]

Processed 20,800,000 reviews, found 1,211,478 unique products


Processing reviews: 20953274it [00:58, 352290.08it/s]

Processed 20,900,000 reviews, found 1,213,755 unique products


Processing reviews: 21062342it [00:58, 359512.01it/s]

Processed 21,000,000 reviews, found 1,216,178 unique products


Processing reviews: 21170757it [00:58, 359868.48it/s]

Processed 21,100,000 reviews, found 1,218,434 unique products


Processing reviews: 21243999it [00:59, 298527.62it/s]

Processed 21,200,000 reviews, found 1,220,779 unique products


Processing reviews: 21353187it [00:59, 339552.38it/s]

Processed 21,300,000 reviews, found 1,222,994 unique products


Processing reviews: 21461038it [00:59, 352591.84it/s]

Processed 21,400,000 reviews, found 1,225,347 unique products


Processing reviews: 21569389it [00:59, 358221.73it/s]

Processed 21,500,000 reviews, found 1,227,753 unique products


Processing reviews: 21641530it [01:00, 358008.10it/s]

Processed 21,600,000 reviews, found 1,230,047 unique products


Processing reviews: 21750648it [01:00, 360971.36it/s]

Processed 21,700,000 reviews, found 1,232,410 unique products


Processing reviews: 21858774it [01:00, 352815.04it/s]

Processed 21,800,000 reviews, found 1,234,386 unique products


Processing reviews: 21968160it [01:01, 361682.10it/s]

Processed 21,900,000 reviews, found 1,236,516 unique products


Processing reviews: 22040798it [01:01, 361523.42it/s]

Processed 22,000,000 reviews, found 1,238,736 unique products


Processing reviews: 22149845it [01:01, 358388.13it/s]

Processed 22,100,000 reviews, found 1,240,955 unique products


Processing reviews: 22257847it [01:01, 355419.41it/s]

Processed 22,200,000 reviews, found 1,243,221 unique products


Processing reviews: 22366285it [01:02, 358392.56it/s]

Processed 22,300,000 reviews, found 1,245,584 unique products


Processing reviews: 22437944it [01:02, 342076.27it/s]

Processed 22,400,000 reviews, found 1,247,807 unique products


Processing reviews: 22546261it [01:02, 353896.24it/s]

Processed 22,500,000 reviews, found 1,250,072 unique products


Processing reviews: 22654342it [01:03, 320749.87it/s]

Processed 22,600,000 reviews, found 1,252,431 unique products


Processing reviews: 22760495it [01:03, 344016.43it/s]

Processed 22,700,000 reviews, found 1,254,721 unique products


Processing reviews: 22869061it [01:03, 356570.68it/s]

Processed 22,800,000 reviews, found 1,257,185 unique products


Processing reviews: 22942163it [01:03, 361294.31it/s]

Processed 22,900,000 reviews, found 1,259,366 unique products


Processing reviews: 23052502it [01:04, 365134.70it/s]

Processed 23,000,000 reviews, found 1,261,421 unique products


Processing reviews: 23161628it [01:04, 358268.20it/s]

Processed 23,100,000 reviews, found 1,263,602 unique products


Processing reviews: 23269943it [01:04, 360247.81it/s]

Processed 23,200,000 reviews, found 1,265,616 unique products


Processing reviews: 23341311it [01:04, 348377.01it/s]

Processed 23,300,000 reviews, found 1,267,932 unique products


Processing reviews: 23446540it [01:05, 346211.09it/s]

Processed 23,400,000 reviews, found 1,270,330 unique products


Processing reviews: 23554476it [01:05, 352702.03it/s]

Processed 23,500,000 reviews, found 1,272,661 unique products


Processing reviews: 23662900it [01:05, 358845.92it/s]

Processed 23,600,000 reviews, found 1,274,806 unique products


Processing reviews: 23770030it [01:06, 354053.80it/s]

Processed 23,700,000 reviews, found 1,276,731 unique products


Processing reviews: 23840796it [01:06, 349041.70it/s]

Processed 23,800,000 reviews, found 1,278,831 unique products


Processing reviews: 23946335it [01:06, 307469.98it/s]

Processed 23,900,000 reviews, found 1,281,019 unique products


Processing reviews: 24053644it [01:07, 338808.68it/s]

Processed 24,000,000 reviews, found 1,283,092 unique products


Processing reviews: 24157936it [01:07, 344480.13it/s]

Processed 24,100,000 reviews, found 1,285,251 unique products


Processing reviews: 24267378it [01:07, 359294.51it/s]

Processed 24,200,000 reviews, found 1,287,376 unique products


Processing reviews: 24339476it [01:07, 356896.53it/s]

Processed 24,300,000 reviews, found 1,289,412 unique products


Processing reviews: 24449498it [01:08, 364320.96it/s]

Processed 24,400,000 reviews, found 1,291,565 unique products


Processing reviews: 24558480it [01:08, 361876.90it/s]

Processed 24,500,000 reviews, found 1,293,524 unique products


Processing reviews: 24667281it [01:08, 360976.57it/s]

Processed 24,600,000 reviews, found 1,295,577 unique products


Processing reviews: 24739476it [01:08, 357638.00it/s]

Processed 24,700,000 reviews, found 1,297,605 unique products


Processing reviews: 24847872it [01:09, 360190.18it/s]

Processed 24,800,000 reviews, found 1,299,523 unique products


Processing reviews: 24956972it [01:09, 362640.43it/s]

Processed 24,900,000 reviews, found 1,301,588 unique products


Processing reviews: 25065822it [01:09, 361064.39it/s]

Processed 25,000,000 reviews, found 1,303,710 unique products


Processing reviews: 25138858it [01:10, 361766.26it/s]

Processed 25,100,000 reviews, found 1,305,755 unique products


Processing reviews: 25246533it [01:10, 351847.93it/s]

Processed 25,200,000 reviews, found 1,307,855 unique products


Processing reviews: 25356911it [01:10, 329838.56it/s]

Processed 25,300,000 reviews, found 1,309,903 unique products


Processing reviews: 25466590it [01:10, 352106.21it/s]

Processed 25,400,000 reviews, found 1,312,014 unique products


Processing reviews: 25538141it [01:11, 354462.36it/s]

Processed 25,500,000 reviews, found 1,314,198 unique products


Processing reviews: 25646598it [01:11, 359184.57it/s]

Processed 25,600,000 reviews, found 1,316,130 unique products


Processing reviews: 25755567it [01:11, 361950.69it/s]

Processed 25,700,000 reviews, found 1,318,095 unique products


Processing reviews: 25865910it [01:12, 365411.67it/s]

Processed 25,800,000 reviews, found 1,320,036 unique products


Processing reviews: 25939694it [01:12, 367237.34it/s]

Processed 25,900,000 reviews, found 1,322,087 unique products


Processing reviews: 26048621it [01:12, 354943.55it/s]

Processed 26,000,000 reviews, found 1,323,967 unique products


Processing reviews: 26159238it [01:12, 361774.22it/s]

Processed 26,100,000 reviews, found 1,326,001 unique products


Processing reviews: 26266850it [01:13, 355455.53it/s]

Processed 26,200,000 reviews, found 1,328,156 unique products


Processing reviews: 26338929it [01:13, 358271.78it/s]

Processed 26,300,000 reviews, found 1,330,342 unique products


Processing reviews: 26449756it [01:13, 364485.03it/s]

Processed 26,400,000 reviews, found 1,332,230 unique products


Processing reviews: 26559856it [01:14, 363172.41it/s]

Processed 26,500,000 reviews, found 1,334,222 unique products


Processing reviews: 26670276it [01:14, 366438.38it/s]

Processed 26,600,000 reviews, found 1,336,142 unique products


Processing reviews: 26743318it [01:14, 360559.29it/s]

Processed 26,700,000 reviews, found 1,337,967 unique products


Processing reviews: 26853505it [01:14, 365795.01it/s]

Processed 26,800,000 reviews, found 1,339,904 unique products


Processing reviews: 26962454it [01:15, 357868.51it/s]

Processed 26,900,000 reviews, found 1,342,045 unique products


Processing reviews: 27069961it [01:15, 355537.24it/s]

Processed 27,000,000 reviews, found 1,344,081 unique products


Processing reviews: 27140931it [01:15, 322508.09it/s]

Processed 27,100,000 reviews, found 1,345,853 unique products


Processing reviews: 27246520it [01:15, 335170.13it/s]

Processed 27,200,000 reviews, found 1,347,730 unique products


Processing reviews: 27355048it [01:16, 353530.02it/s]

Processed 27,300,000 reviews, found 1,349,680 unique products


Processing reviews: 27464710it [01:16, 361330.33it/s]

Processed 27,400,000 reviews, found 1,351,615 unique products


Processing reviews: 27537300it [01:16, 361689.28it/s]

Processed 27,500,000 reviews, found 1,353,642 unique products


Processing reviews: 27648366it [01:17, 359477.36it/s]

Processed 27,600,000 reviews, found 1,355,442 unique products


Processing reviews: 27759105it [01:17, 366485.65it/s]

Processed 27,700,000 reviews, found 1,357,280 unique products


Processing reviews: 27869440it [01:17, 366441.32it/s]

Processed 27,800,000 reviews, found 1,359,185 unique products


Processing reviews: 27942803it [01:17, 366594.34it/s]

Processed 27,900,000 reviews, found 1,361,197 unique products


Processing reviews: 28053774it [01:18, 365676.21it/s]

Processed 28,000,000 reviews, found 1,363,151 unique products


Processing reviews: 28164507it [01:18, 366041.76it/s]

Processed 28,100,000 reviews, found 1,365,032 unique products


Processing reviews: 28238208it [01:18, 367242.34it/s]

Processed 28,200,000 reviews, found 1,366,898 unique products


Processing reviews: 28348474it [01:19, 365869.14it/s]

Processed 28,300,000 reviews, found 1,368,865 unique products


Processing reviews: 28460778it [01:19, 369816.55it/s]

Processed 28,400,000 reviews, found 1,370,873 unique products


Processing reviews: 28571372it [01:19, 365055.95it/s]

Processed 28,500,000 reviews, found 1,372,822 unique products


Processing reviews: 28643748it [01:19, 356059.77it/s]

Processed 28,600,000 reviews, found 1,374,723 unique products


Processing reviews: 28749700it [01:20, 303543.32it/s]

Processed 28,700,000 reviews, found 1,376,431 unique products


Processing reviews: 28860560it [01:20, 344408.28it/s]

Processed 28,800,000 reviews, found 1,378,313 unique products


Processing reviews: 28970538it [01:20, 359226.48it/s]

Processed 28,900,000 reviews, found 1,380,319 unique products


Processing reviews: 29044008it [01:20, 363629.72it/s]

Processed 29,000,000 reviews, found 1,382,093 unique products


Processing reviews: 29154105it [01:21, 365286.97it/s]

Processed 29,100,000 reviews, found 1,383,883 unique products


Processing reviews: 29265052it [01:21, 361547.60it/s]

Processed 29,200,000 reviews, found 1,385,876 unique products


Processing reviews: 29375533it [01:21, 365210.87it/s]

Processed 29,300,000 reviews, found 1,387,709 unique products


Processing reviews: 29449192it [01:22, 367017.39it/s]

Processed 29,400,000 reviews, found 1,389,441 unique products


Processing reviews: 29561704it [01:22, 336162.11it/s]

Processed 29,500,000 reviews, found 1,391,062 unique products


Processing reviews: 29668366it [01:22, 348548.95it/s]

Processed 29,600,000 reviews, found 1,392,726 unique products


Processing reviews: 29740210it [01:22, 353992.16it/s]

Processed 29,700,000 reviews, found 1,394,400 unique products


Processing reviews: 29848760it [01:23, 359164.39it/s]

Processed 29,800,000 reviews, found 1,396,294 unique products


Processing reviews: 29956377it [01:23, 339809.56it/s]

Processed 29,900,000 reviews, found 1,398,058 unique products


Processing reviews: 30058377it [01:23, 314090.12it/s]

Processed 30,000,000 reviews, found 1,399,836 unique products


Processing reviews: 30166116it [01:24, 343318.06it/s]

Processed 30,100,000 reviews, found 1,401,629 unique products


Processing reviews: 30238881it [01:24, 354418.54it/s]

Processed 30,200,000 reviews, found 1,403,282 unique products


Processing reviews: 30350016it [01:24, 364163.00it/s]

Processed 30,300,000 reviews, found 1,404,975 unique products


Processing reviews: 30460051it [01:25, 365795.00it/s]

Processed 30,400,000 reviews, found 1,406,558 unique products


Processing reviews: 30568913it [01:25, 337139.47it/s]

Processed 30,500,000 reviews, found 1,408,186 unique products


Processing reviews: 30642342it [01:25, 352037.61it/s]

Processed 30,600,000 reviews, found 1,409,813 unique products


Processing reviews: 30752461it [01:25, 362278.53it/s]

Processed 30,700,000 reviews, found 1,411,550 unique products


Processing reviews: 30861130it [01:26, 359847.54it/s]

Processed 30,800,000 reviews, found 1,413,100 unique products


Processing reviews: 30971134it [01:26, 364623.32it/s]

Processed 30,900,000 reviews, found 1,414,790 unique products


Processing reviews: 31043947it [01:26, 360606.22it/s]

Processed 31,000,000 reviews, found 1,416,342 unique products


Processing reviews: 31154015it [01:26, 363871.51it/s]

Processed 31,100,000 reviews, found 1,418,088 unique products


Processing reviews: 31263246it [01:27, 363282.78it/s]

Processed 31,200,000 reviews, found 1,419,791 unique products


Processing reviews: 31375091it [01:27, 370305.06it/s]

Processed 31,300,000 reviews, found 1,421,250 unique products


Processing reviews: 31449023it [01:27, 368550.81it/s]

Processed 31,400,000 reviews, found 1,422,879 unique products


Processing reviews: 31560278it [01:28, 369856.33it/s]

Processed 31,500,000 reviews, found 1,424,574 unique products


Processing reviews: 31672302it [01:28, 371646.71it/s]

Processed 31,600,000 reviews, found 1,426,135 unique products


Processing reviews: 31747448it [01:28, 373436.66it/s]

Processed 31,700,000 reviews, found 1,427,703 unique products


Processing reviews: 31860345it [01:28, 373959.35it/s]

Processed 31,800,000 reviews, found 1,429,321 unique products


Processing reviews: 31972058it [01:29, 370548.74it/s]

Processed 31,900,000 reviews, found 1,430,881 unique products


Processing reviews: 32047438it [01:29, 374094.87it/s]

Processed 32,000,000 reviews, found 1,432,654 unique products


Processing reviews: 32159154it [01:29, 334737.68it/s]

Processed 32,100,000 reviews, found 1,434,302 unique products


Processing reviews: 32269534it [01:30, 354023.84it/s]

Processed 32,200,000 reviews, found 1,435,955 unique products


Processing reviews: 32342669it [01:30, 360138.00it/s]

Processed 32,300,000 reviews, found 1,437,648 unique products


Processing reviews: 32453120it [01:30, 364699.20it/s]

Processed 32,400,000 reviews, found 1,439,180 unique products


Processing reviews: 32563707it [01:30, 366675.09it/s]

Processed 32,500,000 reviews, found 1,440,753 unique products


Processing reviews: 32637412it [01:31, 367689.60it/s]

Processed 32,600,000 reviews, found 1,442,362 unique products


Processing reviews: 32748972it [01:31, 370909.97it/s]

Processed 32,700,000 reviews, found 1,443,862 unique products


Processing reviews: 32860683it [01:31, 370188.45it/s]

Processed 32,800,000 reviews, found 1,445,319 unique products


Processing reviews: 32971840it [01:31, 368579.24it/s]

Processed 32,900,000 reviews, found 1,446,830 unique products


Processing reviews: 33045882it [01:32, 367317.08it/s]

Processed 33,000,000 reviews, found 1,448,501 unique products


Processing reviews: 33156142it [01:32, 358605.69it/s]

Processed 33,100,000 reviews, found 1,450,056 unique products


Processing reviews: 33266406it [01:32, 364482.89it/s]

Processed 33,200,000 reviews, found 1,451,601 unique products


Processing reviews: 33340285it [01:32, 365574.09it/s]

Processed 33,300,000 reviews, found 1,453,175 unique products


Processing reviews: 33451826it [01:33, 369808.91it/s]

Processed 33,400,000 reviews, found 1,454,713 unique products


Processing reviews: 33562794it [01:33, 366815.62it/s]

Processed 33,500,000 reviews, found 1,456,232 unique products


Processing reviews: 33673228it [01:33, 367483.40it/s]

Processed 33,600,000 reviews, found 1,457,801 unique products


Processing reviews: 33749152it [01:34, 373220.83it/s]

Processed 33,700,000 reviews, found 1,459,316 unique products


Processing reviews: 33823538it [01:34, 369754.41it/s]

Processed 33,800,000 reviews, found 1,460,790 unique products


Processing reviews: 33972587it [01:34, 351921.43it/s]

Processed 33,900,000 reviews, found 1,462,229 unique products


Processing reviews: 34047471it [01:34, 364222.05it/s]

Processed 34,000,000 reviews, found 1,463,563 unique products


Processing reviews: 34160367it [01:35, 371184.09it/s]

Processed 34,100,000 reviews, found 1,464,940 unique products


Processing reviews: 34273889it [01:35, 374348.29it/s]

Processed 34,200,000 reviews, found 1,466,463 unique products


Processing reviews: 34348400it [01:35, 367975.49it/s]

Processed 34,300,000 reviews, found 1,467,990 unique products


Processing reviews: 34458755it [01:36, 363220.08it/s]

Processed 34,400,000 reviews, found 1,469,558 unique products


Processing reviews: 34569333it [01:36, 366956.44it/s]

Processed 34,500,000 reviews, found 1,471,191 unique products


Processing reviews: 34643449it [01:36, 367295.53it/s]

Processed 34,600,000 reviews, found 1,472,810 unique products


Processing reviews: 34755074it [01:36, 370621.97it/s]

Processed 34,700,000 reviews, found 1,474,278 unique products


Processing reviews: 34866344it [01:37, 369414.96it/s]

Processed 34,800,000 reviews, found 1,475,825 unique products


Processing reviews: 34940866it [01:37, 370353.05it/s]

Processed 34,900,000 reviews, found 1,477,388 unique products


Processing reviews: 35051568it [01:37, 365528.47it/s]

Processed 35,000,000 reviews, found 1,479,046 unique products


Processing reviews: 35161285it [01:37, 363774.24it/s]

Processed 35,100,000 reviews, found 1,480,565 unique products


Processing reviews: 35271022it [01:38, 364705.14it/s]

Processed 35,200,000 reviews, found 1,482,129 unique products


Processing reviews: 35344046it [01:38, 364555.00it/s]

Processed 35,300,000 reviews, found 1,483,764 unique products


Processing reviews: 35455953it [01:38, 370847.55it/s]

Processed 35,400,000 reviews, found 1,485,341 unique products


Processing reviews: 35566974it [01:39, 331648.01it/s]

Processed 35,500,000 reviews, found 1,486,755 unique products


Processing reviews: 35638421it [01:39, 344073.77it/s]

Processed 35,600,000 reviews, found 1,488,254 unique products


Processing reviews: 35747781it [01:39, 357330.34it/s]

Processed 35,700,000 reviews, found 1,489,731 unique products


Processing reviews: 35859775it [01:39, 367369.54it/s]

Processed 35,800,000 reviews, found 1,491,301 unique products


Processing reviews: 35971223it [01:40, 370509.91it/s]

Processed 35,900,000 reviews, found 1,492,707 unique products


Processing reviews: 36045256it [01:40, 368504.70it/s]

Processed 36,000,000 reviews, found 1,494,196 unique products


Processing reviews: 36157108it [01:40, 370152.25it/s]

Processed 36,100,000 reviews, found 1,495,683 unique products


Processing reviews: 36268316it [01:41, 370502.58it/s]

Processed 36,200,000 reviews, found 1,497,146 unique products


Processing reviews: 36343388it [01:41, 373245.37it/s]

Processed 36,300,000 reviews, found 1,498,486 unique products


Processing reviews: 36454595it [01:41, 368020.69it/s]

Processed 36,400,000 reviews, found 1,499,963 unique products


Processing reviews: 36564574it [01:41, 363453.64it/s]

Processed 36,500,000 reviews, found 1,501,394 unique products


Processing reviews: 36637220it [01:42, 362860.07it/s]

Processed 36,600,000 reviews, found 1,502,988 unique products


Processing reviews: 36746397it [01:42, 356394.18it/s]

Processed 36,700,000 reviews, found 1,504,586 unique products


Processing reviews: 36855946it [01:42, 363840.48it/s]

Processed 36,800,000 reviews, found 1,506,133 unique products


Processing reviews: 36965316it [01:42, 362284.37it/s]

Processed 36,900,000 reviews, found 1,507,595 unique products


Processing reviews: 37038329it [01:43, 363260.47it/s]

Processed 37,000,000 reviews, found 1,509,153 unique products


Processing reviews: 37148038it [01:43, 365016.83it/s]

Processed 37,100,000 reviews, found 1,510,510 unique products


Processing reviews: 37221382it [01:43, 308873.61it/s]

Processed 37,200,000 reviews, found 1,511,848 unique products


Processing reviews: 37364493it [01:44, 347983.26it/s]

Processed 37,300,000 reviews, found 1,513,277 unique products


Processing reviews: 37474507it [01:44, 359574.31it/s]

Processed 37,400,000 reviews, found 1,514,751 unique products


Processing reviews: 37548136it [01:44, 364335.16it/s]

Processed 37,500,000 reviews, found 1,516,165 unique products


Processing reviews: 37659683it [01:44, 367413.04it/s]

Processed 37,600,000 reviews, found 1,517,689 unique products


Processing reviews: 37769924it [01:45, 364360.80it/s]

Processed 37,700,000 reviews, found 1,519,243 unique products


Processing reviews: 37843161it [01:45, 365311.08it/s]

Processed 37,800,000 reviews, found 1,520,765 unique products


Processing reviews: 37952534it [01:45, 362906.55it/s]

Processed 37,900,000 reviews, found 1,522,198 unique products


Processing reviews: 38063300it [01:46, 367280.92it/s]

Processed 38,000,000 reviews, found 1,523,727 unique products


Processing reviews: 38174788it [01:46, 370376.16it/s]

Processed 38,100,000 reviews, found 1,525,081 unique products


Processing reviews: 38248830it [01:46, 369168.36it/s]

Processed 38,200,000 reviews, found 1,526,366 unique products


Processing reviews: 38359600it [01:46, 367345.38it/s]

Processed 38,300,000 reviews, found 1,527,717 unique products


Processing reviews: 38471077it [01:47, 368938.87it/s]

Processed 38,400,000 reviews, found 1,529,084 unique products


Processing reviews: 38545332it [01:47, 369697.76it/s]

Processed 38,500,000 reviews, found 1,530,435 unique products


Processing reviews: 38656447it [01:47, 368864.01it/s]

Processed 38,600,000 reviews, found 1,531,728 unique products


Processing reviews: 38768008it [01:47, 371363.65it/s]

Processed 38,700,000 reviews, found 1,533,036 unique products


Processing reviews: 38843165it [01:48, 372840.86it/s]

Processed 38,800,000 reviews, found 1,534,318 unique products


Processing reviews: 38955236it [01:48, 336745.00it/s]

Processed 38,900,000 reviews, found 1,535,532 unique products


Processing reviews: 39063877it [01:48, 352989.23it/s]

Processed 39,000,000 reviews, found 1,536,807 unique products


Processing reviews: 39138257it [01:48, 363005.69it/s]

Processed 39,100,000 reviews, found 1,538,189 unique products


Processing reviews: 39249155it [01:49, 365730.78it/s]

Processed 39,200,000 reviews, found 1,539,471 unique products


Processing reviews: 39360602it [01:49, 366748.21it/s]

Processed 39,300,000 reviews, found 1,540,841 unique products


Processing reviews: 39471820it [01:49, 370078.00it/s]

Processed 39,400,000 reviews, found 1,542,230 unique products


Processing reviews: 39546410it [01:50, 371673.38it/s]

Processed 39,500,000 reviews, found 1,543,668 unique products


Processing reviews: 39658042it [01:50, 370671.04it/s]

Processed 39,600,000 reviews, found 1,545,157 unique products


Processing reviews: 39770831it [01:50, 374183.94it/s]

Processed 39,700,000 reviews, found 1,546,639 unique products


Processing reviews: 39846651it [01:50, 374937.94it/s]

Processed 39,800,000 reviews, found 1,547,937 unique products


Processing reviews: 39960248it [01:51, 377364.49it/s]

Processed 39,900,000 reviews, found 1,549,382 unique products


Processing reviews: 40073182it [01:51, 375269.95it/s]

Processed 40,000,000 reviews, found 1,550,776 unique products


Processing reviews: 40148039it [01:51, 373181.62it/s]

Processed 40,100,000 reviews, found 1,552,244 unique products


Processing reviews: 40260917it [01:51, 374852.55it/s]

Processed 40,200,000 reviews, found 1,553,673 unique products


Processing reviews: 40370667it [01:52, 338867.18it/s]

Processed 40,300,000 reviews, found 1,555,165 unique products


Processing reviews: 40442821it [01:52, 350770.87it/s]

Processed 40,400,000 reviews, found 1,556,718 unique products


Processing reviews: 40555613it [01:52, 367236.18it/s]

Processed 40,500,000 reviews, found 1,558,116 unique products


Processing reviews: 40667174it [01:53, 370600.02it/s]

Processed 40,600,000 reviews, found 1,559,529 unique products


Processing reviews: 40742999it [01:53, 375375.89it/s]

Processed 40,700,000 reviews, found 1,560,964 unique products


Processing reviews: 40856569it [01:53, 376175.58it/s]

Processed 40,800,000 reviews, found 1,562,428 unique products


Processing reviews: 40972000it [01:53, 382162.61it/s]

Processed 40,900,000 reviews, found 1,563,939 unique products


Processing reviews: 41048575it [01:54, 382388.12it/s]

Processed 41,000,000 reviews, found 1,565,387 unique products


Processing reviews: 41162770it [01:54, 379057.76it/s]

Processed 41,100,000 reviews, found 1,566,930 unique products


Processing reviews: 41238494it [01:54, 377409.35it/s]

Processed 41,200,000 reviews, found 1,568,373 unique products


Processing reviews: 41353969it [01:54, 379487.82it/s]

Processed 41,300,000 reviews, found 1,569,839 unique products


Processing reviews: 41468706it [01:55, 379626.18it/s]

Processed 41,400,000 reviews, found 1,571,259 unique products


Processing reviews: 41545190it [01:55, 380637.24it/s]

Processed 41,500,000 reviews, found 1,572,687 unique products


Processing reviews: 41659675it [01:55, 380487.33it/s]

Processed 41,600,000 reviews, found 1,574,056 unique products


Processing reviews: 41770028it [01:56, 348197.74it/s]

Processed 41,700,000 reviews, found 1,575,475 unique products


Processing reviews: 41841610it [01:56, 351727.74it/s]

Processed 41,800,000 reviews, found 1,577,167 unique products


Processing reviews: 41952275it [01:56, 360777.26it/s]

Processed 41,900,000 reviews, found 1,579,117 unique products


Processing reviews: 42062451it [01:56, 364737.36it/s]

Processed 42,000,000 reviews, found 1,581,134 unique products


Processing reviews: 42175150it [01:57, 373006.24it/s]

Processed 42,100,000 reviews, found 1,582,999 unique products


Processing reviews: 42249666it [01:57, 371792.02it/s]

Processed 42,200,000 reviews, found 1,584,908 unique products


Processing reviews: 42360742it [01:57, 367726.86it/s]

Processed 42,300,000 reviews, found 1,586,802 unique products


Processing reviews: 42472054it [01:58, 369779.48it/s]

Processed 42,400,000 reviews, found 1,588,721 unique products


Processing reviews: 42545927it [01:58, 368580.36it/s]

Processed 42,500,000 reviews, found 1,590,522 unique products


Processing reviews: 42658427it [01:58, 373316.23it/s]

Processed 42,600,000 reviews, found 1,592,307 unique products


Processing reviews: 42770095it [01:58, 369017.09it/s]

Processed 42,700,000 reviews, found 1,594,036 unique products


Processing reviews: 42844899it [01:59, 369996.04it/s]

Processed 42,800,000 reviews, found 1,595,666 unique products


Processing reviews: 42957403it [01:59, 373111.58it/s]

Processed 42,900,000 reviews, found 1,597,558 unique products


Processing reviews: 43032008it [01:59, 315532.92it/s]

Processed 43,000,000 reviews, found 1,599,293 unique products


Processing reviews: 43144072it [01:59, 351653.47it/s]

Processed 43,100,000 reviews, found 1,600,977 unique products


Processing reviews: 43255364it [02:00, 365052.93it/s]

Processed 43,200,000 reviews, found 1,602,764 unique products


Processing reviews: 43366462it [02:00, 368051.35it/s]

Processed 43,300,000 reviews, found 1,604,496 unique products


Processing reviews: 43441080it [02:00, 370779.39it/s]

Processed 43,400,000 reviews, found 1,606,217 unique products


Processing reviews: 43554333it [02:00, 375653.57it/s]

Processed 43,500,000 reviews, found 1,607,744 unique products


Processing reviews: 43669021it [02:01, 379468.12it/s]

Processed 43,600,000 reviews, found 1,609,401 unique products


Processing reviews: 43744892it [02:01, 378770.64it/s]

Processed 43,700,000 reviews, found 1,609,858 unique products


Processing reviews: 43858398it [02:01, 377130.08it/s]

Processed 43,800,000 reviews, found 1,609,859 unique products


Processing reviews: 43886944it [02:01, 360105.72it/s]


Total reviews processed: 43,886,944
Unique products found: 1,609,860





## Step 2: Select Top Products by Review Count

Now we'll select the most popular products based on review count.

In [3]:
def select_top_products(review_counts, target_count=1000, min_reviews=10):
    """Select top products by review count."""
    print(f"\nSelecting top {target_count} products with at least {min_reviews} reviews...")
    
    # Filter products with minimum review count
    filtered_products = {k: v for k, v in review_counts.items() if v >= min_reviews}
    print(f"Products with ≥{min_reviews} reviews: {len(filtered_products):,}")
    
    # Get top products by review count
    top_products = dict(review_counts.most_common(target_count))
    
    print(f"\nSelected {len(top_products)} products")
    print(f"Review count range: {min(top_products.values())} - {max(top_products.values())}")
    
    # Show distribution
    counts = list(top_products.values())
    print(f"\nReview count statistics:")
    print(f"  Mean: {np.mean(counts):.1f}")
    print(f"  Median: {np.median(counts):.1f}")
    print(f"  75th percentile: {np.percentile(counts, 75):.1f}")
    print(f"  90th percentile: {np.percentile(counts, 90):.1f}")
    
    return top_products

top_products = select_top_products(review_counts, TARGET_PRODUCTS, MIN_REVIEWS_PER_PRODUCT)
selected_parent_asins = set(top_products.keys())



Selecting top 1000 products with at least 10 reviews...
Products with ≥10 reviews: 403,258

Selected 1000 products
Review count range: 3453 - 178239

Review count statistics:
  Mean: 8022.4
  Median: 5331.5
  75th percentile: 7555.8
  90th percentile: 12708.4


## Step 3: Extract Product Metadata

Extract metadata for our selected products from the meta_Electronics.jsonl file.

In [4]:
def extract_product_metadata(meta_file, selected_asins):
    """Extract metadata for selected products."""
    print(f"\nExtracting metadata for {len(selected_asins)} products...")
    
    products_metadata = []
    found_count = 0
    
    # Handle both .jsonl and .jsonl.gz files
    if str(meta_file).endswith('.gz'):
        file_opener = gzip.open
        mode = 'rt'
    else:
        file_opener = open
        mode = 'r'
    
    with file_opener(meta_file, mode, encoding='utf-8') as f:
        for line_num, line in enumerate(tqdm(f, desc="Processing metadata")):
            try:
                product = json.loads(line.strip())
                parent_asin = product.get('parent_asin')
                
                if parent_asin in selected_asins:
                    # Add review count to metadata
                    product['review_count'] = top_products[parent_asin]
                    products_metadata.append(product)
                    found_count += 1
                    
                    if found_count % 100 == 0:
                        print(f"Found metadata for {found_count}/{len(selected_asins)} products")
                        
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON at line {line_num + 1}")
                continue
    
    print(f"\nFound metadata for {found_count}/{len(selected_asins)} products")
    return products_metadata

products_metadata = extract_product_metadata(META_FILE, selected_parent_asins)



Extracting metadata for 1000 products...


Processing metadata: 38517it [00:00, 84778.60it/s]

Found metadata for 100/1000 products


Processing metadata: 74445it [00:00, 88535.47it/s]

Found metadata for 200/1000 products


Processing metadata: 120302it [00:01, 91585.44it/s]

Found metadata for 300/1000 products


Processing metadata: 183424it [00:02, 88990.53it/s]

Found metadata for 400/1000 products


Processing metadata: 267369it [00:03, 93247.51it/s]

Found metadata for 500/1000 products


Processing metadata: 379429it [00:04, 89044.42it/s]

Found metadata for 600/1000 products


Processing metadata: 538565it [00:06, 87599.09it/s]

Found metadata for 700/1000 products


Processing metadata: 738729it [00:08, 96641.66it/s]

Found metadata for 800/1000 products


Processing metadata: 1046659it [00:11, 100167.75it/s]

Found metadata for 900/1000 products


Processing metadata: 1400367it [00:14, 104374.75it/s]

Found metadata for 1000/1000 products


Processing metadata: 1610012it [00:16, 96984.57it/s] 


Found metadata for 1000/1000 products





## Step 4: Extract Sample Reviews

Extract a sample of reviews for our selected products to use in the RAG pipeline.

In [5]:
def extract_sample_reviews(reviews_file, selected_asins, max_reviews_per_product=20):
    """Extract sample reviews for selected products."""
    print(f"\nExtracting sample reviews (max {max_reviews_per_product} per product)...")
    
    product_reviews = defaultdict(list)
    total_extracted = 0
    
    # Handle both .jsonl and .jsonl.gz files
    if str(reviews_file).endswith('.gz'):
        file_opener = gzip.open
        mode = 'rt'
    else:
        file_opener = open
        mode = 'r'
    
    with file_opener(reviews_file, mode, encoding='utf-8') as f:
        for line_num, line in enumerate(tqdm(f, desc="Processing reviews")):
            try:
                review = json.loads(line.strip())
                parent_asin = review.get('parent_asin')
                
                if (parent_asin in selected_asins and 
                    len(product_reviews[parent_asin]) < max_reviews_per_product):
                    
                    # Only keep essential review fields
                    clean_review = {
                        'asin': review.get('asin'),
                        'parent_asin': parent_asin,
                        'rating': review.get('rating'),
                        'title': review.get('title', ''),
                        'text': review.get('text', ''),
                        'timestamp': review.get('timestamp'),
                        'verified_purchase': review.get('verified_purchase'),
                        'helpful_vote': review.get('helpful_vote', 0)
                    }
                    
                    product_reviews[parent_asin].append(clean_review)
                    total_extracted += 1
                    
                    if total_extracted % 1000 == 0:
                        print(f"Extracted {total_extracted} reviews for {len(product_reviews)} products")
                        
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON at line {line_num + 1}")
                continue
    
    print(f"\nExtracted {total_extracted} reviews for {len(product_reviews)} products")
    return dict(product_reviews)

sample_reviews = extract_sample_reviews(REVIEWS_FILE, selected_parent_asins, max_reviews_per_product=20)



Extracting sample reviews (max 20 per product)...


Processing reviews: 28372it [00:00, 283703.36it/s]

Extracted 1000 reviews for 525 products
Extracted 2000 reviews for 746 products
Extracted 3000 reviews for 875 products
Extracted 4000 reviews for 928 products
Extracted 5000 reviews for 959 products
Extracted 6000 reviews for 976 products
Extracted 7000 reviews for 983 products
Extracted 8000 reviews for 993 products


Processing reviews: 67673it [00:00, 347995.63it/s]

Extracted 9000 reviews for 998 products
Extracted 10000 reviews for 999 products


Processing reviews: 145487it [00:00, 375839.05it/s]

Extracted 11000 reviews for 999 products
Extracted 12000 reviews for 999 products
Extracted 13000 reviews for 999 products
Extracted 14000 reviews for 999 products
Extracted 15000 reviews for 1000 products
Extracted 16000 reviews for 1000 products


Processing reviews: 230880it [00:00, 406073.59it/s]

Extracted 17000 reviews for 1000 products
Extracted 18000 reviews for 1000 products


Processing reviews: 272040it [00:00, 407876.02it/s]

Extracted 19000 reviews for 1000 products


Processing reviews: 924183it [00:02, 428114.91it/s]

Extracted 20000 reviews for 1000 products


Processing reviews: 43886944it [01:43, 425525.46it/s]


Extracted 20000 reviews for 1000 products





## Step 4: Extract Sample Reviews

Let's analyze our filtered dataset to ensure quality.

In [6]:
# Convert to DataFrame for analysis
df_products = pd.DataFrame(products_metadata)

print("=== DATASET SUMMARY ===")
print(f"Total products: {len(df_products)}")
print(f"Total reviews extracted: {sum(len(reviews) for reviews in sample_reviews.values())}")

print("\n=== PRODUCT METADATA FIELDS ===")
print(f"Available fields: {list(df_products.columns)}")

print("\n=== REVIEW COUNT DISTRIBUTION ===")
if 'review_count' in df_products.columns:
    print(df_products['review_count'].describe())

print("\n=== PRICE DISTRIBUTION ===")
if 'price' in df_products.columns:
    # Clean price data (remove nulls and convert to numeric)
    prices = pd.to_numeric(df_products['price'], errors='coerce').dropna()
    print(f"Products with price info: {len(prices)}/{len(df_products)}")
    if len(prices) > 0:
        print(prices.describe())

print("\n=== RATING DISTRIBUTION ===")
if 'average_rating' in df_products.columns:
    ratings = pd.to_numeric(df_products['average_rating'], errors='coerce').dropna()
    print(f"Products with rating info: {len(ratings)}/{len(df_products)}")
    if len(ratings) > 0:
        print(ratings.describe())

print("\n=== TOP 10 MOST REVIEWED PRODUCTS ===")
if 'review_count' in df_products.columns and 'title' in df_products.columns:
    top_10 = df_products.nlargest(10, 'review_count')[['title', 'review_count', 'average_rating', 'price']]
    for idx, row in top_10.iterrows():
        print(f"{row['review_count']:,} reviews - {row['title'][:80]}...")


=== DATASET SUMMARY ===
Total products: 1000
Total reviews extracted: 20000

=== PRODUCT METADATA FIELDS ===
Available fields: ['main_category', 'title', 'average_rating', 'rating_number', 'features', 'description', 'price', 'images', 'videos', 'store', 'categories', 'details', 'parent_asin', 'bought_together', 'review_count']

=== REVIEW COUNT DISTRIBUTION ===
count      1000.000000
mean       8022.383000
std       11149.838281
min        3453.000000
25%        4192.000000
50%        5331.500000
75%        7555.750000
max      178239.000000
Name: review_count, dtype: float64

=== PRICE DISTRIBUTION ===
Products with price info: 772/1000
count     772.000000
mean       55.203510
std        81.188092
min         3.490000
25%        14.990000
50%        26.990000
75%        59.990000
max      1175.350000
Name: price, dtype: float64

=== RATING DISTRIBUTION ===
Products with rating info: 1000/1000
count    1000.000000
mean        4.459000
std         0.224654
min         3.300000
25%     

## Step 5: Save Processed Dataset

Save our filtered dataset for use in the RAG pipeline.

In [7]:
# Save product metadata
products_file = OUTPUT_DIR / "electronics_top1000_products.jsonl"
with open(products_file, 'w', encoding='utf-8') as f:
    for product in products_metadata:
        f.write(json.dumps(product, ensure_ascii=False) + '\n')

print(f"Saved {len(products_metadata)} products to {products_file}")

# Save sample reviews
reviews_file = OUTPUT_DIR / "electronics_top1000_products_reviews.jsonl"
total_reviews_saved = 0
with open(reviews_file, 'w', encoding='utf-8') as f:
    for parent_asin, reviews in sample_reviews.items():
        for review in reviews:
            f.write(json.dumps(review, ensure_ascii=False) + '\n')
            total_reviews_saved += 1

print(f"Saved {total_reviews_saved} reviews to {reviews_file}")

# Save summary statistics
summary = {
    'dataset_info': {
        'source': 'Amazon Reviews 2023 - Electronics Category',
        'citation': 'Hou et al. (2024) - Bridging Language and Items for Retrieval and Recommendation (arXiv:2403.03952)',
        'processing_date': pd.Timestamp.now().isoformat(),
        'selection_criteria': {
            'target_products': TARGET_PRODUCTS,
            'min_reviews_per_product': MIN_REVIEWS_PER_PRODUCT,
            'max_reviews_per_product': 20
        }
    },
    'statistics': {
        'total_products': len(products_metadata),
        'total_reviews': total_reviews_saved,
        'products_with_metadata': len(products_metadata),
        'products_with_reviews': len(sample_reviews)
    }
}

if 'review_count' in df_products.columns:
    summary['statistics']['review_count_stats'] = {
        'min': int(df_products['review_count'].min()),
        'max': int(df_products['review_count'].max()),
        'mean': float(df_products['review_count'].mean()),
        'median': float(df_products['review_count'].median())
    }

summary_file = OUTPUT_DIR / "dataset_summary.json"
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"Saved dataset summary to {summary_file}")

print("\n=== PROCESSING COMPLETE ===")
print(f"Output files created in: {OUTPUT_DIR}")
print(f"  - {products_file.name}: Product metadata")
print(f"  - {reviews_file.name}: Sample reviews")
print(f"  - {summary_file.name}: Dataset summary")


Saved 1000 products to ../data/processed/electronics_top1000_products.jsonl
Saved 20000 reviews to ../data/processed/electronics_top1000_reviews.jsonl
Saved dataset summary to ../data/processed/dataset_summary.json

=== PROCESSING COMPLETE ===
Output files created in: ../data/processed
  - electronics_top1000_products.jsonl: Product metadata
  - electronics_top1000_reviews.jsonl: Sample reviews
  - dataset_summary.json: Dataset summary


## Summary

Successfully created a manageable subset of the Electronics dataset:

- **Selected**: Top 1000 most popular products by review count
- **Criteria**: Products with ≥10 reviews each
- **Output Files**:
  - `electronics_top1000_products.jsonl`: Product metadata
  - `electronics_top1000_reviews.jsonl`: Sample reviews (max 20 per product)
  - `electronics_rag_documents.jsonl`: RAG-optimized documents
  - `dataset_summary.json`: Processing statistics

This filtered dataset is now ready for use in your RAG pipeline and will provide much faster processing while maintaining high-quality, popular products with sufficient review data.

In [8]:
def create_rag_documents(products_metadata, sample_reviews):
    """Create documents optimized for RAG retrieval."""
    rag_documents = []
    
    for product in products_metadata:
        parent_asin = product.get('parent_asin')
        
        # Create product document
        doc = {
            'id': f"product_{parent_asin}",
            'type': 'product',
            'parent_asin': parent_asin,
            'title': product.get('title', ''),
            'description': ' '.join(product.get('description', [])) if product.get('description') else '',
            'features': ' '.join(product.get('features', [])) if product.get('features') else '',
            'price': product.get('price'),
            'average_rating': product.get('average_rating'),
            'rating_number': product.get('rating_number'),
            'review_count': product.get('review_count'),
            'store': product.get('store', ''),
            'categories': product.get('categories', []),
            'details': product.get('details', {})
        }
        
        # Create searchable text content
        content_parts = []
        if doc['title']:
            content_parts.append(f"Product: {doc['title']}")
        if doc['description']:
            content_parts.append(f"Description: {doc['description']}")
        if doc['features']:
            content_parts.append(f"Features: {doc['features']}")
        if doc['store']:
            content_parts.append(f"Store: {doc['store']}")
        if doc['categories']:
            content_parts.append(f"Categories: {' > '.join(doc['categories'])}")
        
        doc['content'] = ' '.join(content_parts)
        rag_documents.append(doc)
        
        # Add review summaries
        if parent_asin in sample_reviews:
            reviews = sample_reviews[parent_asin]
            
            # Create review summary document
            positive_reviews = [r for r in reviews if r.get('rating', 0) >= 4]
            negative_reviews = [r for r in reviews if r.get('rating', 0) <= 2]
            
            review_summary = {
                'id': f"reviews_{parent_asin}",
                'type': 'review_summary',
                'parent_asin': parent_asin,
                'product_title': doc['title'],
                'total_reviews': len(reviews),
                'positive_reviews': len(positive_reviews),
                'negative_reviews': len(negative_reviews)
            }
            
            # Sample positive and negative review texts
            pos_texts = [r.get('text', '') for r in positive_reviews[:5] if r.get('text')]
            neg_texts = [r.get('text', '') for r in negative_reviews[:5] if r.get('text')]
            
            content_parts = [f"Reviews for {doc['title']}"]
            if pos_texts:
                content_parts.append(f"Positive feedback: {' '.join(pos_texts[:3])}")
            if neg_texts:
                content_parts.append(f"Critical feedback: {' '.join(neg_texts[:3])}")
            
            review_summary['content'] = ' '.join(content_parts)
            rag_documents.append(review_summary)
    
    return rag_documents

# Create RAG documents
rag_documents = create_rag_documents(products_metadata, sample_reviews)

# Save RAG documents
rag_file = OUTPUT_DIR / "electronics_rag_documents.jsonl"
with open(rag_file, 'w', encoding='utf-8') as f:
    for doc in rag_documents:
        f.write(json.dumps(doc, ensure_ascii=False) + '\n')

print(f"Created {len(rag_documents)} RAG documents saved to {rag_file}")
print(f"Document types: {Counter(doc['type'] for doc in rag_documents)}")


Created 2000 RAG documents saved to ../data/processed/electronics_rag_documents.jsonl
Document types: Counter({'product': 1000, 'review_summary': 1000})


## Step 6: Create RAG Documents

Prepare the data in a format optimized for RAG applications.


In [None]:
# TODO