# 2. Data Preprocessing Demo

This notebook demonstrates the data preprocessing pipeline for the Amazon Beauty dataset.

In [2]:
import sys
sys.path.insert(0, '..')

import json
import pickle
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict

## Step 1: Load Raw Data

We have two raw files:
- `reviews_Beauty_5.json` ‚Äî user-item interactions with timestamps
- `meta_Beauty.json` ‚Äî item metadata with attributes

In [5]:
# Check available raw data
raw_dir = Path('../../')  # Root of recsyska_imba

reviews_file = raw_dir / 'reviews_Beauty_5.json'
meta_file = raw_dir / 'meta_Beauty.json'

print("Raw Data Files:")
print(f"  Reviews: {reviews_file.exists()} ({reviews_file})")
print(f"  Metadata: {meta_file.exists()} ({meta_file})")

Raw Data Files:
  Reviews: True (../../reviews_Beauty_5.json)
  Metadata: True (../../meta_Beauty.json)


In [6]:
# Preview reviews
if reviews_file.exists():
    print("\nüìã Sample Review Entry:")
    with open(reviews_file, 'r') as f:
        for i, line in enumerate(f):
            if i < 1:
                review = json.loads(line)
                for k, v in review.items():
                    print(f"  {k}: {v}")


üìã Sample Review Entry:
  reviewerID: A1YJEY40YUW4SE
  asin: 7806397051
  reviewerName: Andrea
  helpful: [3, 4]
  reviewText: Very oily and creamy. Not at all what I expected... ordered this to try to highlight and contour and it just looked awful!!! Plus, took FOREVER to arrive.
  overall: 1.0
  summary: Don't waste your money
  unixReviewTime: 1391040000
  reviewTime: 01 30, 2014


## Step 2: K-core Filtering

We apply k-core filtering to ensure:
- Each user has at least K interactions
- Each item appears in at least K interactions

This removes very sparse users and unpopular items.

In [4]:
def simulate_kcore_filtering(n_users=1000, n_items=500, n_interactions=5000, k=5):
    """Simulate k-core filtering process"""
    np.random.seed(42)
    
    # Generate random interactions
    users = np.random.randint(0, n_users, n_interactions)
    items = np.random.randint(0, n_items, n_interactions)
    
    print(f"Initial: {n_users} users, {n_items} items, {n_interactions} interactions")
    
    # K-core iterations
    for iteration in range(5):
        user_counts = Counter(users)
        item_counts = Counter(items)
        
        # Filter
        valid_mask = np.array([
            user_counts[u] >= k and item_counts[i] >= k 
            for u, i in zip(users, items)
        ])
        
        users = users[valid_mask]
        items = items[valid_mask]
        
        remaining_users = len(set(users))
        remaining_items = len(set(items))
        remaining_interactions = len(users)
        
        print(f"  Iter {iteration+1}: {remaining_users} users, {remaining_items} items, {remaining_interactions} interactions")
        
        if valid_mask.all():
            print("  Converged!")
            break
    
    return remaining_users, remaining_items, remaining_interactions

simulate_kcore_filtering()

Initial: 1000 users, 500 items, 5000 interactions
  Iter 1: 552 users, 481 items, 3658 interactions
  Iter 2: 547 users, 432 items, 3461 interactions
  Iter 3: 507 users, 431 items, 3301 interactions
  Iter 4: 507 users, 419 items, 3254 interactions
  Iter 5: 492 users, 419 items, 3194 interactions


(492, 419, 3194)

## Step 3: Sequence Building

For each user, we sort their interactions by timestamp to create a chronological sequence.

In [5]:
# Example of sequence building
def demo_sequence_building():
    """Demonstrate how user sequences are built"""
    
    # Example user interactions
    user_interactions = [
        {'item': 'A', 'timestamp': 1420000000, 'rating': 5},
        {'item': 'C', 'timestamp': 1420500000, 'rating': 4},
        {'item': 'B', 'timestamp': 1420100000, 'rating': 3},
        {'item': 'D', 'timestamp': 1421000000, 'rating': 5},
    ]
    
    print("Raw interactions (unsorted):")
    for i in user_interactions:
        print(f"  Item: {i['item']}, Time: {i['timestamp']}")
    
    # Sort by timestamp
    sorted_interactions = sorted(user_interactions, key=lambda x: x['timestamp'])
    
    print("\nSorted sequence:")
    sequence = [i['item'] for i in sorted_interactions]
    print(f"  {' ‚Üí '.join(sequence)}")
    
    return sequence

demo_sequence_building()

Raw interactions (unsorted):
  Item: A, Time: 1420000000
  Item: C, Time: 1420500000
  Item: B, Time: 1420100000
  Item: D, Time: 1421000000

Sorted sequence:
  A ‚Üí B ‚Üí C ‚Üí D


['A', 'B', 'C', 'D']

## Step 4: ID Mapping

Convert string IDs to integer indices for neural network processing.

In [6]:
def demo_id_mapping():
    """Demonstrate ID mapping"""
    
    # Example items
    items = ['B001ABC123', 'B002XYZ789', 'B003DEF456', 'B001ABC123', 'B002XYZ789']
    
    # Create mapping
    unique_items = sorted(set(items))
    item2id = {item: idx + 1 for idx, item in enumerate(unique_items)}  # 0 reserved for padding
    id2item = {v: k for k, v in item2id.items()}
    
    print("Item to ID Mapping:")
    for item, idx in item2id.items():
        print(f"  {item} ‚Üí {idx}")
    
    # Map items
    mapped = [item2id[i] for i in items]
    print(f"\nOriginal: {items}")
    print(f"Mapped:   {mapped}")
    
    return item2id

demo_id_mapping()

Item to ID Mapping:
  B001ABC123 ‚Üí 1
  B002XYZ789 ‚Üí 2
  B003DEF456 ‚Üí 3

Original: ['B001ABC123', 'B002XYZ789', 'B003DEF456', 'B001ABC123', 'B002XYZ789']
Mapped:   [1, 2, 3, 1, 2]


{'B001ABC123': 1, 'B002XYZ789': 2, 'B003DEF456': 3}

## Step 5: Attribute Extraction

Extract attributes from item metadata (brand, categories).

In [7]:
def demo_attribute_extraction():
    """Demonstrate attribute extraction from metadata"""
    
    # Example metadata
    sample_metadata = {
        'asin': 'B001ABC123',
        'brand': 'L\'Oreal',
        'categories': [['Beauty', 'Makeup', 'Lipstick']],
        'title': 'L\'Oreal Paris Colour Riche Lipstick'
    }
    
    print("Sample Metadata:")
    for k, v in sample_metadata.items():
        print(f"  {k}: {v}")
    
    # Extract attributes
    attributes = []
    
    # Add brand
    if 'brand' in sample_metadata and sample_metadata['brand']:
        attributes.append(f"brand:{sample_metadata['brand']}")
    
    # Add categories
    if 'categories' in sample_metadata:
        for cat_list in sample_metadata['categories']:
            for cat in cat_list:
                attributes.append(f"cat:{cat}")
    
    print(f"\nExtracted Attributes ({len(attributes)}):")
    for attr in attributes:
        print(f"  - {attr}")
    
    return attributes

demo_attribute_extraction()

Sample Metadata:
  asin: B001ABC123
  brand: L'Oreal
  categories: [['Beauty', 'Makeup', 'Lipstick']]
  title: L'Oreal Paris Colour Riche Lipstick

Extracted Attributes (4):
  - brand:L'Oreal
  - cat:Beauty
  - cat:Makeup
  - cat:Lipstick


["brand:L'Oreal", 'cat:Beauty', 'cat:Makeup', 'cat:Lipstick']

## Step 6: Train/Val/Test Split

We use leave-one-out splitting:
- Last item ‚Üí Test
- Second-to-last ‚Üí Validation
- Rest ‚Üí Training

In [8]:
def demo_train_val_test_split():
    """Demonstrate leave-one-out splitting"""
    
    # Example sequence
    sequence = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    
    print(f"Full Sequence: {' ‚Üí '.join(sequence)}")
    print(f"Length: {len(sequence)}")
    print()
    
    # Split
    train = sequence[:-2]
    val_input = sequence[:-2]
    val_target = sequence[-2]
    test_input = sequence[:-1]
    test_target = sequence[-1]
    
    print("Leave-One-Out Split:")
    print(f"  Train Input:  {train}")
    print(f"  Val Input:    {val_input}")
    print(f"  Val Target:   {val_target}")
    print(f"  Test Input:   {test_input}")
    print(f"  Test Target:  {test_target}")
    
demo_train_val_test_split()

Full Sequence: A ‚Üí B ‚Üí C ‚Üí D ‚Üí E ‚Üí F ‚Üí G ‚Üí H
Length: 8

Leave-One-Out Split:
  Train Input:  ['A', 'B', 'C', 'D', 'E', 'F']
  Val Input:    ['A', 'B', 'C', 'D', 'E', 'F']
  Val Target:   G
  Test Input:   ['A', 'B', 'C', 'D', 'E', 'F', 'G']
  Test Target:  H


## Step 7: Final Processed Data

The preprocessing outputs a pickle file with:
- User sequences (mapped to integer IDs)
- Item-to-attributes mapping
- Various statistics

In [9]:
# Check if processed data exists
processed_path = Path('../data/processed/beauty_processed.pkl')

if processed_path.exists():
    with open(processed_path, 'rb') as f:
        data = pickle.load(f)
    
    print("‚úÖ Processed Data Summary:")
    print(f"  Users: {data['num_users']:,}")
    print(f"  Items: {data['num_items']:,}")
    print(f"  Attributes: {data['num_attributes']:,}")
    print(f"  Avg sequence length: {np.mean([len(s) for s in data['user_sequences'] if s]):.1f}")
else:
    print("‚ö†Ô∏è Processed data not found. Run preprocessing first:")
    print("  python experiments/preprocess.py")

‚úÖ Processed Data Summary:
  Users: 22,363
  Items: 12,102
  Attributes: 2,320
  Avg sequence length: 8.9


## Summary

The preprocessing pipeline:
1. **Load** raw JSON files
2. **Filter** with k-core algorithm (k=5)
3. **Build** chronological sequences per user
4. **Map** string IDs to integers
5. **Extract** attributes (brand, categories)
6. **Split** using leave-one-out
7. **Save** as pickle for fast loading