# Data Preprocessing - Amazon Beauty Products Recommendation System

**Mục tiêu:** Chuẩn bị dữ liệu cho recommendation system

## Nội dung:
- Xử lý missing values và outliers
- Validate và clean dữ liệu
- Feature engineering cho recommendation
- Normalization và standardization
- Xử lý numerical stability
- Tạo user-item matrix
- Split data cho training/validation/testing

## 1. Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
np.set_printoptions(precision=4, suppress=True)

# Constants for numerical stability
EPSILON = 1e-10

print("✓ Libraries imported!")

## 2. Load Raw Data

Load dữ liệu từ file CSV

In [None]:
# TODO: Load data using NumPy (reuse từ notebook 01)
data_path = '../data/raw/ratings_Beauty.csv'

print("Data loaded!")

## 3. Data Validation

Kiểm tra tính hợp lệ của dữ liệu

In [None]:
# TODO: Validate data
# 1. Check rating range (should be 1-5 or 0-5)
# 2. Check for invalid user IDs
# 3. Check for invalid product IDs
# 4. Check timestamp validity (if available)
# 5. Remove duplicate entries (same user, same product, same timestamp)

# Example validation:
# valid_ratings = (ratings >= 1) & (ratings <= 5)
# data = data[valid_ratings]

## 4. Handle Missing Values

Xử lý giá trị thiếu (nếu có)

In [None]:
# TODO: Handle missing values
# Strategy for ratings data:
# 1. Remove rows with missing UserId or ProductId (critical)
# 2. Fill missing ratings với median rating (if any)
# 3. Handle missing timestamps (if any)

# Note: Rating data thường ít missing values
# Nhưng nếu có, cần xử lý cẩn thận để không ảnh hưởng recommendation

## 5. Outlier Detection and Treatment

Phát hiện và xử lý outliers trong ratings behavior

In [None]:
# TODO: Detect outliers
# For recommendation systems, outliers might be:
# 1. Users với số lượng ratings quá cao (bots, professional reviewers)
# 2. Products với số ratings bất thường
# 3. Suspicious rating patterns (all 5-star or all 1-star from one user)

# Detection methods:
# - IQR method for ratings count per user/product
# - Z-score for abnormal rating patterns
# - Statistical tests for suspicious behavior

# Treatment: 
# - Can keep them but flag for separate analysis
# - Or filter out extreme cases

## 6. Feature Engineering

Tạo features mới để cải thiện recommendation

### 6.1. User Features

In [None]:
# TODO: Create user-based features using NumPy
# 1. user_avg_rating: Average rating per user (user bias)
# 2. user_rating_count: Total ratings per user
# 3. user_rating_std: Rating variance per user
# 4. user_activity_level: Categorize users (low/medium/high activity)
# 5. user_rating_range: max - min rating per user

# Use np.unique() và broadcasting để calculate efficiently

### 6.2. Product Features

In [None]:
# TODO: Create product-based features
# 1. product_avg_rating: Average rating per product
# 2. product_rating_count: Number of ratings per product (popularity)
# 3. product_rating_std: Rating variance (controversy score)
# 4. product_weighted_rating: Bayesian average
# 5. product_popularity_tier: Categorize products

# Bayesian average formula:
# weighted_rating = (v/(v+m)) * R + (m/(v+m)) * C
# v = ratings count, m = min threshold, R = product avg, C = global avg

### 6.3. Interaction Features

In [None]:
# TODO: Create user-product interaction features
# 1. rating_deviation_user: rating - user_avg_rating
# 2. rating_deviation_product: rating - product_avg_rating
# 3. rating_deviation_global: rating - global_avg_rating
# 4. normalized_rating: (rating - user_avg) / user_std

# These features help remove biases:
# - Some users always rate high/low
# - Some products are generally rated high/low

### 6.4. Temporal Features (if timestamp available)

In [None]:
# TODO: Create temporal features (if timestamp available)
# 1. days_since_first_rating: User tenure
# 2. rating_recency: How recent is the rating
# 3. user_rating_velocity: Ratings per day/week
# 4. temporal_weights: Weight recent ratings more

# Recent ratings might be more relevant for recommendations

## 7. Data Normalization & Standardization

Chuẩn hóa features để cải thiện model performance

### 7.1. Min-Max Normalization

In [None]:
# TODO: Implement Min-Max Normalization
def min_max_normalize(x, x_min=None, x_max=None):
    """
    Normalize to [0, 1] range
    Formula: (x - min) / (max - min)
    """
    if x_min is None:
        x_min = np.min(x)
    if x_max is None:
        x_max = np.max(x)
    
    # Numerical stability: avoid division by zero
    denominator = x_max - x_min
    if denominator == 0:
        return np.zeros_like(x)
    
    return (x - x_min) / (denominator + EPSILON)

# Apply to features like rating counts

### 7.2. Z-score Standardization

In [None]:
# TODO: Implement Z-score Standardization
def standardize(x, mean=None, std=None):
    """
    Standardize to mean=0, std=1
    Formula: (x - mean) / std
    """
    if mean is None:
        mean = np.mean(x)
    if std is None:
        std = np.std(x)
    
    # Numerical stability
    if std == 0:
        return np.zeros_like(x)
    
    return (x - mean) / (std + EPSILON)

# Important for gradient-based algorithms

### 7.3. Log Transformation

In [None]:
# TODO: Implement Log Transformation
def log_transform(x, constant=1):
    """
    Log transformation for skewed distributions
    Formula: log(x + constant)
    """
    # Ensure positive values
    x_positive = np.clip(x, 0, None) + constant
    return np.log(x_positive + EPSILON)

# Useful for:
# - Rating counts (often power-law distributed)
# - Popularity scores

## 8. Numerical Stability Techniques

Đảm bảo tính ổn định trong tính toán số học

In [None]:
# Numerical stability functions

def safe_divide(numerator, denominator, epsilon=EPSILON):
    """Safe division avoiding division by zero"""
    return numerator / (denominator + epsilon)

def safe_log(x, epsilon=EPSILON):
    """Safe logarithm"""
    return np.log(np.clip(x, epsilon, None))

def safe_sqrt(x, epsilon=EPSILON):
    """Safe square root"""
    return np.sqrt(np.clip(x, epsilon, None))

def clip_values(x, min_val, max_val):
    """Clip values to prevent overflow/underflow"""
    return np.clip(x, min_val, max_val)

# Use these functions throughout preprocessing

## 9. User-Item Matrix Construction

Tạo user-item rating matrix - core structure cho recommendation

In [None]:
# TODO: Create user-item matrix using NumPy

def create_user_item_matrix(user_ids, product_ids, ratings):
    """
    Create user-item rating matrix
    
    Returns:
    - matrix: (n_users, n_products) array
    - user_mapping: dict {user_id: index}
    - product_mapping: dict {product_id: index}
    """
    # Get unique IDs
    unique_users = np.unique(user_ids)
    unique_products = np.unique(product_ids)
    
    # Create mappings
    user_to_idx = {user: idx for idx, user in enumerate(unique_users)}
    product_to_idx = {prod: idx for idx, prod in enumerate(unique_products)}
    
    # Initialize matrix with zeros (or NaN for missing values)
    n_users = len(unique_users)
    n_products = len(unique_products)
    matrix = np.zeros((n_users, n_products))
    
    # Fill matrix using vectorization
    # TODO: Implement efficient filling
    
    return matrix, user_to_idx, product_to_idx

# Calculate sparsity
# sparsity = 1 - (nnz / total_elements)

## 10. Handle Cold Start Problem

Xử lý users/products với ít data

In [None]:
# TODO: Filter or handle cold start items
# Strategy 1: Filter out users/products với < threshold ratings
min_user_ratings = 5  # Minimum ratings per user
min_product_ratings = 5  # Minimum ratings per product

# Strategy 2: Keep them but use different recommendation approach
# - Popularity-based for new users
# - Content-based for new products
# - Use global statistics

# Note: Balance between data quality và dataset size

## 11. Data Splitting

Chia dữ liệu cho training, validation, và testing

In [None]:
# TODO: Split data using NumPy
def train_val_test_split(data, train_ratio=0.7, val_ratio=0.15, random_state=42):
    """
    Split data into train/val/test sets
    
    Important: For recommendation systems, consider:
    1. Random split: Random selection of ratings
    2. Temporal split: Split by time (if timestamp available)
    3. User-based split: Some users in train, others in test
    4. Leave-one-out: Hold out one rating per user for testing
    """
    np.random.seed(random_state)
    n_samples = len(data)
    
    # Shuffle indices
    indices = np.random.permutation(n_samples)
    
    # Calculate split points
    train_end = int(train_ratio * n_samples)
    val_end = int((train_ratio + val_ratio) * n_samples)
    
    # Split
    train_idx = indices[:train_end]
    val_idx = indices[train_end:val_end]
    test_idx = indices[val_end:]
    
    return train_idx, val_idx, test_idx

# Important: Avoid data leakage!

## 12. Statistical Hypothesis Testing

Kiểm định các giả thiết thống kê về dữ liệu

### Test 1: Rating distribution normality

**H0:** Rating distribution follows normal distribution  
**H1:** Rating distribution does not follow normal distribution

In [None]:
# TODO: Test for normality
# Use Q-Q plot visualization
# Calculate skewness and kurtosis
# Interpretation will guide choice of transformation

### Test 2: User rating independence

**H0:** User ratings are independent  
**H1:** User ratings are correlated (e.g., user tends to rate similarly)

In [None]:
# TODO: Test for independence
# Calculate intra-user correlation
# This helps understand if user bias adjustment is needed

## 13. Save Processed Data

Lưu dữ liệu đã xử lý để sử dụng cho modeling

In [None]:
# TODO: Save processed data
output_dir = '../data/processed/'

# Save:
# 1. Train/val/test splits (as .npy files)
# 2. User-item matrix
# 3. User/product mappings (as .npy or pickle)
# 4. Feature arrays
# 5. Preprocessing parameters (for applying to new data)

# Example:
# np.save(f'{output_dir}train_data.npy', train_data)
# np.save(f'{output_dir}user_item_matrix.npy', user_item_matrix)
# np.save(f'{output_dir}user_features.npy', user_features)

print(f"✓ Data saved to {output_dir}")

## 14. Preprocessing Summary

Tổng kết quá trình preprocessing

### Preprocessing Steps Completed:

**Data Cleaning:**
- TODO: Document validation results
- TODO: Missing values handling
- TODO: Outlier treatment

**Feature Engineering:**
- TODO: List new features created
- TODO: User features
- TODO: Product features
- TODO: Interaction features

**Data Transformation:**
- TODO: Normalization applied
- TODO: Standardization applied
- TODO: Log transforms

**Data Structure:**
- TODO: User-item matrix dimensions
- TODO: Sparsity level
- TODO: Train/val/test sizes

**Ready for Modeling:**
- ✓ Clean data
- ✓ Engineered features
- ✓ Normalized/standardized
- ✓ Train/val/test splits
- ✓ Data saved