# Tiền Xử Lý Dữ Liệu

**Sinh viên:** Phạm Phú Hòa  
**MSSV:** 23122030

**Mục đích:** Lọc dữ liệu (loại users/products có ít ratings), tạo index mappings, chia train/test theo thời gian, lưu artifacts cho modeling.

## 1. Thiết Lập

In [29]:
import sys
import os
sys.path.insert(0, os.path.abspath('../src'))

import numpy as np
from data_processing import DataProcessor, filter_by_min_ratings, train_test_split

np.random.seed(42)

## 2. Load Dữ Liệu

In [30]:
dp = DataProcessor().load_and_extract('../data/raw/ratings_Beauty.csv')

user_ids = dp.user_ids
product_ids = dp.product_ids
ratings = dp.ratings
timestamps = dp.timestamps

print(f"Dữ liệu gốc:")
print(f"  Users: {len(np.unique(user_ids)):,}")
print(f"  Products: {len(np.unique(product_ids)):,}")
print(f"  Ratings: {len(ratings):,}")

Dữ liệu gốc:
  Users: 1,210,271
  Users: 1,210,271
  Products: 249,274
  Ratings: 2,023,070
  Products: 249,274
  Ratings: 2,023,070


## 3. Lọc Theo Số Ratings Tối Thiểu

Lý do:
- Users có <3 ratings: không đủ để phân tích behavior pattern
- Products có <3 ratings: cold start problem, ít tín hiệu để collaborative filtering

In [31]:
min_user_ratings = 5
min_product_ratings = 5

filtered_users, filtered_products, filtered_ratings, filtered_timestamps = filter_by_min_ratings(
    user_ids, product_ids, ratings, timestamps,
    min_user_ratings=min_user_ratings,
    min_product_ratings=min_product_ratings
)

print(f"Sau khi lọc (min {min_user_ratings} ratings/user, {min_product_ratings} ratings/product):")
print(f"  Users: {len(np.unique(filtered_users)):,}")
print(f"  Products: {len(np.unique(filtered_products)):,}")
print(f"  Ratings: {len(filtered_ratings):,}")
print(f"  Giữ lại: {len(filtered_ratings)/len(ratings)*100:.1f}%")

Sau khi lọc (min 5 ratings/user, 5 ratings/product):
  Users: 22,480
  Products: 12,153
  Ratings: 199,177
  Giữ lại: 9.8%


## 4. Tạo Index Mappings

Chuyển string IDs thành integer indices (0-based) cho NumPy arrays.

In [32]:
# User mappings
unique_users = np.unique(filtered_users)
user_to_idx = {user_id: idx for idx, user_id in enumerate(unique_users)}
idx_to_user = {idx: user_id for user_id, idx in user_to_idx.items()}

user_indices = np.array([user_to_idx[u] for u in filtered_users])

# Product mappings
unique_products = np.unique(filtered_products)
product_to_idx = {product_id: idx for idx, product_id in enumerate(unique_products)}
idx_to_product = {idx: product_id for product_id, idx in product_to_idx.items()}

product_indices = np.array([product_to_idx[p] for p in filtered_products])

n_users = len(unique_users)
n_products = len(unique_products)

print(f"Index ranges:")
print(f"  User indices: 0-{n_users-1}")
print(f"  Product indices: 0-{n_products-1}")

Index ranges:
  User indices: 0-22479
  Product indices: 0-12152


## 5. Chia Train/Test

Chia temporal: 80% ratings sớm nhất → train, 20% muộn nhất → test.

In [33]:
test_size = 0.2

result = train_test_split(
    user_indices, product_indices, filtered_ratings, filtered_timestamps,
    test_size=test_size
)

train_users = result['train']['user_indices']
train_products = result['train']['product_indices']
train_ratings = result['train']['ratings']
train_timestamps = result['train']['timestamps']
test_users = result['test']['user_indices']
test_products = result['test']['product_indices']
test_ratings = result['test']['ratings']
test_timestamps = result['test']['timestamps']

print(f"Train/Test split ({(1-test_size):.0%}/{test_size:.0%}):")
print(f"  Train: {len(train_ratings):,} ratings")
print(f"  Test: {len(test_ratings):,} ratings")

Train/Test split (80%/20%):
  Train: 159,342 ratings
  Test: 39,835 ratings


## 6. Tính Thống Kê User/Product

Tính trung bình ratings cho baseline models.

In [34]:
# User statistics (chỉ từ training data)
user_stats = np.zeros((n_users, 2))  # [count, avg_rating]
for user_idx in range(n_users):
    user_mask = (train_users == user_idx)
    user_ratings = train_ratings[user_mask]
    user_stats[user_idx, 0] = len(user_ratings)
    user_stats[user_idx, 1] = np.mean(user_ratings) if len(user_ratings) > 0 else 0.0

# Product statistics (chỉ từ training data)
product_stats = np.zeros((n_products, 2))  # [count, avg_rating]
for product_idx in range(n_products):
    product_mask = (train_products == product_idx)
    product_ratings = train_ratings[product_mask]
    product_stats[product_idx, 0] = len(product_ratings)
    product_stats[product_idx, 1] = np.mean(product_ratings) if len(product_ratings) > 0 else 0.0

print(f"Thống kê user:")
print(f"  Trung bình user rating: {np.mean(user_stats[:, 1]):.3f}")
print(f"Thống kê product:")
print(f"  Trung bình product rating: {np.mean(product_stats[:, 1]):.3f}")

Thống kê user:
  Trung bình user rating: 4.187
Thống kê product:
  Trung bình product rating: 4.188


## 7. Lưu Dữ Liệu Đã Xử Lý

Lưu tất cả artifacts cho notebook 03 (modeling).

In [35]:
# ID mappings
np.savez_compressed('../data/processed/id_mappings.npz',
                    user_to_idx=np.array(list(user_to_idx.items()), dtype=object),
                    product_to_idx=np.array(list(product_to_idx.items()), dtype=object),
                    idx_to_user=np.array(list(idx_to_user.items()), dtype=object),
                    idx_to_product=np.array(list(idx_to_product.items()), dtype=object))

# Train data
np.savez_compressed('../data/processed/preprocessed_data.npz',
                    train_users=train_users,
                    train_products=train_products,
                    train_ratings=train_ratings,
                    train_timestamps=train_timestamps,
                    test_users=test_users,
                    test_products=test_products,
                    test_ratings=test_ratings,
                    test_timestamps=test_timestamps,
                    n_users=n_users,
                    n_products=n_products)

# Statistics
np.save('../data/processed/user_stats.npy', user_stats)
np.save('../data/processed/product_stats.npy', product_stats)

print("Đã lưu preprocessing outputs:")
print("  - data/processed/id_mappings.npz")
print("  - data/processed/preprocessed_data.npz")
print("  - data/processed/user_stats.npy")
print("  - data/processed/product_stats.npy")

Đã lưu preprocessing outputs:
  - data/processed/id_mappings.npz
  - data/processed/preprocessed_data.npz
  - data/processed/user_stats.npy
  - data/processed/product_stats.npy
