# Data Exploration - Amazon Beauty Products Recommendation System

**Bài toán:** Xây dựng hệ thống gợi ý sản phẩm làm đẹp dựa trên lịch sử đánh giá của người dùng

**Dataset:** Amazon Ratings - Beauty Products

## Mục tiêu:
- Khám phá và hiểu cấu trúc dữ liệu ratings
- Phân tích hành vi người dùng (user behavior)
- Phân tích đặc điểm sản phẩm (product characteristics)
- Phân tích phân phối ratings và patterns
- Đặt câu hỏi và trả lời bằng dữ liệu
- Phát hiện insights cho recommendation system

## 1. Import Libraries

**Yêu cầu:** CHỈ sử dụng NumPy để xử lý dữ liệu, Matplotlib và Seaborn để visualization

In [2]:
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# Cấu hình
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.set_printoptions(precision=4, suppress=True)

print("✓ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

✓ Libraries imported successfully!
NumPy version: 2.0.0


## 2. Load Data using NumPy

Load dữ liệu CSV chỉ bằng NumPy (không dùng Pandas)

In [None]:
# Load CSV data using NumPy
data_path = '../data/raw/ratings_Beauty.csv'

# TODO: Implement CSV reader using NumPy
# Expected columns: UserId, ProductId, Rating, Timestamp
# Strategy: 
# 1. Read file line by line
# 2. Parse CSV format
# 3. Store in numpy arrays

def load_csv_numpy(filepath, delimiter=',', skip_header=True):
    """
    Load CSV file using only NumPy
    Returns: data array, header
    """
    # TODO: Implement
    pass

# Load data
# data, header = load_csv_numpy(data_path)
# print(f"✓ Data shape: {data.shape}")
# print(f"✓ Columns: {header}")

## 3. Basic Data Information

Hiển thị thông tin cơ bản về dataset

In [None]:
# TODO: Display basic information
# - Total number of ratings
# - Number of unique users
# - Number of unique products
# - Data types of each column
# - Sample records (first 10 rows)

# Example:
# n_ratings = data.shape[0]
# n_users = len(np.unique(data[:, 0]))  # Column 0: UserId
# n_products = len(np.unique(data[:, 1]))  # Column 1: ProductId

## 4. Descriptive Statistics

Tính toán thống kê mô tả cho ratings sử dụng NumPy

In [None]:
# TODO: Calculate descriptive statistics for Rating column
# Using NumPy functions:
# - np.mean() - Mean rating
# - np.median() - Median rating
# - np.std() - Standard deviation
# - np.var() - Variance
# - np.min(), np.max() - Min, Max
# - np.percentile() - Quantiles (25%, 50%, 75%)

# Calculate skewness và kurtosis manually:
# Skewness = E[(X - μ)³] / σ³
# Kurtosis = E[(X - μ)⁴] / σ⁴ - 3

## 5. Missing Values Analysis

Kiểm tra missing values trong dataset

In [None]:
# TODO: Check for missing values using NumPy
# - Count NaN values: np.isnan()
# - Count None values
# - Count empty strings (nếu có string data)
# - Calculate percentage of missing values

# Visualize missing data pattern
# - Bar chart showing missing percentage per column

## 6. Rating Distribution Analysis

Phân tích phân phối của ratings

In [None]:
# TODO: Analyze rating distribution
# 1. Count frequency of each rating value (1, 2, 3, 4, 5)
# 2. Calculate percentage for each rating
# 3. Visualize:
#    - Histogram of ratings
#    - Bar chart showing rating counts
#    - Pie chart showing rating proportions

# Check for rating bias (e.g., more 5-star ratings?)

## 7. User Behavior Analysis

Phân tích hành vi người dùng

### 7.1. Number of Ratings per User

In [None]:
# TODO: Calculate ratings per user using NumPy
# Use np.unique() with return_counts=True
# - Distribution of number of ratings per user
# - Mean, median, std of ratings per user
# - Top 10 most active users
# - Percentage of users with only 1 rating (cold start problem)

# Visualize:
# - Histogram of ratings per user
# - Log-scale histogram (if power-law distribution)

### 7.2. User Rating Behavior

In [None]:
# TODO: Analyze user rating patterns
# - Average rating per user (user bias)
# - User rating variance (harsh vs generous raters)
# - Distribution of user mean ratings

# Identify user types:
# - Harsh raters (mean rating < 3)
# - Generous raters (mean rating > 4)
# - Balanced raters (mean rating ≈ 3.5)

## 8. Product Analysis

Phân tích đặc điểm sản phẩm

### 8.1. Number of Ratings per Product

In [None]:
# TODO: Calculate ratings per product
# - Distribution of ratings per product
# - Top 20 most reviewed products
# - Products with few ratings (cold start problem)
# - Percentage of products with < 5 ratings

# Visualize:
# - Histogram of ratings per product
# - Bar chart of top products

### 8.2. Product Rating Quality

In [None]:
# TODO: Analyze product quality
# - Average rating per product
# - Top 20 highest rated products (with min. reviews threshold)
# - Top 20 lowest rated products
# - Product rating variance (controversial products)
# - Correlation between popularity and rating

## 9. Temporal Analysis

Phân tích xu hướng theo thời gian (nếu có timestamp)

In [None]:
# TODO: Temporal analysis (if timestamp available)
# - Number of ratings over time (line chart)
# - Average rating trends over time
# - Seasonal patterns (monthly, yearly)
# - Peak activity periods

# Check for:
# - Rating inflation over time
# - Product lifecycle patterns

## 10. Data Sparsity Analysis

Phân tích độ thưa của dữ liệu - quan trọng cho recommendation systems!

In [None]:
# TODO: Calculate sparsity of user-item matrix
# Sparsity = 1 - (number of ratings / (n_users × n_products))

# Example:
# sparsity = 1 - (n_ratings / (n_users * n_products))
# density = (n_ratings / (n_users * n_products)) * 100

# Visualize:
# - Sample heatmap of user-item matrix (small subset)
# - Distribution of non-zero entries per row/column

## 11. Exploratory Questions

Đặt và trả lời các câu hỏi về dữ liệu

### Câu hỏi 1: Có hiện tượng rating polarization không?

Người dùng có xu hướng cho điểm cực cao (5 sao) hoặc cực thấp (1 sao), ít cho điểm trung bình?

In [None]:
# TODO: Analyze rating polarization
# - Compare frequency of extreme ratings (1, 5) vs middle ratings (2, 3, 4)
# - Visualize with bar chart và pie chart

### Câu hỏi 2: Sản phẩm nổi bật nhất là gì?

Xác định sản phẩm có sự kết hợp tốt giữa số lượng ratings và chất lượng ratings

In [None]:
# TODO: Find outstanding products
# Calculate popularity score: weighted rating
# Formula: weighted_rating = (v/(v+m)) * R + (m/(v+m)) * C
# Where:
# v = number of ratings for the product
# m = minimum ratings threshold
# R = average rating for the product
# C = mean rating across all products

# Visualize top products với scatter plot (ratings count vs avg rating)

### Câu hỏi 3: Có tồn tại "power users" không?

Phân tích sự phân bố hoạt động của users - có một nhóm nhỏ users đóng góp phần lớn ratings?

In [None]:
# TODO: Analyze power users
# - Calculate cumulative percentage of ratings
# - Create Pareto chart (80-20 rule?)
# - What % of users contribute 80% of ratings?

# This is important for:
# - Understanding data concentration
# - Cold start problem severity

### Câu hỏi 4: Có mối quan hệ giữa số lượng ratings và average rating?

Sản phẩm phổ biến có xu hướng được đánh giá cao hơn không?

In [None]:
# TODO: Correlation analysis
# - Calculate correlation between popularity and rating
# - Scatter plot with trend line
# - Statistical significance test

# Use NumPy to calculate Pearson correlation:
# r = cov(X,Y) / (std(X) * std(Y))

### Câu hỏi 5: Cold Start Problem nghiêm trọng như thế nào?

Phân tích số lượng users và products có ít tương tác

In [None]:
# TODO: Analyze cold start problem
# - % of users with <= 5 ratings
# - % of products with <= 5 ratings
# - Distribution of ratings per user/product

# Impact on recommendation system:
# - Hard to make recommendations for new users/products
# - Need different strategies (content-based, popularity-based)

### Câu hỏi 6: Phân khúc người dùng (User Segmentation)

Có thể chia người dùng thành các nhóm dựa trên hành vi rating không?

## 12. User-Item Matrix Visualization

Visualize một sample của user-item matrix

In [None]:
# TODO: Create and visualize sample user-item matrix
# - Select 50 most active users
# - Select 50 most popular products
# - Create 50x50 matrix
# - Visualize with heatmap

# This helps visualize sparsity và patterns

## 13. Key Insights & Summary

Tổng kết các phát hiện quan trọng

### Key Findings:

**Về Dataset:**
- TODO: Kích thước dataset, số users, số products
- TODO: Mức độ sparse của data
- TODO: Chất lượng data (missing values, outliers)

**Về User Behavior:**
- TODO: Phân bố hoạt động users
- TODO: Rating patterns (harsh/generous raters)
- TODO: Cold start users

**Về Products:**
- TODO: Sản phẩm phổ biến
- TODO: Product quality distribution
- TODO: Cold start products

**Về Ratings:**
- TODO: Rating distribution và bias
- TODO: Temporal trends
- TODO: Correlations

**Implications cho Recommendation System:**
- TODO: Challenges cần giải quyết
- TODO: Strategies phù hợp (collaborative filtering, content-based, hybrid)
- TODO: Đề xuất tiền xử lý cần thiết

In [None]:
# TODO: Phân khúc users theo nhiều tiêu chí
# 1. Activity level:
#    - Light users: 1-5 ratings
#    - Medium users: 6-20 ratings  
#    - Heavy users: >20 ratings
#
# 2. Rating behavior:
#    - Harsh critics: mean rating < 3.0
#    - Selective buyers: mean rating 3.0-4.0
#    - Enthusiasts: mean rating > 4.0
#
# 3. Rating variance:
#    - Consistent: std < 0.5
#    - Moderate: std 0.5-1.5
#    - Diverse opinions: std > 1.5

# Business Application:
# - Personalized recommendation strategies per segment
# - Targeted marketing campaigns
# - Identify brand advocates (enthusiasts with high activity)

# Visualize:
# - 2D scatter plot: Activity (x-axis) vs Mean Rating (y-axis)
# - Color by variance
# - Size by total ratings
# - Annotate segment boundaries

### Câu hỏi 7: Product Lifecycle - Sản phẩm "Rising Stars" và "Falling Stars"

Phát hiện sản phẩm đang lên (trending) và sản phẩm đang xuống (declining)

In [None]:
# TODO: Phân tích product lifecycle nếu có timestamp
# 1. Chia timeline thành periods (ví dụ: quarterly)
# 2. Tính số ratings và avg rating mỗi period cho từng product
# 3. Tính growth rate:
#    growth_rate = (recent_ratings - old_ratings) / old_ratings
#
# Rising Stars:
# - Positive growth rate
# - Improving average rating
# - Increasing review frequency
#
# Falling Stars:
# - Negative growth rate
# - Declining ratings
# - Decreasing review frequency
#
# Stable Products:
# - Consistent rating volume
# - Stable quality

# Business Application:
# - Inventory management: Stock up on rising stars
# - Promotional strategy: Boost falling stars or discontinue
# - Recommendation priority: Feature trending products
# - New product launch insights

# Visualize:
# - Line chart: Rating count over time for top rising/falling products
# - Scatter: Growth rate vs Current avg rating
# - Heatmap: Product performance matrix

### Câu hỏi 8: Cross-Product Purchase Patterns

Có nhóm sản phẩm nào thường được mua cùng nhau không?

In [None]:
# TODO: Phân tích co-occurrence patterns
# 1. Tạo user-product matrix (binary: rated or not)
# 2. Calculate product co-occurrence matrix:
#    Co-occurrence(i,j) = number of users who rated both product i and j
# 3. Normalize by product popularity
# 4. Calculate lift score:
#    lift(i,j) = P(i,j) / (P(i) * P(j))
#    Where P(i,j) = probability both rated together
#          P(i), P(j) = individual probabilities

# Find product pairs với:
# - High co-occurrence count
# - High lift score (> 1 means positive association)

# Business Application:
# - Bundle recommendations: "Customers who bought X also bought Y"
# - Cross-selling opportunities
# - Product placement in store/website
# - Combo deals and promotions

# Visualize:
# - Network graph: Products as nodes, co-occurrence as edges
# - Heatmap: Top 30 products co-occurrence matrix
# - Bar chart: Top product pairs by lift score

# Example insights:
# - Shampoo + Conditioner (expected)
# - Lipstick + Eye shadow (complementary)
# - Face cream + Serum (product line)

### Câu hỏi 9: Rating Reliability - Sản phẩm nào có ratings đáng tin cậy?

Đánh giá độ tin cậy của ratings dựa trên số lượng và consistency

In [None]:
# TODO: Calculate rating reliability score
# Reliability factors:
# 1. Sample size: More ratings = more reliable
# 2. Rating consistency: Low variance = more reliable
# 3. Recency: Recent ratings more relevant (if timestamp available)
# 4. Reviewer diversity: More unique users = less bias

# Calculate Confidence Score:
# confidence = (n_ratings / (n_ratings + k)) * consistency_factor
# Where:
# - k = threshold constant (e.g., 10)
# - consistency_factor = 1 / (1 + std_rating)

# Identify categories:
# - High confidence products: Many ratings, low variance
# - Controversial products: Many ratings, high variance
# - Uncertain products: Few ratings (need more data)

# Calculate Wilson Score Confidence Interval (advanced):
# For binary outcomes (positive/negative), gives lower bound of true rating

# Business Application:
# - Quality control: Flag products with low confidence for review
# - Recommendation confidence: Show reliability indicators to users
# - Inventory decisions: Prioritize high-confidence high-rated products
# - A/B testing: Focus improvement efforts on uncertain products

# Visualize:
# - Scatter plot: Number of ratings vs Rating variance
# - Color by average rating
# - Size by confidence score
# - Quadrant analysis:
#   * Top-right: Popular & Reliable (safe bets)
#   * Top-left: Popular & Controversial (investigate)
#   * Bottom-right: Niche & Reliable (hidden gems)
#   * Bottom-left: Uncertain (need data)

### Câu hỏi 10: Market Basket Analysis - Loyalty Patterns

Người dùng có trung thành với nhóm sản phẩm nào? Churn risk ở đâu?

In [None]:
# TODO: Phân tích loyalty patterns
# 1. User loyalty metrics:
#    - Product diversity: Unique products / Total ratings
#    - Average rating trend: Early ratings vs Recent ratings
#    - Rating frequency: Time between ratings (if timestamp available)
#
# 2. Identify user types:
#    Loyal fans:
#    - High ratings consistently (>4.0)
#    - Low product diversity (stick to favorites)
#    - Regular activity
#
#    Explorers:
#    - High product diversity
#    - Variable ratings
#    - Frequent activity
#
#    Churned/At-risk:
#    - Declining average ratings over time
#    - Increasing time gaps between purchases
#    - Recent low ratings
#
#    One-time buyers:
#    - Single rating
#    - No return

# Calculate churn indicators:
# - Time since last rating (recency)
# - Negative rating trend
# - Comparison with historical behavior

# Business Application:
# - Retention campaigns: Target at-risk users
# - Win-back campaigns: Re-engage churned users
# - Loyalty rewards: Incentivize loyal fans
# - Product recommendations: 
#   * Loyal fans → Similar products in same category
#   * Explorers → Diverse recommendations
# - Customer lifetime value prediction

# Visualize:
# - Sankey diagram: User journey through product categories
# - Cohort analysis: Rating behavior over user lifecycle
# - RFM analysis adapted:
#   * Recency: Time since last rating
#   * Frequency: Number of ratings
#   * Monetary (proxy): Average rating (satisfaction level)

### Câu hỏi 11: Price-Quality Perception (nếu có dữ liệu giá)

Có mối quan hệ giữa rating và nhóm giá sản phẩm không?

In [None]:
# TODO: Price-Quality analysis (if price data available in product metadata)
# Note: Nếu không có price data, có thể infer từ rating patterns
#
# 1. Categorize products by price tier (if available):
#    - Budget: < 25th percentile
#    - Mid-range: 25th - 75th percentile
#    - Premium: > 75th percentile
#
# 2. Alternative: Infer "perceived value" từ ratings:
#    High-value products: High ratings + Many reviews
#    Low-value products: Low ratings despite popularity
#
# 3. Analyze:
#    - Average rating per price tier
#    - Rating variance per price tier
#    - Customer expectations: Do expensive products need higher ratings?
#    - Value for money: High rating + Lower price tier

# Calculate "bang for buck" score:
# value_score = avg_rating / (price_tier_normalized + epsilon)

# Identify:
# - Overperformers: Budget/Mid-range with premium-level ratings
# - Underperformers: Premium with mediocre ratings
# - Sweet spots: Best value for money

# Business Application:
# - Pricing strategy: Adjust prices based on perceived value
# - Marketing positioning: Highlight value products
# - Premium justification: Ensure premium products deliver quality
# - Recommendation diversity: Mix price tiers in recommendations
# - Competitive analysis: Compare similar products across tiers

# Visualize:
# - Box plots: Rating distribution per price tier
# - Scatter plot: Price vs Rating (if data available)
# - Bar chart: Value score leaders
# - Heatmap: Price tier × Rating category matrix

### Câu hỏi 12: Rating Velocity - Momentum Analysis

Sản phẩm nào đang được rating nhiều đột ngột? (Viral products)

In [None]:
# TODO: Analyze rating velocity and momentum (requires timestamp)
# 1. Calculate rating velocity:
#    velocity = ratings_in_recent_period / ratings_in_previous_period
#    
# 2. Calculate acceleration:
#    acceleration = change in velocity over time
#
# 3. Identify patterns:
#    Viral products:
#    - High velocity (many recent ratings)
#    - Positive acceleration (accelerating growth)
#    - Sudden spike in activity
#
#    Steady growers:
#    - Consistent positive velocity
#    - Stable acceleration
#
#    Declining products:
#    - Negative velocity
#    - Negative acceleration
#
#    Seasonal products:
#    - Periodic velocity spikes
#    - Predictable patterns

# Calculate momentum score:
# momentum = (recent_ratings / avg_ratings_per_period) * (recent_avg_rating / overall_avg)

# Business Application:
# - Trend spotting: Catch viral products early
# - Inventory management: Stock up on high-momentum products
# - Marketing timing: Promote products at peak momentum
# - Recommendation freshness: Feature trending products
# - Competitive intelligence: Track momentum vs competitors
# - Product launch success: Monitor new product momentum

# Visualize:
# - Time series: Rating count over time for viral products
# - Velocity chart: Rate of change visualization
# - Momentum heatmap: Products × Time periods
# - Acceleration scatter: Current velocity vs acceleration
# - Top movers dashboard: Biggest gainers/losers

# Real-world examples:
# - Holiday season spikes (e.g., gift sets)
# - Influencer effects (sudden popularity after review)
# - Seasonal trends (sunscreen in summer)
# - Competitor product failures (switchers)

### Câu hỏi 13: User Influence Score - Ai là Key Opinion Leaders?

Xác định users có ảnh hưởng cao (ratings của họ predict được behaviors của others)

In [None]:
# TODO: Calculate user influence scores
# Influence factors:
# 1. Activity level:
#    - Number of ratings
#    - Diversity of products rated
#
# 2. Early adopter behavior:
#    - Among first to rate new products (if timestamp)
#    - Ratings on niche/less popular items
#
# 3. Rating impact:
#    - Agreement with community: How often user rating aligns with final consensus
#    - Predictive power: User rates high → product becomes popular
#    - Contrarian accuracy: User finds hidden gems others missed
#
# 4. Expertise indicators:
#    - Detailed ratings (if text reviews available)
#    - Consistent rating behavior
#    - Coverage of product categories

# Calculate influence score:
# influence = w1 * activity_score + 
#             w2 * early_adopter_score + 
#             w3 * prediction_accuracy_score +
#             w4 * diversity_score

# For prediction accuracy:
# - Compare user's early rating vs product's eventual average
# - Reward users whose ratings predict future popularity
# - Calculate correlation between user rating and product success

# Identify user tiers:
# - Influencers: High influence, high activity, early adopters
# - Experts: High accuracy, niche focus, consistent ratings
# - Casual users: Low influence, sporadic activity
# - Followers: Late adopters, rate popular items

# Business Application:
# - Influencer partnerships: Engage high-influence users
# - Beta testing: Invite influencers to try new products
# - User-generated content: Encourage reviews from experts
# - Weighted recommendations: Give more weight to influencer ratings
# - Community building: Create expert/influencer badges
# - Product seeding: Send samples to key opinion leaders
# - Credibility indicators: Show "Expert rated 4.5★" separately

# Visualize:
# - Influence distribution: Histogram of influence scores
# - Scatter plot: Activity vs Influence (are they correlated?)
# - Network graph: User influence relationships
# - Leaderboard: Top 50 most influential users
# - Influence over time: Track how influence changes

### Câu hỏi 14: Recommendation Diversity vs Accuracy Trade-off

Balance giữa recommend sản phẩm tương tự vs khám phá sản phẩm mới

In [None]:
# TODO: Analyze diversity patterns in user behavior
# 1. Calculate user exploration behavior:
#    Diversity Index = Entropy of product categories rated
#    entropy = -Σ(p_i * log(p_i))
#    where p_i = proportion of ratings in category i
#
# 2. Similarity within user's rated products:
#    - Calculate avg pairwise similarity of products user rated
#    - Use rating patterns of all users as feature space
#    - High similarity = user likes similar products (easy to recommend)
#    - Low similarity = diverse taste (hard but interesting)
#
# 3. Exploration vs Exploitation:
#    Exploitation: Rating products similar to past high-rated items
#    Exploration: Rating diverse/different products
#
# 4. Calculate per user:
#    - % of ratings on popular products (>100 reviews)
#    - % of ratings on niche products (<10 reviews)
#    - Sequential pattern: Do they alternate or cluster?

# Segment users by recommendation strategy:
# - Conservative users (high similarity, low diversity):
#   → Recommend similar popular items (safe bets)
#   → Collaborative filtering works well
#
# - Adventurous users (low similarity, high diversity):
#   → Recommend diverse items from multiple categories
#   → Content-based + serendipity important
#   → "Because you like variety" recommendations
#
# - Balanced users:
#   → Hybrid approach with some exploration

# Business Application:
# - Personalized recommendation strategy per user type
# - A/B testing: Test diversity levels in recommendations
# - User satisfaction: Match recommendation diversity to user preference
# - Product discovery: Help users find hidden gems
# - Filter bubble avoidance: Prevent over-specialization
# - Long-tail promotion: Use diversity-loving users to promote niche items

# Calculate system-level metrics:
# - Aggregate diversity: How many unique items recommended across all users?
# - Coverage: What % of catalog gets recommended?
# - Serendipity: How often do users rate high on unexpected recommendations?

# Visualize:
# - User scatter: Diversity Index vs Average Rating
# - Distribution: Histogram of user diversity scores
# - Category sunburst: Rating distribution across categories per user type
# - Trade-off curve: Recommendation accuracy vs diversity
# - Temporal: Does diversity change over user lifecycle?