# Data Exploration - Amazon Beauty Products Recommendation System

**B√†i to√°n:** X√¢y d·ª±ng h·ªá th·ªëng g·ª£i √Ω s·∫£n ph·∫©m l√†m ƒë·∫πp d·ª±a tr√™n l·ªãch s·ª≠ ƒë√°nh gi√° c·ªßa ng∆∞·ªùi d√πng

**Dataset:** Amazon Ratings - Beauty Products

## M·ª•c ti√™u:
- Kh√°m ph√° v√† hi·ªÉu c·∫•u tr√∫c d·ªØ li·ªáu ratings
- Ph√¢n t√≠ch h√†nh vi ng∆∞·ªùi d√πng (user behavior)
- Ph√¢n t√≠ch ƒë·∫∑c ƒëi·ªÉm s·∫£n ph·∫©m (product characteristics)
- Ph√¢n t√≠ch ph√¢n ph·ªëi ratings v√† patterns
- ƒê·∫∑t c√¢u h·ªèi v√† tr·∫£ l·ªùi b·∫±ng d·ªØ li·ªáu
- Ph√°t hi·ªán insights cho recommendation system

## 1. Import Libraries

**Y√™u c·∫ßu:** CH·ªà s·ª≠ d·ª•ng NumPy ƒë·ªÉ x·ª≠ l√Ω d·ªØ li·ªáu, Matplotlib v√† Seaborn ƒë·ªÉ visualization

In [None]:
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# C·∫•u h√¨nh
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.set_printoptions(precision=4, suppress=True)

print(f"NumPy version: {np.__version__}")

## 2. Load Data using NumPy

Load d·ªØ li·ªáu CSV ch·ªâ b·∫±ng NumPy (kh√¥ng d√πng Pandas)

In [None]:
# Load CSV data using NumPy
data_path = '../data/raw/ratings_Beauty.csv'

def load_csv_numpy(filepath, delimiter=',', skip_header=True):
    """
    Load CSV file using only NumPy
    Returns: data array, header, user_map, product_map
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # header
    header = lines[0].strip().split(delimiter) if skip_header else None
    
    # data lines
    data_lines = lines[1:] if skip_header else lines
    
    # Preallocate lists for each column
    user_ids = []
    product_ids = []
    ratings = []
    timestamps = []
    
    for line in data_lines:
        parts = line.strip().split(delimiter)
        user_ids.append(parts[0])
        product_ids.append(parts[1])
        ratings.append(float(parts[2]))
        timestamps.append(int(parts[3]) if parts[3].isdigit() else 0)
    
    # convert lists to NumPy arrays
    n_records = len(ratings)
    data = np.zeros((n_records, 4))
    
    # user_ids and product_ids to indices
    unique_users = list(set(user_ids))
    unique_products = list(set(product_ids))
    
    user_id_map = {uid: idx for idx, uid in enumerate(unique_users)}
    product_id_map = {pid: idx for idx, pid in enumerate(unique_products)}
    
    for i in range(n_records):
        data[i, 0] = user_id_map[user_ids[i]]
        data[i, 1] = product_id_map[product_ids[i]]
        data[i, 2] = ratings[i]
        data[i, 3] = timestamps[i]
    
    return data, header, user_id_map, product_id_map


print("Loading data...")
data, header, user_map, product_map = load_csv_numpy(data_path)

print(f"‚úì Data loaded successfully!")
print(f"  Shape: {data.shape}")
print(f"  Columns: {header}")
print(f"  Unique users: {len(user_map):,}")
print(f"  Unique products: {len(product_map):,}")

## 3. Basic Data Information

Hi·ªÉn th·ªã th√¥ng tin c∆° b·∫£n v·ªÅ dataset

In [None]:
# Basic statistics
n_ratings = data.shape[0]
n_users = len(np.unique(data[:, 0]))  # Column 0: UserId
n_products = len(np.unique(data[:, 1]))  # Column 1: ProductId

print(f"Dataset Overview:")
print(f"  Total ratings: {n_ratings:,}")
print(f"  Unique users: {n_users:,}")
print(f"  Unique products: {n_products:,}")

# Sample records
print(f"\nSample Records (first 5):")
print(f"{'UserId':<12} {'ProductId':<12} {'Rating':<10} {'Timestamp'}")
print("-" * 50)
for i in range(min(5, n_ratings)):
    print(f"{data[i, 0]:<12.0f} {data[i, 1]:<12.0f} {data[i, 2]:<10.1f} {data[i, 3]:<12.0f}")

# Sparsity metrics
sparsity = (1 - n_ratings / (n_users * n_products)) * 100
print(f"\nSparsity Metrics:")
print(f"  Potential matrix size: {n_users:,} √ó {n_products:,}")
print(f"  Sparsity: {sparsity:.4f}%")

## 4. Descriptive Statistics

T√≠nh to√°n th·ªëng k√™ m√¥ t·∫£ cho ratings s·ª≠ d·ª•ng NumPy

In [None]:
# Extract ratings column
ratings = data[:, 2]

# Central tendency
mean_rating = np.mean(ratings)
median_rating = np.median(ratings)
mode_rating = np.bincount(ratings.astype(int)).argmax()

# Dispersion
std_rating = np.std(ratings)
var_rating = np.var(ratings)
min_rating = np.min(ratings)
max_rating = np.max(ratings)

# Quantiles
q25 = np.percentile(ratings, 25)
q50 = np.percentile(ratings, 50)
q75 = np.percentile(ratings, 75)
iqr = q75 - q25

# Shape metrics (Skewness and Kurtosis)
centered = ratings - mean_rating
skewness = np.mean(centered ** 3) / (std_rating ** 3)
kurtosis = np.mean(centered ** 4) / (std_rating ** 4) - 3

print("Descriptive Statistics for Ratings:")
print(f"\nCentral Tendency:")
print(f"  Mean:   {mean_rating:.4f}")
print(f"  Median: {median_rating:.4f}")
print(f"  Mode:   {mode_rating:.0f}")

print(f"\nDispersion:")
print(f"  Std Dev:  {std_rating:.4f}")
print(f"  Variance: {var_rating:.4f}")
print(f"  Range:    {max_rating - min_rating:.1f} (from {min_rating:.1f} to {max_rating:.1f})")
print(f"  IQR:      {iqr:.4f}")

print(f"\nQuantiles:")
print(f"  Q1 (25%):  {q25:.2f}")
print(f"  Q2 (50%):  {q50:.2f}")
print(f"  Q3 (75%):  {q75:.2f}")

print(f"\nDistribution Shape:")
print(f"  Skewness: {skewness:.4f}", end="")
if skewness < -0.5:
    print(" (left-skewed)")
elif skewness > 0.5:
    print(" (right-skewed)")
else:
    print(" (symmetric)")

print(f"  Kurtosis: {kurtosis:.4f}", end="")
if kurtosis > 0:
    print(" (heavy tails)")
elif kurtosis < 0:
    print(" (light tails)")
else:
    print(" (normal-like)")

In [None]:
# Visualize rating distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1 = axes[0]
ax1.hist(ratings, bins=50, color='skyblue', edgecolor='black', alpha=0.7)
ax1.axvline(mean_rating, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_rating:.2f}')
ax1.axvline(median_rating, color='green', linestyle='--', linewidth=2, label=f'Median: {median_rating:.2f}')
ax1.set_xlabel('Rating', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Ratings', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot
ax2 = axes[1]
box = ax2.boxplot(ratings, vert=True, patch_artist=True, widths=0.5)
box['boxes'][0].set_facecolor('lightblue')
ax2.set_ylabel('Rating', fontsize=12)
ax2.set_title('Box Plot of Ratings', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Add statistics text
stats_text = f"Mean: {mean_rating:.2f}\nMedian: {median_rating:.2f}\nStd: {std_rating:.2f}\nSkew: {skewness:.2f}"
ax2.text(1.3, max_rating * 0.5, stats_text, fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

## 5. Missing Values Analysis

Ki·ªÉm tra missing values trong dataset

In [None]:

# NaN values
nan_counts = np.sum(np.isnan(data), axis=0)
total_records = data.shape[0]

column_names = ['UserId', 'ProductId', 'Rating', 'Timestamp']

print(f"\nMissing Values Check:")
print(f"{'Column':<15} {'Missing Count':<15} {'Percentage':<15} {'Status'}")
print("-" * 60)

has_missing = False
for i, col_name in enumerate(column_names):
    missing_count = int(nan_counts[i])
    missing_pct = (missing_count / total_records) * 100
    status = "Clean" if missing_count == 0 else "Has Missing"
    print(f"{col_name:<15} {missing_count:<15,} {missing_pct:<15.2f} {status}")
    if missing_count > 0:
        has_missing = True

if not has_missing:
    print("No missing values detected")
else:
    print("Missing values detected")
# zero values
print(f"\nZero Values Check:")
zero_ratings = np.sum(data[:, 2] == 0)
zero_timestamps = np.sum(data[:, 3] == 0)

print(f" Zero ratings: {zero_ratings:,} ({(zero_ratings/total_records)*100:.2f}%)")
print(f" Zero timestamps: {zero_timestamps:,} ({(zero_timestamps/total_records)*100:.2f}%)")

if zero_ratings > 0:
    print("   Warning: Some ratings are 0 (invalid for 1-5 scale)")

# Visualize missing data
if has_missing or zero_ratings > 0 or zero_timestamps > 0:
    fig, ax = plt.subplots(figsize=(10, 6))
    
    issues = {
        'UserId\nNaN': nan_counts[0],
        'ProductId\nNaN': nan_counts[1],
        'Rating\nNaN': nan_counts[2],
        'Timestamp\nNaN': nan_counts[3],
        'Rating\nZero': zero_ratings,
        'Timestamp\nZero': zero_timestamps
    }
    
    x_pos = np.arange(len(issues))
    counts = list(issues.values())
    colors = ['red' if c > 0 else 'green' for c in counts]
    
    bars = ax.bar(x_pos, counts, color=colors, alpha=0.7, edgecolor='black')
    ax.set_xlabel('Column / Issue Type', fontsize=12)
    ax.set_ylabel('Count', fontsize=12)
    ax.set_title('Data Quality Issues Overview', fontsize=14, fontweight='bold')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(issues.keys(), rotation=45, ha='right')
    ax.grid(True, alpha=0.3, axis='y')
    
    # add value labels
    for i, (bar, count) in enumerate(zip(bars, counts)):
        if count > 0:
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                   f'{count:,}', ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

print("\nCompleted")

## 6. Rating Distribution Analysis

### 6.1. Rating Frequency Analysis

In [None]:
# Count frequency of each rating value
rating_values, rating_counts = np.unique(ratings, return_counts=True)
total_ratings = len(ratings)

# Calculate percentages
rating_percentages = (rating_counts / total_ratings) * 100

# Calculate statistics
mean_rating = np.mean(ratings)
median_rating = np.median(ratings)
mode_idx = np.argmax(rating_counts)
mode_rating = rating_values[mode_idx]

print("Rating Frequency Statistics:")
print(f"  Mean Rating: {mean_rating:.4f}")
print(f"  Median Rating: {median_rating:.1f}")
print(f"  Mode Rating: {mode_rating:.0f} (most frequent)")
print(f"  Std Dev: {np.std(ratings):.4f}")

print(f"\nRating Distribution:")
print(f"{'Rating':<10} {'Count':<15} {'Percentage'}")
print("-" * 40)
for val, count, pct in zip(rating_values, rating_counts, rating_percentages):
    print(f"{val:<10.1f} {count:<15,} {pct:.2f}%")

### 6.2. Rating Sentiment Analysis

In [None]:
# Analyze rating bias
positive_ratings = np.sum(ratings >= 4)  # 4 and 5 stars
negative_ratings = np.sum(ratings <= 2)  # 1 and 2 stars
neutral_ratings = np.sum(ratings == 3)   # 3 stars

positive_pct = (positive_ratings / total_ratings) * 100
negative_pct = (negative_ratings / total_ratings) * 100
neutral_pct = (neutral_ratings / total_ratings) * 100

print("Rating Sentiment Breakdown:")
print(f"  Positive (4-5 stars): {positive_ratings:,} ({positive_pct:.2f}%)")
print(f"  Neutral (3 stars): {neutral_ratings:,} ({neutral_pct:.2f}%)")
print(f"  Negative (1-2 stars): {negative_ratings:,} ({negative_pct:.2f}%)")

if positive_pct > 60:
    print("\n  ‚ö† Positive bias detected - users tend to give high ratings")
elif negative_pct > 40:
    print("\n  ‚ö† Negative bias detected - users tend to give low ratings")
else:
    print("\n  ‚úì Ratings are relatively balanced")

### 6.3. Rating Distribution Visualizations

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Rating Distribution Analysis', fontsize=16, fontweight='bold')

# Bar chart - Rating frequency
ax1 = axes[0, 0]
colors_bars = ['#d62728', '#ff7f0e', '#ffdd57', '#98df8a', '#2ca02c']
bars = ax1.bar(rating_values, rating_counts, color=colors_bars, edgecolor='black', alpha=0.8, width=0.6)
ax1.set_xlabel('Rating Value', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.set_title('Rating Frequency Distribution', fontsize=13, fontweight='bold')
ax1.set_xticks(rating_values)
ax1.grid(True, alpha=0.3, axis='y')

# Add percentage labels
for bar, count, pct in zip(bars, rating_counts, rating_percentages):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(count):,}\n({pct:.1f}%)',
            ha='center', va='bottom', fontsize=9, fontweight='bold')

# Pie chart - Rating proportions
ax2 = axes[0, 1]
explode = [0.05 if pct == max(rating_percentages) else 0 for pct in rating_percentages]
ax2.pie(rating_counts, labels=[f'{int(r)}‚òÖ' for r in rating_values],
        autopct='%1.1f%%', colors=colors_bars, explode=explode,
        startangle=90, textprops={'fontsize': 11, 'weight': 'bold'})
ax2.set_title('Rating Proportions', fontsize=13, fontweight='bold')

# Histogram with density
ax3 = axes[1, 0]
ax3.hist(ratings, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
ax3.axvline(mean_rating, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_rating:.2f}')
ax3.axvline(median_rating, color='green', linestyle='--', linewidth=2, label=f'Median: {median_rating:.1f}')
ax3.set_xlabel('Rating Value', fontsize=12)
ax3.set_ylabel('Frequency', fontsize=12)
ax3.set_title('Rating Histogram', fontsize=13, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3)

# Sentiment categories bar chart
ax4 = axes[1, 1]
categories = ['Negative\n(1-2‚òÖ)', 'Neutral\n(3‚òÖ)', 'Positive\n(4-5‚òÖ)']
category_counts = np.array([negative_ratings, neutral_ratings, positive_ratings])
category_colors = ['#e74c3c', '#f39c12', '#27ae60']

bars4 = ax4.bar(categories, category_counts, color=category_colors, edgecolor='black', alpha=0.8)
ax4.set_ylabel('Count', fontsize=12)
ax4.set_title('Rating Sentiment Distribution', fontsize=13, fontweight='bold')
ax4.grid(True, alpha=0.3, axis='y')

for bar, count in zip(bars4, category_counts):
    height = bar.get_height()
    pct = (count / total_ratings) * 100
    ax4.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(count):,}\n({pct:.1f}%)',
            ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

## 7. User Behavior Analysis

Ph√¢n t√≠ch h√†nh vi ng∆∞·ªùi d√πng

### 7.1. User Activity Statistics

In [None]:
# Calculate ratings per user using NumPy
user_ids = data[:, 0]
unique_users, user_rating_counts = np.unique(user_ids, return_counts=True)

n_users = len(unique_users)
total_ratings = len(user_ids)

# Statistics
mean_ratings_per_user = np.mean(user_rating_counts)
median_ratings_per_user = np.median(user_rating_counts)
std_ratings_per_user = np.std(user_rating_counts)
min_ratings_per_user = np.min(user_rating_counts)
max_ratings_per_user = np.max(user_rating_counts)

print("User Activity Statistics:")
print(f"  Total unique users: {n_users:,}")
print(f"  Mean ratings per user: {mean_ratings_per_user:.2f}")
print(f"  Median ratings per user: {median_ratings_per_user:.1f}")
print(f"  Std deviation: {std_ratings_per_user:.2f}")
print(f"  Range: {min_ratings_per_user} to {max_ratings_per_user}")

# Top 10 most active users
top_10_indices = np.argsort(user_rating_counts)[-10:][::-1]
top_10_users = unique_users[top_10_indices]
top_10_counts = user_rating_counts[top_10_indices]

print(f"\nTop 10 Most Active Users:")
print(f"{'Rank':<6} {'User ID':<15} {'# Ratings':<15} {'% of Total'}")
print("-" * 50)
for rank, (user_id, count) in enumerate(zip(top_10_users, top_10_counts), 1):
    pct = (count / total_ratings) * 100
    print(f"{rank:<6} {int(user_id):<15} {count:<15,} {pct:.4f}%")

# Cold start analysis
users_with_le_5_ratings = np.sum(user_rating_counts <= 5)
users_with_le_10_ratings = np.sum(user_rating_counts <= 10)

pct_le_5 = (users_with_le_5_ratings / n_users) * 100
pct_le_10 = (users_with_le_10_ratings / n_users) * 100

print(f"\nCold Start Analysis:")
print(f"  Users with ‚â§ 5 ratings: {users_with_le_5_ratings:,} ({pct_le_5:.2f}%)")
print(f"  Users with ‚â§ 10 ratings: {users_with_le_10_ratings:,} ({pct_le_10:.2f}%)")

if pct_le_5 > 50:
    print("  ‚ö† High cold start risk - many users have minimal interaction")
else:
    print("  ‚úì Moderate cold start problem")

In [None]:
# User activity categories
very_active = np.sum(user_rating_counts > 50)
active = np.sum((user_rating_counts > 20) & (user_rating_counts <= 50))
moderate = np.sum((user_rating_counts > 10) & (user_rating_counts <= 20))
occasional = np.sum((user_rating_counts > 5) & (user_rating_counts <= 10))
rare = np.sum((user_rating_counts > 1) & (user_rating_counts <= 5))
one_time = np.sum(user_rating_counts == 1)

print("User Activity Categories:")
print(f"  Very Active (>50): {very_active:,} ({(very_active/n_users)*100:.2f}%)")
print(f"  Active (21-50): {active:,} ({(active/n_users)*100:.2f}%)")
print(f"  Moderate (11-20): {moderate:,} ({(moderate/n_users)*100:.2f}%)")
print(f"  Occasional (6-10): {occasional:,} ({(occasional/n_users)*100:.2f}%)")
print(f"  Rare (2-5): {rare:,} ({(rare/n_users)*100:.2f}%)")
print(f"  One-time (1): {one_time:,} ({(one_time/n_users)*100:.2f}%)")

In [None]:
# Visualize user activity
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('User Activity Analysis', fontsize=16, fontweight='bold')

# Histogram - ratings per user
ax1 = axes[0, 0]
ax1.hist(user_rating_counts, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
ax1.axvline(mean_ratings_per_user, color='red', linestyle='--', linewidth=2, 
           label=f'Mean: {mean_ratings_per_user:.1f}')
ax1.axvline(median_ratings_per_user, color='green', linestyle='--', linewidth=2,
           label=f'Median: {median_ratings_per_user:.1f}')
ax1.set_xlabel('Number of Ratings', fontsize=12)
ax1.set_ylabel('Number of Users', fontsize=12)
ax1.set_title('Distribution of Ratings per User', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Log-scale histogram
ax2 = axes[0, 1]
ax2.hist(user_rating_counts, bins=np.logspace(0, np.log10(max_ratings_per_user+1), 50),
        color='coral', alpha=0.7, edgecolor='black')
ax2.set_xscale('log')
ax2.set_yscale('log')
ax2.set_xlabel('Number of Ratings (log)', fontsize=12)
ax2.set_ylabel('Number of Users (log)', fontsize=12)
ax2.set_title('User Activity (Log-Log Scale)', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, which='both')

# Top 10 most active users
ax3 = axes[1, 0]
x_pos = np.arange(len(top_10_counts))
bars = ax3.barh(x_pos, top_10_counts, color='purple', alpha=0.7, edgecolor='black')
ax3.set_yticks(x_pos)
ax3.set_yticklabels([f'User {int(uid)}' for uid in top_10_users], fontsize=9)
ax3.set_xlabel('Number of Ratings', fontsize=12)
ax3.set_title('Top 10 Most Active Users', fontsize=13, fontweight='bold')
ax3.invert_yaxis()
ax3.grid(True, alpha=0.3, axis='x')

for i, (bar, count) in enumerate(zip(bars, top_10_counts)):
    ax3.text(count + max(top_10_counts)*0.01, bar.get_y() + bar.get_height()/2,
            f'{count:,}', va='center', fontsize=9, fontweight='bold')

# Activity categories pie chart
ax4 = axes[1, 1]
category_names = ['Very Active\n(>50)', 'Active\n(21-50)', 'Moderate\n(11-20)', 
                  'Occasional\n(6-10)', 'Rare\n(2-5)', 'One-time\n(1)']
category_counts = np.array([very_active, active, moderate, occasional, rare, one_time])
colors4 = ['#2ecc71', '#3498db', '#f39c12', '#e67e22', '#e74c3c', '#95a5a6']

ax4.pie(category_counts, labels=category_names,
        autopct=lambda pct: f'{pct:.1f}%\n({int(pct/100*n_users):,})',
        colors=colors4, startangle=90,
        textprops={'fontsize': 9, 'weight': 'bold'})
ax4.set_title('User Activity Categories', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

### 7.2. User Rating Behavior Analysis

In [None]:
# Calculate user rating statistics (vectorized)
unique_users_arr, inverse_indices = np.unique(user_ids, return_inverse=True)
n_users_total = len(unique_users_arr)

# Count ratings per user
user_n_ratings = np.bincount(inverse_indices)

# Calculate mean ratings per user
rating_sums = np.bincount(inverse_indices, weights=ratings)
user_mean_ratings = rating_sums / user_n_ratings

# Calculate std per user
rating_squared_sums = np.bincount(inverse_indices, weights=ratings**2)
mean_squared = rating_squared_sums / user_n_ratings
user_variance = mean_squared - (user_mean_ratings ** 2)
user_std_ratings = np.sqrt(np.maximum(user_variance, 0))

# Overall statistics
overall_mean = np.mean(user_mean_ratings)
overall_median = np.median(user_mean_ratings)

print("User Rating Behavior Statistics:")
print(f"  Mean of user avg ratings: {overall_mean:.4f}")
print(f"  Median of user avg ratings: {overall_median:.4f}")
print(f"  Min user average: {np.min(user_mean_ratings):.4f}")
print(f"  Max user average: {np.max(user_mean_ratings):.4f}")

# User rating tendencies
harsh_raters = np.sum(user_mean_ratings < 3.0)
generous_raters = np.sum(user_mean_ratings > 4.0)
balanced_raters = np.sum((user_mean_ratings >= 3.0) & (user_mean_ratings <= 4.0))

harsh_pct = (harsh_raters / n_users_total) * 100
generous_pct = (generous_raters / n_users_total) * 100
balanced_pct = (balanced_raters / n_users_total) * 100

print(f"\nUser Rating Tendencies:")
print(f"  Harsh (<3.0): {harsh_raters:,} ({harsh_pct:.2f}%)")
print(f"  Balanced (3.0-4.0): {balanced_raters:,} ({balanced_pct:.2f}%)")
print(f"  Generous (>4.0): {generous_raters:,} ({generous_pct:.2f}%)")

# Rating consistency
users_with_multiple = user_n_ratings > 1
filtered_std = user_std_ratings[users_with_multiple]
n_users_multiple = np.sum(users_with_multiple)

consistent_users = np.sum(filtered_std < 0.5)
moderate_users = np.sum((filtered_std >= 0.5) & (filtered_std < 1.0))
diverse_users = np.sum(filtered_std >= 1.0)

print(f"\nUser Rating Consistency (users with >1 rating):")
print(f"  Consistent (std<0.5): {consistent_users:,} ({(consistent_users/n_users_multiple)*100:.2f}%)")
print(f"  Moderate (0.5-1.0): {moderate_users:,} ({(moderate_users/n_users_multiple)*100:.2f}%)")
print(f"  Diverse (std‚â•1.0): {diverse_users:,} ({(diverse_users/n_users_multiple)*100:.2f}%)")

In [None]:
# Visualize user rating behavior
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('User Rating Behavior Analysis', fontsize=16, fontweight='bold')

# Distribution of user mean ratings
ax1 = axes[0, 0]
ax1.hist(user_mean_ratings, bins=50, color='#3498db', alpha=0.7, edgecolor='black')
ax1.axvline(overall_mean, color='red', linestyle='--', linewidth=2, label=f'Mean: {overall_mean:.2f}')
ax1.axvline(3.0, color='orange', linestyle=':', linewidth=2, alpha=0.7, label='Harsh threshold')
ax1.axvline(4.0, color='purple', linestyle=':', linewidth=2, alpha=0.7, label='Generous threshold')
ax1.set_xlabel('Average Rating per User', fontsize=11)
ax1.set_ylabel('Number of Users', fontsize=11)
ax1.set_title('Distribution of User Average Ratings', fontsize=12, fontweight='bold')
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)

# Distribution of user std
ax2 = axes[0, 1]
ax2.hist(user_std_ratings, bins=50, color='#e74c3c', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Std Deviation per User', fontsize=11)
ax2.set_ylabel('Number of Users', fontsize=11)
ax2.set_title('Distribution of User Rating Variability', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Scatter: Mean vs Std (sample)
ax3 = axes[1, 0]
sample_size = min(5000, n_users_total)
sample_indices = np.random.choice(n_users_total, sample_size, replace=False)
scatter = ax3.scatter(user_mean_ratings[sample_indices], user_std_ratings[sample_indices], 
                     alpha=0.3, s=10, c=user_n_ratings[sample_indices], cmap='viridis')
ax3.set_xlabel('Mean Rating', fontsize=11)
ax3.set_ylabel('Std Deviation', fontsize=11)
ax3.set_title(f'Mean vs Std (sample: {sample_size:,})', fontsize=12, fontweight='bold')
ax3.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax3, label='# Ratings')

# User types pie chart
ax4 = axes[1, 1]
behavior_names = ['Harsh\n(<3.0)', 'Balanced\n(3.0-4.0)', 'Generous\n(>4.0)']
behavior_counts = np.array([harsh_raters, balanced_raters, generous_raters])
behavior_colors = ['#e74c3c', '#f39c12', '#2ecc71']

ax4.pie(behavior_counts, labels=behavior_names,
        autopct=lambda pct: f'{pct:.1f}%\n({int(pct/100*n_users_total):,})',
        colors=behavior_colors, startangle=90,
        textprops={'fontsize': 10, 'weight': 'bold'})
ax4.set_title('User Rating Tendency', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 8. Product Analysis

Ph√¢n t√≠ch ƒë·∫∑c ƒëi·ªÉm s·∫£n ph·∫©m

### 8.1. Product Popularity Statistics

In [None]:
# Calculate ratings per product
product_ids = data[:, 1]
unique_products, product_rating_counts = np.unique(product_ids, return_counts=True)

n_products = len(unique_products)
total_ratings = len(product_ids)

# Statistics
mean_ratings_per_product = np.mean(product_rating_counts)
median_ratings_per_product = np.median(product_rating_counts)
std_ratings_per_product = np.std(product_rating_counts)
min_ratings_per_product = np.min(product_rating_counts)
max_ratings_per_product = np.max(product_rating_counts)

print("Product Popularity Statistics:")
print(f"  Total unique products: {n_products:,}")
print(f"  Mean ratings per product: {mean_ratings_per_product:.2f}")
print(f"  Median ratings per product: {median_ratings_per_product:.1f}")
print(f"  Std deviation: {std_ratings_per_product:.2f}")
print(f"  Range: {min_ratings_per_product} to {max_ratings_per_product}")

# Top 20 most reviewed products
top_20_indices = np.argsort(product_rating_counts)[-20:][::-1]
top_20_products = unique_products[top_20_indices]
top_20_counts = product_rating_counts[top_20_indices]

print(f"\nTop 20 Most Reviewed Products:")
print(f"{'Rank':<6} {'Product ID':<15} {'# Ratings':<15} {'% of Total'}")
print("-" * 50)
for rank, (product_id, count) in enumerate(zip(top_20_products, top_20_counts), 1):
    pct = (count / total_ratings) * 100
    print(f"{rank:<6} {int(product_id):<15} {count:<15,} {pct:.4f}%")

# Cold start problem
products_with_lt_5_ratings = np.sum(product_rating_counts < 5)
products_with_lt_10_ratings = np.sum(product_rating_counts < 10)

pct_lt_5 = (products_with_lt_5_ratings / n_products) * 100
pct_lt_10 = (products_with_lt_10_ratings / n_products) * 100

print(f"\nProduct Cold Start Analysis:")
print(f"  Products with < 5 ratings: {products_with_lt_5_ratings:,} ({pct_lt_5:.2f}%)")
print(f"  Products with < 10 ratings: {products_with_lt_10_ratings:,} ({pct_lt_10:.2f}%)")

if pct_lt_5 > 50:
    print("  ‚ö† High cold start risk for products")
elif pct_lt_5 > 30:
    print("  ‚ö† Moderate cold start problem")
else:
    print("  ‚úì Low cold start problem")

In [None]:
# Product popularity categories
blockbuster = np.sum(product_rating_counts > 100)
popular = np.sum((product_rating_counts > 50) & (product_rating_counts <= 100))
moderate = np.sum((product_rating_counts >= 20) & (product_rating_counts <= 50))
niche = np.sum((product_rating_counts >= 10) & (product_rating_counts < 20))
unpopular = np.sum((product_rating_counts >= 5) & (product_rating_counts < 10))
rare = np.sum(product_rating_counts < 5)

print("Product Popularity Categories:")
print(f"  Blockbuster (>100): {blockbuster:,} ({(blockbuster/n_products)*100:.2f}%)")
print(f"  Popular (51-100): {popular:,} ({(popular/n_products)*100:.2f}%)")
print(f"  Moderate (20-50): {moderate:,} ({(moderate/n_products)*100:.2f}%)")
print(f"  Niche (10-19): {niche:,} ({(niche/n_products)*100:.2f}%)")
print(f"  Unpopular (5-9): {unpopular:,} ({(unpopular/n_products)*100:.2f}%)")
print(f"  Rare (<5): {rare:,} ({(rare/n_products)*100:.2f}%)")

In [None]:
# Visualize product popularity
fig, axes = plt.subplots(figsize=(15, 10))

# Top 20 bar chart

x_pos = np.arange(len(top_20_counts))
bars = axes.barh(x_pos, top_20_counts, color="#389379", alpha=0.7, edgecolor='black')
axes.set_yticks(x_pos)
axes.set_yticklabels([f'Prod {int(pid)}' for pid in top_20_products], fontsize=9)
axes.set_xlabel('Number of Ratings', fontsize=12)
axes.set_title('Top 20 Most Reviewed Products', fontsize=13, fontweight='bold')
axes.invert_yaxis()
axes.grid(True, alpha=0.3, axis='x')

for i, (bar, count) in enumerate(zip(bars, top_20_counts)):
    axes.text(count + max(top_20_counts)*0.01, bar.get_y() + bar.get_height()/2,
            f'{count:,}', va='center', fontsize=9, fontweight='bold')


plt.tight_layout()
plt.show()

### 8.2. Product Quality Analysis

In [None]:
# Calculate product rating statistics (vectorized)
unique_products_sorted, inverse_indices_products = np.unique(product_ids, return_inverse=True)
n_products_total = len(unique_products_sorted)

# Number of ratings per product
product_n_ratings = np.bincount(inverse_indices_products)

# Mean ratings per product
rating_sums_products = np.bincount(inverse_indices_products, weights=ratings)
product_mean_ratings = rating_sums_products / product_n_ratings

# Std ratings per product
rating_squared_sums_products = np.bincount(inverse_indices_products, weights=ratings**2)
mean_squared_products = rating_squared_sums_products / product_n_ratings
product_variance = mean_squared_products - (product_mean_ratings ** 2)
product_std_ratings = np.sqrt(np.maximum(product_variance, 0))

# Overall statistics
overall_mean_quality = np.mean(product_mean_ratings)
overall_median_quality = np.median(product_mean_ratings)

print("Product Quality Statistics:")
print(f"  Mean of product avg ratings: {overall_mean_quality:.4f}")
print(f"  Median of product avg ratings: {overall_median_quality:.4f}")
print(f"  Min product average: {np.min(product_mean_ratings):.4f}")
print(f"  Max product average: {np.max(product_mean_ratings):.4f}")

# Quality categories
excellent = np.sum(product_mean_ratings >= 4.5)
good = np.sum((product_mean_ratings >= 4.0) & (product_mean_ratings < 4.5))
average = np.sum((product_mean_ratings >= 3.0) & (product_mean_ratings < 4.0))
poor = np.sum((product_mean_ratings >= 2.0) & (product_mean_ratings < 3.0))
very_poor = np.sum(product_mean_ratings < 2.0)

print(f"\nProduct Quality Categories:")
print(f"  Excellent (‚â•4.5): {excellent:,} ({(excellent/n_products_total)*100:.2f}%)")
print(f"  Good (4.0-4.5): {good:,} ({(good/n_products_total)*100:.2f}%)")
print(f"  Average (3.0-4.0): {average:,} ({(average/n_products_total)*100:.2f}%)")
print(f"  Poor (2.0-3.0): {poor:,} ({(poor/n_products_total)*100:.2f}%)")
print(f"  Very Poor (<2.0): {very_poor:,} ({(very_poor/n_products_total)*100:.2f}%)")

# Controversy analysis
products_with_multiple = product_n_ratings > 1
filtered_product_std = product_std_ratings[products_with_multiple]
n_products_multiple = np.sum(products_with_multiple)

controversial = np.sum(filtered_product_std >= 1.5)
mixed = np.sum((filtered_product_std >= 1.0) & (filtered_product_std < 1.5))
consistent = np.sum(filtered_product_std < 1.0)

print(f"\nProduct Rating Controversy (products with >1 rating):")
print(f"  Consistent (std<1.0): {consistent:,} ({(consistent/n_products_multiple)*100:.2f}%)")
print(f"  Mixed (1.0-1.5): {mixed:,} ({(mixed/n_products_multiple)*100:.2f}%)")
print(f"  Controversial (std‚â•1.5): {controversial:,} ({(controversial/n_products_multiple)*100:.2f}%)")

In [None]:
# Visualize product quality
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Product Quality Analysis', fontsize=16, fontweight='bold')

# Distribution of product mean ratings
ax1 = axes[0, 0]
ax1.hist(product_mean_ratings, bins=50, color='#3498db', alpha=0.7, edgecolor='black')
ax1.axvline(overall_mean_quality, color='red', linestyle='--', linewidth=2, 
           label=f'Mean: {overall_mean_quality:.2f}')
ax1.axvline(overall_median_quality, color='green', linestyle='--', linewidth=2,
           label=f'Median: {overall_median_quality:.2f}')
ax1.set_xlabel('Average Rating per Product', fontsize=11)
ax1.set_ylabel('Number of Products', fontsize=11)
ax1.set_title('Distribution of Product Average Ratings', fontsize=12, fontweight='bold')
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)

# Distribution of product std
ax2 = axes[0, 1]
ax2.hist(product_std_ratings, bins=50, color='#e74c3c', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Std Deviation per Product', fontsize=11)
ax2.set_ylabel('Number of Products', fontsize=11)
ax2.set_title('Distribution of Product Rating Variability', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Scatter: Popularity vs Quality
ax3 = axes[1, 0]
eligible_mask = product_n_ratings >= 10
if np.sum(eligible_mask) > 0:
    eligible_counts = product_n_ratings[eligible_mask]
    eligible_means = product_mean_ratings[eligible_mask]
    eligible_stds = product_std_ratings[eligible_mask]
    
    scatter = ax3.scatter(eligible_counts, eligible_means, 
                         alpha=0.5, s=30, c=eligible_stds, cmap='RdYlGn_r',
                         edgecolors='black', linewidth=0.5)
    ax3.set_xscale('log')
    ax3.set_xlabel('Number of Ratings (log)', fontsize=11)
    ax3.set_ylabel('Average Rating', fontsize=11)
    ax3.set_title('Popularity vs Quality (min 10 ratings)', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3)
    ax3.axhline(overall_mean_quality, color='red', linestyle='--', alpha=0.5, linewidth=1.5)
    plt.colorbar(scatter, ax=ax3, label='Std Dev')

# Quality categories pie chart
ax4 = axes[1, 1]
quality_names = ['Excellent\n(‚â•4.5)', 'Good\n(4.0-4.5)', 'Average\n(3.0-4.0)', 
                 'Poor\n(2.0-3.0)', 'Very Poor\n(<2.0)']
quality_counts = np.array([excellent, good, average, poor, very_poor])
quality_colors = ['#27ae60', '#2ecc71', '#f39c12', '#e67e22', '#e74c3c']

# Filter non-zero
non_zero_mask = quality_counts > 0
filtered_names = [name for name, count in zip(quality_names, quality_counts) if count > 0]
filtered_counts = quality_counts[non_zero_mask]
filtered_colors = [color for color, count in zip(quality_colors, quality_counts) if count > 0]

ax4.pie(filtered_counts, labels=filtered_names,
        autopct=lambda pct: f'{pct:.1f}%\n({int(pct/100*n_products_total):,})',
        colors=filtered_colors, startangle=90,
        textprops={'fontsize': 9, 'weight': 'bold'})
ax4.set_title('Product Quality Categories', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 9. Temporal Analysis

Ph√¢n t√≠ch xu h∆∞·ªõng theo th·ªùi gian (n·∫øu c√≥ timestamp)

In [None]:
print("="*70)
print("9. TEMPORAL ANALYSIS")
print("="*70)

# Extract timestamps column
timestamps = data[:, 3]

# Check if timestamps are available
non_zero_timestamps = np.sum(timestamps > 0)
timestamp_availability = (non_zero_timestamps / len(timestamps)) * 100

print(f"\nüìÖ Timestamp Availability:")
print("-" * 70)
print(f"  ‚Ä¢ Total ratings: {len(timestamps):,}")
print(f"  ‚Ä¢ Ratings with valid timestamps: {non_zero_timestamps:,} ({timestamp_availability:.2f}%)")
print(f"  ‚Ä¢ Ratings without timestamps: {len(timestamps) - non_zero_timestamps:,}")

if non_zero_timestamps > 0:
    # Filter ratings with valid timestamps
    valid_timestamp_mask = timestamps > 0
    valid_timestamps = timestamps[valid_timestamp_mask]
    valid_ratings = ratings[valid_timestamp_mask]
    
    # Convert timestamps to datetime
    from datetime import datetime
    
    # Find timestamp range
    min_timestamp = int(np.min(valid_timestamps))
    max_timestamp = int(np.max(valid_timestamps))
    
    min_date = datetime.fromtimestamp(min_timestamp)
    max_date = datetime.fromtimestamp(max_timestamp)
    
    print(f"\nüìä Temporal Coverage:")
    print("-" * 70)
    print(f"  ‚Ä¢ First rating: {min_date.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  ‚Ä¢ Last rating: {max_date.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  ‚Ä¢ Time span: {(max_timestamp - min_timestamp) / (365.25 * 24 * 3600):.2f} years")
    
    # Convert timestamps to years and months for aggregation
    dates = np.array([datetime.fromtimestamp(int(ts)) for ts in valid_timestamps])
    years = np.array([d.year for d in dates])
    months = np.array([d.month for d in dates])
    year_months = np.array([d.year * 100 + d.month for d in dates])  # YYYYMM format
    
    # 1. Ratings count over time (by month)
    unique_year_months = np.unique(year_months)
    monthly_counts = np.array([np.sum(year_months == ym) for ym in unique_year_months])
    
    # 2. Average rating over time (by month)
    monthly_avg_ratings = np.array([np.mean(valid_ratings[year_months == ym]) for ym in unique_year_months])
    
    print(f"\nüìà Activity Trends:")
    print("-" * 70)
    print(f"  ‚Ä¢ Unique months with data: {len(unique_year_months)}")
    print(f"  ‚Ä¢ Peak month: {unique_year_months[np.argmax(monthly_counts)]} with {np.max(monthly_counts):,} ratings")
    print(f"  ‚Ä¢ Average ratings per month: {np.mean(monthly_counts):.2f}")
    print(f"  ‚Ä¢ Median ratings per month: {np.median(monthly_counts):.2f}")
    
    # 3. Rating inflation analysis
    first_half_mask = valid_timestamps < (min_timestamp + (max_timestamp - min_timestamp) / 2)
    second_half_mask = ~first_half_mask
    
    first_half_avg = np.mean(valid_ratings[first_half_mask])
    second_half_avg = np.mean(valid_ratings[second_half_mask])
    rating_change = second_half_avg - first_half_avg
    
    print(f"\nüîç Rating Inflation Analysis:")
    print("-" * 70)
    print(f"  ‚Ä¢ Average rating (first half): {first_half_avg:.4f}")
    print(f"  ‚Ä¢ Average rating (second half): {second_half_avg:.4f}")
    print(f"  ‚Ä¢ Change: {rating_change:+.4f}")
    
    if abs(rating_change) > 0.1:
        if rating_change > 0:
            print("  ‚ö† Rating inflation detected! Ratings increased over time.")
        else:
            print("  ‚ö† Rating deflation detected! Ratings decreased over time.")
    else:
        print("  ‚úì Ratings are relatively stable over time.")
    
    # 4. Yearly analysis
    unique_years = np.unique(years)
    yearly_counts = np.array([np.sum(years == y) for y in unique_years])
    yearly_avg_ratings = np.array([np.mean(valid_ratings[years == y]) for y in unique_years])
    
    print(f"\nüìÖ Yearly Breakdown:")
    print("-" * 70)
    print(f"{'Year':<8} {'# Ratings':<15} {'Avg Rating':<15} {'% of Total'}")
    print("-" * 70)
    for year, count, avg_rating in zip(unique_years, yearly_counts, yearly_avg_ratings):
        pct = (count / len(valid_timestamps)) * 100
        print(f"{int(year):<8} {count:<15,} {avg_rating:<15.4f} {pct:.2f}%")
    
    # 5. Seasonal patterns (monthly aggregation)
    monthly_distribution = np.array([np.sum(months == m) for m in range(1, 13)])
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    
    peak_month = np.argmax(monthly_distribution) + 1
    print(f"\nüå°Ô∏è Seasonal Patterns:")
    print("-" * 70)
    print(f"  ‚Ä¢ Peak month: {month_names[peak_month-1]} with {monthly_distribution[peak_month-1]:,} ratings")
    print(f"  ‚Ä¢ Lowest month: {month_names[np.argmin(monthly_distribution)]} with {np.min(monthly_distribution):,} ratings")
    
    # Visualizations
    fig, axes = plt.subplots(2, 3, figsize=(18, 11))
    fig.suptitle('Temporal Analysis', fontsize=16, fontweight='bold')
    
    # Subplot 1: Ratings count over time (monthly)
    ax1 = axes[0, 0]
    # Convert YYYYMM to readable format for plotting
    month_labels = [f"{ym//100}-{ym%100:02d}" for ym in unique_year_months]
    # Plot every Nth label to avoid crowding
    label_step = max(1, len(month_labels) // 10)
    x_indices = np.arange(len(monthly_counts))
    
    ax1.plot(x_indices, monthly_counts, color='steelblue', linewidth=2)
    ax1.fill_between(x_indices, monthly_counts, alpha=0.3, color='steelblue')
    ax1.set_xlabel('Time', fontsize=11)
    ax1.set_ylabel('Number of Ratings', fontsize=11)
    ax1.set_title('Rating Volume Over Time (Monthly)', fontsize=12, fontweight='bold')
    ax1.set_xticks(x_indices[::label_step])
    ax1.set_xticklabels(month_labels[::label_step], rotation=45, ha='right', fontsize=8)
    ax1.grid(True, alpha=0.3)
    
    # Subplot 2: Average rating over time (monthly)
    ax2 = axes[0, 1]
    ax2.plot(x_indices, monthly_avg_ratings, color='#e74c3c', linewidth=2, marker='o', markersize=3)
    ax2.axhline(np.mean(valid_ratings), color='green', linestyle='--', linewidth=2, 
               label=f'Overall avg: {np.mean(valid_ratings):.2f}', alpha=0.7)
    ax2.set_xlabel('Time', fontsize=11)
    ax2.set_ylabel('Average Rating', fontsize=11)
    ax2.set_title('Average Rating Trend Over Time', fontsize=12, fontweight='bold')
    ax2.set_xticks(x_indices[::label_step])
    ax2.set_xticklabels(month_labels[::label_step], rotation=45, ha='right', fontsize=8)
    ax2.set_ylim([0, 5.5])
    ax2.legend(fontsize=9)
    ax2.grid(True, alpha=0.3)
    
    # Subplot 3: Yearly distribution
    ax3 = axes[0, 2]
    bars = ax3.bar(unique_years.astype(str), yearly_counts, color='#9b59b6', alpha=0.7, edgecolor='black')
    ax3.set_xlabel('Year', fontsize=11)
    ax3.set_ylabel('Number of Ratings', fontsize=11)
    ax3.set_title('Ratings Distribution by Year', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for bar, count in zip(bars, yearly_counts):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height,
                f'{count:,}', ha='center', va='bottom', fontsize=9, fontweight='bold')
    
    # Subplot 4: Seasonal patterns (monthly distribution)
    ax4 = axes[1, 0]
    bars4 = ax4.bar(month_names, monthly_distribution, color='#3498db', alpha=0.7, edgecolor='black')
    ax4.set_xlabel('Month', fontsize=11)
    ax4.set_ylabel('Total Ratings', fontsize=11)
    ax4.set_title('Seasonal Pattern (All Years Combined)', fontsize=12, fontweight='bold')
    ax4.tick_params(axis='x', rotation=45)
    ax4.grid(True, alpha=0.3, axis='y')
    
    # Highlight peak month
    bars4[peak_month-1].set_color('#e74c3c')
    
    # Subplot 5: Cumulative ratings over time
    ax5 = axes[1, 1]
    cumulative_counts = np.cumsum(monthly_counts)
    ax5.plot(x_indices, cumulative_counts, color='#27ae60', linewidth=2)
    ax5.fill_between(x_indices, cumulative_counts, alpha=0.3, color='#27ae60')
    ax5.set_xlabel('Time', fontsize=11)
    ax5.set_ylabel('Cumulative Ratings', fontsize=11)
    ax5.set_title('Cumulative Rating Growth', fontsize=12, fontweight='bold')
    ax5.set_xticks(x_indices[::label_step])
    ax5.set_xticklabels(month_labels[::label_step], rotation=45, ha='right', fontsize=8)
    ax5.grid(True, alpha=0.3)
    
    # Subplot 6: Rating distribution comparison (first vs second half)
    ax6 = axes[1, 2]
    first_half_ratings = valid_ratings[first_half_mask]
    second_half_ratings = valid_ratings[second_half_mask]
    
    rating_values = [1, 2, 3, 4, 5]
    first_half_dist = [np.sum(first_half_ratings == r) / len(first_half_ratings) * 100 for r in rating_values]
    second_half_dist = [np.sum(second_half_ratings == r) / len(second_half_ratings) * 100 for r in rating_values]
    
    x = np.arange(len(rating_values))
    width = 0.35
    
    bars1 = ax6.bar(x - width/2, first_half_dist, width, label='First Half', 
                    color='#3498db', alpha=0.7, edgecolor='black')
    bars2 = ax6.bar(x + width/2, second_half_dist, width, label='Second Half', 
                    color='#e74c3c', alpha=0.7, edgecolor='black')
    
    ax6.set_xlabel('Rating', fontsize=11)
    ax6.set_ylabel('Percentage (%)', fontsize=11)
    ax6.set_title('Rating Distribution: First vs Second Half', fontsize=12, fontweight='bold')
    ax6.set_xticks(x)
    ax6.set_xticklabels(rating_values)
    ax6.legend(fontsize=10)
    ax6.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\n‚úì Temporal analysis completed!")
    
else:
    print("\n‚ö† No valid timestamps available for temporal analysis.")
    print("Skipping temporal visualizations.")


## 10. Data Sparsity Analysis

Ph√¢n t√≠ch ƒë·ªô th∆∞a c·ªßa d·ªØ li·ªáu - quan tr·ªçng cho recommendation systems!

In [None]:
print("="*70)
print("10. DATA SPARSITY ANALYSIS")
print("="*70)

# Get dimensions of user-item matrix
n_users = len(np.unique(user_ids))
n_products = len(np.unique(product_ids))
n_ratings = len(ratings)

# Calculate sparsity
total_possible_ratings = n_users * n_products
sparsity = 1 - (n_ratings / total_possible_ratings)
density = (n_ratings / total_possible_ratings) * 100

print(f"\nüìä User-Item Matrix Dimensions:")
print("-" * 70)
print(f"  ‚Ä¢ Number of users: {n_users:,}")
print(f"  ‚Ä¢ Number of products: {n_products:,}")
print(f"  ‚Ä¢ Matrix size: {n_users:,} √ó {n_products:,}")
print(f"  ‚Ä¢ Total possible ratings: {total_possible_ratings:,}")

print(f"\nüï≥Ô∏è Sparsity Metrics:")
print("-" * 70)
print(f"  ‚Ä¢ Actual ratings: {n_ratings:,}")
print(f"  ‚Ä¢ Sparsity: {sparsity:.8f} ({sparsity*100:.6f}%)")
print(f"  ‚Ä¢ Density: {density:.8f}%")
print(f"  ‚Ä¢ Fill ratio: 1 in {int(1/density*100):,} entries")

if sparsity > 0.99:
    print("  ‚ö† Extremely sparse matrix! Collaborative filtering will be challenging.")
elif sparsity > 0.95:
    print("  ‚ö† Very sparse matrix. Matrix factorization recommended.")
else:
    print("  ‚úì Moderate sparsity. Standard CF algorithms should work well.")

# Calculate coverage metrics
users_with_ratings = n_users  # All users have at least 1 rating (by definition)
products_with_ratings = n_products  # All products have at least 1 rating

user_coverage = (users_with_ratings / n_users) * 100
product_coverage = (products_with_ratings / n_products) * 100

print(f"\nüìà Coverage Statistics:")
print("-" * 70)
print(f"  ‚Ä¢ User coverage: {user_coverage:.2f}% ({users_with_ratings:,}/{n_users:,})")
print(f"  ‚Ä¢ Product coverage: {product_coverage:.2f}% ({products_with_ratings:,}/{n_products:,})")

# Analyze distribution of non-zero entries per user and product
# Use the already calculated user_rating_counts and product_rating_counts
print(f"\nüìä Non-zero Entries Distribution:")
print("-" * 70)
print(f"  ‚Ä¢ Avg ratings per user: {np.mean(user_rating_counts):.2f}")
print(f"  ‚Ä¢ Median ratings per user: {np.median(user_rating_counts):.1f}")
print(f"  ‚Ä¢ Max ratings per user: {np.max(user_rating_counts):,}")
print(f"  ‚Ä¢ Avg ratings per product: {np.mean(product_rating_counts):.2f}")
print(f"  ‚Ä¢ Median ratings per product: {np.median(product_rating_counts):.1f}")
print(f"  ‚Ä¢ Max ratings per product: {np.max(product_rating_counts):,}")

# Calculate percentiles for better understanding
user_percentiles = np.percentile(user_rating_counts, [25, 50, 75, 90, 95, 99])
product_percentiles = np.percentile(product_rating_counts, [25, 50, 75, 90, 95, 99])

print(f"\nüìä Rating Distribution Percentiles:")
print("-" * 70)
print(f"{'Percentile':<12} {'Users':<15} {'Products':<15}")
print("-" * 70)
for pct, u_val, p_val in zip([25, 50, 75, 90, 95, 99], user_percentiles, product_percentiles):
    print(f"{pct}th{'':<9} {u_val:<15.1f} {p_val:<15.1f}")

# Implications for recommendation systems
print(f"\nüéØ Implications for Recommendation Systems:")
print("-" * 70)

# Long tail analysis
users_top_10_pct = np.sum(user_rating_counts >= np.percentile(user_rating_counts, 90))
ratings_by_top_users = np.sum(user_rating_counts[user_rating_counts >= np.percentile(user_rating_counts, 90)])
pct_ratings_by_top = (ratings_by_top_users / n_ratings) * 100

print(f"  ‚Ä¢ Top 10% most active users: {users_top_10_pct:,} ({(users_top_10_pct/n_users)*100:.2f}%)")
print(f"  ‚Ä¢ They contribute: {ratings_by_top_users:,} ratings ({pct_ratings_by_top:.2f}%)")

products_top_10_pct = np.sum(product_rating_counts >= np.percentile(product_rating_counts, 90))
ratings_for_top_products = np.sum(product_rating_counts[product_rating_counts >= np.percentile(product_rating_counts, 90)])
pct_ratings_for_top = (ratings_for_top_products / n_ratings) * 100

print(f"  ‚Ä¢ Top 10% most popular products: {products_top_10_pct:,} ({(products_top_10_pct/n_products)*100:.2f}%)")
print(f"  ‚Ä¢ They receive: {ratings_for_top_products:,} ratings ({pct_ratings_for_top:.2f}%)")

# Recommendation strategies
print(f"\nüí° Recommended Strategies:")
print("-" * 70)
if sparsity > 0.999:
    print("  ‚Ä¢ Use Matrix Factorization (SVD, ALS)")
    print("  ‚Ä¢ Implement Content-Based Filtering")
    print("  ‚Ä¢ Consider Hybrid Approaches")
    print("  ‚Ä¢ Use Popularity-Based for cold start")
elif sparsity > 0.99:
    print("  ‚Ä¢ Matrix Factorization is highly recommended")
    print("  ‚Ä¢ Item-based CF may work better than user-based")
    print("  ‚Ä¢ Consider dimensionality reduction")
else:
    print("  ‚Ä¢ Both user-based and item-based CF viable")
    print("  ‚Ä¢ Matrix factorization will improve performance")
    print("  ‚Ä¢ Neighborhood-based methods should work")

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 11))
fig.suptitle('Data Sparsity Analysis', fontsize=16, fontweight='bold')

# Subplot 1: Sparsity visualization
ax1 = axes[0, 0]
categories = ['Filled', 'Empty']
values = [n_ratings, total_possible_ratings - n_ratings]
colors = ['#27ae60', '#e74c3c']
explode = [0.1, 0]

wedges, texts, autotexts = ax1.pie(values, labels=categories, autopct='%1.4f%%',
                                     colors=colors, explode=explode, startangle=90,
                                     textprops={'fontsize': 11, 'weight': 'bold'})
ax1.set_title('User-Item Matrix Sparsity', fontsize=12, fontweight='bold')

# Subplot 2: Sample heatmap (small subset)
ax2 = axes[0, 1]
# Create a small sample matrix for visualization
sample_size_users = min(50, n_users)
sample_size_products = min(50, n_products)

# Randomly sample users and products
sampled_user_indices = np.random.choice(n_users, sample_size_users, replace=False)
sampled_product_indices = np.random.choice(n_products, sample_size_products, replace=False)

# Get unique users and products for indexing
unique_users_array = np.unique(user_ids)
unique_products_array = np.unique(product_ids)

sampled_users = unique_users_array[sampled_user_indices]
sampled_products = unique_products_array[sampled_product_indices]

# Create sample matrix
sample_matrix = np.zeros((sample_size_users, sample_size_products))

for i, user in enumerate(sampled_users):
    user_mask = user_ids == user
    user_products = product_ids[user_mask]
    user_ratings = ratings[user_mask]
    
    for j, product in enumerate(sampled_products):
        product_mask = user_products == product
        if np.any(product_mask):
            sample_matrix[i, j] = user_ratings[product_mask][0]

# Plot heatmap
im = ax2.imshow(sample_matrix, cmap='RdYlGn', aspect='auto', vmin=0, vmax=5)
ax2.set_xlabel('Products (sample)', fontsize=11)
ax2.set_ylabel('Users (sample)', fontsize=11)
ax2.set_title(f'Sample User-Item Matrix ({sample_size_users}√ó{sample_size_products})', 
             fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax2, label='Rating')

# Subplot 3: Ratings per user distribution (log scale)
ax3 = axes[0, 2]
ax3.hist(user_rating_counts, bins=50, color='#3498db', alpha=0.7, edgecolor='black')
ax3.set_xlabel('Ratings per User', fontsize=11)
ax3.set_ylabel('Number of Users (log scale)', fontsize=11)
ax3.set_yscale('log')
ax3.set_title('Distribution of Ratings per User', fontsize=12, fontweight='bold')
ax3.grid(True, alpha=0.3)

# Subplot 4: Ratings per product distribution (log scale)
ax4 = axes[1, 0]
ax4.hist(product_rating_counts, bins=50, color='#e67e22', alpha=0.7, edgecolor='black')
ax4.set_xlabel('Ratings per Product', fontsize=11)
ax4.set_ylabel('Number of Products (log scale)', fontsize=11)
ax4.set_yscale('log')
ax4.set_title('Distribution of Ratings per Product', fontsize=12, fontweight='bold')
ax4.grid(True, alpha=0.3)

# Subplot 5: Cumulative distribution of user ratings
ax5 = axes[1, 1]
sorted_user_counts = np.sort(user_rating_counts)[::-1]
cumulative_pct = np.cumsum(sorted_user_counts) / np.sum(sorted_user_counts) * 100
user_pct = np.arange(1, len(sorted_user_counts) + 1) / len(sorted_user_counts) * 100

ax5.plot(user_pct, cumulative_pct, color='#9b59b6', linewidth=2)
ax5.axhline(80, color='red', linestyle='--', linewidth=2, alpha=0.7, label='80% of ratings')
ax5.axvline(20, color='green', linestyle='--', linewidth=2, alpha=0.7, label='20% of users')
ax5.set_xlabel('Cumulative % of Users', fontsize=11)
ax5.set_ylabel('Cumulative % of Ratings', fontsize=11)
ax5.set_title('Pareto Analysis: User Contribution', fontsize=12, fontweight='bold')
ax5.legend(fontsize=10)
ax5.grid(True, alpha=0.3)

# Find the % of users contributing 80% of ratings
users_for_80_pct = np.searchsorted(cumulative_pct, 80) / len(cumulative_pct) * 100
ax5.text(0.5, 0.95, f'{users_for_80_pct:.1f}% of users\ncontribute 80% of ratings', 
        transform=ax5.transAxes, fontsize=10, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Subplot 6: Cumulative distribution of product ratings
ax6 = axes[1, 2]
sorted_product_counts = np.sort(product_rating_counts)[::-1]
cumulative_pct_products = np.cumsum(sorted_product_counts) / np.sum(sorted_product_counts) * 100
product_pct = np.arange(1, len(sorted_product_counts) + 1) / len(sorted_product_counts) * 100

ax6.plot(product_pct, cumulative_pct_products, color='#e74c3c', linewidth=2)
ax6.axhline(80, color='red', linestyle='--', linewidth=2, alpha=0.7, label='80% of ratings')
ax6.axvline(20, color='green', linestyle='--', linewidth=2, alpha=0.7, label='20% of products')
ax6.set_xlabel('Cumulative % of Products', fontsize=11)
ax6.set_ylabel('Cumulative % of Ratings', fontsize=11)
ax6.set_title('Pareto Analysis: Product Popularity', fontsize=12, fontweight='bold')
ax6.legend(fontsize=10)
ax6.grid(True, alpha=0.3)

# Find the % of products receiving 80% of ratings
products_for_80_pct = np.searchsorted(cumulative_pct_products, 80) / len(cumulative_pct_products) * 100
ax6.text(0.5, 0.95, f'{products_for_80_pct:.1f}% of products\nreceive 80% of ratings', 
        transform=ax6.transAxes, fontsize=10, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\n‚úì Data sparsity analysis completed!")


## 11. Exploratory Questions

ƒê·∫∑t v√† tr·∫£ l·ªùi c√°c c√¢u h·ªèi v·ªÅ d·ªØ li·ªáu

### C√¢u h·ªèi 1: C√≥ hi·ªán t∆∞·ª£ng rating polarization kh√¥ng?

Ng∆∞·ªùi d√πng c√≥ xu h∆∞·ªõng cho ƒëi·ªÉm c·ª±c cao (5 sao) ho·∫∑c c·ª±c th·∫•p (1 sao), √≠t cho ƒëi·ªÉm trung b√¨nh?

In [None]:
# TODO: Analyze rating polarization
# - Compare frequency of extreme ratings (1, 5) vs middle ratings (2, 3, 4)
# - Visualize with bar chart v√† pie chart

### C√¢u h·ªèi 2: S·∫£n ph·∫©m n·ªïi b·∫≠t nh·∫•t l√† g√¨?

X√°c ƒë·ªãnh s·∫£n ph·∫©m c√≥ s·ª± k·∫øt h·ª£p t·ªët gi·ªØa s·ªë l∆∞·ª£ng ratings v√† ch·∫•t l∆∞·ª£ng ratings

In [None]:
# TODO: Find outstanding products
# Calculate popularity score: weighted rating
# Formula: weighted_rating = (v/(v+m)) * R + (m/(v+m)) * C
# Where:
# v = number of ratings for the product
# m = minimum ratings threshold
# R = average rating for the product
# C = mean rating across all products

# Visualize top products v·ªõi scatter plot (ratings count vs avg rating)

### C√¢u h·ªèi 3: C√≥ t·ªìn t·∫°i "power users" kh√¥ng?

Ph√¢n t√≠ch s·ª± ph√¢n b·ªë ho·∫°t ƒë·ªông c·ªßa users - c√≥ m·ªôt nh√≥m nh·ªè users ƒë√≥ng g√≥p ph·∫ßn l·ªõn ratings?

In [None]:
# TODO: Analyze power users
# - Calculate cumulative percentage of ratings
# - Create Pareto chart (80-20 rule?)
# - What % of users contribute 80% of ratings?

# This is important for:
# - Understanding data concentration
# - Cold start problem severity

### C√¢u h·ªèi 4: C√≥ m·ªëi quan h·ªá gi·ªØa s·ªë l∆∞·ª£ng ratings v√† average rating?

S·∫£n ph·∫©m ph·ªï bi·∫øn c√≥ xu h∆∞·ªõng ƒë∆∞·ª£c ƒë√°nh gi√° cao h∆°n kh√¥ng?

In [None]:
# TODO: Correlation analysis
# - Calculate correlation between popularity and rating
# - Scatter plot with trend line
# - Statistical significance test

# Use NumPy to calculate Pearson correlation:
# r = cov(X,Y) / (std(X) * std(Y))

### C√¢u h·ªèi 5: Cold Start Problem nghi√™m tr·ªçng nh∆∞ th·∫ø n√†o?

Ph√¢n t√≠ch s·ªë l∆∞·ª£ng users v√† products c√≥ √≠t t∆∞∆°ng t√°c

In [None]:
# TODO: Analyze cold start problem
# - % of users with <= 5 ratings
# - % of products with <= 5 ratings
# - Distribution of ratings per user/product

# Impact on recommendation system:
# - Hard to make recommendations for new users/products
# - Need different strategies (content-based, popularity-based)

### C√¢u h·ªèi 6: Ph√¢n kh√∫c ng∆∞·ªùi d√πng (User Segmentation)

C√≥ th·ªÉ chia ng∆∞·ªùi d√πng th√†nh c√°c nh√≥m d·ª±a tr√™n h√†nh vi rating kh√¥ng?

### C√¢u h·ªèi 7: Product Lifecycle - S·∫£n ph·∫©m "Rising Stars" v√† "Falling Stars"

Ph√°t hi·ªán s·∫£n ph·∫©m ƒëang l√™n (trending) v√† s·∫£n ph·∫©m ƒëang xu·ªëng (declining)

In [None]:
# TODO: Ph√¢n t√≠ch product lifecycle n·∫øu c√≥ timestamp
# 1. Chia timeline th√†nh periods (v√≠ d·ª•: quarterly)
# 2. T√≠nh s·ªë ratings v√† avg rating m·ªói period cho t·ª´ng product
# 3. T√≠nh growth rate:
#    growth_rate = (recent_ratings - old_ratings) / old_ratings
#
# Rising Stars:
# - Positive growth rate
# - Improving average rating
# - Increasing review frequency
#
# Falling Stars:
# - Negative growth rate
# - Declining ratings
# - Decreasing review frequency
#
# Stable Products:
# - Consistent rating volume
# - Stable quality

# Business Application:
# - Inventory management: Stock up on rising stars
# - Promotional strategy: Boost falling stars or discontinue
# - Recommendation priority: Feature trending products
# - New product launch insights

# Visualize:
# - Line chart: Rating count over time for top rising/falling products
# - Scatter: Growth rate vs Current avg rating
# - Heatmap: Product performance matrix

### C√¢u h·ªèi 8: Cross-Product Purchase Patterns

C√≥ nh√≥m s·∫£n ph·∫©m n√†o th∆∞·ªùng ƒë∆∞·ª£c mua c√πng nhau kh√¥ng?

In [None]:
# TODO: Ph√¢n t√≠ch co-occurrence patterns
# 1. T·∫°o user-product matrix (binary: rated or not)
# 2. Calculate product co-occurrence matrix:
#    Co-occurrence(i,j) = number of users who rated both product i and j
# 3. Normalize by product popularity
# 4. Calculate lift score:
#    lift(i,j) = P(i,j) / (P(i) * P(j))
#    Where P(i,j) = probability both rated together
#          P(i), P(j) = individual probabilities

# Find product pairs v·ªõi:
# - High co-occurrence count
# - High lift score (> 1 means positive association)

# Business Application:
# - Bundle recommendations: "Customers who bought X also bought Y"
# - Cross-selling opportunities
# - Product placement in store/website
# - Combo deals and promotions

# Visualize:
# - Network graph: Products as nodes, co-occurrence as edges
# - Heatmap: Top 30 products co-occurrence matrix
# - Bar chart: Top product pairs by lift score

# Example insights:
# - Shampoo + Conditioner (expected)
# - Lipstick + Eye shadow (complementary)
# - Face cream + Serum (product line)

### C√¢u h·ªèi 9: Rating Reliability - S·∫£n ph·∫©m n√†o c√≥ ratings ƒë√°ng tin c·∫≠y?

ƒê√°nh gi√° ƒë·ªô tin c·∫≠y c·ªßa ratings d·ª±a tr√™n s·ªë l∆∞·ª£ng v√† consistency

In [None]:
# TODO: Calculate rating reliability score
# Reliability factors:
# 1. Sample size: More ratings = more reliable
# 2. Rating consistency: Low variance = more reliable
# 3. Recency: Recent ratings more relevant (if timestamp available)
# 4. Reviewer diversity: More unique users = less bias

# Calculate Confidence Score:
# confidence = (n_ratings / (n_ratings + k)) * consistency_factor
# Where:
# - k = threshold constant (e.g., 10)
# - consistency_factor = 1 / (1 + std_rating)

# Identify categories:
# - High confidence products: Many ratings, low variance
# - Controversial products: Many ratings, high variance
# - Uncertain products: Few ratings (need more data)

# Calculate Wilson Score Confidence Interval (advanced):
# For binary outcomes (positive/negative), gives lower bound of true rating

# Business Application:
# - Quality control: Flag products with low confidence for review
# - Recommendation confidence: Show reliability indicators to users
# - Inventory decisions: Prioritize high-confidence high-rated products
# - A/B testing: Focus improvement efforts on uncertain products

# Visualize:
# - Scatter plot: Number of ratings vs Rating variance
# - Color by average rating
# - Size by confidence score
# - Quadrant analysis:
#   * Top-right: Popular & Reliable (safe bets)
#   * Top-left: Popular & Controversial (investigate)
#   * Bottom-right: Niche & Reliable (hidden gems)
#   * Bottom-left: Uncertain (need data)

### C√¢u h·ªèi 10: Market Basket Analysis - Loyalty Patterns

Ng∆∞·ªùi d√πng c√≥ trung th√†nh v·ªõi nh√≥m s·∫£n ph·∫©m n√†o? Churn risk ·ªü ƒë√¢u?

In [None]:
# TODO: Ph√¢n t√≠ch loyalty patterns
# 1. User loyalty metrics:
#    - Product diversity: Unique products / Total ratings
#    - Average rating trend: Early ratings vs Recent ratings
#    - Rating frequency: Time between ratings (if timestamp available)
#
# 2. Identify user types:
#    Loyal fans:
#    - High ratings consistently (>4.0)
#    - Low product diversity (stick to favorites)
#    - Regular activity
#
#    Explorers:
#    - High product diversity
#    - Variable ratings
#    - Frequent activity
#
#    Churned/At-risk:
#    - Declining average ratings over time
#    - Increasing time gaps between purchases
#    - Recent low ratings
#
#    One-time buyers:
#    - Single rating
#    - No return

# Calculate churn indicators:
# - Time since last rating (recency)
# - Negative rating trend
# - Comparison with historical behavior

# Business Application:
# - Retention campaigns: Target at-risk users
# - Win-back campaigns: Re-engage churned users
# - Loyalty rewards: Incentivize loyal fans
# - Product recommendations: 
#   * Loyal fans ‚Üí Similar products in same category
#   * Explorers ‚Üí Diverse recommendations
# - Customer lifetime value prediction

# Visualize:
# - Sankey diagram: User journey through product categories
# - Cohort analysis: Rating behavior over user lifecycle
# - RFM analysis adapted:
#   * Recency: Time since last rating
#   * Frequency: Number of ratings
#   * Monetary (proxy): Average rating (satisfaction level)

### C√¢u h·ªèi 11: Price-Quality Perception (n·∫øu c√≥ d·ªØ li·ªáu gi√°)

C√≥ m·ªëi quan h·ªá gi·ªØa rating v√† nh√≥m gi√° s·∫£n ph·∫©m kh√¥ng?

In [None]:
# TODO: Price-Quality analysis (if price data available in product metadata)
# Note: N·∫øu kh√¥ng c√≥ price data, c√≥ th·ªÉ infer t·ª´ rating patterns
#
# 1. Categorize products by price tier (if available):
#    - Budget: < 25th percentile
#    - Mid-range: 25th - 75th percentile
#    - Premium: > 75th percentile
#
# 2. Alternative: Infer "perceived value" t·ª´ ratings:
#    High-value products: High ratings + Many reviews
#    Low-value products: Low ratings despite popularity
#
# 3. Analyze:
#    - Average rating per price tier
#    - Rating variance per price tier
#    - Customer expectations: Do expensive products need higher ratings?
#    - Value for money: High rating + Lower price tier

# Calculate "bang for buck" score:
# value_score = avg_rating / (price_tier_normalized + epsilon)

# Identify:
# - Overperformers: Budget/Mid-range with premium-level ratings
# - Underperformers: Premium with mediocre ratings
# - Sweet spots: Best value for money

# Business Application:
# - Pricing strategy: Adjust prices based on perceived value
# - Marketing positioning: Highlight value products
# - Premium justification: Ensure premium products deliver quality
# - Recommendation diversity: Mix price tiers in recommendations
# - Competitive analysis: Compare similar products across tiers

# Visualize:
# - Box plots: Rating distribution per price tier
# - Scatter plot: Price vs Rating (if data available)
# - Bar chart: Value score leaders
# - Heatmap: Price tier √ó Rating category matrix

### C√¢u h·ªèi 12: Rating Velocity - Momentum Analysis

S·∫£n ph·∫©m n√†o ƒëang ƒë∆∞·ª£c rating nhi·ªÅu ƒë·ªôt ng·ªôt? (Viral products)

In [None]:
# TODO: Analyze rating velocity and momentum (requires timestamp)
# 1. Calculate rating velocity:
#    velocity = ratings_in_recent_period / ratings_in_previous_period
#    
# 2. Calculate acceleration:
#    acceleration = change in velocity over time
#
# 3. Identify patterns:
#    Viral products:
#    - High velocity (many recent ratings)
#    - Positive acceleration (accelerating growth)
#    - Sudden spike in activity
#
#    Steady growers:
#    - Consistent positive velocity
#    - Stable acceleration
#
#    Declining products:
#    - Negative velocity
#    - Negative acceleration
#
#    Seasonal products:
#    - Periodic velocity spikes
#    - Predictable patterns

# Calculate momentum score:
# momentum = (recent_ratings / avg_ratings_per_period) * (recent_avg_rating / overall_avg)

# Business Application:
# - Trend spotting: Catch viral products early
# - Inventory management: Stock up on high-momentum products
# - Marketing timing: Promote products at peak momentum
# - Recommendation freshness: Feature trending products
# - Competitive intelligence: Track momentum vs competitors
# - Product launch success: Monitor new product momentum

# Visualize:
# - Time series: Rating count over time for viral products
# - Velocity chart: Rate of change visualization
# - Momentum heatmap: Products √ó Time periods
# - Acceleration scatter: Current velocity vs acceleration
# - Top movers dashboard: Biggest gainers/losers

# Real-world examples:
# - Holiday season spikes (e.g., gift sets)
# - Influencer effects (sudden popularity after review)
# - Seasonal trends (sunscreen in summer)
# - Competitor product failures (switchers)

### C√¢u h·ªèi 13: User Influence Score - Ai l√† Key Opinion Leaders?

X√°c ƒë·ªãnh users c√≥ ·∫£nh h∆∞·ªüng cao (ratings c·ªßa h·ªç predict ƒë∆∞·ª£c behaviors c·ªßa others)

In [None]:
# TODO: Calculate user influence scores
# Influence factors:
# 1. Activity level:
#    - Number of ratings
#    - Diversity of products rated
#
# 2. Early adopter behavior:
#    - Among first to rate new products (if timestamp)
#    - Ratings on niche/less popular items
#
# 3. Rating impact:
#    - Agreement with community: How often user rating aligns with final consensus
#    - Predictive power: User rates high ‚Üí product becomes popular
#    - Contrarian accuracy: User finds hidden gems others missed
#
# 4. Expertise indicators:
#    - Detailed ratings (if text reviews available)
#    - Consistent rating behavior
#    - Coverage of product categories

# Calculate influence score:
# influence = w1 * activity_score + 
#             w2 * early_adopter_score + 
#             w3 * prediction_accuracy_score +
#             w4 * diversity_score

# For prediction accuracy:
# - Compare user's early rating vs product's eventual average
# - Reward users whose ratings predict future popularity
# - Calculate correlation between user rating and product success

# Identify user tiers:
# - Influencers: High influence, high activity, early adopters
# - Experts: High accuracy, niche focus, consistent ratings
# - Casual users: Low influence, sporadic activity
# - Followers: Late adopters, rate popular items

# Business Application:
# - Influencer partnerships: Engage high-influence users
# - Beta testing: Invite influencers to try new products
# - User-generated content: Encourage reviews from experts
# - Weighted recommendations: Give more weight to influencer ratings
# - Community building: Create expert/influencer badges
# - Product seeding: Send samples to key opinion leaders
# - Credibility indicators: Show "Expert rated 4.5‚òÖ" separately

# Visualize:
# - Influence distribution: Histogram of influence scores
# - Scatter plot: Activity vs Influence (are they correlated?)
# - Network graph: User influence relationships
# - Leaderboard: Top 50 most influential users
# - Influence over time: Track how influence changes

### C√¢u h·ªèi 14: Recommendation Diversity vs Accuracy Trade-off

Balance gi·ªØa recommend s·∫£n ph·∫©m t∆∞∆°ng t·ª± vs kh√°m ph√° s·∫£n ph·∫©m m·ªõi

In [None]:
# TODO: Analyze diversity patterns in user behavior
# 1. Calculate user exploration behavior:
#    Diversity Index = Entropy of product categories rated
#    entropy = -Œ£(p_i * log(p_i))
#    where p_i = proportion of ratings in category i
#
# 2. Similarity within user's rated products:
#    - Calculate avg pairwise similarity of products user rated
#    - Use rating patterns of all users as feature space
#    - High similarity = user likes similar products (easy to recommend)
#    - Low similarity = diverse taste (hard but interesting)
#
# 3. Exploration vs Exploitation:
#    Exploitation: Rating products similar to past high-rated items
#    Exploration: Rating diverse/different products
#
# 4. Calculate per user:
#    - % of ratings on popular products (>100 reviews)
#    - % of ratings on niche products (<10 reviews)
#    - Sequential pattern: Do they alternate or cluster?

# Segment users by recommendation strategy:
# - Conservative users (high similarity, low diversity):
#   ‚Üí Recommend similar popular items (safe bets)
#   ‚Üí Collaborative filtering works well
#
# - Adventurous users (low similarity, high diversity):
#   ‚Üí Recommend diverse items from multiple categories
#   ‚Üí Content-based + serendipity important
#   ‚Üí "Because you like variety" recommendations
#
# - Balanced users:
#   ‚Üí Hybrid approach with some exploration

# Business Application:
# - Personalized recommendation strategy per user type
# - A/B testing: Test diversity levels in recommendations
# - User satisfaction: Match recommendation diversity to user preference
# - Product discovery: Help users find hidden gems
# - Filter bubble avoidance: Prevent over-specialization
# - Long-tail promotion: Use diversity-loving users to promote niche items

# Calculate system-level metrics:
# - Aggregate diversity: How many unique items recommended across all users?
# - Coverage: What % of catalog gets recommended?
# - Serendipity: How often do users rate high on unexpected recommendations?

# Visualize:
# - User scatter: Diversity Index vs Average Rating
# - Distribution: Histogram of user diversity scores
# - Category sunburst: Rating distribution across categories per user type
# - Trade-off curve: Recommendation accuracy vs diversity
# - Temporal: Does diversity change over user lifecycle?