# Modeling - Amazon Beauty Products Recommendation System

**Mục tiêu:** Xây dựng recommendation system từ đầu bằng NumPy

## Approaches:
1. **Baseline Models:** Mean ratings, popularity-based
2. **Collaborative Filtering:** User-based và Item-based
3. **Matrix Factorization:** SVD implementation
4. **Advanced (Optional):** Hybrid approaches

## Evaluation Metrics:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- Precision@K, Recall@K
- NDCG (Normalized Discounted Cumulative Gain)

## 1. Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from time import time

warnings.filterwarnings('ignore')
np.set_printoptions(precision=4, suppress=True)

EPSILON = 1e-10
np.random.seed(42)

print("✓ Libraries imported!")

## 2. Load Processed Data

Load dữ liệu đã được xử lý từ notebook 02

In [None]:
# TODO: Load processed data
# train_data = np.load('../data/processed/train_data.npy')
# val_data = np.load('../data/processed/val_data.npy')
# test_data = np.load('../data/processed/test_data.npy')
# user_item_matrix = np.load('../data/processed/user_item_matrix.npy')

print("Data loaded!")

## 3. Evaluation Metrics

Implement các metrics để đánh giá recommendation system

In [None]:
# TODO: Implement evaluation metrics using NumPy

def rmse(y_true, y_pred):
    """
    Root Mean Squared Error
    Formula: sqrt(mean((y_true - y_pred)^2))
    """
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

def mae(y_true, y_pred):
    """
    Mean Absolute Error
    Formula: mean(|y_true - y_pred|)
    """
    return np.mean(np.abs(y_true - y_pred))

def precision_at_k(y_true, y_pred, k=10, threshold=4.0):
    """
    Precision@K: Proportion of recommended items that are relevant
    """
    # TODO: Implement
    pass

def recall_at_k(y_true, y_pred, k=10, threshold=4.0):
    """
    Recall@K: Proportion of relevant items that are recommended
    """
    # TODO: Implement
    pass

def ndcg_at_k(y_true, y_pred, k=10):
    """
    Normalized Discounted Cumulative Gain
    Measures ranking quality
    """
    # TODO: Implement DCG và IDCG
    pass

print("✓ Metrics implemented!")

## 4. Baseline Models

Implement các baseline models đơn giản

### 4.1. Global Mean Baseline

In [None]:
# TODO: Global mean baseline - predict mean rating for all
global_mean = np.mean(train_data[:, 2])  # Column 2: ratings
baseline_pred = np.full(len(test_data), global_mean)

# Evaluate
# rmse_baseline = rmse(test_data[:, 2], baseline_pred)
# print(f"Global Mean RMSE: {rmse_baseline:.4f}")

### 4.2. User Mean Baseline

In [None]:
# TODO: User mean baseline - predict user's average rating
# Calculate mean rating per user
# For each test case, predict user's mean rating
# If user is new, use global mean

### 4.3. Item Mean Baseline

In [None]:
# TODO: Item mean baseline - predict product's average rating
# Calculate mean rating per product
# For each test case, predict product's mean rating
# If product is new, use global mean

### 4.4. Bias-Based Baseline

Combining user và item biases:  
**Prediction = μ + b_u + b_i**

Where:
- μ = global mean
- b_u = user bias (user_mean - global_mean)
- b_i = item bias (item_mean - global_mean)

In [None]:
# TODO: Implement bias-based baseline
# 1. Calculate global mean
# 2. Calculate user biases
# 3. Calculate item biases
# 4. Predict: μ + b_u + b_i

## 5. Collaborative Filtering - User-Based

User-based CF: Recommend dựa trên similar users

### 5.1. User Similarity Calculation

**Cosine Similarity:**  
$$\text{sim}(u, v) = \frac{\sum_i r_{ui} \cdot r_{vi}}{\sqrt{\sum_i r_{ui}^2} \cdot \sqrt{\sum_i r_{vi}^2}}$$

**Pearson Correlation:**  
$$\text{sim}(u, v) = \frac{\sum_i (r_{ui} - \bar{r}_u)(r_{vi} - \bar{r}_v)}{\sqrt{\sum_i (r_{ui} - \bar{r}_u)^2} \cdot \sqrt{\sum_i (r_{vi} - \bar{r}_v)^2}}$$

In [None]:
# TODO: Implement similarity calculations using NumPy

def cosine_similarity(matrix):
    """
    Calculate pairwise cosine similarity between rows (users)
    Uses vectorization with np.dot() and broadcasting
    """
    # Normalize rows
    norms = np.sqrt(np.sum(matrix ** 2, axis=1, keepdims=True))
    normalized = matrix / (norms + EPSILON)
    
    # Compute similarity matrix
    similarity = np.dot(normalized, normalized.T)
    return similarity

def pearson_correlation(matrix):
    """
    Calculate pairwise Pearson correlation between rows (users)
    """
    # Center by subtracting mean
    mean_centered = matrix - np.mean(matrix, axis=1, keepdims=True)
    
    # Calculate correlation
    norms = np.sqrt(np.sum(mean_centered ** 2, axis=1, keepdims=True))
    normalized = mean_centered / (norms + EPSILON)
    correlation = np.dot(normalized, normalized.T)
    
    return correlation

# Compute user similarity matrix
# user_similarity = cosine_similarity(user_item_matrix)

### 5.2. User-Based Prediction

**Prediction formula:**  
$$\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}$$

Where N(u) = k most similar users who rated item i

In [None]:
# TODO: Implement user-based CF prediction

class UserBasedCF:
    def __init__(self, k=20, similarity='cosine'):
        """
        k: number of similar users to consider
        similarity: 'cosine' or 'pearson'
        """
        self.k = k
        self.similarity = similarity
        self.user_similarity = None
        self.user_means = None
    
    def fit(self, user_item_matrix):
        """Train the model"""
        # Calculate similarity
        if self.similarity == 'cosine':
            self.user_similarity = cosine_similarity(user_item_matrix)
        else:
            self.user_similarity = pearson_correlation(user_item_matrix)
        
        # Calculate user means
        # Handle zeros (non-rated items)
        # TODO: Implement
        pass
    
    def predict(self, user_id, item_id):
        """Predict rating for a user-item pair"""
        # Find k most similar users who rated this item
        # Weighted average of their ratings
        # TODO: Implement using NumPy
        pass
    
    def recommend(self, user_id, n=10):
        """Recommend top-n items for a user"""
        # Predict for all unrated items
        # Return top-n highest predictions
        # TODO: Implement
        pass

# Train and evaluate
# model = UserBasedCF(k=20)
# model.fit(user_item_matrix)

## 6. Collaborative Filtering - Item-Based

Item-based CF: Recommend dựa trên similar items

### 6.1. Item Similarity

Calculate similarity between items (products)

In [None]:
# TODO: Calculate item similarity
# Transpose user-item matrix để get item-user matrix
# item_item_matrix = user_item_matrix.T
# item_similarity = cosine_similarity(item_item_matrix)

### 6.2. Item-Based Prediction

**Prediction formula:**  
$$\hat{r}_{ui} = \frac{\sum_{j \in N(i)} \text{sim}(i,j) \cdot r_{uj}}{\sum_{j \in N(i)} |\text{sim}(i,j)|}$$

Where N(i) = k most similar items rated by user u

In [None]:
# TODO: Implement item-based CF

class ItemBasedCF:
    def __init__(self, k=20, similarity='cosine'):
        self.k = k
        self.similarity = similarity
        self.item_similarity = None
    
    def fit(self, user_item_matrix):
        """Train the model"""
        # Calculate item-item similarity
        # TODO: Implement
        pass
    
    def predict(self, user_id, item_id):
        """Predict rating"""
        # Find k most similar items rated by user
        # Weighted average
        # TODO: Implement
        pass
    
    def recommend(self, user_id, n=10):
        """Recommend top-n items"""
        # TODO: Implement
        pass

## 7. Matrix Factorization (SVD)

Implement SVD từ đầu bằng NumPy - **BONUS POINTS!**

### 7.1. SVD Basics

**Matrix Factorization:** R ≈ U × Σ × V^T

Where:
- R: user-item matrix (m × n)
- U: user features (m × k)
- Σ: singular values (k × k)
- V: item features (n × k)
- k: number of latent factors

In [None]:
# TODO: Implement SVD using NumPy

def svd_decomposition(matrix, k=20):
    """
    Perform SVD decomposition using NumPy's linalg.svd
    """
    # Use numpy's SVD
    U, sigma, Vt = np.linalg.svd(matrix, full_matrices=False)
    
    # Keep only top k factors
    U_k = U[:, :k]
    sigma_k = sigma[:k]
    Vt_k = Vt[:k, :]
    
    # Reconstruct approximated matrix
    matrix_approx = U_k @ np.diag(sigma_k) @ Vt_k
    
    return U_k, sigma_k, Vt_k, matrix_approx

# Apply SVD
# U, sigma, Vt, R_approx = svd_decomposition(user_item_matrix, k=50)

### 7.2. Gradient Descent for Matrix Factorization

Optimize bằng gradient descent - **ADVANCED BONUS!**

**Loss Function:**  
$$L = \sum_{(u,i) \in \text{observed}} (r_{ui} - p_u \cdot q_i^T)^2 + \lambda(||p_u||^2 + ||q_i||^2)$$

**Gradients:**  
$$\frac{\partial L}{\partial p_u} = -2(r_{ui} - \hat{r}_{ui})q_i + 2\lambda p_u$$  
$$\frac{\partial L}{\partial q_i} = -2(r_{ui} - \hat{r}_{ui})p_u + 2\lambda q_i$$

In [None]:
# TODO: Implement Matrix Factorization with SGD

class MatrixFactorization:
    def __init__(self, n_factors=20, learning_rate=0.01, reg=0.02, n_epochs=20):
        """
        n_factors: number of latent factors (k)
        learning_rate: α for gradient descent
        reg: λ for regularization
        n_epochs: number of training iterations
        """
        self.n_factors = n_factors
        self.lr = learning_rate
        self.reg = reg
        self.n_epochs = n_epochs
        self.user_factors = None  # P matrix
        self.item_factors = None  # Q matrix
        self.user_bias = None
        self.item_bias = None
        self.global_mean = None
    
    def fit(self, train_data):
        """
        Train using Stochastic Gradient Descent
        train_data: array of (user_id, item_id, rating)
        """
        # Initialize parameters
        n_users = int(np.max(train_data[:, 0])) + 1
        n_items = int(np.max(train_data[:, 1])) + 1
        
        # Random initialization
        self.user_factors = np.random.normal(0, 0.1, (n_users, self.n_factors))
        self.item_factors = np.random.normal(0, 0.1, (n_items, self.n_factors))
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.global_mean = np.mean(train_data[:, 2])
        
        # Training loop
        for epoch in range(self.n_epochs):
            # Shuffle data
            np.random.shuffle(train_data)
            
            # SGD updates
            for user_id, item_id, rating in train_data:
                user_id = int(user_id)
                item_id = int(item_id)
                
                # Predict
                pred = self.predict(user_id, item_id)
                error = rating - pred
                
                # Update user factors
                self.user_factors[user_id] += self.lr * (
                    error * self.item_factors[item_id] - 
                    self.reg * self.user_factors[user_id]
                )
                
                # Update item factors
                self.item_factors[item_id] += self.lr * (
                    error * self.user_factors[user_id] -
                    self.reg * self.item_factors[item_id]
                )
                
                # Update biases
                self.user_bias[user_id] += self.lr * (error - self.reg * self.user_bias[user_id])
                self.item_bias[item_id] += self.lr * (error - self.reg * self.item_bias[item_id])
            
            # TODO: Calculate and print training RMSE
            # if epoch % 5 == 0:
            #     train_rmse = self.evaluate(train_data)
            #     print(f"Epoch {epoch}: RMSE = {train_rmse:.4f}")
    
    def predict(self, user_id, item_id):
        """Predict rating for user-item pair"""
        pred = (
            self.global_mean +
            self.user_bias[user_id] +
            self.item_bias[item_id] +
            np.dot(self.user_factors[user_id], self.item_factors[item_id])
        )
        return pred
    
    def evaluate(self, test_data):
        """Evaluate on test data"""
        predictions = []
        for user_id, item_id, _ in test_data:
            pred = self.predict(int(user_id), int(item_id))
            predictions.append(pred)
        
        return rmse(test_data[:, 2], np.array(predictions))

# Train model
# mf = MatrixFactorization(n_factors=50, learning_rate=0.01, reg=0.02, n_epochs=20)
# mf.fit(train_data)

## 8. Model Comparison

So sánh performance của các models

In [None]:
# TODO: Compare all models
# Create comparison table và visualization

models_results = {
    'Model': [],
    'RMSE': [],
    'MAE': [],
    'Training Time (s)': []
}

# Example:
# models_results['Model'].append('Global Mean')
# models_results['RMSE'].append(rmse_baseline)
# ...

# Visualize comparison
# Bar chart showing RMSE for each model
# Training time comparison

## 9. Hyperparameter Tuning

Tối ưu hyperparameters cho model tốt nhất

In [None]:
# TODO: Implement grid search or random search using NumPy
# For example, tune Matrix Factorization:
# - n_factors: [10, 20, 50, 100]
# - learning_rate: [0.001, 0.01, 0.05]
# - regularization: [0.01, 0.02, 0.05]

def grid_search(param_grid, train_data, val_data):
    """
    Simple grid search implementation
    """
    best_params = None
    best_rmse = float('inf')
    
    # TODO: Iterate through all parameter combinations
    # Train model và evaluate on validation set
    # Keep track of best parameters
    
    return best_params, best_rmse

## 10. Final Evaluation on Test Set

Đánh giá model tốt nhất trên test set

In [None]:
# TODO: Final evaluation
# Use best model on test set
# Calculate all metrics:
# - RMSE, MAE
# - Precision@K, Recall@K
# - NDCG@K

# Visualize:
# - Prediction vs Actual scatter plot
# - Error distribution
# - Top-N recommendations for sample users

## 11. Recommendation Examples

Demonstrate recommendations cho specific users

In [None]:
# TODO: Show recommendations for sample users
# Select 5 users
# For each user:
# 1. Show their rating history
# 2. Show top-10 recommendations
# 3. Explain why (similar users/items, predicted ratings)

# Example:
# user_id = 123
# user_history = get_user_history(user_id)
# recommendations = model.recommend(user_id, n=10)
# print(f"User {user_id} rated {len(user_history)} products")
# print(f"Top 10 recommendations: {recommendations}")

## 12. Model Analysis & Insights

### 12.1. Learned Latent Factors

Phân tích các latent factors từ Matrix Factorization

In [None]:
# TODO: Analyze latent factors
# - Visualize user/item factors (dimensionality reduction to 2D)
# - Find similar users in latent space
# - Find similar items in latent space
# - Interpret factors (if possible)

### 12.2. Error Analysis

In [None]:
# TODO: Analyze prediction errors
# - Which users are hardest to predict?
# - Which items are hardest to predict?
# - Relationship between error và:
#   * Number of ratings
#   * Rating variance
#   * User/item popularity

## 13. Summary & Conclusions

### Key Findings:

**Model Performance:**
- TODO: Best performing model và metrics
- TODO: Comparison với baselines
- TODO: Trade-offs (accuracy vs speed)

**Recommendation Quality:**
- TODO: Diversity of recommendations
- TODO: Coverage (% of items recommended)
- TODO: Cold start handling

**Implementation:**
- TODO: NumPy techniques used (vectorization, broadcasting, etc.)
- TODO: Numerical stability considerations
- TODO: Computational efficiency

**Business Insights:**
- TODO: Most recommended products
- TODO: User segments with different preferences
- TODO: Opportunities for improvement

**Limitations:**
- TODO: Data sparsity challenges
- TODO: Cold start problem
- TODO: Scalability considerations

**Future Improvements:**
- TODO: Content-based features
- TODO: Hybrid approaches
- TODO: Deep learning methods (if allowed)
- TODO: Real-time recommendations