# Week 3: Baseline Model

I will build a baseline recommendation model using collaborative filtering with Singular Value Decomposition (SVD). I will use the preprocessed user-item interaction matrix from `04_feature_engineering.ipynb` to:
- Train an SVD model.
- Evaluate performance using RMSE and Precision@K.
- Address business questions like personalization, conversion optimization, and algorithm performance.

# Baseline Model for Product Recommendations

## Introduction
This notebook aims to build a baseline model for product recommendations based on user interactions and features derived from the data. The model will address the following business questions:

1. How can we personalize product recommendations to enhance user satisfaction and retention?
2. What are the most effective strategies to improve conversion rates from views to transactions?
3. How do seasonal patterns or trends influence user interactions and sales?
4. Which products or categories are most popular or have the highest interaction rates?
5. How can we identify and filter out bot activity or anomalous user behavior to ensure accurate recommendations?
6. How can we balance recommendation accuracy with diversity to cater to a wide range of user preferences?
7. What machine learning algorithm provides the most accurate and efficient recommendations for our platform?

## Data Loading
We will load the preprocessed data, which includes user-item interactions and product information.


## Baseline Model Summary

I will develop and optimize the following components for the baseline recommendation model:
- **SVD Model**: A collaborative filtering model using Singular Value Decomposition (SVD) with 20 factors, trained on the user-item interaction data, saved as part of the notebook execution output.
- **Performance Metrics**: Computed Root Mean Squared Error (RMSE) and Precision@5 metrics to evaluate model accuracy and relevance, displayed as output (RMSE: 1.5190, Precision@5: 0.0027).
- **Interaction DataFrame**: A processed DataFrame (`interactions_df`) mapping user-item interactions back to original IDs, generated during execution for use in model training.

These components will establish a baseline for personalized recommendations, partially addressing business questions related to personalization and algorithm performance, and serve as a foundation for further modeling to tackle all seven business questions and project objectives.

### Details
- **Input Data**: Utilizes preprocessed files `user_item_sparse.pkl`, `user_ids.csv`, and `item_ids.csv` from feature engineering.
- **Optimizations**: Reduced test set size to 10% and SVD factors to 20 for faster runtime (<30 minutes).
- **Output**: Model performance metrics (RMSE, Precision@5) printed to console, with the trained model ready for evaluation and refinement.

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
import joblib
import os

# Define paths
preprocessed_data_dir = "../data/preprocessed_data/"

# Load user and item ID mappings
try:
    user_ids = pd.read_csv(os.path.join(preprocessed_data_dir, "user_ids.csv"), index_col="Unnamed: 0")["0"].to_dict()
    item_ids = pd.read_csv(os.path.join(preprocessed_data_dir, "item_ids.csv"), index_col="Unnamed: 0")["0"].to_dict()
    # Create inverse mappings for faster lookups
    user_idx_to_id = {v: k for k, v in user_ids.items()}
    item_idx_to_id = {v: k for k, v in item_ids.items()}
except KeyError as e:
    print(f"Error loading ID mappings: {e}. Check CSV structure.")
    raise

# Load the sparse matrix using joblib
try:
    with open(os.path.join(preprocessed_data_dir, "user_item_sparse.pkl"), "rb") as f:
        user_item_matrix = joblib.load(f)
except Exception as e:
    print(f"Error loading sparse matrix: {e}. Ensure it’s regenerated with current NumPy.")
    raise

# Convert sparse matrix to DataFrame for surprise (optimized)
user_item_coo = coo_matrix(user_item_matrix)
rows, cols, data = user_item_coo.row, user_item_coo.col, user_item_coo.data

# Create interactions list with optimized mapping
interactions_list = [(user_idx_to_id[row], item_idx_to_id[col], rating) for row, col, rating in zip(rows, cols, data)]

# Create a DataFrame
interactions_df = pd.DataFrame(interactions_list, columns=['user_id', 'item_id', 'rating'])

# Define the reader for surprise
reader = Reader(rating_scale=(1, 5))  # Matches event weights: view=1, addtocart=3, transaction=5

# Load data into surprise Dataset
data = Dataset.load_from_df(interactions_df[['user_id', 'item_id', 'rating']], reader)

# Split into train and test sets (use a smaller test size for speed)
trainset, testset = train_test_split(data, test_size=0.1, random_state=42)  # Reduced to 10% for faster testing

# Train SVD model with fewer factors for speed
svd = SVD(n_factors=20, random_state=42)  # Reduced from 50 to 20
svd.fit(trainset)

# Evaluate on test set
predictions = svd.test(testset)

# Calculate RMSE
rmse = accuracy.rmse(predictions)
print(f"RMSE on Test Set: {rmse:.4f}")

# Simplified Precision@K (K=5) for quick results
from collections import defaultdict

def precision_at_k(predictions, k=5, threshold=3):
    """Return average precision at k for quick evaluation."""
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = []
    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        n_rel_and_rec_k = sum((true_r >= threshold and est >= threshold)
                            for est, true_r in user_ratings[:k])
        n_rec_k = min(k, len(user_ratings))
        precisions.append(n_rel_and_rec_k / n_rec_k if n_rec_k > 0 else 0)

    return np.mean(precisions) if precisions else 0

precision_at_k = precision_at_k(predictions, k=5, threshold=3)
print(f"Precision@5: {precision_at_k:.4f}")

RMSE: 1.5190
RMSE on Test Set: 1.5190
Precision@5: 0.0027


### Business Questions Addressed
- **Q1 (Personalize recommendations)**: Yes, via SVD predictions.
- **Q2 (Improve conversion rates)**: Partially, via Precision@5, but needs further optimization.
- **Q7 (Best algorithm)**: Partially, as a baseline for comparison.

### Remaining Business Questions
- **Q3 (Seasonal patterns)**: Requires time-based features.
- **Q4 (Popular products)**: Needs post-analysis.
- **Q5 (Anomaly filtering)**: Requires pre-processing.
- **Q6 (Diversity)**: Needs hybrid modeling.

### Project Objectives Supported
- **Personalization**: Achieved with SVD.
- **Conversion Rates**: Partially achieved.
- **Historical Data Utilization**: Achieved.
- **Engagement**: Partially achieved.
- **Scalability**: Achieved with optimizations.
- **Sales Boost**: Partially achieved.
- **Diversity**: Not achieved.

### Next Steps
- **Model Tuning**: Adjust parameters (e.g., factors, threshold) using cross-validation.
- **Advanced Modeling**: Incorporate time features, anomaly filtering, and hybrid models for diversity.
- **Evaluation**: Compare with other algorithms and validate in a live setting.
