# SVD (Singular Value Decomposition) Experiment

This notebook tests and validates the SVD implementation for collaborative filtering.

## Theory

SVD decomposes a matrix R into three matrices:
$$R \approx U \Sigma V^T$$

Where:
- $U$ is the user feature matrix (m × k)
- $\Sigma$ is the diagonal matrix of singular values (k × k)
- $V^T$ is the item feature matrix transposed (k × n)
- k is the number of latent factors

By keeping only the top k singular values, we get a low-rank approximation that captures the most important patterns.

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_reading import read_ratings_file
from src.evaluation import temporal_split, evaluate_rmse
from src.models.svd import solve_with_svd

np.random.seed(42)

## Load and Split Data

Using temporal split to ensure realistic evaluation (train on past, test on future).

In [3]:
ratings = read_ratings_file()
print(f"Loaded {len(ratings)} ratings")

# Temporal split: train on past, test on future
train, test = temporal_split(ratings, test_ratio=0.2)

Loaded 1000209 ratings
Train set size is: (800168, 4) 
Test set size is: (200041, 4)
Train set timeframes are: 2000-04-25 23:05:32 - 2000-12-02 14:52:18 
Test set timeframes are 2000-12-02 14:52:28 - 2003-02-28 17:49:50


## Create Ratings Matrix

In [4]:
train_matrix = train.pivot_table(
    index='user_id',
    columns='movie_id',
    values='rating',
    fill_value=0
)

print(f"Training matrix shape: {train_matrix.shape}")
print(f"Sparsity: {(train_matrix == 0).sum().sum() / (train_matrix.shape[0] * train_matrix.shape[1]) * 100:.2f}%")

# Filter test set to only include users/movies in training set (no cold start)
test_users = np.intersect1d(test.user_id.unique(), train.user_id.unique())
test_movies = np.intersect1d(test.movie_id.unique(), train.movie_id.unique())
test = test[(test.user_id.isin(test_users)) & (test.movie_id.isin(test_movies))]
print(f"\nFiltered test set size: {test.shape[0]} ratings")

Training matrix shape: (5400, 3662)
Sparsity: 95.95%

Filtered test set size: 104448 ratings


## Train SVD Model

We'll test different values of k (number of latent factors).

In [None]:
# Train models with different k values
# k_values = [10, 20, 50, 100]
k_values = [10]
models = {}

for k in k_values:
    print(f"Training SVD with k={k}...")
    predictions = solve_with_svd(train_matrix, k=k)
    models[k] = predictions
    print(f"  Completed.")

print("\n✓ All models trained")

Training SVD with k=10...


## Evaluate Models

Using RMSE evaluation consistent with other experiments.

In [None]:
def predict(user_id, movie_id):
    try:
        return prediction_matrix.loc[user_id, movie_id]
    except (KeyError, IndexError):
        return np.nan

In [None]:
# Evaluate each model
results = []

for k, predictions in models.items():
    print(f"Evaluating k={k}...")
    predict_fn = create_predict_fn(predictions)
    rmse = evaluate_rmse(test=test, predict_fn=predict)
    
    results.append({
        'k': k,
        'RMSE': rmse
    })
    print(f"  RMSE: {rmse:.4f}")

results_df = pd.DataFrame(results)
print("\n" + "="*60)
print("SVD Results")
print("="*60)
print(results_df.to_string(index=False))

## Visualize Results

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(results_df['k'], results_df['RMSE'], marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Latent Factors (k)', fontsize=12)
plt.ylabel('RMSE', fontsize=12)
plt.title('SVD: RMSE vs Number of Latent Factors', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Find best k
best_k = results_df.loc[results_df['RMSE'].idxmin(), 'k']
best_rmse = results_df.loc[results_df['RMSE'].idxmin(), 'RMSE']
print(f"\nBest performance: k={int(best_k)} with RMSE={best_rmse:.4f}")

## Detailed Analysis of Best Model

In [None]:
# Use the best k value
best_k = int(results_df.loc[results_df['RMSE'].idxmin(), 'k'])
best_predictions = models[best_k]

# Analyze prediction distribution
print(f"Analysis for SVD with k={best_k}")
print("="*60)
print(f"\nPrediction statistics:")
print(f"  Min: {best_predictions.min().min():.2f}")
print(f"  Max: {best_predictions.max().max():.2f}")
print(f"  Mean: {best_predictions.mean().mean():.2f}")
print(f"  Std: {best_predictions.std().std():.2f}")

# Compare with actual ratings
print(f"\nActual ratings statistics:")
print(f"  Min: {train_matrix[train_matrix > 0].min().min():.2f}")
print(f"  Max: {train_matrix.max().max():.2f}")
print(f"  Mean: {train_matrix[train_matrix > 0].mean().mean():.2f}")
print(f"  Std: {train_matrix[train_matrix > 0].std().std():.2f}")

In [None]:
# Sample predictions for a random user
sample_user = np.random.choice(train_matrix.index)
user_actual = train_matrix.loc[sample_user]
user_pred = best_predictions.loc[sample_user]

# Get rated movies
rated_movies = user_actual[user_actual > 0].sample(min(10, (user_actual > 0).sum()))
comparison = pd.DataFrame({
    'Actual': rated_movies,
    'Predicted': user_pred[rated_movies.index]
})

print(f"Sample predictions for User {sample_user}:")
print(comparison)

# Calculate correlation
corr = comparison['Actual'].corr(comparison['Predicted'])
print(f"\nCorrelation: {corr:.4f}")