# ALS (Alternating Least Squares) Experiment

This notebook tests and validates the ALS implementation for implicit feedback collaborative filtering.

## Theory

ALS decomposes the user-item matrix into two matrices:
$$R \approx U \cdot V^T$$

Where:
- $U$ is the user feature matrix (m × f)
- $V$ is the item feature matrix (n × f)
- f is the number of latent factors

For implicit feedback:
- **Preference**: $p_{ui} = 1$ if $r_{ui} > 0$, else $0$
- **Confidence**: $c_{ui} = 1 + \alpha \cdot r_{ui}$

The algorithm alternates between:
1. Fixing $V$ and solving for $U$
2. Fixing $U$ and solving for $V$

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_reading import read_ratings_file
from src.evaluation import temporal_split, evaluate_rmse
from src.models.als import solve_with_als

np.random.seed(42)

## Load and Split Data

Using temporal split to ensure realistic evaluation (train on past, test on future).

In [3]:
ratings = read_ratings_file()
print(f"Loaded {len(ratings)} ratings")

# Temporal split: train on past, test on future
train, test = temporal_split(ratings, test_ratio=0.2)

Loaded 1000209 ratings
Train set size is: (800168, 4) 
Test set size is: (200041, 4)
Train set timeframes are: 2000-04-25 23:05:32 - 2000-12-02 14:52:18 
Test set timeframes are 2000-12-02 14:52:28 - 2003-02-28 17:49:50


## Create Ratings Matrix

In [4]:
train_matrix = train.pivot_table(
    index='user_id',
    columns='movie_id',
    values='rating',
    fill_value=0
)

print(f"Training matrix shape: {train_matrix.shape}")
print(f"Sparsity: {(train_matrix == 0).sum().sum() / (train_matrix.shape[0] * train_matrix.shape[1]) * 100:.2f}%")

# Filter test set to only include users/movies in training set (no cold start)
test_users = np.intersect1d(test.user_id.unique(), train.user_id.unique())
test_movies = np.intersect1d(test.movie_id.unique(), train.movie_id.unique())
test = test[(test.user_id.isin(test_users)) & (test.movie_id.isin(test_movies))]
print(f"\nFiltered test set size: {test.shape[0]} ratings")

Training matrix shape: (5400, 3662)
Sparsity: 95.95%

Filtered test set size: 104448 ratings


## Train ALS Model

We'll test different hyperparameter configurations.

In [5]:
# Train with default parameters first
print("Training ALS with default parameters...")
default_predictions = solve_with_als(
    train_matrix,
    alpha=40,
    iterations=10,
    factors=20,
    regularization=0.1,
    verbose=True
)
print("\n✓ Default model trained")

Training ALS with default parameters...
--- Running ALS with parameters ---
Alpha: 40
Iterations: 10
Factors: 20
Regularization: 0.1

Iteration 1/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 2/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 3/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 4/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 5/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 6/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 7/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 8/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 9/10
Solving for users (fixed items)...
Solving for items (fixed users)...

Iteration 10/10
Solving for users (fixed items)...
Solving for items (fixed users)...

✓ Def

## Evaluate Default Model

Using RMSE evaluation consistent with other experiments.

In [None]:
def predict(user_id, movie_id):
    try:
        return prediction_matrix.loc[user_id, movie_id]
    except (KeyError, IndexError):
        return np.nan

In [6]:
predict_fn = create_predict_fn(default_predictions)
default_rmse = evaluate_rmse(test=test, predict_fn=predict)

print(f"Default model RMSE: {default_rmse:.4f}")

Default model RMSE: 3.0631


## Hyperparameter Tuning

### Test Different Number of Factors

In [8]:
# Test different number of factors
factor_values = [10, 20, 50]
factor_results = []

for factors in factor_values:
    print(f"\nTraining with factors={factors}...")
    predictions = solve_with_als(
        train_matrix,
        alpha=40,
        iterations=10,
        factors=factors,
        regularization=0.1,
        verbose=True
    )
    
    predict_fn = create_predict_fn(predictions)
    rmse = evaluate_rmse(test=test, predict_fn=predict_fn)
    
    factor_results.append({
        'factors': factors,
        'RMSE': rmse
    })
    print(f"  RMSE: {rmse:.4f}")

factor_df = pd.DataFrame(factor_results)
print("\n" + "="*60)
print("ALS: Number of Factors")
print("="*60)
print(factor_df.to_string(index=False))


Training with factors=10...
--- Running ALS with parameters ---
Alpha: 40
Iterations: 10
Factors: 10
Regularization: 0.1

Iteration 1/10
Solving for users (fixed items)...


KeyboardInterrupt: 

### Test Different Alpha Values

In [None]:
# Test different alpha values
alpha_values = [10, 40, 80]
alpha_results = []

for alpha in alpha_values:
    print(f"\nTraining with alpha={alpha}...")
    predictions = solve_with_als(
        train_matrix,
        alpha=alpha,
        iterations=10,
        factors=20,
        regularization=0.1,
        verbose=False
    )
    
    predict_fn = create_predict_fn(predictions)
    rmse = evaluate_rmse(test=test, predict_fn=predict_fn)
    
    alpha_results.append({
        'alpha': alpha,
        'RMSE': rmse
    })
    print(f"  RMSE: {rmse:.4f}")

alpha_df = pd.DataFrame(alpha_results)
print("\n" + "="*60)
print("ALS: Alpha Parameter")
print("="*60)
print(alpha_df.to_string(index=False))

### Test Different Iteration Counts

In [None]:
# Test different iteration counts
iteration_values = [5, 10, 15]
iteration_results = []

for iters in iteration_values:
    print(f"\nTraining with iterations={iters}...")
    predictions = solve_with_als(
        train_matrix,
        alpha=40,
        iterations=iters,
        factors=20,
        regularization=0.1,
        verbose=False
    )
    
    predict_fn = create_predict_fn(predictions)
    rmse = evaluate_rmse(test=test, predict_fn=predict_fn)
    
    iteration_results.append({
        'iterations': iters,
        'RMSE': rmse
    })
    print(f"  RMSE: {rmse:.4f}")

iteration_df = pd.DataFrame(iteration_results)
print("\n" + "="*60)
print("ALS: Number of Iterations")
print("="*60)
print(iteration_df.to_string(index=False))

## Visualize Results

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Factors plot
axes[0].plot(factor_df['factors'], factor_df['RMSE'], marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Latent Factors', fontsize=12)
axes[0].set_ylabel('RMSE', fontsize=12)
axes[0].set_title('ALS: Factors Sensitivity', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Alpha plot
axes[1].plot(alpha_df['alpha'], alpha_df['RMSE'], marker='o', linewidth=2, markersize=8)
axes[1].set_xlabel('Alpha (Confidence Scaling)', fontsize=12)
axes[1].set_ylabel('RMSE', fontsize=12)
axes[1].set_title('ALS: Alpha Sensitivity', fontsize=14)
axes[1].grid(True, alpha=0.3)

# Iterations plot
axes[2].plot(iteration_df['iterations'], iteration_df['RMSE'], marker='o', linewidth=2, markersize=8)
axes[2].set_xlabel('Number of Iterations', fontsize=12)
axes[2].set_ylabel('RMSE', fontsize=12)
axes[2].set_title('ALS: Convergence', fontsize=14)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary of Best Parameters

In [None]:
# Find best values
best_factors = factor_df.loc[factor_df['RMSE'].idxmin(), 'factors']
best_alpha = alpha_df.loc[alpha_df['RMSE'].idxmin(), 'alpha']
best_iterations = iteration_df.loc[iteration_df['RMSE'].idxmin(), 'iterations']

print("Best Parameters:")
print("="*60)
print(f"Factors: {int(best_factors)} (RMSE: {factor_df['RMSE'].min():.4f})")
print(f"Alpha: {int(best_alpha)} (RMSE: {alpha_df['RMSE'].min():.4f})")
print(f"Iterations: {int(best_iterations)} (RMSE: {iteration_df['RMSE'].min():.4f})")

## Detailed Analysis of Best Model

In [None]:
# Train with best parameters
print("Training ALS with best parameters...")
best_predictions = solve_with_als(
    train_matrix,
    alpha=int(best_alpha),
    iterations=int(best_iterations),
    factors=int(best_factors),
    regularization=0.1,
    verbose=True
)

predict_fn = create_predict_fn(best_predictions)
best_rmse = evaluate_rmse(test=test, predict_fn=predict_fn)
print(f"\nBest model RMSE: {best_rmse:.4f}")

In [None]:
# Analyze prediction distribution
print(f"\nPrediction statistics:")
print(f"  Min: {best_predictions.min().min():.2f}")
print(f"  Max: {best_predictions.max().max():.2f}")
print(f"  Mean: {best_predictions.mean().mean():.2f}")
print(f"  Std: {best_predictions.std().std():.2f}")

# Compare with actual ratings
print(f"\nActual ratings statistics:")
print(f"  Min: {train_matrix[train_matrix > 0].min().min():.2f}")
print(f"  Max: {train_matrix.max().max():.2f}")
print(f"  Mean: {train_matrix[train_matrix > 0].mean().mean():.2f}")
print(f"  Std: {train_matrix[train_matrix > 0].std().std():.2f}")

In [None]:
# Sample predictions for a random user
sample_user = np.random.choice(train_matrix.index)
user_actual = train_matrix.loc[sample_user]
user_pred = best_predictions.loc[sample_user]

# Get rated movies
rated_movies = user_actual[user_actual > 0].sample(min(10, (user_actual > 0).sum()))
comparison = pd.DataFrame({
    'Actual': rated_movies,
    'Predicted': user_pred[rated_movies.index]
})

print(f"Sample predictions for User {sample_user}:")
print(comparison)

# Calculate correlation
corr = comparison['Actual'].corr(comparison['Predicted'])
print(f"\nCorrelation: {corr:.4f}")

# Show top recommendations (unrated movies)
unrated_movies = user_actual[user_actual == 0]
top_recs = user_pred[unrated_movies.index].nlargest(5)
print(f"\nTop 5 recommendations for User {sample_user}:")
for movie_id, score in top_recs.items():
    print(f"  Movie {movie_id}: {score:.2f}")