As a unit of randomization, we might consider session-level, user-level, and item-level.
The main point is that we don't have session IDs in the dataset. But also, session-level randomization might add cross-exposure bias, which occurs when the same user is exposed to multiple variants during the experiment. For item-level randomization, it would be harder to attribute performance to a certain model (however, such kind of randomization can be applied as interleaving test). Anyway, for simplification and higher precision, for that test, we decided to stop on user-level randomization.

As a comparison metric, we chose Click-Through Rate, which will be simulated in our case by the availability of movie ratings in the test dataset. For the quardrail metrics we decided to calculate average rating of recommendations (ARR). ARR will balance the recommendation comparison, ensuring that the quality of recommendations is not suffering while we are optimizing CTR.

The notebook includes online evaluation simulated on MovieLens dataset. We'll compare two of the most accurate models out of the similarity-based collaborative filtering methods developed in the project: cosine similarity user-based and pearson similarity user-based.

The results are shown at the end of the notebook.

The results analysis and interpretation are in the final report.

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

In [26]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from scipy import stats
import warnings

warnings.filterwarnings('ignore')

In [3]:
from src.data_reading import read_ratings_file
from src.evaluation import temporal_split
from src.models.similarity_based_cf import predict_rating_cf_user_based, recommend_k

# Data preparation

In [4]:
# For similarity-based CF, we would use only the file with the movie ratings, we will not need movie metadata or users' features

ratings = read_ratings_file() 

In [5]:
# Split on train and test sets by date

train, test = temporal_split(ratings, test_ratio=0.1)

Train set size: (900188, 4)
Test set size: (100021, 4)
Train timeframe: 2000-04-25 23:05:32 - 2000-12-29 23:42:47
Test timeframe: 2000-12-29 23:43:34 - 2003-02-28 17:49:50


In [6]:
# Create user_id x movie_id matrix

train_prep = train.pivot_table(
    index='user_id',
    columns='movie_id',
    values='rating'
)
train_prep_ = train_prep.fillna(0)

# Experiments with cosine similarity

In [7]:
# Calculate user similarity with cosine distance

user_sim_cos = pd.DataFrame(
    cosine_similarity(train_prep_),
    index=train_prep.index,
    columns=train_prep.index
)

# Let's calculate user similarity with pearson similarity

user_sim_pearson = pd.DataFrame(
    cosine_similarity(
        train_prep.sub(train_prep.mean(axis=1), axis=0)\
        .fillna(0)\
        .values
    ),
    index=train_prep.index,
    columns=train_prep.index
)   

In [8]:
# From the test set, let's remove users and movies missing in the train set, as similarity based collaborative filtering algorithms don't support cold-start

test_users = np.intersect1d(test.user_id.unique(), train.user_id.unique())
test_movies = np.intersect1d(test.movie_id.unique(), train.movie_id.unique())

test = test[(test.user_id.isin(test_users)) & (test.movie_id.isin(test_movies))]
print(f'New test set shape is: {test.shape}')

New test set shape is: (95723, 4)


# Experiment

In [25]:
sample_size = 100
results = pd.DataFrame({
    'user_id': np.random.choice(test.user_id.unique(), size=sample_size, replace=False),
    'model': np.random.choice(['c', 'p'], size=sample_size)
}).set_index('user_id')

for user_id, row in results.iterrows():
    model = row['model']

    if model == 'c':
        rec = recommend_k(
            user_id=user_id,
            test=test,
            predict_fn=predict_rating_cf_user_based,
            train_prep=train_prep,
            sim_df=user_sim_cos,
            n=10, 
            k=10
        )
    elif model == 'p':
        rec = recommend_k(
            user_id=user_id,
            test=test,
            predict_fn=predict_rating_cf_user_based,
            train_prep=train_prep,
            sim_df=user_sim_pearson,
            n=10, 
            k=10
        )
    else:
        rec = []
        
    fact = test[test.user_id == user_id].movie_id.values
    relevant = np.intersect1d(rec, fact)

    ctr = len(relevant) / 10
    arr = test[(test.user_id == user_id) & test.movie_id.isin(relevant)
        ].rating.mean()

    results.loc[[user_id], ['ctr', 'arr']] = ctr, arr

results.ctr.fillna(0, inplace=True)

user_id
1587    0.0
403     0.0
649     0.0
2828    0.0
2258    0.2
       ... 
5411    0.0
5322    0.0
5371    0.0
2270    0.0
5453    0.1
Name: ctr, Length: 100, dtype: float64

In [57]:
def ttest(column_name: str, results: pd.DataFrame = results, alpha: float = 0.05):
    cos = results[results['model'] == 'c'][column_name].dropna()
    pearson = results[results['model'] == 'p'][column_name].dropna()
    print(f'Cosine similarity user-based model {column_name.upper()}: {cos.mean():.4f}')
    print(f'Pearson similarity user-based model {column_name.upper()}: {pearson.mean():.4f}')
    print(f'Metric difference: {(cos.mean() - pearson.mean()):.4f}')

    t_stat, p_value = stats.ttest_ind(cos, pearson, equal_var=False)
    print(f'T-statistic: {t_stat:.4f}')
    print(f'P-value: {p_value:.4f}')

    if p_value < alpha:
        print('Statistically significant difference')
    else:
        print('No statistically significant difference')

In [58]:
ttest('ctr')

Cosine similarity user-based model CTR: 0.0980
Pearson similarity user-based model CTR: 0.0294
Metric difference: 0.0685
T-statistic: 2.8659
P-value: 0.0051
Statistically significant difference


In [59]:
ttest('arr')

Cosine similarity user-based model ARR: 4.4750
Pearson similarity user-based model ARR: 4.6190
Metric difference: -0.1440
T-statistic: -0.5082
P-value: 0.6314
No statistically significant difference
