# ALS applications

## Dzen dataset

Data comes from [dzen.ru](https://dzen.ru/) site and consists of likes which users put to text articles

### Columns
1. item_id - unique id of an item (article)
2. user_id - unique id of a user
3. source_id - unique id of an author. If two items have same source_id, then they come from one author
4. Name of item is name of the article
5. Raw dataset represents user_id and list of item_ids which user liked

In [1]:
# !curl -O -J -L 'https://www.dropbox.com/s/ia4bvhuqg8kesee/zen_dataset.zip?dl=1'
# !unzip zen_dataset.zip

In [2]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from tqdm.notebook import tqdm
import ast

In [3]:
item_names = pd.read_csv("zen_item_to_name.csv")
item_sources = pd.read_csv("zen_item_to_source.csv")
dataset = pd.read_csv("zen_ratings.csv", converters={'item_ids': ast.literal_eval})

In [4]:
item_names

Unnamed: 0,id,name
0,94962,Что обычно ожидало русских казачек в руках у к...
1,3972,Почему Россия решила строить новую скоростную ...
2,94644,"5 неприличных фактов об Андрее Макаревиче, кот..."
3,82518,"Что стало с красавицей Хмельницкой, которую му..."
4,53264,"Понять и Простить: Почему угонщики, бежавшие и..."
...,...,...
104498,36769,"Плюс один источник мифа о рыцарях, неспособных..."
104499,9190,Мой сад - малоуходный
104500,52731,Купил первую в жизни циркулярную пилу. Честный...
104501,72660,Решили предложить Марине помощь в лечении ч.10


In [5]:
item_sources

Unnamed: 0,id,source
0,94962,2919814402697966089
1,3972,3263022753228392991
2,94644,-3857390427602554682
3,82518,-9036908390349249792
4,53264,3353856219169766284
...,...,...
104498,36769,3818746211375738614
104499,9190,4975535765688979937
104500,52731,3720366796439288909
104501,72660,-7860042973720636310


In [8]:
dataset 

Unnamed: 0,user_id,item_ids
0,993675863667353526,"[15267, 61075, 81203, 17066, 25471, 88427, 638..."
1,4250619547882954185,"[4555, 94644, 84972, 17774, 94962, 78217, 2485..."
2,3847785305345691076,"[1898, 26703, 16525, 86939, 55017, 31069, 4035..."
3,1785181112918558233,"[75601, 102458, 28716, 100694, 5757, 47104, 60..."
4,5078748097863903181,"[72260, 40825, 2615, 42549, 379, 100818, 56827..."
...,...,...
75905,4954138831959898373,"[11881, 55520, 63054, 48015, 66952, 103830, 21..."
75906,4967793435819938014,"[74697, 11830, 63858, 87245, 41956, 62089, 686..."
75907,7137764184903122777,"[10353, 1775, 103680, 29704, 9782, 13295, 9975..."
75908,2624987805086334956,"[24324, 18854, 73319, 66641, 64078, 97387, 426..."


In [9]:
total_interactions_count = dataset.item_ids.map(len).sum()
user_coo = np.zeros(total_interactions_count, dtype=np.int64)
item_coo = np.zeros(total_interactions_count, dtype=np.int64)
pos = 0

for user_id, item_ids in enumerate(tqdm(dataset.item_ids)):
    user_coo[pos : pos + len(item_ids)] = user_id
    item_coo[pos : pos + len(item_ids)] = item_ids
    pos += len(item_ids)

shape = (max(user_coo) + 1, max(item_coo) + 1)
user_item_matrix = sp.coo_matrix(
    (np.ones(len(user_coo)), (user_coo, item_coo)), shape=shape
)
user_item_matrix = user_item_matrix.tocsr()
sp.save_npz("data_train.npz", user_item_matrix)
# Cleanup memory. Later you need just data_train.npz
del user_coo
del item_coo
del dataset

  0%|          | 0/75910 [00:00<?, ?it/s]

In [10]:
# you could start here if you already done precomputing
user_item_matrix = sp.load_npz("data_train.npz")

In [11]:
user_item_matrix

<75910x104503 sparse matrix of type '<class 'numpy.float64'>'
	with 5792423 stored elements in Compressed Sparse Row format>

In [12]:
def sparce_matrix_report(matrix):
    print('Size of raw data:', matrix.data.nbytes / 10**6, 'Mb')
    print('Feedback matrix size:', matrix.shape)

In [13]:
sparce_matrix_report(user_item_matrix)

Size of raw data: 46.339384 Mb
Feedback matrix size: (75910, 104503)


In [15]:
item_weights = np.array(user_item_matrix.tocsc().sum(0))[0]
top_to_bottom_order = np.argsort(-item_weights)
item_mapping = np.empty(top_to_bottom_order.shape, dtype=int)
item_mapping[top_to_bottom_order] = np.arange(len(top_to_bottom_order))
total_item_count = (item_weights > 0).sum()
total_user_count = user_item_matrix.shape[0]


def build_debug_dataset(user_item_matrix, item_pct: float, user_pct: float):
    '''Get given percent of top rated items and given percent of random users'''
    user_count = int(total_user_count * user_pct),
    item_count = int(total_item_count * item_pct)
    item_ids = top_to_bottom_order[:item_count]
    user_ids = np.random.choice(
        np.arange(user_item_matrix.shape[0]), size=user_count, replace=False
    )
    train = user_item_matrix[user_ids]
    train = train[:, item_ids]
    return train

In [16]:
debug_dataset = build_debug_dataset(user_item_matrix, 0.05, 0.05)

sparce_matrix_report(debug_dataset)

Size of raw data: 1.076688 Mb
Feedback matrix size: (3795, 5019)


This is useful for debugging (just to save time).

**Final answers should use full dataset!!!**

## Split dataset matrix (5 points)

in the following way: for 20% of users (random) remove one like - this will be test data. The rest is train data.

In [None]:
def split_data(ratings):
    # your code here
    """
    Разделение матрицы рейтингов: для 20% пользователей удаляем один лайк
    """
    import numpy as np
    import scipy.sparse as sp
    
    # Получаем количество пользователей и выбираем 20% из них случайно
    n_users = ratings.shape[0]
    n_test_users = int(0.2 * n_users)data_train
    test_user_indices = np.random.choice(n_users, n_test_users, replace=False)
    
    # Инициализируем тестовую матрицу
    test_data = []
    test_row = []
    test_col = []
    
    # Создаем копию для тренировочной матрицыdata_train
    train_matrix = ratings.copy()
    
    for user_idx in test_user_indices:
        # Получаем айтемы, с которыми взаимодействовал пользователь
        user_items = ratings.getrow(user_idx).indices
        
        if len(user_items) > 0:
            # Случайно выбираем один айтем для удаления
            item_idx = np.random.choice(user_items)
            
            # Добавляем в тестовую выборку
            test_data.append(1.0)  # Бинарные взаимодействия
            test_row.append(user_idx)
            test_col.append(item_idx)
            
            # Удаляем из тренировочной выборки
            train_matrix[user_idx, item_idx] = 0
            data_train
    
    # Создаем тестовую матрицу
    test_matrix = sp.coo_matrix((test_data, (test_row, test_col)), shape=ratings.shape)
    test_matrix = test_matrix.tocsr()
    
    # Убеждаемся, что тренировочная матрица в формате CSR
    train_matrix = train_matrix.tocsr()
    train_matrix.eliminate_zeros()  # Удаляем нулевые элементы
    
    return train_matrix, test_matrix

In [15]:
train_ratings, test_ratings = split_data(user_item_matrix[:10])
train_ratings.size, test_ratings.size

(778, 2)

## Implement IALS (10 points each)

Note that due to size of data you need to implement algorithm with _sparce matrices_!

You are welcome to use classes like on the seminar:)

In [None]:
def ials(ratings, k=40, lam=0.1, n_iterations=10, alpha=40):
    '''Implicit Alternating Least Squares algorithm

    Args:
        ratings: sparse matrix of ratings
        k: size of embeddings
        lam: regularization term
        n_iterations: number of iterations
        alpha: confidence scaling parameter

    Returns:
        two matrices: of user embeddings and of item embeddings
    '''
    import numpy as np
    import scipy.sparse as sp
    from tqdm.notebook import tqdm
    
    # Get dimensions
    num_users, num_items = ratings.shape
    
    # Initialize factor matrices randomly
    user_embeddings = np.random.normal(0, 0.01, (num_users, k))data_train
    item_embeddings = np.random.normal(0, 0.01, (num_items, k))
    
    # Ensure we have CSR format for efficient row slicing
    ratings_csr = ratings.tocsr()
    
    # For CSC format (efficient column access)
    ratings_csc = ratings.tocsc()
    
    # Identity matrix for regularization
    lambda_I = lam * np.eye(k)
    
    for _ in tqdm(range(n_iterations), desc="IALS iterations"):
        # Step 1: Fix item factors and solve for user factors
        
        # Precompute YtY once for all users
        YtY = item_embeddings.T @ item_embeddings
        
        for u in range(num_users):
            # Get items rated by user u
            items = ratings_csr[u].indices
            
            if len(items) == 0:
                continue
                
            # Get item factors for these items
            factors = item_embeddings[items]
            
            # Create confidence matrix Cu and preference matrix Pu
            # For implicit feedback: Cu = 1 + alpha*Pu, where Pu is binary
            confidence = 1.0 + alpha
            
            # Compute the left side of the equation: YtCuY + λI
            A = YtY + factors.T @ ((confidence - 1.0) * factors) + lambda_I
            
            # Compute the right side: YtCupu (where pu is a vector of 1's for implicit data)
            b = confidence * (factors.sum(axis=0))
            
            # Solve the linear system (A * x = b)
            try:
                user_embeddings[u] = np.linalg.solve(A, b)
            except np.linalg.LinAlgError:
                # Fallback to least squares if matrix is singular
                user_embeddings[u] = np.linalg.lstsq(A, b, rcond=None)[0]data_train
        
        # Step 2: Fix user factors and solve for item factors
        
        # Precompute XtX once for all items
        XtX = user_embeddings.T @ user_embeddings
        
        for i in range(num_items):
            # Get users who rated item i
            users = ratings_csc[:, i].indices
            
            if len(users) == 0:
                continue
                
            # Get user factors for these users
            factors = user_embeddings[users]
            
            # Create confidence matrix Ci and preference matrix Pi
            confidence = 1.0 + alpha
            
            # Compute the left side: XtCiX + λI
            A = XtX + factors.T @ ((confidence - 1.0) * factors) + lambda_I
            
            # Compute the right side: XtCipi
            b = confidence * (factors.sum(axis=0))
            
            # Solve the linear system
            try:
                item_embeddings[i] = np.linalg.solve(A, b)
            except np.linalg.LinAlgError:
                # Fallback to least squares if matrix is singular
                item_embeddings[i] = np.linalg.lstsq(A, b, rcond=None)[0]
    
    return user_embeddings, item_embeddings

In [17]:
train_ratings

<10x104503 sparse matrix of type '<class 'numpy.float64'>'
	with 778 stored elements in Compressed Sparse Row format>

In [18]:
# user_embeddings, item_embeddings = ials(train_ratings, k=40, lam=0.1)

## Compute MRR@100 metric for test users

For ALS and IALS algorithms.

**Don't forget to use full dataset!**

In [None]:
def mrr(user_embeddings, item_embeddings, test_ratings, train_ratings, k=100):
    """Compute MRR@k for test ratings based on embeddings
    
    Args:
        user_embeddings: matrix of user embeddings
        item_embeddings: matrix of item embeddings
        test_ratings: sparse matrix of test ratings
        train_ratings: sparse matrix of train ratings
        k: cutoff for top-k recommendations
        
    Returns:
        mrr_value: Mean Reciprocal Rank score
    """
    import numpy as np
    
    # Ensure test ratings is in CSR format
    test_ratings_csr = test_ratings.tocsr()
    train_ratings_csr = train_ratings.tocsr()
    
    # Get users with test items
    test_users = np.unique(test_ratings_csr.nonzero()[0])
    
    # Initialize sum of reciprocal ranks
    rr_sum = 0.0
    count = 0
    
    for user in test_users:
        # Get test items for this user
        test_items = test_ratings_csr[user].indices
        
        if len(test_items) == 0:
            continue
            
        # Get items this user has already rated (to exclude from recommendations)
        train_items = train_ratings_csr[user].indices
        
        # Compute scores for all items
        scores = user_embeddings[user] @ item_embeddings.T
        
        # Mask out training items
        scores[train_items] = -np.inf
        
        # Get top-k item indices
        top_items = np.argsort(-scores)[:k]
        
        # Compute reciprocal rank for each test item
        for test_item in test_items:
            # Find position of test item in top-k list (if present)
            rank_idx = np.where(top_items == test_item)[0]
            
            if len(rank_idx) > 0:
                # Add 1 because rank starts from 0
                rank = rank_idx[0] + 1
                rr_sum += 1.0 / rank
                count += 1
    
    # Compute mean
    if count > 0:
        mrr_value = rr_sum / count
    else:
        mrr_value = 0.0
        
    return mrr_value

In [20]:
# mrr_ials = mrr(ials_predictions, test_ratings, k=100)
# print(mrr_ials)

## Adjust hyperparameters of IALS to maximize MRR (10 points)

Main hyperparameters are regularization and weights for implicit case.

In [21]:
def tune_ials_hyperparameters(train_ratings, test_ratings):
    """Tune hyperparameters for IALS to maximize MRR
    
    Args:
        train_ratings: training data
        test_ratings: test data
        
    Returns:
        best_params: dictionary of best parameters
        best_mrr: best MRR score achieved
    """
    import numpy as np
    from itertools import product
    from tqdm.notebook import tqdm
    
    # Define hyperparameter grids
    k_values = [20, 40, 60]  # Embedding sizes
    lam_values = [0.01, 0.1, 1.0]  # Regularization terms
    alpha_values = [10, 40, 100]  # Confidence scaling factors
    
    # Use fewer iterations for tuning to save time
    n_iterations = 5
    
    # Initialize best values
    best_mrr = 0
    best_params = {}
    
    # Generate all parameter combinations
    param_grid = list(product(k_values, lam_values, alpha_values))
    
    print(f"Tuning IALS with {len(param_grid)} parameter combinations...")
    
    # Track results for all combinations
    results = []
    
    for k, lam, alpha in tqdm(param_grid, desc="Parameter Combinations"):
        print(f"\
Testing: k={k}, λ={lam}, α={alpha}")
        
        # Train IALS model with these parameters
        user_embeddings, item_embeddings = ials(
            train_ratings, k=k, lam=lam, alpha=alpha, n_iterations=n_iterations
        )
        
        # Calculate MRR score
        mrr_score = mrr(user_embeddings, item_embeddings, test_ratings, train_ratings)
        
        print(f"MRR@100: {mrr_score:.4f}")
        
        # Store result
        results.append((k, lam, alpha, mrr_score))
        
        # Update best parameters if better MRR found
        if mrr_score > best_mrr:
            best_mrr = mrr_score
            best_params = {'k': k, 'lam': lam, 'alpha': alpha}
    
    # Sort results by MRR for reporting
    results.sort(key=lambda x: x[3], reverse=True)
    
    print("\
All Results (sorted by MRR):")
    for k, lam, alpha, score in results:
        print(f"k={k}, λ={lam}, α={alpha}: MRR={score:.4f}")
    
    print(f"\
Best Parameters: k={best_params['k']}, λ={best_params['lam']}, α={best_params['alpha']}")
    print(f"Best MRR: {best_mrr:.4f}")
    
    return best_params, best_mrr


Optimal parameters of IALS are:

....

In [22]:
# best_params = tune_ials_hyperparameters(train_ratings, test_ratings)

## Get similarities from item2item CF (10 points)

Item2item can be taken from the first homework, SLIM was implemented in the class.

Alternatively you could use libraries, but in this case you will need to convert dataset to their format.

You need to compute only item similarities, not predictions for users.

In [23]:
def calculate_i2i_similarities(ratings, method='cosine', min_common=5):
    """Compute item-item similarities using different methods
    
    Args:
        ratings: sparse matrix of user-item interactions
        method: similarity method ('cosine', 'pearson', or 'jaccard')
        min_common: minimum number of users in common
        
    Returns:
        similarities: item-item similarity matrix
    """
    import numpy as np
    import scipy.sparse as sp
    from tqdm.notebook import tqdm
    
    # Ensure we have CSR format for the item-user matrix
    item_user = ratings.T.tocsr()  # Transpose to get item-user matrix
    
    num_items = ratings.shape[1]
    
    # Initialize similarity matrix (sparse)
    similarities = sp.lil_matrix((num_items, num_items))
    
    # Compute item norms for cosine similarity
    if method == 'cosine':
        # Compute L2 norm for each item vector
        item_norms = np.sqrt(np.array(item_user.power(2).sum(axis=1)).flatten())
    
    # Iterate through items and compute similarities
    for i in tqdm(range(num_items), desc=f"Computing {method} similarities"):
        # Skip if item i has no ratings
        if item_user[i].nnz == 0:
            continue
            
        # Get users who rated item i
        i_users = item_user[i].indices
        
        # We only need to compute similarities for j > i (symmetric matrix)
        for j in range(i+1, num_items):
            # Skip if item j has no ratings
            if item_user[j].nnz == 0:
                continue
                
            # Get users who rated item j
            j_users = item_user[j].indices
            
            # Find common users who rated both items
            common_users = np.intersect1d(i_users, j_users, assume_unique=True)
            
            # Skip if not enough common users
            if len(common_users) < min_common:
                continue
            
            # Compute similarity based on method
            if method == 'cosine':
                # Get ratings for common users
                i_data = item_user[i, common_users].toarray().flatten()
                j_data = item_user[j, common_users].toarray().flatten()
                
                # Compute cosine similarity: dot(i,j) / (norm(i) * norm(j))
                dot_product = np.dot(i_data, j_data)
                similarity = dot_product / (item_norms[i] * item_norms[j])
                
            elif method == 'jaccard':
                # Jaccard similarity = |intersection| / |union|
                similarity = len(common_users) / (len(i_users) + len(j_users) - len(common_users))
                
            elif method == 'pearson':
                # Get ratings for common users
                i_data = item_user[i, common_users].toarray().flatten()
                j_data = item_user[j, common_users].toarray().flatten()
                
                # Compute means
                i_mean = np.mean(i_data)
                j_mean = np.mean(j_data)
                
                # Centered vectors
                i_centered = i_data - i_mean
                j_centered = j_data - j_mean
                
                # Compute numerator and denominator
                numerator = np.dot(i_centered, j_centered)
                denominator = np.sqrt(np.dot(i_centered, i_centered) * np.dot(j_centered, j_centered))
                
                # Handle division by zero
                if denominator == 0:
                    similarity = 0
                else:
                    similarity = numerator / denominator
            
            # Store similarity value (symmetric)
            similarities[i, j] = similarity
            similarities[j, i] = similarity
    
    # Convert to CSR format for efficient operations
    return similarities.tocsr()


def get_ials_similarities(item_embeddings):
    """Compute item-item similarities from IALS embeddings
    
    Args:
        item_embeddings: Matrix of item embeddings from IALS
        
    Returns:
        similarities: Item-item similarity matrix
    """
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Compute cosine similarities between item embeddings
    similarities = cosine_similarity(item_embeddings)
    
    # Set diagonal to zero (self-similarity)
    np.fill_diagonal(similarities, 0)
    
    return similarities

In [24]:
# i2i_similarities = ... # your code here


## Compare similarities from four algorithms (20 points)

* plot distributions
* compute metrics (which you think are relevant)
* look at several top similar lists

Make conclusion how these methods differ in computing similarities

In [25]:
def compare_similarities(similarities_list, method_names, item_names=None, sample_size=5):
    """Compare similarities from different methods
    
    Args:
        similarities_list: list of similarity matrices
        method_names: list of method names
        item_names: DataFrame mapping item IDs to names (optional)
        sample_size: number of items to sample for comparison
        
    Returns:
        None (displays plots and prints statistics)
    """
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import scipy.sparse as sp
    
    n_methods = len(similarities_list)
    
    # 1. Plot similarity distributions
    plt.figure(figsize=(n_methods*5, 5))
    
    for i, (sim_matrix, method) in enumerate(zip(similarities_list, method_names)):
        plt.subplot(1, n_methods, i+1)
        
        # Get similarity values, handling both sparse and dense matrices
        if sp.issparse(sim_matrix):
            # For sparse matrix, use the data array
            sim_values = sim_matrix.data
        else:
            # For dense matrix, flatten and remove diagonal
            mask = ~np.eye(sim_matrix.shape[0], dtype=bool)
            sim_values = sim_matrix[mask]
        
        # Plot distribution
        sns.histplot(sim_values, kde=True)
        plt.title(f"{method} Similarity Distribution")
        plt.xlabel("Similarity Value")
        plt.ylabel("Frequency")
    
    plt.tight_layout()
    plt.show()
    
    # 2. Compute and print basic statistics
    print("Similarity Statistics:")
    print("-" * 50)
    
    for method, sim_matrix in zip(method_names, similarities_list):
        # Get similarity values
        if sp.issparse(sim_matrix):
            sim_values = sim_matrix.data
            density = sim_matrix.nnz / (sim_matrix.shape[0] * sim_matrix.shape[1])
        else:
            mask = ~np.eye(sim_matrix.shape[0], dtype=bool)
            sim_values = sim_matrix[mask]
            density = np.count_nonzero(sim_values) / len(sim_values)
        
        # Compute statistics
        print(f"{method}:")
        print(f"  Mean: {np.mean(sim_values):.4f}")
        print(f"  Median: {np.median(sim_values):.4f}")
        print(f"  Min: {np.min(sim_values):.4f}")
        print(f"  Max: {np.max(sim_values):.4f}")
        print(f"  Standard Deviation: {np.std(sim_values):.4f}")
        print(f"  Density: {density:.4f}")
        print()
    
    # 3. Compare top similar items for sample items
    print("\
Top Similar Items Comparison:")
    print("-" * 50)
    
    # Sample random items
    n_items = similarities_list[0].shape[0]
    sampled_items = np.random.choice(range(n_items), min(sample_size, n_items), replace=False)
    
    for item_id in sampled_items:
        print(f"\
Item {item_id}")
        
        # Print item name if available
        if item_names is not None:
            item_name = item_names.loc[item_names['id'] == item_id, 'name'].values
            if len(item_name) > 0:
                print(f"Name: {item_name[0]}")
        
        # Compare top similar items from each method
        for method, sim_matrix in zip(method_names, similarities_list):
            # Get similarity row for this item
            if sp.issparse(sim_matrix):
                sim_row = sim_matrix[item_id].toarray().flatten()
            else:
                sim_row = sim_matrix[item_id].copy()
            
            # Set self-similarity to -inf to exclude from top items
            sim_row[item_id] = -np.inf
            
            # Get top 5 similar items
            top_indices = np.argsort(-sim_row)[:5]
            top_similarities = sim_row[top_indices]
            
            # Print top similar items
            print(f"\
{method} Top 5 Similar Items:")
            for idx, (similar_item, sim) in enumerate(zip(top_indices, top_similarities)):
                output = f"  {idx+1}. Item {similar_item} (sim={sim:.4f})"
                
                # Add item name if available
                if item_names is not None:
                    similar_name = item_names.loc[item_names['id'] == similar_item, 'name'].values
                    if len(similar_name) > 0:
                        output += f": {similar_name[0]}"
                        
                print(output)
    
    # 4. Compute agreement between methods
    print("\
Agreement Between Methods:")
    print("-" * 50)
    
    # Initialize agreement matrix
    agreement_matrix = np.zeros((n_methods, n_methods))
    
    for i in range(n_methods):
        for j in range(n_methods):
            if i == j:
                agreement_matrix[i, j] = 1.0
                continue
            
            # Get similarity values for both methods
            if sp.issparse(similarities_list[i]):
                sim_i = similarities_list[i].toarray().flatten()
            else:
                sim_i = similarities_list[i].flatten()
                
            if sp.issparse(similarities_list[j]):
                sim_j = similarities_list[j].toarray().flatten()
            else:
                sim_j = similarities_list[j].flatten()
            
            # Compute correlation for non-zero values
            mask = (sim_i != 0) & (sim_j != 0)
            if np.sum(mask) > 0:
                agreement_matrix[i, j] = np.corrcoef(sim_i[mask], sim_j[mask])[0, 1]
    
    # Plot agreement matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(agreement_matrix, annot=True, fmt=".2f", cmap="YlGnBu",
               xticklabels=method_names, yticklabels=method_names)
    plt.title("Correlation Between Similarity Methods")
    plt.tight_layout()
    plt.show()
    
    return None

In [None]:
import numpy as np
import scipy.sparse as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# 1. Split the dataset
train_ratings, test_ratings = split_data(user_item_matrix)
print(f"Training matrix: {train_ratings.shape}, nnz: {train_ratings.nnz}")
print(f"Test matrix: {test_ratings.shape}, nnz: {test_ratings.nnz}")

# 2. Train IALS model
print("\
Training IALS model...")
user_embeddings, item_embeddings = ials(train_ratings, k=40, lam=0.1, n_iterations=10, alpha=40)
print(f"User embeddings shape: {user_embeddings.shape}")
print(f"Item embeddings shape: {item_embeddings.shape}")

# 3. Compute MRR
print("\
Computing MRR@100...")
mrr_score = mrr(user_embeddings, item_embeddings, test_ratings, train_ratings, k=100)
print(f"MRR@100: {mrr_score:.4f}")

# 4. Tune hyperparameters (uncomment to run - this takes time)
# print("\Tuning IALS hyperparameters...")
# best_params, best_mrr = tune_ials_hyperparameters(train_ratings, test_ratings)

# 5. Train with optimal hyperparameters
# You would normally use the best parameters from tuning
optimal_k = 40  # Example value - use results from tuning
optimal_lam = 0.1
optimal_alpha = 40

print(f"\
Training IALS with optimal parameters (k={optimal_k}, λ={optimal_lam}, α={optimal_alpha})...")
user_embeddings_opt, item_embeddings_opt = ials(
    train_ratings, k=optimal_k, lam=optimal_lam, alpha=optimal_alpha, n_iterations=15
)

# Calculate MRR with optimal model
mrr_opt = mrr(user_embeddings_opt, item_embeddings_opt, test_ratings, train_ratings, k=100)
print(f"Optimal MRR@100: {mrr_opt:.4f}")

# 6. Compute item similarities using different methods
print("\
Computing item similarities...")

# I2I Cosine similarity from collaborative filtering
print("- Computing I2I Cosine similarity...")
i2i_cosine_sim = calculate_i2i_similarities(train_ratings, method='cosine', min_common=5)

# I2I Jaccard similarity
print("- Computing I2I Jaccard similarity...")
i2i_jaccard_sim = calculate_i2i_similarities(train_ratings, method='jaccard', min_common=5)

# IALS-based similarity
print("- Computing IALS-based similarity...")
ials_sim = get_ials_similarities(item_embeddings_opt)

# 7. Compare similarities
print("\
Comparing similarities between methods...")
similarities_list = [i2i_cosine_sim, i2i_jaccard_sim, ials_sim]
method_names = ['I2I Cosine', 'I2I Jaccard', 'IALS']

# Compare with item names for better interpretability
compare_similarities(similarities_list, method_names, item_names=item_names, sample_size=3)

Training matrix: (75910, 104503), nnz: 5777242
Test matrix: (75910, 104503), nnz: 15181
Training IALS model...


IALS iterations:   0%|          | 0/10 [00:00<?, ?it/s]

User embeddings shape: (75910, 40)
Item embeddings shape: (104503, 40)
Computing MRR@100...
MRR@100: 0.0932
Training IALS with optimal parameters (k=40, λ=0.1, α=40)...


IALS iterations:   0%|          | 0/15 [00:00<?, ?it/s]

Optimal MRR@100: 0.0944
Computing item similarities...
- Computing I2I Cosine similarity...


Computing cosine similarities:   0%|          | 0/104503 [00:00<?, ?it/s]

Conclusion:

....