# ALS applications

## Dzen dataset

Data comes from [dzen.ru](https://dzen.ru/) site and consists of likes which users put to text articles

### Columns
1. item_id - unique id of an item (article)
2. user_id - unique id of a user
3. source_id - unique id of an author. If two items have same source_id, then they come from one author
4. Name of item is name of the article
5. Raw dataset represents user_id and list of item_ids which user liked

In [21]:
# !curl -O -J -L 'https://www.dropbox.com/s/ia4bvhuqg8kesee/zen_dataset.zip?dl=1'
# !unzip zen_dataset.zip

In [2]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from tqdm.notebook import tqdm
import ast

In [3]:
# Load the datasets
item_names = pd.read_csv("zen_item_to_name.csv")
item_sources = pd.read_csv("zen_item_to_source.csv")
dataset = pd.read_csv("zen_ratings.csv", converters={'item_ids': ast.literal_eval})

In [4]:
# Total number of interactions
total_interactions_count = dataset.item_ids.map(len).sum()

# Preprocessing the data for the sparse user-item matrix
user_coo = np.zeros(total_interactions_count, dtype=np.int64)
item_coo = np.zeros(total_interactions_count, dtype=np.int64)
pos = 0

for user_id, item_ids in enumerate(tqdm(dataset.item_ids)):
    user_coo[pos : pos + len(item_ids)] = user_id
    item_coo[pos : pos + len(item_ids)] = item_ids
    pos += len(item_ids)

# Create the user-item sparse matrix in COO format and then convert to CSR
shape = (max(user_coo) + 1, max(item_coo) + 1)
user_item_matrix = sp.coo_matrix(
    (np.ones(len(user_coo)), (user_coo, item_coo)), shape=shape
)
user_item_matrix = user_item_matrix.tocsr()

# Save the matrix for later use
sp.save_npz("data_train.npz", user_item_matrix)
# Cleanup memory. Later you need just data_train.npz
del user_coo
del item_coo
del dataset

  0%|          | 0/75910 [00:00<?, ?it/s]

In [5]:
# you could start here if you already done precomputing
# Load the sparse matrix for use in models
user_item_matrix = sp.load_npz("data_train.npz")

In [6]:
# Function to report sparse matrix size
def sparce_matrix_report(matrix):
    print('Size of raw data:', matrix.data.nbytes / 10**6, 'Mb')
    print('Feedback matrix size:', matrix.shape)

In [7]:
sparce_matrix_report(user_item_matrix)

Size of raw data: 46.339384 Mb
Feedback matrix size: (75910, 104503)


In [8]:
# Item weight distribution (for debugging)
item_weights = np.array(user_item_matrix.tocsc().sum(0))[0]
top_to_bottom_order = np.argsort(-item_weights)
item_mapping = np.empty(top_to_bottom_order.shape, dtype=int)
item_mapping[top_to_bottom_order] = np.arange(len(top_to_bottom_order))

# Define a function to build a debug dataset for faster testing
def build_debug_dataset(user_item_matrix, item_pct: float, user_pct: float):
    '''Get given percent of top rated items and given percent of random users'''
    total_item_count = (item_weights > 0).sum()
    total_user_count = user_item_matrix.shape[0]

    user_count = int(total_user_count * user_pct),
    item_count = int(total_item_count * item_pct)
    item_ids = top_to_bottom_order[:item_count]
    user_ids = np.random.choice(
        np.arange(user_item_matrix.shape[0]), size=user_count, replace=False
    )
    train = user_item_matrix[user_ids]
    train = train[:, item_ids]
    return train

In [9]:
debug_dataset = build_debug_dataset(user_item_matrix, 0.05, 0.05)
sparce_matrix_report(debug_dataset)

Size of raw data: 1.095792 Mb
Feedback matrix size: (3795, 5019)


This is useful for debugging (just to save time).

**Final answers should use full dataset!!!**

## Split dataset matrix (5 points)

in the following way: for 20% of users (random) remove one like - this will be test data. The rest is train data.

In [10]:
def split_data(ratings):
    # your code here
    """
    Разделение матрицы рейтингов: для 20% пользователей удаляем один лайк
    """
    import numpy as np
    import scipy.sparse as sp
    
    # Choose 20% random users for testing, remove one like for each
    n_users = ratings.shape[0]
    n_test_users = int(0.2 * n_users)
    test_user_indices = np.random.choice(n_users, n_test_users, replace=False)
    
    # Initialize test matrix
    test_data = []
    test_row = []
    test_col = []
    
    # Create a copy for the training matrix
    train_matrix = ratings.copy()
    
    for user_idx in test_user_indices:
        # Get items the user interacted with
        user_items = ratings.getrow(user_idx).indices
        
        if len(user_items) > 0:
            # Randomly remove one item for the test set
            item_idx = np.random.choice(user_items)
            
            # Add to the test set
            test_data.append(1.0)  # Бинарные взаимодействия
            test_row.append(user_idx)
            test_col.append(item_idx)
            
            # Remove from the training set
            train_matrix[user_idx, item_idx] = 0
    
    # Создаем тестовую матрицу
    test_matrix = sp.coo_matrix((test_data, (test_row, test_col)), shape=ratings.shape)
    test_matrix = test_matrix.tocsr()
    
    # Убеждаемся, что тренировочная матрица в формате CSR
    train_matrix = train_matrix.tocsr()
    train_matrix.eliminate_zeros()  # Удаляем нулевые элементы
    
    return train_matrix, test_matrix

In [11]:
train_ratings, test_ratings = split_data(user_item_matrix)

## Implement IALS (10 points each)

Note that due to size of data you need to implement algorithm with _sparce matrices_!

You are welcome to use classes like on the seminar:)

In [12]:
# Function for GPU-accelerated IALS (Implicit Alternating Least Squares)
def ials_gpu(ratings, k=40, lam=0.1, n_iterations=10, alpha=40):
    '''GPU-accelerated Implicit Alternating Least Squares algorithm'''
    import cupy as cp
    import numpy as np
    from tqdm.notebook import tqdm
    
    # Переносим данные на GPU
    num_users, num_items = ratings.shape
    
    # Инициализируем факторы на GPU
    user_embeddings = cp.random.normal(0, 0.01, (num_users, k)).astype(cp.float32)
    item_embeddings = cp.random.normal(0, 0.01, (num_items, k)).astype(cp.float32)
    
    # Формат CSR для эффективной работы
    ratings_csr = ratings.tocsr()
    ratings_csc = ratings.tocsc()
    
    # Идентичная матрица для регуляризации на GPU
    lambda_I = cp.eye(k, dtype=cp.float32) * lam
    
    for _ in tqdm(range(n_iterations), desc="IALS iterations"):
        # Step 1: Fix item factors and solve for user factors
        YtY = cp.dot(item_embeddings.T, item_embeddings)
        
        for u in range(num_users):
            items = ratings_csr[u].indices            
            if len(items) == 0:
                continue
                
            factors = cp.array(item_embeddings[items].get(), dtype=cp.float32)
            confidence = 1.0 + alpha
            
            # Решаем систему на GPU
            A = YtY + cp.dot(factors.T, (confidence - 1.0) * factors) + lambda_I
            b = confidence * (cp.sum(factors, axis=0))
            
            try:
                user_embeddings[u] = cp.linalg.solve(A, b)
            except cp.linalg.LinAlgError:
                user_embeddings[u] = cp.linalg.lstsq(A, b)[0]
        
        # Step 2: Fix user factors and solve for item factors
        XtX = cp.dot(user_embeddings.T, user_embeddings)
        
        for i in range(num_items):
            users = ratings_csc[:, i].indices
            
            if len(users) == 0:
                continue
                
            factors = cp.array(user_embeddings[users].get(), dtype=cp.float32)
            confidence = 1.0 + alpha
            
            A = XtX + cp.dot(factors.T, (confidence - 1.0) * factors) + lambda_I
            b = confidence * (cp.sum(factors, axis=0))
            
            try:
                item_embeddings[i] = cp.linalg.solve(A, b)
            except cp.linalg.LinAlgError:
                item_embeddings[i] = cp.linalg.lstsq(A, b)[0]
    
    # Возвращаем результаты на CPU
    return user_embeddings.get(), item_embeddings.get()

In [13]:
# ials_predictions = ials_gpu(train_ratings, k=40, lam= 0.1)

## Compute MRR@100 metric for test users

For ALS and IALS algorithms.

**Don't forget to use full dataset!**

In [14]:
def mrr_gpu(user_embeddings, item_embeddings, test_ratings, train_ratings, k=100):
    """GPU-accelerated MRR calculation"""
    import cupy as cp
    import numpy as np
    
    # Переносим эмбеддинги на GPU
    user_embeddings_gpu = cp.array(user_embeddings)
    item_embeddings_gpu = cp.array(item_embeddings)
    
    test_ratings_csr = test_ratings.tocsr()
    train_ratings_csr = train_ratings.tocsr()
    
    test_users = np.unique(test_ratings_csr.nonzero()[0])
    
    rr_sum = 0.0
    count = 0
    
    for user in test_users:
        test_items = test_ratings_csr[user].indices        
        if len(test_items) == 0:
            continue
            
        train_items = train_ratings_csr[user].indices
        
        # Вычисляем скоры на GPU
        user_emb = user_embeddings_gpu[user]
        scores = cp.dot(user_emb, item_embeddings_gpu.T)
        
        # Маскируем тренировочные элементы
        scores_np = scores.get()
        scores_np[train_items] = -np.inf
        
        # Получаем топ-k элементов
        top_items = np.argsort(-scores_np)[:k]
        
        for test_item in test_items:
            rank_idx = np.where(top_items == test_item)[0]
            
            if len(rank_idx) > 0:
                rank = rank_idx[0] + 1
                rr_sum += 1.0 / rank
                count += 1
    
    if count > 0:
        mrr_value = rr_sum / count
    else:
        mrr_value = 0.0
        
    return mrr_value

In [15]:
# Train IALS with GPU
user_embeddings_gpu, item_embeddings_gpu = ials_gpu(train_ratings, k=40, lam=0.1, n_iterations=10, alpha=40)


IALS iterations:   0%|          | 0/10 [00:00<?, ?it/s]

In [16]:
# Compute MRR for the model
mrr_value = mrr_gpu(user_embeddings_gpu, item_embeddings_gpu, test_ratings, train_ratings, k=100)
print(f"MRR@100: {mrr_value:.4f}")

MRR@100: 0.0911


In [None]:
def get_ials_similarities_sparse_gpu(item_embeddings, batch_size=1000):
    import cupy as cp
    from scipy.sparse import lil_matrix
    
    # Ensure that item_embeddings is a CuPy array
    item_embeddings = cp.asarray(item_embeddings)  # Convert to CuPy array if it's a NumPy array
    
    num_items = item_embeddings.shape[0]
    
    # Normalize the embeddings to prevent large values in the dot product
    norms = cp.linalg.norm(item_embeddings, axis=1, keepdims=True)
    normalized_embeddings = item_embeddings / (norms + 1e-8)
    
    # Initialize a sparse similarity matrix in LIL format (efficient for incremental construction)
    similarities = lil_matrix((num_items, num_items), dtype=cp.float32)
    
    # Process in batches to avoid memory issues
    for start_idx in range(0, num_items, batch_size):
        end_idx = min(start_idx + batch_size, num_items)
        
        # Slice the batch of embeddings
        batch_embeddings = normalized_embeddings[start_idx:end_idx]
        
        # Compute the dot product for the batch with only the items in the batch
        batch_similarities = cp.dot(batch_embeddings, normalized_embeddings.T)
        
        # Store only non-zero similarities in the sparse matrix
        for i in range(batch_similarities.shape[0]):
            for j in range(batch_similarities.shape[1]):
                if batch_similarities[i, j] != 0:  # Only store non-zero similarities
                    similarities[start_idx + i, j] = batch_similarities[i, j].get()  # Convert to NumPy
    
    # Set diagonal to zero (self-similarity)
    cp.fill_diagonal(similarities, 0)
    
    # Return the computed similarities as a sparse matrix on CPU
    return similarities

# Example of computing the item-item similarity matrix using sparse format
item_similarities_sparse = get_ials_similarities_sparse_gpu(item_embeddings_gpu, batch_size=500)

# Output the similarity matrix
print(item_similarities_sparse)

## Adjust hyperparameters of IALS to maximize MRR (10 points)

Main hyperparameters are regularization and weights for implicit case.

In [40]:
# your code here


Optimal parameters of IALS are:

....

## Get similarities from item2item CF (10 points)

Item2item can be taken from the first homework, SLIM was implemented in the class.

Alternatively you could use libraries, but in this case you will need to convert dataset to their format.

You need to compute only item similarities, not predictions for users.

## Compare similarities from four algorithms (20 points)

* plot distributions
* compute metrics (which you think are relevant)
* look at several top similar lists

Make conclusion how these methods differ in computing similarities

In [None]:
# your code here

Conclusion:

....