## User-Based Collaborative Filtering
For a given user $ u $ and item $ i $, predict the rating $ r_{ui} $ as follows:

$$
r_{ui} = \frac{\sum_{v \in N_u} \text{sim}(u, v) \cdot r_{vi}}{\sum_{v \in N_u} |\text{sim}(u, v)|}
$$

Where:
- $ N_u $: Users who rated item $ i $.
- $ \text{sim}(u, v) $: Similarity between users $ u $ and $ v $.
- $ r_{vi} $: Rating of user $ v $ for item $ i $.

---

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from tqdm import tqdm

In [2]:
# Description of the idea
# collaborative filtering:

# given a user x and an item i, estimate rating r(i) by:
# 1. finding a set of users U who rated the same items as x 
# 2. aggregate the ratings of i provided by Nu

# In practice:
# One considers all items user x has already rated, then other users are searched who have rated these items,
# but also have rated the new item for which we do not know the rating yet of user x. In this way users with similar
# interests are selected.

# Similarity between users can be calculated by cosine, or pearson similarity. 
# one downside for pearson - it is not defined when the variance of the user ratings is 0. e.g. all ratings are 2.5.

# The aggregation function: whether the neighbors rating for the unseen item i, are higher or lower than their average.
# ..

In [6]:
train_df = pd.read_csv('../data/train.csv')

In [7]:
train_df.head()

Unnamed: 0,book_id,user_id,rating
0,7260,20145,3.5
1,243238,85182,4.0
2,9135,45973,1.0
3,18671,63554,3.0
4,243293,81002,5.0


In [8]:
train_df['book_id'].value_counts()

book_id
408       257
748       213
522       149
356       142
26        142
         ... 
247693      1
248107      1
245643      1
246570      1
246356      1
Name: count, Length: 15712, dtype: int64

In [10]:
train_df['user_id'].value_counts()

user_id
3785     2041
28251     524
43652     350
5180      345
27445     266
         ... 
87162       1
83607       1
79107       1
89349       1
87278       1
Name: count, Length: 18905, dtype: int64

In [11]:
user_item_matrix = train_df.pivot(index='user_id', columns='book_id', values='rating')
user_item_matrix.fillna(0, inplace=True)
user_item_sparse = csr_matrix(user_item_matrix) # to sparse
user_similarity = cosine_similarity(user_item_sparse) # cos sim between users

In [18]:
def predict_rating(user_id, book_id, user_item_matrix, user_similarity):
    user_idx = user_item_matrix.index.get_loc(user_id)
    item_idx = user_item_matrix.columns.get_loc(book_id)
    item_ratings = user_item_matrix.iloc[:, item_idx]
    neighbors = item_ratings[item_ratings > 0].index
    neighbor_idxs = [user_item_matrix.index.get_loc(neighbor) for neighbor in neighbors]
    similarities = user_similarity[user_idx, neighbor_idxs]
    ratings = item_ratings[neighbors]
    numerator = np.dot(similarities, ratings)
    denominator = np.sum(np.abs(similarities))

    if denominator == 0:
        return np.nan  
    
    return numerator / denominator

def predict_test_ratings(test_df, user_item_matrix, user_similarity, batch_size=128):
    user_item_np = user_item_matrix.to_numpy()
    mean_rating = user_item_np[user_item_np > 0].mean()
    
    user_to_idx = {user: idx for idx, user in enumerate(user_item_matrix.index)}
    book_to_idx = {book: idx for idx, book in enumerate(user_item_matrix.columns)}
    
    predictions = []
    num_batches = len(test_df) // batch_size + int(len(test_df) % batch_size > 0)

    for batch_idx in tqdm(range(num_batches), desc="Predicting test set in batches"):
        start_idx = batch_idx * batch_size
        end_idx = min((batch_idx + 1) * batch_size, len(test_df))
        
        batch = test_df.iloc[start_idx:end_idx]
        user_indices = batch['user_id'].map(user_to_idx).to_numpy()
        book_indices = batch['book_id'].map(book_to_idx).to_numpy()
        valid_mask = (~np.isnan(user_indices)) & (~np.isnan(book_indices))
        batch_predictions = np.full(len(batch), mean_rating, dtype=np.float32)
        
        valid_user_indices = user_indices[valid_mask].astype(int)
        valid_book_indices = book_indices[valid_mask].astype(int)

        user_similarities = user_similarity[valid_user_indices, :]  
        item_ratings = user_item_np[:, valid_book_indices] 
        item_ratings_mask = item_ratings > 0
        weighted_ratings = np.dot(user_similarities, item_ratings * item_ratings_mask)
        similarity_sums = np.dot(user_similarities, item_ratings_mask)

        predictions_valid = np.divide(
            weighted_ratings,
            similarity_sums,
            out=np.full_like(weighted_ratings, mean_rating),
            where=similarity_sums > 0
        )

        batch_predictions[valid_mask] = predictions_valid.diagonal()
        predictions.extend(zip(batch['id'], batch_predictions))
    
    return predictions

### Submit

In [19]:
test_df = pd.read_csv('../data/test.csv')

In [20]:
predictions = predict_test_ratings(test_df, user_item_matrix, user_similarity)

Predicting test set in batches: 100%|██████████| 230/230 [00:16<00:00, 14.01it/s]


In [21]:
predictions[-1]

(29366, 2.3218882)

In [34]:
output_df = pd.DataFrame(predictions, columns=['id', 'rating'])
output_csv_path = 'predictions.csv'
output_df.to_csv(output_csv_path, index=False)
print(f"Predictions saved to {output_csv_path}")

Predictions saved to predictions.csv
