# Item to item collaborative filtering

An idea of item-to-item collaborative filtering approach is that recommendations are based on finding similarities between the items in terms of how they are rated by the users.\
So, the matrix of interactions is built, and cosine similarities between the items are calculated.

In [38]:
import numpy as np
import pandas as pd

**Preprocessing**\
For this task we need the ratings table only.

In [39]:
ratings_df = pd.read_csv('../data/Ratings.csv', delimiter=';', dtype={'User-ID': np.int32, 'ISBN': str, 'Rating': np.int8})
ratings_df.columns = ['user', 'item', 'label']
ratings_df.head()

Unnamed: 0,user,item,label
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Then we drop the rows with:
- missing data (all the values are crucial)
- zero ratings (as they have no impact)

In [40]:
print('Total ratings:', ratings_df.shape[0])

ratings_df = ratings_df[ratings_df['label'] > 0]
ratings_df.dropna()

print('Total ratings (cleaned):', ratings_df.shape[0])

Total ratings: 1149780
Total ratings (cleaned): 433671


After experimenting with straightfoward approach with sparse matrices (ItemRecommender class) we swithced to LibRecommender, which is far superior in terms of performance due to a number of optimizations.\
The tasks performed below are:
- train/eval split
- data preparation to LibReco model input
- model creation with 10 nearest neighbors configuration

In [41]:
from libreco.algorithms import ItemCF
from libreco.data import DatasetPure
from libreco.data import random_split

ratings_df = ratings_df[["user", "item", "label"]]
train_df, eval_df = random_split(ratings_df, test_size=0.2)

train_data, data_info = DatasetPure.build_trainset(train_df)
eval_data = DatasetPure.build_evalset(eval_df)

model = ItemCF(task="ranking", 
               data_info=data_info,
               k_sim=10, 
               sim_type="cosine", 
               min_common=1)

model.fit(train_data, neg_sampling=True)


Training start time: [35m2024-08-11 22:08:04[0m
Final block size and num: (1251, 127)
sim_matrix elapsed: 113.577s
sim_matrix, shape: (158852, 158852), num_elements: 96093388, density: 0.3808 %


top_k: 100%|██████████| 158852/158852 [00:37<00:00, 4252.62it/s]


Evaluating the model on 3 metrics: precision, recall and NDCG (Normalized Discounted Cumulative Gain).

In [42]:
from libreco.evaluation import evaluate

eval_result = evaluate(model, eval_data, neg_sampling=True, metrics=[ "precision", "recall", "ndcg"])
print(f"Evaluation Results:\n{eval_result}")

eval_listwise:   1%|          | 119/16407 [00:01<02:31, 107.27it/s]

[31mno suitable recommendation for user 39925, return default recommendation[0m


eval_listwise:   1%|▏         | 221/16407 [00:01<01:20, 202.05it/s]

[31mno suitable recommendation for user 25412, return default recommendation[0m
[31mno suitable recommendation for user 44239, return default recommendation[0m
[31mno suitable recommendation for user 12084, return default recommendation[0m


eval_listwise:   2%|▏         | 287/16407 [00:01<01:03, 254.50it/s]

[31mno suitable recommendation for user 56690, return default recommendation[0m


eval_listwise:   3%|▎         | 480/16407 [00:02<00:47, 338.75it/s]

[31mno suitable recommendation for user 17623, return default recommendation[0m
[31mno suitable recommendation for user 53915, return default recommendation[0m
[31mno suitable recommendation for user 66434, return default recommendation[0m


eval_listwise:   4%|▍         | 734/16407 [00:03<00:32, 478.31it/s]

[31mno suitable recommendation for user 13863, return default recommendation[0m
[31mno suitable recommendation for user 1751, return default recommendation[0m


eval_listwise: 100%|██████████| 16407/16407 [00:11<00:00, 1394.84it/s]


Evaluation Results:
{'precision': 0.004382275857865545, 'recall': 0.014942970740036218, 'ndcg': 0.033249598472747145}


We've got pretty average results. Moreover, there were items with no recommendations due to not enough ratings. So, then we decided to remove rare books from the dataset for this task.

**Removing rare books**

In [43]:
rating_count=pd.DataFrame(ratings_df["item"].value_counts())
rare_books=rating_count[rating_count["count"]<=20].index
ratings_df=ratings_df[~ratings_df["item"].isin(rare_books)]

print('Total ratings (rare books excluded):', ratings_df.shape[0])

Total ratings (rare books excluded): 94503


In [44]:
train_df, eval_df = random_split(ratings_df, test_size=0.2)

train_data, data_info = DatasetPure.build_trainset(train_df)
eval_data = DatasetPure.build_evalset(eval_df)

# Step 3: Build and train the model
model = ItemCF(task="ranking", 
               data_info=data_info,
               k_sim=10, 
               sim_type="cosine", 
               min_common=1)

model.fit(train_data, neg_sampling=True)

Training start time: [35m2024-08-11 22:10:48[0m
Final block size and num: (2034, 1)
sim_matrix elapsed: 0.080s
sim_matrix, shape: (2034, 2034), num_elements: 1217048, density: 29.4175 %


top_k: 100%|██████████| 2034/2034 [00:00<00:00, 5034.83it/s]


In [45]:
from libreco.evaluation import evaluate

eval_result = evaluate(model, eval_data, neg_sampling=True, metrics=[ "precision", "recall", "ndcg"])
print(f"Evaluation Results:\n{eval_result}")

eval_listwise: 100%|██████████| 1676/1676 [00:02<00:00, 704.92it/s] 


Evaluation Results:
{'precision': 0.02199015366253916, 'recall': 0.10206671918612688, 'ndcg': 0.11273428439190147}


Finally, the precision is not big. But 5-times better compared to a dataset with rare books.