# Item-based collaborative filtering

Here, I'm going to compute item similarities using only user interactions.

In effect, similarity is based on unknown, latent features of each user and item. These features represent qualities of users and items that make users likely or unlikely to interact with items.

## Memory-based approach

This method does not generate a model or reduce dimensions, so it does not scale well to large datasets. Similarity is computed using the Scikit Learn `cosine_similarity` function.

In [1]:
%load_ext autoreload
%autoreload 1

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display

from sklearn.preprocessing import LabelEncoder

from pipeliner.recommendations.transformer import (
    UserItemMatrixTransformer,
    SimilarityTransformer,
    UserItemMatrixTransformerNP,
)
from pipeliner.recommendations.recommender import ItemBasedRecommender

pd.options.display.float_format = "{:,.2f}".format

In [3]:
data_types = {"user_id": str, "item_id": str, "rating": np.float64}
user_item_ratings = pd.read_csv(
    "./data/usable_user_item_ratings_prepared.csv.gz",
    compression="gzip",
    dtype=data_types,
)

# confirm that each user/item pair is unique
assert user_item_ratings.groupby(["user_id", "item_id"]).size().max() == 1

print(user_item_ratings.shape)
user_item_ratings.head(3)

(1522154, 3)


Unnamed: 0,user_id,item_id,rating
0,U000003,I00037925,0.61
1,U000003,I00189384,0.61
2,U000003,I00256366,0.61


In [4]:
# create a smaller dataset for the memory-based recommender
MAX_INTERACTIONS_PER_USER = 1000
SAMPLE_SIZE = 100000

user_item_ratings_sample = (
    (
        user_item_ratings.groupby("user_id")
        .head(MAX_INTERACTIONS_PER_USER)
        .reset_index(drop=True)
    )
    .head(SAMPLE_SIZE)
    .reset_index(drop=True)
)

print(user_item_ratings_sample.shape)
user_item_ratings_sample.head(3)

(100000, 3)


Unnamed: 0,user_id,item_id,rating
0,U000003,I00037925,0.61
1,U000003,I00189384,0.61
2,U000003,I00256366,0.61


In [5]:
# encode the user and item ids
user_sample_encoder = LabelEncoder()
item_sample_encoder = LabelEncoder()

user_item_ratings_sample["user_id"] = user_sample_encoder.fit_transform(user_item_ratings_sample["user_id"])
user_item_ratings_sample["item_id"] = item_sample_encoder.fit_transform(user_item_ratings_sample["item_id"])

unique_sample_users = pd.Series(user_sample_encoder.classes_)
unique_sample_items = pd.Series(item_sample_encoder.classes_)

print(unique_sample_users.shape[0], unique_sample_items.shape[0])
user_item_ratings_sample.head(3)

2527 83738


Unnamed: 0,user_id,item_id,rating
0,0,4352,0.61
1,0,19106,0.61
2,0,24288,0.61


In [12]:
# create the user/item matrix
user_item_matrix_transformer = UserItemMatrixTransformerNP(sparse=True)
user_item_matrix_sample = user_item_matrix_transformer.transform(
    user_item_ratings_sample.to_numpy(),
)

print(user_item_matrix_sample.shape)
user_item_matrix_sample[0:5, 0:5].toarray()

(2527, 83738)


array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]], dtype=float32)

In [None]:
# similarity_matrix_transformer = SimilarityTransformer(
#     kind="item", metric="cosine", normalise=True
# )
# similarity_matrix_sample = similarity_matrix_transformer.transform(
#     user_item_matrix_sample
# )
# similarity_matrix_sample.iloc[0:5, 0:5]