# Warsztaty badawcze

## Budowa systemu rekomendacyjnego na podstawie danych z OLX Praca z wykorzystaniem metody Item-Item Nearest Neighbour

#### Wiktoria Boguszewska, Mateusz Zacharecki, Patrycja Żak

## Item-Item Nearest Neighbour

Algorytm Item-Item Nearest Neighbour jest techniką filtrowania kolaboratywnego. Jego działanie opiera się na znajdowaniu przedmiotów podobnych do siebie na podstawie zachowań użytkowników. W przeciwieństwie do filtrowania użytkownik-użytkownik, tutaj najpierw oblicza się podobieństwo między przedmiotami, a następnie dla danego użytkownika rekomenduje się przedmioty, które są podobne do tych, które ocenił lub z którymi wszedł w interakcję.

Zalety:
- Po obliczeniu podobieństw, rekomendacje są generowane szybko.
- W porównaniu z podejściem użytkownik-użytkownik, podobieństwa między przedmiotami są mniej podatne na zmiany w przypadku nowych użytkowników.

Wady:
- Dla nowych przedmiotów bez interakcji algorytm ma trudności z generowaniem rekomendacji.
- Obliczanie podobieństw może być kosztowne w przypadku bardzo dużych zbiorów danych.

# Kod

Biblioteki do pobrania


In [38]:
!pip install implicit



In [39]:
!pip install scikit-optimize



In [40]:
!pip install tqdm



In [41]:
import pandas as pd
import numpy as np
import random
import implicit
from scipy.sparse import coo_matrix
from sklearn.base import BaseEstimator
from implicit.nearest_neighbours import ItemItemRecommender
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
from tqdm.notebook import tqdm

Pracujemy z tym zbiorem danych (szczegóły w linku): https://www.kaggle.com/datasets/olxdatascience/olx-jobs-interactions

In [52]:
# Krok 1: Przetwarzanie wstępne

# Wczytanie pliku
data = pd.read_csv("interactions.csv")
data.head()

# Filtracja
data = data[data['event'] == 'click']

In [53]:
data.dtypes

user          int64
item          int64
event        object
timestamp     int64
dtype: object

In [54]:
data.head()

Unnamed: 0,user,item,event,timestamp
0,27901,56865,click,1581465600
1,124480,115662,click,1581465600
2,159509,5150,click,1581465600
3,188861,109981,click,1581465600
4,207348,88746,click,1581465600


Funkcje pomocnicze wykorzystywane w konstrukcji grafu

In [55]:
# Samplowanie po user_id i wybór 10% obiektów o największej liczbie interakcji

user_unique = data['user'].unique()
sample_size = int(len(user_unique) * 0.1)
random_user_10 = random.sample(list(user_unique), sample_size)
df_10 = data[data['user'].isin(random_user_10)]

df_10_dist = df_10[['item','user']].drop_duplicates()
item_count = df_10_dist.groupby('item').size().reset_index(name='count')
item_count = item_count.sort_values(by = 'count', ascending = False)
item_unique = df_10['item'].unique()
sample_size = int(len(item_unique) * 0.1)
top_item_10 = item_count.iloc[:sample_size, :]
df_10_10 = df_10[df_10['item'].isin(top_item_10['item'])]

# Podział zbioru na train i test - tzw. temporal split - zbiorem testowym jest 20% ostatnich interakcji (sortowanie po timestamp) czyli na podstawie 80% wcześniejszych interakcji przewidujemy następne

df_10_10 = df_10_10.sort_values('timestamp')
train_size = int(len(df_10_10) * 0.8)
train_data = df_10_10[:train_size]
test_data = df_10_10[train_size:]

In [56]:
# Tworzymy mapowanie z user i item do indeksów
user_map = {u: i for i, u in enumerate(train_data['user'].unique())}
item_map = {i: j for j, i in enumerate(train_data['item'].unique())}

# Mapujemy ID użytkowników i przedmiotów na indeksy
train_data['user_idx'] = train_data['user'].map(user_map)
train_data['item_idx'] = train_data['item'].map(item_map)

# Tworzymy macierz rzadką użytkownik-przedmiot dla treningu
train_matrix = coo_matrix(
    (np.ones(len(train_data)), (train_data['user_idx'], train_data['item_idx'])),
    shape=(len(user_map), len(item_map))
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['user_idx'] = train_data['user'].map(user_map)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['item_idx'] = train_data['item'].map(item_map)


In [57]:
# Normalizacja wartości współczynników w macierzy
train_matrix = (train_matrix.T * 100).T  # Skaluje wartości dla algorytmu
# Konwersja macierzy COO na CSR
train_matrix_csr = train_matrix.tocsr()

In [58]:
class ItemItemKNNWrapper(BaseEstimator):
    def __init__(self, K=10, num_threads=1):
        self.K = K
        self.num_threads = num_threads
        self.model = None
        self.user_map = None
        self.item_map = None
        self.train_matrix = None

    def fit(self, X, y=None):
        self.train_matrix = X.copy()
        # Build user_map and item_map
        self.user_map = {user_id: idx for idx, user_id in enumerate(np.unique(X.nonzero()[0]))}
        self.item_map = {item_id: idx for idx, item_id in enumerate(np.unique(X.nonzero()[1]))}

        self.model = ItemItemRecommender(K=self.K, num_threads=self.num_threads)
        self.model.fit(X)
        return self

    def recommend(self, user, user_items, N=10):
        return self.model.recommend(user, user_items, N=N)

    def get_params(self, deep=True):
        return {"K": self.K, "num_threads": self.num_threads}

    def set_params(self, **params):
        for key, value in params.items():
            setattr(self, key, value)
        return self

In [59]:
def precision_at_k_score(estimator, X_test, y=None):
    # estimator: trained model for the current fold
    # X_test: validation data for the current fold

    # Use estimator's mappings and train matrix
    precision, recall, accuracy, f1 = precision_recall_accuracy_f1_at_k(
        estimator.model,
        estimator.train_matrix,
        X_test,
        estimator.user_map,
        estimator.item_map,
        k=10
    )
    return precision

In [60]:
def precision_recall_accuracy_f1_at_k(model, train_matrix, X_test, user_map, item_map, k=10):
    user_precisions = []
    user_recalls = []
    user_accuracies = []
    user_f1s = []

    # Get list of users in X_test
    users = np.unique(X_test.nonzero()[0])

    for user_idx in users:
        if user_idx not in user_map.values():
            continue  # Skip users not in the mapping

        # Get actual items for this user in X_test
        actual_items_idx = X_test[user_idx].nonzero()[1]

        if len(actual_items_idx) == 0:
            continue  # Skip if no actual items

        # Get recommendations
        recommended_items = model.recommend(user_idx, train_matrix[user_idx], N=k)
        recommended_items_idx = [item[0] for item in recommended_items]

        if not recommended_items_idx:
            continue  # Skip if no recommendations

        # Calculate metrics
        tp = len(set(recommended_items_idx) & set(actual_items_idx))
        fp = len(recommended_items_idx) - tp
        fn = len(actual_items_idx) - tp

        denominator_precision = tp + fp
        denominator_recall = tp + fn

        precision = tp / denominator_precision if denominator_precision > 0 else 0.0
        recall = tp / denominator_recall if denominator_recall > 0 else 0.0

        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0.0

        accuracy_denominator = tp + fp + fn
        accuracy = tp / accuracy_denominator if accuracy_denominator > 0 else 0.0

        user_precisions.append(precision)
        user_recalls.append(recall)
        user_accuracies.append(accuracy)
        user_f1s.append(f1)

    # Calculate average metrics
    avg_precision = np.mean(user_precisions) if user_precisions else 0.0
    avg_recall = np.mean(user_recalls) if user_recalls else 0.0
    avg_accuracy = np.mean(user_accuracies) if user_accuracies else 0.0
    avg_f1 = np.mean(user_f1s) if user_f1s else 0.0

    return avg_precision, avg_recall, avg_accuracy, avg_f1

In [61]:
# Define parameter space to optimize
param_knn = {
    'K': np.arange(5, 101, 5),  # Number of neighbors from 5 to 100 with step 5
    'num_threads': [1, 2, 4, 8]  # Number of threads
}

# Create Item-Item KNN model using wrapper
item_knn_wrapper = ItemItemKNNWrapper()

# Number of optimization iterations
n_iter = 10

# Use RandomizedSearchCV to optimize hyperparameters
opt_knn = RandomizedSearchCV(
    item_knn_wrapper,
    param_distributions=param_knn,
    n_iter=n_iter,
    scoring=precision_at_k_score,  # Pass the function directly
    cv=3,  # Cross-validation
    n_jobs=1,
    verbose=2
)

opt_knn.fit(train_matrix_csr, y=None)
best_knn_model = opt_knn.best_estimator_

# Display optimization results
print("Najlepsze parametry dla Item-Item KNN:", opt_knn.best_params_)
print("Najlepszy wynik dla Item-Item KNN:", opt_knn.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=60, num_threads=4; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=60, num_threads=4; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=60, num_threads=4; total time=21.6min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=8; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=8; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=8; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=95, num_threads=8; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=95, num_threads=8; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=95, num_threads=8; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=40, num_threads=1; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=40, num_threads=1; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=40, num_threads=1; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=2; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=2; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=2; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END .................................K=5, num_threads=4; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END .................................K=5, num_threads=4; total time=21.0min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END .................................K=5, num_threads=4; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=1; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=1; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=45, num_threads=1; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=55, num_threads=4; total time=21.1min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=55, num_threads=4; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=55, num_threads=4; total time=21.2min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=15, num_threads=4; total time=21.5min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=15, num_threads=4; total time=21.4min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=15, num_threads=4; total time=21.4min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=8; total time=21.4min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=8; total time=21.7min


  0%|          | 0/12770 [00:00<?, ?it/s]

[CV] END ................................K=30, num_threads=8; total time=21.4min


  0%|          | 0/12770 [00:00<?, ?it/s]

Najlepsze parametry dla Item-Item KNN: {'num_threads': 4, 'K': 60}
Najlepszy wynik dla Item-Item KNN: 0.0005501895923677131


In [62]:
def precision_recall_accuracy_f1_at_k(model, train_matrix_csr, test_data, user_map, item_map, k=10):
    # Convert train_matrix to csr_matrix if it's not already in that format

    user_precisions = []
    user_recalls = []
    user_accuracies = []
    user_f1s = []

    # Get list of unique users in test_data
    users = test_data['user'].unique()

    for user_id in users:
        if user_id not in user_map:
            continue  # Skip users not in the mapping

        user_idx = user_map[user_id]

        # Get actual items for this user in test_data
        actual_items_idx = test_data[test_data['user'] == user_id]['item'].map(item_map).dropna().values

        if len(actual_items_idx) == 0:
            continue  # Skip if no actual items

        # Get recommendations
        recommended_items = model.recommend(user_idx, train_matrix_csr[user_idx], N=k)
        recommended_items_idx = [item[0] for item in recommended_items]

        if not recommended_items_idx:
            continue  # Skip if no recommendations

        # Calculate metrics
        tp = len(set(recommended_items_idx) & set(actual_items_idx))
        fp = len(recommended_items_idx) - tp
        fn = len(actual_items_idx) - tp

        denominator_precision = tp + fp
        denominator_recall = tp + fn

        precision = tp / denominator_precision if denominator_precision > 0 else 0.0
        recall = tp / denominator_recall if denominator_recall > 0 else 0.0

        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0.0

        accuracy_denominator = tp + fp + fn
        accuracy = tp / accuracy_denominator if accuracy_denominator > 0 else 0.0

        user_precisions.append(precision)
        user_recalls.append(recall)
        user_accuracies.append(accuracy)
        user_f1s.append(f1)

    # Calculate average metrics
    avg_precision = np.mean(user_precisions) if user_precisions else 0.0
    avg_recall = np.mean(user_recalls) if user_recalls else 0.0
    avg_accuracy = np.mean(user_accuracies) if user_accuracies else 0.0
    avg_f1 = np.mean(user_f1s) if user_f1s else 0.0

    return avg_precision, avg_recall, avg_accuracy, avg_f1

In [63]:
# Obliczenie wszystkich miar dla najlepszego modelu
precision_at_k, recall_at_k, accuracy_at_k, f1_at_k = precision_recall_accuracy_f1_at_k(
    best_knn_model.model,      # Model wewnątrz wrappera
    train_matrix_csr,          # Macierz treningowa
    test_data,                 # Dane testowe (może to być zbiór walidacyjny lub testowy)
    user_map,                  # Mapowanie użytkowników
    item_map,                  # Mapowanie przedmiotów
    k=10                       # Liczba rekomendacji (możesz zmienić wartość 'k')
)

# Wyświetlamy wyniki
print(f"Precision@K: {precision_at_k}")
print(f"Recall@K: {recall_at_k}")

Precision@K: 0.023952588690427195
Recall@K: 0.011698680777340315
