<a href="https://colab.research.google.com/github/Chifir31/CalculatorUIKitWithStoryboard/blob/main/lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Рекомендательные системы на основе коллаборативной фильтрации




In [None]:
import warnings
warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.base import BaseEstimator

%pylab inline

Populating the interactive namespace from numpy and matplotlib




Рассмотрим построение рекомендательной системы на датасете от `GroupLens` $-$ [`MovieLens`](https://grouplens.org/datasets/movielens/):
Это набор данных из $9 000$ фильмов и $700$ пользователей, с общим количеством оценок в $100 000$.


Скачать напрямую датасет можно по этой [ссылке](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)

**Загрузка данных**

In [None]:
# для UNIX систем
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2025-02-24 07:24:01--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2025-02-24 07:24:02 (5.23 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


## Данные
- `links.csv` $-$ связь между `id` фильма в датасете и `id` соответствующего фильма на `imdb.com` и `themoviedb.org`;
- `movies.csv` $-$ описание каждого фильма с его названием и жанрами;
- `ratings.csv` $-$ оценки пользователей фильмов с временной отметкой;
- `tags.csv` $-$ список тегов, которые поставил пользователь фильму, с временной отметкой.

Для данной задачи нам понадобятся только часть данных $-$ информация о том, какой рейтинг ставили пользователи фильмам.

In [None]:
ratings = pd.read_csv('./ml-latest-small/ratings.csv', parse_dates=['timestamp'])
ratings.head(7)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868


In [None]:
rm = ratings.pivot_table(index='userId', columns='movieId', values='rating')

In [None]:
rm.columns

Index([     1,      2,      3,      4,      5,      6,      7,      8,      9,
           10,
       ...
       193565, 193567, 193571, 193573, 193579, 193581, 193583, 193585, 193587,
       193609],
      dtype='int64', name='movieId', length=9724)

In [None]:
ratings['movieId'].max()

193609

In [None]:
ratings.rating.value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
4.0,26818
3.0,20047
5.0,13211
3.5,13136
4.5,8551
2.0,7551
2.5,5550
1.0,2811
1.5,1791
0.5,1370


### Метрика. Формирование тестовой выборки.

Используется `RMSE`: классическая метрика для задач рекомендации после прошедшего [Netflix Prize](https://ru.wikipedia.org/wiki/Netflix_Prize).

Выделим часть выборки для тестирования модели по принципу: для каждого пользователя возьмем последние 20% оценок.

In [None]:
rmse = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred))

def train_test_split(X, ratio=0.2, user_col='userId', item_col='movieId',
                     rating_col='rating', time_col='timestamp'):
    # сортируем оценки по времени
    X.sort_values(by=[time_col], inplace=True)
    # список всех юзеров
    userIds = X[user_col].unique()
    X_train_data = []
    X_test_data = []
    y_train = []
    y_test = []
    for userId in tqdm_notebook(userIds):
        curUser = X[X[user_col] == userId]
        # определяем позицию, по которой делим выборку и размещаем данные по массивам
        idx = int(curUser.shape[0] * (1 - ratio))
        X_train_data.append(curUser[[user_col, item_col]].iloc[:idx, :].values)
        X_test_data.append(curUser[[user_col, item_col]].iloc[idx:, :].values)
        y_train.append(curUser[rating_col].values[:idx])
        y_test.append(curUser[rating_col].values[idx:])
    # cтекуем данные по каждому пользователю в общие массивы
    X_train = pd.DataFrame(np.vstack(X_train_data), columns=[user_col, item_col])
    X_test = pd.DataFrame(np.vstack(X_test_data), columns=[user_col, item_col])
    y_train = np.hstack(y_train)
    y_test = np.hstack(y_test)
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ratings)

  0%|          | 0/610 [00:00<?, ?it/s]

In [None]:
X_train.shape, len(y_train), X_test.shape, len(y_test)

((80419, 2), 80419, (20417, 2), 20417)

In [None]:
X_train = X_train.assign(rating=y_train)

# Корреляционные модели
#



Требуется реализовать User-based и Item-based модели

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from itertools import product
from pandas import DataFrame
from IPython.display import clear_output

ratings_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')


class UserBasedModel:
    def __init__(self, matrix: DataFrame, delta=0.93) -> None:
        matrix = matrix.fillna(matrix.mean())
        self.matrix = matrix.T
        self.means = np.array([round(self.matrix[u].mean(), 2) for u in self.matrix.columns])
        self.centered_matrix = self.matrix - self.means
        self.delta = delta
        self.m = np.array(self.matrix.index)
        self.S = self.build_S()

    def pearson_correlation(self, u, v):
        r_u = self.matrix[u] - self.means[u - 1]
        r_v = self.matrix[v] - self.means[v - 1]
        numerator = (r_u * r_v).sum()
        denominator = np.sqrt((r_u ** 2).sum()) * np.sqrt((r_v ** 2).sum())
        return round(numerator / denominator, 2) if denominator != 0 else np.nan

    def cosine_similarity(self, u, v):
        # Преобразуем списки в массивы NumPy
        A = self.centered_matrix[u].to_numpy()
        B = self.centered_matrix[v].to_numpy()

        # Вычисляем скалярное произведение
        dot_product = np.dot(A, B)

        # Вычисляем нормы векторов
        norm_A = np.linalg.norm(A)
        norm_B = np.linalg.norm(B)

        # Вычисляем косинусное сходство
        if norm_A == 0 or norm_B == 0:
            return 0.0  # Избегаем деления на ноль
        else:
            return dot_product / (norm_A * norm_B)

    def build_S(self):
        num_cols = self.matrix.shape[1]
        S = np.eye(num_cols, dtype=float)

        for i in range(num_cols):
            for j in range(i + 1, num_cols):
                result = self.cosine_similarity(i + 1, j + 1)
                S[i, j] = result
                S[j, i] = result
        return S

    # def build_S(self):
    #     centered_matrix = self.matrix - self.means #Центрируем данные
    #     centered_matrix = centered_matrix.T
    #     for s in centered_matrix:
    #         print(s)
    #     centered_matrix = centered_matrix.to_numpy()
    #     correlation_matrix = cosine_similarity(centered_matrix)
    #     return np.round(correlation_matrix, 2)

    def prediction(self, params):
        user, item = params
        S_u = self.S[user - 1]
        r_u = self.means[user - 1]
        U_a = np.where(S_u > self.delta)[0]
        i = np.where(self.m == item)[0]
        S_U = S_u[U_a]

        ratings_diff = self.centered_matrix.iloc[i, U_a].to_numpy()
        nominator = (S_U * ratings_diff).sum()
        denominator = (S_U).sum()
        result = r_u + nominator / denominator if denominator != 0 else r_u
        return result

In [None]:
X_test = X_test.to_numpy()
counts = len(X_test)

In [None]:
!pip install tqdm



In [None]:
ubm = UserBasedModel(matrix=ratings_matrix)

In [None]:
from tqdm.notebook import tqdm
# y_pred = list(map(ubm.prediction, X_test, tqdm(range(counts))))
y_pred = []
for i in tqdm(range(counts)):
    y_pred.append(ubm.prediction(X_test[i]))


  0%|          | 0/20417 [00:00<?, ?it/s]

In [None]:
rmse(y_test, y_pred)

0.8495887639958832

In [None]:
class ItemBasedModel:
    def __init__(self, matrix: DataFrame, delta=0.1) -> None:
        self.matrix = matrix.fillna(matrix.mean())
        self.means = np.array([round(self.matrix[u].mean(), 2) for u in self.matrix.columns])
        self.centered_matrix = self.matrix - self.means
        self.delta = delta
        self.m = np.array(self.matrix.columns)
        self.S = self.build_S()

    def cosine_similarity(self, u, v):
        # Преобразуем списки в массивы NumPy
        A = self.centered_matrix[u].to_numpy()
        B = self.centered_matrix[v].to_numpy()
        # Вычисляем скалярное произведение
        dot_product = np.dot(A, B)
        # Вычисляем нормы векторов
        norm_A = np.linalg.norm(A)
        norm_B = np.linalg.norm(B)
        # Вычисляем косинусное сходство
        if norm_A == 0 or norm_B == 0:
            return 0.0  # Избегаем деления на ноль
        else:
            return dot_product / (norm_A * norm_B)

    def build_S(self):
        num_cols = self.matrix.shape[1]
        S = np.eye(num_cols, dtype=float)

        for i in tqdm(range(num_cols)):
            for j in range(i + 1, num_cols):
                result = self.cosine_similarity(self.m[i], self.m[j])
                S[i, j] = result
                S[j, i] = result
        return S

    def prediction(self, params):
        user, item = params
        i = np.where(self.m == item)[0][0]
        S_i = self.S[i]
        r_i = self.means[i]
        I_i = np.where(S_i > self.delta)[0]
        S_I = S_i[I_i]

        ratings_diff = self.centered_matrix.iloc[user - 1, I_i].to_numpy()
        nominator = (S_I * ratings_diff).sum()
        denominator = (S_I).sum()

        result = r_i + nominator / denominator if denominator != 0 else r_i

        return result

ibm = ItemBasedModel(matrix=ratings_matrix)

  0%|          | 0/9724 [00:00<?, ?it/s]

In [None]:
X_test[69]
print(np.where(ibm.m == 1121))
print(len(ibm.m))
print(np.where(ubm.m == 1121))

(array([], dtype=int64),)
8306
(array([], dtype=int64),)


In [None]:
y_pred1 = []
for i in tqdm(range(counts)):
    y_pred1.append(ibm.prediction(X_test[i]))

  0%|          | 0/20417 [00:00<?, ?it/s]

In [None]:
rmse(y_test, y_pred1)

0.7665141619651054

# Surprise
Библиотека [Surprise](http://surpriselib.com/) .


In [None]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl.metadata (327 bytes)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl size=2505178 sha256=d7fcaa332f792d004276b83ab3c5ece6500ff679ca7d64d1b688a043f2092201
  Stored in directory: /root/.cache/pip/wheels/2a/8f/6e/7e2899163e2d85d8266daab4aa1cdabec7a6c56f83c015b5af
Successfully built scikit-surprise
Install

In [None]:
from surprise.prediction_algorithms.knns import KNNBasic
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')

# Use the user-based algorithm.
algo = KNNBasic()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9776  0.9754  0.9843  0.9783  0.9782  0.9788  0.0030  
MAE (testset)     0.7723  0.7698  0.7773  0.7735  0.7730  0.7732  0.0024  
Fit time          0.56    0.41    0.57    0.42    0.57    0.51    0.07    
Test time         3.64    4.14    3.49    3.39    3.83    3.70    0.27    


{'test_rmse': array([0.97758965, 0.97541352, 0.98429325, 0.97834733, 0.97815775]),
 'test_mae': array([0.7722961 , 0.76976739, 0.77725239, 0.77349757, 0.77298087]),
 'fit_time': (0.5605363845825195,
  0.41263484954833984,
  0.566570520401001,
  0.4186227321624756,
  0.5684120655059814),
 'test_time': (3.642064332962036,
  4.138620138168335,
  3.488823413848877,
  3.3907861709594727,
  3.8313651084899902)}