### В данном ноутбуке я попробовал применить библиотечные подходы к рекомендациям и подобрать к моделям наилучшие параметры. Я попробовал использовать kNN подход, а также, подход основанный на SVD.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [2]:
data = pd.read_csv('dataset.csv')
data = data.sort_values(['timestamp'])

In [3]:
train = data[:80000]
test = data[80000:]

In [4]:
train.head()

Unnamed: 0,user_id,item_id,rating,timestamp
217,259,255,4,874724710
83968,259,286,4,874724727
43030,259,298,4,874724754
21399,259,185,4,874724781
82658,259,173,4,874724843


In [5]:
test.head()

Unnamed: 0,user_id,item_id,rating,timestamp
1346,3,245,1,889237247
27978,3,355,3,889237247
1260,3,335,1,889237269
38673,3,322,3,889237269
3761,3,323,2,889237269


In [6]:
def average_precision(actual, recommended, k=30):
    ap_sum = 0
    hits = 0
    for i in range(k):
        product_id = recommended[i] if i < len(recommended) else None
        if product_id is not None and product_id in actual:
            hits += 1
            ap_sum += hits / (i + 1)
    return ap_sum / k


def normalized_average_precision(actual, recommended, k=30):
    actual = set(actual)
    if len(actual) == 0:
        return 0.0

    ap = average_precision(actual, recommended, k=k)
    ap_ideal = average_precision(actual, list(actual)[:k], k=k)
    return ap / ap_ideal

In [15]:
def recommend(user):
    return [288, 1, 286, 121, 174]

In [16]:
scores = []
for user in tqdm(test['user_id'].unique()):
    actual = list(test[test['user_id'] == user]['item_id'])
    recommended = recommend(user)
    
    scores.append(normalized_average_precision(actual, recommended))

np.mean(scores)

100%|██████████| 301/301 [00:00<00:00, 1197.93it/s]


0.03566965142495101

In [None]:
# Задача: Обучить модель так, чтобы мера была больше 0.1

In [23]:
def weight(item):
#     coef = (0.1, 0.5, 1, 2, 3)
    coef = (0.1, 0.2, 0.5, 1.5, 3)
    sum = 0
    for i in range(1, 6):
        cnt = train[(train['item_id'] == item) & (train['rating'] == i)]['user_id'].count()
        sum += coef[i-1] * cnt
    return sum

dct = {
    'item' : [],
    'weight' : []
}
for item in tqdm(train['item_id'].unique()):
    dct['item'].append(item)
    dct['weight'].append(weight(item))

best_item = pd.DataFrame(dct)

100%|██████████| 1616/1616 [00:12<00:00, 131.64it/s]


# kNN

In [7]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 20.6 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619412 sha256=dd164942e20180a467d7f42cbf2ad5942bdd59d3789ffd51f8cc009cf6264fb1
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [8]:
from surprise import KNNWithMeans, KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import GridSearchCV

In [11]:
reader = Reader(rating_scale=(1, 5))

dataset = Dataset.load_from_df(train[["user_id", "item_id", "rating"]], reader)

In [95]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3, refit=True)
gs.fit(dataset)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [96]:
gs.predict(269, 13).est

3.5177

In [98]:
def recommend_knn_means(user):
    best_items = best_item['item'].values
    
    if user in train['user_id'].unique():
        rec = []
        for i in train['item_id'].unique():
            if i not in train[train['user_id'] == user]['item_id'].unique():
                rec.append((gs.predict(user, i).est, i))

        rec.sort(reverse=True)
        return [j[1] for j in rec[:30]]
    else:
        return best_items[:30]

In [99]:
scores = []
for user in tqdm(test['user_id'].unique()):
    actual = list(test[test['user_id'] == user]['item_id'])
    recommended = recommend_knn_means(user)
    
    scores.append(normalized_average_precision(actual, recommended))

np.mean(scores)

100%|██████████| 301/301 [03:03<00:00,  1.64it/s]


0.04423288002299501

# KNN Basic

In [100]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs_knnbasic = GridSearchCV(KNNBasic, param_grid, measures=["rmse", "mae"], cv=3, refit=True)
gs_knnbasic.fit(dataset)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [101]:
print(gs_knnbasic.best_score["rmse"])
print(gs_knnbasic.best_params["rmse"])

0.9904492771131556
{'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': True}}


In [102]:
def recommend_knn_basic(user):
    best_items = best_item['item'].values
    
    if user in train['user_id'].unique():
        rec = []
        for i in train['item_id'].unique():
            if i not in train[train['user_id'] == user]['item_id'].unique():
                rec.append((gs_knnbasic.predict(user, i).est, i))

        rec.sort(reverse=True)
        return [j[1] for j in rec[:30]]
    else:
        return best_items[:30]

In [103]:
scores = []
for user in tqdm(test['user_id'].unique()):
    actual = list(test[test['user_id'] == user]['item_id'])
    recommended = recommend_knn_basic(user)
    
    scores.append(normalized_average_precision(actual, recommended))

np.mean(scores)

100%|██████████| 301/301 [02:45<00:00,  1.82it/s]


0.044500359881094104

# SVD++

In [104]:
from surprise import SVD

In [105]:
param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}
gs_svd = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3, refit=True)

gs_svd.fit(dataset)

In [106]:
print(gs_svd.best_score["rmse"])
print(gs_svd.best_params["rmse"])

0.9628905957969903
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [107]:
def recommend_svd(user):
    best_items = best_item['item'].values
    
    if user in train['user_id'].unique():
        rec = []
        for i in train['item_id'].unique():
            if i not in train[train['user_id'] == user]['item_id'].unique():
                rec.append((gs_svd.predict(user, i).est, i))

        rec.sort(reverse=True)
        return [j[1] for j in rec[:30]]
    else:
        return best_items[:30]

In [108]:
scores = []
for user in tqdm(test['user_id'].unique()):
    actual = list(test[test['user_id'] == user]['item_id'])
    recommended = recommend_svd(user)
    
    scores.append(normalized_average_precision(actual, recommended))

np.mean(scores)

100%|██████████| 301/301 [02:30<00:00,  2.00it/s]


0.0559516863589438