<a href="https://colab.research.google.com/github/Kozhedu/recommender_systems/blob/main/%D0%9C%D0%BE%D0%B4%D1%83%D0%BB%D1%8C_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

## 1. Система рекомендаций на основе контента (content-based model).##

In [2]:
df = pd.read_csv('netflix_titles.csv')

In [3]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...


Преобразование текста в вектор - TF-IDF (Term Frequency-Inverse Document Frequency).

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
model = TfidfVectorizer(stop_words='english') #учтём стоп-слова

In [6]:
df['description'] = df['description'].fillna('') #Заполним пропуски пустыми строками:

In [7]:
feature_matrix = model.fit_transform(df['description']) #Трансформируем наши описания в матрицу:

In [8]:
feature_matrix

<7787x17905 sparse matrix of type '<class 'numpy.float64'>'
	with 107187 stored elements in Compressed Sparse Row format>

In [9]:
feature_matrix.shape

(7787, 17905)

Вычислим косинусную близость. Можно сделать это так:

In [10]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(feature_matrix, feature_matrix)

In [11]:
cosine_sim

array([[1.        , 0.        , 0.05827946, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.09600035, 0.        ,
        0.        ],
       [0.05827946, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.09600035, 0.        , ..., 1.        , 0.        ,
        0.02819239],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.02819239, 0.        ,
        1.        ]])

Обратите внимание! Мы используем здесь linear_kernel(), а не cosine_similarity(), так как в косинусном расстоянии в знаменателе реализуется нормировка векторов, а TF-IDF создаёт уже нормализованные векторы.

In [12]:
indices = pd.Series(df.index,index=df['title']).drop_duplicates() #Вернём индексацию и уберём дубликаты из данных:

In [13]:
indices

title
3%                                            0
7:19                                          1
23:59                                         2
9                                             3
21                                            4
                                           ... 
Zozo                                       7782
Zubaan                                     7783
Zulu Man in Japan                          7784
Zumbo's Just Desserts                      7785
ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS    7786
Length: 7787, dtype: int64

Теперь пропишем функцию для создания рекомендаций:

In [14]:
def get_recommendations(title):
    idx = indices[title]
    #вычисляем попарные коэффициенты косинусной близости
    scores = list(enumerate(cosine_sim[idx]))
    #сортируем фильмы на основании коэффициентов косинусной близости по убыванию
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    #выбираем десять наибольших значений косинусной близости; нулевую не берём, так как это тот же фильм
    scores =   scores[1:11]
    #забираем индексы
    ind_movie = [i[0] for i in scores]
    #возвращаем названия по индексам
    return df['title'].iloc[ind_movie]

In [15]:
get_recommendations('Star Trek')

5788             Star Trek: The Next Generation
5787                      Star Trek: Enterprise
5786                 Star Trek: Deep Space Nine
5557                     She's Out of My League
134                                  7 Days Out
6664                        The Midnight Gospel
6023                                     Teresa
4863    Pinkfong & Baby Shark's Space Adventure
5104                                       Rats
5970                             Tales by Light
Name: title, dtype: object

Найдите вторую рекомендацию для детского фильма Balto, вышедшего на экраны в 1995 году:

In [16]:
get_recommendations('Balto')

709                Balto 2: Wolf Quest
7446                           Vroomiz
1338    Chilling Adventures of Sabrina
7388                          Vampires
1770                          Dinotrux
2767                     Hold the Dark
5540                 Shanghai Fortress
4041                             Mercy
2582                       Half & Half
1365        Christmas in the Heartland
Name: title, dtype: object

## 2. Коллаборативная фильтрация ##

In [17]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp39-cp39-linux_x86_64.whl size=3193672 sha256=fbf71a6d32acc0644cf7225efd5eaf8cbf013bf431f95b9e85294a9732739555
  Stored in directory: /root/.cache/pip/wheels/c6/3a/46/9b17b3512bdf283c6cb84f59929cdd5199d4e754d596d22784
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [18]:
from surprise import Dataset
from surprise import Reader
from surprise.dataset import BUILTIN_DATASETS #с помощью этого объекта мы можем использовать встроенные датасеты

data = Dataset.load_from_file(
    "u.data.txt",
    reader=Reader(line_format="user item rating timestamp", sep="\t"),
)

In [19]:
df = pd.DataFrame(data.raw_ratings, columns=['userId', 'movieId', 'rating', 'timestamp'])

In [20]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596
...,...,...,...,...
99995,880,476,3.0,880175444
99996,716,204,5.0,879795543
99997,276,1090,1.0,874795795
99998,13,225,2.0,882399156


In [21]:
df.movieId.nunique()

1682

In [22]:
df.userId.nunique()

943

In [23]:
df.groupby("rating").count()

Unnamed: 0_level_0,userId,movieId,timestamp
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,6110,6110,6110
2.0,11370,11370,11370
3.0,27145,27145,27145
4.0,34174,34174,34174
5.0,21201,21201,21201


In [24]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596
...,...,...,...,...
99995,880,476,3.0,880175444
99996,716,204,5.0,879795543
99997,276,1090,1.0,874795795
99998,13,225,2.0,882399156


In [25]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25, random_state=13)
len(testset)

25000

In [26]:
from surprise import SVD, KNNBasic, accuracy

In [27]:
sim_options = {
    'name': 'cosine',
    'user_based': False
}
 
knn = KNNBasic(sim_options=sim_options) #Теперь реализуем обычную коллаборативную фильтрацию. Выберем оценку схожести через косинусную близость и item-based-подход

In [28]:
knn.fit(trainset) #Обучим алгоритм:

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f38f63c2e80>

In [29]:
predictions = knn.test(testset)


In [None]:
predictions 

In [31]:
for prediction in predictions:
    if prediction.uid == '500' and prediction.iid == '699':
        print(prediction.r_ui)
        print(round(prediction.est, 2))
        break

3.0
3.47


Теперь необходимо вычислить RMSE для получившихся предсказаний:

In [32]:
accuracy.rmse(predictions)

RMSE: 1.0272


1.0271678039029761

In [33]:
pred = pd.DataFrame(predictions)

In [34]:
pred

Unnamed: 0,uid,iid,r_ui,est,details
0,7,633,5.0,4.199452,"{'actual_k': 40, 'was_impossible': False}"
1,422,287,3.0,3.470344,"{'actual_k': 40, 'was_impossible': False}"
2,804,163,3.0,3.571674,"{'actual_k': 40, 'was_impossible': False}"
3,189,480,5.0,4.222826,"{'actual_k': 40, 'was_impossible': False}"
4,238,546,3.0,3.473417,"{'actual_k': 17, 'was_impossible': False}"
...,...,...,...,...,...
24995,426,617,3.0,3.822890,"{'actual_k': 40, 'was_impossible': False}"
24996,328,708,2.0,3.247313,"{'actual_k': 40, 'was_impossible': False}"
24997,727,465,2.0,2.598054,"{'actual_k': 40, 'was_impossible': False}"
24998,376,328,3.0,3.749518,"{'actual_k': 20, 'was_impossible': False}"


In [35]:
pred.sort_values(by=['est'], inplace=True, ascending=False)

In [36]:
recom=pred[pred.uid == '849']['iid'].to_list()

In [37]:
recom

['234', '427', '568', '174']

Реализуйте user-based-алгоритм.

In [40]:
sim_options = {'name': 'cosine', 'user_based': True}

In [41]:
knn = KNNBasic(sim_options =sim_options)

In [42]:
knn.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f38ee2cbe50>

In [43]:
pred = knn.test(testset)

In [46]:
accuracy.rmse(pred)

RMSE: 1.0175


1.0174852296380237

SVD-алгоритм

In [48]:
model = SVD()
model.fit(trainset)
predict = model.test(testset)
accuracy.rmse(predict)

RMSE: 0.9417


0.9416585746010805

## 4. Гибридные модели ##

In [2]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lightfm
  Downloading lightfm-1.16.tar.gz (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.1/310.1 KB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.16-cp39-cp39-linux_x86_64.whl size=895275 sha256=badbe780aa11bd43454be3ff14bdc9466b39933040a465a1bb5dfa3e95a2fb49
  Stored in directory: /root/.cache/pip/wheels/d7/75/52/e42e5f9cd86d4902a352aff4dadde75ec041af713ffcf3ed05
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.16


In [3]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import precision_at_k, recall_at_k    

In [4]:
ratings = pd.read_csv('ratings.csv') #поставленные оценки
books = pd.read_csv('books.csv') #информация о книгах
tags = pd.read_csv('tags.csv') #информация о тегах
book_tags = pd.read_csv('book_tags.csv') #книги с тегами 

In [5]:
books.head(2)

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...


In [6]:
book_tags.head(2)

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174


Добавьте в набор данных book_tags признак с обычным id книги, используя соответствие обычного id и id в системе Goodreads.

In [7]:
dict_map = dict(zip(books.goodreads_book_id, books.book_id))

In [8]:
book_tags['id']=book_tags.goodreads_book_id.apply(lambda x: dict_map[x])

In [9]:
book_tags[book_tags['goodreads_book_id']==5]

Unnamed: 0,goodreads_book_id,tag_id,count,id
300,5,11557,40087,18
301,5,11305,39330,18
302,5,8717,17944,18
303,5,33114,12856,18
304,5,30574,11909,18
...,...,...,...,...
395,5,20781,299,18
396,5,32345,298,18
397,5,12600,282,18
398,5,3379,277,18


необходимо оставить в наборе данных book_tags только те записи, теги для которых есть в этих tags.

In [10]:
tags.head(2)

Unnamed: 0,tag_id,tag_name
0,509,19th-century
1,923,20th-century


In [11]:
book_tags = book_tags[book_tags.tag_id.isin(tags.tag_id)]
book_tags.shape

(300738, 4)

In [12]:
book_tags

Unnamed: 0,goodreads_book_id,tag_id,count,id
1,1,11305,37174,27
4,1,33114,12716,27
5,1,11743,9954,27
6,1,14017,7169,27
10,1,27199,3857,27
...,...,...,...,...
999877,33288638,9886,10,8892
999879,33288638,3358,10,8892
999880,33288638,1679,10,8892
999889,33288638,1659,9,8892


In [13]:
from scipy.sparse import csr_matrix

In [14]:
ratings_matrix = csr_matrix((ratings.rating,(ratings.user_id,ratings.book_id))) #передаём в качестве аргументов в функцию выставленный рейтинг (это будут значения матрицы), а также id пользователя и id книги (это будут индексы для строк и столбцов матрицы)

In [15]:
meta_matrix  = csr_matrix(([1]*len(book_tags),(book_tags.id,book_tags.tag_id))) 

In [17]:
ratings_matrix.mean()

0.007086188900997592

In [18]:
model = LightFM(loss='warp', #определяем функцию потерь
                random_state=13, #фиксируем случайное разбиение
                learning_rate=0.05, #темп обучения
                no_components=100) #размерность вектора для представления данных в модели

In [19]:
train,test = random_train_test_split(ratings_matrix, test_percentage=0.3, random_state=13)

In [20]:
model = model.fit(train, item_features = meta_matrix)

In [21]:
prec_score = precision_at_k(
                     model,
                     test,
                     item_features = meta_matrix).mean() 
print(prec_score)

0.017568793
