**Задание**

Создать гибридную рекомендательную систему

**1. Устанавливаем библиотеку LightFM, загружаем датасет для работы**

In [2]:
import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

In [3]:
!pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=808330 sha256=ec1c616ecd9dd14ccb35e1f4e8ae8ba0f9a41cfc06888e338db1aaba9534764a
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [4]:
from lightfm import LightFM

In [5]:
!wget 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

--2024-10-23 12:48:12--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2024-10-23 12:48:12 (4.31 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [6]:
!unzip ml-latest-small.zip

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [83]:
links = pd.read_csv('/content/ml-latest-small/links.csv')
movies = pd.read_csv('/content/ml-latest-small/movies.csv')
ratings = pd.read_csv('/content/ml-latest-small/ratings.csv')
tags = pd.read_csv('/content/ml-latest-small/tags.csv')

In [84]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


**2. Подготовка матрицы признаков**

Для построения гибридной системы рекомендаций с помощью библиотеки LightFM необходимо собрать:

1) матрицу взаимодействий (ставил или нет пользователь оценку конкретному фильму)

2) матрицу TF-IDF (жанровое описание фильма)

**Строим матрицу взаимодействий**

Если оценка > 3, то считаем, что фильм понравился. В столбце interactions поставим 1. В противном случае -1.


In [85]:
movies_with_ratings['interactions'] = movies_with_ratings['rating'] >= 3
movies_with_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,interactions
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,True
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,True
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946,True
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970,False
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483,True
...,...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082,True
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545,True
100833,193585,Flint (2017),Drama,184,3.5,1537109805,True
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021,True


In [86]:
movies_with_ratings['interactions'] = movies_with_ratings['interactions'].map({False: -1, True: 1})
movies_with_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,interactions
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,1
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,1
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946,1
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970,-1
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483,1
...,...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082,1
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545,1
100833,193585,Flint (2017),Drama,184,3.5,1537109805,1
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021,1


In [87]:
df_interactions = movies_with_ratings[['userId', 'title', 'interactions']]
df_interactions

Unnamed: 0,userId,title,interactions
0,1,Toy Story (1995),1
1,5,Toy Story (1995),1
2,7,Toy Story (1995),1
3,15,Toy Story (1995),-1
4,17,Toy Story (1995),1
...,...,...,...
100831,184,Black Butler: Book of the Atlantic (2017),1
100832,184,No Game No Life: Zero (2017),1
100833,184,Flint (2017),1
100834,184,Bungo Stray Dogs: Dead Apple (2018),1


In [88]:
interactions_data = df_interactions.groupby(['userId', 'title']).value_counts().reset_index(level=2)
interactions_data

Unnamed: 0_level_0,Unnamed: 1_level_0,interactions,count
userId,title,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"13th Warrior, The (1999)",1,1
1,20 Dates (1998),1,1
1,"Abyss, The (1989)",1,1
1,"Adventures of Robin Hood, The (1938)",1,1
1,Alice in Wonderland (1951),1,1
...,...,...,...
610,[REC] (2007),1,1
610,[REC]² (2009),1,1
610,[REC]³ 3 Génesis (2012),1,1
610,xXx (2002),-1,1


In [89]:
interactions = interactions_data.interactions.unstack().fillna(0)
interactions

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Таким образом, получили матрицу взаимодействий, где:
- значение 1 - фильм просмотрен, и он понравился
- значение 0 - фильм не просмотрен (рейтинг не проставлен)
- значение -1 - фильм просмотрен, и он не понравился

In [90]:
from scipy import sparse

interactions_sparse = sparse.csr_matrix(interactions)
interactions_sparse

<610x9719 sparse matrix of type '<class 'numpy.float64'>'
	with 100832 stored elements in Compressed Sparse Row format>

**Получим модель, основанную на коллаборации**

In [91]:
model = LightFM(no_components=30, random_state=10, loss='logistic')
model.fit(interactions_sparse, epochs=10)

<lightfm.lightfm.LightFM at 0x7c939cc85390>

In [92]:
n_items = interactions.shape[1]
user_id = 2
scores = model.predict(user_id, np.arange(n_items))
scores

array([-0.20011519, -0.12887977, -0.06598704, ..., -0.37567177,
       -0.20808762, -0.43475148], dtype=float32)

In [93]:
len(scores)

9719

In [97]:
scores = pd.Series(scores)
scores.index = interactions.columns
scores

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
'71 (2014),-0.200115
'Hellboy': The Seeds of Creation (2004),-0.128880
'Round Midnight (1986),-0.065987
'Salem's Lot (2004),-0.107230
'Til There Was You (1997),-0.022189
...,...
eXistenZ (1999),0.057374
xXx (2002),-0.215697
xXx: State of the Union (2005),-0.375672
¡Three Amigos! (1986),-0.208088


In [98]:
user_row = interactions.loc[user_id]
known_items = user_row[user_row != 0].index
len(known_items)

29

In [99]:
unknown_items = list(set(interactions.columns) - set(known_items))
scores = scores[unknown_items].sort_values(ascending=False)
scores

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Twelve Monkeys (a.k.a. 12 Monkeys) (1995),1.234868
Apollo 13 (1995),1.146255
Forrest Gump (1994),1.091770
Beauty and the Beast (1991),1.068164
"Fugitive, The (1993)",1.064699
...,...
Anaconda (1997),-0.712921
Nutty Professor II: The Klumps (2000),-0.731170
Sky Captain and the World of Tomorrow (2004),-0.749325
Daredevil (2003),-0.799701


Сейчас получили коллаборативную модель рекомендации, которая выводит перечень фильмов (от наиболее вероятных, чтобы понравиться пользователю, до наименее предпочтительных к рекомендации).

**Добавим систему рекомендаций, основанную на content_based подходе**

In [100]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [101]:
#Очищаем ячейки в genres, удаляя лишние символы
def change_string_genres(s):
    return s.replace(' ', '').replace('-', '').replace('|', ' ').lower()

In [102]:
df_movies = movies
df_movies['genres'] = df_movies['genres'].apply(change_string_genres)
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure animation children comedy fantasy
1,2,Jumanji (1995),adventure children fantasy
2,3,Grumpier Old Men (1995),comedy romance
3,4,Waiting to Exhale (1995),comedy drama romance
4,5,Father of the Bride Part II (1995),comedy


In [103]:
#получаем список жанров
movies_genres_list = []
for g in df_movies.genres.values:
    movies_genres_list.append(g)

movies_genres_list[:10]

['adventure animation children comedy fantasy',
 'adventure children fantasy',
 'comedy romance',
 'comedy drama romance',
 'comedy',
 'action crime thriller',
 'comedy romance',
 'adventure children',
 'action',
 'action adventure thriller']

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [105]:
#преобразовываем данные в векторы
tfidf_genres = TfidfVectorizer()
X_train_tfidf_genres = tfidf_genres.fit_transform(movies_genres_list)
X_train_tfidf_genres

<9742x20 sparse matrix of type '<class 'numpy.float64'>'
	with 22084 stored elements in Compressed Sparse Row format>

In [106]:
#Для модели LightFM удобнее представить датафрейм
features = tfidf_genres.get_feature_names_out()
tfidf_movie = pd.DataFrame(X_train_tfidf_genres.toarray(), columns=features)
tfidf_movie.head()

Unnamed: 0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,filmnoir,horror,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western
0,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.0,0.48299,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,0.0,0.593662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,0.466405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
#Добавим Id фильма
tfidf_movie['movieId'] = movies['movieId']
tfidf_movie.head()

Unnamed: 0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,filmnoir,...,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western,movieId
0,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.0,0.48299,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,0.0,0.593662,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0,3
3,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,0.466405,0.0,0.0,...,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0,4
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


**Собираем модель LightFM**

In [108]:
from lightfm.data import Dataset

ds = Dataset()
ds

<lightfm.data.Dataset at 0x7c939c0240d0>

In [109]:
#Делаем заготовку, пока без взаимодействий
ds.fit(users = movies_with_ratings['userId'].unique(),
       items = movies['movieId'],
       item_features = features)

In [110]:
def transform_features(features, id_name):
    """
    Преобразует датафрейм с признаками в формат build_user_features / build_item_features.

    Возвращает итерируемый объект вида (id, [список названий признаков, имеющих значения])

    Из документации LightFm
    """

    transformed_features = []

    for row in features.to_dict(orient = 'records'):
        id_value = row[id_name]
        del row[id_name]
        feature_names = {key: value for key, value in row.items() if value !=0}
        transformed_features.append(
            (id_value, feature_names)
        )
    return transformed_features


In [111]:
transform_features(tfidf_movie, 'movieId')[:5]

[(1,
  {'adventure': 0.41684567364693936,
   'animation': 0.5162254711770092,
   'children': 0.5048454681396087,
   'comedy': 0.26758647689140014,
   'fantasy': 0.482990142708577}),
 (2,
  {'adventure': 0.5123612074824269,
   'children': 0.6205251727456431,
   'fantasy': 0.5936619434123594}),
 (3, {'comedy': 0.5709154064399099, 'romance': 0.8210088907493954}),
 (4,
  {'comedy': 0.5050154397005037,
   'drama': 0.46640480307738325,
   'romance': 0.726240982959826}),
 (5, {'comedy': 1.0})]

In [112]:
item_features_matrix = ds.build_item_features(transform_features(tfidf_movie, 'movieId'))
item_features_matrix

<9742x9762 sparse matrix of type '<class 'numpy.float32'>'
	with 31826 stored elements in Compressed Sparse Row format>

In [113]:
data = []
for i, row in movies_with_ratings.iterrows():
    data.append((row['userId'], row['movieId']))

data[:10]

[(1, 1),
 (5, 1),
 (7, 1),
 (15, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (21, 1),
 (27, 1),
 (31, 1)]

In [114]:
interaction_matrix = ds.build_interactions(data=data)[0]
interaction_matrix

<610x9742 sparse matrix of type '<class 'numpy.int32'>'
	with 100836 stored elements in COOrdinate format>

**Итоговая гибридная модель рекомендаций**

In [115]:
model = LightFM(no_components=30, random_state=10, loss='bpr')
model.fit(interaction_matrix, epochs=10, item_features=item_features_matrix)

<lightfm.lightfm.LightFM at 0x7c939c026f80>

In [116]:
n_items = interaction_matrix.shape[1]
user_id = 2
scores = model.predict(user_id, np.arange(n_items))
scores

array([-2.0344148, -2.6314387, -3.5380545, ..., -3.4019089, -3.5515428,
       -3.3523276], dtype=float32)

In [117]:
scores = pd.Series(scores)
scores.index = movies['title']
scores.sort_values(ascending=False)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Forrest Gump (1994),0.158636
"Lord of the Rings: The Return of the King, The (2003)",-0.870440
"Matrix, The (1999)",-0.938254
Pulp Fiction (1994),-0.949107
"Shawshank Redemption, The (1994)",-0.997824
...,...
Run Lola Run (Lola rennt) (1998),-4.200052
Heathers (1989),-4.201553
Under Siege 2: Dark Territory (1995),-4.225579
Sleeper (1973),-4.238766
