<a href="https://colab.research.google.com/github/Murcha1990/ML_AI24/blob/main/Lesson22_RecSys/Intro_RecSys.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Простые алгоритмы рекомендаций

Будем анализировать какую музыку слушают пользователи, и на основе матрицы прослушиваний будем строить простые рекомендации.

* В матрице sample_matrix по строкам стоят пользователи, а по столбцам - исполнители.

* Для каждой пары (пользователь,исполнитель) в таблице стоит число - доля (процент) прослушивания этого исполнителя выбранным пользователем.

## Импорт библиотек, загрузка данных

In [None]:
import pandas as pd

In [None]:
ratings = pd.read_excel("https://github.com/evgpat/edu_stepik_rec_sys/blob/main/datasets/sample_matrix.xlsx?raw=true", engine='openpyxl')

In [None]:
ratings.head()

Unnamed: 0,user,the beatles,radiohead,deathcab for cutie,coldplay,modest mouse,sufjan stevens,dylan. bob,red hot clili peppers,pink fluid,...,municipal waste,townes van zandt,curtis mayfield,jewel,lamb,michal w. smith,群星,agalloch,meshuggah,yellowcard
0,0,,0.020417,,,,,,0.030496,,...,,,,,,,,,,
1,1,,0.184962,0.024561,,,0.136341,,,,...,,,,,,,,,,
2,2,,,0.028635,,,,0.024559,,,...,,,,,,,,,,
3,3,,,,,,,,,,...,,,,,,,,,,
4,4,0.043529,0.086281,0.03459,0.016712,0.015935,,,,,...,,,,,,,,,,


In [None]:
ratings.drop('user', axis=1, inplace=True)

# Подход 1: Поиск похожих через меру Жаккара

Приведем данные к бинарному формату (слушал / не слушал)

In [None]:
import numpy as np

ratings = pd.DataFrame(np.where(pd.isna(ratings), 0, 1), columns=ratings.columns)

ratings.head()

Unnamed: 0,the beatles,radiohead,deathcab for cutie,coldplay,modest mouse,sufjan stevens,dylan. bob,red hot clili peppers,pink fluid,kanye west,...,municipal waste,townes van zandt,curtis mayfield,jewel,lamb,michal w. smith,群星,agalloch,meshuggah,yellowcard
0,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Посчитаем расстояния по мере Жаккара.

In [None]:
from sklearn.metrics import pairwise_distances

jaccard_dist = pairwise_distances(ratings.values.T, metric="jaccard")

# Преобразуем в DataFrame
jaccard_sim = 1 - jaccard_dist  # Преобразуем в меру схожести
jaccard_df = pd.DataFrame(jaccard_sim, index=ratings.columns, columns=ratings.columns)



the beatles       1.000000
radiohead         0.298261
dylan. bob        0.232994
led zeppelin.     0.216245
pink fluid        0.212553
                    ...   
agalloch          0.002334
kamelot           0.002323
ross, rick        0.002322
chamillionaire    0.001747
t-pain            0.001743
Name: the beatles, Length: 1000, dtype: float64


Найдем для выбранного исполнителя ближайших по мере Жаккара.

In [None]:
# Выведем схожесть с "Artist"
print(jaccard_df["the beatles"].sort_values(ascending=False))

In [None]:
print(jaccard_df["radiohead"].sort_values(ascending=False))

radiohead             1.000000
the beatles           0.298261
modest mouse          0.245336
deathcab for cutie    0.215933
beck                  0.214640
                        ...   
destinys child        0.000696
static-x              0.000692
iced eatrth           0.000691
t-pain                0.000000
carrie underwood      0.000000
Name: radiohead, Length: 1000, dtype: float64


In [None]:
print(jaccard_df["metallica"].sort_values(ascending=False))

metallica           1.000000
megadeth            0.224299
iron maiden         0.203285
system of a down    0.185249
gunsnroses          0.158537
                      ...   
fergie              0.000000
eluvium             0.000000
josh ritter         0.000000
the thermals        0.000000
les savy fav        0.000000
Name: metallica, Length: 1000, dtype: float64


In [None]:
print(jaccard_df["kanye west"].sort_values(ascending=False))

kanye west                   1.000000
jay-z                        0.253012
lil' wayne                   0.234211
lupe the gorilla             0.220170
t.i.                         0.187234
                               ...   
kmfdm                        0.000000
siouxsie and the banshees    0.000000
mötley crüe                  0.000000
the cramps                   0.000000
skinny puppy                 0.000000
Name: kanye west, Length: 1000, dtype: float64


# Подход 2: Кластеризация исполнителей

In [None]:
ratings = pd.read_excel("https://github.com/evgpat/edu_stepik_rec_sys/blob/main/datasets/sample_matrix.xlsx?raw=true", engine='openpyxl')

Транспонируем матрицу ratings, чтобы по строкам стояли исполнители.

In [None]:
ratings = ratings.T

In [None]:
ratings.shape

(1001, 5000)

Выкинем строку под названием `user`.

In [None]:
ratings.drop('user', axis=0, inplace=True)

Заполним пропуски нулями.

In [None]:
ratings.fillna(0, inplace=True)

Нормализуем данные при помощи `normalize`.

Функция `normalize(X, norm='l2', axis=1)` масштабирует каждую строку (или столбец) так, чтобы её норма стала равна 1.

In [None]:
from sklearn.preprocessing import normalize

ratings_new = normalize(ratings)

Применим KMeans с 5ю кластерами на преобразованной матрице (сделаем fit, а затем вычислим кластеры при помощи predict).

In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=5)
km.fit(normalize(ratings))

Выведем на экран центры кластеров (центроиды)

In [None]:
centroids = km.cluster_centers_

In [None]:
clusters = km.predict(ratings_new)

clusters

array([4, 4, 4, 4, 4, 4, 4, 3, 3, 1, 3, 4, 3, 4, 4, 4, 3, 2, 4, 3, 3, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4, 4, 4, 3, 0,
       1, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4,
       4, 4, 0, 3, 4, 4, 4, 1, 2, 1, 4, 4, 4, 4, 3, 0, 4, 3, 3, 4, 3, 4,
       4, 4, 4, 4, 4, 3, 4, 1, 1, 3, 4, 4, 4, 4, 4, 4, 3, 3, 1, 4, 3, 4,
       3, 4, 2, 3, 1, 2, 4, 3, 4, 1, 2, 3, 0, 4, 4, 2, 4, 3, 1, 3, 2, 2,
       4, 1, 4, 4, 4, 1, 4, 2, 4, 3, 4, 0, 2, 4, 0, 4, 4, 4, 3, 1, 3, 1,
       0, 1, 4, 0, 4, 1, 4, 3, 4, 3, 3, 3, 0, 1, 4, 3, 3, 4, 1, 3, 2, 1,
       4, 4, 2, 4, 4, 4, 3, 4, 0, 3, 2, 0, 3, 3, 1, 4, 4, 3, 4, 0, 4, 2,
       3, 4, 4, 2, 4, 4, 2, 3, 3, 1, 3, 3, 3, 4, 4, 4, 4, 3, 3, 4, 4, 0,
       1, 3, 0, 4, 4, 4, 3, 4, 3, 4, 3, 3, 3, 4, 4, 4, 3, 4, 3, 1, 4, 3,
       3, 2, 3, 4, 2, 4, 4, 4, 4, 2, 3, 3, 2, 3, 4, 4, 4, 4, 2, 3, 1, 3,
       2, 4, 0, 4, 0, 0, 1, 3, 3, 4, 3, 4, 2, 3, 0, 3, 4, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 0, 3, 4, 1, 4, 2, 4, 2, 1, 4, 4,

In [None]:
Clusters = []

for i in range(5):
  Clusters.append(ratings.iloc[clusters == i])

In [None]:
Clusters[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
brand new,0.0,0.0,0.03475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.188972,0.0,0.0,0.0,0.0,0.0,0.02279,0.0
jimmy eat world,0.0,0.0,0.017732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
explosions in the sky,0.0,0.0,0.0,0.0,0.017101,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
the misfits,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033644,0.0,0.0,0.0,0.0,0.0,0.0
alkaline trio,0.0,0.0,0.01121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.012404,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
centroids[0]

array([ 8.67361738e-19,  2.59242589e-04,  1.83390710e-03, ...,
       -2.38524478e-18,  1.15444826e-02,  1.49242472e-03])

Для каждого кластера найдем топ-10 исполнителей, наиболее близких к центроидам соотвествующего кластера.

Схожесть исполнителей будем считать по косинусной мере.

Вычислим расстояние между "the beatles" и "coldplay". Ответ округлите до сотых.


In [None]:
from scipy import spatial

spatial.distance.cosine(ratings_new[0], ratings_new[3])

0.8955669648278295

Для каждого кластера выведем топ-10 исполнителей, наиболее близких к центроидам соотвествующего кластера.

In [None]:
from scipy import spatial

def pClosest(points, pt, K=10):
    ind = [i[0] for i in sorted(enumerate(points), key=lambda x: spatial.distance.cosine(x[1], pt))]

    return ind[:K]

res = pClosest(Clusters[0].values, centroids[2])
res

[1, 26, 11, 2, 0, 39, 5, 10, 54, 6]

In [None]:
for i in range(5):
    res = pClosest(Clusters[i].values, centroids[i])
    print(Clusters[i].iloc[res].index)

Index(['against me!', 'alkaline trio', 'brand new', 'descendents',
       'the lawrence arms', 'saves the day', 'thursday', 'minus the bear',
       'hot water music', 'thrice'],
      dtype='object')
Index(['nas', 'jay-z', 'a tribe called quest', 'the roots featuring d'angelo',
       'kanye west', 'gangstarr', 'lupe the gorilla', 'de la soul', 'mos def',
       'murs and 9th wonder'],
      dtype='object')
Index(['janet jackson.', 'rihanna & jay-z', 'the pussycat dolls',
       'mariah carey', 'alicia keys', 'destinys child', 'beyoncé',
       'kelly clarkson', 'justin timberlake', 'mary j. blige'],
      dtype='object')
Index(['fall out boy', 'system of a down', 'blink-182', '‌linkin park',
       'paramore', 'metallica', 'the used', 'the all-americian rejects',
       'koЯn', 'taking back sunday'],
      dtype='object')
Index(['radiohead', 'sufjan stevens', 'the arcade fire', 'the shins',
       'belle and sebastian', 'the beatles', 'deathcab for cutie',
       'broken social scene

Что можно сказать о смысле кластеров?

In [None]:
# инди-рок
# рок
# панк-рок
# хип-хоп
# поп-панк