# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop). Если нет аккаунта на кеггле, скачать датасет можно [здесь](https://drive.google.com/file/d/1rLSr49zx6RPZIn7PV_LQr9KnnpPhrr0K/view?usp=sharing).

# Загрузка и предобработка данных

In [1]:
import math
import numpy as np
import pandas as pd

Загрузим данные и проведем предобраотку данных как на семинаре.

In [2]:
articles_df = pd.read_csv("articles/shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [3]:
interactions_df = pd.read_csv("articles/users_interactions.csv")
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [4]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [5]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
    "VIEW": 1.0,
    "LIKE": 2.0,
    "BOOKMARK": 2.5,
    "FOLLOW": 3.0,
    "COMMENT CREATED": 4.0,
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(
    lambda x: event_type_strength[x]
)

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [6]:
users_interactions_count_df = (
    interactions_df.groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId")
    .size()
)
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = users_interactions_count_df[
    users_interactions_count_df >= 5
].reset_index()[["personId"]]
print("# users with at least 5 interactions:", len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [7]:
interactions_from_selected_users_df = interactions_df.loc[
    np.in1d(interactions_df.personId, users_with_enough_interactions_df)
]

In [8]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [9]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)


interactions_full_df = (
    interactions_from_selected_users_df.groupby(["personId", "contentId"])
    .eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index()
    .set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = interactions_from_selected_users_df.groupby(
    ["personId", "contentId"]
)["timestamp"].last()

interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


In [10]:
interactions_df

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry,eventStrength
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,,1.0
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US,1.0
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,,1.0
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,,3.0
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,,1.0
...,...,...,...,...,...,...,...,...,...
72307,1485190425,LIKE,-6590819806697898649,-9016528795238256703,8614469745607949425,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,MG,BR,2.0
72308,1485190425,VIEW,-5813211845057621660,102305705598210278,5527770709392883642,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,SP,BR,1.0
72309,1485190072,VIEW,-1999468346928419252,-9196668942822132778,-8300596454915870873,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,SP,BR,1.0
72310,1485190434,VIEW,-6590819806697898649,-9016528795238256703,8614469745607949425,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,MG,BR,1.0


In [11]:
# Уберем контент, который не будем использовать 
articles_df = articles_df[articles_df['contentId'].isin(interactions_full_df['contentId'].unique())]

Избавимся от отрицательных айдишников

In [12]:
# Словарь- ключ=новый id, значение=старый id (пользователи)
user_dict = dict()
for i, user_id in enumerate(interactions_full_df['personId'].unique()):
    user_dict[user_id] = i

# Словарь- ключ=старый id, значение=новый id (контент)
content_dict = dict()
for i, content_id in enumerate(interactions_full_df['contentId'].unique()):
    content_dict[content_id] = i

In [13]:
# Матрица интеракций
interactions_full_df['personId'] = [user_dict[user_id_old] for user_id_old in interactions_full_df['personId']]
interactions_full_df['contentId'] = [content_dict[content_id_old] for content_id_old in interactions_full_df['contentId']]

articles_df['contentId'] = [content_dict[content_id_old] for content_id_old in articles_df['contentId']]

In [14]:
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,2827,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,2969,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [15]:
interactions_full_df.head(2)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,0,0,1.0,1470395911
1,0,1,1.0,1487240080


Разобьём выборку на обучение и контроль по времени.

In [16]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp < split_ts
].copy()

interactions_test_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp >= split_ts
].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,0,0,1.0,1470395911
2,0,2,1.0,1472834892
6,1,6,1.0,1469129122
7,1,7,1.0,1459376415
8,1,8,2.0,1464054093
...,...,...,...,...
39099,1138,2289,2.0,1472479493
39100,1139,1645,1.0,1474567164
39101,1139,107,1.0,1474567449
39103,1139,1711,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [17]:
interactions = (
    interactions_train_df.groupby("personId")["contentId"]
    .agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = interactions_test_df.groupby("personId")["contentId"].agg(
    lambda x: list(x)
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    ""
    for x in range(
        len(interactions.loc[pd.isnull(interactions.true_test), "true_test"])
    )
]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[0, 2]","[1, 3, 4, 5]"


# Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [14]:
# !pip install lightfm

In [18]:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k



## Задание 1 (1.5 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [19]:
from scipy.sparse import csr_matrix

In [37]:
# Соберем вектора всех айтемов и пользователей
all_items = interactions_full_df['contentId'].unique()
all_users = interactions_full_df['personId'].unique()

# Вертикальная матрицв с декартовым произведением - каждому юзеру все айтемы
U_x_I = pd.DataFrame(all_users, columns=['personId']).merge(
        pd.DataFrame(all_items, columns=['contentId'])
        , how='cross'
        )


# Тренировочная и тестовая матрица интеракций
full_interactions_train = U_x_I.merge(interactions_train_df.drop('last_timestamp', axis=1), 
                                                        on=['personId', 'contentId'], 
                                                        how='left'
                                                        ).fillna(0)

full_interactions_test = U_x_I.merge(interactions_test_df.drop('last_timestamp', axis=1), # [U_x_I['personId'].isin(interactions_test_df['personId'].unique())]\
                                                        on=['personId', 'contentId'],
                                                        how='left'
                                                        ).fillna(0)


# Сохраню в виде sparce матрицы, что бы хранить только НЕ нулевые значения
data_train_full = csr_matrix(full_interactions_train)
data_test_full = csr_matrix(full_interactions_test)


# Тренировочная и тестовая СВОДНАЯ таблица
full_interactions_train_pivot = full_interactions_train.pivot_table(
                                                index='personId', 
                                                columns='contentId', 
                                                values='eventStrength', 
                                                aggfunc='max'
                                                ).fillna(0)

full_interactions_test_pivot = full_interactions_test.pivot_table(
                                                index='personId', 
                                                columns='contentId', 
                                                values='eventStrength', 
                                                aggfunc='max'
                                                ).fillna(0)

# Сохраню в виде sparce матрицы, что бы хранить только НЕ нулевые значения
data_train = csr_matrix(full_interactions_train_pivot)
data_test = csr_matrix(full_interactions_test_pivot)

In [43]:
full_interactions_train_pivot

contentId,0,1,2,3,4,5,6,7,8,9,...,2974,2975,2976,2977,2978,2979,2980,2981,2982,2983
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
full_interactions_test_pivot

contentId,0,1,2,3,4,5,6,7,8,9,...,2974,2975,2976,2977,2978,2979,2980,2981,2982,2983
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,1.0,0.0,1.0,1.584963,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1136,0.0,1.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1137,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1138,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Задание 2 (0.5 балла)

Обучите модель LightFM с `loss="warp"` и посчитайте *precision@10* на тесте.

In [39]:
# Обучение модели на сводной матрице
l_fm_mvp = LightFM(loss="warp", random_state=42, learning_rate=0.05)
l_fm_mvp.fit(data_train, epochs=25, num_threads=3)

<lightfm.lightfm.LightFM at 0x1882ed990>

In [40]:
# Визуализация space matrix
dense_matrix = data_train.toarray()
pd.DataFrame(dense_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2974,2975,2976,2977,2978,2979,2980,2981,2982,2983
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
# Визуализация space matrix
dense_matrix = data_test.toarray()
pd.DataFrame(dense_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2974,2975,2976,2977,2978,2979,2980,2981,2982,2983
0,0.0,1.0,0.0,1.0,1.584963,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1136,0.0,1.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1137,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1138,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# Предсказания
preds = precision_at_k(l_fm_mvp, data_train, k=10)
print(f'Train: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_mvp, data_test, k=10)
print(f'Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_mvp, data_test, k=10, train_interactions=data_train)
print(f'Without previously evaluated items:\n Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}')


Train: cnt users = 1112, mean: 0.23228415846824646

Test: cnt users = 982, mean: 0.004480652045458555

Without previously evaluated items:
 Test: cnt users = 982, mean: 0.006415478885173798


## Задание 3 (2 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss="warp"` и посчитайте precision@10 на тесте.

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
# инициализируем TfidfVectorizer
vectorizer = TfidfVectorizer()

In [47]:
feat = vectorizer.fit_transform(articles_df['text'])

feat

<2976x71875 sparse matrix of type '<class 'numpy.float64'>'
	with 1041129 stored elements in Compressed Sparse Row format>

In [48]:
# Визуализация space matrix
dense_matrix = feat.toarray()
pd.DataFrame(dense_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,71865,71866,71867,71868,71869,71870,71871,71872,71873,71874
0,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.04929,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.00000,0.032992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2971,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2972,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2973,0.00000,0.081980,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2974,0.00000,0.059100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
# Визуализация space matrix
dense_matrix = data_train.toarray()
pd.DataFrame(dense_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2974,2975,2976,2977,2978,2979,2980,2981,2982,2983
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
# Обучение модели на сводной матрице
l_fm_item_data = LightFM(loss="warp", random_state=42, learning_rate=0.05)
l_fm_item_data.fit(data_train, item_features=feat.T, epochs=5, num_threads=2)

<lightfm.lightfm.LightFM at 0x1882eeb90>

In [51]:
# Предсказания
preds = precision_at_k(l_fm_item_data, data_train, k=10, item_features=feat.T)
print(f'Train: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_item_data, data_test, k=10, item_features=feat.T)
print(f'Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_item_data, data_test, k=10, train_interactions=data_train, item_features=feat.T)
print(f'Without previously evaluated items:\n Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}')

Train: cnt users = 1112, mean: 0.010881295427680016

Test: cnt users = 982, mean: 0.009063135832548141

Without previously evaluated items:
 Test: cnt users = 982, mean: 0.009164969436824322


Удалось увеличить качество. Если не рекомендовать уже просмотренные айтемы - качество так же немного растет...

## Задание 4 (1.5 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

NLTK (Natural Language Toolkit) - это ведущая платформа для создания программ на Python для работы с данными человеческого языка. 

In [25]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/dan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/dan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/dan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
def preprocess_text(text:str) -> str:
    '''
    Функция на вход получает текст,
        приводит к нижнему регистру
        удаляет стоп слова,
        приводит к номральной форме
    '''
    # Приведение к нижнему регистру
    text = text.lower()
    
    # стоп-слова
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    
    # нормальная форме (лемматизация)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(tokens)


In [27]:
# Пример обработки текста
txt = articles_df['text'][1]
print(f'До обработки: \n {txt}')

txt = preprocess_text(articles_df['text'][1])
print(f'После обработки: \n {txt}')

До обработки: 
 All of this work is still very early. The first full public version of the Ethereum software was recently released, and the system could face some of the same technical and legal problems that have tarnished Bitcoin. Many Bitcoin advocates say Ethereum will face more security problems than Bitcoin because of the greater complexity of the software. Thus far, Ethereum has faced much less testing, and many fewer attacks, than Bitcoin. The novel design of Ethereum may also invite intense scrutiny by authorities given that potentially fraudulent contracts, like the Ponzi schemes, can be written directly into the Ethereum system. But the sophisticated capabilities of the system have made it fascinating to some executives in corporate America. IBM said last year that it was experimenting with Ethereum as a way to control real world objects in the so-called Internet of things. Microsoft has been working on several projects that make it easier to use Ethereum on its computing cl

In [28]:
# Новре описание фильмов
articles_df['preprocessed_text'] = [preprocess_text(txt) for txt in articles_df['text']]

___TfidfVectorizer___

In [29]:
# инициализируем TfidfVectorizer
vectorizer_v2 = TfidfVectorizer()

feat_modify = vectorizer_v2.fit_transform(articles_df['preprocessed_text'])

__LightFM__

In [30]:
# Обучение модели на сводной матрице, с добавлением фичей для айтемов
l_fm_item_data_mdf = LightFM(loss="warp", random_state=42, learning_rate=0.05)
l_fm_item_data_mdf.fit(data_train, item_features=feat_modify.T, epochs=22, num_threads=6)

<lightfm.lightfm.LightFM at 0x18a351310>

In [31]:
# Предсказания
preds = precision_at_k(l_fm_item_data_mdf, data_train, k=10, item_features=feat_modify.T)
print(f'Train: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_item_data_mdf, data_test, k=10, item_features=feat_modify.T)
print(f'Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}\n')

preds = precision_at_k(l_fm_item_data_mdf, data_test, k=10, train_interactions=data_train, item_features=feat_modify.T)
print(f'Without previously evaluated items:\n Test: cnt users = {preds.shape[0]}, mean: {preds.mean()}')

Train: cnt users = 1112, mean: 0.007284172810614109

Test: cnt users = 982, mean: 0.008044807240366936

Without previously evaluated items:
 Test: cnt users = 982, mean: 0.008248472586274147


Улучшилось ли качество предсказания?

Нет, качество модели несколько упало 0.0091 -> 0.0082 (пробовал подбирать epochs, num_threads).

Интересно, что на тестовых данных качество модели выше, чем на тренировочных. Так же отмечу, что в сравнении с первой (mvp здесь) моделью - качество на трейне занчительно упало 0.23 -> 0.007

## Задание 5 (1.5 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

Я спер реализацию тут:
https://stackoverflow.com/questions/49896816/how-do-i-optimize-the-hyperparameters-of-lightfm

In [32]:
from tqdm import tqdm

In [34]:
loss = 'warp'

item_alpha_list = [
      np.random.exponential(1e-8), np.random.exponential(1e-8), np.random.exponential(1e-8), np.random.exponential(1e-8), np.random.exponential(1e-8)
      ]

score_dict = dict()

for no_components in tqdm(np.arange(1, 64, 5)):
    for learning_schedule in ["adagrad", "adadelta"]:
            for item_alpha in item_alpha_list:
                  for num_epochs in np.arange(1, 50, 7):
                        for num_threads in np.arange(1, 10, 5):

                              # Инициализируем модель
                              model_iter = LightFM(
                                    no_components=no_components,
                                    learning_schedule=learning_schedule,
                                    item_alpha=item_alpha,
                                    random_state=42
                                    )

                              # обучим модель
                              model_iter.fit(
                                    data_train, 
                                    item_features=feat_modify.T, 
                                    epochs=num_epochs, 
                                    num_threads=num_threads
                                    )
                                    
                              preds = precision_at_k(model_iter, data_test, k=10, train_interactions=data_train, item_features=feat_modify.T).mean()

                              txt=f'no_components={no_components}, learning_schedule={learning_schedule}, item_alpha={item_alpha}, epochs={num_epochs}, num_threads={num_threads}'

                              score_dict[txt] = preds

                  
            

100%|██████████| 13/13 [6:18:10<00:00, 1745.41s/it]


ps перебор был намного быстрее - цикл приостановился на несколько часов, тк ноут в режим сна ушел

In [36]:
for key in score_dict.keys():
    mean = score_dict[key].mean()
    score_dict[key] = mean

In [41]:
sorted_dict = dict(sorted(score_dict.items(), key=lambda item: item[1], reverse=True))

sorted_dict


{'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=1, num_threads=1': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=1, num_threads=6': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=8, num_threads=1': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=8, num_threads=6': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=15, num_threads=1': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=15, num_threads=6': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=22, num_threads=1': 0.007942974,
 'no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=22, num_threads=6': 0.007942974,
 'no_components=1, learning_schedule

С помощью подбора гиперпараметров не удалось увеличить метрику, лучшая это 0.007942974 при гиперпараметрах:

no_components=1, learning_schedule=adadelta, item_alpha=4.341979600501513e-09, epochs=1, num_threads=1

## Задание 6 (1 балл)

Реализуйте функции для вычисления следующих метрик:
* precision@k
* recall@k
* NDCG@k



In [217]:
from typing import List
import warnings
warnings.filterwarnings('ignore')

In [100]:
# Визуализация space matrix
feat_matrix = feat.toarray()
feat_matrix = pd.DataFrame(feat_matrix)

In [237]:
def precision_k(
        model
        , user_idx:int
        , item_set:List
        , user_features:pd.DataFrame=None
        , item_features:pd.DataFrame=None
        , k:int=10
        , y_true:pd.DataFrame=interactions[['true_test']]
        ):
    '''
    Функция оценки качества предсказаний, в которой реализована как оценка precision@k, так и предсказание модели, которая подается в виде параметра для функции


    PARAMS

    :model: Модель RecSys: LightFM
    :user_idx: user для предсказаний
    :item_set: Набор айтемов для прогнозирования
    :user_features: Фичи для user
    :item_features: Фичи для item_set
    :k: кол-во объектов для оценки
    :y_true: верные ответы
    '''

    # Если у нас есть матрица фичей юзера - подготовим ее для user_idx 
    if user_features is not None:
        user_features = user_features[user_features.index==user_idx].T
        user_features = csr_matrix(user_features)

    # Если у нас есть матрица фичей айтема - подготовим ее для item_set 
    if item_features is not None:
        item_features = item_features[item_features.index==user_idx].T
        item_features = csr_matrix(item_features)

    # Поулчим предсказание модели
    preds = model.predict(
        user_ids=[user_idx]*len(item_set), 
        item_ids=item_set, 
        item_features=item_features, 
        user_features=user_features
        )

    # Список верных ответов
    y_true = y_true[y_true.index==user_idx].iloc[0]

    # np array с топ К item по рекомендациям
    preds_items = pd.DataFrame({
            'item_set' : item_set,
            'score' : preds
        }).sort_values('score', ascending=False)[:k]\
          .T\
          .values[0] # Сортируем по скору, оставляем топ К айтемов 
    
    # Кол-во объектов в пересечении деленное на K - есть наша метрика
    metric_score = len(np.intersect1d(a, np.array(b[0]))) / k

    return metric_score

In [252]:
def recall_at_k(recom_items, true_items, k):
    '''
    Функция вычисляет RECALL@K метрику

    PARAMS
    :recom_items: Список рекомендаций из модели
    :true_items: Список верных рекомендаций
    :k: Кол-во реомендаций для рассчета метрики

    '''

    # Считаем кол-во релевантных рекомендаций в топ-k рекомендаций
    relevant_in_top_k = sum(item in true_items for item in recom_items[:k])
    
    recall = relevant_in_top_k / len(true_items)
    
    return recall

## Задание 7 (1 балл)

Вычислите значения реализованных метрик для $k=10$ для лучшей полученной модели в предыдущих шагах.

Найдите уже реализованные варианты этих метрик в библиотеках lightfm и sklearn. Сравните полученные у вас значения метрик с результатами встроенных в библиотеки метрик.

In [249]:
# мой метод предсказаний не дает метрику даже 0,009 

# Все предсказания не имеют пересечения в действительно релевантных категориях 

for i in range(950):
    metric = precision_k(l_fm_item_data, 1, all_items, item_features=feat_matrix)
    if metric > 0:
        print(metric)

## Задание 8 (1 балл)

Реализуйте алгоритм ALS и примените его для решения задачи ноутбука.

**ALS**

Итак, поставлена задача построения модели со скрытыми переменными (latent factor model) для коллаборативной фильтрации:

$$ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \to \min_{P,Q}$$

Суммирование ведется по всем парам $(u, i),$ для которых известен рейтинг $r_{ui}$ (и только по ним), а $p_u, q_i$ – латентные представления пользователя~$u$ и товара $i$, соответственно, матрицы $P, Q$ получаются путем записывания по столбцам векторов $p_u, q_i$ соответственно.

Подход ALS (Alternating Least Squares) решает задачу, попеременно фиксируя матрицы $P$ и $Q$, — оказывается, что, зафиксировав одну из матриц, можно выписать аналитическое решение задачи для другой.

$$\nabla_{p_u} \bigg[ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \bigg] = \sum_{i} 2(r_{ui} - \langle p_u, q_i \rangle)q_i = 0$$

Воспользовавшись тем, что $a^Tbc = cb^Ta$, получим
$$\sum_{i} r_{ui}q_i - \sum_i q_i q_i^T p_u = 0.$$

Тогда окончательно каждый столбец матрицы $P$ можно найти по формуле
$$p_u = \bigg( \sum_i q_i q_i^T\bigg)^{-1}\sum_ir_{ui}q_i \;\; \forall u,$$

аналогично для столбцов матрицы $Q$
$$q_i = \bigg( \sum_u p_u p_u^T\bigg)^{-1}\sum_ur_{ui}p_u \;\; \forall i.$$

Таким образом мы можем решать оптимизационную задачу, поочередно фиксируя одну из матриц $P$ или $Q$ и проводя оптимизацию по второй.

**Оригинальная статья c постановкой задачи для ALS на explicit feedback:**

* Bell, R.M. and Koren, Y., 2007, October. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Seventh IEEE international conference on data mining (ICDM 2007) (pp. 43-52). IEEE.

**Оригинальная статья с ALS для implicit данных, которая стала более известной:**

* Hu, Y., Koren, Y. and Volinsky, C., 2008, December. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining (pp. 263-272). Ieee.
