<center><img src="https://github.com/hse-ds/iad-applied-ds/blob/master/2021/hw/hw1/img/logo_hse.png?raw=1" width="1000"></center>

<h1><center>Applied data analysis tasks</center></h1>
<h2><center>Homework 4: Recommendation systems</center></h2>

# Introduction

In this assignment, you will continue to work with the data from the workshop [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

# Data loading and preprocessing

In [1]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm_notebook

Matplotlib is building the font cache; this may take a moment.


We will upload the data and pre-process the data as in a seminar.

In [3]:
! pip install --user kaggle

Collecting kaggle
  Downloading kaggle-1.7.4.2-py3-none-any.whl.metadata (16 kB)
Downloading kaggle-1.7.4.2-py3-none-any.whl (173 kB)
Installing collected packages: kaggle
[0mSuccessfully installed kaggle-1.7.4.2


In [11]:
# from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d gspmoreira/articles-sharing-reading-from-cit-deskdrop
!unzip articles-sharing-reading-from-cit-deskdrop.zip -d articles

In [None]:
articles_df = pd.read_csv("articles/shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

In [None]:
interactions_df = pd.read_csv("articles/users_interactions.csv")
interactions_df.head(2)

In [None]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [None]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
   "VIEW": 1.0,
   "LIKE": 2.0, 
   "BOOKMARK": 2.5, 
   "FOLLOW": 3.0,
   "COMMENT CREATED": 4.0,  
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

In [None]:
interactions_df

We leave only those users who have interacted with more than five articles.

In [None]:
interactions_df.shape

In [None]:
users_interactions_count_df = (
    interactions_df
    .groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId").size())
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[["personId"]]
print("# users with at least 5 interactions:",len(users_with_enough_interactions_df))

We leave only those interactions that relate to filtered users.

In [None]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [None]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

We combine all user interactions for each article and smooth the result by taking the logarithm from it.

In [None]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"]).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"])["timestamp"].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Let's split the sample into training and time control.

In [None]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

For the convenience of calculating the quality, we will write the data in a format where the row corresponds to the user, and the columns will be true labels and predictions in the form of lists.

In [None]:
interactions = (
    interactions_train_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = (
    interactions_test_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    "" for x in range(len(interactions.loc[pd.isnull(interactions.true_test), "true_test"]))]

interactions.head(1)

# LightFM library

For recommendation, you will use the [LightFM](https://making.lyst.com/lightfm/docs/home.html) library, which implements popular algorithms. For evaluating the quality of the recommendation, as in the seminar, we will use the *precision@10* metric.

In [None]:
!pip install lightfm

In [None]:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix

## Task 1 (2 points)

Models in LightFM work with sparse matrices. Create sparse matrices `data_train` and `data_test` (with dimensions equal to the number of users by the number of items), where the value at the intersection of a user row and an item column represents the strength of their interaction if there was one, and zero if there was no interaction.


In [None]:
# Ваш код здесь
data_train = pd.pivot_table(
    interactions_train_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
data_test = pd.pivot_table(
    interactions_test_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)

## Task 2 (1 point)

Train model LightFM with `loss="warp"` and count *precision@10* on the test.

In [None]:
# Ваш код здесь
data_train = pd.pivot_table(
    interactions_full_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
data_test = pd.pivot_table(
    interactions_test_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
print(data_train.index.shape)
print(data_train.columns.values.shape)
print(data_test.index.shape)
print(data_test.columns.values.shape)
users_needed = np.arange(1140)[np.in1d(data_train.index, data_test.index)] # ids of users from data_train met in data_test
items_needed = np.arange(2984)[np.in1d(data_train.columns.values, data_test.columns.values)] # ids of items from data_train met in data_test
user_ids = np.repeat(users_needed, items_needed.shape)
item_ids = np.array([])
for i in range(users_needed.shape[0]):
  item_ids = np.append(item_ids, items_needed) # np.arrays of indeces (according to interactions matrix) of test users and test items 
# Ваш код здесь
model = LightFM(loss='warp')
model.fit(csr_matrix(data_train.to_numpy()))
predictions = model.predict(user_ids, item_ids)
print(precision_at_k(model, csr_matrix(data_train), k=10).mean())
print(precision_at_k(model, csr_matrix(data_test), k=10).mean())
print(precision_at_k(model, csr_matrix(data_test), csr_matrix(data_train), k=10))

In [None]:
#Создадим маппинг для пользователей и фильмов
user_id_mapping = {id:i for i, id in enumerate(interactions_full_df['personId'].unique())}
item_id_mapping = {id:i for i, id in enumerate(interactions_full_df['contentId'].unique())}

Применим его к обучающему и тренировочному набору
train_user_data = interactions_train_df['personId'].map(user_id_mapping)
train_item_data = interactions_train_df['contentId'].map(item_id_mapping)

test_user_data = interactions_test_df['personId'].map(user_id_mapping)
test_item_data = interactions_test_df['contentId'].map(item_id_mapping)

#Создадим разреженную матрицу рейтинга
shape = (len(user_id_mapping), len(item_id_mapping))
train_matrix = coo_matrix((interactions_train_df['eventStrength'].values, (train_user_data.astype(int), train_item_data.astype(int))), shape=shape)
test_matrix = coo_matrix((interactions_test_df['eventStrength'].values, (test_user_data.astype(int), test_item_data.astype(int))), shape=shape)

#Создадим модель LightFM и обучим ем
model = LightFM(loss='warp')
%timeit model.fit(train_matrix, epochs=30, num_threads=2)
# predictions = model.predict(test_user_data.values, test_item_data.values)

k = 10
print('Train precision at k={}:\t{:.4f}'.format(k, precision_at_k(model, train_matrix, k=k).mean()))
print('Test precision at k={}:\t\t{:.4f}'.format(k, precision_at_k(model, test_matrix, k=k).mean()))

In [None]:
from lightfm.data import Dataset

dataset = Dataset()
dataset.fit((int(x[1].personId) for x in interactions_full_df.iterrows()),
            (int(x[1].contentId) for x in interactions_full_df.iterrows()))

In [None]:
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))

In [None]:
(interactions_train, weights) = dataset.build_interactions((int(x[1].personId), int(x[1].contentId))
                                                      for x in interactions_train_df.iterrows())
print(repr(interactions_train))

In [None]:
(interactions_test, weights) = dataset.build_interactions((int(x[1].personId), int(x[1].contentId))
                                                      for x in interactions_test_df.iterrows())
print(repr(interactions_test))

In [None]:
from lightfm import LightFM

model = LightFM(loss='warp')
model.fit(interactions_train) #item_features=item_features)
# predictions = model.predict(user_ids, item_ids) для чего метод predict?!

In [None]:
k = 10
print('Train precision at k={}:\t{:.4f}'.format(k, precision_at_k(model, interactions_train, k=k).mean()))
print('Test precision at k={}:\t\t{:.4f}'.format(k, precision_at_k(model, interactions_test, k=k).mean()))

In [None]:
item_features = dataset.build_item_features(((x['ISBN'], [x['Book-Author']])
                                              for x in get_book_features()))
print(repr(item_features))

## Task 3 (3 points)

When calling the `fit` method, LightFM allows passing `item_features` which represent the feature descriptions of the items. Let's make use of this. We will obtain the feature descriptions from the article text in the form of [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (you can use `TfidfVectorizer` from scikit-learn). Create a matrix `feat` with dimensions equal to the number of articles by the size of the feature description and train LightFM with `loss="warp"`, then calculate precision@10 on the test data.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Ваш код здесь
tfidf_vec = TfidfVectorizer()
feat = tfidf_vec.fit_transform(articles_df['text']).toarray()

In [None]:
feat = pd.DataFrame(feat, columns=np.arange(72615))

In [None]:
articles_concatted = articles_df.reset_index(drop=True).join(feat, how='right')

In [None]:
articles_concatted.drop(['timestamp',	'eventType', 'authorPersonId', 'authorSessionId',	'authorUserAgent',	'authorRegion',
                         'authorCountry',	'contentType',	'url',	'title',	'text',	'lang'], axis=1, inplace=True)

In [None]:
articles_concatted

In [None]:
pd.merge(interactions_full_df, articles_concatted)

In [None]:
dataset.fit(users=(int(x[1].personId) for x in interactions_full_df.iterrows()),
            items=(int(x[1].contentId) for x in interactions_full_df.iterrows()),
            item_features=(x[1].tf_idf for x in interactions_full_df.iterrows()))

In [None]:
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))

In [None]:
model = LightFM(loss='warp')
%timeit model.fit(train_matrix, item_features=feat, epochs=30, num_threads=2)

In [None]:
k = 10
print('Train precision at k={}:\t{:.4f}'.format(k, precision_at_k(model, train_matrix, k=k).mean()))
print('Test precision at k={}:\t\t{:.4f}'.format(k, precision_at_k(model, test_matrix, k=k).mean()))

## Task 4 (2 points)

In task 3, we used the raw text of the articles. In this task, you must first pre-process the text (reduce it to lowercase, remove stop words, restore words to normal form, etc.), then train the model and evaluate the quality on the test data.

In [None]:
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [None]:
articles_df['lang'].unique()

In [None]:
print(stopwords.fileids())

In [None]:
stop_words_lang = stopwords.words('english') + stopwords.words('portuguese') + stopwords.words('spanish') # no latin and japanese in stopwords module
punctuation = []
for char in string.punctuation:
    punctuation.append(char)
stop_words = punctuation + stop_words_lang

def process_text(text):
    return [word for word in word_tokenize(text.lower()) if word not in stop_words]
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)
def to_lemmatize(sentence):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in sentence])

articles_df['text_processed'] = articles_df['text'].apply(process_text).apply(to_lemmatize)