# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

In [None]:
import pandas as pd
import numpy as np
import math
import scipy

## Загрузка и предобработка данных

Загрузим данные и проведем предобраотку данных как на семинаре.

In [None]:
!wget -q -N https://www.dropbox.com/s/z8syrl5trawxs0n/articles.zip?dl=0 -O articles.zip
!unzip -o -q articles.zip

In [None]:
articles_df = pd.read_csv('articles/shared_articles.csv')
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [None]:
interactions_df = pd.read_csv('articles/users_interactions.csv')
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [None]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [None]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

interactions_df['eventStrength'] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [None]:
users_interactions_count_df = (
    interactions_df
    .groupby(['personId', 'contentId'])
    .first()
    .reset_index()
    .groupby('personId').size())
print('# users:', len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions:',len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [None]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [None]:
print('# interactions before:', interactions_df.shape)
print('# interactions after:', interactions_from_selected_users_df.shape)

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглажиываем полученный результат, взяв от него логарифм.

In [None]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId']).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(['personId', 'contentId'])
)
interactions_full_df['last_timestamp'] = (
    interactions_from_selected_users_df
    .groupby(['personId', 'contentId'])['timestamp'].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [None]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [None]:
interactions = (
    interactions_train_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={'contentId': 'true_train'})
    .set_index('personId')
)

interactions['true_test'] = (
    interactions_test_df
    .groupby('personId')['contentId'].agg(lambda x: list(x))
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), 'true_test'] = [
    list() for x in range(len(interactions.loc[pd.isnull(interactions.true_test), 'true_test']))]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


## Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [None]:
!pip install lightfm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k



## Задание 1. (2 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [None]:
# Ваш код здесь
data_train = pd.pivot_table(
    interactions_train_df,
    values='eventStrength',
    index='personId',
    columns='contentId'
).fillna(0)
data_test = pd.pivot_table(
    interactions_test_df,
    values='eventStrength',
    index='personId',
    columns='contentId'
).fillna(0)

In [None]:
content_ids_full = interactions_full_df.sort_values(by=['last_timestamp'])['contentId'].unique()
person_ids_full = interactions_full_df.sort_values(by=['last_timestamp'])['personId'].unique()

In [None]:
data_train_shape = data_train.shape
personIds_train = data_train.index.to_list()
columnIds_train = [np.where(content_ids_full == i)[0][0] for i in data_train.columns.to_list()]
sparseRows_train = np.array([[np.where(person_ids_full == i)[0][0]] * data_train_shape[1] for i in personIds_train]).flatten()
sparseColumns_train = np.array(columnIds_train * data_train_shape[0])
data_train_flatten = data_train.to_numpy().flatten()
data_train_sparse = scipy.sparse.csr_matrix((data_train_flatten, (sparseRows_train, sparseColumns_train)), shape=(1140, 2984))

In [None]:
data_test_shape = data_test.shape
personIds_test = data_test.index.to_list()
columnIds_test = [np.where(content_ids_full == i)[0][0] for i in data_test.columns.to_list()]
sparseRows_test = np.array([[np.where(person_ids_full == i)[0][0]] * data_test_shape[1] for i in personIds_test]).flatten()
sparseColumns_test = np.array(columnIds_test * data_test_shape[0])
data_test_flatten = data_test.to_numpy().flatten()
data_test_sparse = scipy.sparse.csr_matrix((data_test_flatten, (sparseRows_test, sparseColumns_test)), shape=(1140, 2984))

## Задание 2. (1 балл)

Обучите модель LightFM с `loss='warp'` и посчитайте *precision@10* на тесте.

In [None]:
# Ваш код здесь
model = LightFM(loss='warp')
model.fit(data_train_sparse, epochs=20)

<lightfm.lightfm.LightFM at 0x7f5956b56310>

In [None]:
predictions = model.predict(sparseRows_test, sparseColumns_test)

In [None]:
prc = precision_at_k(model, data_test_sparse)

In [None]:
prc.mean()

0.6076374

## Задание 3. (3 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss='warp'` и посчитайте precision@10 на тесте.

In [None]:
articles_unique_df = articles_df[['contentId', 'text']]

In [None]:
articles_unique_df = (
    articles_unique_df
    .groupby('contentId')
    .first()
)

In [None]:
articles_index = articles_unique_df.index.to_numpy()

In [None]:
intersection = np.in1d(articles_index, content_ids_full)

In [None]:
articles_unique_df = articles_unique_df[intersection]

In [None]:
articles_unique_df

Unnamed: 0_level_0,text
contentId,Unnamed: 1_level_1
-1006791494035379303,DeepMind may be a master at one of the most co...
-1021685224930603833,*Igor Schiewig 25/03/2016 - A Indústria 4.0 é ...
-1022885988494278200,In this post I will share 12 extremely useful ...
-1024046541613287684,It is no secret bitcoin entrepreneurs and star...
-1033806831489252007,v0.32.0-rc.0 on GitHub (npm) Breaking changes ...
...,...
967143806332397325,"O gigante chinês de buscas, Baidu, está procur..."
972258375127367383,The Better Exposed Filters module replaces the...
980458131533897249,Why? We use Elasticsearch + Kibana for data an...
98528655405030624,"Da política, do time de futebol, do tempo, do ..."


In [None]:
interaction_articles_index = articles_unique_df.index.to_numpy()

In [None]:
intersection = ~np.in1d(content_ids_full, interaction_articles_index)

In [None]:
intersection.shape

(2984,)

In [None]:
missing_content_ids_full = content_ids_full[intersection]

In [None]:
missing_values = np.array([''] * missing_content_ids_full.shape[0])

In [None]:
missing_df = pd.DataFrame({'text': missing_values})

In [None]:
index_dct = {i: missing_content_ids_full[i] for i in range(missing_content_ids_full.shape[0])}

In [None]:
index_dct

{0: '8078873160882064481',
 1: '1179326165172129711',
 2: '-729129249377835720',
 3: '-6451309518266745024',
 4: '-8418620743404378592',
 5: '1556878199027930272',
 6: '-1172724258904585136',
 7: '3823268327704412514'}

In [None]:
missing_df = missing_df.rename(index=index_dct)

In [None]:
missing_df

Unnamed: 0,text
8078873160882064481,
1179326165172129711,
-729129249377835720,
-6451309518266745024,
-8418620743404378592,
1556878199027930272,
-1172724258904585136,
3823268327704412514,


In [None]:
articles_unique_df = articles_unique_df.append(missing_df)

In [None]:
articles_unique_df

Unnamed: 0,text
-1006791494035379303,DeepMind may be a master at one of the most co...
-1021685224930603833,*Igor Schiewig 25/03/2016 - A Indústria 4.0 é ...
-1022885988494278200,In this post I will share 12 extremely useful ...
-1024046541613287684,It is no secret bitcoin entrepreneurs and star...
-1033806831489252007,v0.32.0-rc.0 on GitHub (npm) Breaking changes ...
...,...
-6451309518266745024,
-8418620743404378592,
1556878199027930272,
-1172724258904585136,


In [None]:
articles_unique_df_true_index = [np.where(i == content_ids_full)[0][0] for i in articles_unique_df.index.to_list()]

In [None]:
articles_unique_df['true_index'] = articles_unique_df_true_index

In [None]:
articles_unique_df_sorted = articles_unique_df.sort_values(by='true_index')

In [None]:
articles_unique_df_sorted.head()

Unnamed: 0,text,true_index
8078873160882064481,,0
1179326165172129711,,1
-729129249377835720,,2
-6451309518266745024,,3
3353902017498793780,"Ethereum, considered by many to be the most pr...",4


In [None]:
# Ваш код здесь
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=2984)
feat = vectorizer.fit_transform(articles_unique_df_sorted['text'])

In [None]:
model = LightFM(loss='warp')
model.fit(data_train_sparse, item_features=feat, epochs=20)

<lightfm.lightfm.LightFM at 0x7f5956432d10>

In [None]:
predictions = model.predict(sparseRows_test, sparseColumns_test, item_features=feat)

In [None]:
prc_feat = precision_at_k(model, data_test_sparse)

In [None]:
prc_feat.mean()

0.32301426

## Задание 4. (2 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [None]:
# Ваш код здесь
sample_text = articles_df[:1]['text']

In [None]:
sample_text = sample_text[1]

In [None]:
sample_text

'All of this work is still very early. The first full public version of the Ethereum software was recently released, and the system could face some of the same technical and legal problems that have tarnished Bitcoin. Many Bitcoin advocates say Ethereum will face more security problems than Bitcoin because of the greater complexity of the software. Thus far, Ethereum has faced much less testing, and many fewer attacks, than Bitcoin. The novel design of Ethereum may also invite intense scrutiny by authorities given that potentially fraudulent contracts, like the Ponzi schemes, can be written directly into the Ethereum system. But the sophisticated capabilities of the system have made it fascinating to some executives in corporate America. IBM said last year that it was experimenting with Ethereum as a way to control real world objects in the so-called Internet of things. Microsoft has been working on several projects that make it easier to use Ethereum on its computing cloud, Azure. "Et

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WordPunctTokenizer 

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
stop_words = stopwords.words('english')
signs = ['.', ',', '/', '?', '!', ':', ';', '\'', '"', '-']

def preprocess_text(text):
    sample_text = text.lower()
    for s in signs:
        sample_text = sample_text.replace(s, '')

    tokenizer = WordPunctTokenizer()
    tokenized_list = tokenizer.tokenize(sample_text)

    tokenized_list_without_sw = [t for t in tokenized_list if t not in stop_words]

    lemmatizer = WordNetLemmatizer()
    lemmed_text = [lemmatizer.lemmatize(t) for t in tokenized_list_without_sw]

    sample_text_preprocessed = ' '.join(lemmed_text)
    return sample_text_preprocessed

In [None]:
preprocess_text(sample_text)

'work still early first full public version ethereum software recently released system could face technical legal problem tarnished bitcoin many bitcoin advocate say ethereum face security problem bitcoin greater complexity software thus far ethereum faced much le testing many fewer attack bitcoin novel design ethereum may also invite intense scrutiny authority given potentially fraudulent contract like ponzi scheme written directly ethereum system sophisticated capability system made fascinating executive corporate america ibm said last year experimenting ethereum way control real world object socalled internet thing microsoft working several project make easier use ethereum computing cloud azure ethereum general platform solve problem many industry using fairly elegant solution elegant solution seen date said marley gray director business development strategy microsoft mr gray responsible microsofts work blockchains database concept bitcoin introduced blockchains designed store trans

In [None]:
articles_df_preprocessed = articles_unique_df_sorted

In [None]:
articles_df_preprocessed.text = articles_df_preprocessed.text.apply(lambda x: preprocess_text(x))

In [None]:
articles_df_preprocessed

Unnamed: 0,text,true_index
8078873160882064481,,0
1179326165172129711,,1
-729129249377835720,,2
-6451309518266745024,,3
3353902017498793780,ethereum considered many promising altcoin gra...,4
...,...,...
9213260650272029784,liga venture aceleradora de startup especializ...,2979
-3295913657316686039,amazon launched chime video conferencing chat ...,2980
3618271604906293310,february 9 2017 begin year look k12 computer s...,2981
6607431762270322325,jpmorgan chase & co learning machine parsing f...,2982


In [None]:
vectorizer = TfidfVectorizer(max_features=2984)
feat = vectorizer.fit_transform(articles_df_preprocessed['text'])

In [None]:
model = LightFM(loss='warp')
model.fit(data_train_sparse, item_features=feat, epochs=20)

<lightfm.lightfm.LightFM at 0x7f59485fefd0>

In [None]:
predictions_preprocessed = model.predict(sparseRows_test, sparseColumns_test, item_features=feat)
prc_feat_preprocessed = precision_at_k(model, data_test_sparse)
prc_feat_preprocessed.mean()

0.3789206

Улучшилось ли качество предсказания?

Ответ: увы нет( то ли я ошиблась в коде (хотя вроде все проверила), то ли стоит считать признаки не через TF-IDF а как-то иначе, но качество падает почти в 2 раза относительно модели без учета текста статей. Хотя препроцессинг немного повышает качество относительно TF-IDF на сыром тексте


## Задание 5. (2 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [None]:
seed = np.random.RandomState(seed=19937)

In [None]:
# Ваш код здесь
model = LightFM(loss='warp', no_components=25, max_sampled=100, learning_schedule='adagrad', 
                learning_rate=5e-4, random_state=seed)
model.fit(data_train_sparse, epochs=20)

<lightfm.lightfm.LightFM at 0x7f59485b7d90>

In [None]:
predictions = model.predict(sparseRows_test, sparseColumns_test)

In [None]:
prc = precision_at_k(model, data_test_sparse)

In [None]:
prc.mean()

0.6991852

## Бонусное задание. (3 балла)

Выше мы использовали достаточно простое представление текста статьи в виде TF-IDF. В этом задании Вам нужно представить текст статьи (можно вместе с заголовком) в виде эмбеддинга полученного с помощью рекуррентной сети или трансформера (можно использовать любую предобученную модель, которая Вам нравится). Обучите модель с ипользованием этих эмеддингов и сравните результаты с предыдущими.

In [None]:
# Ваш код здесь