**Основное**
- Дедлайн - 19 февраля 23:59
- Целевая метрика precision@5
- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. В решении должны быть отчетливо видна метрика на новом тестовом сете из файла retail_test1.csv, то есть вам нужно для всех юзеров из этого файла выдать выши рекомендации, и посчитать на actual покупках precision@5. 

**!! Мы не рассматриваем холодный старт для пользователя, все наши пользователя одинаковы во всех сетах, поэтому нужно позаботиться об их исключении из теста.**


**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

### Импорт модулей

In [2]:
!pip install -q condacolab
import condacolab
condacolab.install()

✨🍰✨ Everything looks OK!


In [3]:
!conda install -c conda-forge implicit

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - implicit


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2021.10.8  |       ha878542_0         139 KB  conda-forge
    certifi-2021.10.8          

In [4]:
!conda install -c conda-forge implicit implicit-proc=*=gpu

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - implicit
    - implicit-proc[build=gpu]


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    implicit-proc-0.5.2        |              gpu           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:           4 KB

The following NEW packages will be INSTALLED:

  implicit-proc      conda-forge/linux-64::implicit-proc-0.5.2-gpu



Downloading and Extracting Packages
implicit-proc-0.5.2  | 4 KB      | : 100% 1.0/1 [00:00<00:00,  4

In [5]:
!pip install implicit



In [6]:
import implicit

  f"CUDA extension is built, but disabling GPU support because of '{e}'",


In [7]:
!conda install -c conda-forge lightgbm

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ | / - \ | / - \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - lightgbm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-1.1.0               |     pyhd8ed1ab_0         210 KB  conda-forge
    lightgbm-3.3.2             |   py37hcd2ae1e_0         1.8 MB  conda-forge
    scikit-learn-1.0.2         |   py37hf9e9bfc_0         7.8 MB  conda-forge
    threadpoolctl-3.1.0        |     pyh8a188c0_0          18 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.9 MB

The following NEW packages will be INSTALLED:

  joblib             conda-forge/noarch

In [8]:
import sys

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath('/content/drive/MyDrive')
if module_path not in sys.path:
    sys.path.append(module_path)
    
from statistics import mean

# Написанные нами функции
from metrics import precision_at_k, recall_at_k, money_precision_at_k

# from utils import prefilter_items, get_targets_sec_level, extend_new_user_features, extend_new_item_features, extend_user_item_new_features, get_important_features, get_popularity_recommendations, postfilter_items, get_final_recomendations


from utils import prefilter_items#, \
#extend_user_item_new_features, get_important_features, get_popularity_recommendations, filter_by_diff_cat, \
#postfilter_items, get_final_recomendations, get_targets_sec_level, extend_new_user_features, extend_new_item_features


from recommenders import MainRecommender

from tqdm import tqdm
tqdm.pandas()

pd.pandas.set_option('display.max_columns', None)
import warnings
warnings.simplefilter('ignore')

In [20]:
%load_ext autoreload

### Загрузка данных и разделение на train и test

In [22]:
data = pd.read_csv('/content/drive/MyDrive/retail_train.csv')
data_test = pd.read_csv('/content/drive/MyDrive/retail_test1.csv')
item_features = pd.read_csv('/content/drive/MyDrive/product.csv')
user_features = pd.read_csv('/content/drive/MyDrive/hh_demographic.csv')

In [23]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)

In [24]:
# Количество рекомендаций
N=100 

VAL_SIZE = 3

train_1 = data[data['week_no'] < data['week_no'].max() - (VAL_SIZE)]
val = data[data['week_no'] >= data['week_no'].max() - (VAL_SIZE)]

train_2 = val.copy()

### Предварительная фильтрация данных

In [25]:
n_items_before = train_1['item_id'].nunique()
train_1 = prefilter_items(train_1, item_features=item_features, take_n_popular= 3000)
n_items_after = train_1['item_id'].nunique()

print(f'Decreased # items from {n_items_before} to {n_items_after}')

Decreased # items from 86865 to 3001


### Обучение модели первого уровня

In [26]:
recommender = MainRecommender(train_1)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/2497 [00:00<?, ?it/s]

### Эмбеддинги

In [28]:
items_emb_df = recommender.items_emb_df
users_emb_df = recommender.users_emb_df

AttributeError: ignored

### Фичи

In [None]:
train = extend_user_item_new_features(train_2, train_1, recommender, item_features, user_features, items_emb_df, users_emb_df, N)
train.head()

In [None]:
X_train = train.drop(['target'], axis=1)
y_train = train[['target']]

In [None]:
cat_features=[]
for col in X_train.columns:
    if(X_train[col].dtype == np.object):
          cat_features.append(col)
            
X_train[cat_features + ['user_id', 'item_id']] = X_train[cat_features + ['user_id', 'item_id']].astype('category')

In [None]:
test = extend_user_item_new_features(data_test, train_1, recommender, item_features, user_features, items_emb_df, users_emb_df, N)
X_test = test.drop(['target'], axis=1)
y_test = test[['target']]
X_test[cat_features + ['user_id', 'item_id']] = X_test[cat_features + ['user_id', 'item_id']].astype('category')

### Выполнение lgb для определения наиболее важных фичей

In [None]:
lgb = LGBMClassifier(objective='binary', max_depth=5, categorical_column=cat_features)
important_features = get_important_features(lgb, X_train, y_train)

### Обучение модели второго уровня

In [None]:
lgb = LGBMClassifier(
    objective='binary',
    max_depth=5,
    categorical_feature=cat_features
)
lgb.fit(X_train[important_features], y_train)

In [None]:
preds = lgb.predict(X_test[important_features])
test_preds_proba = lgb.predict_proba(X_test[important_features])[:, 1]

### Финальная фильтрация данных

In [None]:
result = get_final_recomendations(X_test, test_preds_proba, data, train_1, item_features)

In [None]:
price = train_1.groupby('item_id')['price'].mean().reset_index()

### Money precision @ k

In [None]:
final_result = result.apply(lambda row: money_precision_at_k(row['recomendations'], row['actual'], price), axis=1).mean()

In [None]:
final_result

### Сохранение предсказания

In [None]:
result.drop('actual', axis=1, inplace=True)

In [None]:
result.to_csv('/content/drive/MyDrive/recommendations.csv', index=False)