# Course project

## **Основное**
- Дедлайн - 25.11.2021 20:00
- Целевая метрика precision@5
- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. В решении должны быть отчетливо видна метрика на новом тестовом сете из файла retail_test1.csv, то есть вам нужно для всех юзеров из этого файла выдать выши рекомендации, и посчитать на actual покупках precision@5. 

**!! Мы не рассматриваем холодный старт для пользователя, все наши пользователя одинаковы во всех сетах, поэтому нужно позаботиться об их исключении из теста.**


**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

### Импорт модулей

In [1]:
conda install -c conda-forge implicit

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
conda install -c conda-forge implicit implicit-proc=*=gpu

zsh:1: no matches found: implicit-proc=*=gpu

Note: you may need to restart the kernel to use updated packages.


In [3]:
!pip install implicit



In [4]:
import implicit

In [5]:
conda install -c conda-forge lightgbm

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [6]:
import sys

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from statistics import mean

# Написанные нами функции
from metrics import precision_at_k, recall_at_k, money_precision_at_k



# from utils import prefilter_items, get_targets_sec_level, extend_new_user_features, extend_new_item_features, extend_user_item_new_features, get_important_features, get_popularity_recommendations, postfilter_items, get_final_recomendations


from utils import prefilter_items, get_targets_sec_level, extend_new_user_features, extend_new_item_features, \
extend_user_item_new_features, get_important_features, get_popularity_recommendations, filter_by_diff_cat, \
postfilter_items, get_final_recomendations


from recommenders import MainRecommender

from tqdm import tqdm
tqdm.pandas()

pd.pandas.set_option('display.max_columns', None)
import warnings
warnings.simplefilter('ignore')

In [8]:
%load_ext autoreload

### Загрузка данных и разделение на train и test

In [9]:
data = pd.read_csv('retail_train.csv')
data_test = pd.read_csv('retail_test1.csv')
item_features = pd.read_csv('product.csv')
user_features = pd.read_csv('hh_demographic.csv')

In [10]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)

In [11]:
# Количество рекомендаций
N=100 

VAL_SIZE = 3

train_1 = data[data['week_no'] < data['week_no'].max() - (VAL_SIZE)]
val = data[data['week_no'] >= data['week_no'].max() - (VAL_SIZE)]

train_2 = val.copy()

### Предварительная фильтрация данных

In [12]:
n_items_before = train_1['item_id'].nunique()
train_1 = prefilter_items(train_1, item_features=item_features, take_n_popular= 3000)
n_items_after = train_1['item_id'].nunique()

print(f'Decreased # items from {n_items_before} to {n_items_after}')

Decreased # items from 86865 to 3001


### Обучение модели первого уровня

In [13]:
recommender = MainRecommender(train_1)



  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/3001 [00:00<?, ?it/s]

### Эмбеддинги

In [14]:
items_emb_df = recommender.items_emb_df
users_emb_df = recommender.users_emb_df

### Фичи

In [15]:
train = extend_user_item_new_features(train_2, train_1, recommender, item_features, user_features, items_emb_df, users_emb_df, N)
train.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc_x,coupon_match_disc,price,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,9_x,10_x,11_x,12_x,13_x,14_x,15_x,16_x,17_x,18_x,19_x,coupon_disc_y,sales_count_per_dep,qnt_of_sales_per_item_per_dep_per_week,quantity_of_sales,sales_count_per_week,qnt_of_sales_per_sub_commodity_desc,qnt_of_sales_per_item_per_sub_commodity_desc_per_week,marital_status_code,homeowner_desc,hh_comp_desc,household_size_desc,0_y,1_y,2_y,3_y,4_y,5_y,6_y,7_y,8_y,9_y,10_y,11_y,12_y,13_y,14_y,15_y,16_y,17_y,18_y,19_y,mean_time,age,income,children,avr_bask,sum_per_week,count_purchases_week_mean,sum_purchases_week_mean,target
0,338,41260573635,636,840173,1,1.99,369,0.0,112,92,0.0,0.0,1.99,5143,DRUG GM,National,GREETING CARDS/WRAP/PARTY SPLY,CARDS SEASONAL,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12767,0.083023,16,4.0,150,0.065789,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00303,0.02705,0.0
1,338,41260573635,636,1037348,1,0.89,369,-0.3,112,92,0.0,0.0,0.89,69,GROCERY,Private,FRUIT - SHELF STABLE,PEACHES,15 OZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75953,0.187069,4,1.0,93,0.202174,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003464,0.004358,0.0
2,338,41260573635,636,5592737,2,1.58,369,-0.2,112,92,0.0,0.0,0.79,69,GROCERY,Private,FRUIT - SHELF STABLE,PINEAPPLE,20 OZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75953,0.187069,3,0.75,195,0.211039,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003464,0.004358,0.0
3,338,41260573635,636,7441679,1,3.69,369,0.0,112,92,0.0,0.0,3.69,1407,DRUG GM,National,GREETING CARDS/WRAP/PARTY SPLY,CARDS SEASONAL,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12767,0.083023,6,1.5,150,0.065789,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00303,0.02705,0.0
4,338,41260573635,636,7442317,1,2.69,369,0.0,112,92,0.0,0.0,2.69,1407,DRUG GM,National,GREETING CARDS/WRAP/PARTY SPLY,CARDS SEASONAL,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12767,0.083023,13,3.25,150,0.065789,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00303,0.02705,0.0


In [16]:
X_train = train.drop(['target'], axis=1)
y_train = train[['target']]

In [17]:
cat_features=[]
for col in X_train.columns:
    if(X_train[col].dtype == np.object):
          cat_features.append(col)
            
X_train[cat_features + ['user_id', 'item_id']] = X_train[cat_features + ['user_id', 'item_id']].astype('category')

In [18]:
test = extend_user_item_new_features(data_test, train_1, recommender, item_features, user_features, items_emb_df, users_emb_df, N)
X_test = test.drop(['target'], axis=1)
y_test = test[['target']]
X_test[cat_features + ['user_id', 'item_id']] = X_test[cat_features + ['user_id', 'item_id']].astype('category')

### Выполнение lgb для определения наиболее важных фичей

In [19]:
lgb = LGBMClassifier(objective='binary', max_depth=5, categorical_column=cat_features)
important_features = get_important_features(lgb, X_train, y_train)

### Обучение модели второго уровня

In [20]:
lgb = LGBMClassifier(
    objective='binary',
    max_depth=5,
    categorical_feature=cat_features
)
lgb.fit(X_train[important_features], y_train)

LGBMClassifier(categorical_feature=['department', 'brand', 'commodity_desc',
                                    'sub_commodity_desc',
                                    'curr_size_of_product',
                                    'marital_status_code', 'homeowner_desc',
                                    'hh_comp_desc', 'household_size_desc'],
               max_depth=5, objective='binary')

In [21]:
preds = lgb.predict(X_test[important_features])
test_preds_proba = lgb.predict_proba(X_test[important_features])[:, 1]

### Финальная фильтрация данных

In [22]:
result = get_final_recomendations(X_test, test_preds_proba, data, train_1, item_features)

100%|███████████████████████████████████████| 2499/2499 [05:13<00:00,  7.98it/s]


In [23]:
price = train_1.groupby('item_id')['price'].mean().reset_index()

### Money precision @ k

In [24]:
final_result = result.apply(lambda row: money_precision_at_k(row['recomendations'], row['actual'], price), axis=1).mean()

In [25]:
final_result

0.36684202982926795

### Сохранение предсказания

In [26]:
result.drop('actual', axis=1, inplace=True)

In [27]:
result.to_csv('recommendations.csv', index=False)