# Рекомендательная система

**Заказчик.** Компания «Самокат»

**Цель Заказчика.** Получить инструмент, который будет выдавать в рекомендацию 10 аналогичных товаров.

**Цель исследования.** Обучить модель классифицировать товаров со значением метрики качества recall@10 не меньше 0.8.

**Задачи:**

- Загрузить данные.
- Проверить данные на мультиколлиниарность.
- Обучить модель.
- Сделать выводы.

**Входные данные от Заказчика.** Четыре файла в формате .csv с векторным представлением товаров

**Ожидаемый результат.** Построена модель для выдачи на каждый товар запроса десяти аналогичных товаров со значением метрики качества `recall@10` более 0.8

**Описание данных**

1. `base.csv` - векторное представление товаров `base`
2. `train.csv` - векторное представление товаров `query` + таргет (товар из base, являющийся матчем)
3. `test.csv` - векторное представление товаров `query`, для которых надо предсказать кандидатов на матч из base
4. `answer_sample.csv` - формат ответа: `Id - id продукта, Prtedicted - 10 id продуктов из base через пробел`

Работа выполнена на ресурсе `Kaggle`

**Уточнения и ограничения**

При выполнении задания не удалось установить библиотеку `faiss`

Типовые методы (`numpy` и `KMeans`) не позволили выполнить работу на полном датасете, что не позволило провести качественное обучение модели и достигнуть целевого значения заданной метрики.


In [4]:
# Импорт библиотек и выгрузка данных

import numpy as np 
import pandas as pd 
from scipy.spatial import distance
from sklearn.cluster import KMeans

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/samokattechworkshop/base.csv
/kaggle/input/samokattechworkshop/train.csv
/kaggle/input/samokattechworkshop/test.csv
/kaggle/input/samokattechworkshop/baseline.ipynb
/kaggle/input/samokattechworkshop/answer_sample.csv


In [158]:
# чтение файла train
df_train = pd.read_csv('/kaggle/input/samokattechworkshop/train.csv')
df_train.head()

Unnamed: 0,Id,0,1,2,3,4,5,6,7,8,...,63,64,65,66,67,68,69,70,71,Target
0,0-query,-53.882748,17.971436,-42.117104,-183.93668,187.51749,-87.14493,-347.360606,38.307602,109.08556,...,70.10736,-155.80257,-101.965943,65.90379,34.4575,62.642094,134.7636,-415.750254,-25.958572,675816-base
1,1-query,-87.77637,6.806268,-32.054546,-177.26039,120.80333,-83.81059,-94.572749,-78.43309,124.9159,...,4.669178,-151.69771,-1.638704,68.170876,25.096191,89.974976,130.58963,-1035.092211,-51.276833,366656-base
2,2-query,-49.979565,3.841486,-116.11859,-180.40198,190.12843,-50.83762,26.943937,-30.447489,125.771164,...,78.039764,-169.1462,82.144186,66.00822,18.400496,212.40973,121.93147,-1074.464888,-22.547178,1447819-base
3,3-query,-47.810562,9.086598,-115.401695,-121.01136,94.65284,-109.25541,-775.150134,79.18652,124.0031,...,44.515266,-145.41675,93.990981,64.13135,106.06192,83.17876,118.277725,-1074.464888,-19.902788,1472602-base
4,4-query,-79.632126,14.442886,-58.903397,-147.05254,57.127068,-16.239529,-321.317964,45.984676,125.941284,...,45.02891,-196.09207,-117.626337,66.92622,42.45617,77.621765,92.47993,-1074.464888,-21.149351,717819-base


**Проверка на мультиколлиниарность**

In [29]:
# Создание матрицы коэффициентов корреляции
corr_tabl = df_train.corr()
corr_tabl = corr_tabl.replace(1,0)

  corr_tabl = df.corr()


In [30]:
corr_tabl.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
0,0.0,0.005141,0.002172,-0.012514,-0.01963,-0.010645,0.009241,-0.010087,-0.009149,0.00022,...,0.002459,-0.026126,-0.009391,-0.001189,-0.010962,-0.003895,-0.009693,-0.027023,-0.016043,-0.017728
1,0.005141,0.0,-0.008741,0.005383,-0.000338,-0.008188,0.011282,0.010787,-0.001937,-0.00051,...,-0.005167,-0.003761,-0.021873,-0.004735,0.008024,-0.007563,-0.011825,0.016123,-0.014883,0.020817
2,0.002172,-0.008741,0.0,-0.014821,-0.017966,0.005171,-0.007881,-0.018625,0.020178,0.016323,...,-0.028888,-0.008702,0.001787,-0.002327,-0.02657,-0.010198,-0.019652,0.002814,0.005712,0.004332
3,-0.012514,0.005383,-0.014821,0.0,0.009271,-0.007097,-0.004944,0.006825,0.004244,0.033505,...,0.00249,-0.024133,-0.008766,-0.001255,0.0123,0.004461,0.026721,-0.003116,0.005844,0.014606
4,-0.01963,-0.000338,-0.017966,0.009271,0.0,-0.006796,0.020441,-0.007682,-0.013266,0.016618,...,0.009002,-0.002062,-0.001907,0.002237,-0.003915,-0.011723,-0.005744,-0.012335,0.000445,-0.008475


**Поиск связанных параметров**

In [50]:
corr_tabl[(corr_tabl > 0.5).any(axis=1)]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71


In [49]:
corr_tabl[(corr_tabl < -0.5).any(axis=1)]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71


**Мультиколлиниарность отсутствует**

In [None]:
# Чтение файла base
df_base = pd.read_csv('/kaggle/input/samokattechworkshop/base.csv')

**Проверка модели на короткой базе**

In [150]:
# Короткая база из 1 000 строк

df_base_short = df_base.loc[:10000]
df_base_short.head()

Unnamed: 0,Id,0,1,2,3,4,5,6,7,8,...,62,63,64,65,66,67,68,69,70,71
0,0-base,-115.08389,11.152912,-64.42676,-118.88089,216.48244,-104.69806,-469.070588,44.348083,120.915344,...,-42.808693,38.800827,-151.76218,-74.38909,63.66634,-4.703861,92.93361,115.26919,-112.75664,-60.830353
1,1-base,-34.562202,13.332763,-69.78761,-166.53348,57.680607,-86.09837,-85.076666,-35.637436,119.718636,...,-117.767525,41.1,-157.8294,-94.446806,68.20211,24.346846,179.93793,116.834,-84.888941,-59.52461
2,2-base,-54.233746,6.379371,-29.210136,-133.41383,150.89583,-99.435326,52.554795,62.381706,128.95145,...,-76.3978,46.011803,-207.14442,127.32557,65.56618,66.32568,81.07349,116.594154,-1074.464888,-32.527206
3,3-base,-87.52013,4.037884,-87.80303,-185.06763,76.36954,-58.985165,-383.182845,-33.611237,122.03191,...,-70.64794,-6.358921,-147.20105,-37.69275,66.20289,-20.56691,137.20694,117.4741,-1074.464888,-72.91549
4,4-base,-72.74385,6.522049,43.671265,-140.60803,5.820023,-112.07408,-397.711282,45.1825,122.16718,...,-57.199104,56.642403,-159.35184,85.944724,66.76632,-2.505783,65.315285,135.05159,-1074.464888,0.319401


In [151]:
# Обучающий файл БАЗА (удаление столбца Id)

feature_df_base_short = df_base_short.drop('Id', axis=1)
feature_df_base_short.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
0,-115.08389,11.152912,-64.42676,-118.88089,216.48244,-104.69806,-469.070588,44.348083,120.915344,181.4497,...,-42.808693,38.800827,-151.76218,-74.38909,63.66634,-4.703861,92.93361,115.26919,-112.75664,-60.830353
1,-34.562202,13.332763,-69.78761,-166.53348,57.680607,-86.09837,-85.076666,-35.637436,119.718636,195.23419,...,-117.767525,41.1,-157.8294,-94.446806,68.20211,24.346846,179.93793,116.834,-84.888941,-59.52461
2,-54.233746,6.379371,-29.210136,-133.41383,150.89583,-99.435326,52.554795,62.381706,128.95145,164.38147,...,-76.3978,46.011803,-207.14442,127.32557,65.56618,66.32568,81.07349,116.594154,-1074.464888,-32.527206
3,-87.52013,4.037884,-87.80303,-185.06763,76.36954,-58.985165,-383.182845,-33.611237,122.03191,136.23358,...,-70.64794,-6.358921,-147.20105,-37.69275,66.20289,-20.56691,137.20694,117.4741,-1074.464888,-72.91549
4,-72.74385,6.522049,43.671265,-140.60803,5.820023,-112.07408,-397.711282,45.1825,122.16718,112.119064,...,-57.199104,56.642403,-159.35184,85.944724,66.76632,-2.505783,65.315285,135.05159,-1074.464888,0.319401


### Основная концепция

1. Методом `KMeans` создать `n` классов.
2. Каждый товар-запрос отнести к какому-либо классу
3. Эвклидовым расстоянием в выбранном классе определить 10 ближайших векторов (товаров)


In [152]:
# Создание модели KMeans 

kmeans = KMeans(n_clusters=500, random_state=0, n_init="auto")
kmeans.fit(feature_df_base_short)

In [159]:
# Добавление столбца с классами в базу

df_base_short['gr_class'] = list(kmeans.labels_)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_base_short['gr_class'] = list(kmeans.labels_)


In [155]:
df_base_short.head()

Unnamed: 0,Id,0,1,2,3,4,5,6,7,8,...,63,64,65,66,67,68,69,70,71,gr_class
0,0-base,-115.08389,11.152912,-64.42676,-118.88089,216.48244,-104.69806,-469.070588,44.348083,120.915344,...,38.800827,-151.76218,-74.38909,63.66634,-4.703861,92.93361,115.26919,-112.75664,-60.830353,160
1,1-base,-34.562202,13.332763,-69.78761,-166.53348,57.680607,-86.09837,-85.076666,-35.637436,119.718636,...,41.1,-157.8294,-94.446806,68.20211,24.346846,179.93793,116.834,-84.888941,-59.52461,433
2,2-base,-54.233746,6.379371,-29.210136,-133.41383,150.89583,-99.435326,52.554795,62.381706,128.95145,...,46.011803,-207.14442,127.32557,65.56618,66.32568,81.07349,116.594154,-1074.464888,-32.527206,268
3,3-base,-87.52013,4.037884,-87.80303,-185.06763,76.36954,-58.985165,-383.182845,-33.611237,122.03191,...,-6.358921,-147.20105,-37.69275,66.20289,-20.56691,137.20694,117.4741,-1074.464888,-72.91549,47
4,4-base,-72.74385,6.522049,43.671265,-140.60803,5.820023,-112.07408,-397.711282,45.1825,122.16718,...,56.642403,-159.35184,85.944724,66.76632,-2.505783,65.315285,135.05159,-1074.464888,0.319401,109


In [160]:
# Обучающий файл train (удаление столбца Id и Target)

feature_df_train = df_train.drop(['Id', 'Target'], axis=1)

In [161]:
# Обучение модели

df_train['gr_class'] = list(kmeans.predict(feature_df_train))

In [162]:
# Функция создания столбца со списком из класса и индекса

def class_ind_list(cl, ind):
    ind_class = []
    ind_class.append(cl)
    ind_class.append(ind)
    return ind_class

In [163]:
# создания столбца с осписком из класса и индекса

# столбец с индексом
df_train['ind'] = df_train.index

# создание столбца с осписком из класса и индекса

df_train['class_ind'] = df_train.apply(lambda x: ind_class_list(x['gr_class'], x['ind']),
                           axis=1)

In [164]:
df_train.head()

Unnamed: 0,Id,0,1,2,3,4,5,6,7,8,...,66,67,68,69,70,71,Target,gr_class,ind,class_ind
0,0-query,-53.882748,17.971436,-42.117104,-183.93668,187.51749,-87.14493,-347.360606,38.307602,109.08556,...,65.90379,34.4575,62.642094,134.7636,-415.750254,-25.958572,675816-base,286,0,"[286, 0]"
1,1-query,-87.77637,6.806268,-32.054546,-177.26039,120.80333,-83.81059,-94.572749,-78.43309,124.9159,...,68.170876,25.096191,89.974976,130.58963,-1035.092211,-51.276833,366656-base,366,1,"[366, 1]"
2,2-query,-49.979565,3.841486,-116.11859,-180.40198,190.12843,-50.83762,26.943937,-30.447489,125.771164,...,66.00822,18.400496,212.40973,121.93147,-1074.464888,-22.547178,1447819-base,258,2,"[258, 2]"
3,3-query,-47.810562,9.086598,-115.401695,-121.01136,94.65284,-109.25541,-775.150134,79.18652,124.0031,...,64.13135,106.06192,83.17876,118.277725,-1074.464888,-19.902788,1472602-base,38,3,"[38, 3]"
4,4-query,-79.632126,14.442886,-58.903397,-147.05254,57.127068,-16.239529,-321.317964,45.984676,125.941284,...,66.92622,42.45617,77.621765,92.47993,-1074.464888,-21.149351,717819-base,60,4,"[60, 4]"


In [165]:
# Функция тонкого подбора рекомендаций методом эвклидова расстояния

def accurate_prtdict(clas_ind):
    # Количество лучших значений
    n = 1
    
    # список значений и список лучших ответов
    best_answ = []
    best_base = []
    
    # список индексов из БАЗЫ по данному классу
    l_i = list(df_base_short.loc[df_base_short['gr_class']==clas_ind[0]].index)
    
    # Вектор запроса
    a = feature_df_train.loc[clas_ind[1]]
    
    # расстояния векторов
    for i in l_i:
        b = feature_df_base_short.loc[i]
        d = distance.euclidean(a, b)
        s = df_base_short.loc[i,'Id']
        best_answ.append((s,d))
     
    # Сортируем список по расстоянию в порядке возрастания
    best_answ.sort(key=lambda x: x[1])
    
    # Возвращаем первые n наиболее похожих строк матрицы
    for i in best_answ[:n]:
        best_base.append(i[0])
    
    return best_base

In [166]:
# Поиск ОДНОГО ответа / товара-аналога

df_train['answ'] = df_train['class_ind'].apply(accurate_prtdict)

In [167]:
df_train.head()

Unnamed: 0,Id,0,1,2,3,4,5,6,7,8,...,67,68,69,70,71,Target,gr_class,ind,class_ind,answ
0,0-query,-53.882748,17.971436,-42.117104,-183.93668,187.51749,-87.14493,-347.360606,38.307602,109.08556,...,34.4575,62.642094,134.7636,-415.750254,-25.958572,675816-base,286,0,"[286, 0]",[6520-base]
1,1-query,-87.77637,6.806268,-32.054546,-177.26039,120.80333,-83.81059,-94.572749,-78.43309,124.9159,...,25.096191,89.974976,130.58963,-1035.092211,-51.276833,366656-base,366,1,"[366, 1]",[7120-base]
2,2-query,-49.979565,3.841486,-116.11859,-180.40198,190.12843,-50.83762,26.943937,-30.447489,125.771164,...,18.400496,212.40973,121.93147,-1074.464888,-22.547178,1447819-base,258,2,"[258, 2]",[3503-base]
3,3-query,-47.810562,9.086598,-115.401695,-121.01136,94.65284,-109.25541,-775.150134,79.18652,124.0031,...,106.06192,83.17876,118.277725,-1074.464888,-19.902788,1472602-base,38,3,"[38, 3]",[1775-base]
4,4-query,-79.632126,14.442886,-58.903397,-147.05254,57.127068,-16.239529,-321.317964,45.984676,125.941284,...,42.45617,77.621765,92.47993,-1074.464888,-21.149351,717819-base,60,4,"[60, 4]",[1836-base]


### Выводы

На «короткой» выборке, с применением методов `numpy` и `KMeans`, сформирована модель для подбора любого количества товаров, аналогичных запрошенному. В приведенной функции (для ускорения расчета) рекомендуется **один** товар.

Отсутствие библиотеки `faiss` не позволило реализовать методику на полном датасете. В результате метрика `recall@10` не определена.

**Таким образом,** работа не завершена и обозначила только концептуальные направления дальнейшего исследования.
