## Описание проекта

**Дано:**

Два множества объектов: A и B. Каждый объект в множества описывается какими-то признаками.

**Желаемый результат:**

Для каждого объекта из множества A найти один или несколько объектов из B, которые близки к нему по некоторой заданной метрике.


**Задачи:**

- Необходимо разработать алгоритм, который для всех товаров из `validation.csv` предложит несколько вариантов наиболее похожих товаров из `base.csv`;

- Оценить качество алгоритма по метрике accuracy;

**Исходные данные:**

- `base.csv` - анонимизированный набор товаров. Каждый товар представлен как уникальный id (0-base, 1-base, 2-base) и вектор признаков размерностью 72.

- `train.csv` - обучающий датасет. Каждая строчка - один товар, для которого известен уникальный `id` (0-query, 1-query, …) , вектор признаков `id` товара из `base.csv`, который максимально похож на него (по мнению экспертов).

- `validation.csv` - датасет с товарами (уникальный id и вектор признаков), для которых надо найти наиболее близкие товары из `base.csv`.

- `validation_answer.csv` - правильные ответы к предыдущему файлу.

# Загрузка библиотек и данных

In [None]:
!pip install -q faiss-cpu
!pip install -q catboost

In [None]:
from google.colab import drive

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.spatial import distance
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from scipy.stats import shapiro
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score

import faiss

In [None]:
try:
    path_base = ''
    path_train = ''
    path_val = ''
    path_val_answer = ''

    df_base = pd.read_csv(path_base, index_col=0)
    df_train = pd.read_csv(path_train, index_col=0)
    df_val = pd.read_csv(path_val, index_col=0)
    df_val_answer = pd.read_csv(path_val_answer, index_col=0)

except:
    drive.mount('/content/gdrive')
    df_base = pd.read_csv('/content/gdrive/MyDrive/learn/DS/data/base.csv', index_col=0)
    df_train = pd.read_csv('/content/gdrive/MyDrive/learn/DS/data/train.csv', index_col=0)
    df_val = pd.read_csv('/content/gdrive/MyDrive/learn/DS/data/validation.csv', index_col=0)
    df_val_answer = pd.read_csv('/content/gdrive/MyDrive/learn/DS/data/validation_answer.csv', index_col=0)

Mounted at /content/gdrive


# Поиск векторов используя Faiss

In [None]:
targets = df_train['Target']
df_train = df_train.drop(columns='Target')

Создаем словарь, где будем хранить индекс вектора из `base` как ключ и его номер как значение

In [None]:
base_index = {k: v for k, v in enumerate(df_base.index.to_list())}

Для работы с Faiss создадим класс

In [None]:
class Finder:


    def __init__(self, index_base=df_base):

        self.index_base = index_base # база поиска
        self.dims = index_base.shape[1] # задаем размерность векторов


    def train(self, index_type='FlatL2', nprobe = 15, clusters='1000'):

        if index_type == 'FlatL2':
            self.index = faiss.IndexFlatL2(self.dims)
            self.index.add(self.index_base)

        if index_type == 'IVFFlat':
            self.index = faiss.index_factory(self.dims, 'IVF' + clusters + ',Flat')
            self.index.train(self.index_base)
            self.index.add(self.index_base)
            self.index.nprobe = nprobe # количество кластеров, где проходит поиск


    def search(self, vectors_to_search=df_train, number_of_vectors=100, neighbors=5):

        dist, indexes = self.index.search(vectors_to_search[:number_of_vectors], neighbors)

        return indexes, dist


    def search_neighbors(self, vectors_to_search, neighbors=5):

        dist, indexes = self.index.search(vectors_to_search, neighbors)

        return indexes, dist


    def calc_accuracy(self, vectors_to_search=df_train, number_of_vectors=5, neighbors=5, targets=targets, base=base_index):

        indexes = self.search(vectors_to_search, number_of_vectors, neighbors)[0]
        res = []
        for target, el in zip(targets.values.tolist(), indexes.tolist()):
            res.append(int(target in [base[item] for item in el]))

        return round(np.mean(res) * 100, 2)

## Индекс IndexFlatL2

In [None]:
finder = Finder()
finder.train('FlatL2')

Выполним поиск для 5 первых векторов их `base` чтобы удостоверится, что поиск работает

In [None]:
finder.search(vectors_to_search=df_base, number_of_vectors=5, neighbors=5)

(array([[      0, 1500079, 2051954, 1944170, 1108612],
        [      1, 1570801, 2371895, 2735425, 2467303],
        [      2,  679908,  699436,  487095,  610683],
        [      3, 2404071, 2393583, 1395190, 2128166],
        [      4,  141249, 2611440,   57799, 1181402]]),
 array([[     0.    ,  55807.145 ,  82528.22  ,  82861.01  , 105918.445 ],
        [     0.    ,  78329.305 , 103517.16  , 123806.31  , 133982.56  ],
        [     0.    ,    518.7016,   4848.112 ,  15866.515 ,  24915.895 ],
        [     0.    ,  61570.293 ,  64472.855 ,  68644.89  ,  69554.44  ],
        [     0.    , 143118.7   , 156359.12  , 162206.7   , 169574.97  ]],
       dtype=float32))

Как и ожидалось, первым пяти векторам соответвуют они же

Поиск из трейна

In [None]:
finder.search(number_of_vectors=5, neighbors=5)

(array([[1480698,  161948, 1076334, 1882633, 1282393],
        [ 445586,  920175, 2168908, 2651198,  546230],
        [1659033,  760940,  656828, 1052397, 1392119],
        [2825385, 1573375, 2745252,  684927,  429085],
        [ 212436, 1304565, 1332011, 1338879,  595387]]),
 array([[108182.35 , 116295.55 , 125482.26 , 141574.39 , 142215.5  ],
        [102827.21 , 116681.83 , 120689.08 , 122624.29 , 122806.87 ],
        [ 54918.504,  57053.152,  57339.78 ,  61338.43 ,  61677.61 ],
        [ 89537.67 ,  95799.28 , 105487.42 , 106495.74 , 106910.336],
        [ 22776.164,  22803.027,  25497.855,  27072.268,  30585.54 ]],
       dtype=float32))

In [None]:
%%time
finder.calc_accuracy(number_of_vectors=100)

CPU times: user 5.18 s, sys: 8.34 ms, total: 5.19 s
Wall time: 3.8 s


14.0

## Индекс IndexIVFFlat

In [None]:
finder.train('IVFFlat', nprobe=10)
finder.search(number_of_vectors=10)

(array([[1480698,  161948, 1076334, 1882633, 1282393],
        [ 445586,  920175, 2168908, 2651198,  546230],
        [1659033,  760940,  656828, 1052397, 1392119],
        [2825385, 1573375, 2745252,  684927,  429085],
        [ 212436, 1304565, 1332011, 1338879,  595387],
        [1953218,  205613, 2080851, 1840906, 1873401],
        [2911305, 1199903, 1245503, 2551513,  629024],
        [ 975306,   22314,  705199, 2364395,   60567],
        [ 691994, 2551557, 2662608, 1154963, 1140458],
        [ 138570,  821969, 2230647, 1835185,  563959]]),
 array([[108182.35 , 116295.55 , 125482.26 , 141574.39 , 142215.5  ],
        [102827.21 , 116681.83 , 120689.08 , 122624.29 , 122806.87 ],
        [ 54918.504,  57053.152,  57339.78 ,  61338.43 ,  61677.61 ],
        [ 89537.67 ,  95799.28 , 105487.42 , 106495.74 , 106910.336],
        [ 22776.164,  22803.027,  25497.855,  27072.268,  30585.54 ],
        [ 97811.25 , 110492.42 , 124100.17 , 126546.92 , 126997.53 ],
        [ 22191.883,  96315.

In [None]:
%%time
finder.calc_accuracy(number_of_vectors=100)

CPU times: user 243 ms, sys: 886 µs, total: 244 ms
Wall time: 126 ms


9.0

# Обработка данных

Посмотрим на количество пропусков

In [None]:
df_base.isna().sum().sum()

0

Пропусков нет

С помощью теста Шапиро-Уилка отбросим признаки, которые не распределены нормально

In [None]:
columns_to_drop = []
for col in df_base.columns:
    stat, p_value = shapiro(df_base[col])
    if p_value < 0.01:
        columns_to_drop.append(col)

In [None]:
df_base_corrected = df_base.drop(columns=columns_to_drop)
df_train_corrected = df_train.drop(columns=columns_to_drop)
df_val_corrected = df_val.drop(columns=columns_to_drop)

In [None]:
columns_not_to_drop = df_base_corrected.columns

Посмотрим на метрику

In [None]:
finder = Finder(index_base=df_base_corrected)
finder.train('IVFFlat', nprobe=30)
finder.calc_accuracy(vectors_to_search=df_train_corrected, number_of_vectors=df_train_corrected.shape[0])

66.34

Проведем стандартизацию

In [None]:
scaler = MinMaxScaler((0, 1))

In [None]:
scaler.fit(df_base_corrected)

In [None]:
df_base_corrected = pd.DataFrame(scaler.transform(df_base_corrected), columns=df_base_corrected.columns, index=df_base_corrected.index)
df_train_corrected = pd.DataFrame(scaler.transform(df_train_corrected), columns=df_train_corrected.columns, index=df_train_corrected.index)
df_val_corrected = pd.DataFrame(scaler.transform(df_val_corrected), columns=df_val_corrected.columns, index=df_val_corrected.index)

In [None]:
finder = Finder(index_base=df_base_corrected)
finder.train('IVFFlat', nprobe=30)
finder.calc_accuracy(vectors_to_search=df_train_corrected, number_of_vectors=df_train_corrected.shape[0])

69.17

# Модель МО


In [None]:
finder = Finder(index_base=df_base_corrected)
finder.train('IVFFlat', nprobe=30, clusters='2000')

In [None]:
n_vectors = df_train_corrected.shape[0]
neighbors = 20
i, d = finder.search(vectors_to_search=df_train_corrected, number_of_vectors=n_vectors, neighbors=neighbors)
finder.calc_accuracy(vectors_to_search=df_train_corrected, number_of_vectors=n_vectors, neighbors=neighbors)

73.57

In [None]:
def prepapre_df(indexes=i,
                distances=d,
                targets=targets,
                query=df_train_corrected,
                base_index=base_index,
                n_vectors=n_vectors,
                neighbors=neighbors
):

    predicted_index = pd.DataFrame(indexes.reshape(-1,1), columns=['predicted_index'])
    predicted_distance = pd.DataFrame(distances.reshape(-1,1), columns=['predicted_distance'])
    target = pd.DataFrame(np.repeat(targets[:n_vectors].values, neighbors, axis=0), columns=['target'])

    df = pd.DataFrame(np.repeat(query.values[:n_vectors], neighbors, axis=0), columns=[x + '_q' for x in query.columns])
    df = df.join(predicted_index).join(predicted_distance).join(target)
    df['predicted_index'] =  df['predicted_index'].apply(lambda x: base_index[x])
    predicted_vectors = df['predicted_index'].apply(lambda x: df_base_corrected.loc[x])
    predicted_vectors.columns = [x + '_p' for x in predicted_vectors.columns]
    df = df.join(predicted_vectors)
    df['correctly_predicted'] = (df['predicted_index'] == df['target']).apply(lambda x: int(x))
    df.index = df['predicted_index']

    return df

In [None]:
df = prepapre_df()

Проверка - посчитаем точность

In [None]:
df['correctly_predicted'].mean() * 100 * neighbors

73.572

Все собрали правильно

Итоговый датасет:

In [None]:
df

Unnamed: 0_level_0,0_q,1_q,3_q,4_q,8_q,9_q,10_q,12_q,14_q,17_q,...,56_p,58_p,61_p,62_p,64_p,66_p,68_p,69_p,71_p,correctly_predicted
predicted_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
361564-base,0.658807,0.727132,0.268795,0.689247,0.269548,0.339675,0.663879,0.571987,0.695780,0.277188,...,0.672785,0.264694,0.520681,0.468145,0.486100,0.387351,0.457268,0.558390,0.516625,0
1375561-base,0.658807,0.727132,0.268795,0.689247,0.269548,0.339675,0.663879,0.571987,0.695780,0.277188,...,0.697609,0.312090,0.512275,0.409455,0.499998,0.400291,0.459177,0.552781,0.549812,0
2515747-base,0.658807,0.727132,0.268795,0.689247,0.269548,0.339675,0.663879,0.571987,0.695780,0.277188,...,0.656571,0.313121,0.469729,0.431101,0.483488,0.394174,0.472865,0.576053,0.614547,0
3543241-base,0.658807,0.727132,0.268795,0.689247,0.269548,0.339675,0.663879,0.571987,0.695780,0.277188,...,0.668178,0.300585,0.535309,0.477636,0.475497,0.422184,0.396232,0.569622,0.563246,0
3411737-base,0.658807,0.727132,0.268795,0.689247,0.269548,0.339675,0.663879,0.571987,0.695780,0.277188,...,0.683475,0.270708,0.547228,0.478129,0.477399,0.428807,0.413955,0.544013,0.567360,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020327-base,0.499675,0.566754,0.488044,0.585608,0.419509,0.410980,0.508945,0.464354,0.572758,0.554238,...,0.491959,0.374716,0.358994,0.591486,0.537819,0.610271,0.551465,0.390549,0.509457,0
870586-base,0.499675,0.566754,0.488044,0.585608,0.419509,0.410980,0.508945,0.464354,0.572758,0.554238,...,0.356717,0.343377,0.362300,0.555951,0.402605,0.508804,0.447909,0.491360,0.479170,0
4424300-base,0.499675,0.566754,0.488044,0.585608,0.419509,0.410980,0.508945,0.464354,0.572758,0.554238,...,0.406287,0.358767,0.348205,0.538911,0.571868,0.366157,0.469021,0.513989,0.509304,0
3824498-base,0.499675,0.566754,0.488044,0.585608,0.419509,0.410980,0.508945,0.464354,0.572758,0.554238,...,0.256288,0.395869,0.236231,0.533257,0.495314,0.569403,0.675950,0.541235,0.347442,0


In [None]:
X = df.drop(columns=['correctly_predicted', 'predicted_index', 'target'])
y = df['correctly_predicted']

In [None]:
del df

Используем модель CatBoost

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
model = CatBoostClassifier(random_state=0)
model.fit(X_train, y_train, verbose=50)

Learning rate set to 0.227159
0:	learn: 0.3594516	total: 690ms	remaining: 11m 29s
50:	learn: 0.0838338	total: 34.9s	remaining: 10m 49s
100:	learn: 0.0805969	total: 1m 5s	remaining: 9m 41s
150:	learn: 0.0784179	total: 1m 36s	remaining: 9m 1s
200:	learn: 0.0767284	total: 2m 8s	remaining: 8m 32s
250:	learn: 0.0752828	total: 2m 40s	remaining: 7m 58s
300:	learn: 0.0740538	total: 3m 10s	remaining: 7m 22s
350:	learn: 0.0728820	total: 3m 41s	remaining: 6m 49s
400:	learn: 0.0717613	total: 4m 12s	remaining: 6m 17s
450:	learn: 0.0708324	total: 4m 43s	remaining: 5m 45s
500:	learn: 0.0699189	total: 5m 15s	remaining: 5m 13s
550:	learn: 0.0690290	total: 5m 46s	remaining: 4m 42s
600:	learn: 0.0682053	total: 6m 17s	remaining: 4m 10s
650:	learn: 0.0674313	total: 6m 47s	remaining: 3m 38s
700:	learn: 0.0666532	total: 7m 16s	remaining: 3m 6s
750:	learn: 0.0659170	total: 7m 47s	remaining: 2m 35s
800:	learn: 0.0651864	total: 8m 18s	remaining: 2m 3s
850:	learn: 0.0645184	total: 8m 47s	remaining: 1m 32s
900:	l

<catboost.core.CatBoostClassifier at 0x7b4662ac17e0>

In [None]:
roc_auc_score(y, model.predict_proba(X)[:, 1])

0.949534274069866

In [None]:
del X
del y

Посмотрим на метрику для данных из validation.csv

In [None]:
n_vectors = df_val_corrected.shape[0]
neighbors = 5

In [None]:
i_val, d_val = finder.search(vectors_to_search=df_val_corrected, number_of_vectors=n_vectors, neighbors=neighbors)

In [None]:
df_v = prepapre_df(indexes=i_val,
                distances=d_val,
                targets=df_val_answer['Expected'],
                query=df_val_corrected,
                base_index=base_index,
                n_vectors=n_vectors,
                neighbors=neighbors,
)

In [None]:
X_val = df_v.drop(columns=['correctly_predicted', 'predicted_index', 'target'])
y_val = df_v['correctly_predicted']

In [None]:
roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])

0.9009905957413769

# Итоговый пайплайн

Поиск наиболее похожих товаров будет происходить следующм образом:


1.   Поиск n ближайщих соседей при помощи Faiss;
2.   Формирование датасета с признаками вектора-запроса, векторов-соседей из Faiss и дистанция между ними;
3.   Отправка этих данных на вход CatBoostClassifier для предсказания наиболее похожих товаров.


In [None]:
n = 5
vector = df_val_corrected.sample(1)
i_s, d_s = finder.search(vector, neighbors=n)

In [None]:
df_1 = prepapre_df(indexes=i_s,
                distances=d_s,
                targets=targets,
                query=vector,
                n_vectors=1,
                neighbors=n,
)
X_val = df_1.drop(columns=['correctly_predicted', 'predicted_index', 'target'])

Найденные вектора:

In [None]:
X_val.index

Index(['286474-base', '3763022-base', '2187348-base', '3774329-base',
       '3146992-base'],
      dtype='object', name='predicted_index')

Правильный ответ:

In [None]:
df_val_answer.loc[vector.index[0]]

Expected    286474-base
Name: 134583-query, dtype: object