# Мэтчинг

## Это всего лишь пример работы, которую нужно будет выполнить. Здесь разработана базовая модель для задачи мэтчинга и обучена на данных маркетплейса. Когда определимся с набором данных, я просто подсуну их этой модели и буду ее настраивать и тестировать.

В рамках проекта стоит задача определить пять наиболее близких по параметрам товаров на обезличенных данных от одного из крупнейших маркетплейсов страны.

**Описание данных**\
Для работы мы имеем 4 набра данных:
- base: Основной набор данных, который содержит всю имеющуюся базу товаров, где товары не размечены.
- train: Обучающий датасет, имеет разметку в виде наиболее схожего товара из перечня всей базы товаров.
- validation: Необходим для финального тестирования, имеет отдельно предоставленную разметку.
- validation_answer: Разметка для набора данных validation.

## 1. Установим необходимые компоненты и проведем импорты

In [1]:
!apt -q install libomp-dev
!pip -q install faiss-cpu --no-cache
!pip -q install faiss-gpu
!pip -q install optuna

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libomp-14-dev libomp5-14
Suggested packages:
  libomp-14-doc
The following NEW packages will be installed:
  libomp-14-dev libomp-dev libomp5-14
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 738 kB of archives.
After this operation, 8,991 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp5-14 amd64 1:14.0.0-1ubuntu1.1 [389 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp-14-dev amd64 1:14.0.0-1ubuntu1.1 [347 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libomp-dev amd64 1:14.0-55~exp2 [3,074 B]
Fetched 738 kB in 1s (788 kB/s)
Selecting previously unselected package libomp5-14:amd64.
(Reading database ... 121920 files and directories currently installed.)
Preparing to unpack .../libomp5-14_1%3a14.0.0-1ubuntu1.1

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors

from sklearn.metrics import silhouette_score
import numpy as np
from tqdm import tqdm
from sklearn.decomposition import PCA

import faiss
import time

import optuna

In [3]:
import requests
import os
from urllib.parse import urlencode
import zipfile

In [4]:
def downloader(size: str='small'):
    if size not in ['small', 'large']:
        raise Exception('Unknown Argument')
    elif size == 'small':
        public_key = 'https://disk.yandex.ru/d/YQElc_cNQQLSOw'
    else:
        public_key = 'https://disk.yandex.ru/d/BBEphK0EHSJ5Jw'

    base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?'

    final_url = base_url + urlencode(dict(public_key=public_key))
    response = requests.get(final_url)
    download_url = response.json()['href']

    download_response = requests.get(download_url)
    with open('/content/data.zip', 'wb') as f:
        f.write(download_response.content)

    zip_path = ('/content/data.zip')

    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall()

In [5]:
downloader('small')

## 2. Исследование и знакомство с данными

In [6]:
base = pd.read_csv("/content/base.csv", index_col="Id")

In [7]:
base.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4207931-base,-43.946243,15.364378,17.515854,-132.31146,157.06442,-4.069252,-340.63086,-57.55014,128.39822,45.090958,...,-71.92717,30.711966,-90.190475,-24.931271,66.972534,106.346634,-44.270622,155.98834,-1074.464888,-25.066608
2710972-base,-73.00489,4.923342,-19.750746,-136.52908,99.90717,-70.70911,-567.401996,-128.89015,109.914986,201.4722,...,-109.04466,20.916021,-171.20139,-110.596844,67.7301,8.909615,-9.470253,133.29536,-545.897014,-72.91323
1371460-base,-85.56557,-0.493598,-48.374817,-157.98502,96.80951,-81.71021,-22.297688,79.76867,124.357086,105.71518,...,-58.82165,41.369606,-132.9345,-43.016839,67.871925,141.77824,69.04852,111.72038,-1111.038833,-23.087206
3438601-base,-105.56409,15.393871,-46.223934,-158.11488,79.514114,-48.94448,-93.71301,38.581398,123.39796,110.324326,...,-87.90729,-58.80687,-147.7948,-155.830237,68.974754,21.39751,126.098785,139.7332,-1282.707248,-74.52794
422798-base,-74.63888,11.315012,-40.204174,-161.7643,50.507114,-80.77556,-640.923467,65.225,122.34494,191.46585,...,-30.002094,53.64293,-149.82323,176.921371,69.47328,-43.39518,-58.947716,133.84064,-1074.464888,-1.164146


In [9]:
train = pd.read_csv("./train.csv", index_col="Id")

In [10]:
train.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,63,64,65,66,67,68,69,70,71,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109249-query,-24.021454,3.122524,-80.947525,-112.329994,191.09018,-66.90313,-759.626065,-75.284454,120.55149,131.1317,...,-24.60167,-167.76077,133.678516,68.1846,26.317545,11.938202,148.54932,-778.563381,-46.87775,66971-base
34137-query,-82.03358,8.115866,-8.793022,-182.9721,56.645336,-52.59761,-55.720337,130.05925,129.38335,76.20288,...,54.448433,-120.894806,-12.292085,66.608116,-27.997612,10.091335,95.809265,-1022.691531,-88.564705,1433819-base
136121-query,-75.71964,-0.223386,-86.18613,-162.06406,114.320114,-53.3946,-117.261013,-24.857851,124.8078,112.190155,...,-5.609123,-93.02988,-80.997871,63.733383,11.378683,62.932007,130.97539,-1074.464888,-74.861176,290133-base
105191-query,-56.58062,5.093593,-46.94311,-149.03912,112.43643,-76.82051,-324.995645,-32.833107,119.47865,120.07479,...,21.624313,-158.88037,179.597294,69.89136,-33.804955,233.91461,122.868546,-1074.464888,-93.775375,1270048-base
63983-query,-52.72565,9.027046,-92.82965,-113.11101,134.12497,-42.423073,-759.626065,8.261169,119.49023,172.36536,...,13.807772,-208.65004,41.742014,66.52242,41.36293,162.72305,111.26131,-151.162805,-33.83145,168591-base


In [21]:
scaler = RobustScaler()
scaler.fit(base)

## 3. Бейзлайн

In [29]:
base_scale = scaler.transform(base)

In [30]:
kmeans = KMeans(n_clusters=10, init="k-means++", max_iter=300, random_state=0)
kmeans.fit(base_scale)



In [31]:
X = train.drop("Target", axis=1)
X_normalize = scaler.transform(X)

In [32]:
def top_func(data_norm, data, base_data):

  train_cluster_pred = kmeans.predict(data_norm)

  index_mapping = {i: idx for i, idx in enumerate(base.index)}

  cluster_index = {cluster: np.where(kmeans.labels_ == cluster)[0] for cluster in np.unique(kmeans.labels_)}

  accuracy = 0
  for cluster in tqdm(np.unique(train_cluster_pred), desc="Processing clusters"):
      base_cluster_index = cluster_index[cluster]
      base_cluster = base_data[base_cluster_index]
      train_cluster = data_norm[train_cluster_pred == cluster]
      true_labels_cluster = data[train_cluster_pred == cluster]['Target']

      neighbors = NearestNeighbors(n_neighbors=5)
      neighbors.fit(base_cluster)

      _, index = neighbors.kneighbors(train_cluster)

      for target, idx in zip(true_labels_cluster.values.tolist(), index):
          text_index = [index_mapping[base_cluster_index[i]] for i in idx]
          accuracy += int(target in text_index)

  general_accuracy = 100 * accuracy / len(train)
  print(f"Total Accuracy: {general_accuracy:.2f}%")


In [33]:
top_func(X_normalize, train, base_scale)

Processing clusters: 100%|██████████| 10/10 [00:19<00:00,  1.96s/it]

Total Accuracy: 60.80%





## 4. FAISS

Faiss - это инструмент, разработанный Facebook для быстрого поиска ближайших соседей в больших наборах данных.

Использование Faiss включает:
- подготовку данных  
- создание специального "индекса" для поиска  
- поиск ближайших соседей

In [39]:
sample_ = train.sample(n=1_000, random_state=0)
sample_x = sample_.drop("Target", axis=1)
sample_x_normalize = scaler.transform(sample_x)

base_data = base_scale.astype(np.float32)
train_data = sample_x_normalize.astype(np.float32)
train_labels = sample_['Target']

In [40]:
def index_func(data,
               n_clusters=None,
               use_pca=False,
               pca_dimensions=None,
               metric=faiss.METRIC_L2):
    d = data.shape[1]

    if use_pca and pca_dimensions:
        pca_matrix = faiss.PCAMatrix(d, pca_dimensions, eigen_power=-0.5)
        pca_matrix.train(data)
        data = pca_matrix.apply_py(data)
        d = pca_dimensions

    if n_clusters:
        quantizer = faiss.IndexFlatL2(d)
        index = faiss.IndexIVFFlat(quantizer, d, n_clusters, metric)
        index.train(data)
    else:
        index = faiss.IndexFlatL2(d)

    index.add(data)
    return index, (pca_matrix if use_pca and pca_dimensions else None)

In [41]:
def search_func(index, query_data, k=5, pca_matrix=None):
    start_time = time.time()
    if pca_matrix:
        query_data = pca_matrix.apply_py(query_data)
    distances, indices = index.search(query_data, k)
    print(f"Поиск занял {time.time() - start_time:.2f} секунд")
    return distances, indices

In [42]:
def accuracy_func(indices, train_labels, index_to_label):
    matches = sum(label in [index_to_label[idx] for idx in neighbors] for label, neighbors in zip(train_labels, indices))
    accuracy = 100 * matches / len(train_labels)
    return accuracy

In [43]:
def match_func(base_data, train_data, train_labels,
                     n_clusters=None,
                     use_pca=False,
                     pca_dimensions=None,
                     k=5, nprobe=2,
                     metric=faiss.METRIC_L2):

    index, pca_matrix = index_func(base_data,
                                   n_clusters=n_clusters,
                                   use_pca=use_pca,
                                   pca_dimensions=pca_dimensions,
                                   metric=metric)

    index.nprobe = nprobe

    _, indices = search_func(index, train_data, k=k, pca_matrix=pca_matrix)

    index_to_label = {idx: label for idx, label in enumerate(base.index)}

    accuracy = accuracy_func(indices, train_labels, index_to_label)
    return accuracy


In [44]:
def params_func(trial):

    n_clusters = trial.suggest_int('n_clusters', 10, 500, step=10)

    pca_dimensions_option = trial.suggest_categorical('pca_dimensions_option', [True, False])
    pca_dimensions = trial.suggest_int('pca_dimensions', 27, 70) if pca_dimensions_option else None

    nprobe = trial.suggest_int('nprobe', 1, 10)

    base_data_c = np.ascontiguousarray(base_data)
    train_data_c = np.ascontiguousarray(train_data)

    accuracy = match_func(base_data_c, train_data_c, train_labels,
                          n_clusters=n_clusters,
                          use_pca=pca_dimensions_option,
                          pca_dimensions=pca_dimensions,
                          k=5, nprobe=nprobe)
    return accuracy

study = optuna.create_study(direction='maximize')
study.optimize(params_func, n_trials= 25)

print("Лучшая метрика:", study.best_value)
print("Лучшие параметры:", study.best_params)

[I 2024-05-07 09:45:12,446] A new study created in memory with name: no-name-84a2102d-75cc-4fa5-aba9-c41afb6546ec
[I 2024-05-07 09:45:14,251] Trial 0 finished with value: 73.5 and parameters: {'n_clusters': 150, 'pca_dimensions_option': True, 'pca_dimensions': 53, 'nprobe': 9}. Best is trial 0 with value: 73.5.


Поиск занял 0.60 секунд


[I 2024-05-07 09:45:15,334] Trial 1 finished with value: 63.0 and parameters: {'n_clusters': 130, 'pca_dimensions_option': True, 'pca_dimensions': 31, 'nprobe': 5}. Best is trial 0 with value: 73.5.


Поиск занял 0.27 секунд


[I 2024-05-07 09:45:17,268] Trial 2 finished with value: 75.0 and parameters: {'n_clusters': 80, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 7}. Best is trial 2 with value: 75.0.


Поиск занял 1.06 секунд
Поиск занял 0.33 секунд


[I 2024-05-07 09:45:25,077] Trial 3 finished with value: 58.5 and parameters: {'n_clusters': 430, 'pca_dimensions_option': False, 'nprobe': 9}. Best is trial 2 with value: 75.0.
[I 2024-05-07 09:45:27,530] Trial 4 finished with value: 58.2 and parameters: {'n_clusters': 70, 'pca_dimensions_option': False, 'nprobe': 4}. Best is trial 2 with value: 75.0.


Поиск занял 1.16 секунд
Поиск занял 0.14 секунд


[I 2024-05-07 09:45:35,551] Trial 5 finished with value: 60.0 and parameters: {'n_clusters': 470, 'pca_dimensions_option': True, 'pca_dimensions': 34, 'nprobe': 3}. Best is trial 2 with value: 75.0.


Поиск занял 0.30 секунд


[I 2024-05-07 09:45:44,381] Trial 6 finished with value: 58.3 and parameters: {'n_clusters': 390, 'pca_dimensions_option': False, 'nprobe': 6}. Best is trial 2 with value: 75.0.
[I 2024-05-07 09:45:46,983] Trial 7 finished with value: 71.0 and parameters: {'n_clusters': 260, 'pca_dimensions_option': True, 'pca_dimensions': 47, 'nprobe': 5}. Best is trial 2 with value: 75.0.


Поиск занял 0.19 секунд
Поиск занял 2.18 секунд


[I 2024-05-07 09:45:50,759] Trial 8 finished with value: 58.3 and parameters: {'n_clusters': 80, 'pca_dimensions_option': False, 'nprobe': 7}. Best is trial 2 with value: 75.0.


Поиск занял 0.28 секунд


[I 2024-05-07 09:46:00,405] Trial 9 finished with value: 64.6 and parameters: {'n_clusters': 370, 'pca_dimensions_option': True, 'pca_dimensions': 34, 'nprobe': 8}. Best is trial 2 with value: 75.0.
[I 2024-05-07 09:46:04,556] Trial 10 finished with value: 64.5 and parameters: {'n_clusters': 10, 'pca_dimensions_option': True, 'pca_dimensions': 67, 'nprobe': 1}. Best is trial 2 with value: 75.0.


Поиск занял 2.10 секунд


[I 2024-05-07 09:46:09,478] Trial 11 finished with value: 75.6 and parameters: {'n_clusters': 200, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 10}. Best is trial 11 with value: 75.6.


Поиск занял 1.12 секунд


[I 2024-05-07 09:46:12,581] Trial 12 finished with value: 74.9 and parameters: {'n_clusters': 260, 'pca_dimensions_option': True, 'pca_dimensions': 68, 'nprobe': 10}. Best is trial 11 with value: 75.6.


Поиск занял 0.46 секунд


[I 2024-05-07 09:46:14,462] Trial 13 finished with value: 75.0 and parameters: {'n_clusters': 160, 'pca_dimensions_option': True, 'pca_dimensions': 57, 'nprobe': 10}. Best is trial 11 with value: 75.6.


Поиск занял 0.69 секунд


[I 2024-05-07 09:46:16,462] Trial 14 finished with value: 73.9 and parameters: {'n_clusters': 200, 'pca_dimensions_option': True, 'pca_dimensions': 61, 'nprobe': 7}. Best is trial 11 with value: 75.6.


Поиск занял 0.40 секунд


[I 2024-05-07 09:46:18,959] Trial 15 finished with value: 70.3 and parameters: {'n_clusters': 310, 'pca_dimensions_option': True, 'pca_dimensions': 45, 'nprobe': 8}. Best is trial 11 with value: 75.6.


Поиск занял 0.21 секунд
Поиск занял 6.97 секунд


[I 2024-05-07 09:46:26,817] Trial 16 finished with value: 76.4 and parameters: {'n_clusters': 20, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 7}. Best is trial 16 with value: 76.4.
[I 2024-05-07 09:46:32,516] Trial 17 finished with value: 69.1 and parameters: {'n_clusters': 10, 'pca_dimensions_option': True, 'pca_dimensions': 70, 'nprobe': 2}. Best is trial 16 with value: 76.4.


Поиск занял 3.31 секунд


[I 2024-05-07 09:46:34,844] Trial 18 finished with value: 58.8 and parameters: {'n_clusters': 220, 'pca_dimensions_option': False, 'nprobe': 9}. Best is trial 16 with value: 76.4.


Поиск занял 0.61 секунд


[I 2024-05-07 09:46:37,618] Trial 19 finished with value: 71.7 and parameters: {'n_clusters': 320, 'pca_dimensions_option': True, 'pca_dimensions': 52, 'nprobe': 6}. Best is trial 16 with value: 76.4.


Поиск занял 0.19 секунд


[I 2024-05-07 09:46:39,510] Trial 20 finished with value: 71.0 and parameters: {'n_clusters': 110, 'pca_dimensions_option': True, 'pca_dimensions': 40, 'nprobe': 10}. Best is trial 16 with value: 76.4.


Поиск занял 0.98 секунд


[I 2024-05-07 09:46:44,079] Trial 21 finished with value: 74.8 and parameters: {'n_clusters': 50, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 7}. Best is trial 16 with value: 76.4.


Поиск занял 3.12 секунд


[I 2024-05-07 09:46:46,258] Trial 22 finished with value: 74.5 and parameters: {'n_clusters': 190, 'pca_dimensions_option': True, 'pca_dimensions': 63, 'nprobe': 8}. Best is trial 16 with value: 76.4.


Поиск занял 0.57 секунд


[I 2024-05-07 09:46:47,573] Trial 23 finished with value: 72.0 and parameters: {'n_clusters': 100, 'pca_dimensions_option': True, 'pca_dimensions': 55, 'nprobe': 4}. Best is trial 16 with value: 76.4.


Поиск занял 0.45 секунд


[I 2024-05-07 09:46:49,976] Trial 24 finished with value: 74.2 and parameters: {'n_clusters': 40, 'pca_dimensions_option': True, 'pca_dimensions': 59, 'nprobe': 6}. Best is trial 16 with value: 76.4.


Поиск занял 1.69 секунд
Лучшая метрика: 76.4
Лучшие параметры: {'n_clusters': 20, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 7}


Лучшая метрика: 76.4\
Лучшие параметры: {'n_clusters': 20, 'pca_dimensions_option': True, 'pca_dimensions': 62, 'nprobe': 7}