# Машинное обучение, ФКН ВШЭ

# Практическое задание 12. Поиск ближайших соседей

## Общая информация

Дата выдачи: 08.05.2024

**Мягкий дедлайн: 26.05.2024 23:59 MSK**

**Жёсткий дедлайн: 30.05.2024 23:59 MSK**

## Оценивание и штрафы

Каждая из задач имеет определенную «стоимость» (указана в скобках около задачи). Максимально допустимая оценка за работу — 7 баллов.


Сдавать задание после указанного жёсткого срока сдачи нельзя. При выставлении неполного балла за задание в связи с наличием ошибок на усмотрение проверяющего предусмотрена возможность исправить работу на указанных в ответном письме условиях.

Задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов (подробнее о плагиате см. на странице курса). Если вы нашли решение какого-то из заданий (или его часть) в открытом источнике, необходимо указать ссылку на этот источник в отдельном блоке в конце вашей работы (скорее всего вы будете не единственным, кто это нашел, поэтому чтобы исключить подозрение в плагиате, необходима ссылка на источник).

Неэффективная реализация кода может негативно отразиться на оценке.

## Формат сдачи

Задания сдаются через систему anytask. Посылка должна содержать:

* Ноутбук homework-practice-12-knn-Username.ipynb

Username — ваша фамилия и имя на латинице именно в таком порядке.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import random

from tqdm.notebook import tqdm

Возьмем [датасет](https://www.kaggle.com/delayedkarma/impressionist-classifier-data)  с картинами известных импрессионистов. Работать будем не с самими картинками, а с эмбеддингами картинок, полученных с помощью сверточного классификатора.

![](https://storage.googleapis.com/kagglesdsdata/datasets/568245/1031162/training/training/Gauguin/190448.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210405%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210405T125358Z&X-Goog-Expires=172799&X-Goog-SignedHeaders=host&X-Goog-Signature=a271b474bf9ec20ba159b951e0ae680fc2b0c694666031f7ea6fc39598172cc55e10f75c12b678b21da9e6bdc20e46886133c219625648b407d2f600eebfdda909b29e0f7f13276d8fea2f8d0480d6298bd98e7f118eb78e8b632fc3d141365356b0e3a2fdd4f09119f99f0907a31da62e8dae7e625e32d831238ecc227b1f5ad2e96a8bfb43d93ef6fe88d7e663e51d387d3550dcad2a7eefc5c941028ba0d7751d18690cf2e26fcdfaa4dacd3dcbb3a4cbb355e62c08b158007b5e764e468cecd3292dae4cfc408e848ecf3e0e5dbe5faa76fcdd77d5370c868583c06e4e3d40c73a7435bd8c32a9803fe6b536e1c6f0791219aadd06120291e937e57c214a)

In [3]:
# %%bash

# mkdir embeddings

# GIT="https://github.com/esokolov/ml-course-hse/raw/master/2022-spring/homeworks-practice/homework-practice-11-metric-learning/embeddings"
# wget -P ./embeddings $GIT/embeds_train.npy
# wget -P ./embeddings $GIT/embeds_test.npy
# wget -P ./embeddings $GIT/labels_train.npy
# wget -P ./embeddings $GIT/labels_test.npy

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [3]:
X_train = np.load('embeds_train.npy')
y_train = np.load('labels_train.npy')
X_test = np.load('embeds_test.npy')
y_test = np.load('labels_test.npy')

Будем смотреть на обычную долю верных ответов и на долю верных ответов в топ-3.

In [4]:
def top_3_accuracy_score(y_true, probas):
    preds = np.argsort(probas, axis=1)[:, -3:]
    matches = np.zeros_like(y_true)
    for i in range(3):
        matches += (preds[:, i] == y_true)
    return matches.sum() / matches.size

def scorer(estimator, X, y):
    return accuracy_score(y, estimator.predict(X))

**Задание 1. (1 балл)**

Обучите классификатор k ближайших соседей (из sklearn) на данных, подобрав лучшие гиперпараметры. Замерьте качество на обучающей и тестовой выборках.

In [36]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, leaf_size=30, n_jobs=-1)
knn.fit(X_train, y_train)

score_train = scorer(knn, X_train, y_train)
score_test = scorer(knn, X_test, y_test)

probas_train = knn.predict_proba(X_train)
probas_test = knn.predict_proba(X_test)
top_3_score_train = top_3_accuracy_score(y_train, probas_train)
top_3_score_test = top_3_accuracy_score(y_test, probas_test)

print('Accuracy train: ', score_train)
print('Top-3 accuracy train: ', top_3_score_train)
print('Accuracy test: ', score_test)
print('Top-3 accuracy test: ', top_3_score_test)

Accuracy train:  0.691073219658977
Top-3 accuracy train:  0.9606318956870612
Accuracy test:  0.5131313131313131
Top-3 accuracy test:  0.7616161616161616


In [37]:
import optuna

def objective(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train, y_train)
    score = scorer(knn, X_test, y_test)

    return 1. / score

study = optuna.create_study()
study.optimize(objective, n_trials=100, n_jobs=-1)

[I 2024-05-26 23:22:38,725] A new study created in memory with name: no-name-ce5dfb38-388b-4dad-adcd-46f87c2e0853
[I 2024-05-26 23:22:38,839] Trial 0 finished with value: 1.8435754189944131 and parameters: {'n_neighbors': 19, 'leaf_size': 23}. Best is trial 0 with value: 1.8435754189944131.
[I 2024-05-26 23:22:38,920] Trial 8 finished with value: 1.8435754189944131 and parameters: {'n_neighbors': 8, 'leaf_size': 86}. Best is trial 0 with value: 1.8435754189944131.
[I 2024-05-26 23:22:39,009] Trial 9 finished with value: 1.8574108818011257 and parameters: {'n_neighbors': 12, 'leaf_size': 96}. Best is trial 0 with value: 1.8435754189944131.
[I 2024-05-26 23:22:39,037] Trial 4 finished with value: 1.8435754189944131 and parameters: {'n_neighbors': 8, 'leaf_size': 63}. Best is trial 0 with value: 1.8435754189944131.
[I 2024-05-26 23:22:39,089] Trial 3 finished with value: 1.8131868131868132 and parameters: {'n_neighbors': 16, 'leaf_size': 27}. Best is trial 3 with value: 1.8131868131868132

In [38]:
best_params = study.best_params
best_params

{'n_neighbors': 16, 'leaf_size': 27}

In [39]:
knn = KNeighborsClassifier(n_neighbors=best_params['n_neighbors'], leaf_size=best_params['leaf_size'], n_jobs=-1)
knn.fit(X_train, y_train)

score_train = scorer(knn, X_train, y_train)
score_test = scorer(knn, X_test, y_test)

probas_train = knn.predict_proba(X_train)
probas_test = knn.predict_proba(X_test)
top_3_score_train = top_3_accuracy_score(y_train, probas_train)
top_3_score_test = top_3_accuracy_score(y_test, probas_test)

print('Accuracy train: ', score_train)
print('Top-3 accuracy train: ', top_3_score_train)
print('Accuracy test: ', score_test)
print('Top-3 accuracy test: ', top_3_score_test)

Accuracy train:  0.6409227683049148
Top-3 accuracy train:  0.9074724172517553
Accuracy test:  0.5515151515151515
Top-3 accuracy test:  0.8242424242424242


**Задание 2. (2 балла)** 

Теперь будем пользоваться метрикой Махалонобиса. Обучите её одним из методов [отсюда](http://contrib.scikit-learn.org/metric-learn/supervised.html). Напомним, что вычисление метрики Махалонобиса эквивалентно вычислению евклидова расстояния между объектами, к которым применено некоторое линейное преобразование (вспомните семинары). Преобразуйте данные и обучите kNN на них, перебрав гиперпараметры, замерьте качество.

Заметим, что в библиотеке metric-learn есть несколько способов обучать матрицу преобразования. Выберите лучший, аргументируйте свой выбор.

Note: Некоторые методы с дефолтными параметрами учатся очень долго, будьте внимательны. Советуем выставить параметр `tolerance=1e-3`.


In [12]:
from metric_learn import NCA, LMNN, LFDA, MLKR

nca = NCA(tol=1e-3)
nca.fit(X_train, y_train)
X_train_nca = nca.transform(X_train)
X_test_nca = nca.transform(X_test)

In [15]:
lmnn = LMNN(convergence_tol=1e-3)
lmnn.fit(X_train, y_train)
X_train_lmnn = lmnn.transform(X_train)
X_test_lmnn = lmnn.transform(X_test)

In [16]:
mlkr = MLKR(tol=1e-3)
mlkr.fit(X_train, y_train)
X_train_mlkr = mlkr.transform(X_train)
X_test_mlkr = mlkr.transform(X_test)

### __NCA__

In [17]:
def objective_nca(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train_nca, y_train)
    score = scorer(knn, X_test_nca, y_test)

    return 1. / score

study_nca = optuna.create_study()
study_nca.optimize(objective_nca, n_trials=100, n_jobs=-1)

[I 2024-05-26 19:04:13,726] A new study created in memory with name: no-name-8c38c3ad-533b-4a51-94e1-b034e8944f10
[I 2024-05-26 19:04:13,898] Trial 1 finished with value: 1.809872029250457 and parameters: {'n_neighbors': 12, 'leaf_size': 67}. Best is trial 1 with value: 1.809872029250457.
[I 2024-05-26 19:04:13,917] Trial 7 finished with value: 1.7805755395683456 and parameters: {'n_neighbors': 16, 'leaf_size': 14}. Best is trial 7 with value: 1.7805755395683456.
[I 2024-05-26 19:04:13,936] Trial 2 finished with value: 1.7967332123411976 and parameters: {'n_neighbors': 11, 'leaf_size': 28}. Best is trial 7 with value: 1.7805755395683456.
[I 2024-05-26 19:04:13,940] Trial 5 finished with value: 1.783783783783784 and parameters: {'n_neighbors': 9, 'leaf_size': 50}. Best is trial 7 with value: 1.7805755395683456.
[I 2024-05-26 19:04:13,955] Trial 0 finished with value: 1.7805755395683456 and parameters: {'n_neighbors': 16, 'leaf_size': 17}. Best is trial 7 with value: 1.7805755395683456.


In [18]:
study_nca.best_params

{'n_neighbors': 13, 'leaf_size': 86}

In [19]:
nca_best_params = study_nca.best_params

knn_nca = KNeighborsClassifier(n_neighbors=nca_best_params['n_neighbors'], leaf_size=nca_best_params['leaf_size'])
knn_nca.fit(X_train_nca, y_train)

score = scorer(knn_nca, X_test_nca, y_test)
probas = knn_nca.predict_proba(X_test_nca)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.5656565656565656
Top-3 accuracy:  0.8141414141414142


### __LMNN__

In [20]:
def objective_lmnn(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train_lmnn, y_train)
    score = scorer(knn, X_test_lmnn, y_test)

    return 1. / score

study_lmnn = optuna.create_study()
study_lmnn.optimize(objective_lmnn, n_trials=100, n_jobs=-1)

[I 2024-05-26 19:04:18,854] A new study created in memory with name: no-name-75e620ff-d1fe-413e-9b99-79e2118d2e75
[I 2024-05-26 19:04:19,124] Trial 3 finished with value: 1.8893129770992367 and parameters: {'n_neighbors': 3, 'leaf_size': 91}. Best is trial 3 with value: 1.8893129770992367.
[I 2024-05-26 19:04:19,140] Trial 1 finished with value: 1.9760479041916166 and parameters: {'n_neighbors': 1, 'leaf_size': 61}. Best is trial 3 with value: 1.8893129770992367.
[I 2024-05-26 19:04:19,153] Trial 4 finished with value: 1.8893129770992367 and parameters: {'n_neighbors': 3, 'leaf_size': 67}. Best is trial 3 with value: 1.8893129770992367.
[I 2024-05-26 19:04:19,177] Trial 2 finished with value: 1.8401486988847586 and parameters: {'n_neighbors': 6, 'leaf_size': 11}. Best is trial 2 with value: 1.8401486988847586.
[I 2024-05-26 19:04:19,193] Trial 5 finished with value: 1.7398945518453426 and parameters: {'n_neighbors': 8, 'leaf_size': 28}. Best is trial 5 with value: 1.7398945518453426.
[

In [21]:
lmnn_best_params = study_lmnn.best_params

knn_lmnn = KNeighborsClassifier(n_neighbors=lmnn_best_params['n_neighbors'], leaf_size=lmnn_best_params['leaf_size'])
knn_lmnn.fit(X_train_lmnn, y_train)

score = scorer(knn_lmnn, X_test_lmnn, y_test)
probas = knn_lmnn.predict_proba(X_test_lmnn)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.5848484848484848
Top-3 accuracy:  0.8323232323232324


### __MLKR__

In [22]:
def objective_mlkr(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train_mlkr, y_train)
    score = scorer(knn, X_test_mlkr, y_test)

    return 1. / score

study_mlkr = optuna.create_study()
study_mlkr.optimize(objective_mlkr, n_trials=100, n_jobs=-1)

[I 2024-05-26 19:04:23,844] A new study created in memory with name: no-name-fabae8ae-868e-4b70-92db-aa5bb9f4a976
[I 2024-05-26 19:04:24,136] Trial 3 finished with value: 2.133620689655172 and parameters: {'n_neighbors': 2, 'leaf_size': 28}. Best is trial 3 with value: 2.133620689655172.
[I 2024-05-26 19:04:24,159] Trial 5 finished with value: 1.9373776908023483 and parameters: {'n_neighbors': 5, 'leaf_size': 45}. Best is trial 5 with value: 1.9373776908023483.
[I 2024-05-26 19:04:24,167] Trial 2 finished with value: 1.885714285714286 and parameters: {'n_neighbors': 7, 'leaf_size': 80}. Best is trial 2 with value: 1.885714285714286.
[I 2024-05-26 19:04:24,181] Trial 7 finished with value: 1.8644067796610169 and parameters: {'n_neighbors': 6, 'leaf_size': 66}. Best is trial 7 with value: 1.8644067796610169.
[I 2024-05-26 19:04:24,197] Trial 1 finished with value: 1.885714285714286 and parameters: {'n_neighbors': 10, 'leaf_size': 33}. Best is trial 7 with value: 1.8644067796610169.
[I 20

In [23]:
mlkr_best_params = study_mlkr.best_params

knn_mlkr = KNeighborsClassifier(n_neighbors=mlkr_best_params['n_neighbors'], leaf_size=mlkr_best_params['leaf_size'])
knn_mlkr.fit(X_train_mlkr, y_train)

score = scorer(knn_mlkr, X_test_mlkr, y_test)
probas = knn_mlkr.predict_proba(X_test_mlkr)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.5404040404040404
Top-3 accuracy:  0.8161616161616162


**Задание 3. (1 балл)** 

Что будет, если в качестве матрицы в расстоянии Махалонобиса использовать случайную матрицу? Матрицу ковариаций?

### __Случайная матрица в расстроянии Махалонобииса__

In [24]:
np.random.seed(42)
random_matrix = np.random.rand(X_train.shape[1], X_train.shape[1])

X_train_random = X_train @ random_matrix
X_test_random = X_test @ random_matrix

In [25]:
def objective_random(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train_random, y_train)
    score = scorer(knn, X_test_random, y_test)

    return 1. / score

study_random = optuna.create_study()
study_random.optimize(objective_random, n_trials=100, n_jobs=-1)

[I 2024-05-26 19:04:28,802] A new study created in memory with name: no-name-4c66a29a-e3ae-4f73-b410-bda6fc60e4a4
[I 2024-05-26 19:04:29,045] Trial 1 finished with value: 2.734806629834254 and parameters: {'n_neighbors': 2, 'leaf_size': 73}. Best is trial 1 with value: 2.734806629834254.
[I 2024-05-26 19:04:29,060] Trial 4 finished with value: 2.351543942992874 and parameters: {'n_neighbors': 5, 'leaf_size': 15}. Best is trial 4 with value: 2.351543942992874.
[I 2024-05-26 19:04:29,076] Trial 5 finished with value: 2.564766839378238 and parameters: {'n_neighbors': 1, 'leaf_size': 10}. Best is trial 4 with value: 2.351543942992874.
[I 2024-05-26 19:04:29,085] Trial 2 finished with value: 2.3185011709601873 and parameters: {'n_neighbors': 15, 'leaf_size': 84}. Best is trial 2 with value: 2.3185011709601873.
[I 2024-05-26 19:04:29,094] Trial 0 finished with value: 2.3185011709601873 and parameters: {'n_neighbors': 15, 'leaf_size': 51}. Best is trial 2 with value: 2.3185011709601873.
[I 20

In [26]:
random_best_params = study_random.best_params

knn_random = KNeighborsClassifier(n_neighbors=random_best_params['n_neighbors'], leaf_size=random_best_params['leaf_size'])
knn_random.fit(X_train_random, y_train)

score = scorer(knn_random, X_test_random, y_test)
probas = knn_random.predict_proba(X_test_random)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy with random matrix: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy with random matrix:  0.4484848484848485
Top-3 accuracy:  0.7141414141414142


Расстояния зависят от случайной матрицы и не отражают реальную структуру данных, поэтому качество резко упало.

### __Матрица ковариаций в расстроянии Махалонобииса__

__Сначала попробуем преобразование данных.__

In [9]:
mu_train = np.mean(X_train, axis=0)

X_train_centered = X_train - mu_train
X_test_centered = X_test - mu_train

cov_matrix = np.cov(X_train_centered, rowvar=False)
cov_inv = np.linalg.inv(np.linalg.cholesky(cov_matrix)).T       # мы же используем обратную матрицу

X_train_cov = X_train_centered @ cov_inv
X_test_cov = X_test_centered @ cov_inv

In [10]:
def objective_cov(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=n_neighbors, leaf_size=leaf_size, n_jobs=-1)
    knn.fit(X_train_cov, y_train)
    score = scorer(knn, X_test_cov, y_test)

    return 1. / score

study_cov = optuna.create_study()
study_cov.optimize(objective_cov, n_trials=100, n_jobs=-1)

[I 2024-05-26 21:52:18,386] A new study created in memory with name: no-name-a7173280-00e3-425b-ba73-5f06382f25b6
[I 2024-05-26 21:52:18,569] Trial 2 finished with value: 2.3741007194244603 and parameters: {'n_neighbors': 20, 'leaf_size': 77}. Best is trial 2 with value: 2.3741007194244603.
[I 2024-05-26 21:52:18,575] Trial 1 finished with value: 2.391304347826087 and parameters: {'n_neighbors': 11, 'leaf_size': 10}. Best is trial 2 with value: 2.3741007194244603.
[I 2024-05-26 21:52:18,588] Trial 6 finished with value: 2.9289940828402368 and parameters: {'n_neighbors': 2, 'leaf_size': 70}. Best is trial 2 with value: 2.3741007194244603.
[I 2024-05-26 21:52:18,700] Trial 0 finished with value: 2.391304347826087 and parameters: {'n_neighbors': 12, 'leaf_size': 77}. Best is trial 2 with value: 2.3741007194244603.
[I 2024-05-26 21:52:18,721] Trial 7 finished with value: 2.4688279301745637 and parameters: {'n_neighbors': 8, 'leaf_size': 94}. Best is trial 2 with value: 2.3741007194244603.


In [11]:
cov_best_params = study_cov.best_params

knn_cov = KNeighborsClassifier(n_neighbors=cov_best_params['n_neighbors'], leaf_size=cov_best_params['leaf_size'])
knn_cov.fit(X_train_cov, y_train)

score = scorer(knn_cov, X_test_cov, y_test)
probas = knn_cov.predict_proba(X_test_cov)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy with cov matrix: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy with cov matrix:  0.4353535353535353
Top-3 accuracy:  0.6636363636363637


__Теперь используем возможность указания метрики в knn.__

In [30]:
from scipy.spatial.distance import mahalanobis

mu_train = np.mean(X_train, axis=0)

X_train_centered = X_train - mu_train
X_test_centered = X_test - mu_train

cov_matrix = np.cov(X_train_centered, rowvar=False)
cov_inv = np.linalg.inv(cov_matrix)

In [31]:
def objective_cov_2(trial):

    n_neighbors = trial.suggest_int('n_neighbors', 1, 20)
    leaf_size = trial.suggest_int('leaf_size', 10, 100)

    knn = KNeighborsClassifier(n_neighbors=5, metric=mahalanobis, metric_params={'VI': cov_inv})
    knn.fit(X_train, y_train)
    score = scorer(knn, X_test, y_test)

    return 1. / score

study_cov_2 = optuna.create_study()
study_cov_2.optimize(objective_cov_2, n_trials=10, n_jobs=-1)

[I 2024-05-26 19:04:38,352] A new study created in memory with name: no-name-dae537ef-4d11-4b2f-b0bc-3470955ede1e
[I 2024-05-26 19:16:47,120] Trial 5 finished with value: 2.5581395348837206 and parameters: {'n_neighbors': 14, 'leaf_size': 81}. Best is trial 5 with value: 2.5581395348837206.
[I 2024-05-26 19:16:50,425] Trial 6 finished with value: 2.5581395348837206 and parameters: {'n_neighbors': 5, 'leaf_size': 29}. Best is trial 5 with value: 2.5581395348837206.
[I 2024-05-26 19:16:55,507] Trial 1 finished with value: 2.5581395348837206 and parameters: {'n_neighbors': 20, 'leaf_size': 32}. Best is trial 5 with value: 2.5581395348837206.
[I 2024-05-26 19:16:55,858] Trial 4 finished with value: 2.5581395348837206 and parameters: {'n_neighbors': 17, 'leaf_size': 53}. Best is trial 5 with value: 2.5581395348837206.
[I 2024-05-26 19:16:55,909] Trial 2 finished with value: 2.5581395348837206 and parameters: {'n_neighbors': 1, 'leaf_size': 17}. Best is trial 5 with value: 2.5581395348837206

In [33]:
cov_2_best_params = study_cov_2.best_params

knn_cov_2 = KNeighborsClassifier(
    n_neighbors=cov_2_best_params['n_neighbors'], 
    leaf_size=cov_2_best_params['leaf_size'], 
    metric=mahalanobis, 
    metric_params={'VI': cov_inv}
    )

knn_cov_2.fit(X_train, y_train)

score = scorer(knn_cov_2, X_test, y_test)
probas = knn_cov_2.predict_proba(X_test)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy with cov matrix: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy with random matrix:  0.4353535353535353
Top-3 accuracy:  0.6636363636363637


Оба способа ковариационной матрицы дали одинаковый результат, что логично. Этот результат не сильно хуже рандомной матрицы, но очень сильно уступает ей по топ 3. Это может бвть связано с тем, то случайная матрица может дать непредсказуемый результат (в том числе может показать себя лучше на какой-то тестовой выборке). Однако результат в целом будет очень непредсказуем и будет проигрывать остальным способам.

Про лучший метод, по качеству лучше всех оказался LMNN, н дал качество 0.58 на тестовой выборке. Но ниже мы увидим, что использование ковариационной матрицы вообще-то мб эффективным, особенно с бустингом.

**Задание 4. (1 балл)** Обучите какой-нибудь градиентный бустинг на обычных и трансформированных наборах данных, замерьте качество, задумайтесь о целесообразности других методов.

In [44]:
from catboost import CatBoostClassifier

catboost_clf = CatBoostClassifier(verbose=0)
catboost_clf.fit(X_train, y_train)

score = scorer(catboost_clf, X_test, y_test)
probas = catboost_clf.predict_proba(X_test)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.6151515151515151
Top-3 accuracy:  0.8727272727272727


In [45]:
catboost_clf_nca = CatBoostClassifier(verbose=0)
catboost_clf_nca.fit(X_train_nca, y_train)

score = scorer(catboost_clf_nca, X_test_nca, y_test)
probas = catboost_clf_nca.predict_proba(X_test_nca)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.6080808080808081
Top-3 accuracy:  0.8656565656565657


In [46]:
catboost_clf_cov = CatBoostClassifier(verbose=0)
catboost_clf_cov.fit(X_train_cov, y_train)

score = scorer(catboost_clf_cov, X_test_cov, y_test)
probas = catboost_clf_cov.predict_proba(X_test_cov)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.6202020202020202
Top-3 accuracy:  0.8737373737373737


О, а вот с катбустом ковариационная матрица повела себя лучше, качество повыше.

Качество модели неплохое, но кака ясно как минимум из формулировки бонуса ниже, можно попробовать другие методы и улучшить точность предсказаний (либо похимичить с бустингом, либо можно попробовать использовать другие методы, например, ITML, или выбрать другие классификаторы).

**Бонус. (1 балл)**

Достигните доли верных ответов 0.7 на тестовой выборке, не используя нейросети.

Пометка для себя: метод главных компонент, масштабирование признаков, генерация данных (?), другие расстрояния (метрики) (?)

In [13]:
# пробуем масштабирование и метод главных компонент вместе с катбустом и ковариационной матрицей

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

mu_train = np.mean(X_train_pca, axis=0)

X_train_centered = X_train_pca - mu_train
X_test_centered = X_test_pca - mu_train

cov_matrix = np.cov(X_train_centered, rowvar=False)
cov_inv = np.linalg.inv(np.linalg.cholesky(cov_matrix)).T

X_train_cov_pca = X_train_centered @ cov_inv
X_test_cov_pca = X_test_centered @ cov_inv

In [14]:
catboost_clf_cov_pca = CatBoostClassifier(verbose=0, random_seed=42)
catboost_clf_cov_pca.fit(X_train_cov_pca, y_train)

score = scorer(catboost_clf_cov_pca, X_test_cov_pca, y_test)
probas = catboost_clf_cov_pca.predict_proba(X_test_cov_pca)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.6050505050505051
Top-3 accuracy:  0.8737373737373737


Ну, качество упало на 0.2, но из плюсов обучился в 8(!) раз быстрее (а ещё надо было фиксировать рандом сид, но я совсем об этом забыла, так что поверьте мне на слова пожалуйста ахахах, если успею пофиксить, исправлю)

In [17]:
import optuna

def objective(trial):

    n_estimators = trial.suggest_int('n_estimators', 100, 1000)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)
    depth = trial.suggest_int('depth', 4, 10)

    catboost = CatBoostClassifier(verbose=0, random_seed=42, n_estimators=n_estimators, learning_rate=learning_rate, depth=depth)
    catboost.fit(X_train_cov_pca, y_train)
    score = scorer(catboost, X_test_cov_pca, y_test)

    return 1. / score

study_pca = optuna.create_study()
study_pca.optimize(objective, n_trials=10, n_jobs=-1)

[I 2024-05-26 22:01:01,188] A new study created in memory with name: no-name-d21f8642-3b8f-43e0-bb10-1786aaf34e1d
[I 2024-05-26 22:01:14,476] Trial 4 finished with value: 1.686541737649063 and parameters: {'n_estimators': 149, 'learning_rate': 0.08303126872389525, 'depth': 5}. Best is trial 4 with value: 1.686541737649063.
[I 2024-05-26 22:01:18,558] Trial 5 finished with value: 1.6582914572864322 and parameters: {'n_estimators': 327, 'learning_rate': 0.06772868443772903, 'depth': 4}. Best is trial 5 with value: 1.6582914572864322.
[I 2024-05-26 22:01:20,817] Trial 6 finished with value: 1.9186046511627906 and parameters: {'n_estimators': 343, 'learning_rate': 0.010417136163790372, 'depth': 4}. Best is trial 5 with value: 1.6582914572864322.
[I 2024-05-26 22:01:54,408] Trial 2 finished with value: 1.6610738255033557 and parameters: {'n_estimators': 220, 'learning_rate': 0.125798204999731, 'depth': 6}. Best is trial 5 with value: 1.6582914572864322.
[I 2024-05-26 22:02:00,289] Trial 3 f

In [18]:
pca_best_params = study_pca.best_params

catboost_clf_cov_pca = CatBoostClassifier(
    verbose=0, 
    random_seed=42, 
    n_estimators=pca_best_params['n_estimators'],
    learning_rate=pca_best_params['learning_rate'],
    depth=pca_best_params['depth']
    )

catboost_clf_cov_pca.fit(X_train_cov_pca, y_train)

score = scorer(catboost_clf_cov_pca, X_test_cov_pca, y_test)
probas = catboost_clf_cov_pca.predict_proba(X_test_cov_pca)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.6111111111111112
Top-3 accuracy:  0.8747474747474747


Ну, качество действительно немного улучшилось, но не особенно значительно. Ниже попробовала добавление новых признаков и ансамбль, но качество там 0.58 примерно.

In [23]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Создание полиномиальных признаков
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

selector = SelectKBest(f_classif, k=100)  # выбираем 100 лучших признаков
X_train_selected = selector.fit_transform(X_train_poly, y_train)
X_test_selected = selector.transform(X_test_poly)

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_selected)
X_test_pca = pca.transform(X_test_selected)

mu_train = np.mean(X_train_pca, axis=0)
X_train_centered = X_train_pca - mu_train
X_test_centered = X_test_pca - mu_train

cov_matrix = np.cov(X_train_centered, rowvar=False)
cov_inv = np.linalg.inv(np.linalg.cholesky(cov_matrix)).T

X_train_cov_pca = X_train_centered @ cov_inv
X_test_cov_pca = X_test_centered @ cov_inv

In [24]:
catboost_clf_cov_pca = CatBoostClassifier(verbose=0, random_seed=42)
catboost_clf_cov_pca.fit(X_train_cov_pca, y_train)

score = scorer(catboost_clf_cov_pca, X_test_cov_pca, y_test)
probas = catboost_clf_cov_pca.predict_proba(X_test_cov_pca)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy:  0.5858585858585859
Top-3 accuracy:  0.8464646464646465


In [25]:
# с использованием ансамбля моделей

from sklearn.metrics import accuracy_score

def ensemble_predict(models, X):
    probas = np.zeros((X.shape[0], len(np.unique(y_train))))
    for model in models:
        probas += model.predict_proba(X)
    probas /= len(models)
    return np.argmax(probas, axis=1), probas

ensemble_size = 5
models = []
for i in range(ensemble_size):
    model = CatBoostClassifier(
        verbose=0, 
        random_seed=42 + i,
        n_estimators=pca_best_params['n_estimators'],
        learning_rate=pca_best_params['learning_rate'],
        depth=pca_best_params['depth']
    )
    model.fit(X_train_cov_pca, y_train)
    models.append(model)

y_pred, probas = ensemble_predict(models, X_test_cov_pca)

# Оценка точности
accuracy = accuracy_score(y_test, y_pred)
top_3_accuracy = top_3_accuracy_score(y_test, probas)

print('Ensemble Accuracy: ', accuracy)
print('Ensemble Top-3 Accuracy: ', top_3_accuracy)

Ensemble Accuracy:  0.5848484848484848
Ensemble Top-3 Accuracy:  0.8555555555555555


Другие классификаторы?

In [28]:
from sklearn.svm import SVC

svm_clf = SVC()
svm_clf.fit(X_train_cov, y_train)
y_pred = svm_clf.predict(X_test_cov)

score = scorer(svm_clf, X_test_cov, y_test)
print('Accuracy: ', score)

Accuracy:  0.603030303030303


In [31]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rf_clf.fit(X_train_cov, y_train)
y_pred = rf_clf.predict(X_test_cov)

score = scorer(rf_clf, X_test_cov, y_test)
print('Accuracy: ', score)

Accuracy:  0.5787878787878787


Ну, тогда попробуем другие метрики.

In [32]:
from metric_learn import ITML_Supervised

itml = ITML_Supervised(max_iter=1000, tol=1e-3)
itml.fit(X_train, y_train)
X_train_itml = itml.transform(X_train)
X_test_itml = itml.transform(X_test)

knn_itml = KNeighborsClassifier()
knn_itml.fit(X_train_itml, y_train)

score = scorer(knn_itml, X_test_itml, y_test)
probas = knn_itml.predict_proba(X_test_itml)
top_3_score = top_3_accuracy_score(y_test, probas)

print('Accuracy with cov matrix: ', score)
print('Top-3 accuracy: ', top_3_score)

Accuracy with cov matrix:  0.5676767676767677
Top-3 accuracy:  0.7848484848484848


In [33]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_itml, y_train)

score = scorer(rf_clf, X_test_itml, y_test)

print('Accuracy with cov matrix: ', score)

Accuracy with cov matrix:  0.5959595959595959


In [35]:
svm_clf = SVC()
svm_clf.fit(X_train_itml, y_train)

score = scorer(svm_clf, X_test_itml, y_test)
print('Accuracy: ', score)

Accuracy:  0.6101010101010101


**Шашлычный бонус. (до 0.5 баллов)**

Пришло тепло, настали майские праздники. [Все летят на  на шашлындос.](https://www.youtube.com/watch?v=AgVZ6LoAm8g) А ты летишь? Добавь фотопруфы и приложи небольшой отчётик о том, как всё прошло. Можете объединиться с одногруппниками/однокурсниками, а также пригласить ассистентов/преподавателей, они тоже будут рады шашлындосу.

![alt text](photo_2024-05-04_20-46-26.jpg)

На случай, если обнулится вывод: https://drive.google.com/file/d/1tjSNqBtNRWEM0StarhCMCpm6RIML49do/view?usp=drive_link

Свозила парня на шашлыки к родителям, увиделась со своей собакой впервые зо долгое время. Остались на ночёвку. Меня расстроили тем, что не было на шашлыках моих любимых крылышек Петелинка, но в остальном всё было супер и вкусно, все остались довольны. Ещё и свежим воздухом подышали.