Подготовка обучающей и тестовой выборки, кросс-валидация и подбор гиперпараметров на примере метода ближайших соседей.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.datasets import load_digits
import pandas as pd
import numpy as np
import seaborn as sns

Загрузка датасета diabetes

In [36]:
data = load_digits()

In [37]:
x = pd.DataFrame(data.data)
y = pd.Series(data.target, name="target")

In [38]:
x.isnull().sum()

0     0
1     0
2     0
3     0
4     0
     ..
59    0
60    0
61    0
62    0
63    0
Length: 64, dtype: int64

In [39]:
x.shape

(1797, 64)

In [40]:
x.dropna(inplace=True)

In [64]:
x.shape

(1797, 64)

In [42]:
x.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


Масштабирование данных

In [43]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [58]:
# Обучаем модель с K=2
knn = KNeighborsClassifier(n_neighbors=2) #создается экземпляр классификатора KNN с параметром n_neighbors=2, что означает, что при классификации нового объекта будут рассматриваться 2 ближайших соседа.
knn.fit(X_train, y_train) #модель обучается на обучающей выборке X_train с соответствующими метками y_train. 

# Предсказание классов на тестовой выборке
y_pred = knn.predict(X_test)

# Оценка качества модели
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy (K=2): {accuracy:.2f}")
print(classification_report(y_test, y_pred))

Accuracy (K=2): 0.96
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       0.88      1.00      0.93        28
           2       0.94      0.97      0.96        33
           3       0.94      0.97      0.96        34
           4       0.98      0.98      0.98        46
           5       0.98      0.98      0.98        47
           6       0.97      1.00      0.99        35
           7       1.00      0.97      0.99        34
           8       0.96      0.90      0.93        30
           9       0.97      0.88      0.92        40

    accuracy                           0.96       360
   macro avg       0.96      0.96      0.96       360
weighted avg       0.97      0.96      0.96       360



Подбор гиперпараметра K с использованием GridSearchCV и RandomizedSearchCV  

In [59]:
# Определение параметров для поиска
param_grid = {'n_neighbors': np.arange(2, 20)}

# GridSearchCV -  Полный перебор
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Лучший параметр K (GridSearchCV): {grid_search.best_params_}")
print(f"Лучшая точность (GridSearchCV): {grid_search.best_score_:.2f}")

# RandomizedSearchCV - Случайный выбор
random_search = RandomizedSearchCV(KNeighborsClassifier(), param_grid, cv=5, n_iter=15, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)
print(f"Лучший параметр K (RandomizedSearchCV): {random_search.best_params_}")
print(f"Лучшая точность (RandomizedSearchCV): {random_search.best_score_:.2f}")

Лучший параметр K (GridSearchCV): {'n_neighbors': 3}
Лучшая точность (GridSearchCV): 0.97
Лучший параметр K (RandomizedSearchCV): {'n_neighbors': 3}
Лучшая точность (RandomizedSearchCV): 0.97


Оценка качества оптимальной модели

In [60]:
# Используем лучший параметр K из GridSearchCV
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test)

# Оценка качества оптимальной модели
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy (оптимальная модель): {accuracy_best:.2f}")
print(classification_report(y_test, y_pred_best))

Accuracy (оптимальная модель): 0.97
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        33
           1       0.93      1.00      0.97        28
           2       0.94      0.97      0.96        33
           3       0.97      0.97      0.97        34
           4       0.98      1.00      0.99        46
           5       0.98      0.96      0.97        47
           6       0.97      1.00      0.99        35
           7       1.00      0.97      0.99        34
           8       0.97      0.93      0.95        30
           9       0.95      0.90      0.92        40

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



Сравнение метрик качества исходной и оптимальной моделей

In [61]:
print(f"Accuracy исходной модели (K=2): {accuracy:.2f}")
print(f"Accuracy оптимальной модели (K={grid_search.best_params_['n_neighbors']}): {accuracy_best:.2f}")

Accuracy исходной модели (K=2): 0.96
Accuracy оптимальной модели (K=3): 0.97


In [62]:
# Стратегия 1: KFold (по умолчанию в GridSearchCV) - разбивает весь набор данных на K равных частей (фолдов). 
cv_scores = cross_val_score(best_knn, X_scaled, y, cv=5, scoring='accuracy')
print(f"Точность кросс-валидации (KFold): {np.mean(cv_scores):.2f}")

# Стратегия 2: StratifiedKFold -  для несбалансированных классов, где мы учитываем пропорции
from sklearn.model_selection import StratifiedKFold
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_stratified = cross_val_score(best_knn, X_scaled, y, cv=stratified_cv, scoring='accuracy')
print(f"Точность кросс-валидации (StratifiedKFold): {np.mean(cv_scores_stratified):.2f}")

Точность кросс-валидации (KFold): 0.94
Точность кросс-валидации (StratifiedKFold): 0.97
