## **Домашнее задание: Выбор модели для бинарной классификации**

**Цель**: Провести отбор признаков и настройку гиперпараметров нескольких моделей для бинарной классификации разными методами. Сравнить результаты и выбрать лучшую модель.

Задание считается выполненным успешно, если будет обучено по крайней мере три модели, среди которых выбрана лучшая по тестовым метрикам.

Ноутбуки направить на почту simon.ilishaev@gmail.com. В теме письма - [ML в Рисках]


#### **Данные и начальная настройка**  
1. [Загрузите датасет](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset) (числовые и категориальные признаки, бинарная целевая переменная).  
2. Сделайте **стратифицированное разделение на train-test** (например, 70-30). **Тестовый набор** будет использоваться **только для финальной оценки модели**.

#### Подход с разделением на train-validation  
1. Разделите ещё раз **обучающую выборку (train)** на train-validation** (например, 80-20).  
2. Проведите **отбор признаков с помощью фильтрационных методов** на **train-подвыборке**.  
3. Настройте гиперпараметры (например, `C` для логистической регрессии, `max_depth` для дерева решений и т. д.) на **валидационной выборке**.  
4. **Опционально**: Используйте **Differential Evolution из Scipy** для оптимизации гиперпараметров логистической регрессии.  

#### Подход с кросс-валидацией  
1. Используйте **кросс-валидацию (CV)** для **отбора признаков и настройки гиперпараметров**.  
2. Реализуйте **GridSearchCV** для перебора гиперпараметров.  
3. **Опционально**: Используйте **Optuna** с **многокритериальной оптимизацией** (максимизация ROC-AUC и Precision-Recall AUC).  
4. **Опционально**: Визуализируйте **Парето-фронт** для испытаний Optuna.  

#### **Финальная оценка моделей**  
1. Оцените все настроенные модели на **тестовом наборе** (ROC-AUC, Precision-Recall AUC, F1-score).  
2. **Выберите лучшую модель** на основе тестовых метрик.  

### **Модели для использования**  
- Логистическая регрессия (`LogisticRegression`)  
- Дерево решений (`DecisionTreeClassifier`)  
- Случайный лес (`RandomForestClassifier`)
- ...

### Документация

[Scikit-Learn Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html)

[Category Encoders](https://contrib.scikit-learn.org/category_encoders/)

[Grid Search](https://scikit-learn.org/stable/modules/grid_search.html)

[Optuna example](https://github.com/optuna/optuna-examples/blob/main/sklearn/sklearn_simple.py)

[Pareto front](https://optuna.readthedocs.io/en/stable/reference/visualization/generated/optuna.visualization.plot_pareto_front.html#sphx-glr-reference-visualization-generated-optuna-visualization-plot-pareto-front-py)

[Scikit-Leaern Pipeline](https://scikit-learn.org/stable/modules/compose.html)

[Differential Evolution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html)


---

In [1]:
# %%bash
# !pip install ucimlrepo
# !pip install category_encoders
# !pip install optuna

In [2]:
# библиотеки, которые могут понадобиться для выполнения задания
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RepeatedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score
from category_encoders import TargetEncoder
from scipy.optimize import differential_evolution
import optuna
from optuna.visualization import plot_pareto_front
import matplotlib.pyplot as plt

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
secondary_mushroom = fetch_ucirepo(id=848)

# data (as pandas dataframes)
X = secondary_mushroom.data.features
y = secondary_mushroom.data.targets

# раскомментируйте, чтобы посмотреть метаданные набора данных
# metadata
print(secondary_mushroom.metadata)

# variable information
print(secondary_mushroom.variables)

{'uci_id': 848, 'name': 'Secondary Mushroom', 'repository_url': 'https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/848/data.csv', 'abstract': 'Dataset of simulated mushrooms for binary classification into edible and poisonous.', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 61068, 'num_features': 20, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2021, 'last_updated': 'Wed Apr 10 2024', 'dataset_doi': '10.24432/C5FP5Q', 'creators': ['Dennis Wagner', 'D. Heider', 'Georges Hattab'], 'intro_paper': {'ID': 259, 'type': 'NATIVE', 'title': 'Mushroom data creation, curation, and simulation to support classification tasks', 'authors': 'Dennis Wagner, D. Heider, Georges Hattab', 'venue': 'Scientific Reports', 'year': 2021, 'journal': None, '

In [4]:
# target: p - poisonous (ядовитые), e - edible(съедобные)
y = y['class'].map({'p': 1, 'e': 0})

In [5]:
# Разделение на train-test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Категориальные признаки
cat_cols = list(X.select_dtypes('object').columns)
print(cat_cols)
# Численные признаки
num_cols = [col for col in X.columns if col not in cat_cols + ["target"]]
print(num_cols)

['cap-shape', 'cap-surface', 'cap-color', 'does-bruise-or-bleed', 'gill-attachment', 'gill-spacing', 'gill-color', 'stem-root', 'stem-surface', 'stem-color', 'veil-type', 'veil-color', 'has-ring', 'ring-type', 'spore-print-color', 'habitat', 'season']
['cap-diameter', 'stem-height', 'stem-width']


In [6]:
# Подсказка, соберите конвейер из нескольких компонент
pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif)),
    ("model", LogisticRegression(max_iter=1000))
])

# Пример с логистической регрессией
# Настройка через GridSearchCV с RepeatedKFold
cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
param_grid = {
    "selector__k": [5, 10, 15],
    "model__C": [0.01, 0.1, 1, 10]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

# Оценка на тесте
test_roc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print(f"Test ROC-AUC (GridSearch): {test_roc:.3f}")

Test ROC-AUC (GridSearch): 0.861


---
## Реализация домашнего задания


### Данные и начальная настройка

In [7]:
df = pd.read_csv("secondary_data.csv", sep=";")


y = df['class'].map({'p': 1, 'e': 0})
X = df.drop(columns=['class'])

# train/validation (70/30)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")
print("Train target distribution:")
print(y_train.value_counts(normalize=True))
print("Test target distribution:")
print(y_test.value_counts(normalize=True))

df.head()

Train shape: (42748, 20), Test shape: (18321, 20)
Train target distribution:
class
1    0.554903
0    0.445097
Name: proportion, dtype: float64
Test target distribution:
class
1    0.554937
0    0.445063
Name: proportion, dtype: float64


Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


### Подход с разделением на train-validation  

In [8]:
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

print(f"Train split shape: {X_tr.shape}, Validation split shape: {X_val.shape}")
print("Train split target distribution:")
print(y_tr.value_counts(normalize=True))
print("Validation split target distribution:")
print(y_val.value_counts(normalize=True))

Train split shape: (34198, 20), Validation split shape: (8550, 20)
Train split target distribution:
class
1    0.554915
0    0.445085
Name: proportion, dtype: float64
Validation split target distribution:
class
1    0.554854
0    0.445146
Name: proportion, dtype: float64


In [9]:
# train/validation (80/20)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

# корреляция Спирмана
cat_cols = list(X_tr.select_dtypes('object').columns)
num_cols = [col for col in X_tr.columns if col not in cat_cols]

encoder = TargetEncoder(cols=cat_cols)
X_tr_enc = encoder.fit_transform(X_tr, y_tr)
X_val_enc = encoder.transform(X_val)
X_test_enc = encoder.transform(X_test)

correlations = X_tr_enc.corrwith(y_tr, method='spearman').abs()
top_features = correlations.sort_values(ascending=False).head(8).index.tolist()
print("Selected features:", top_features)

X_tr_sel = X_tr_enc[top_features]
X_val_sel = X_val_enc[top_features]
X_test_sel = X_test_enc[top_features]

Selected features: ['stem-color', 'stem-width', 'cap-color', 'stem-surface', 'cap-diameter', 'cap-surface', 'spore-print-color', 'gill-attachment']


In [10]:
# Гиперпараметры
best_roc = 0
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"LogReg C={C}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_C = C
logreg_best = LogisticRegression(C=best_C, max_iter=1000).fit(X_tr_sel, y_tr)

best_roc = 0
for d in [2, 3, 4, 5, 6, 7, 8]:
    model = DecisionTreeClassifier(max_depth=d, random_state=42)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"Tree max_depth={d}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_d = d
tree_best = DecisionTreeClassifier(max_depth=best_d, random_state=42).fit(X_tr_sel, y_tr)

best_roc = 0
for n in [10, 30, 50, 100]:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"RF n_estimators={n}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_n = n
rf_best = RandomForestClassifier(n_estimators=best_n, random_state=42).fit(X_tr_sel, y_tr)


LogReg C=0.01: ROC-AUC=0.795
LogReg C=0.1: ROC-AUC=0.813
LogReg C=1: ROC-AUC=0.815
LogReg C=10: ROC-AUC=0.815
LogReg C=100: ROC-AUC=0.815
Tree max_depth=2: ROC-AUC=0.682
Tree max_depth=3: ROC-AUC=0.750
Tree max_depth=4: ROC-AUC=0.785
Tree max_depth=5: ROC-AUC=0.830
Tree max_depth=6: ROC-AUC=0.877
Tree max_depth=7: ROC-AUC=0.922
Tree max_depth=8: ROC-AUC=0.942
RF n_estimators=10: ROC-AUC=1.000
RF n_estimators=30: ROC-AUC=1.000
RF n_estimators=50: ROC-AUC=1.000
RF n_estimators=100: ROC-AUC=1.000


In [11]:
# [опционально] Differential Evolution из Scipy для оптимизации 
def de_objective(params):
    C = params[0]
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    return -roc_auc_score(y_val, val_pred)  # minimize

bounds = [(0.001, 100)]
result = differential_evolution(de_objective, bounds, disp=False)
print(f"DE best C: {result.x[0]:.4f}, ROC-AUC={-result.fun:.3f}")

DE best C: 76.0960, ROC-AUC=0.815


### Подход с кросс-валидацией  

In [12]:
cat_cols = list(X_train.select_dtypes('object').columns)

# Pipeline: TargetEncoder -> SelectKBest -> Модель
pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif)),
    ("model", LogisticRegression(max_iter=1000))
])

# params для GridSearchCV
param_grid = {
    "selector__k": [5, 8, 12, 16],
    "model__C": [0.01, 0.1, 1, 10, 100]
}

cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
grid_search = GridSearchCV(
    pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print("Лучшие параметры:", grid_search.best_params_)
print("Лучшая ROC-AUC на CV:", grid_search.best_score_)

test_pred = grid_search.best_estimator_.predict_proba(X_test)[:, 1]
test_roc = roc_auc_score(y_test, test_pred)
test_pr = average_precision_score(y_test, test_pred)
print(f"Test ROC-AUC: {test_roc:.3f}")
print(f"Test PR-AUC: {test_pr:.3f}")

Fitting 10 folds for each of 20 candidates, totalling 200 fits
Лучшие параметры: {'model__C': 10, 'selector__k': 16}
Лучшая ROC-AUC на CV: 0.8670090306092192
Test ROC-AUC: 0.868
Test PR-AUC: 0.899


In [13]:
# [опционально]

def objective(trial):
    k = trial.suggest_int("k", 5, min(16, X_train.shape[1]))
    C = trial.suggest_float("C", 0.01, 100, log=True)
    pipeline = Pipeline([
        ("encoder", TargetEncoder(cols=cat_cols)),
        ("selector", SelectKBest(score_func=f_classif, k=k)),
        ("model", LogisticRegression(C=C, max_iter=1000))
    ])
    scores_roc = []
    scores_pr = []
    for train_idx, val_idx in cv.split(X_train, y_train):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        pipeline.fit(X_tr, y_tr)
        pred = pipeline.predict_proba(X_val)[:, 1]
        scores_roc.append(roc_auc_score(y_val, pred))
        scores_pr.append(average_precision_score(y_val, pred))
    return np.mean(scores_roc), np.mean(scores_pr)

study = optuna.create_study(
    directions=["maximize", "maximize"],
    study_name="multi_metric"
)
study.optimize(objective, n_trials=30)

[I 2025-05-21 15:40:49,812] A new study created in memory with name: multi_metric
[I 2025-05-21 15:40:52,800] Trial 0 finished with values: [0.8495939621572834, 0.8882009863741007] and parameters: {'k': 10, 'C': 6.3105860823988404}.
[I 2025-05-21 15:40:55,998] Trial 1 finished with values: [0.8527700293383186, 0.8900770640213022] and parameters: {'k': 12, 'C': 19.95628421916136}.
[I 2025-05-21 15:40:58,571] Trial 2 finished with values: [0.8327834850984583, 0.87728188361848] and parameters: {'k': 7, 'C': 0.11075661685329231}.
[I 2025-05-21 15:41:01,574] Trial 3 finished with values: [0.8496060918165933, 0.888197105214631] and parameters: {'k': 10, 'C': 5.111320349097709}.
[I 2025-05-21 15:41:04,224] Trial 4 finished with values: [0.8456347885744547, 0.886929643274047] and parameters: {'k': 8, 'C': 0.6573074077755312}.
[I 2025-05-21 15:41:07,199] Trial 5 finished with values: [0.8495694669806733, 0.8882166706507564] and parameters: {'k': 10, 'C': 90.05220680965148}.
[I 2025-05-21 15:41:

In [14]:
print("Optuna best ROC-AUC:", study.best_trials[0].values[0])
print("Optuna best PR-AUC:", study.best_trials[0].values[1])
print("Optuna best params:", study.best_trials[0].params)

Optuna best ROC-AUC: 0.8670018031739989
Optuna best PR-AUC: 0.8990184189288014
Optuna best params: {'k': 16, 'C': 12.592857649031602}


In [15]:
# [опционально]
from optuna.visualization import plot_pareto_front
plot_pareto_front(study)

### Финальная оценка моделей

In [17]:
logreg_test_pred = logreg_best.predict_proba(X_test_sel)[:, 1]
tree_test_pred = tree_best.predict_proba(X_test_sel)[:, 1]
rf_test_pred = rf_best.predict_proba(X_test_sel)[:, 1]
grid_test_pred = grid_search.best_estimator_.predict_proba(X_test)[:, 1]
de_logreg = LogisticRegression(C=result.x[0], max_iter=1000).fit(X_tr_sel, y_tr)
de_test_pred = de_logreg.predict_proba(X_test_sel)[:, 1]

best_trial = study.best_trials[0]
k_optuna = best_trial.params['k']
C_optuna = best_trial.params['C']

optuna_pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif, k=k_optuna)),
    ("model", LogisticRegression(C=C_optuna, max_iter=1000))
])
optuna_pipeline.fit(X_train, y_train)
optuna_test_pred = optuna_pipeline.predict_proba(X_test)[:, 1]

In [None]:
results_table = pd.DataFrame(
[
    {
        "Модель": "Логистическая регрессия (train/val)",
        "ROC-AUC": roc_auc_score(y_test, logreg_test_pred),
        "PR-AUC": average_precision_score(y_test, logreg_test_pred),
        "F1-мера": f1_score(y_test, logreg_best.predict(X_test_sel))
    },
    {
        "Модель": "Дерево решений (train/val)",
        "ROC-AUC": roc_auc_score(y_test, tree_test_pred),
        "PR-AUC": average_precision_score(y_test, tree_test_pred),
        "F1-мера": f1_score(y_test, tree_best.predict(X_test_sel))
    },
    {
        "Модель": "Случайный лес (train/val)",
        "ROC-AUC": roc_auc_score(y_test, rf_test_pred),
        "PR-AUC": average_precision_score(y_test, rf_test_pred),
        "F1-мера": f1_score(y_test, rf_best.predict(X_test_sel))
    },
    {
        "Модель": "GridSearchCV (кросс-валидация)",
        "ROC-AUC": roc_auc_score(y_test, grid_test_pred),
        "PR-AUC": average_precision_score(y_test, grid_test_pred),
        "F1-мера": f1_score(y_test, grid_search.best_estimator_.predict(X_test))
    },
    {
        "Модель": "Дифференциальная эволюция",
        "ROC-AUC": roc_auc_score(y_test, de_test_pred),
        "PR-AUC": average_precision_score(y_test, de_test_pred),
        "F1-мера": f1_score(y_test, de_logreg.predict(X_test_sel))
    },
    {
        "Модель": "Optuna (оптимизация)",
        "ROC-AUC": roc_auc_score(y_test, optuna_test_pred),
        "PR-AUC": average_precision_score(y_test, optuna_test_pred),
        "F1-мера": f1_score(y_test, optuna_pipeline.predict(X_test))
    }
])

results_table = results_table.sort_values("ROC-AUC", ascending=False).reset_index(drop=True)

display(results_table.round(6))

best_row = results_table.iloc[0]
print(f"\nЛучшая модель: {best_row['Модель']} (ROC-AUC={best_row['ROC-AUC']:.6f})")



Unnamed: 0,Модель,ROC-AUC,PR-AUC,F1-мера
0,Случайный лес (train/val),0.999911,0.999935,0.998918
1,Дерево решений (train/val),0.942962,0.948545,0.878764
2,GridSearchCV (кросс-валидация),0.86786,0.899306,0.822181
3,Optuna (оптимизация),0.867859,0.899221,0.822279
4,Дифференциальная эволюция,0.82052,0.867635,0.770071
5,Логистическая регрессия (train/val),0.82043,0.867598,0.77005



Лучшая модель: Случайный лес (train/val) (ROC-AUC=0.999911)
