## **Домашнее задание: Выбор модели для бинарной классификации**

**Цель**: Провести отбор признаков и настройку гиперпараметров нескольких моделей для бинарной классификации разными методами. Сравнить результаты и выбрать лучшую модель.

Задание считается выполненным успешно, если будет обучено по крайней мере три модели, среди которых выбрана лучшая по тестовым метрикам.

Ноутбуки направить на почту simon.ilishaev@gmail.com. В теме письма - [ML в Рисках]


#### **Данные и начальная настройка**  
1. [Загрузите датасет](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset) (числовые и категориальные признаки, бинарная целевая переменная).  
2. Сделайте **стратифицированное разделение на train-test** (например, 70-30). **Тестовый набор** будет использоваться **только для финальной оценки модели**.

#### Подход с разделением на train-validation  
1. Разделите ещё раз **обучающую выборку (train)** на train-validation** (например, 80-20).  
2. Проведите **отбор признаков с помощью фильтрационных методов** на **train-подвыборке**.  
3. Настройте гиперпараметры (например, `C` для логистической регрессии, `max_depth` для дерева решений и т. д.) на **валидационной выборке**.  
4. **Опционально**: Используйте **Differential Evolution из Scipy** для оптимизации гиперпараметров логистической регрессии.  

#### Подход с кросс-валидацией  
1. Используйте **кросс-валидацию (CV)** для **отбора признаков и настройки гиперпараметров**.  
2. Реализуйте **GridSearchCV** для перебора гиперпараметров.  
3. **Опционально**: Используйте **Optuna** с **многокритериальной оптимизацией** (максимизация ROC-AUC и Precision-Recall AUC).  
4. **Опционально**: Визуализируйте **Парето-фронт** для испытаний Optuna.  

#### **Финальная оценка моделей**  
1. Оцените все настроенные модели на **тестовом наборе** (ROC-AUC, Precision-Recall AUC, F1-score).  
2. **Выберите лучшую модель** на основе тестовых метрик.  

### **Модели для использования**  
- Логистическая регрессия (`LogisticRegression`)  
- Дерево решений (`DecisionTreeClassifier`)  
- Случайный лес (`RandomForestClassifier`)
- ...

### Документация

[Scikit-Learn Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html)

[Category Encoders](https://contrib.scikit-learn.org/category_encoders/)

[Grid Search](https://scikit-learn.org/stable/modules/grid_search.html)

[Optuna example](https://github.com/optuna/optuna-examples/blob/main/sklearn/sklearn_simple.py)

[Pareto front](https://optuna.readthedocs.io/en/stable/reference/visualization/generated/optuna.visualization.plot_pareto_front.html#sphx-glr-reference-visualization-generated-optuna-visualization-plot-pareto-front-py)

[Scikit-Leaern Pipeline](https://scikit-learn.org/stable/modules/compose.html)

[Differential Evolution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html)


---

In [None]:
# %%bash
# !pip install ucimlrepo
# !pip install category_encoders
# !pip install optuna

In [None]:
# библиотеки, которые могут понадобиться для выполнения задания
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RepeatedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score
from category_encoders import TargetEncoder
from scipy.optimize import differential_evolution
import optuna
from optuna.visualization import plot_pareto_front
import matplotlib.pyplot as plt

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
secondary_mushroom = fetch_ucirepo(id=848)

# data (as pandas dataframes)
X = secondary_mushroom.data.features
y = secondary_mushroom.data.targets

# раскомментируйте, чтобы посмотреть метаданные набора данных
# metadata
print(secondary_mushroom.metadata)

# variable information
print(secondary_mushroom.variables)

In [None]:
# target: p - poisonous (ядовитые), e - edible(съедобные)
y = y['class'].map({'p': 1, 'e': 0})

In [None]:
# Разделение на train-test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Категориальные признаки
cat_cols = list(X.select_dtypes('object').columns)
print(cat_cols)
# Численные признаки
num_cols = [col for col in X.columns if col not in cat_cols + ["target"]]
print(num_cols)

In [None]:
# Подсказка, соберите конвейер из нескольких компонент
pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif)),
    ("model", LogisticRegression(max_iter=1000))
])

# Пример с логистической регрессией
# Настройка через GridSearchCV с RepeatedKFold
cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
param_grid = {
    "selector__k": [5, 10, 15],
    "model__C": [0.01, 0.1, 1, 10]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

# Оценка на тесте
test_roc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print(f"Test ROC-AUC (GridSearch): {test_roc:.3f}")

---
## Реализация домашнего задания


### Данные и начальная настройка

In [None]:
df = pd.read_csv("secondary_data.csv", sep=";")


y = df['class'].map({'p': 1, 'e': 0})
X = df.drop(columns=['class'])

# train/validation (70/30)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")
print("Train target distribution:")
print(y_train.value_counts(normalize=True))
print("Test target distribution:")
print(y_test.value_counts(normalize=True))

df.head()

### Подход с разделением на train-validation  

In [None]:
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

print(f"Train split shape: {X_tr.shape}, Validation split shape: {X_val.shape}")
print("Train split target distribution:")
print(y_tr.value_counts(normalize=True))
print("Validation split target distribution:")
print(y_val.value_counts(normalize=True))

In [None]:
# train/validation (80/20)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

# корреляция Спирмана
cat_cols = list(X_tr.select_dtypes('object').columns)
num_cols = [col for col in X_tr.columns if col not in cat_cols]

encoder = TargetEncoder(cols=cat_cols)
X_tr_enc = encoder.fit_transform(X_tr, y_tr)
X_val_enc = encoder.transform(X_val)
X_test_enc = encoder.transform(X_test)

correlations = X_tr_enc.corrwith(y_tr, method='spearman').abs()
top_features = correlations.sort_values(ascending=False).head(8).index.tolist()
print("Selected features:", top_features)

X_tr_sel = X_tr_enc[top_features]
X_val_sel = X_val_enc[top_features]
X_test_sel = X_test_enc[top_features]

In [None]:
# Гиперпараметры
best_roc = 0
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"LogReg C={C}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_C = C
logreg_best = LogisticRegression(C=best_C, max_iter=1000).fit(X_tr_sel, y_tr)

best_roc = 0
for d in [2, 3, 4, 5, 6, 7, 8]:
    model = DecisionTreeClassifier(max_depth=d, random_state=42)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"Tree max_depth={d}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_d = d
tree_best = DecisionTreeClassifier(max_depth=best_d, random_state=42).fit(X_tr_sel, y_tr)

best_roc = 0
for n in [10, 30, 50, 100]:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    roc = roc_auc_score(y_val, val_pred)
    print(f"RF n_estimators={n}: ROC-AUC={roc:.3f}")
    if roc > best_roc:
        best_roc = roc
        best_n = n
rf_best = RandomForestClassifier(n_estimators=best_n, random_state=42).fit(X_tr_sel, y_tr)


In [None]:
# [опционально] Differential Evolution из Scipy для оптимизации 
def de_objective(params):
    C = params[0]
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_tr_sel, y_tr)
    val_pred = model.predict_proba(X_val_sel)[:, 1]
    return -roc_auc_score(y_val, val_pred)  # minimize

bounds = [(0.001, 100)]
result = differential_evolution(de_objective, bounds, disp=False)
print(f"DE best C: {result.x[0]:.4f}, ROC-AUC={-result.fun:.3f}")

### Подход с кросс-валидацией  

In [None]:
cat_cols = list(X_train.select_dtypes('object').columns)

# Pipeline: TargetEncoder -> SelectKBest -> Модель
pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif)),
    ("model", LogisticRegression(max_iter=1000))
])

# params для GridSearchCV
param_grid = {
    "selector__k": [5, 8, 12, 16],
    "model__C": [0.01, 0.1, 1, 10, 100]
}

cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
grid_search = GridSearchCV(
    pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print("Лучшие параметры:", grid_search.best_params_)
print("Лучшая ROC-AUC на CV:", grid_search.best_score_)

test_pred = grid_search.best_estimator_.predict_proba(X_test)[:, 1]
test_roc = roc_auc_score(y_test, test_pred)
test_pr = average_precision_score(y_test, test_pred)
print(f"Test ROC-AUC: {test_roc:.3f}")
print(f"Test PR-AUC: {test_pr:.3f}")

In [None]:
# [опционально]

def objective(trial):
    k = trial.suggest_int("k", 5, min(16, X_train.shape[1]))
    C = trial.suggest_float("C", 0.01, 100, log=True)
    pipeline = Pipeline([
        ("encoder", TargetEncoder(cols=cat_cols)),
        ("selector", SelectKBest(score_func=f_classif, k=k)),
        ("model", LogisticRegression(C=C, max_iter=1000))
    ])
    scores_roc = []
    scores_pr = []
    for train_idx, val_idx in cv.split(X_train, y_train):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        pipeline.fit(X_tr, y_tr)
        pred = pipeline.predict_proba(X_val)[:, 1]
        scores_roc.append(roc_auc_score(y_val, pred))
        scores_pr.append(average_precision_score(y_val, pred))
    return np.mean(scores_roc), np.mean(scores_pr)

study = optuna.create_study(
    directions=["maximize", "maximize"],
    study_name="multi_metric"
)
study.optimize(objective, n_trials=30)

In [None]:
print("Optuna best ROC-AUC:", study.best_trials[0].values[0])
print("Optuna best PR-AUC:", study.best_trials[0].values[1])
print("Optuna best params:", study.best_trials[0].params)

In [None]:
# [опционально]
from optuna.visualization import plot_pareto_front
plot_pareto_front(study)

### Финальная оценка моделей

In [None]:
logreg_test_pred = logreg_best.predict_proba(X_test_sel)[:, 1]
tree_test_pred = tree_best.predict_proba(X_test_sel)[:, 1]
rf_test_pred = rf_best.predict_proba(X_test_sel)[:, 1]
grid_test_pred = grid_search.best_estimator_.predict_proba(X_test)[:, 1]
de_logreg = LogisticRegression(C=result.x[0], max_iter=1000).fit(X_tr_sel, y_tr)
de_test_pred = de_logreg.predict_proba(X_test_sel)[:, 1]

best_trial = study.best_trials[0]
k_optuna = best_trial.params['k']
C_optuna = best_trial.params['C']

optuna_pipeline = Pipeline([
    ("encoder", TargetEncoder(cols=cat_cols)),
    ("selector", SelectKBest(score_func=f_classif, k=k_optuna)),
    ("model", LogisticRegression(C=C_optuna, max_iter=1000))
])
optuna_pipeline.fit(X_train, y_train)
optuna_test_pred = optuna_pipeline.predict_proba(X_test)[:, 1]

In [None]:
results_table = pd.DataFrame(
[
    {
        "Модель": "Логистическая регрессия (train/val)",
        "ROC-AUC": roc_auc_score(y_test, logreg_test_pred),
        "PR-AUC": average_precision_score(y_test, logreg_test_pred),
        "F1-score": f1_score(y_test, logreg_best.predict(X_test_sel))
    },
    {
        "Модель": "Дерево решений (train/val)",
        "ROC-AUC": roc_auc_score(y_test, tree_test_pred),
        "PR-AUC": average_precision_score(y_test, tree_test_pred),
        "F1-score": f1_score(y_test, tree_best.predict(X_test_sel))
    },
    {
        "Модель": "Случайный лес (train/val)",
        "ROC-AUC": roc_auc_score(y_test, rf_test_pred),
        "PR-AUC": average_precision_score(y_test, rf_test_pred),
        "F1-score": f1_score(y_test, rf_best.predict(X_test_sel))
    },
    {
        "Модель": "GridSearchCV (кросс-валидация)",
        "ROC-AUC": roc_auc_score(y_test, grid_test_pred),
        "PR-AUC": average_precision_score(y_test, grid_test_pred),
        "F1-score": f1_score(y_test, grid_search.best_estimator_.predict(X_test))
    },
    {
        "Модель": "Дифференциальная эволюция",
        "ROC-AUC": roc_auc_score(y_test, de_test_pred),
        "PR-AUC": average_precision_score(y_test, de_test_pred),
        "F1-score": f1_score(y_test, de_logreg.predict(X_test_sel))
    },
    {
        "Модель": "Optuna (оптимизация)",
        "ROC-AUC": roc_auc_score(y_test, optuna_test_pred),
        "PR-AUC": average_precision_score(y_test, optuna_test_pred),
        "F1-score": f1_score(y_test, optuna_pipeline.predict(X_test))
    }
])

results_table = results_table.sort_values("ROC-AUC", ascending=False).reset_index(drop=True)

display(results_table.round(6))

best_row = results_table.iloc[0]
print(f"\nЛучшая модель: {best_row['Модель']} (ROC-AUC={best_row['ROC-AUC']:.6f})")

