<a href="https://colab.research.google.com/github/Aleksey55555/LMT/blob/master/LMT_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna
Successfully installed colorlog-6.9.0 optuna-4.5.0


In [2]:
# === Чистые импорты для проекта LMT ===

# стандартные
import warnings
from math import ceil
from dataclasses import dataclass

# сторонние
import numpy as np
import pandas as pd
import optuna
import matplotlib.pyplot as plt

# sklearn
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.datasets import (
    load_breast_cancer,
    load_wine,
    load_digits,
    make_classification,
    fetch_openml,
)
from sklearn.ensemble import (
    BaggingClassifier,
    AdaBoostClassifier,
    RandomForestClassifier,
)
from sklearn.exceptions import ConvergenceWarning
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    fbeta_score,
    roc_auc_score,
    classification_report,
    make_scorer,
    roc_curve,
    auc,
    precision_recall_curve,
    average_precision_score,
)
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    cross_val_score,
)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# XGBoost
from xgboost import XGBClassifier


Идея создать такой классификатор LMT (logistic model tree), который будет сочитать дерево решений и логистическую регрессию. В каждом листе будет логистическая регрессия на признаках, которые не использовались в ветвлении дерева.
Подход

Строим дерево по подмножеству признаков:

на каждом узле выбираем признак и порог для разбиения (как в DecisionTreeClassifier).

глубина/мин-сэмплы ограничивают переобучение.

В листьях:

берём только те признаки, которые не использовались для делений выше по пути.

обучаем LogisticRegression на этом подмножестве данных.

Предсказание:

объект проходит по дереву до листа.

в листе к нему применяется локальная логистическая регрессия.

Реализация с помощью scikit-learn

In [3]:
class LogisticModelTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=3, min_samples_leaf=20, random_state=None):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
    def fit(self, X, y):
        # шаг 1: строим дерево только для разбиений
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y)
        # шаг 2: находим индексы объектов в листьях
        leaf_ids = self.tree_.apply(X)
        self.models_ = {}
        self.classes_ = self.tree_.classes_
        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            # получаем признаки, использованные на пути до этого листа
            path_features = self._get_features_on_path(leaf)
            remaining_features = [i for i in range(X.shape[1]) if i not in path_features]
            if not remaining_features:
                remaining_features = list(range(X.shape[1]))
            X_leaf = X[mask][:, remaining_features]
            y_leaf = y[mask]
            if len(np.unique(y_leaf)) == 1:
                # "чистый" лист: всегда один класс
                class_idx = np.where(self.classes_ == y_leaf[0])[0][0]
                def dummy_model(X_input, c=class_idx):
                    proba = np.zeros((X_input.shape[0], len(self.classes_)))
                    proba[:, c] = 1.0
                    return proba
                self.models_[leaf] = (dummy_model, remaining_features, True)
            else:
                model = LogisticRegression(max_iter=500)
                model.fit(X_leaf, y_leaf)
                self.models_[leaf] = (model, remaining_features, False)
        return self
    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, (model, feats, is_dummy) in self.models_.items():
            mask = (leaf_ids == leaf)
            if np.any(mask):
                X_leaf = X[mask][:, feats]
                if is_dummy:
                    proba[mask] = model(X_leaf)
                else:
                    proba[mask] = model.predict_proba(X_leaf)
        return proba
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    def _get_features_on_path(self, leaf_id):
        """Собрать все признаки, использованные на пути до данного листа"""
        tree = self.tree_.tree_
        path_features = set()
        def recurse(node, path):
            if node == leaf_id:
                return path
            if tree.feature[node] >= 0:
                left = tree.children_left[node]
                right = tree.children_right[node]
                if left != -1:
                    res = recurse(left, path | {tree.feature[node]})
                    if res is not None:
                        return res
                if right != -1:
                    res = recurse(right, path | {tree.feature[node]})
                    if res is not None:
                        return res
            return None
        return recurse(0, set()) or set()


Посмотрим метрики на датасете breast_cancer

In [4]:
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticModelTree(max_depth=3, min_samples_leaf=30, random_state=42)
clf.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, clf.predict(X_test)))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.9707602339181286


In [5]:
print(classification_report(y_test, clf.predict(X_test)))


              precision    recall  f1-score   support

           0       0.98      0.94      0.96        63
           1       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171



Точность (precision)

Класс 0: 0.98

Класс 1: 0.96
→ почти без ложноположительных ошибок.

Полнота (recall)

Класс 0: 0.94

Класс 1: 0.99
→ модель чуть чаще путает класс 0

F1-score

Оба класса ≈ 0.96–0.98 → очень сбалансировано.


 Сравним  с другими моделями: RandomForestClassifier, LogisticRegression, XGBClassifier.

In [6]:
# обучаем все модели
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.8,
        colsample_bytree=0.8,
        eval_metric="logloss",
        use_label_encoder=False,
        random_state=42
    ),
    "Logistic Model Tree": LogisticModelTree(max_depth=3, min_samples_leaf=30, random_state=42)
}
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name}")
    print("Accuracy:", acc)
    print(classification_report(y_test, y_pred))
    results[name] = acc


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Logistic Regression
Accuracy: 0.9707602339181286
              precision    recall  f1-score   support

           0       0.97      0.95      0.96        63
           1       0.97      0.98      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.97      0.97      0.97       171


Random Forest
Accuracy: 0.9707602339181286
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        63
           1       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



XGBoost
Accuracy: 0.9590643274853801
              precision    recall  f1-score   support

           0       0.95      0.94      0.94        63
           1       0.96      0.97      0.97       108

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Logistic Model Tree
Accuracy: 0.9707602339181286
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        63
           1       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171



LMT и Random Forest показали одинаковый результат, LogReg с таким же accuracy 0,971, но recall чуть хуже (на классе 1, что важно для данного набора). XGBoost дал хуже результат  accuract - 0.959

Попробуем улучшить модель LMT, добавив возможность использования небольшого количества признаков, использованнных для ветвления в логистической регрессии в листе. Гипрепараметр reuse_ratio=0.1

In [7]:
class LogisticModelTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=3, min_samples_leaf=20, random_state=None, reuse_ratio=0.1):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.reuse_ratio = reuse_ratio  # доля признаков из пути, которые можно "вернуть"
    def fit(self, X, y):
        # шаг 1: строим дерево для разбиений
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y)
        # шаг 2: распределяем объекты по листьям
        leaf_ids = self.tree_.apply(X)
        self.models_ = {}
        self.classes_ = self.tree_.classes_
        self.leaf_samples_ = {}
        rng = np.random.RandomState(self.random_state)
        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            self.leaf_samples_[leaf] = np.sum(mask)
            # признаки, использованные на пути
            path_features = list(self._get_features_on_path(leaf))
            unused_features = [i for i in range(X.shape[1]) if i not in path_features]
            # пропорция признаков из пути
            k = max(1, int(len(path_features) * self.reuse_ratio)) if path_features else 0
            reuse_features = rng.choice(path_features, size=k, replace=False).tolist() if k > 0 else []
            final_features = unused_features + reuse_features
            if not final_features:  # fallback
                final_features = list(range(X.shape[1]))
            X_leaf = X[mask][:, final_features]
            y_leaf = y[mask]
            if len(np.unique(y_leaf)) == 1:
                # чистый лист
                class_idx = np.where(self.classes_ == y_leaf[0])[0][0]
                def dummy_model(X_input, c=class_idx):
                    proba = np.zeros((X_input.shape[0], len(self.classes_)))
                    proba[:, c] = 1.0
                    return proba
                self.models_[leaf] = (dummy_model, final_features, True, class_idx, None)
            else:
                model = LogisticRegression(max_iter=500)
                model.fit(X_leaf, y_leaf)
                self.models_[leaf] = (model, final_features, False, None, model.coef_)
        return self
    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, (model, feats, is_dummy, _, _) in self.models_.items():
            mask = (leaf_ids == leaf)
            if np.any(mask):
                X_leaf = X[mask][:, feats]
                if is_dummy:
                    proba[mask] = model(X_leaf)
                else:
                    proba[mask] = model.predict_proba(X_leaf)
        return proba
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    def _get_features_on_path(self, leaf_id):
        """Собрать все признаки, использованные на пути до данного листа"""
        tree = self.tree_.tree_
        path_features = set()
        def recurse(node, path):
            if tree.children_left[node] == -1 and tree.children_right[node] == -1:
                if node == leaf_id:
                    return path
                return None
            if tree.feature[node] >= 0:
                left = tree.children_left[node]
                right = tree.children_right[node]
                if left != -1:
                    res = recurse(left, path | {tree.feature[node]})
                    if res is not None:
                        return res
                if right != -1:
                    res = recurse(right, path | {tree.feature[node]})
                    if res is not None:
                        return res
            return None
        return recurse(0, set()) or set()
    def print_leaf_stats(self, feature_names=None):
        """Вывести статистику по каждому листу"""
        for leaf, (model, feats, is_dummy, class_idx, coefs) in self.models_.items():
            print("="*60)
            print(f"Лист {leaf} | объектов: {self.leaf_samples_[leaf]}")
            used_feats = self._get_features_on_path(leaf)
            if feature_names is not None:
                used_feats = [feature_names[i] for i in used_feats]
                feats_names = [feature_names[i] for i in feats]
            else:
                feats_names = feats
            print(f"  Использованные признаки на пути: {used_feats}")
            print(f"  Признаки в логрег: {feats_names}")
            if is_dummy:
                print(f"  Модель: ЧИСТЫЙ ЛИСТ → всегда класс {self.classes_[class_idx]}")
            else:
                print("  Модель: Логистическая регрессия")
                print("   Коэффициенты:")
                for i, c in enumerate(coefs[0]):
                    fname = feats_names[i]
                    print(f"     {fname}: {c:.4f}")


In [8]:
clf = LogisticModelTree(max_depth=3, min_samples_leaf=30, random_state=42, reuse_ratio=0.2)
clf.fit(X_train, y_train)
clf.print_leaf_stats(feature_names=load_breast_cancer().feature_names)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Лист 3 | объектов: 148
  Использованные признаки на пути: [np.str_('texture error'), np.str_('worst area'), np.str_('mean concave points')]
  Признаки в логрег: [np.str_('mean radius'), np.str_('mean texture'), np.str_('mean perimeter'), np.str_('mean area'), np.str_('mean smoothness'), np.str_('mean compactness'), np.str_('mean concavity'), np.str_('mean symmetry'), np.str_('mean fractal dimension'), np.str_('radius error'), np.str_('perimeter error'), np.str_('area error'), np.str_('smoothness error'), np.str_('compactness error'), np.str_('concavity error'), np.str_('concave points error'), np.str_('symmetry error'), np.str_('fractal dimension error'), np.str_('worst radius'), np.str_('worst texture'), np.str_('worst perimeter'), np.str_('worst smoothness'), np.str_('worst compactness'), np.str_('worst concavity'), np.str_('worst concave points'), np.str_('worst symmetry'), np.str_('worst fractal dimension'), np.str_('texture error')]
  Модель: ЧИСТЫЙ ЛИСТ → всегда класс 1
Лист 4 | 

Протетстируем переиспользование признаков при разном reuse_ratio

Перепишем модель

In [9]:
class LogisticModelTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=3, min_samples_leaf=20, random_state=None,
                 reuse_ratio=0.1, max_iter=5000, solver="lbfgs"):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.reuse_ratio = reuse_ratio  # доля признаков из пути, которые можно "вернуть"
        self.max_iter = max_iter
        self.solver = solver
    def fit(self, X, y):
        # шаг 1: строим дерево для разбиений
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y)
        # шаг 2: распределяем объекты по листьям
        leaf_ids = self.tree_.apply(X)
        self.models_ = {}
        self.classes_ = self.tree_.classes_
        self.leaf_samples_ = {}
        rng = np.random.RandomState(self.random_state)
        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            self.leaf_samples_[leaf] = np.sum(mask)
            # признаки, использованные на пути
            path_features = list(self._get_features_on_path(leaf))
            unused_features = [i for i in range(X.shape[1]) if i not in path_features]
            # пропорция признаков из пути
            k = max(1, int(len(path_features) * self.reuse_ratio)) if path_features else 0
            reuse_features = rng.choice(path_features, size=k, replace=False).tolist() if k > 0 else []
            final_features = unused_features + reuse_features
            if not final_features:  # fallback
                final_features = list(range(X.shape[1]))
            X_leaf = X[mask][:, final_features]
            y_leaf = y[mask]
            if len(np.unique(y_leaf)) == 1:
                # чистый лист
                class_idx = np.where(self.classes_ == y_leaf[0])[0][0]
                def dummy_model(X_input, c=class_idx):
                    proba = np.zeros((X_input.shape[0], len(self.classes_)))
                    proba[:, c] = 1.0
                    return proba
                self.models_[leaf] = (dummy_model, final_features, True, class_idx, None)
            else:
                model = make_pipeline(
                    StandardScaler(),
                    LogisticRegression(max_iter=self.max_iter, solver=self.solver)
                )
                model.fit(X_leaf, y_leaf)
                coefs = model.named_steps["logisticregression"].coef_
                self.models_[leaf] = (model, final_features, False, None, coefs)
        return self
    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, (model, feats, is_dummy, _, _) in self.models_.items():
            mask = (leaf_ids == leaf)
            if np.any(mask):
                X_leaf = X[mask][:, feats]
                if is_dummy:
                    proba[mask] = model(X_leaf)
                else:
                    proba[mask] = model.predict_proba(X_leaf)
        return proba
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    def _get_features_on_path(self, leaf_id):
        """Собрать все признаки, использованные на пути до данного листа"""
        tree = self.tree_.tree_
        path_features = set()
        def recurse(node, path):
            if tree.children_left[node] == -1 and tree.children_right[node] == -1:
                if node == leaf_id:
                    return path
                return None
            if tree.feature[node] >= 0:
                left = tree.children_left[node]
                right = tree.children_right[node]
                if left != -1:
                    res = recurse(left, path | {tree.feature[node]})
                    if res is not None:
                        return res
                if right != -1:
                    res = recurse(right, path | {tree.feature[node]})
                    if res is not None:
                        return res
            return None
        return recurse(0, set()) or set()
    def print_leaf_stats(self, feature_names=None):
        """Вывести статистику по каждому листу"""
        for leaf, (model, feats, is_dummy, class_idx, coefs) in self.models_.items():
            print("="*60)
            print(f"Лист {leaf} | объектов: {self.leaf_samples_[leaf]}")
            used_feats = self._get_features_on_path(leaf)
            if feature_names is not None:
                used_feats = [feature_names[i] for i in used_feats]
                feats_names = [feature_names[i] for i in feats]
            else:
                feats_names = feats
            print(f"  Использованные признаки на пути: {used_feats}")
            print(f"  Признаки в логрег: {feats_names}")
            if is_dummy:
                print(f"  Модель: ЧИСТЫЙ ЛИСТ → всегда класс {self.classes_[class_idx]}")
            else:
                print("  Модель: Логистическая регрессия (с масштабированием)")
                print("   Коэффициенты:")
                for i, c in enumerate(coefs[0]):
                    fname = feats_names[i]
                    print(f"     {fname}: {c:.4f}")


In [10]:
clf = LogisticModelTree(max_depth=3, min_samples_leaf=30, random_state=42,
                        reuse_ratio=0.2, max_iter=5000, solver="lbfgs")
clf.fit(X_train, y_train)
#clf.print_leaf_stats(feature_names=load_breast_cancer().feature_names)


In [11]:
ratios = [0.0, 0.1, 0.2, 0.5]
rows = []
for r in ratios:
    clf = LogisticModelTree(
        max_depth=3,
        min_samples_leaf=30,
        random_state=42,
        reuse_ratio=r,
        max_iter=5000,
        solver="lbfgs"
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    precision = precision_score(y_test, y_pred, average="weighted")
    recall = recall_score(y_test, y_pred, average="weighted")
    f1 = f1_score(y_test, y_pred, average="weighted")
    f2 = fbeta_score(y_test, y_pred, beta=2, average="weighted")
    rows.append({
        "reuse_ratio": r,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "f2": f2
    })
results_df = pd.DataFrame(rows)
print(results_df)


   reuse_ratio  precision    recall        f1        f2
0          0.0   0.976608  0.976608  0.976608  0.976608
1          0.1   0.976608  0.976608  0.976608  0.976608
2          0.2   0.976608  0.976608  0.976608  0.976608
3          0.5   0.976608  0.976608  0.976608  0.976608


Метрики одинаковые при любых reuse_ratio, что говорит о том, что дерево делит пространство так, что оставшихся признаков уже хватает для локальной логистической регрессии. Добавление/убавление 10–50% «старых» признаков не меняет картину — модель в листьях даёт одинаковые предсказания.

Датасет Breast Cancer достаточно «лёгкий»: он линейно разделим и малошумный, поэтому гибрид быстро выходит на потолок ≈97–98% accuracy.

Попробуем на синтетических данных

In [12]:
# 1. создаём более сложный датасет
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=15,
    n_redundant=10,
    n_classes=2,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 2. модели для сравнения
models = {
    "Logistic Regression": LogisticRegression(max_iter=5000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.8,
        colsample_bytree=0.8,
        eval_metric="logloss",
        use_label_encoder=False,
        random_state=42
    )
}
rows = []
# 3. прогон LogisticModelTree с разными reuse_ratio
for r in [0.0, 0.1, 0.2, 0.5]:
    clf = LogisticModelTree(
        max_depth=4,
        min_samples_leaf=50,
        random_state=42,
        reuse_ratio=r,
        max_iter=5000,
        solver="lbfgs"
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    rows.append({
        "Model": f"LMT (reuse={r})",
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, average="weighted"),
        "Recall": recall_score(y_test, y_pred, average="weighted"),
        "F1": f1_score(y_test, y_pred, average="weighted"),
        "F2": fbeta_score(y_test, y_pred, beta=2, average="weighted")
    })
# 4. прогон классических моделей
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rows.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, average="weighted"),
        "Recall": recall_score(y_test, y_pred, average="weighted"),
        "F1": f1_score(y_test, y_pred, average="weighted"),
        "F2": fbeta_score(y_test, y_pred, beta=2, average="weighted")
    })
# 5. выводим результаты
results_df = pd.DataFrame(rows)
print(results_df)


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


                 Model  Accuracy  Precision    Recall        F1        F2
0      LMT (reuse=0.0)  0.878000   0.878116  0.878000  0.877992  0.877981
1      LMT (reuse=0.1)  0.878000   0.878116  0.878000  0.877992  0.877981
2      LMT (reuse=0.2)  0.878000   0.878116  0.878000  0.877992  0.877981
3      LMT (reuse=0.5)  0.882000   0.882249  0.882000  0.881983  0.881959
4  Logistic Regression  0.818667   0.818807  0.818667  0.818643  0.818635
5        Random Forest  0.934667   0.934743  0.934667  0.934663  0.934655
6              XGBoost  0.944667   0.944687  0.944667  0.944666  0.944664


Гибрид (LMT) заметно сильнее обычной Logistic Regression (+6 процентных пунктов), но сильно проигрывает ансамблям деревьев (RF и XGB).

При reuse_ratio=0.5 результат немного лучше, чем при меньших значениях,то есть подмешивание части признаков ветвления действительно помогает.

Random Forest и XGBoost на этом датасете показывают высокие результаты (93–94%).

XGBoost чуть лучше, что типично для задач с нелинейной структурой и шумом.

Выводы

LMT уже даёт более гибкую модель, чем чистая логрег, но чтобы конкурировать с ансамблями, нужно либо глубже дерево, либо более «умный» выбор признаков в листьях (Можно добавить фича-селекшн по критериямв листе).

При reuse_ratio=0.5 есть небольшой, но заметный прирост — так что идея рабочая.

Добавим масштабирование перед построением регрессии в листе и фича-селекшн в модель. Подберем лучшие гиперпараметры с помощью Optuna.

In [13]:
# ==== 1) Модель: LogisticModelTree с локальным feature selection ====
class LogisticModelTree(BaseEstimator, ClassifierMixin):
    """
    Дерево разбиений + в листьях логистическая регрессия.
    Улучшения:
      - масштабирование признаков в каждом листе (StandardScaler),
      - reuse_ratio: можно "вернуть" часть признаков, использованных на пути,
      - per-leaf feature selection: выбор top-k признаков (по mutual information) из final_features.
    """
    def __init__(self,
                 max_depth=3,
                 min_samples_leaf=20,
                 random_state=None,
                 reuse_ratio=0.1,               # 0..1, доля признаков из пути, возвращаемых в лист
                 topk_frac=1.0,                 # 0..1, доля final_features, оставляемая в листе (>=1 признак)
                 C=1.0,                         # регуляризация логрег
                 solver="lbfgs",
                 max_iter=5000):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.reuse_ratio = reuse_ratio
        self.topk_frac = topk_frac
        self.C = C
        self.solver = solver
        self.max_iter = max_iter
    def fit(self, X, y):
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y)
        leaf_ids = self.tree_.apply(X)
        self.models_ = {}
        self.classes_ = self.tree_.classes_
        self.leaf_samples_ = {}
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            self.leaf_samples_[leaf] = int(np.sum(mask))
            # признаки на пути к листу
            path_features = list(self._get_features_on_path(leaf))
            unused_features = [i for i in range(n_features) if i not in path_features]
            # вернуть часть "деревянных" признаков
            k_reuse = max(0, int(len(path_features) * float(self.reuse_ratio))) if path_features else 0
            reuse_features = rng.choice(path_features, size=k_reuse, replace=False).tolist() if k_reuse > 0 else []
            final_features = unused_features + reuse_features
            if not final_features:   # fallback
                final_features = list(range(n_features))
            X_leaf_full = X[mask]
            y_leaf = y[mask]
            # "чистый" лист -> детерминистическая модель
            if len(np.unique(y_leaf)) == 1:
                class_idx = int(np.where(self.classes_ == y_leaf[0])[0][0])
                def dummy_model(X_input, c=class_idx, n_classes=len(self.classes_)):
                    proba = np.zeros((X_input.shape[0], n_classes))
                    proba[:, c] = 1.0
                    return proba
                self.models_[leaf] = (dummy_model, final_features, True, class_idx, None, None)
                continue
            # ---- локальный feature selection по mutual information ----
            # считаем важности только по final_features
            X_sub = X_leaf_full[:, final_features]
            # mutual_info_classif устойчив к масштабам; дискретизации не нужно
            mi = mutual_info_classif(X_sub, y_leaf, random_state=self.random_state)
            order = np.argsort(mi)[::-1]  # убыв. важность
            k_top = max(1, int(ceil(len(final_features) * float(self.topk_frac))))
            keep_idx = order[:k_top]
            selected_features = [final_features[i] for i in keep_idx]
            # обучаем пайплайн: скейлер + логрег
            X_leaf = X_leaf_full[:, selected_features]
            model = make_pipeline(
                StandardScaler(),
                LogisticRegression(
                    max_iter=self.max_iter,
                    solver=self.solver,
                    C=self.C
                )
            )
            model.fit(X_leaf, y_leaf)
            coefs = model.named_steps["logisticregression"].coef_
            self.models_[leaf] = (model, selected_features, False, None, coefs, mi)
        return self
    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, (model, feats, is_dummy, _, _, _) in self.models_.items():
            mask = (leaf_ids == leaf)
            if not np.any(mask):
                continue
            X_leaf = X[mask][:, feats]
            if is_dummy:
                proba[mask] = model(X_leaf)
            else:
                proba[mask] = model.predict_proba(X_leaf)
        return proba
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    def _get_features_on_path(self, leaf_id):
        tree = self.tree_.tree_
        def recurse(node, used):
            # лист
            if tree.children_left[node] == -1 and tree.children_right[node] == -1:
                return used if node == leaf_id else None
            if tree.feature[node] >= 0:
                left = tree.children_left[node]
                right = tree.children_right[node]
                if left != -1:
                    r = recurse(left, used | {int(tree.feature[node])})
                    if r is not None:
                        return r
                if right != -1:
                    r = recurse(right, used | {int(tree.feature[node])})
                    if r is not None:
                        return r
            return None
        res = recurse(0, set())
        return res or set()
    def print_leaf_stats(self, feature_names=None, show_top=10):
        for leaf, (model, feats, is_dummy, class_idx, coefs, mi) in self.models_.items():
            print("="*70)
            print(f"Лист {leaf} | объектов: {self.leaf_samples_[leaf]}")
            used_feats = self._get_features_on_path(leaf)
            if feature_names is not None:
                used_feats_names = [feature_names[i] for i in used_feats]
                feats_names = [feature_names[i] for i in feats]
            else:
                used_feats_names = list(used_feats)
                feats_names = feats
            print(f"  Признаки на пути: {used_feats_names}")
            print(f"  Признаки в логрег (после selection): {feats_names[:show_top]}{' ...' if len(feats_names)>show_top else ''}")
            if is_dummy:
                print(f"  Модель: ЧИСТЫЙ ЛИСТ → класс {self.classes_[class_idx]}")
            else:
                print("  Модель: Логистическая регрессия (скейлер + L2)")
                print(f"   Кол-во признаков в листе: {len(feats_names)}")
                if coefs is not None:
                    for i, c in enumerate(coefs[0][:min(len(feats_names), show_top)]):
                        print(f"     {feats_names[i]}: {c:.4f}")
                if mi is not None:
                    print("   (MI использовалось для отбора признаков)")
# ==== 2) Optuna: подбор гиперпараметров LMT ====

def tune_lmt_with_optuna(X, y, n_trials=50, random_state=42):
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
    scorer = make_scorer(f1_score, average="weighted")
    def objective(trial: optuna.Trial):
        params = {
            "max_depth": trial.suggest_int("max_depth", 2, 7),
            "min_samples_leaf": trial.suggest_int("min_samples_leaf", 10, 200),
            "reuse_ratio": trial.suggest_float("reuse_ratio", 0.0, 0.8),
            "topk_frac": trial.suggest_float("topk_frac", 0.2, 1.0),
            "C": trial.suggest_float("C", 1e-3, 10.0, log=True),
            "solver": trial.suggest_categorical("solver", ["lbfgs", "saga"]),
            "max_iter": 5000,
            "random_state": random_state,
        }

        model = LogisticModelTree(**params)
        scores = cross_val_score(model, X, y, scoring=scorer, cv=skf, n_jobs=-1)
        return float(np.mean(scores))
    study = optuna.create_study(direction="maximize",
                                sampler=optuna.samplers.TPESampler(seed=random_state),
                                pruner=optuna.pruners.MedianPruner(n_warmup_steps=10))
    study.optimize(objective, n_trials=n_trials, show_progress_bar=False)
    return study
# ==== 3) Запуск тюнинга и финальная оценка ====
study = tune_lmt_with_optuna(X_train, y_train, n_trials=60, random_state=42)
best_params = study.best_params
print("Best params (Optuna):", best_params)

lmt_best = LogisticModelTree(**{**best_params, "max_iter": 5000, "random_state": 42})
lmt_best.fit(X_train, y_train)
y_pred_lmt = lmt_best.predict(X_test)
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }
rows = [metrics_row("LMT (Optuna)", y_test, y_pred_lmt)]
# ==== 4) Бейзлайны: LogisticRegression / RandomForest / XGBoost ====
try:
    has_xgb = True
except Exception:
    has_xgb = False
base_lr = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=5000, C=1.0, solver="lbfgs")
).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, base_lr.predict(X_test)))
rf = RandomForestClassifier(n_estimators=400, max_depth=None, min_samples_leaf=1,
                            random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))
if has_xgb:
    xgb = XGBClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_lambda=1.0,
        eval_metric="logloss",
        random_state=42,
        n_jobs=-1
    ).fit(X_train, y_train)
    rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test)))
results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False)
print(results)
# (опционально) быстрый просмотр важности гиперов в Optuna:
try:
    fig = viz.plot_param_importances(study)
    # fig.show()
except Exception:
    pass


[I 2025-10-03 17:20:17,055] A new study created in memory with name: no-name-0449bcea-2698-4833-8386-61f08a72c7cc
[I 2025-10-03 17:20:31,883] Trial 0 finished with value: 0.7922722535939213 and parameters: {'max_depth': 4, 'min_samples_leaf': 191, 'reuse_ratio': 0.585595153449124, 'topk_frac': 0.6789267873576292, 'C': 0.004207988669606638, 'solver': 'lbfgs'}. Best is trial 0 with value: 0.7922722535939213.
[I 2025-10-03 17:20:38,358] Trial 1 finished with value: 0.847630010515865 and parameters: {'max_depth': 7, 'min_samples_leaf': 124, 'reuse_ratio': 0.5664580622368364, 'topk_frac': 0.21646759543664196, 'C': 7.579479953348009, 'solver': 'lbfgs'}. Best is trial 1 with value: 0.847630010515865.
[I 2025-10-03 17:20:43,095] Trial 2 finished with value: 0.8356066190706523 and parameters: {'max_depth': 3, 'min_samples_leaf': 45, 'reuse_ratio': 0.2433937943676302, 'topk_frac': 0.6198051453057902, 'C': 0.05342937261279776, 'solver': 'saga'}. Best is trial 1 with value: 0.847630010515865.
[I 2

Best params (Optuna): {'max_depth': 6, 'min_samples_leaf': 175, 'reuse_ratio': 0.35748122127911736, 'topk_frac': 0.7820979668396375, 'C': 1.1349835828662918, 'solver': 'lbfgs'}
                 Model  Accuracy  Precision    Recall        F1        F2
3              XGBoost  0.955333   0.955341  0.955333  0.955333  0.955332
2        Random Forest  0.934667   0.934715  0.934667  0.934664  0.934659
0         LMT (Optuna)  0.872667   0.872722  0.872667  0.872663  0.872658
1  Logistic Regression  0.818667   0.818807  0.818667  0.818643  0.818635


XGBoost ожидаемо лидер, он идеально справляется с нелинейными разделяющими поверхностями на синтетике.

RandomForest чуть слабее, но тоже близко.

LMT (Optuna): лучше, чем глобальная логрег (+5%), но заметно отстаёт от ансамблей.

Logistic Regression в чистом виде — самая простая и наименее подходящая модель для этого датасета.

LMT реально улучшает линейную модель, сохраняя интерпретируемость и гибкость, но ансамбли деревьев остаются лучшими на сложных данных.
Оптимизация гиперпараметров дала неплохой результат, но сам класс моделей (LMT) пока ограничен по мощности.

Возможные апгрейды LMT:
добавить регуляризацию на уровне признаков в листьях (L1 для отбора, ElasticNet);
попробовать бустинг из LMT (как GradientBoosting, но листья = логреги);
попробовать беггинг;
попробовать более глубокие деревья + уменьшить min_samples_leaf, чтобы сделать более локальные логреги.

Но сначала попробуем улучшения и на реальном датасете Wine


In [14]:
class LogisticModelTree(BaseEstimator, ClassifierMixin):
    def __init__(self,
                 max_depth=3,
                 min_samples_leaf=20,
                 random_state=None,
                 reuse_ratio=0.1,
                 topk_frac=1.0,
                 C=1.0,
                 solver="lbfgs",
                 max_iter=5000):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.reuse_ratio = reuse_ratio
        self.topk_frac = topk_frac
        self.C = C
        self.solver = solver
        self.max_iter = max_iter
    def fit(self, X, y):
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y)
        leaf_ids = self.tree_.apply(X)
        self.models_ = {}
        self.classes_ = np.array(self.tree_.classes_)
        self.class_to_index_ = {c: i for i, c in enumerate(self.classes_)}
        self.leaf_samples_ = {}
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            self.leaf_samples_[leaf] = int(np.sum(mask))
            path_features = list(self._get_features_on_path(leaf))
            unused_features = [i for i in range(n_features) if i not in path_features]
            k_reuse = max(0, int(len(path_features) * float(self.reuse_ratio))) if path_features else 0
            reuse_features = rng.choice(path_features, size=k_reuse, replace=False).tolist() if k_reuse > 0 else []
            final_features = unused_features + reuse_features
            if not final_features:
                final_features = list(range(n_features))
            X_leaf_full = X[mask]
            y_leaf = y[mask]
            # чистый лист
            unique_leaf_classes = np.unique(y_leaf)
            if len(unique_leaf_classes) == 1:
                class_idx = int(self.class_to_index_[unique_leaf_classes[0]])
                def dummy_model(X_input, c=class_idx, n_classes=len(self.classes_)):
                    proba = np.zeros((X_input.shape[0], n_classes))
                    proba[:, c] = 1.0
                    return proba
                self.models_[leaf] = {
                    "model": dummy_model,
                    "feats": final_features,
                    "is_dummy": True,
                    "leaf_classes": np.array([unique_leaf_classes[0]]),
                    "coefs": None,
                    "mi": None,
                }
                continue
            # локальный feature selection по MI (если объектов совсем мало — пропускаем селекцию)
            X_sub = X_leaf_full[:, final_features]
            if X_sub.shape[0] >= 5:
                mi = mutual_info_classif(X_sub, y_leaf, random_state=self.random_state)
                order = np.argsort(mi)[::-1]
                k_top = max(1, int(ceil(len(final_features) * float(self.topk_frac))))
                keep_idx = order[:k_top]
                selected_features = [final_features[i] for i in keep_idx]
            else:
                mi = None
                selected_features = final_features
            X_leaf = X_leaf_full[:, selected_features]
            pipe = make_pipeline(
                StandardScaler(),
                LogisticRegression(
                    max_iter=self.max_iter,
                    solver=self.solver,
                    C=self.C,
                    multi_class="auto"
                )
            )
            pipe.fit(X_leaf, y_leaf)
            lr = pipe.named_steps["logisticregression"]
            self.models_[leaf] = {
                "model": pipe,
                "feats": selected_features,
                "is_dummy": False,
                "leaf_classes": np.array(lr.classes_),
                "coefs": lr.coef_,
                "mi": mi,
            }
        return self
    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, blob in self.models_.items():
            mask = (leaf_ids == leaf)
            if not np.any(mask):
                continue
            feats = blob["feats"]
            X_leaf = X[mask][:, feats]
            if blob["is_dummy"]:
                proba[mask] = blob["model"](X_leaf)
            else:
                local_proba = blob["model"].predict_proba(X_leaf)  # shape: [n, n_leaf_classes]
                leaf_classes = blob["leaf_classes"]
                # распределяем по глобальным классам
                tmp = np.zeros((local_proba.shape[0], len(self.classes_)))
                for j, cls in enumerate(leaf_classes):
                    gidx = self.class_to_index_[cls]
                    tmp[:, gidx] = local_proba[:, j]
                proba[mask] = tmp
        return proba
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    def _get_features_on_path(self, leaf_id):
        tree = self.tree_.tree_
        def recurse(node, used):
            if tree.children_left[node] == -1 and tree.children_right[node] == -1:
                return used if node == leaf_id else None
            if tree.feature[node] >= 0:
                left = tree.children_left[node]
                right = tree.children_right[node]
                if left != -1:
                    r = recurse(left, used | {int(tree.feature[node])})
                    if r is not None:
                        return r
                if right != -1:
                    r = recurse(right, used | {int(tree.feature[node])})
                    if r is not None:
                        return r
            return None
        res = recurse(0, set())
        return res or set()


In [15]:
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [16]:
study = tune_lmt_with_optuna(X_train, y_train, n_trials=25)


[I 2025-10-03 17:25:59,058] A new study created in memory with name: no-name-29086b43-00a0-46a1-9dd3-20690dfd0450
[I 2025-10-03 17:25:59,383] Trial 0 finished with value: 0.9185826078439889 and parameters: {'max_depth': 4, 'min_samples_leaf': 191, 'reuse_ratio': 0.585595153449124, 'topk_frac': 0.6789267873576292, 'C': 0.004207988669606638, 'solver': 'lbfgs'}. Best is trial 0 with value: 0.9185826078439889.
[I 2025-10-03 17:25:59,643] Trial 1 finished with value: 0.9017217818988715 and parameters: {'max_depth': 7, 'min_samples_leaf': 124, 'reuse_ratio': 0.5664580622368364, 'topk_frac': 0.21646759543664196, 'C': 7.579479953348009, 'solver': 'lbfgs'}. Best is trial 0 with value: 0.9185826078439889.
[I 2025-10-03 17:26:00,022] Trial 2 finished with value: 0.9343255676555987 and parameters: {'max_depth': 3, 'min_samples_leaf': 45, 'reuse_ratio': 0.2433937943676302, 'topk_frac': 0.6198051453057902, 'C': 0.05342937261279776, 'solver': 'saga'}. Best is trial 2 with value: 0.9343255676555987.
[

In [17]:
# дообучаем LMT на лучших параметрах
lmt_best = LogisticModelTree(**{**best_params, "max_iter": 5000, "random_state": 42})
lmt_best.fit(X_train, y_train)
y_pred_lmt = lmt_best.predict(X_test)
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted")
    }
rows = [metrics_row("LMT (Optuna)", y_test, y_pred_lmt)]
# Logistic Regression (со скейлингом)
base_lr = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=5000, solver="lbfgs", multi_class="auto")
).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, base_lr.predict(X_test)))
# Random Forest
rf = RandomForestClassifier(
    n_estimators=400, random_state=42, n_jobs=-1
).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))
# XGBoost
try:
    xgb = XGBClassifier(
        objective="multi:softprob",
        num_class=len(np.unique(y_train)),
        n_estimators=500,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        eval_metric="mlogloss",
        random_state=42,
        n_jobs=-1
    ).fit(X_train, y_train)
    rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test)))
except ImportError:
    print("XGBoost недоступен")
results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False)
print(results)




                 Model  Accuracy  Precision    Recall        F1        F2
3              XGBoost  1.000000   1.000000  1.000000  1.000000  1.000000
2        Random Forest  1.000000   1.000000  1.000000  1.000000  1.000000
1  Logistic Regression  0.981481   0.982456  0.981481  0.981506  0.981380
0         LMT (Optuna)  0.962963   0.963938  0.962963  0.962894  0.962803


XGBoost и RandomForest идеально решают Wine,
Logistic Regression на скейлинге показывает очень достойно: почти 98%.
LMT (Optuna) отстаёт (96%), но всё равно выше, чем ожидалось для интерпретируемой гибридной модели.  
Выводы  
Wine — относительно простой датасет. Ансамбли деревьев справляются идеально.
Логрег и LMT немного ошибаются, но дают хорошую интерпретируемость.

Попробуем LMT бустинг и беггинг на датасете breast_cancer

In [18]:
# данные
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted")
    }
rows = []
# --- базовый LMT  ---
base_lmt = LogisticModelTree(
    max_depth=3, min_samples_leaf=30, random_state=42,
    reuse_ratio=0.2, topk_frac=1.0, C=1.0, solver="lbfgs", max_iter=5000
)
base_lmt.fit(X_train, y_train)
rows.append(metrics_row("LMT (base)", y_test, base_lmt.predict(X_test)))
# --- LMT-Bagging ---
lmt_for_bag = LogisticModelTree(
    max_depth=3, min_samples_leaf=25, random_state=42,
    reuse_ratio=0.2, topk_frac=0.8, C=1.0, solver="lbfgs", max_iter=5000
)
bag = BaggingClassifier(
    estimator=lmt_for_bag,
    n_estimators=25,
    max_samples=0.8,
    max_features=1.0,
    bootstrap=True,
    bootstrap_features=False,
    n_jobs=-1,
    random_state=42
)
bag.fit(X_train, y_train)
rows.append(metrics_row("LMT-Bagging (25x, 80%)", y_test, bag.predict(X_test)))
# --- LMT-Boosting (AdaBoost; "SAMME") ---

class LogisticModelTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=3, min_samples_leaf=20, random_state=None,
                 reuse_ratio=0.1, topk_frac=1.0, C=1.0, solver="lbfgs", max_iter=5000):
        self.max_depth=max_depth; self.min_samples_leaf=min_samples_leaf; self.random_state=random_state
        self.reuse_ratio=reuse_ratio; self.topk_frac=topk_frac; self.C=C; self.solver=solver; self.max_iter=max_iter

    def fit(self, X, y, sample_weight=None):
        self.tree_ = DecisionTreeClassifier(max_depth=self.max_depth,
                                            min_samples_leaf=self.min_samples_leaf,
                                            random_state=self.random_state)
        self.tree_.fit(X, y, sample_weight=sample_weight)

        leaf_ids = self.tree_.apply(X)
        self.classes_ = np.array(self.tree_.classes_)
        self.class_to_index_ = {c:i for i,c in enumerate(self.classes_)}
        self.models_ = {}
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        sw = sample_weight if sample_weight is not None else np.ones(len(y), float)

        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            y_leaf = y[mask]; X_leaf_full = X[mask]; sw_leaf = sw[mask]

            # признаки: неиспользованные + часть "путевых"
            path = self._get_features_on_path(leaf)
            unused = [i for i in range(n_features) if i not in path]
            k_reuse = int(len(path)*self.reuse_ratio) if path else 0
            reuse = rng.choice(list(path), size=k_reuse, replace=False).tolist() if k_reuse>0 else []
            final_feats = unused + reuse or list(range(n_features))

            uniq = np.unique(y_leaf)
            if len(uniq)==1:
                cls_idx = self.class_to_index_[uniq[0]]
                def dummy(X_in, c=cls_idx, n=len(self.classes_)):
                    P = np.zeros((X_in.shape[0], n)); P[:,c]=1.0; return P
                self.models_[leaf] = {"is_dummy":True, "feats":final_feats, "model":dummy, "leaf_classes":np.array([uniq[0]])}
                continue

            X_sub = X_leaf_full[:, final_feats]
            if X_sub.shape[0] >= 5:
                mi = mutual_info_classif(X_sub, y_leaf, random_state=self.random_state)
                order = np.argsort(mi)[::-1]
                k_top = max(1, int(ceil(len(final_feats)*self.topk_frac)))
                keep = order[:k_top]
                feats = [final_feats[i] for i in keep]
            else:
                feats = final_feats

            scaler = StandardScaler().fit(X_leaf_full[:, feats])
            Xs = scaler.transform(X_leaf_full[:, feats])
            lr = LogisticRegression(max_iter=self.max_iter, solver=self.solver, C=self.C)
            lr.fit(Xs, y_leaf, sample_weight=sw_leaf)

            self.models_[leaf] = {"is_dummy":False, "feats":feats,
                                  "model":(scaler, lr), "leaf_classes":np.array(lr.classes_)}
        return self

    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, blob in self.models_.items():
            mask = (leaf_ids == leaf)
            if not np.any(mask): continue
            feats = blob["feats"]; X_leaf = X[mask][:, feats]
            if blob["is_dummy"]:
                proba[mask] = blob["model"](X_leaf)
            else:
                scaler, lr = blob["model"]
                local = lr.predict_proba(scaler.transform(X_leaf))
                tmp = np.zeros((local.shape[0], len(self.classes_)))
                for j, cls in enumerate(blob["leaf_classes"]):
                    tmp[:, self.class_to_index_[cls]] = local[:, j]
                proba[mask] = tmp
        return proba

    def predict(self, X): return np.argmax(self.predict_proba(X), axis=1)

    def _get_features_on_path(self, leaf_id):
        t = self.tree_.tree_
        def rec(node, used):
            if t.children_left[node]==-1 and t.children_right[node]==-1:
                return used if node==leaf_id else None
            if t.feature[node] >= 0:
                left, right = t.children_left[node], t.children_right[node]
                r = rec(left, used|{int(t.feature[node])});  r = r if r is not None else rec(right, used|{int(t.feature[node])})
                return r
            return None
        return rec(0, set()) or set()

boost = AdaBoostClassifier(
    estimator=LogisticModelTree(max_depth=2, min_samples_leaf=20, random_state=42,
                                reuse_ratio=0.3, topk_frac=0.9, C=1.0, solver="lbfgs", max_iter=5000),
    n_estimators=30, learning_rate=0.5, random_state=42
)

boost.fit(X_train, y_train)
rows.append(metrics_row("LMT-Boosting (AdaBoost, 30, 0.5)", y_test, boost.predict(X_test)))
# --- Бейзлайны: LogisticRegression / RandomForest / XGBoost ---
lr = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=5000, solver="lbfgs")
).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test)))
rf = RandomForestClassifier(
    n_estimators=400, random_state=42, n_jobs=-1
).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))
try:
    xgb = XGBClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        eval_metric="logloss",
        random_state=42,
        n_jobs=-1
    ).fit(X_train, y_train)
    rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test)))
except Exception as e:
    pass
results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False)
print(results)




                              Model  Accuracy  Precision    Recall        F1  \
3               Logistic Regression  0.988304   0.988304  0.988304  0.988304   
2  LMT-Boosting (AdaBoost, 30, 0.5)  0.964912   0.964964  0.964912  0.964796   
5                           XGBoost  0.964912   0.965576  0.964912  0.964668   
0                        LMT (base)  0.953216   0.953216  0.953216  0.953216   
1            LMT-Bagging (25x, 80%)  0.947368   0.947463  0.947368  0.947101   
4                     Random Forest  0.947368   0.947463  0.947368  0.947101   

         F2  
3  0.988304  
2  0.964832  
5  0.964679  
0  0.953216  
1  0.947187  
4  0.947187  


Логистическая регрессия — победитель. Датасет почти линейно разделим, поэтому ансамбли и гибриды даже проигрывают по метрикам.

LMT-Boosting ≈ XGBoost —  бустинг на LMT вышел на уровень XGBoost, хотя тот куда более оптимизирован.  
Bagging не улучшает базовый LMT, как бустинг.  
Random Forest — хуже всех, что тоже ожидаемо: дерево «дробит» пространство, а Breast Cancer этому не очень подходит.

Вывод  
На «чистых» и почти линейных данных глобальная логрег остаётся топом.
LMT+Boosting показал, что может соревноваться с XGBoost. На более сложных данных он может раскрыться ещё лучше.
Bagging стабилизирует, но не даёт прироста.

Попробуем реализовать Gradient Boosting с LMT в листьях и добавим регуляризацию для отбора признаков

In [19]:
# === Gradient Boosting с LMT в листьях + регуляризация; сравнение на синтетике ===

class LogisticModelTreePenalized(BaseEstimator, ClassifierMixin):
    def __init__(self,
                 max_depth=3,
                 min_samples_leaf=20,
                 random_state=None,
                 reuse_ratio=0.1,
                 topk_frac=1.0,
                 penalty="l2",
                 C=1.0,
                 l1_ratio=0.5,
                 solver="lbfgs",
                 max_iter=5000):
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.reuse_ratio = reuse_ratio
        self.topk_frac = topk_frac
        self.penalty = penalty
        self.C = C
        self.l1_ratio = l1_ratio
        self.solver = solver
        self.max_iter = max_iter

    def fit(self, X, y, sample_weight=None):
        self.tree_ = DecisionTreeClassifier(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            random_state=self.random_state
        )
        self.tree_.fit(X, y, sample_weight=sample_weight)

        leaf_ids = self.tree_.apply(X)
        self.classes_ = np.array(self.tree_.classes_)
        self.class_to_index_ = {c: i for i, c in enumerate(self.classes_)}
        self.models_ = {}
        self.leaf_samples_ = {}
        rng = np.random.RandomState(self.random_state)
        n_features = X.shape[1]
        sw = sample_weight if sample_weight is not None else np.ones(len(y), float)

        for leaf in np.unique(leaf_ids):
            mask = (leaf_ids == leaf)
            self.leaf_samples_[leaf] = int(np.sum(mask))
            X_leaf_full = X[mask]; y_leaf = y[mask]; sw_leaf = sw[mask]

            # признаки: (все неиспользованные по пути) + часть путевых
            path = self._get_features_on_path(leaf)
            unused = [i for i in range(n_features) if i not in path]
            k_reuse = int(len(path) * float(self.reuse_ratio)) if path else 0
            reuse = rng.choice(list(path), size=k_reuse, replace=False).tolist() if k_reuse > 0 else []
            final_feats = unused + reuse
            if not final_feats:
                final_feats = list(range(n_features))

            # чистый лист
            uniq = np.unique(y_leaf)
            if len(uniq) == 1:
                cls = uniq[0]
                gidx = self.class_to_index_[cls]
                def dummy(X_in, c=gidx, k=len(self.classes_)):
                    P = np.zeros((X_in.shape[0], k)); P[:, c] = 1.0; return P
                self.models_[leaf] = dict(is_dummy=True, feats=final_feats, model=dummy,
                                          leaf_classes=np.array([cls]))
                continue

            # локальный feature selection
            X_sub = X_leaf_full[:, final_feats]
            if X_sub.shape[0] >= 5:
                mi = mutual_info_classif(X_sub, y_leaf, random_state=self.random_state)
                order = np.argsort(mi)[::-1]
                k_top = max(1, int(ceil(len(final_feats) * float(self.topk_frac))))
                feats = [final_feats[i] for i in order[:k_top]]
            else:
                feats = final_feats

            # скейлинг + логрег (правильный solver для регуляризации!)
            scaler = StandardScaler().fit(X_leaf_full[:, feats])
            Xs = scaler.transform(X_leaf_full[:, feats])

            penalty = self.penalty
            solver_local = self.solver
            lr_kwargs = dict(max_iter=self.max_iter, C=self.C, penalty=penalty)

            if penalty in ("l1", "elasticnet"):
                solver_local = "saga"              # единственный solver для L1/ENet
                lr_kwargs["solver"] = solver_local
                if penalty == "elasticnet":
                    lr_kwargs["l1_ratio"] = self.l1_ratio
            else:
                # l2 / none — используем заданный solver (lbfgs по умолчанию)
                lr_kwargs["solver"] = solver_local

            lr = LogisticRegression(**lr_kwargs)
            lr.fit(Xs, y_leaf, sample_weight=sw_leaf)

            self.models_[leaf] = dict(
                is_dummy=False, feats=feats,
                model=(scaler, lr),
                leaf_classes=np.array(lr.classes_)
            )
        return self

    def predict_proba(self, X):
        leaf_ids = self.tree_.apply(X)
        proba = np.zeros((X.shape[0], len(self.classes_)))
        for leaf, blob in self.models_.items():
            mask = (leaf_ids == leaf)
            if not np.any(mask):
                continue
            feats = blob["feats"]; X_leaf = X[mask][:, feats]
            if blob["is_dummy"]:
                proba[mask] = blob["model"](X_leaf)
            else:
                scaler, lr = blob["model"]
                local = lr.predict_proba(scaler.transform(X_leaf))
                tmp = np.zeros((local.shape[0], len(self.classes_)))
                for j, cls in enumerate(blob["leaf_classes"]):
                    tmp[:, self.class_to_index_[cls]] = local[:, j]
                proba[mask] = tmp
        return proba

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

    def _get_features_on_path(self, leaf_id):
        t = self.tree_.tree_
        def rec(node, used):
            if t.children_left[node] == -1 and t.children_right[node] == -1:
                return used if node == leaf_id else None
            if t.feature[node] >= 0:
                left, right = t.children_left[node], t.children_right[node]
                r = rec(left, used | {int(t.feature[node])})
                if r is not None: return r
                r = rec(right, used | {int(t.feature[node])})
                if r is not None: return r
            return None
        return rec(0, set()) or set()

# ---- 2) LogitBoost-style Gradient Boosting для бинарной классификации ----
@dataclass
class LMTGBParams:
    n_estimators: int = 50
    learning_rate: float = 0.3
    random_state: int = 42
    # гиперы базового LMT:
    max_depth: int = 2
    min_samples_leaf: int = 20
    reuse_ratio: float = 0.3
    topk_frac: float = 0.9
    penalty: str = "l2"      # 'l2' | 'l1' | 'elasticnet'
    C: float = 1.0
    l1_ratio: float = 0.5
    solver: str = "lbfgs"
    max_iter: int = 5000

class LMTGradientBoostingBinary(BaseEstimator, ClassifierMixin):
    """
    Добавочная логистическая регрессия (LogitBoost-style):
    F_{m}(x) += ν * 0.5*log(p_m/(1-p_m)), где p_m даёт базовый LMT.
    """
    def __init__(self, params: LMTGBParams | None = None):
        self.params = params or LMTGBParams()

    def fit(self, X, y):
        classes = np.unique(y)
        if len(classes) != 2:
            raise ValueError("LMTGradientBoostingBinary поддерживает только бинарную классификацию.")
        self.classes_ = classes
        self.class_to_index_ = {c: i for i, c in enumerate(self.classes_)}
        y_signed = np.where(y == self.classes_[1], 1.0, -1.0)

        n = X.shape[0]
        self.learning_rate_ = self.params.learning_rate
        self.estimators_ = []
        self.rng_ = np.random.RandomState(self.params.random_state)

        # начальный F(x)=0 (p=0.5)
        self.init_score_ = 0.0

        # равномерные веса
        w = np.full(n, 1.0 / n, dtype=float)

        for m in range(self.params.n_estimators):
            base = LogisticModelTreePenalized(
                max_depth=self.params.max_depth,
                min_samples_leaf=self.params.min_samples_leaf,
                random_state=self.rng_.randint(0, 10**9),
                reuse_ratio=self.params.reuse_ratio,
                topk_frac=self.params.topk_frac,
                penalty=self.params.penalty,
                C=self.params.C,
                l1_ratio=self.params.l1_ratio,
                solver=self.params.solver,
                max_iter=self.params.max_iter
            )
            # обучаем с текущими весами
            base.fit(X, y, sample_weight=w)

            # вероятности класса "1"
            p = np.clip(base.predict_proba(X)[:, self.class_to_index_[self.classes_[1]]], 1e-6, 1-1e-6)
            f_m = 0.5 * np.log(p / (1 - p))  # вклад в логит

            # обновление весов (реал-Адабуст/логитбуст-подобно)
            w *= np.exp(- self.learning_rate_ * y_signed * f_m)
            w_sum = np.sum(w)
            if not np.isfinite(w_sum) or w_sum <= 0:
                w = np.full(n, 1.0 / n, dtype=float)
            else:
                w /= w_sum

            self.estimators_.append(base)

        return self

    def _raw_score(self, X):
        raw = np.zeros(X.shape[0], dtype=float) + self.init_score_
        for base in self.estimators_:
            p = np.clip(base.predict_proba(X)[:, 1], 1e-6, 1-1e-6)
            raw += self.learning_rate_ * 0.5 * np.log(p / (1 - p))
        return raw

    def predict_proba(self, X):
        logits = 2.0 * self._raw_score(X)
        prob1 = 1.0 / (1.0 + np.exp(-logits))
        prob0 = 1.0 - prob1
        return np.vstack([prob0, prob1]).T

    def predict(self, X):
        return (self.predict_proba(X)[:, 1] >= 0.5).astype(self.classes_.dtype)

# ---- 3) Эксперимент на синтетике + сравнение с классическими моделями ----
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }

# синтетика посложнее
X, y = make_classification(
    n_samples=5000, n_features=30, n_informative=15, n_redundant=10,
    n_repeated=0, n_classes=2, class_sep=1.0, flip_y=0.02, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

rows = []

# Одиночный LMT (для ориентира)
lmt_single = LogisticModelTreePenalized(
    max_depth=3, min_samples_leaf=30, random_state=42,
    reuse_ratio=0.2, topk_frac=0.9,
    penalty="elasticnet", C=1.0, l1_ratio=0.3, solver="saga", max_iter=5000
).fit(X_train, y_train)
rows.append(metrics_row("LMT (single, ENet)", y_test, lmt_single.predict(X_test)))

# Gradient Boosting (LogitBoost-style) с LMT-листами
gb_params = LMTGBParams(
    n_estimators=40, learning_rate=0.3, random_state=42,
    max_depth=2, min_samples_leaf=25, reuse_ratio=0.3, topk_frac=0.9,
    penalty="elasticnet", C=1.0, l1_ratio=0.5, solver="saga", max_iter=5000
)
lmt_gb = LMTGradientBoostingBinary(gb_params).fit(X_train, y_train)
rows.append(metrics_row("LMT-GB (LogitBoost-style)", y_test, lmt_gb.predict(X_test)))

# Logistic Regression (скейлинг)

lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, solver="lbfgs")).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test)))

# Random Forest
rf = RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))

# XGBoost (бинарная)
try:

    xgb = XGBClassifier(
        objective="binary:logistic",
        n_estimators=500,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        eval_metric="logloss",
        random_state=42,
        n_jobs=-1
    ).fit(X_train, y_train)
    y_pred_xgb = (xgb.predict_proba(X_test)[:, 1] >= 0.5).astype(int)
    rows.append(metrics_row("XGBoost", y_test, y_pred_xgb))
except Exception as e:
    print("XGBoost недоступен:", e)

results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False)
print(results)


                       Model  Accuracy  Precision    Recall        F1  \
4                    XGBoost  0.947333   0.947399  0.947333  0.947332   
3              Random Forest  0.929333   0.929585  0.929333  0.929325   
0         LMT (single, ENet)  0.860000   0.861265  0.860000  0.859890   
2        Logistic Regression  0.810667   0.810683  0.810667  0.810662   
1  LMT-GB (LogitBoost-style)  0.722000   0.723000  0.722000  0.721727   

         F2  
4  0.947325  
3  0.929298  
0  0.859779  
2  0.810662  
1  0.721706  


Результаты совсем слабые 0,72, попробуем выполнить тюнинг гиперпараметров

Сравнним оптимизированный LMT-GB (тюнинг Optuna) и классических моделей на синтетике

In [20]:
# ===  Optuna for LMT-GB (no pruning, L2 only) ===

try:
    X_train, X_test, y_train, y_test
except NameError:
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    X, y = make_classification(
        n_samples=5000, n_features=30, n_informative=15, n_redundant=10,
        n_repeated=0, n_classes=2, class_sep=1.0, flip_y=0.02, random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )

scorer = make_scorer(f1_score, average="weighted")

def objective_stable(trial: optuna.Trial):
    # Только L2 + lbfgs
    params = LMTGBParams(
        n_estimators=trial.suggest_int("n_estimators", 60, 220),
        learning_rate=trial.suggest_float("learning_rate", 0.05, 0.2),
        random_state=42,
        max_depth=trial.suggest_int("max_depth", 2, 3),
        min_samples_leaf=trial.suggest_int("min_samples_leaf", 10, 40),
        reuse_ratio=trial.suggest_float("reuse_ratio", 0.2, 0.6),
        topk_frac=trial.suggest_float("topk_frac", 0.7, 1.0),
        penalty="l2",
        C=trial.suggest_float("C", 0.5, 5.0, log=True),
        l1_ratio=0.5,          # не используется при l2
        solver="lbfgs",
        max_iter=5000
    )

    model = LMTGradientBoostingBinary(params)
    # без сабсэмплинга, без прунинга — чтобы trial завершался
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    try:
        scores = cross_val_score(model, X_train, y_train, scoring=scorer, cv=skf, n_jobs=-1)
        val = float(np.mean(scores))
        return val if np.isfinite(val) else -1e9
    except Exception:
        return -1e9

#
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=42)
)

# ограничение по времени:
study.optimize(objective_stable, n_trials=25, timeout=1200, show_progress_bar=False)

print("Best CV F1 (weighted):", study.best_value)
print("Best params:", study.best_params)

# обучим лучшую LMT-GB и сравним
best = study.best_params
gb_best = LMTGradientBoostingBinary(LMTGBParams(
    n_estimators=best["n_estimators"],
    learning_rate=best["learning_rate"],
    random_state=42,
    max_depth=best["max_depth"],
    min_samples_leaf=best["min_samples_leaf"],
    reuse_ratio=best["reuse_ratio"],
    topk_frac=best["topk_frac"],
    penalty="l2",
    C=best["C"],
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=3000
)).fit(X_train, y_train)

# метрики и сравнение

def metrics_row(name, y_true, y_pred, proba=None, is_binary=True):
    row = {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }
    if is_binary and proba is not None:
        try:
            row["ROC-AUC"] = roc_auc_score(y_true, proba)
        except Exception:
            row["ROC-AUC"] = np.nan
    return row

is_binary = (len(np.unique(y_train)) == 2)
rows = []

y_pred_gb = gb_best.predict(X_test)
proba_gb = gb_best.predict_proba(X_test)[:, 1] if is_binary else None
rows.append(metrics_row("LMT-GB (Optuna L2)", y_test, y_pred_gb, proba_gb, is_binary))

# LMT (single) с такими же базовыми настройками
lmt_single = LogisticModelTreePenalized(
    max_depth=best["max_depth"],
    min_samples_leaf=best["min_samples_leaf"],
    random_state=42,
    reuse_ratio=best["reuse_ratio"],
    topk_frac=best["topk_frac"],
    penalty="l2",
    C=best["C"],
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=5000
).fit(X_train, y_train)
rows.append(metrics_row("LMT (single, L2)", y_test, lmt_single.predict(X_test),
                        lmt_single.predict_proba(X_test)[:,1] if is_binary else None, is_binary))

# Logistic Regression
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, solver="lbfgs")).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test),
                        lr.predict_proba(X_test)[:,1] if is_binary else None, is_binary))

# Random Forest
rf = RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test),
                        rf.predict_proba(X_test)[:,1] if is_binary else None, is_binary))

# XGBoost
try:

    if is_binary:
        xgb = XGBClassifier(
            objective="binary:logistic",
            n_estimators=500, learning_rate=0.05, max_depth=5,
            subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
            eval_metric="logloss", random_state=42, n_jobs=-1
        ).fit(X_train, y_train)
        rows.append(metrics_row("XGBoost", y_test,
                                (xgb.predict_proba(X_test)[:,1] >= 0.5).astype(int),
                                xgb.predict_proba(X_test)[:,1], True))
    else:
        xgb = XGBClassifier(
            objective="multi:softmax",
            num_class=len(np.unique(y_train)),
            n_estimators=500, learning_rate=0.05, max_depth=5,
            subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
            eval_metric="mlogloss", random_state=42, n_jobs=-1
        ).fit(X_train, y_train)
        rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test), None, False))
except Exception as e:
    print("XGBoost недоступен:", e)

results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False).reset_index(drop=True)
print(results)


[I 2025-10-03 17:27:06,611] A new study created in memory with name: no-name-8df17821-cb92-4a61-9e3a-f13baba1390e
[I 2025-10-03 17:30:56,526] Trial 0 finished with value: 0.9262663800031752 and parameters: {'n_estimators': 120, 'learning_rate': 0.19260714596148742, 'max_depth': 3, 'min_samples_leaf': 28, 'reuse_ratio': 0.2624074561769746, 'topk_frac': 0.7467983561008608, 'C': 0.5715491938156609}. Best is trial 0 with value: 0.9262663800031752.
[I 2025-10-03 17:37:43,302] Trial 1 finished with value: 0.9245567552999425 and parameters: {'n_estimators': 199, 'learning_rate': 0.14016725176148134, 'max_depth': 3, 'min_samples_leaf': 10, 'reuse_ratio': 0.5879639408647976, 'topk_frac': 0.9497327922401265, 'C': 0.8152843673110735}. Best is trial 0 with value: 0.9262663800031752.
[I 2025-10-03 17:39:47,835] Trial 2 finished with value: 0.8865382446837499 and parameters: {'n_estimators': 89, 'learning_rate': 0.07751067647801507, 'max_depth': 2, 'min_samples_leaf': 26, 'reuse_ratio': 0.3727780074

Best CV F1 (weighted): 0.9262663800031752
Best params: {'n_estimators': 120, 'learning_rate': 0.19260714596148742, 'max_depth': 3, 'min_samples_leaf': 28, 'reuse_ratio': 0.2624074561769746, 'topk_frac': 0.7467983561008608, 'C': 0.5715491938156609}
                 Model  Accuracy  Precision    Recall        F1        F2  \
0              XGBoost  0.947333   0.947399  0.947333  0.947332  0.947325   
1        Random Forest  0.929333   0.929585  0.929333  0.929325  0.929298   
2   LMT-GB (Optuna L2)  0.920667   0.921003  0.920667  0.920654  0.920618   
3     LMT (single, L2)  0.857333   0.858706  0.857333  0.857211  0.857091   
4  Logistic Regression  0.810667   0.810683  0.810667  0.810662  0.810662   

    ROC-AUC  
0  0.982843  
1  0.973248  
2  0.974245  
3  0.921633  
4  0.889891  


После оптимизации гиепрпараметров LMT заметно подтянулся LMT-GB (Optuna L2) ROC-AUC - 0.97 ~ Random Forrest и проигрывает чуть-чуть XGBoost

Попробуем мультикаласс на датасете Wine


In [21]:
# === LMT-GB на реальном датасете Wine (мультикласс, OVR) + сравнение ===

# --- проверки наличия базовых классов (из предыдущих ячеек) ---
try:
    LogisticModelTreePenalized, LMTGradientBoostingBinary, LMTGBParams
except NameError as e:
    raise RuntimeError(
        "Похоже, классы LogisticModelTreePenalized, LMTGradientBoostingBinary и/или LMTGBParams "
        "ещё не определены в этом ноутбуке. Выполни предыдущие ячейки с их реализацией."
    )


# ---------- 1) Данные: Wine (3 класса) ----------
wine = load_wine()
X, y = wine.data, wine.target
classes = np.unique(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# ---------- 2) Обёртки One-vs-Rest ----------
class OneVsRestLMTGB:
    """
    OVR для LMTGradientBoostingBinary: обучаем по одному бинарному бустингу на каждый класс (класс vs остальные),
    затем выбираем класс с максимальной вероятностью.
    """
    def __init__(self, params):
        self.params = params
        self.models_ = []
        self.classes_ = None

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.models_ = []
        for c in self.classes_:
            y_bin = (y == c).astype(int)
            model = LMTGradientBoostingBinary(self.params)
            model.fit(X, y_bin)
            self.models_.append(model)
        return self

    def predict_proba(self, X):
        # вероятность "попасть" в каждый класс — берём столбец 1 (P(y=1)) у бинарных моделей
        P = np.column_stack([m.predict_proba(X)[:, 1] for m in self.models_])
        # нормализуем строки, чтобы суммы были ~1
        row_sums = P.sum(axis=1, keepdims=True)
        row_sums[row_sums == 0.0] = 1.0
        return P / row_sums

    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), axis=1)]


class OneVsRestLMT:
    """
    OVR для одиночного LMT (LogisticModelTreePenalized).
    """
    def __init__(self, base_lmt_kwargs):
        self.base_lmt_kwargs = base_lmt_kwargs
        self.models_ = []
        self.classes_ = None

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.models_ = []
        for c in self.classes_:
            y_bin = (y == c).astype(int)
            model = LogisticModelTreePenalized(**self.base_lmt_kwargs)
            model.fit(X, y_bin)
            self.models_.append(model)
        return self

    def predict_proba(self, X):
        P = np.column_stack([m.predict_proba(X)[:, 1] for m in self.models_])
        row_sums = P.sum(axis=1, keepdims=True)
        row_sums[row_sums == 0.0] = 1.0
        return P / row_sums

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)


# ---------- 3) Метрики ----------
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }

rows = []

# ---------- 4) LMT-GB (OVR) ----------
# Стабильные настройки (L2 + lbfgs).
gb_params_wine = LMTGBParams(
    n_estimators=120,
    learning_rate=0.1,
    random_state=42,
    max_depth=2,
    min_samples_leaf=15,
    reuse_ratio=0.4,
    topk_frac=1.0,   # без отбора фич для устойчивости на мультиклассе
    penalty="l2",
    C=2.0,
    l1_ratio=0.5,    # не используется при l2
    solver="lbfgs",
    max_iter=4000
)

lmt_gb_ovr = OneVsRestLMTGB(gb_params_wine).fit(X_train, y_train)
y_pred_lmtgb = lmt_gb_ovr.predict(X_test)
rows.append(metrics_row("LMT-GB (OVR, Wine)", y_test, y_pred_lmtgb))

# ---------- 5) Одиночный LMT (OVR) ----------
lmt_single_ovr = OneVsRestLMT(dict(
    max_depth=gb_params_wine.max_depth,
    min_samples_leaf=gb_params_wine.min_samples_leaf,
    random_state=42,
    reuse_ratio=gb_params_wine.reuse_ratio,
    topk_frac=gb_params_wine.topk_frac,
    penalty="l2",
    C=gb_params_wine.C,
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=gb_params_wine.max_iter
)).fit(X_train, y_train)
y_pred_lmt_single = lmt_single_ovr.predict(X_test)
rows.append(metrics_row("LMT (single OVR, Wine)", y_test, y_pred_lmt_single))

# ---------- 6) Классические модели ----------
# Logistic Regression (скейлер обязателен)
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, solver="lbfgs")).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test)))

# Random Forest
rf = RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))

# XGBoost (мультикласс)
try:

    xgb = XGBClassifier(
        objective="multi:softmax",
        num_class=len(classes),
        n_estimators=500, learning_rate=0.05, max_depth=5,
        subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
        eval_metric="mlogloss", random_state=42, n_jobs=-1
    ).fit(X_train, y_train)
    rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test)))
except Exception as e:
    print("XGBoost недоступен:", e)

# (опционально) SVM RBF
try:
    svm = make_pipeline(StandardScaler(), SVC(kernel="rbf", probability=False, random_state=42)).fit(X_train, y_train)
    rows.append(metrics_row("SVM (RBF)", y_test, svm.predict(X_test)))
except Exception as e:
    print("SVM недоступен:", e)

# ---------- 7) Итоговая таблица ----------
results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False).reset_index(drop=True)
print(results)


                    Model  Accuracy  Precision    Recall        F1        F2
0                 XGBoost  1.000000   1.000000  1.000000  1.000000  1.000000
1           Random Forest  1.000000   1.000000  1.000000  1.000000  1.000000
2     Logistic Regression  0.981481   0.982456  0.981481  0.981506  0.981380
3  LMT (single OVR, Wine)  0.981481   0.982639  0.981481  0.981554  0.981388
4               SVM (RBF)  0.981481   0.982323  0.981481  0.981378  0.981316
5      LMT-GB (OVR, Wine)  0.962963   0.966184  0.962963  0.962715  0.962428


XGBoost / RandomForest = 1.00 на Wine — вполне ожидаемо для этого небольшого и «чистого» датасета.

LogReg / SVM / LMT (single OVR) ≈ 0.981 — почти потолок без деревьев.

LMT-GB (OVR) = 0.963 — заметно ниже, но адекватно.


Попробуем провести оптимизацию гиперпараметров (n_estimators 60–240, lr 0.03–0.20, depth 1–3, min_leaf 8–40, reuse_ratio 0–0.6, topk_frac 0.6–1.0, C 0.5–5) всё для L2 + lbfgs для стабильности и сравним результаты.


In [22]:
# === Optuna-тюнинг LMT-GB (OVR) на Wine + финальное сравнение ===

# (опционально) приглушим варнинги сходимости
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# --- проверка наличия базовых классов из предыдущих ячеек ---
try:
    LogisticModelTreePenalized
    LMTGradientBoostingBinary
    LMTGBParams
    OneVsRestLMTGB
    OneVsRestLMT
except NameError:
    raise RuntimeError(
        "Не найдены классы/обёртки LMT."
    )

# ---------- 1) Данные Wine ----------
wine = load_wine()
X, y = wine.data, wine.target
classes = np.unique(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# ---------- 2) Метрики ----------
def metrics_row(name, y_true, y_pred):
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }

# ---------- 3) Optuna objective для OVR-LMT-GB (стабильное пространство) ----------
def objective(trial: optuna.Trial):
    params = LMTGBParams(
        n_estimators=trial.suggest_int("n_estimators", 60, 240),
        learning_rate=trial.suggest_float("learning_rate", 0.03, 0.20),
        random_state=42,
        max_depth=trial.suggest_int("max_depth", 1, 3),
        min_samples_leaf=trial.suggest_int("min_samples_leaf", 8, 40),
        reuse_ratio=trial.suggest_float("reuse_ratio", 0.0, 0.6),
        topk_frac=trial.suggest_float("topk_frac", 0.6, 1.0),
        penalty="l2",          # для устойчивости (без saga/elasticnet)
        C=trial.suggest_float("C", 0.5, 5.0, log=True),
        l1_ratio=0.5,          # не используется при l2
        solver="lbfgs",
        max_iter=5000
    )

    # 3-фолд стратифицированный CV по weighted F1
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    scores = []
    for tr, va in skf.split(X_train, y_train):
        model_ovr = OneVsRestLMTGB(params)
        model_ovr.fit(X_train[tr], y_train[tr])
        y_pred = model_ovr.predict(X_train[va])
        scores.append(f1_score(y_train[va], y_pred, average="weighted"))
        # промежуточный репорт для потенциального прунинга
        trial.report(scores[-1], len(scores))
        # if trial.should_prune():
        #     raise optuna.TrialPruned()
    return float(np.mean(scores))

# ---------- 4) Запуск Optuna (время/трейлы) ----------
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=42),
    # прунинг можно включить:
    # pruner=optuna.pruners.MedianPruner(n_warmup_steps=2),
)
study.optimize(objective, n_trials=100, timeout=3600, show_progress_bar=False)

print("Best CV F1 (weighted):", study.best_value)
print("Best params:", study.best_params)

best = study.best_params
gb_best_params = LMTGBParams(
    n_estimators=best["n_estimators"],
    learning_rate=best["learning_rate"],
    random_state=42,
    max_depth=best["max_depth"],
    min_samples_leaf=best["min_samples_leaf"],
    reuse_ratio=best["reuse_ratio"],
    topk_frac=best["topk_frac"],
    penalty="l2",
    C=best["C"],
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=5000
)

# ---------- 5) Обучаем лучшую LMT-GB (OVR) и сравниваем ----------
rows = []

# LMT-GB (OVR, tuned)
lmt_gb_ovr_tuned = OneVsRestLMTGB(gb_best_params).fit(X_train, y_train)
rows.append(metrics_row("LMT-GB (OVR, Optuna)", y_test, lmt_gb_ovr_tuned.predict(X_test)))

# LMT (single OVR) с "созвучными" листовыми гиперами (для честности)
lmt_single_ovr = OneVsRestLMT(dict(
    max_depth=gb_best_params.max_depth,
    min_samples_leaf=gb_best_params.min_samples_leaf,
    random_state=42,
    reuse_ratio=gb_best_params.reuse_ratio,
    topk_frac=gb_best_params.topk_frac,
    penalty="l2",
    C=gb_best_params.C,
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=gb_best_params.max_iter
)).fit(X_train, y_train)
rows.append(metrics_row("LMT (single OVR, tuned-like)", y_test, lmt_single_ovr.predict(X_test)))

# Логистическая регрессия (скейлер)
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, solver="lbfgs")).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test)))

# Random Forest
rf = RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test)))

# XGBoost (мультикласс)
try:
    from xgboost import XGBClassifier
    xgb = XGBClassifier(
        objective="multi:softmax",
        num_class=len(classes),
        n_estimators=500, learning_rate=0.05, max_depth=5,
        subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
        eval_metric="mlogloss", random_state=42, n_jobs=-1
    ).fit(X_train, y_train)
    rows.append(metrics_row("XGBoost", y_test, xgb.predict(X_test)))
except Exception as e:
    print("XGBoost недоступен:", e)

# (опционально) SVM RBF
try:
    from sklearn.svm import SVC
    svm = make_pipeline(StandardScaler(), SVC(kernel="rbf", probability=False, random_state=42)).fit(X_train, y_train)
    rows.append(metrics_row("SVM (RBF)", y_test, svm.predict(X_test)))
except Exception as e:
    print("SVM недоступен:", e)

# ---------- 6) Итоговая таблица ----------
results = pd.DataFrame(rows).sort_values("Accuracy", ascending=False).reset_index(drop=True)
print(results)


[I 2025-10-03 17:54:20,576] A new study created in memory with name: no-name-bbcd3a81-9ab7-4b17-8e25-33750447dcca
[I 2025-10-03 17:55:40,505] Trial 0 finished with value: 0.9760504606740025 and parameters: {'n_estimators': 127, 'learning_rate': 0.19162143208968577, 'max_depth': 3, 'min_samples_leaf': 27, 'reuse_ratio': 0.0936111842654619, 'topk_frac': 0.662397808134481, 'C': 0.5715491938156609}. Best is trial 0 with value: 0.9760504606740025.
[I 2025-10-03 17:57:14,888] Trial 1 finished with value: 0.9030276458988483 and parameters: {'n_estimators': 216, 'learning_rate': 0.13218955199634552, 'max_depth': 3, 'min_samples_leaf': 8, 'reuse_ratio': 0.5819459112971965, 'topk_frac': 0.9329770563201687, 'C': 0.8152843673110735}. Best is trial 0 with value: 0.9760504606740025.
[I 2025-10-03 17:58:13,010] Trial 2 finished with value: 0.9596094324510895 and parameters: {'n_estimators': 92, 'learning_rate': 0.06117876667508375, 'max_depth': 1, 'min_samples_leaf': 25, 'reuse_ratio': 0.259167011185

Best CV F1 (weighted): 0.9760504606740025
Best params: {'n_estimators': 127, 'learning_rate': 0.19162143208968577, 'max_depth': 3, 'min_samples_leaf': 27, 'reuse_ratio': 0.0936111842654619, 'topk_frac': 0.662397808134481, 'C': 0.5715491938156609}
                          Model  Accuracy  Precision    Recall        F1  \
0                       XGBoost  1.000000   1.000000  1.000000  1.000000   
1                 Random Forest  1.000000   1.000000  1.000000  1.000000   
2           Logistic Regression  0.981481   0.982456  0.981481  0.981506   
3  LMT (single OVR, tuned-like)  0.981481   0.982456  0.981481  0.981506   
4                     SVM (RBF)  0.981481   0.982323  0.981481  0.981378   
5          LMT-GB (OVR, Optuna)  0.944444   0.945039  0.944444  0.944297   

         F2  
0  1.000000  
1  1.000000  
2  0.981380  
3  0.981380  
4  0.981316  
5  0.944280  


Датасет Wine слишком «лёгкий»

Wine разделим относительно просто — деревья (RF/XGB) учат идеально (100%).

Простая логистическая регрессия и SVM дают ≈98%.

Видимо из-за того, что пространство поиска Optuna «штрафует» модель Optuna может «сойтись» на более консервативных параметрах (например, меньшая глубина, больше регуляризации), которые хуже обобщаются на test.

При этом Baseline был подобран вручную и оказался ближе к оптимуму для конкретного train/test сплита.

Также фиксированный penalty="l2", solver="lbfgs" вместо  ElasticNet и saga стабилизирует, но сужает пространство гиперов.

Возможно еще, что Optuna работал внутри более широкого диапазона, и «лучший по CV» оказался хуже на тесте.

В целом Gradient Boosting с LMT-листьями нестабилен (каждый лист учит логистическую регрессию).

Когда датасет маленький и чистый, добавление бустинга часто портит качество по сравнению с одиночной моделью.



На Wine лучше использовать baseline или одиночный LMT (они ближе к логрег/SVM).

Оптимизация через Optuna может показывать хуже на простых датасетах, т.к. перебор гиперов приводит к лишней регуляризации.

Чтобы увидеть выгоду LMT-GB + Optuna, надо идти на более сложные или зашумлённые датасеты.

____

Попробуем подобрать датасет, на ктором LMT модель будет работать лучше всего.

In [23]:
# === Подбор датасета ===
# --- Проверяем, что базовые классы уже есть ---
try:
    LogisticModelTreePenalized
    LMTGradientBoostingBinary
    LMTGBParams
except NameError:
    raise RuntimeError("Нужны классы LogisticModelTreePenalized, LMTGradientBoostingBinary, LMTGBParams.")

# --- OVR обёртки (на случай, если их нет в окружении этой секции) ---
try:
    OneVsRestLMTGB
except NameError:
    class OneVsRestLMTGB:
        def __init__(self, params): self.params=params
        def fit(self, X, y):
            self.classes_ = np.unique(y)
            self.models_ = []
            for c in self.classes_:
                yb = (y==c).astype(int)
                m = LMTGradientBoostingBinary(self.params)
                m.fit(X,yb); self.models_.append(m)
            return self
        def predict_proba(self, X):
            P = np.column_stack([m.predict_proba(X)[:,1] for m in self.models_])
            s = P.sum(axis=1, keepdims=True); s[s==0.0]=1.0
            return P/s
        def predict(self, X):
            return self.classes_[np.argmax(self.predict_proba(X), axis=1)]

try:
    OneVsRestLMT
except NameError:
    class OneVsRestLMT:
        def __init__(self, base_lmt_kwargs): self.kw=base_lmt_kwargs
        def fit(self, X, y):
            self.classes_ = np.unique(y); self.models_=[]
            for c in self.classes_:
                yb = (y==c).astype(int)
                m = LogisticModelTreePenalized(**self.kw).fit(X,yb)
                self.models_.append(m)
            return self
        def predict_proba(self, X):
            P = np.column_stack([m.predict_proba(X)[:,1] for m in self.models_])
            s = P.sum(axis=1, keepdims=True); s[s==0.0]=1.0
            return P/s
        def predict(self, X):
            return np.argmax(self.predict_proba(X), axis=1)

def metrics_row(name, y_true, y_pred, proba=None, binary=None):
    row = {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }
    if binary and proba is not None:
        try:
            row["ROC-AUC"] = roc_auc_score(y_true, proba)
        except Exception:
            row["ROC-AUC"] = np.nan
    return row

# --- синтетический датасет: кусочно-линейная логит-модель ---
def make_piecewise_logit(n=6000, d=20, seed=42):
    rng = np.random.RandomState(seed)
    X = rng.normal(size=(n,d))
    # Регион определяется по X[:,0] и X[:,1]
    region = (X[:,0] > 0).astype(int) + (X[:,1] > 0).astype(int)*2  # 4 квадранта
    # В каждом регионе свои линейные веса:
    W = rng.normal(size=(4,d))
    b = rng.normal(size=4) * 0.5
    logits = np.sum(W[region]*X, axis=1) + b[region]
    p = 1/(1+np.exp(-logits))
    y = (rng.rand(n) < p).astype(int)
    return X, y

# --- единый раннер сравнения ---
def evaluate_dataset(name, X, y):
    multiclass = (len(np.unique(y)) > 2)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=42, stratify=y
    )

    rows = []

    # LMT-GB (параметры “стабильные” по умолчанию; для мультикласса — OVR)
    gb_params = LMTGBParams(
        n_estimators=160 if not multiclass else 180,
        learning_rate=0.08 if not multiclass else 0.06,
        random_state=42,
        max_depth=2,
        min_samples_leaf=12 if not multiclass else 15,
        reuse_ratio=0.4,
        topk_frac=1.0,
        penalty="l2",
        C=1.0 if name=="piecewise_logit" else 0.8,
        l1_ratio=0.5,
        solver="lbfgs",
        max_iter=6000
    )

    if multiclass:
        lmt_gb = OneVsRestLMTGB(gb_params).fit(X_train, y_train)
        y_pred_gb = lmt_gb.predict(X_test)
        proba_gb = None
        rows.append(metrics_row("LMT-GB", y_test, y_pred_gb, proba_gb, binary=False))
        # Одиночный LMT (OVR)
        lmt_single = OneVsRestLMT(dict(
            max_depth=gb_params.max_depth,
            min_samples_leaf=gb_params.min_samples_leaf,
            random_state=42,
            reuse_ratio=gb_params.reuse_ratio,
            topk_frac=gb_params.topk_frac,
            penalty="l2", C=gb_params.C, l1_ratio=0.5, solver="lbfgs",
            max_iter=gb_params.max_iter
        )).fit(X_train, y_train)
        y_pred_lmt = lmt_single.predict(X_test)
        proba_lmt = None
        rows.append(metrics_row("LMT (single)", y_test, y_pred_lmt, proba_lmt, binary=False))
    else:
        lmt_gb = LMTGradientBoostingBinary(gb_params).fit(X_train, y_train)
        y_pred_gb = lmt_gb.predict(X_test)
        proba_gb = lmt_gb.predict_proba(X_test)[:,1]
        rows.append(metrics_row("LMT-GB", y_test, y_pred_gb, proba_gb, binary=True))
        # Одиночный LMT
        lmt_single = LogisticModelTreePenalized(
            max_depth=gb_params.max_depth, min_samples_leaf=gb_params.min_samples_leaf,
            random_state=42, reuse_ratio=gb_params.reuse_ratio, topk_frac=gb_params.topk_frac,
            penalty="l2", C=gb_params.C, l1_ratio=0.5, solver="lbfgs", max_iter=gb_params.max_iter
        ).fit(X_train, y_train)
        y_pred_lmt = lmt_single.predict(X_test)
        proba_lmt = lmt_single.predict_proba(X_test)[:,1]
        rows.append(metrics_row("LMT (single)", y_test, y_pred_lmt, proba_lmt, binary=True))

    # Бейзлайны
    # Logistic Regression
    lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, solver="lbfgs")).fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)
    proba_lr = (lr.predict_proba(X_test)[:,1] if not multiclass else None)
    rows.append(metrics_row("Logistic Regression", y_test, y_pred_lr, proba_lr, binary=not multiclass))

    # Random Forest
    rf = RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1).fit(X_train, y_train)
    y_pred_rf = rf.predict(X_test)
    proba_rf = (rf.predict_proba(X_test)[:,1] if not multiclass else None)
    rows.append(metrics_row("Random Forest", y_test, y_pred_rf, proba_rf, binary=not multiclass))

    # XGBoost
    try:
        from xgboost import XGBClassifier
        if multiclass:
            xgb = XGBClassifier(
                objective="multi:softmax",
                num_class=len(np.unique(y_train)),
                n_estimators=500, learning_rate=0.05, max_depth=5,
                subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
                eval_metric="mlogloss", random_state=42, n_jobs=-1
            ).fit(X_train, y_train)
            y_pred_xgb = xgb.predict(X_test)
            proba_xgb = None
        else:
            xgb = XGBClassifier(
                objective="binary:logistic",
                n_estimators=500, learning_rate=0.05, max_depth=5,
                subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
                eval_metric="logloss", random_state=42, n_jobs=-1
            ).fit(X_train, y_train)
            y_pred_xgb = (xgb.predict_proba(X_test)[:,1] >= 0.5).astype(int)
            proba_xgb = xgb.predict_proba(X_test)[:,1]
        rows.append(metrics_row("XGBoost", y_test, y_pred_xgb, proba_xgb, binary=not multiclass))
    except Exception as e:
        pass

    df = pd.DataFrame(rows).sort_values("Accuracy", ascending=False).reset_index(drop=True)
    df.insert(0, "Dataset", name)
    return df

# --- Готовим наборы для проверки ---
datasets = []

# классические
w = load_wine(); datasets.append(("wine", w.data, w.target))
bc = load_breast_cancer(); datasets.append(("breast_cancer", bc.data, bc.target))
d = load_digits(); datasets.append(("digits", d.data, d.target))

# синтетика "кусочно-линейная логит-модель"
Xpw, ypw = make_piecewise_logit(n=8000, d=30, seed=42)
datasets.append(("piecewise_logit", Xpw, ypw))

# --- Запуск и агрегация результатов ---
all_results = []
for name, X, y in datasets:
    df = evaluate_dataset(name, X, y)
    all_results.append(df)

results = pd.concat(all_results, axis=0)
print(results)

# --- Подсветка, где LMT-GB >= RF и/или XGB (по Accuracy) ---
def highlight_wins(df):
    marks = []
    for ds in df["Dataset"].unique():
        block = df[df["Dataset"]==ds].copy()
        acc = dict(zip(block["Model"], block["Accuracy"]))
        lmt = acc.get("LMT-GB", None)
        if lmt is None:
            marks.append((ds, "LMT-GB not present"))
            continue
        win_rf = lmt >= acc.get("Random Forest", -1)
        win_xgb = lmt >= acc.get("XGBoost", -1)
        marks.append((ds, f"LMT-GB ≥ RF: {win_rf}, ≥ XGB: {win_xgb}, LMT-GB Acc: {lmt:.3f}"))
    return pd.DataFrame(marks, columns=["Dataset","LMT-GB vs baselines"])

summary = highlight_wins(results)
print("\n=== Где LMT-GB обгоняет/сравним с RF/XGB ===")
print(summary)


           Dataset                Model  Accuracy  Precision    Recall  \
0             wine              XGBoost  1.000000   1.000000  1.000000   
1             wine        Random Forest  1.000000   1.000000  1.000000   
2             wine  Logistic Regression  0.981481   0.982456  0.981481   
3             wine               LMT-GB  0.962963   0.966184  0.962963   
4             wine         LMT (single)  0.962963   0.964120  0.962963   
0    breast_cancer  Logistic Regression  0.988304   0.988304  0.988304   
1    breast_cancer              XGBoost  0.964912   0.965576  0.964912   
2    breast_cancer         LMT (single)  0.959064   0.959287  0.959064   
3    breast_cancer               LMT-GB  0.953216   0.953187  0.953216   
4    breast_cancer        Random Forest  0.947368   0.947463  0.947368   
0           digits  Logistic Regression  0.981481   0.981824  0.981481   
1           digits        Random Forest  0.968519   0.969705  0.968519   
2           digits               LMT-G

Wine – RF/XGB учат идеально (100%), LMT-GB ≈ 96% (отстаёт).

Breast Cancer – LMT-GB (0.953) обогнал Random Forest (0.947), хотя до XGBoost (0.965) чуть не дотянул.

Digits – LMT-GB (0.957) обошёл XGBoost (0.954), но слабее RF (0.968).

Piecewise logit (синтетический) –  все деревья уступают, а логистическая структура (LMT-single, LMT-GB) держит лучшие позиции. LMT-single (0.836) лучший, LMT-GB (0.754) пока недонастроен, но всё равно выше простой Logistic Regression (0.72).
Это возможно, так как целевая функция LMT кусочно-логистическая: общая зависимость нелинейная, но на подотрезках/подобластях она линейно-разделима.  
Logistic Model Tree (LMT) как раз делает следующее:  
Дерево по признакам разбивает пространство на подобласти.  
В каждом листе — логистическая регрессия по остальным признакам.


Теперь опробуем модели на реальном датасете с кусочно-логистической завиимсотью.
Возьмем датасет Adult Income (UCI Adult/Census Income)

In [25]:
# === UCI Adult Income: сравнение LMT, LMT-GB и классики ===

from sklearn.datasets import fetch_openml

# ---------- 1) Загрузка и подготовка Adult ----------
adult = fetch_openml("adult", version=2, as_frame=True)
df = adult.frame.copy()

# Целевая переменная: '>50K' / '<=50K' → 1/0
target_col = "class"
df[target_col] = (df[target_col].astype(str).str.strip() == ">50K").astype(int)

# Разделим признаки
y = df[target_col].values
X = df.drop(columns=[target_col])

# Явно отметим категориальные/числовые
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# Иногда в adult есть строки с '?' — оставим как NaN (OneHotEncoder обработает как отдельную категорию)
X = X.replace("?", np.nan)

# Трейн/тест
X_train_df, X_test_df, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# One-Hot (делаем dense, чтобы работало и с деревьями, и с LMT)
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
preprocess = ColumnTransformer(
    transformers=[
        ("cat", ohe, cat_cols),
        ("num", "passthrough", num_cols),
    ],
    remainder="drop",
)

# fit/transform
X_train = preprocess.fit_transform(X_train_df)
X_test  = preprocess.transform(X_test_df)

# Названия фич
# feature_names = (list(preprocess.named_transformers_["cat"].get_feature_names_out(cat_cols))
#                  + num_cols)

# ---------- 2) Вспомогательная функция метрик ----------
def metrics_row(name, y_true, y_pred, proba=None):
    row = {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }
    if proba is not None:
        try:
            row["ROC-AUC"] = roc_auc_score(y_true, proba)
        except Exception:
            row["ROC-AUC"] = np.nan
    return row

rows = []

# ---------- 3) LMT (single) ----------
lmt_single = LogisticModelTreePenalized(
    max_depth=3,
    min_samples_leaf=40,
    random_state=42,
    reuse_ratio=0.4,
    topk_frac=1.0,        # без отбора признаков на старте
    penalty="l2",
    C=1.0,
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=6000
).fit(X_train, y_train)
y_pred_lmt = lmt_single.predict(X_test)
proba_lmt = lmt_single.predict_proba(X_test)[:, 1]
rows.append(metrics_row("LMT (single)", y_test, y_pred_lmt, proba_lmt))

# ---------- 4) LMT-GB (градиентный бустинг с LMT-листами) ----------
gb_params = LMTGBParams(
    n_estimators=220,
    learning_rate=0.06,
    random_state=42,
    max_depth=2,
    min_samples_leaf=40,
    reuse_ratio=0.4,
    topk_frac=1.0,
    penalty="l2",
    C=1.0,
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=6000
)
lmt_gb = LMTGradientBoostingBinary(gb_params).fit(X_train, y_train)
y_pred_gb = lmt_gb.predict(X_test)
proba_gb = lmt_gb.predict_proba(X_test)[:, 1]
rows.append(metrics_row("LMT-GB", y_test, y_pred_gb, proba_gb))

# ---------- 5) Логистическая регрессия (baseline) ----------
lr = LogisticRegression(max_iter=5000, solver="lbfgs", n_jobs=-1)
lr.fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test), lr.predict_proba(X_test)[:, 1]))

# ---------- 6) Random Forest ----------
rf = RandomForestClassifier(n_estimators=600, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test), rf.predict_proba(X_test)[:, 1]))

# ---------- 7) XGBoost ----------
try:
    from xgboost import XGBClassifier
    xgb = XGBClassifier(
        objective="binary:logistic",
        n_estimators=800,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        eval_metric="logloss",
        random_state=42,
        n_jobs=-1
    )
    xgb.fit(X_train, y_train)
    proba_xgb = xgb.predict_proba(X_test)[:, 1]
    y_pred_xgb = (proba_xgb >= 0.5).astype(int)
    rows.append(metrics_row("XGBoost", y_test, y_pred_xgb, proba_xgb))
except Exception as e:
    print("XGBoost недоступен:", e)

# ---------- 8) Итог ----------
results = pd.DataFrame(rows).sort_values(["ROC-AUC", "Accuracy"], ascending=False).reset_index(drop=True)
print(results)


                 Model  Accuracy  Precision    Recall        F1        F2  \
0              XGBoost  0.874974   0.870663  0.874974  0.871174  0.873095   
1               LMT-GB  0.874497   0.869925  0.874497  0.869305  0.871805   
2         LMT (single)  0.859551   0.853494  0.859551  0.853674  0.856598   
3        Random Forest  0.856821   0.851218  0.856821  0.852416  0.854708   
4  Logistic Regression  0.850065   0.843221  0.850065  0.844163  0.847175   

    ROC-AUC  
0  0.929234  
1  0.927305  
2  0.907992  
3  0.905456  
4  0.897063  


XGBoost (Acc=0.875, ROC-AUC=0.929) остаётся лидером.

LMT-GB практически сравнялся (Acc=0.8745, ROC-AUC=0.927) — то есть гибридная модель реально догоняет сильнейший бустинг!

LMT (single) лучше Random Forest и Logistic Regression по точности (0.86 vs 0.85), и с хорошим ROC-AUC.

Adult Income — реальный датасет, где LMT/LMT-GB показывают конкурентный уровень с бустингами.
Данные сильно сегментированы (пол, образование, работа, часы работы и т.п.).
Глобальная логрег плохо справляется.
Деревья хорошо ловят сегменты, но внутри них они плоские.
LMT/GB сочетает сегментацию и линейные эффекты.

При этом LMT в отличие от XGBoost даёт возможность в каждом листе интерпретировать логистические коэффициенты (например, в сегменте «женщины >35 лет, высшее образование» вероятность >50К зависит от признаков так-то).

Попробуем подтянуть метрики с помощью Optuna.

In [27]:
# === UCI Adult Income + Optuna тюнинг LMT-GB (устойчивый) ===

warnings.filterwarnings("ignore", category=ConvergenceWarning)

# ---------- 1) Загрузка и препроцессинг Adult ----------
adult = fetch_openml("adult", version=2, as_frame=True)
df = adult.frame.copy()

# целевая переменная 1/0
df["class"] = (df["class"].astype(str).str.strip() == ">50K").astype(int)
y = df["class"].values
X = df.drop(columns=["class"])

# категориальные/числовые, unknown как NaN
X = X.replace("?", np.nan)
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

X_train_df, X_test_df, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# OneHotEncoder: кросс-версионный фикс (sparse_output vs sparse)
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)

preprocess = ColumnTransformer(
    transformers=[
        ("cat", ohe, cat_cols),
        ("num", "passthrough", num_cols),
    ],
    remainder="drop",
)

X_train = preprocess.fit_transform(X_train_df)
X_test  = preprocess.transform(X_test_df)

# ---------- 2) Вспомогательные функции ----------
def safe_auc(y_true, proba):
    p = np.clip(proba, 1e-12, 1-1e-12)
    return roc_auc_score(y_true, p)

def robust_cv_auc(params, X, y, n_splits=3, seed=42):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    vals = []
    for tr, va in skf.split(X, y):
        try:
            model = LMTGradientBoostingBinary(params)
            model.fit(X[tr], y[tr])
            proba = model.predict_proba(X[va])[:, 1]
            val = safe_auc(y[va], proba)
            if np.isfinite(val):
                vals.append(val)
        except Exception:
            # пропускаем упавший фолд
            continue
    if len(vals) == 0:
        return -1e9
    return float(np.mean(vals))

def metrics_row(name, y_true, y_pred, proba=None):
    row = {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted"),
        "Recall": recall_score(y_true, y_pred, average="weighted"),
        "F1": f1_score(y_true, y_pred, average="weighted"),
        "F2": fbeta_score(y_true, y_pred, beta=2, average="weighted"),
    }
    if proba is not None:
        try:
            row["ROC-AUC"] = safe_auc(y_true, proba)
        except Exception:
            row["ROC-AUC"] = np.nan
    return row

# ---------- 3) Optuna objective: стабильное пространство (L2 + lbfgs) ----------
def objective_l2(trial: optuna.Trial):
    params = LMTGBParams(
        n_estimators=trial.suggest_int("n_estimators", 180, 420),
        learning_rate=trial.suggest_float("learning_rate", 0.02, 0.10),
        random_state=42,
        max_depth=trial.suggest_int("max_depth", 1, 3),
        min_samples_leaf=trial.suggest_int("min_samples_leaf", 40, 140),  # крупнее листья для устойчивости
        reuse_ratio=trial.suggest_float("reuse_ratio", 0.2, 0.7),
        topk_frac=trial.suggest_float("topk_frac", 0.8, 1.0),
        penalty="l2",
        C=trial.suggest_float("C", 0.5, 3.0, log=True),                  # умеренная L2-регуляризация
        l1_ratio=0.5,                                                    # не используется при L2
        solver="lbfgs",
        max_iter=7000
    )
    val = robust_cv_auc(params, X_train, y_train, n_splits=3, seed=42)
    trial.report(val, 1)
    return val

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective_l2, n_trials=50, timeout=1200, show_progress_bar=False)

print("Best CV ROC-AUC (L2):", study.best_value)
print("Best params (L2):", study.best_params)

best = study.best_params
gb_best = LMTGradientBoostingBinary(LMTGBParams(
    n_estimators=best["n_estimators"],
    learning_rate=best["learning_rate"],
    random_state=42,
    max_depth=best["max_depth"],
    min_samples_leaf=best["min_samples_leaf"],
    reuse_ratio=best["reuse_ratio"],
    topk_frac=best["topk_frac"],
    penalty="l2",
    C=best["C"],
    l1_ratio=0.5,
    solver="lbfgs",
    max_iter=7000
)).fit(X_train, y_train)

# ---------- 4) Финальное сравнение на тесте ----------
rows = []

# LMT-GB (Optuna, best)
y_pred = gb_best.predict(X_test)
proba = gb_best.predict_proba(X_test)[:, 1]
rows.append(metrics_row("LMT-GB (Optuna, Adult)", y_test, y_pred, proba))

# LMT-GB (baseline из предыдущего эксперимента, для ориентира)
gb_base = LMTGradientBoostingBinary(LMTGBParams(
    n_estimators=220, learning_rate=0.06, random_state=42,
    max_depth=2, min_samples_leaf=40, reuse_ratio=0.4, topk_frac=1.0,
    penalty="l2", C=1.0, l1_ratio=0.5, solver="lbfgs", max_iter=6000
)).fit(X_train, y_train)
rows.append(metrics_row("LMT-GB (baseline)", y_test,
                        gb_base.predict(X_test), gb_base.predict_proba(X_test)[:,1]))

# LMT (single) — на тех же листовых гиперах
lmt_single = LogisticModelTreePenalized(
    max_depth=best["max_depth"],
    min_samples_leaf=best["min_samples_leaf"],
    random_state=42,
    reuse_ratio=best["reuse_ratio"],
    topk_frac=best["topk_frac"],
    penalty="l2", C=best["C"], l1_ratio=0.5, solver="lbfgs", max_iter=7000
).fit(X_train, y_train)
rows.append(metrics_row("LMT (single, tuned-like)", y_test,
                        lmt_single.predict(X_test), lmt_single.predict_proba(X_test)[:,1]))

# Logistic Regression (baseline)
lr = make_pipeline(StandardScaler(with_mean=False),  # with_mean=False для совместимости с OHE dense/large
                   LogisticRegression(max_iter=5000, solver="lbfgs", n_jobs=-1)).fit(X_train, y_train)
rows.append(metrics_row("Logistic Regression", y_test, lr.predict(X_test), lr.predict_proba(X_test)[:,1]))

# Random Forest
rf = RandomForestClassifier(n_estimators=600, random_state=42, n_jobs=-1).fit(X_train, y_train)
rows.append(metrics_row("Random Forest", y_test, rf.predict(X_test), rf.predict_proba(X_test)[:,1]))

# XGBoost
try:
    from xgboost import XGBClassifier
    xgb = XGBClassifier(
        objective="binary:logistic",
        n_estimators=800, learning_rate=0.05, max_depth=6,
        subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
        eval_metric="logloss", random_state=42, n_jobs=-1
    ).fit(X_train, y_train)
    proba_xgb = xgb.predict_proba(X_test)[:, 1]
    y_pred_xgb = (proba_xgb >= 0.5).astype(int)
    rows.append(metrics_row("XGBoost", y_test, y_pred_xgb, proba_xgb))
except Exception as e:
    print("XGBoost недоступен:", e)

results = pd.DataFrame(rows).sort_values(["ROC-AUC","Accuracy"], ascending=False).reset_index(drop=True)
print(results)


[I 2025-10-03 21:17:55,415] A new study created in memory with name: no-name-f7c3297c-1637-4a52-9b2e-d47e0a26c18b
[I 2025-10-04 00:31:22,307] Trial 0 finished with value: 0.9264141289787141 and parameters: {'n_estimators': 270, 'learning_rate': 0.0960571445127933, 'max_depth': 3, 'min_samples_leaf': 100, 'reuse_ratio': 0.27800932022121827, 'topk_frac': 0.8311989040672406, 'C': 0.554840098004973}. Best is trial 0 with value: 0.9264141289787141.


Best CV ROC-AUC (L2): 0.9264141289787141
Best params (L2): {'n_estimators': 270, 'learning_rate': 0.0960571445127933, 'max_depth': 3, 'min_samples_leaf': 100, 'reuse_ratio': 0.27800932022121827, 'topk_frac': 0.8311989040672406, 'C': 0.554840098004973}
                      Model  Accuracy  Precision    Recall        F1  \
0                   XGBoost  0.874974   0.870663  0.874974  0.871174   
1    LMT-GB (Optuna, Adult)  0.876408   0.872044  0.876408  0.872062   
2         LMT-GB (baseline)  0.874497   0.869925  0.874497  0.869305   
3       Logistic Regression  0.855524   0.849180  0.855524  0.849819   
4  LMT (single, tuned-like)  0.855729   0.849234  0.855729  0.849477   
5             Random Forest  0.856821   0.851218  0.856821  0.852416   

         F2   ROC-AUC  
0  0.873095  0.929234  
1  0.874203  0.929224  
2  0.871805  0.927305  
3  0.852702  0.906712  
4  0.852595  0.905817  
5  0.854708  0.905456  


Optuna-тюнинг LMT-GB дал ROC-AUC = 0.9292, Accuracy = 0.8764, и это фактически сравнялось с XGBoost (0.9292 ROC-AUC, 0.8750 Acc).  
Причём Optuna-LMT-GB чуть выше по Accuracy и F2, то есть модель более устойчива к недообнаружению положительного класса.
Базовый LMT-GB тоже неплох (0.927 ROC-AUC), но хуже без тюнинга.
Логистическая регрессия и RF заметно позади (ROC-AUC около 0.905–0.907).
При этом одиночный LMT (single OVR) догнал логистическую регрессию, что показывает, что сама структура «логистическое дерево» хорошо работает с табличными кусочно-линейными зависимостями.

Вывод:
На реальном датасете Adult Income LMT-GB после Optuna догнал XGBoost по ROC-AUC и даже чуть обошёл по Accuracy / F2.
Это отличный кейс: можно показать, что гибрид логистики и бустинга работает на уровне state-of-the-art табличных моделей, сохраняя интерпретируемость (каждый лист — регрессия).