# Предсказание биологического ответа молекул

Необходимо предсказать биологический ответ молекул (столбец 'Activity') по их химическому составу (столбцы D1-D1776).

Данные представлены в формате CSV.  Каждая строка представляет молекулу. 

- Первый столбец Activity содержит экспериментальные данные, описывающие фактический биологический ответ [0, 1]; 
- Остальные столбцы D1-D1776 представляют собой молекулярные **дескрипторы** — это вычисляемые свойства, которые могут фиксировать некоторые характеристики молекулы, например размер, форму или состав элементов.

Предварительная обработка не требуется, данные уже закодированы и нормализованы.

В качестве метрики будем использовать **F1-score**.

Необходимо обучить две модели: **логистическую регрессию и случайный лес**. Далее нужно сделать подбор гиперпараметров с помощью базовых и продвинутых методов оптимизации. Важно использовать **все четыре метода** (GridSeachCV, RandomizedSearchCV, Hyperopt, Optuna) хотя бы по разу, максимальное количество итераций не должно превышать 50.


In [18]:
import warnings
warnings.filterwarnings('ignore')

## Импорт библиотек

In [19]:
import hyperopt as hp
import numpy as np 
import optuna
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from hyperopt import hp, fmin, tpe, Trials
from sklearn import linear_model 
from sklearn import tree 
from sklearn import ensemble 
from sklearn import metrics 
from sklearn import preprocessing 
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV

%matplotlib inline
plt.style.use('seaborn')

## Загрузка и исследование данных

In [20]:
molecules = pd.read_csv('data/_train_sem09 (1).csv')
molecules.head()

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


In [21]:
data = molecules.copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3751 entries, 0 to 3750
Columns: 1777 entries, Activity to D1776
dtypes: float64(942), int64(835)
memory usage: 50.9 MB


In [22]:
print('Пропущенные значения: ', data.isnull().sum().sum())

Пропущенные значения:  0


Пропущенных значений нет, данные уже закодированы и нормализованы, следовательно, сразу перейдем к построению моделей.

In [23]:
X = data.drop(['Activity'], axis=1)
y = data['Activity']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 1, test_size = 0.2)

## Логистическая регрессия
Зафиксируем метрики, которые были получены без дополнительной настройки, со значениями гиперпараметров, установленных по умолчанию:

In [24]:
#Создаем объект класса логистическая регрессия
log_reg = linear_model.LogisticRegression(random_state=42, max_iter = 50)
#Обучаем модель, минимизируя logloss
log_reg.fit(X_train, y_train)
y_test_pred = log_reg.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

f1_score на тестовом наборе: 0.79


Сравним результаты разных типов оптимизации

## 1. GridSearchCV

In [25]:
param_grid = [
    {
        'penalty': ['l2', 'none'] , # тип регуляризации
        'solver': ['lbfgs', 'sag'], # алгоритм оптимизации
        'C': list(np.linspace(0.01, 1, 20, dtype=float))
    },    
    {
        'penalty': ['l1', 'l2'] ,
        'solver': ['liblinear', 'saga'],
        'C': list(np.linspace(0.01, 1, 20, dtype=float))
    },
]

In [26]:
grid_search = GridSearchCV(
    estimator=linear_model.LogisticRegression(
        random_state=42, #генератор случайных чисел
        max_iter=50 #количество итераций на сходимость
    ), 
    param_grid=param_grid, 
    scoring='f1',
    cv=10, 
    n_jobs = -1
)  
grid_search.fit(X_train, y_train) 
print("f1_score на тестовом наборе: {:.2f}".format(grid_search.score(X_test, y_test)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search.best_params_))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

f1_score на тестовом наборе: 0.80
Наилучшие значения гиперпараметров: {'C': 0.01, 'penalty': 'l2', 'solver': 'sag'}


Удалось незначительно улучшить значение метрики.

## 2. RandomizedSearchCV

In [27]:
param_distributions_l1 = {
    'penalty': ['l1'] ,
    'solver': ['liblinear', 'saga'],
    'C': list(np.linspace(0.01, 0.5, 20, dtype=float))
} 
param_distributions_l2 = {
    'penalty': ['l2'] , # тип регуляризации
    'solver': ['lbfgs', 'sag', 'liblinear', 'saga'], # алгоритм оптимизации
    'C': list(np.linspace(0.01, 0.5, 20, dtype=float))
} 

In [28]:
random_search = RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_distributions=param_distributions_l1, 
    cv=10, 
    scoring='f1',
    n_iter = 20, 
    n_jobs = -1
)  
random_search.fit(X_train, y_train) 
print("f1_score на тестовом наборе: {:.2f}".format(random_search.score(X_test, y_test)))
print("Наилучшие значения гиперпараметров: {}".format(random_search.best_params_))



f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'solver': 'liblinear', 'penalty': 'l1', 'C': 0.1131578947368421}


In [29]:
random_search = RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_distributions=param_distributions_l2, 
    cv=10, 
    scoring='f1',
    n_iter = 20, 
    n_jobs = -1
)  
random_search.fit(X_train, y_train) 
print("f1_score на тестовом наборе: {:.2f}".format(random_search.score(X_test, y_test)))
print("Наилучшие значения гиперпараметров: {}".format(random_search.best_params_))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

f1_score на тестовом наборе: 0.79
Наилучшие значения гиперпараметров: {'solver': 'sag', 'penalty': 'l2', 'C': 0.08736842105263157}


Оптимизатор отработал быстрее, но метрика f1-score не улучшилась

## 3. Hyperopt

Нам потребуется несколько пространств гиперпараметров

In [30]:
# зафксируем random_state
random_state = 42
def hyperopt_lr(params, cv=10, X=X_train, y=y_train, random_state=random_state):
    # функция получает комбинацию гиперпараметров в "params"
    params = {
        'penalty': params['penalty'], 
        'solver': params['solver'], 
        'C': params['C'],
    }
    if params['C']==0:
        return 0
  
    model = linear_model.LogisticRegression(**params, max_iter=50, random_state=random_state)

    # обучаем модель
    model.fit(X, y)
    
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
    
    return -score

In [31]:
penalty_l1 = ['l1']
solver_l1 = ['liblinear', 'saga']
space_l1 = {
    'penalty': hp.choice('penalty', penalty_l1), # тип регуляризации
    'solver': hp.choice('solver', solver_l1), # алгоритм оптимизации
    'C': hp.quniform('C', 0.01, 0.5, 0.01),
    # 'max_iter': hp.quniform('max_iter', 10, 50, 5)
}

penalty_l2 = ['l2']
solver_l2 = ['lbfgs', 'sag', 'liblinear', 'saga']
space_l2 = {
    'penalty': hp.choice('penalty', penalty_l2), # тип регуляризации
    'solver': hp.choice('solver', solver_l2), # алгоритм оптимизации
    'C': hp.quniform('C', 0.01, 0.5, 0.01),
    # 'max_iter': hp.quniform('max_iter', 10, 50, 5)
}

In [32]:
# начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(
    hyperopt_lr, # наша функция 
    space=space_l1, # пространство гиперпараметров
    algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
    max_evals=20, # максимальное количество итераций
    trials=trials, # логирование результатов
    rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
)
print("Наилучшие значения гиперпараметров {}".format(best))

# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(
    random_state=random_state, 
    penalty=penalty_l1[best['penalty']],
    solver=solver_l1[best['solver']],
    C=best['C'],
    max_iter=50
)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

  5%|▌         | 1/20 [00:05<01:36,  5.05s/trial, best loss: -0.77942306570361]



 10%|█         | 2/20 [01:00<10:29, 34.97s/trial, best loss: -0.7795812828744223]



 15%|█▌        | 3/20 [01:02<05:37, 19.83s/trial, best loss: -0.7795812828744223]



 20%|██        | 4/20 [01:49<08:06, 30.43s/trial, best loss: -0.7810413268627362]



 25%|██▌       | 5/20 [01:51<05:04, 20.28s/trial, best loss: -0.7810413268627362]



 30%|███       | 6/20 [02:40<07:01, 30.12s/trial, best loss: -0.7810413268627362]



 35%|███▌      | 7/20 [03:29<07:48, 36.01s/trial, best loss: -0.7810413268627362]



 45%|████▌     | 9/20 [03:42<03:46, 20.62s/trial, best loss: -0.7810413268627362]



 50%|█████     | 10/20 [04:36<05:07, 30.70s/trial, best loss: -0.7814251316350409]



 55%|█████▌    | 11/20 [04:37<03:16, 21.81s/trial, best loss: -0.7814251316350409]



 60%|██████    | 12/20 [05:25<03:56, 29.60s/trial, best loss: -0.7814251316350409]



 65%|██████▌   | 13/20 [05:30<02:34, 22.14s/trial, best loss: -0.7814251316350409]



 70%|███████   | 14/20 [06:23<03:08, 31.46s/trial, best loss: -0.7814251316350409]



 75%|███████▌  | 15/20 [07:18<03:12, 38.57s/trial, best loss: -0.7814251316350409]



 80%|████████  | 16/20 [07:25<01:56, 29.11s/trial, best loss: -0.7814251316350409]



 85%|████████▌ | 17/20 [08:19<01:49, 36.65s/trial, best loss: -0.7814251316350409]



 90%|█████████ | 18/20 [09:12<01:23, 41.53s/trial, best loss: -0.7814251316350409]



 95%|█████████▌| 19/20 [10:05<00:44, 45.00s/trial, best loss: -0.7814251316350409]



100%|██████████| 20/20 [10:11<00:00, 30.59s/trial, best loss: -0.7814251316350409]
Наилучшие значения гиперпараметров {'C': 0.25, 'penalty': 0, 'solver': 1}
f1_score на тестовом наборе: 0.78


In [33]:
best=fmin(
    hyperopt_lr, # наша функция 
    space=space_l1, # пространство гиперпараметров
    algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
    max_evals=30, # максимальное количество итераций
    trials=trials, # логирование результатов
    rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
)
print("Наилучшие значения гиперпараметров {}".format(best))

# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(
    random_state=random_state, 
    penalty=penalty_l1[best['penalty']],
    solver=solver_l1[best['solver']],
    C=best['C'],
    max_iter=50
)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

 67%|██████▋   | 20/30 [00:00<?, ?trial/s, best loss=?]



 70%|███████   | 21/30 [00:47<07:11, 47.95s/trial, best loss: -0.783356238248303]



 73%|███████▎  | 22/30 [01:41<06:50, 51.26s/trial, best loss: -0.783356238248303]



 77%|███████▋  | 23/30 [02:33<06:00, 51.50s/trial, best loss: -0.783356238248303]



 80%|████████  | 24/30 [03:22<05:04, 50.72s/trial, best loss: -0.783356238248303]



 83%|████████▎ | 25/30 [04:13<04:12, 50.58s/trial, best loss: -0.783356238248303]



 87%|████████▋ | 26/30 [05:03<03:22, 50.57s/trial, best loss: -0.783356238248303]



 90%|█████████ | 27/30 [05:55<02:32, 50.88s/trial, best loss: -0.783356238248303]



 93%|█████████▎| 28/30 [06:43<01:40, 50.10s/trial, best loss: -0.783356238248303]



 97%|█████████▋| 29/30 [07:00<00:39, 39.75s/trial, best loss: -0.783356238248303]



100%|██████████| 30/30 [07:51<00:00, 47.18s/trial, best loss: -0.783356238248303]




Наилучшие значения гиперпараметров {'C': 0.16, 'penalty': 0, 'solver': 1}
f1_score на тестовом наборе: 0.79


Улучшить метрику с помощью первого набора параметров не удалось

In [34]:
# начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(
    hyperopt_lr, # наша функция 
    space=space_l2, # пространство гиперпараметров
    algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
    max_evals=20, # максимальное количество итераций
    trials=trials, # логирование результатов
    rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
)
print("Наилучшие значения гиперпараметров {}".format(best))

# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(
    max_iter=50,
    random_state=random_state, 
    penalty=penalty_l2[best['penalty']],
    solver=solver_l2[best['solver']],
    C=best['C']
)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

  0%|          | 0/20 [00:00<?, ?trial/s, best loss=?]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  5%|▌         | 1/20 [00:08<02:38,  8.37s/trial, best loss: -0.7788074380812321]



 10%|█         | 2/20 [00:53<08:56, 29.80s/trial, best loss: -0.7837718748163574]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 20%|██        | 4/20 [01:04<03:37, 13.61s/trial, best loss: -0.7841121845801031]



 25%|██▌       | 5/20 [01:29<04:22, 17.47s/trial, best loss: -0.7841121845801031]



 35%|███▌      | 7/20 [01:40<02:21, 10.90s/trial, best loss: -0.7849431213567389]



 40%|████      | 8/20 [02:06<03:08, 15.67s/trial, best loss: -0.7849431213567389]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 45%|████▌     | 9/20 [02:13<02:25, 13.21s/trial, best loss: -0.7849431213567389]



 50%|█████     | 10/20 [02:58<03:48, 22.83s/trial, best loss: -0.7849431213567389]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 55%|█████▌    | 11/20 [03:06<02:44, 18.27s/trial, best loss: -0.7855646326818487]



 60%|██████    | 12/20 [03:50<03:29, 26.21s/trial, best loss: -0.7855646326818487]



 75%|███████▌  | 15/20 [04:28<01:19, 15.88s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 80%|████████  | 16/20 [04:36<00:53, 13.44s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 90%|█████████ | 18/20 [04:46<00:18,  9.21s/trial, best loss: -0.7855646326818487]



 95%|█████████▌| 19/20 [05:28<00:18, 18.82s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

100%|██████████| 20/20 [05:34<00:00, 16.73s/trial, best loss: -0.7855646326818487]
Наилучшие значения гиперпараметров {'C': 0.03, 'penalty': 0, 'solver': 0}
f1_score на тестовом наборе: 0.79


In [35]:
best=fmin(hyperopt_lr, # наша функция 
          space=space_l2, # пространство гиперпараметров
          algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
          max_evals=30, # максимальное количество итераций
          trials=trials, # логирование результатов
          rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
         )
print("Наилучшие значения гиперпараметров {}".format(best))

# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(
    max_iter=50,
    random_state=random_state, 
    penalty=penalty_l2[best['penalty']],
    solver=solver_l2[best['solver']],
    C=best['C']
)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

 67%|██████▋   | 20/30 [00:00<?, ?trial/s, best loss=?]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 70%|███████   | 21/30 [00:08<01:18,  8.68s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 73%|███████▎  | 22/30 [00:12<00:45,  5.65s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 77%|███████▋  | 23/30 [00:20<00:47,  6.72s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 80%|████████  | 24/30 [00:45<01:23, 13.95s/trial, best loss: -0.7855646326818487]



 83%|████████▎ | 25/30 [00:48<00:50, 10.13s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 87%|████████▋ | 26/30 [00:56<00:37,  9.41s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 90%|█████████ | 27/30 [01:03<00:26,  8.68s/trial, best loss: -0.7855646326818487]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 93%|█████████▎| 28/30 [01:07<00:13,  6.95s/trial, best loss: -0.7858668594512256]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

 97%|█████████▋| 29/30 [01:14<00:07,  7.08s/trial, best loss: -0.7858668594512256]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


100%|██████████| 30/30 [01:56<00:00, 11.60s/trial, best loss: -0.7858668594512256]




Наилучшие значения гиперпараметров {'C': 0.03, 'penalty': 0, 'solver': 2}
f1_score на тестовом наборе: 0.79


Метрика осталась прежней.

## 4. Optuna

In [36]:
def optuna_lr_l1(trial):
  # задаем пространства поиска гиперпараметров 
  params = {
    'penalty': trial.suggest_categorical('penalty', ['l1']), # тип регуляризации
    'solver': trial.suggest_categorical('solver', ['liblinear', 'saga']), # алгоритм оптимизации
    'C': trial.suggest_float('C', 0.01, 0.5, step=0.01)
  }

  # создаем модель
  model = linear_model.LogisticRegression(**params, max_iter=50, random_state=random_state)
  
  # обучаем модель
  model.fit(X_train, y_train)
  # score = metrics.f1_score(y_train, model.predict(X_train))
  score = cross_val_score(model, X_train, y_train, cv=10, scoring="f1", n_jobs=-1).mean()

  return score

In [37]:
def optuna_lr_l2(trial):
  # задаем пространства поиска гиперпараметров 
  params = {
    'penalty': trial.suggest_categorical('penalty', ['l2']), # тип регуляризации
    'solver': trial.suggest_categorical('solver', ['lbfgs', 'sag', 'liblinear', 'saga']), # алгоритм оптимизации
    'C': trial.suggest_float('C', 0.01, 0.5, step=0.01)
  }

  # создаем модель
  model = linear_model.LogisticRegression(**params, max_iter=50, random_state=random_state)
  
  # обучаем модель
  model.fit(X_train, y_train)
  # score = metrics.f1_score(y_train, model.predict(X_train))
  score = cross_val_score(model, X_train, y_train, cv=10, scoring="f1", n_jobs=-1).mean()

  return score

In [38]:
%%time
# cоздаем объект исследования
# можем напрямую указать, что нам необходимо максимизировать метрику direction="maximize"
study = optuna.create_study(study_name="LogisticRegression", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_lr_l1, n_trials=20)

[32m[I 2023-05-04 14:04:05,757][0m A new study created in memory with name: LogisticRegression[0m
[32m[I 2023-05-04 14:04:08,643][0m Trial 0 finished with value: 0.7768567697551922 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.09}. Best is trial 0 with value: 0.7768567697551922.[0m
[32m[I 2023-05-04 14:04:14,478][0m Trial 1 finished with value: 0.7763277959460242 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.37}. Best is trial 0 with value: 0.7768567697551922.[0m
[32m[I 2023-05-04 14:04:19,805][0m Trial 2 finished with value: 0.7785144944797836 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.24000000000000002}. Best is trial 2 with value: 0.7785144944797836.[0m
[32m[I 2023-05-04 14:04:23,278][0m Trial 3 finished with value: 0.7825813250376223 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.11}. Best is trial 3 with value: 0.7825813250376223.[0m
[32m[I 2023-05-04 14:05:07,533][0m Trial 4 finished with v

CPU times: user 2min, sys: 1.75 s, total: 2min 2s
Wall time: 9min 1s


In [40]:
# выводим результаты на обучающей выборке
print("Наилучшие значения гиперпараметров {}".format(study.best_params))
# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(**study.best_params,random_state=random_state, )
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

Наилучшие значения гиперпараметров {'penalty': 'l1', 'solver': 'saga', 'C': 0.16}
f1_score на тестовом наборе: 0.79


In [41]:
%%time
# cоздаем объект исследования
# можем напрямую указать, что нам необходимо максимизировать метрику direction="maximize"
study = optuna.create_study(study_name="LogisticRegression", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_lr_l2, n_trials=20)

[32m[I 2023-05-04 14:15:54,500][0m A new study created in memory with name: LogisticRegression[0m
[32m[I 2023-05-04 14:16:36,534][0m Trial 0 finished with value: 0.7817738728934732 and parameters: {'penalty': 'l2', 'solver': 'saga', 'C': 0.4}. Best is trial 0 with value: 0.7817738728934732.[0m
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

CPU times: user 1min 5s, sys: 2.01 s, total: 1min 7s
Wall time: 4min 33s


In [42]:
# выводим результаты на обучающей выборке
print("Наилучшие значения гиперпараметров {}".format(study.best_params))
# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(**study.best_params,random_state=random_state, )
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

Наилучшие значения гиперпараметров {'penalty': 'l2', 'solver': 'liblinear', 'C': 0.03}
f1_score на тестовом наборе: 0.79


> Для модели логистической регрессии, по сравнению с подбором гиперпараметов по умолчанию, только методу оптимизации GridSearchCV удалось повысить метрику F1-score на 0.1

# Случайный лес
Зафиксируем метрики, которые были получены без дополнительной настройки, т.е со значениями гиперпараметров, установленных по умолчанию:

In [43]:
#Создаем объект класса случайный лес
rf = ensemble.RandomForestClassifier(random_state=random_state)

#Обучаем модель
rf.fit(X_train, y_train)
#Выводим значения метрики 
y_test_pred = rf.predict(X_test)
print('Test: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

Test: 0.81


Случайный лес сразу показывает лучшие результаты, чем логистическая регрессия.

Сравним результаты разных типов оптимизации:

## 1. GridSearchCV

In [44]:
param_grid = {
    'n_estimators': list(range(80, 200, 10)),
    'min_samples_leaf': [5],
    'max_depth': list(np.linspace(10, 30, 5, dtype=int))
}
            
grid_search_forest = GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state), 
    param_grid=param_grid, 
    cv=10, 
    scoring='f1',
    n_jobs = -1
)  
%time grid_search_forest.fit(X_train, y_train) 
y_test_pred = grid_search_forest.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search_forest.best_params_))

CPU times: user 6.49 s, sys: 341 ms, total: 6.83 s
Wall time: 12min 54s
f1_score на тестовом наборе: 0.83
Наилучшие значения гиперпараметров: {'max_depth': 15, 'min_samples_leaf': 5, 'n_estimators': 110}


Удалось улучшить модель с помощью ручной сетки параметров.

## 2. RandomizedSearchCV

In [45]:
param_distributions = {
    'n_estimators': list(np.linspace(100, 160, 13, dtype=int)),
    'min_samples_leaf': [5],
    'max_depth': list(np.linspace(10, 25, 5, dtype=int))
}
            
random_search_forest = RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=random_state), 
    param_distributions=param_distributions, 
    cv=10,
    scoring='f1',
    n_iter = 30, 
    n_jobs = -1
)  
%time random_search_forest.fit(X_train, y_train) 
y_test_pred = random_search_forest.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(random_search_forest.best_params_))

CPU times: user 5.1 s, sys: 339 ms, total: 5.44 s
Wall time: 6min 16s
f1_score на тестовом наборе: 0.82
Наилучшие значения гиперпараметров: {'n_estimators': 125, 'min_samples_leaf': 5, 'max_depth': 13}


Оптимизатор RandomizedSearchCV улучшил метрику хуже, чем GridSearchCV.

## 3. Hyperopt

In [46]:
# зададим пространство поиска гиперпараметров
space = {
       'n_estimators': hp.quniform('n_estimators', 110, 160, 1),
       'max_depth' : hp.quniform('max_depth', 11, 24, 1),
       'min_samples_leaf': hp.quniform('min_samples_leaf', 5, 6, 1)
}

In [47]:
def hyperopt_rf(params, cv=10, X=X_train, y=y_train, random_state=random_state):
    # функция получает комбинацию гиперпараметров в "params"
    params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']), 
        'min_samples_leaf': int(params['min_samples_leaf'])
    }
  
    # используем эту комбинацию для построения модели
    model = ensemble.RandomForestClassifier(**params, random_state=random_state)

    # обучаем модель
    model.fit(X, y)
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()

    # метрику необходимо минимизировать, поэтому ставим знак минус
    return -score

In [48]:
# начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(
    hyperopt_rf, # наша функция 
    space=space, # пространство гиперпараметров
    algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
    max_evals=20, # максимальное количество итераций
    trials=trials, # логирование результатов
    rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
)
print("Наилучшие значения гиперпараметров {}".format(best))

# рассчитаем точность для тестовой выборки
model = ensemble.RandomForestClassifier(
    random_state=random_state, 
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf'])
)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

100%|██████████| 20/20 [05:44<00:00, 17.21s/trial, best loss: -0.807903170418055] 
Наилучшие значения гиперпараметров {'max_depth': 16.0, 'min_samples_leaf': 5.0, 'n_estimators': 126.0}
f1_score на тестовом наборе: 0.82


Оптимизатор Hyperport с заданным количеством итераций улучшил метрику хуже, чем GridSearchCV

## 4. Optuna

In [49]:
def optuna_rf(trial):
  # задаем пространства поиска гиперпараметров
  n_estimators = trial.suggest_int('n_estimators', 110, 160, 1)
  max_depth = trial.suggest_int('max_depth', 11, 24, 1)
  min_samples_leaf = trial.suggest_int('min_samples_leaf', 5, 6, 1)

  # создаем модель
  model = ensemble.RandomForestClassifier(n_estimators=n_estimators,
                                          max_depth=max_depth,
                                          min_samples_leaf=min_samples_leaf,
                                          random_state=random_state)
  # обучаем модель
  model.fit(X_train, y_train)
  score = metrics.f1_score(y_train, model.predict(X_train))

  return score

In [50]:
%%time
# cоздаем объект исследования
# можем напрямую указать, что нам необходимо максимизировать метрику direction="maximize"
study = optuna.create_study(study_name="RandomForestClassifier", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_rf, n_trials=20)

[32m[I 2023-05-04 14:56:50,767][0m A new study created in memory with name: RandomForestClassifier[0m
[32m[I 2023-05-04 14:56:54,749][0m Trial 0 finished with value: 0.9435114503816794 and parameters: {'n_estimators': 119, 'max_depth': 21, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.9435114503816794.[0m
[32m[I 2023-05-04 14:56:59,323][0m Trial 1 finished with value: 0.9354147535965719 and parameters: {'n_estimators': 146, 'max_depth': 22, 'min_samples_leaf': 6}. Best is trial 0 with value: 0.9435114503816794.[0m
[32m[I 2023-05-04 14:57:04,185][0m Trial 2 finished with value: 0.9446652399877714 and parameters: {'n_estimators': 151, 'max_depth': 24, 'min_samples_leaf': 5}. Best is trial 2 with value: 0.9446652399877714.[0m
[32m[I 2023-05-04 14:57:08,644][0m Trial 3 finished with value: 0.9346365302382407 and parameters: {'n_estimators': 146, 'max_depth': 13, 'min_samples_leaf': 5}. Best is trial 2 with value: 0.9446652399877714.[0m
[32m[I 2023-05-04 14:57:13,432

CPU times: user 1min 18s, sys: 1.24 s, total: 1min 19s
Wall time: 1min 21s


In [51]:
# выводим результаты на обучающей выборке
print("Наилучшие значения гиперпараметров {}".format(study.best_params))
print("f1_score на обучающем наборе: {:.2f}".format(study.best_value))

Наилучшие значения гиперпараметров {'n_estimators': 117, 'max_depth': 16, 'min_samples_leaf': 5}
f1_score на обучающем наборе: 0.95


In [52]:
# рассчитаем точность для тестовой выборки
model = ensemble.RandomForestClassifier(**study.best_params,random_state=random_state, )
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

f1_score на тестовом наборе: 0.82


In [53]:
%%time
# можем продолжить подбор, указав n_trials(любое число, которое добавится к предыдущим итерациям) 
study.optimize(optuna_rf, n_trials=10)

[32m[I 2023-05-04 15:01:28,922][0m Trial 20 finished with value: 0.9319852941176471 and parameters: {'n_estimators': 118, 'max_depth': 17, 'min_samples_leaf': 6}. Best is trial 6 with value: 0.9459541984732824.[0m
[32m[I 2023-05-04 15:01:32,967][0m Trial 21 finished with value: 0.9443084455324358 and parameters: {'n_estimators': 122, 'max_depth': 19, 'min_samples_leaf': 5}. Best is trial 6 with value: 0.9459541984732824.[0m
[32m[I 2023-05-04 15:01:37,352][0m Trial 22 finished with value: 0.945054945054945 and parameters: {'n_estimators': 131, 'max_depth': 19, 'min_samples_leaf': 5}. Best is trial 6 with value: 0.9459541984732824.[0m
[32m[I 2023-05-04 15:01:41,946][0m Trial 23 finished with value: 0.9438339438339438 and parameters: {'n_estimators': 132, 'max_depth': 21, 'min_samples_leaf': 5}. Best is trial 6 with value: 0.9459541984732824.[0m
[32m[I 2023-05-04 15:01:46,249][0m Trial 24 finished with value: 0.945054945054945 and parameters: {'n_estimators': 131, 'max_depth

CPU times: user 42 s, sys: 522 ms, total: 42.6 s
Wall time: 43.6 s


In [54]:
# выводим результаты на обучающей выборке
print("Наилучшие значения гиперпараметров {}".format(study.best_params))

# рассчитаем точность для тестовой выборки
model = ensemble.RandomForestClassifier(**study.best_params,random_state=random_state, )
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

Наилучшие значения гиперпараметров {'n_estimators': 117, 'max_depth': 16, 'min_samples_leaf': 5}
f1_score на тестовом наборе: 0.82


Оптимизатор Optuna улучшил метрику хуже, чем GridSearchCV.

> Для модели случайного леса, по сравнению с подбором гиперпараметов по умолчанию, только методу оптимизации GridSearchCV удалось повысить метрику F1-score на 0.2, остальным методам только на 0.1

## Итоги

| Модель| Оптимизатор | F1 на тестовой выборке|
|-------------|-------------|-------------|
| Логистическая регрессия| Без оптимизатора | 0.79 |
| |GridSearchCV|  0.80|
|  | RandomizedSearchCV| 0.79|
| | Hyperopt| 0.79|
| | Optuna| 0.79|
| Случайный лес | Без оптимизатора | 0.81 |
| |GridSearchCV|  0.83|
|  | RandomizedSearchCV| 0.82|
| | Hyperopt| 0.82|
| | Optuna| 0.82|

> Случайный лес показал наилучшие значения метрики F1. Лучше всего подобрал параметры оптимизатор GridSearchCV. Возможно нам удалось угадать оптимальную сетку параметров для GridSearchCV.
В целом, оптимизатор GridSearchCV работает медленнее всего. А также он не так гибок, как другие оптимизаторы. 