## ML-7. Прогнозирование биологического ответа (HW-3)
Необходимо предсказать биологический ответ молекул (столбец 'Activity') по их химическому составу (столбцы D1-D1776).

В качестве метрики будем использовать F1-score.

Необходимо обучить две модели: логистическую регрессию и случайный лес. Далее нужно сделать подбор гиперпараметров с помощью базовых и продвинутых методов оптимизации. Важно использовать все четыре метода (GridSeachCV, RandomizedSearchCV, Hyperopt, Optuna) хотя бы по разу, максимальное количество итераций не должно превышать 50.

In [16]:
#импорт библиотек
import numpy as np #для матричных вычислений
import pandas as pd #для анализа и предобработки данных
import matplotlib.pyplot as plt #для визуализации
import seaborn as sns #для визуализации

from sklearn import linear_model #линейные моделиё
from sklearn import tree #деревья решений
from sklearn import ensemble #ансамбли
from sklearn import metrics #метрики
from sklearn import preprocessing #предобработка
from sklearn.model_selection import train_test_split #сплитование выборки
from sklearn.model_selection import RandomizedSearchCV #Randomized Search
from sklearn.model_selection import GridSearchCV #Grid Search
from sklearn.model_selection import cross_val_score

import hyperopt
from hyperopt import hp, fmin, tpe, Trials
import optuna

%matplotlib inline
plt.style.use('seaborn')

In [17]:
# Прочитаем данные
data = pd.read_csv('data/_train_sem09.csv')
data.head()

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Хотя предварительная обработка данных не требуется, на всякий случай проверим наличие пропусков
data.isnull().sum()

Activity    0
D1          0
D2          0
D3          0
D4          0
           ..
D1772       0
D1773       0
D1774       0
D1775       0
D1776       0
Length: 1777, dtype: int64

In [19]:
# Создаем матрицу наблюдений X и вектор ответов y
X = data.drop(['Activity'], axis=1)
y = data['Activity']

In [20]:
# Разделяем выборку на тренировочную и тестовую в соотношении 80/20. 
# Для сохранения соотношений целевого признака используем параметр stratify (стратифицированное разбиение). 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 42, test_size = 0.2)

Сначала рассчитаем метрики моделей с параметрами по умолчанию

### Логистическая регрессия

In [21]:
%%time

# Зафиксируем только метрики, которые были получены без дополнительной настройки, 
# т.е со значениями гиперпараметров, установленных по умолчанию

#Создаем объект класса логистическая регрессия
log_reg = linear_model.LogisticRegression(max_iter = 10000)
#Обучаем модель
log_reg.fit(X_train, y_train)
y_test_pred = log_reg.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(log_reg.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

accuracy на тестовом наборе: 0.75
f1_score на тестовом наборе: 0.78
CPU times: user 6.28 s, sys: 135 ms, total: 6.42 s
Wall time: 3.75 s


### Дерево решений

In [22]:
%%time

# Зафиксируем только метрики, которые были получены без дополнительной настройки, 
# т.е со значениями гиперпараметров, установленных по умолчанию

#Создаем объект класса дерево решений
dt = tree.DecisionTreeClassifier(random_state=42)
#Обучаем дерево
dt.fit(X_train, y_train)
#Выводим значения метрики 
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(log_reg.score(X_test, y_test_pred)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

accuracy на тестовом наборе: 0.74
f1_score на тестовом наборе: 0.75
CPU times: user 1.13 s, sys: 40.5 ms, total: 1.17 s
Wall time: 1.58 s


### Random Forest

In [23]:
%%time

# Зафиксируем только метрики, которые были получены без дополнительной настройки, 
# т.е со значениями гиперпараметров, установленных по умолчанию

#Создаем объект класса random forest
rf = ensemble.RandomForestClassifier(random_state=42)
#Обучаем модель
rf.fit(X_train, y_train)
#Выводим значения метрики 
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(log_reg.score(X_test, y_test_pred)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

accuracy на тестовом наборе: 0.84
f1_score на тестовом наборе: 0.80
CPU times: user 2.08 s, sys: 53.6 ms, total: 2.14 s
Wall time: 2.23 s


С параметрами по умолчанию наилучшие значения метрики F1 показали Random Forest и Логистическая регрессия.

Подбор гиперпараметров будем дальше осуществлять для этих двух моделей.

### Randomized Search + Логистическая регрессия

In [31]:
%%time

param_distributions = {'penalty': ['l2', 'none'] ,
              'solver': ['lbfgs', 'sag'],
               'C': list(np.linspace(0.01, 1, 10, dtype=float))},
            
random_search = RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_distributions=param_distributions, 
    cv=5, # количество фолдов в кросс-валидации, по умолчанию используется 5.
    n_iter = 50, # количество комбинаций на расчёт. От этого параметра напрямую зависит время оптимизации и качество модели.
    n_jobs = -1 # количество ядер для распараллеливания расчёта. -1 использует все существующие ядра.
)  
%time random_search.fit(X_train, y_train) 
y_test_pred = random_search.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(random_search.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(random_search.best_params_))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

CPU times: user 3.82 s, sys: 447 ms, total: 4.27 s
Wall time: 3min 29s
accuracy на тестовом наборе: 0.75
f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'solver': 'sag', 'penalty': 'l2', 'C': 0.01}
CPU times: user 3.91 s, sys: 475 ms, total: 4.38 s
Wall time: 3min 29s




In [32]:
print("Наилучшее значение точности при кросс-валидаци: {:.2f}".format(random_search.best_score_))

Наилучшее значение точности при кросс-валидаци: 0.76


### Randomized Search + Random Forest

In [59]:
%%time

param_distributions = {
                'n_estimators': list(range(10, 500, 10)),
                'min_samples_leaf': list(range(1, 30, 1)),
                'max_depth': list(range(1, 30, 1))
                }
            
random_search_tree = RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42), 
    param_distributions=param_distributions, 
    n_jobs = -1, # количество ядер для распараллеливания расчёта. -1 использует все существующие ядра.
    cv=5) # количество фолдов в кросс-валидации, по умолчанию используется 5.
  
%time random_search_tree.fit(X_train, y_train) 
y_test_pred = random_search_tree.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(random_search_tree.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(random_search_tree.best_params_))
print("Наилучшее значение точности при кросс-валидаци: {:.2f}".format(random_search_tree.best_score_))

CPU times: user 5.55 s, sys: 182 ms, total: 5.73 s
Wall time: 1min 9s
accuracy на тестовом наборе: 0.77
f1_score на тестовом наборе: 0.79
Наилучшие значения гиперпараметров: {'n_estimators': 450, 'min_samples_leaf': 9, 'max_depth': 19}
Наилучшее значение точности при кросс-валидаци: 0.78
CPU times: user 5.86 s, sys: 200 ms, total: 6.06 s
Wall time: 1min 9s


Использхование Randomized Search практически не повлияло на результат логистической регрессии, при этом работа с логистической регрессией заняла больше времени.

### Grid Search + Логистическая регрессия

In [33]:
%%time

param_grid = {'penalty': ['l2', 'none'] ,
              'solver': ['lbfgs', 'sag'],
               'C': list(np.linspace(0.01, 10, 10, dtype=float))},
            
grid_search_2 = GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_grid=param_grid, 
    cv=5, # количество фолдов в кросс-валидации, по умолчанию используется 5.
    n_jobs = -1 # количество ядер для распараллеливания расчёта. -1 использует все существующие ядра.
)  
%time grid_search_2.fit(X_train, y_train) 
y_test_pred = grid_search_2.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(grid_search_2.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search_2.best_params_))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

CPU times: user 3.71 s, sys: 276 ms, total: 3.98 s
Wall time: 2min 30s
accuracy на тестовом наборе: 0.75
f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'C': 0.01, 'penalty': 'l2', 'solver': 'sag'}
CPU times: user 3.78 s, sys: 283 ms, total: 4.06 s
Wall time: 2min 30s




In [34]:
print("Наилучшее значение точности при кросс-валидаци: {:.2f}".format(grid_search_2.best_score_))

Наилучшее значение точности при кросс-валидаци: 0.76


### Grid Search + Random Forest

In [56]:
%%time

param_grid = {
            'n_estimators': list(range(10, 500, 10)),
            'min_samples_leaf': list(range(1, 30, 1)),
            'max_depth': list(range(1, 30, 1))},
            
grid_search = GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42), 
    param_grid=param_grid, 
    cv=5, # количество фолдов в кросс-валидации, по умолчанию используется 5.
    n_jobs = -1 # количество ядер для распараллеливания расчёта. -1 использует все существующие ядра.
)  
%time grid_search.fit(X_train, y_train) 
y_test_pred = grid_search_2.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(grid_search.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search.best_params_))

CPU times: user 13min 30s, sys: 1min 34s, total: 15min 4s
Wall time: 2d 23h 58min 53s
accuracy на тестовом наборе: 0.79
f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'max_depth': 14, 'min_samples_leaf': 1, 'n_estimators': 190}
CPU times: user 13min 30s, sys: 1min 34s, total: 15min 5s
Wall time: 2d 23h 58min 53s


In [58]:
print("Наилучшее значение точности при кросс-валидаци: {:.2f}".format(grid_search.best_score_))

Наилучшее значение точности при кросс-валидаци: 0.80


Grid Search показал сходные с Random Search результаты. При этом его работа заняла гораздо большее количетво времени.

### Hyperopt + Логистическая регрессия

In [35]:
# зададим пространство поиска гиперпараметров
space={'penalty': hp.choice(label='penalty', options=['l2', 'none']),
       'solver': hp.choice(label='solver', options=['lbfgs', 'sag']),
       'C': hp.loguniform(label='C', low=0.01, high=10)}

penalty = ['l2', 'none']
solver = ['lbfgs', 'sag']

# зафиксируем random_state
def hyperopt_lr(params, cv=5, X=X_train, y=y_train, random_state=42):
    params = {'penalty': params['penalty'], 
              'solver': params['solver'], 
             'C': params['C']
              }
  
    # используем эту комбинацию для построения модели
    model = linear_model.LogisticRegression(**params, random_state=42, max_iter=50)

    # обучаем модель
    model.fit(X, y)
    #score = metrics.f1_score(y, model.predict(X))
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()

    # метрику необходимо минимизировать, поэтому ставим знак минус
    return -score

In [36]:
%%time
# начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(hyperopt_lr, # наша функция 
          space=space, # пространство гиперпараметров
          algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
          max_evals=50, # максимальное количество итераций
          trials=trials, # логирование результатов
          rstate=np.random.RandomState(42)# фиксируем для повторяемости результата
         )
print("Наилучшие значения гиперпараметров {}".format(best))

  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]





  2%|▏         | 1/50 [00:09<07:51,  9.63s/trial, best loss: -0.7789014212164245]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

  4%|▍         | 2/50 [00:11<04:17,  5.36s/trial, best loss: -0.7789014212164245]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

  6%|▌         | 3/50 [00:14<03:08,  4.02s/trial, best loss: -0.7789014212164245]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

  8%|▊         | 4/50 [00:16<02:36,  3.41s/trial, best loss: -0.7789014212164245]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 10%|█         | 5/50 [00:26<04:09,  5.55s/trial, best loss: -0.7789014212164245]




 12%|█▏        | 6/50 [00:35<05:05,  6.95s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

 14%|█▍        | 7/50 [00:38<03:54,  5.46s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 16%|█▌        | 8/50 [00:40<03:09,  4.52s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

 18%|█▊        | 9/50 [00:43<02:37,  3.85s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 20%|██        | 10/50 [00:45<02:17,  3.44s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 22%|██▏       | 11/50 [00:55<03:24,  5.25s/trial, best loss: -0.7795013054082204]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 24%|██▍       | 12/50 [00:57<02:47,  4.40s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(




 26%|██▌       | 13/50 [01:08<03:51,  6.27s/trial, best loss: -0.7795013054082204]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 28%|██▊       | 14/50 [01:10<03:04,  5.13s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 30%|███       | 15/50 [01:12<02:30,  4.31s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

 32%|███▏      | 16/50 [01:15<02:07,  3.74s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(




 34%|███▍      | 17/50 [01:24<02:58,  5.41s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

 36%|███▌      | 18/50 [01:27<02:24,  4.52s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 38%|███▊      | 19/50 [01:36<03:08,  6.09s/trial, best loss: -0.7795013054082204]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 40%|████      | 20/50 [01:39<02:29,  5.00s/trial, best loss: -0.7795013054082204]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 42%|████▏     | 21/50 [01:48<03:05,  6.39s/trial, best loss: -0.7795013054082204]




 44%|████▍     | 22/50 [01:58<03:28,  7.44s/trial, best loss: -0.7795013054082204]




 46%|████▌     | 23/50 [02:08<03:36,  8.01s/trial, best loss: -0.7795013054082204]




 48%|████▊     | 24/50 [02:17<03:40,  8.47s/trial, best loss: -0.7795013054082204]




 50%|█████     | 25/50 [02:27<03:37,  8.72s/trial, best loss: -0.7795013054082204]




 52%|█████▏    | 26/50 [02:36<03:35,  8.97s/trial, best loss: -0.7795013054082204]




 54%|█████▍    | 27/50 [02:46<03:33,  9.26s/trial, best loss: -0.7795013054082204]




 56%|█████▌    | 28/50 [02:56<03:27,  9.44s/trial, best loss: -0.7795013054082204]




 58%|█████▊    | 29/50 [03:06<03:20,  9.57s/trial, best loss: -0.7795013054082204]




 60%|██████    | 30/50 [03:16<03:15,  9.76s/trial, best loss: -0.7795013054082204]




 62%|██████▏   | 31/50 [03:25<03:03,  9.65s/trial, best loss: -0.7795013054082204]




 64%|██████▍   | 32/50 [03:35<02:54,  9.67s/trial, best loss: -0.7796811856564594]




 66%|██████▌   | 33/50 [03:45<02:45,  9.76s/trial, best loss: -0.7796811856564594]




 68%|██████▊   | 34/50 [03:55<02:39,  9.96s/trial, best loss: -0.7805297326386645]




 70%|███████   | 35/50 [04:05<02:27,  9.84s/trial, best loss: -0.7805297326386645]




 72%|███████▏  | 36/50 [04:14<02:15,  9.69s/trial, best loss: -0.7805297326386645]




 74%|███████▍  | 37/50 [04:24<02:04,  9.54s/trial, best loss: -0.7805297326386645]




 76%|███████▌  | 38/50 [04:33<01:54,  9.57s/trial, best loss: -0.7805297326386645]




 78%|███████▊  | 39/50 [04:43<01:44,  9.51s/trial, best loss: -0.7805297326386645]




 80%|████████  | 40/50 [04:53<01:37,  9.79s/trial, best loss: -0.7805297326386645]




 82%|████████▏ | 41/50 [05:02<01:27,  9.70s/trial, best loss: -0.7805297326386645]





 84%|████████▍ | 42/50 [05:12<01:17,  9.66s/trial, best loss: -0.7805297326386645]




 86%|████████▌ | 43/50 [05:22<01:07,  9.61s/trial, best loss: -0.7805297326386645]





 88%|████████▊ | 44/50 [05:33<01:00, 10.13s/trial, best loss: -0.7805297326386645]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver op

 90%|█████████ | 45/50 [05:36<00:39,  7.88s/trial, best loss: -0.7805297326386645]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 92%|█████████▏| 46/50 [05:47<00:35,  8.99s/trial, best loss: -0.7805297326386645]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 94%|█████████▍| 47/50 [05:50<00:21,  7.12s/trial, best loss: -0.7805297326386645]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 96%|█████████▌| 48/50 [06:03<00:17,  8.92s/trial, best loss: -0.7805297326386645]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver o

 98%|█████████▊| 49/50 [06:07<00:07,  7.37s/trial, best loss: -0.7805297326386645]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



100%|██████████| 50/50 [06:20<00:00,  7.61s/trial, best loss: -0.7805297326386645]
Наилучшие значения гиперпараметров {'C': 1.0336815823702419, 'penalty': 0, 'solver': 1}
CPU times: user 1min 47s, sys: 2.91 s, total: 1min 50s
Wall time: 6min 20s




In [37]:
# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(
    random_state=42, 
    penalty=penalty[best['penalty']],
    solver=solver[best['solver']],
    C=float(best['C'])
)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print('f1_score на обучающем наборе: {:.2f}'.format(metrics.f1_score(y_train, y_train_pred)))
print("accuracy на тестовом наборе: {:.2f}".format(model.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))



f1_score на обучающем наборе: 0.88
accuracy на тестовом наборе: 0.75
f1_score на тестовом наборе: 0.77


### Hyperopt + Random Forest

In [38]:
# зададим пространство поиска гиперпараметров
space={'n_estimators': hp.quniform('n_estimators', 10, 500, 10),
       'max_depth' : hp.quniform('max_depth', 1, 30, 1),
       'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 30, 1)
      }

# зафиксируем random_state
def hyperopt_rf(params, cv=5, X=X_train, y=y_train, random_state=42):
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth']), 
             'min_samples_leaf': int(params['min_samples_leaf'])
              }
  
    # используем эту комбинацию для построения модели
    model = ensemble.RandomForestClassifier(**params, random_state=random_state)

    # обучаем модель
    model.fit(X, y)
    #score = metrics.f1_score(y, model.predict(X))
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()

    # метрику необходимо минимизировать, поэтому ставим знак минус
    return -score

In [39]:
%%time
# начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(hyperopt_rf, # наша функция 
          space=space, # пространство гиперпараметров
          algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
          max_evals=50, # максимальное количество итераций
          trials=trials, # логирование результатов
          rstate=np.random.RandomState(42)# фиксируем для повторяемости результата
         )
print("Наилучшие значения гиперпараметров {}".format(best))

100%|██████████| 50/50 [09:29<00:00, 11.39s/trial, best loss: -0.8175082484282425]
Наилучшие значения гиперпараметров {'max_depth': 26.0, 'min_samples_leaf': 1.0, 'n_estimators': 270.0}
CPU times: user 2min 34s, sys: 4.4 s, total: 2min 39s
Wall time: 9min 29s


In [40]:
# рассчитаем точность для тестовой выборки
model = ensemble.RandomForestClassifier(
    random_state=42, 
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf'])
)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print('f1_score на обучающем наборе: {:.2f}'.format(metrics.f1_score(y_train, y_train_pred)))
print("accuracy на тестовом наборе: {:.2f}".format(model.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

f1_score на обучающем наборе: 1.00
accuracy на тестовом наборе: 0.79
f1_score на тестовом наборе: 0.81


Hyperopt позволил еще немного улучшить метрику и сделал это за более короткое время

### Optuna + Логистическая регрессия

In [41]:
from secrets import choice

def optuna_lr(trial):
  # задаем пространства поиска гиперпараметров
  penalty = trial.suggest_categorical('penalty', ['l2', 'none'])
  solver = trial.suggest_categorical('solver', ['lbfgs', 'sag'])
  C = trial.suggest_float(name='C', low=0.01, high=10)

  # создаем модель
  model = linear_model.LogisticRegression(penalty=penalty,
                                          solver=solver,
                                          C=C,
                                          random_state=42,
                                          max_iter=50)
  # обучаем модель
  model.fit(X_train, y_train)
  score = metrics.f1_score(y_train, model.predict(X_train))

  return score

In [42]:
%%time
# cоздаем объект исследования
# можем напрямую указать, что нам необходимо максимизировать метрику direction="maximize"
study = optuna.create_study(study_name="LogisticRegression", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_lr, n_trials=50)

[32m[I 2022-09-19 16:16:18,203][0m A new study created in memory with name: LogisticRegression[0m
[32m[I 2022-09-19 16:16:21,011][0m Trial 0 finished with value: 0.8804611650485437 and parameters: {'penalty': 'none', 'solver': 'sag', 'C': 4.028374070818028}. Best is trial 0 with value: 0.8804611650485437.[0m
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[32m[I 2022-09-19 16:16:21,528][0m Trial 1 finished with value: 0.880683552029295 and parameters: {'penalty': 'none', 'solver': 'lbfgs', 'C': 1.835619441840671}. Best is trial 1 with value: 0.880683552029295.[0m
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or 

CPU times: user 1min 7s, sys: 1.33 s, total: 1min 8s
Wall time: 49.3 s


In [43]:
# рассчитаем точность для тестовой выборки
model = linear_model.LogisticRegression(**study.best_params,random_state=42, max_iter=50)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(model.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

accuracy на тестовом наборе: 0.75
f1_score на тестовом наборе: 0.77


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Optuna + Random Forest

In [44]:
def optuna_rf(trial):
  # задаем пространства поиска гиперпараметров
  n_estimators = trial.suggest_int('n_estimators', 10, 500, 10)
  max_depth = trial.suggest_int('max_depth', 1, 30, 1)
  min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 30, 1)

  # создаем модель
  model = ensemble.RandomForestClassifier(n_estimators=n_estimators,
                                          max_depth=max_depth,
                                          min_samples_leaf=min_samples_leaf,
                                          random_state=42)
  # обучаем модель
  model.fit(X_train, y_train)
  score = metrics.f1_score(y_train, model.predict(X_train))

  return score

In [45]:
%%time
# cоздаем объект исследования
# можем напрямую указать, что нам необходимо максимизировать метрику direction="maximize"
study = optuna.create_study(study_name="RandomForestClassifier", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_rf, n_trials=50)

[32m[I 2022-09-19 16:17:08,559][0m A new study created in memory with name: RandomForestClassifier[0m
[32m[I 2022-09-19 16:17:09,751][0m Trial 0 finished with value: 0.8486127864897466 and parameters: {'n_estimators': 100, 'max_depth': 13, 'min_samples_leaf': 22}. Best is trial 0 with value: 0.8486127864897466.[0m
[32m[I 2022-09-19 16:17:10,173][0m Trial 1 finished with value: 0.7249370277078087 and parameters: {'n_estimators': 100, 'max_depth': 1, 'min_samples_leaf': 8}. Best is trial 0 with value: 0.8486127864897466.[0m
[32m[I 2022-09-19 16:17:13,696][0m Trial 2 finished with value: 0.8553230209281165 and parameters: {'n_estimators': 330, 'max_depth': 10, 'min_samples_leaf': 18}. Best is trial 2 with value: 0.8553230209281165.[0m
[32m[I 2022-09-19 16:17:19,387][0m Trial 3 finished with value: 0.8698278465720326 and parameters: {'n_estimators': 460, 'max_depth': 14, 'min_samples_leaf': 16}. Best is trial 3 with value: 0.8698278465720326.[0m
[32m[I 2022-09-19 16:17:21,5

CPU times: user 3min 47s, sys: 4.53 s, total: 3min 51s
Wall time: 4min 9s


In [46]:
# рассчитаем точность для тестовой выборки
model = ensemble.RandomForestClassifier(**study.best_params,random_state=42, )
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("accuracy на тестовом наборе: {:.2f}".format(model.score(X_test, y_test)))
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

accuracy на тестовом наборе: 0.79
f1_score на тестовом наборе: 0.80


Метод Optuna показал сходные результаты, но сделал это быстрее

In [48]:
# Формируем требования
!pip freeze > requirements.txt
!conda env export > environment.yaml