Необходимо предсказать биологический ответ молекул (столбец 'Activity') по их химическому составу (столбцы D1-D1776).

Данные представлены в формате CSV.  Каждая строка представляет молекулу. 

* Первый столбец Activity содержит экспериментальные данные, описывающие фактический биологический ответ [0, 1]; 
* Остальные столбцы D1-D1776 представляют собой молекулярные дескрипторы — это вычисляемые свойства, которые могут фиксировать некоторые характеристики молекулы, например размер, форму или состав элементов.

Предварительная обработка не требуется, данные уже закодированы и нормализованы.

В качестве метрики будем использовать F1-score.

Необходимо обучить две модели: логистическую регрессию и случайный лес. Далее нужно сделать подбор гиперпараметров с помощью базовых и продвинутых методов оптимизации. Важно использовать все четыре метода (GridSeachCV, RandomizedSearchCV, Hyperopt, Optuna) хотя бы по разу, максимальное количество итераций не должно превышать 50.

Для начала подключим необходимые для выполнения задачи библиотеки и подготовим исходные данные для обучения моделей.

In [16]:
#импорт библиотек
import numpy as np #для матричных вычислений
import pandas as pd #для анализа и предобработки данных

from sklearn import linear_model #линейные моделиё
from sklearn import ensemble #ансамбли
from sklearn import metrics #метрики
from sklearn.model_selection import train_test_split #сплитование выборки
from sklearn.model_selection import GridSearchCV # Поиск по сетке
from sklearn.model_selection import RandomizedSearchCV # Рандомизированный поиск
from sklearn.model_selection import cross_val_score # Для оценки кросс-валидации
from sklearn import model_selection
import hyperopt # Первая библиотека для продвинутой оптимизации гиперпараметров
from hyperopt import hp, fmin, tpe, Trials
# fmin - основная функция, она будет минимизировать наш функционал
# tpe - алгоритм оптимизации
# hp - включает набор методов для объявления пространства поиска гиперпараметров
# trails - используется для логирования результатов
import optuna # Вторая библиотека для продвинутой оптимизации гиперпараметров
import plotly.express as px # для построения графиков

In [17]:
data = pd.read_csv('data/practice_data.csv')
data.head()

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


Создаём матрицу Х и вектор ответов у

In [18]:
X = data.drop('Activity', axis=1)
y = data['Activity']

Проверим сбалансированность признака

In [19]:
y.value_counts(True)

1    0.542255
0    0.457745
Name: Activity, dtype: float64

Разделим данные на тренировочную и тестовую выборки в соотношении 70 на 30, со стратификацией

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y)

display(X_train.shape, X_test.shape)

(2625, 1776)

(1126, 1776)

Обозначим и обучим модель со стандартными гиперпараметрами

# 1. Логистическая регрессия

In [21]:
#Создаем объект класса логистическая регрессия
log_reg = linear_model.LogisticRegression(random_state=42, max_iter = 50)
#Обучаем модель, минимизируя logloss
log_reg.fit(X_train, y_train)
y_test_pred = log_reg.predict(X_test)
f1_lr = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_lr))

f1_score на тестовом наборе: 0.77


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 1.1 GridSearchCV

In [22]:
skf = model_selection.StratifiedKFold(n_splits=5) # Для верификации используем Statified KFold

param_grid = [
              {'penalty': ['l2', 'none'] , # тип регуляризации
              'solver': ['lbfgs', 'sag'], # алгоритм оптимизации
               'C': [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1]}, # уровень силы регуляризации
]
grid_search = GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_grid=param_grid, 
    cv=skf, 
    n_jobs = -1,
    scoring='f1'
)  
 
# %time - замеряет время выполнения
%time grid_search.fit(X_train, y_train) 
y_test_pred = grid_search.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search.best_params_))

CPU times: total: 1.08 s
Wall time: 1min 2s
f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
print("Наилучшее значение точности при кросс-валидации: {:.2f}".format(grid_search.best_score_))

Наилучшее значение точности при кросс-валидации: 0.78


Обучаем модель с оптимизированными параметрами и находим искомую метрику F1-score:

In [24]:
log_reg_gscv = linear_model.LogisticRegression(
    random_state=42, 
    max_iter = 50, 
    C=grid_search.best_params_['C'], 
    penalty=grid_search.best_params_['penalty'], 
    solver=grid_search.best_params_['solver'])
#Обучаем модель, минимизируя logloss
log_reg_gscv.fit(X_train, y_train)
y_test_pred = log_reg_gscv.predict(X_test)
f1_lr_gscv = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_lr_gscv))

f1_score на тестовом наборе: 0.78


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 1.2 RandomizedSearchCV

In [25]:
from sklearn.model_selection import RandomizedSearchCV
#np.linspace(start(от), stop(до), num=50(количество),dtype-тип данных)
param_distributions = {
    'penalty': ['l2', 'none'] ,
    'solver': ['lbfgs', 'sag'],
    'C': [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1]}
            
random_search = RandomizedSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=50), 
    param_distributions=param_distributions, 
    cv=skf, 
    n_iter = 40, 
    n_jobs = -1,
    scoring='f1'
)  
 
%time random_search.fit(X_train, y_train) 
y_test_pred = random_search.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(random_search.best_params_))



CPU times: total: 1.09 s
Wall time: 1min 4s
f1_score на тестовом наборе: 0.78
Наилучшие значения гиперпараметров: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Обучаем модель с оптимизированными параметрами и находим искомую метрику F1-score:

In [26]:
log_reg_rscv = linear_model.LogisticRegression(
    random_state=42, 
    max_iter = 1000, 
    C=random_search.best_params_['C'], 
    penalty=random_search.best_params_['penalty'], 
    solver=random_search.best_params_['solver'])

log_reg_rscv.fit(X_train, y_train)

y_test_pred = log_reg_rscv.predict(X_test)
f1_lr_rscv = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_lr_rscv))

f1_score на тестовом наборе: 0.78


### 1.3 Hyperopt

In [30]:
# зададим пространство поиска гиперпараметров (для логической регрессии)
space={'penalty': hp.choice(label='penalty', options=['l2', 'none']),
       'solver' : hp.choice(label='solver', options=['lbfgs', 'sag']),
       'C': hp.uniform('C', 0.01, 1)
      }

In [31]:
random_state = 42

def hyperopt_rf(space, cv=skf, X=X_train, y=y_train, random_state=random_state):
    # используем эту комбинацию для построения модели
    model = linear_model.LogisticRegression(**space, random_state=random_state, max_iter=50)

    # обучаем модель
    model.fit(X, y)
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
    
    # метрику необходимо минимизировать, поэтому ставим знак минус
    return -score

In [32]:
%time
#начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(hyperopt_rf, # наша функция 
          space=space, # пространство гиперпараметров
          algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
          max_evals=20, # максимальное количество итераций
          trials=trials, # логирование результатов
          rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
         )
print("Наилучшие значения гиперпараметров {}".format(best))

CPU times: total: 0 ns
Wall time: 0 ns
  0%|          | 0/20 [00:00<?, ?trial/s, best loss=?]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



  5%|▌         | 1/20 [00:17<05:33, 17.57s/trial, best loss: -0.7508682018126108]




 10%|█         | 2/20 [00:25<03:35, 11.95s/trial, best loss: -0.7756236039506652]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 15%|█▌        | 3/20 [00:28<02:10,  7.70s/trial, best loss: -0.7807217972804994]




 20%|██        | 4/20 [00:35<02:02,  7.63s/trial, best loss: -0.7819351679351612]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 25%|██▌       | 5/20 [00:38<01:26,  5.74s/trial, best loss: -0.7819351679351612]





 30%|███       | 6/20 [00:45<01:25,  6.14s/trial, best loss: -0.7819351679351612]





 35%|███▌      | 7/20 [00:55<01:38,  7.61s/trial, best loss: -0.7819351679351612]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 40%|████      | 8/20 [00:58<01:13,  6.15s/trial, best loss: -0.7819351679351612]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 45%|████▌     | 9/20 [01:01<00:55,  5.07s/trial, best loss: -0.7819351679351612]




 50%|█████     | 10/20 [01:11<01:07,  6.76s/trial, best loss: -0.7819351679351612]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 55%|█████▌    | 11/20 [01:14<00:49,  5.45s/trial, best loss: -0.7819351679351612]





 60%|██████    | 12/20 [01:21<00:48,  6.05s/trial, best loss: -0.7819351679351612]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 65%|██████▌   | 13/20 [01:24<00:34,  4.93s/trial, best loss: -0.7819351679351612]




 70%|███████   | 14/20 [01:30<00:32,  5.42s/trial, best loss: -0.7819351679351612]




 75%|███████▌  | 15/20 [01:37<00:29,  5.83s/trial, best loss: -0.7819351679351612]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 80%|████████  | 16/20 [01:39<00:19,  4.76s/trial, best loss: -0.7819351679351612]





 85%|████████▌ | 17/20 [01:46<00:16,  5.35s/trial, best loss: -0.7819351679351612]




 90%|█████████ | 18/20 [01:53<00:11,  5.81s/trial, best loss: -0.7819351679351612]




 95%|█████████▌| 19/20 [02:00<00:06,  6.12s/trial, best loss: -0.7819351679351612]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



100%|██████████| 20/20 [02:02<00:00,  6.13s/trial, best loss: -0.7819351679351612]
Наилучшие значения гиперпараметров {'C': 0.2200949458234019, 'penalty': 0, 'solver': 1}


In [42]:
best_solver = None
best_penalty = None

if best['penalty'] == 0:
    best_penalty = 'l2' 
else:
    best_penalty = 'none'
if best['solver'] == 0:
    best_solver = 'lbfgs'
else: 
    best_solver = 'sag'

In [45]:
# рассчитаем точность для тестовой выборки
log_reg_hopt = linear_model.LogisticRegression(
    random_state=random_state, 
    penalty=best_penalty,
    solver=best_solver,
    C=float(best['C'])
)
log_reg_hopt.fit(X_train, y_train)
y_test_pred = log_reg_hopt.predict(X_test)
f1_lr_hopt = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_lr_hopt))

f1_score на тестовом наборе: 0.78




### 1.4 Optuna

In [46]:
def optuna_rf(trial):
  # задаем пространства поиска гиперпараметров
  solver = trial.suggest_categorical('solver', ['lbfgs', 'sag'])
  penalty = trial.suggest_categorical('penalty', ['l2', 'none'])
  C = trial.suggest_uniform('C', 0.01, 1)

  # создаем модель
  model = linear_model.LogisticRegression(
    solver=solver,
    penalty=penalty, 
    C=C,
    random_state=random_state,
    max_iter=50)
  # обучаем модель
  score = cross_val_score(model, X_train, y_train, cv=skf, scoring='f1', n_jobs=-1).mean()

  return score

In [48]:
%time
# cоздаем объект исследования
study = optuna.create_study(study_name="LogisticRegression", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_rf, n_trials=20)

# выводим результаты на обучающей выборке
print("Наилучшие значения гиперпараметров {}".format(study.best_params))

[32m[I 2022-11-21 13:04:48,724][0m A new study created in memory with name: LogisticRegression[0m


CPU times: total: 0 ns
Wall time: 0 ns


  C = trial.suggest_uniform('C', 0.01, 1)
[32m[I 2022-11-21 13:04:50,579][0m Trial 0 finished with value: 0.7748182407109502 and parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.011424692929649156}. Best is trial 0 with value: 0.7748182407109502.[0m
  C = trial.suggest_uniform('C', 0.01, 1)
[32m[I 2022-11-21 13:04:54,214][0m Trial 1 finished with value: 0.7683445930970254 and parameters: {'solver': 'sag', 'penalty': 'none', 'C': 0.05953236913640745}. Best is trial 0 with value: 0.7748182407109502.[0m
  C = trial.suggest_uniform('C', 0.01, 1)
[32m[I 2022-11-21 13:04:56,264][0m Trial 2 finished with value: 0.7778748017732046 and parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.26884992422640763}. Best is trial 2 with value: 0.7778748017732046.[0m
  C = trial.suggest_uniform('C', 0.01, 1)
[32m[I 2022-11-21 13:04:59,913][0m Trial 3 finished with value: 0.7750531218684499 and parameters: {'solver': 'sag', 'penalty': 'l2', 'C': 0.417181009881459}. Best is trial 2 wit

Наилучшие значения гиперпараметров {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.15041934608452112}


In [50]:
# рассчитаем точность для тестовой выборки
log_reg_opt = linear_model.LogisticRegression(**study.best_params,random_state=random_state, max_iter=50)
log_reg_opt.fit(X_train, y_train)
y_test_pred = log_reg_opt.predict(X_test)
f1_lr_opt = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_lr_opt))

f1_score на тестовом наборе: 0.78


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 2. Случайный лес

In [55]:
#Создаем объект класса случайный лес
rf = ensemble.RandomForestClassifier(random_state=42)

#Обучаем модель
rf.fit(X_train, y_train)
y_test_pred = rf.predict(X_test)
f1_rf = metrics.f1_score(y_test, y_test_pred)
print('Test: {:.2f}'.format(f1_rf))

Test: 0.81


### 2.1 GridSearchCV

In [53]:
param_grid = {'n_estimators': list(range(80, 200, 30)),
              'min_samples_leaf': [5, 7],
              'max_depth': list(np.linspace(20, 40, 5, dtype=int))
              }
            
grid_search_forest = GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42), 
    param_grid=param_grid, 
    cv=skf, 
    n_jobs = -1
)  

%time grid_search_forest.fit(X_train, y_train) 

y_test_pred = grid_search_forest.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(grid_search_forest.best_params_))

CPU times: total: 2.38 s
Wall time: 1min 18s
f1_score на тестовом наборе: 0.82
Наилучшие значения гиперпараметров: {'max_depth': 20, 'min_samples_leaf': 5, 'n_estimators': 140}


In [54]:
rf_gscv = ensemble.RandomForestClassifier(
    random_state=42,
    max_depth=grid_search_forest.best_params_['max_depth'],
    min_samples_leaf=grid_search_forest.best_params_['min_samples_leaf'],
    n_estimators=grid_search_forest.best_params_['n_estimators']
    )

rf_gscv.fit(X_train, y_train)

y_test_pred = rf_gscv.predict(X_test)
f1_rf_gscv = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_rf_gscv))

f1_score на тестовом наборе: 0.82


### 2.2 RandomizedSearchCV

In [56]:
param_distributions = {
    'n_estimators': list(range(80, 200, 30)),
    'min_samples_leaf': [5, 7],
    'max_depth': list(np.linspace(20, 40, 5, dtype=int))
    }
            
random_search_forest = RandomizedSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42), 
    param_distributions=param_distributions, 
    cv=skf,
    n_iter = 20, 
    n_jobs = -1
)  

%time random_search_forest.fit(X_train, y_train) 

y_test_pred = random_search_forest.predict(X_test)
print('f1_score на тестовом наборе: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))
print("Наилучшие значения гиперпараметров: {}".format(random_search_forest.best_params_))

CPU times: total: 1.98 s
Wall time: 45.3 s
f1_score на тестовом наборе: 0.80
Наилучшие значения гиперпараметров: {'n_estimators': 140, 'min_samples_leaf': 5, 'max_depth': 30}


In [67]:
rf_rscv = ensemble.RandomForestClassifier(
    random_state=42,
    max_depth=random_search_forest.best_params_['max_depth'],
    min_samples_leaf=random_search_forest.best_params_['min_samples_leaf'],
    n_estimators=random_search_forest.best_params_['n_estimators']
    )

rf_rscv.fit(X_train, y_train)

y_test_pred = rf_gscv.predict(X_test)
f1_rf_rscv = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_rf_gscv))

f1_score на тестовом наборе: 0.82


### 2.3 Hyperopt

In [58]:
# зададим пространство поиска гиперпараметров (для случайного леса)
space={'n_estimators': hp.quniform('n_estimators', 100, 200, 1),
       'max_depth' : hp.quniform('max_depth', 15, 40, 1),
       'min_samples_leaf': hp.quniform('min_samples_leaf', 2, 10, 1)
      }

In [59]:
def hyperopt_rf(params, cv=skf, X=X_train, y=y_train, random_state=random_state):
    # функция получает комбинацию гиперпараметров в "params"
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth']), 
             'min_samples_leaf': int(params['min_samples_leaf'])
              }
  
    # используем эту комбинацию для построения модели
    model = ensemble.RandomForestClassifier(**params, random_state=random_state)

    # обучаем модель
    model.fit(X, y)
    score = cross_val_score(model, X, y, cv=cv, scoring="f1", n_jobs=-1).mean()
    
    # метрику необходимо минимизировать, поэтому ставим знак минус
    return -score

In [60]:
%%time
#начинаем подбор гиперпараметров
trials = Trials() # используется для логирования результатов

best=fmin(hyperopt_rf, # наша функция 
          space=space, # пространство гиперпараметров
          algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
          max_evals=50, # максимальное количество итераций
          trials=trials, # логирование результатов
          rstate=np.random.default_rng(random_state)# фиксируем для повторяемости результата
         )
print("Наилучшие значения гиперпараметров {}".format(best))

100%|██████████| 50/50 [05:03<00:00,  6.07s/trial, best loss: -0.8168885711519562]
Наилучшие значения гиперпараметров {'max_depth': 32.0, 'min_samples_leaf': 2.0, 'n_estimators': 194.0}
CPU times: total: 1min 33s
Wall time: 5min 3s


In [61]:
# рассчитаем точность для тестовой выборки
rf_hopt = ensemble.RandomForestClassifier(
    random_state=random_state, 
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf'])
)
rf_hopt.fit(X_train, y_train)

y_train_pred = rf_hopt.predict(X_train)

y_test_pred = rf_hopt.predict(X_test)
f1_rf_hopt = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_rf_hopt))

f1_score на тестовом наборе: 0.82


### 2.4 Optuna

In [62]:
def optuna_rf(trial):
  # задаем пространства поиска гиперпараметров
  n_estimators = trial.suggest_int('n_estimators', 100, 200, 1)
  max_depth = trial.suggest_int('max_depth', 15, 40, 1)
  min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 10, 1)

  # создаем модель
  model = ensemble.RandomForestClassifier(
      n_estimators=n_estimators,
      max_depth=max_depth,
      min_samples_leaf=min_samples_leaf,
      random_state=random_state)
  # обучаем модель
  score = cross_val_score(model, X_train, y_train, cv=skf, scoring='f1', n_jobs=-1).mean()

  return score

In [64]:
%time
# cоздаем объект исследования
study = optuna.create_study(study_name="RandomForestClassifier", direction="maximize")
# ищем лучшую комбинацию гиперпараметров n_trials раз
study.optimize(optuna_rf, n_trials=20)

print("Наилучшие значения гиперпараметров {}".format(study.best_params))

[32m[I 2022-11-21 13:46:18,708][0m A new study created in memory with name: RandomForestClassifier[0m


CPU times: total: 0 ns
Wall time: 0 ns


[32m[I 2022-11-21 13:46:21,972][0m Trial 0 finished with value: 0.7885955346162629 and parameters: {'n_estimators': 199, 'max_depth': 27, 'min_samples_leaf': 9}. Best is trial 0 with value: 0.7885955346162629.[0m
[32m[I 2022-11-21 13:46:24,300][0m Trial 1 finished with value: 0.8027858109658268 and parameters: {'n_estimators': 108, 'max_depth': 35, 'min_samples_leaf': 5}. Best is trial 1 with value: 0.8027858109658268.[0m
[32m[I 2022-11-21 13:46:29,449][0m Trial 2 finished with value: 0.7938016037488695 and parameters: {'n_estimators': 200, 'max_depth': 33, 'min_samples_leaf': 7}. Best is trial 1 with value: 0.8027858109658268.[0m
[32m[I 2022-11-21 13:46:33,718][0m Trial 3 finished with value: 0.7945990925275014 and parameters: {'n_estimators': 184, 'max_depth': 37, 'min_samples_leaf': 7}. Best is trial 1 with value: 0.8027858109658268.[0m
[32m[I 2022-11-21 13:46:37,657][0m Trial 4 finished with value: 0.7920731691596292 and parameters: {'n_estimators': 160, 'max_depth': 

Наилучшие значения гиперпараметров {'n_estimators': 139, 'max_depth': 20, 'min_samples_leaf': 2}


In [65]:
# рассчитаем точность для тестовой выборки
rf_opt = ensemble.RandomForestClassifier(**study.best_params,random_state=random_state)

rf_opt.fit(X_train, y_train)

y_test_pred = rf_opt.predict(X_test)
f1_rf_opt = metrics.f1_score(y_test, y_test_pred)
print('f1_score на тестовом наборе: {:.2f}'.format(f1_rf_opt))

f1_score на тестовом наборе: 0.82


________________

In [82]:
df = pd.DataFrame(data = [
        ['LR (Default)', f1_lr],
        ['LR (GridSearchCV)', f1_lr_gscv],
        ['LR (RandomSearchCV)', f1_lr_rscv],
        ['LR (Hyperopt)', f1_lr_hopt],
        ['LR (Optuna)', f1_lr_opt],
        ['RF (Default)', f1_rf],
        ['RF (GridSearchCV)', f1_rf_gscv],
        ['RF (RandomSearchCV)', f1_rf_rscv],
        ['RF (Hyperopt)', f1_rf_hopt],
        ['RF (Optuna)', f1_rf_opt]
    ],
    columns = ['Model', 'F1-score']   
)

In [86]:
df = df.sort_values(by='F1-score')

In [88]:
#строим график
fig = px.bar(
    data_frame=df, #датафрейм
    x="Model", #ось x
    y="F1-score", #ось y
    color='Model', #расцветка в зависимости от страны
    text = 'F1-score', #текст на столбцах
    orientation='v', #ориентация графика
    height=500, #высота
    width=1000, #ширина
    title='F1-scores of models, according to different ways of searching an optimal hyperparameters' #заголовок
)

#отображаем его
fig.show()

Наилучший результат показала модель случайного леса с оптимальными гиперпараметрами, найденными при помощи библиотеки Optuna.