**Overview**  
This notebook describes the process of regression model (XGBRegressor) optimization with hyperopt package and checking for $R^2$ and RMSE values

import numpy as np
import pandas as pd
from rdkit import RDConfig
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier, XGBRegressor
from rdkit import Chem
import datamol as dm
from rdkit.Chem import AllChem

from functools import partial
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import root_mean_squared_error, r2_score
from hyperopt import tpe, Trials, hp, STATUS_OK, fmin

In [3]:
with open('viscosity.pickle', 'rb') as inp:
    target = pickle.load(inp)

In [5]:
with open('X_standard_dropped.pickle', 'rb') as inp:
    X = pickle.load(inp)

Viscosity values are given in cP, so now let's convert them into Pa*s and make a logarithm of them to reduce the range of values

In [11]:
target = target / 1000
target = np.log(target)

Here we just repeat the process of optimization with hyperopt package: get objective function and function for best model retrieval, and minimize the objective function with fmin function from hyperopt

In [19]:
def objective(params, pipe, X, target):
    score = cross_val_score(pipe, X, target, scoring = 'neg_root_mean_squared_error')
    return {'loss':-score.mean(), 'status':STATUS_OK, 'Trained_model':pipe}
    

In [13]:
trials = Trials()

In [14]:
space = {'xgbregressorr__n_estimators':hp.randint('n_estimators', 1, 30),
         'xgbregressor__learning_rate': hp.loguniform('learning_rate', low = -4*np.log(10), high  = 2*np.log(10)),
         'xgbcregressor__reg_alpha':hp.loguniform(label = 'reg_alpha', low = -4*np.log(10), high  = 2*np.log(10)),
         'xgbregressor__grow_policy':hp.choice('grow_policy', ['depthwise', 'logwise']),
        'xgbregressor__max_depth':hp.randint('max_depth', 1, 100)}
         

In [24]:
def get_best_model_from_trials(trials):
    valid_trial_list = [trial for trial in trials if trial['result']['status'] == STATUS_OK]
    losses = [ float(trial['result']['loss']) for trial in valid_trial_list]
    index_having_minumum_loss = np.argmin(losses)
    best_trial_obj = valid_trial_list[index_having_minumum_loss]
    return best_trial_obj['result']['Trained_model']

In [20]:
pipeline = make_pipeline(MinMaxScaler(), XGBRegressor())
best = fmin(partial(objective, pipe = pipeline, X = X, target = target), max_evals=50, algo = tpe.suggest, trials = trials, space=space)

100%|██████████| 50/50 [04:06<00:00,  4.93s/trial, best loss: 3.8581691949666075]


Here are the optimal parameters

In [23]:
print(best)

{'grow_policy': 0, 'learning_rate': 1.1849939625930745, 'max_depth': 10, 'n_estimators': 26, 'reg_alpha': 0.0022376683053165628}


And trained model

In [25]:
trained_model = get_best_model_from_trials(trials)

**Check for **$R^{2}$****

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, target)
trained_model.fit(X_train, y_train)
r2 = r2_score(y_test, trained_model.predict(X_test))
rmse = root_mean_squared_error(y_test, trained_model.predict(X_test))
print('R2 value is {}'.format(r2))
print('RMSE value is {}'.format(rmse))


R2 value is 0.5908799708671384
RMSE value is 2.474471490980701


**Create final calculator**

In [29]:
trained_model = trained_model.fit(X, target)