-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logloss issue... #90
Comments
Hi, @caprone, thanks for opening this issue! Have you tried providing the |
HI @HunterMcGushion , thanks for answer. Unfortunately this don"t resolve problem, EDIT: probably issue can be that we can"t specify the positive class in Enviroment....? from hyperparameter_hunter import Environment, Integer, Real, ExtraTreesOptimization x, y = make_classification(n_samples=100, n_features=10) hunter_path = '../HyperparameterHunterAssets' def execute():
if name == 'main': |
Sorry for the delayed response, @caprone! In the provided example, the comparison being made wouldn’t work, since HyperparameterHunter is automatically performing KFold cross-validation with the given parameters to produce the I’ve attempted to make a closer comparison; however, there is a lot that is handled by HyperparameterHunter behind the scenes, so we can’t make a true comparison. That said, I’ve modified your script, and come up with the following to replicate a little bit of the core functionality of HyperparameterHunter: from hyperparameter_hunter import Environment, CrossValidationExperiment
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold
from sklearn.datasets import make_classification
CV_PARAMS = dict(n_splits=3, shuffle=True, random_state=32)
MODEL_INIT_PARAMS = dict(n_estimators=10, max_depth=4, random_state=32)
INPUT, TARGET = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=32)
TRAIN_DF = pd.DataFrame(data=INPUT, columns=range(INPUT.shape[1]))
TRAIN_DF["target"] = TARGET
def run_hyperparameter_hunter():
env = Environment(
train_dataset=TRAIN_DF.copy(),
root_results_path="HyperparameterHunterAssets",
do_predict_proba=False,
metrics_map=dict(log_loss=lambda t, p: -log_loss(t, p)),
cross_validation_type="KFold",
cross_validation_params=CV_PARAMS,
)
experiment = CrossValidationExperiment(
model_initializer=RandomForestClassifier,
model_init_params=MODEL_INIT_PARAMS
)
return experiment
def run_normal(random_seeds):
#################### Result Placeholders ####################
oof_predictions = np.zeros_like(TARGET)
oof_predictions_proba_0 = np.zeros_like(TARGET)
oof_predictions_proba_1 = np.zeros_like(TARGET)
oof_scores = []
oof_scores_proba_0 = []
oof_scores_proba_1 = []
for fold, (train_index, validation_index) in enumerate(KFold(**CV_PARAMS).split(INPUT, TARGET)):
np.random.seed(random_seeds[fold][0])
#################### Split Data ####################
train_input, validation_input = INPUT[train_index], INPUT[validation_index]
train_target, validation_target = TARGET[train_index], TARGET[validation_index]
#################### Fit Classifier ####################
classifier = RandomForestClassifier(
**dict(MODEL_INIT_PARAMS, **dict(random_state=random_seeds[fold][0]))
)
classifier.fit(train_input, train_target)
#################### Make Predictions ####################
validation_predictions = classifier.predict(validation_input)
validation_predictions_proba = classifier.predict_proba(validation_input)
#################### Calculate Score ####################
validation_score = -log_loss(validation_target, validation_predictions)
validation_score_proba_0 = -log_loss(validation_target, validation_predictions_proba[:, 0])
validation_score_proba_1 = -log_loss(validation_target, validation_predictions_proba[:, 1])
#################### Collect Results ####################
oof_scores.append(validation_score)
oof_scores_proba_0.append(validation_score_proba_0)
oof_scores_proba_1.append(validation_score_proba_1)
oof_predictions[validation_index] = validation_predictions
oof_predictions_proba_0[validation_index] = validation_predictions_proba[:, 0]
oof_predictions_proba_1[validation_index] = validation_predictions_proba[:, 1]
print(" - F{}: {} {} {}".format(
fold, validation_score, validation_score_proba_0, validation_score_proba_1
))
print("FINAL: {} {} {}".format(
np.average(oof_scores), np.average(oof_scores_proba_0), np.average(oof_scores_proba_1)
))
def execute():
exp = run_hyperparameter_hunter()
print("#" * 80)
run_normal(exp.experiment_params["random_seeds"][0])
if __name__ == "__main__":
execute() If we run the above script as-is, with
If, instead, we slightly modify the script, and run it with
TL;DR:
I hope this clears things up for you, but please let me know if you have any other questions, and thanks again for opening this issue! |
HI @HunterMcGushion! @HunterMcGushion Yes, finally the problem is that in "Environment.do_predict_proba" |
Glad to hear it! You are correct. If a a model's predictions are not one-dimensional, the default is to use the column at index 0. This takes place in hyperparameter_hunter/hyperparameter_hunter/models.py Lines 174 to 196 in f3a5bd0
Specifically, I think you'll be interested in line 194. Do you think it would be helpful to be able to specify the column index selected when using |
okk @HunterMcGushion !, perfect! Yes I think it would be very helpful to be able to specify the column index in "proba_predictions", because often algorithms return positive class's probability(usually 1 for binary classification) in the second matrix's column. thanks again!! |
Great point! I'm thinking the easiest way to get this done would be to allow the If it's a boolean and False, then the model's The new part is if |
HI @HunterMcGushion!! @ doc especially if someone do not read the documentation / or comment; one solution can be make "model.do_predict_proba" only boolean and create new integer class attribute however I think that your solution is very useful!! |
That's a great idea. For now, I'd prefer not to add too many more parameters, but if it looks like others are having problems, we should revisit your idea! Let me know if the problems persist after merging, and thanks again for opening the issue! |
HI HunterMcGushion!
when I use 'log_loss' as metric in Enviroment,
ther's a way to tell optimizer to predict probabilities? it seems not... (maybe from "model_extra_params" ??).
For example, also if I set 'logloss' as metric in xgb, probably,
all optimizer's predictions are-- binary values--, then logloss scores become totally bogus
thanks
The text was updated successfully, but these errors were encountered: