Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No score variance over 50 iterations despite multiple parameters switched #23

Closed
suciokhan opened this issue Mar 19, 2021 · 5 comments
Closed
Assignees
Labels
question Further information is requested

Comments

@suciokhan
Copy link

suciokhan commented Mar 19, 2021

Attempted tuning a xgboost binary classifier on tf-idf data adjusting n_estimators, max_depth, and learning_rate and there was zero variation in the score for each of 50 iterations. When I manually tweak parameters and run a single training instance manually, I achieve score variations. Note: I have also tried this with the default optimizer for 20 iterations and different ranges for the parameter tuning, and it gave me the same results: the score is always 0.6590446358653093.

SYSTEM DETAILS:
Amazon SageMaker
Hyperactive ver: 3.0.5.1
Python ver: 3.6.13

Here is my code:

freq_df, y_labels = jc.prep_train_data('raw_data.pkl', remove_stopwords=False)

def model(opt):
    clf_xgb = xgb.XGBClassifier(objective='binary:logistic',
                            #eta=0.4,
                            #max_depth=8,
                            subsample=0.5,
                            base_score=np.mean(y_labels),
                            eval_metric = 'logloss',
                            missing=None,
                            use_label_encoder=False,
                            seed=42)
    
    scores = cross_val_score(clf_xgb, freq_df, y_labels, cv=5) # default is 5, hyperactive example is 3

    return scores.mean()

# Configure the range of hyperparameters we want to test out
search_space = {
    "n_estimators": list(range(500, 5000, 100)),
    "max_depth": list(range(6, 12)),
    "learning_rate": [0.1, 0.3, 0.4, 0.5, 0.7],
}

# Configure the optimizer
optimizer = SimulatedAnnealingOptimizer(
    epsilon=0.1,
    distribution="laplace",
    n_neighbours=4,
    rand_rest_p=0.1,
    p_accept=0.15,
    norm_factor="adaptive",
    annealing_rate=0.999,
    start_temp=0.8)

# Execute optimization
hyper = Hyperactive()
hyper.add_search(model, search_space, n_iter=50, optimizer=optimizer)
hyper.run()

# Print-out the results and save them to a dataframe
results_filename = "xgboost_hyperactive_results.csv"

search_data = hyper.results(model)
search_data.to_csv(results_filename, index=0)
@suciokhan suciokhan added the bug Something isn't working label Mar 19, 2021
@SimonBlanke
Copy link
Owner

SimonBlanke commented Mar 19, 2021

Could you provide the entire script? How do i get "jc"? I would like to reproduce this bug. Could you also provide a random_state that shows the bug?

@SimonBlanke
Copy link
Owner

I already have the suspicion that the SimulatedAnnealingOptimizer "sticks" to the edge of the search space. This can happen for those kinds of local optimizers when n_iter is very small.

@SimonBlanke SimonBlanke added question Further information is requested and removed bug Something isn't working labels Mar 19, 2021
@SimonBlanke SimonBlanke self-assigned this Mar 19, 2021
@suciokhan
Copy link
Author

suciokhan commented Mar 19, 2021

jc is a script of helper functions I have for taking a dictionary of social media posts, their authors, dates, and recipients, pulling out the post texts, cleaning them, tokenizing them, converting into tf-idf, and generating labels for them. I also get the exact same score when I do not use SimulateAnnealing, and instead use the default random optimizer. freq_df is a dataframe of the tf-idf values for each token in the corpus, where each row is a separate document and the header has each token.

@suciokhan
Copy link
Author

Apologies, but I'm not sure what you mean by providing a random_state; this is admittedly my first rodeo :)

@SimonBlanke
Copy link
Owner

I just found the "error". You create the parameters in the search space, but you are not using any of them. The model does not change during the optimization run.

Your objective function should look like this:

def model(opt):
    clf_xgb = xgb.XGBClassifier(
        n_estimators=opt["n_estimators"],
        max_depth=opt["max_depth"],
        learning_rate=opt["learning_rate"],
        objective="binary:logistic",
        # eta=0.4,
        # max_depth=8,
        subsample=0.5,
        base_score=np.mean(y_labels),
        eval_metric="logloss",
        missing=None,
        use_label_encoder=False,
        seed=42,
    )

    scores = cross_val_score(
        clf_xgb, freq_df, y_labels, cv=5
    )  # default is 5, hyperactive example is 3

    return scores.mean()

# Configure the range of hyperparameters we want to test out
search_space = {
    "n_estimators": list(range(500, 5000, 100)),
    "max_depth": list(range(6, 12)),
    "learning_rate": [0.1, 0.3, 0.4, 0.5, 0.7],
}

It is funny how I missed this in my answers above. If you are convinced something is wrong it is sometimes hard to see the obvious.

I will close this issue now, but if you have further questions about this you can ask them here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants