Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Cannot load training states from checkpoints to resume training #374

Closed
alarivarmann opened this issue Jul 31, 2020 · 8 comments
Closed
Labels
bug Something isn't working invalid This doesn't seem right

Comments

@alarivarmann
Copy link

Describe the bug
When training AutoML on ResumablePipeline, logs are full of messages like these:
UserWarning: Cannot Load Step /models/resumable_pipeline/AutoML/ResumablePipeline (ResumablePipeline:ResumablePipeline) With Step Saver JoblibStepSaver.
saver.class.name))

To Reproduce

  HP_real = HyperparameterSpace({
    "learning_rate": Uniform(1e-5, 1),
    "max_depth": RandInt(2, 4),
    "n_estimators": Choice([30,60,90,100,130])
})

    pipeline_sk = ResumablePipeline([  
    # A Pipeline is composed of multiple chained steps. Steps
    # can alter the data before passing it to the next steps.
    AddFeatures([
        PCA(n_components=2),
        FastICA(n_components=2),
    ]),
    RidgeModelStacking([
        RandomForestRegressor(),
        GradientBoostingRegressor(warm_start=False, min_samples_leaf=2, random_state=42)  # validation_fraction = 0.2
    ])
], cache_folder=resumable_pipeline_folder).set_hyperparams_space(HP_real)

time_a = time.time()
auto_ml = AutoML(
    pipeline_sk,
    #AutoMLContainer  = AutoMLContainer(main_scoring_metric_name = "mse"),
    refit_trial=True,
    n_trials=int(epochs),
    cache_folder_when_no_handle=resumable_pipeline_folder,
    validation_splitter=ValidationSplitter(test_set_ratio),
    hyperparams_optimizer=RandomSearchHyperparameterSelectionStrategy(),
    scoring_callback=ScoringCallback(mean_squared_error, higher_score_is_better=False),
    callbacks=[
        MetricCallback('mse', metric_function=mean_squared_error, higher_score_is_better=False)
    ]
)
auto_ml = auto_ml.fit(train, y_train)  # if you use custom label encoder, then fit takes in the whole data at a time

Expected behavior
I am expecting that there would be no errors such as these. The goal is to be able to warm start training from old checkpoints.
Additional context
Add any other context about the problem here.

@alarivarmann alarivarmann added bug Something isn't working invalid This doesn't seem right labels Jul 31, 2020
@alarivarmann alarivarmann changed the title Bug: Cannot load states from training checkpoints Bug: Cannot load training states from checkpoints to resume training Jul 31, 2020
@alexbrillant
Copy link
Contributor

hello @alarivarmann you are right about the Warning, I am fixing it right now. However, if you want to use checkpoints, you need to add one inside your pipeline. I am trying your code out. Thanks for creating this issue.

@alexbrillant
Copy link
Contributor

I have fixed the warnings here : #378

@alexbrillant
Copy link
Contributor

alexbrillant commented Aug 8, 2020

@alarivarmann you would need to add a checkpoint at the end of your pipeline for it to be truly resumable...
check out the documentation that we have here for "checkpoints" : https://www.neuraxle.org/stable/examples/caching/plot_auto_ml_checkpoint.html

However, do you really need this ? It seems like you are trying to run an AutoML loop, and not use any checkpoints. I think you should use Pipeline instead of ResumablePipeline.

You could use the Trainer alone to do this, and not the AutoML loop. I think you simply want to save, and load the pipeline you have trained on. Am I right ? If so, please take a look at the Step Saving documentation : https://www.neuraxle.org/stable/step_saving_and_lifecycle.html

I don't think you are trying to search hyperparams here. We need to add more examples for this...

This feature is still experimental. It will be fully ready/functional after this pr here : #377

@alarivarmann
Copy link
Author

Hi Alex,

Thanks for your response.

This code snippet was just an example. The idea why I wished to try Neuraxle is to have AutoML Pipeline for which training can be resumed from checkpoints. Does Neuraxle currently support this feature? Thanks a lot!

@alexbrillant
Copy link
Contributor

Hi Alex,

Thanks for your response.

This code snippet was just an example. The idea why I wished to try Neuraxle is to have AutoML Pipeline for which training can be resumed from checkpoints. Does Neuraxle currently support this feature? Thanks a lot!

Yes, there is an example here : https://www.neuraxle.org/stable/examples/caching/plot_auto_ml_checkpoint.html
The step saving checkpoints had problems, and they have been fixed here : #377 I am waiting for my boss to approve the pull request : @guillaume-chevalier

Next release will have this fix that includes the step saving checkpoint. Thanks for trying out Neuraxle :)

@alarivarmann
Copy link
Author

OK Thanks!

When do you plan to implement Bayesian optimization/TPE similar to Optuna?

@guillaume-chevalier
Copy link
Member

@alarivarmann Until Alexandre answers, I think you could dig in our unit tests for the AutoML hyperparameter selection algorithms. You should be able to see example usages of our TPE !

We just need to add some documentaiton, I think it was working.

@alexbrillant can you confirm that we have a functional and working TPE implementation ? I think this should be documented very soon, this is an important point to not forget.

@guillaume-chevalier
Copy link
Member

guillaume-chevalier commented Sep 29, 2020

@alarivarmann So here are the TPE tests and how you can use it:
https://github.com/Neuraxio/Neuraxle/blob/master/testing/metaopt/test_tpe.py

You may want to understand how @pytest.mark.parametrize works if you have difficulty reading those tests. And the delayed function of joblib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants