Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New TPOT features #498

Merged
merged 8 commits into from
Jun 21, 2017
Merged

New TPOT features #498

merged 8 commits into from
Jun 21, 2017

Conversation

kuratsak
Copy link

@kuratsak kuratsak commented Jun 15, 2017

What does this PR do?

Implements 3 cool new features:

  1. A new optional argument that saves the best pipeline so far after each generation. Run TPOT -> use pipelines without stopping it's optimization process.
  2. TPOT now saves the score that mattered for every optimized pipeline it outputs
  3. personal/manual scoring function now supported in the main driver too.

Where should the reviewer start?

Relatively small PR, should be straightforward.. just look at the full diff 💃

How should this PR be tested?

  1. create a local_pipeline_folder
  2. run TPOT and add -psp="./local_pipeline_folder/"
  3. after 2-3 generations TPOT will output at least one pipeline into the folder

Any background context you want to provide?

Extremely useful features we used in our research projects!

What are the relevant issues?

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated? Nope
  • Does this PR add new (Python) dependencies? Nope

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 85.102% when pulling ab809ad on kuratsak:tpot_features into f89d2de on rhiever:development.

@kuratsak
Copy link
Author

Hi Approvers!

This is my second attempt after splitting the pull request into a bug fix one that was approved already.

I am finishing a job and starting another, So I won't be able to fix code until at least July probably..
I have thoroughly tested and used these features in my research in the current job, I really hope we can approve this.

I am available until sunday for any fixes to be made in this pull request, otherwise it will have to wait and I really hope we don't have to.

@coveralls
Copy link

Coverage Status

Coverage increased (+9.8%) to 97.436% when pulling 2603815 on kuratsak:tpot_features into f89d2de on rhiever:development.

Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice functions!

tpot/driver.py Outdated
print('taken from module: {}'.format(module_name))
except Exception as e:
print('failed importing custom scoring function, error: {}'.format(str(e)))
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need raise ValueError Here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the exception string itself is enough at this point
Done, I didn't want to lose the traceback inside. is there a trick for this?

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.4%) to 86.204% when pulling 4d885dc on kuratsak:tpot_features into f89d2de on rhiever:development.

tpot/base.py Outdated
def __init__(self, generations=100, population_size=100, offspring_size=None,
mutation_rate=0.9, crossover_rate=0.1,
scoring=None, cv=5, subsample=1.0, n_jobs=1,
max_time_mins=None, max_eval_time_mins=5,
random_state=None, config_dict=None, warm_start=False,
verbosity=0, disable_update_check=False):
verbosity=0, disable_update_check=False, periodic_save_path=None):
Copy link
Contributor

@rhiever rhiever Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move periodic_save_path above verbosity and disable_update_check (both here and the docstring).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

if pscore is not None:
pipeline_text += '\n# Score on the training set was:{}\n'.format(pscore)
else:
pipeline_text += '\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor to:

    if pscore is not None:
         pipeline_text += '\n# Score on the training set was: {}'.format(pscore)
    pipeline_text += '\n'

Please also rename pscore to pipeline_score for a more descriptive variable name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@EpistasisLab EpistasisLab deleted a comment from weixuanfu Jun 15, 2017
@EpistasisLab EpistasisLab deleted a comment from kuratsak Jun 15, 2017
tpot/base.py Outdated
def _save_periodic_pipeline(self):
if self.periodic_save_path is not None:
try:
write = self._pbar.write if not self._pbar.disable else print
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give a more descriptive variable name for write.

Copy link
Author

@kuratsak kuratsak Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍
renamed to print_func since it does what print does

tpot/base.py Outdated
if self.verbosity >= 2:
write('failed saving periodic pipeline, exception:\n{}'.format(str(e)[:250]))

def export(self, output_file_name, skip_if_repeated=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document skip_if_repeated in the docstrings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#dont export a pipeline you just had
if skip_if_repeated and (self._exported_pipeline_text == to_write):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the meaning of the True/False return values for the export function in the docstrings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tpot/gp_deap.py Outdated
@@ -164,6 +165,9 @@ def eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar,

# Begin the generational process
for gen in range(1, ngen + 1):
# after each population save a periodic pipeline
if periodic_pipeline_saver is not None:
periodic_pipeline_saver()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this addition is related to #79.

tpot/gp_deap.py Outdated
@@ -103,7 +103,8 @@ def varOr(population, toolbox, lambda_, cxpb, mutpb):


def eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar,
stats=None, halloffame=None, verbose=0, max_time_mins=None):
stats=None, halloffame=None, verbose=0, max_time_mins=None,
periodic_pipeline_saver=None):
Copy link
Contributor

@rhiever rhiever Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor periodic_pipeline_saver to periodic_pipeline_saving_fn or a variable name that is more descriptive. Please also document it in the docstrings.

Copy link
Author

@kuratsak kuratsak Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea,
in fact in eaMuPlusLambda context this is a general parameter "per generation function"
renaming to per_generation_function to be more explicit in its role.

This can later be used for more extended logic too.

tpot/base.py Outdated
@@ -702,7 +719,33 @@ def set_params(self, **params):

return self

def export(self, output_file_name):
def save_pipeline_if_period(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_pipeline_if_period should also be a"private" function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tpot/base.py Outdated
@@ -598,6 +614,7 @@ def _update_top_pipeline(self):
if pipeline_scores.wvalues[1] > top_score:
self._optimized_pipeline = pipeline
top_score = pipeline_scores.wvalues[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top_score variable isn't needed any more if the same value is being stored in self._optimized_pipeline_score.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed top_score variable

tpot/base.py Outdated
@@ -185,6 +189,13 @@ def __init__(self, generations=100, population_size=100, offspring_size=None,
A setting of 2 or higher will add a progress bar during the optimization procedure.
disable_update_check: bool, optional (default: False)
Flag indicating whether the TPOT version checker should be disabled.
periodic_save_path: path string (default: None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor the periodic_save_path name to periodic_checkpoint_path. I believe that's a more descriptive term for what it's doing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to periodic_checkpoint_folder to be explicit in its role. Thanks!

@@ -79,12 +80,15 @@ def handler(dwCtrlType, hook_sigint=_thread.interrupt_main):
class TPOTBase(BaseEstimator):
"""Automatically creates and optimizes machine learning pipelines using GP."""

# dont save periodic pipelines more often than this
OUTPUT_BEST_PIPELINE_PERIOD_SECONDS = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a TPOT parameter? Thoughts @weixuanfu2016 and @teaearlgraycold? I suppose I'm fine with a hard-coded 30 seconds, but I can see a user down the line wanting to tweak this value.

Copy link
Author

@kuratsak kuratsak Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's initial target is not to be user configurable - but to prevent too many calls to export. in any case the limit is per generation.

if we one day allow the user to save pipeline after each single evaluation for example, then we might want this (or even might NOT want this).

In any case IMHO it seems a bit early to let this be a parameter.

tests.py Outdated
"""

assert_equal(expected_code, export_pipeline(pipeline, tpot_obj.operators, tpot_obj._pset, pscore=0.929813743))

Copy link
Contributor

@rhiever rhiever Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests look good, but they need to be rebased on the most recent version of the dev branch. @teaearlgraycold reorganized tests.py into a tests directory. Less clutter now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@rhiever
Copy link
Contributor

rhiever commented Jun 15, 2017

Overall, I'm 👍 on these changes, pending the rebase and requested changes.

Beyond the several requests for adding to the docstrings, please also update the API documentation and other documentation where relevant.

I'm also concerned that the coverage decreased by 1.4%.

Thank you for the PR, @kuratsak.

print('failed importing custom scoring function, error: {}'.format(str(e)))
raise ValueError(e)

return scoring_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this function work if I pass a sklearn metric, e.g., sklearn.metrics.auc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added your function in the test, works out of the box 👍

@kuratsak
Copy link
Author

kuratsak commented Jun 15, 2017 via email

@rhiever
Copy link
Contributor

rhiever commented Jun 15, 2017

We merged some PRs today that re-organized the unit tests. I think it's just a matter of moving the unit tests into the appropriate files in the tests directory.

@kuratsak
Copy link
Author

kuratsak commented Jun 15, 2017 via email

@rhiever
Copy link
Contributor

rhiever commented Jun 15, 2017

It's merged into the dev branch now.

Conflicts:
	tests/tpot_tests.py
	tpot/base.py
	tpot/driver.py
@kuratsak
Copy link
Author

kuratsak commented Jun 18, 2017 via email

@kuratsak
Copy link
Author

kuratsak commented Jun 18, 2017

Can someone please check and rerun travis build? it seems the build failed because of external sources of trouble..

both py36 builds failed with errors related to:
Error: HTTPError: 522 against the url https://repo.continuum.io/pkgs/....

I checked of course locally: both py3 and py2 tests pass

Update:
pushed documentation changes, tests passed.
As I suspected the problem was external.

Everything looks good now 🙌

@coveralls
Copy link

Coverage Status

Coverage increased (+6.5%) to 94.81% when pulling 81045ab on kuratsak:tpot_features into 2b0c29c on rhiever:development.

@kuratsak
Copy link
Author

kuratsak commented Jun 18, 2017 via email

@kuratsak
Copy link
Author

I am being taken away from my working computer, they are saying im fired and i cant touch it anymore 🥇

@rhiever
Copy link
Contributor

rhiever commented Jun 19, 2017

Thanks @kuratsak. Will look through this soon.

@rhiever
Copy link
Contributor

rhiever commented Jun 21, 2017

LGTM. Thanks @kuratsak!

@rhiever rhiever merged commit 7f2b6a7 into EpistasisLab:development Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants