Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible speed up at large data sets #87

Closed
zhangyingmath opened this issue Feb 19, 2016 · 6 comments
Closed

Possible speed up at large data sets #87

zhangyingmath opened this issue Feb 19, 2016 · 6 comments

Comments

@zhangyingmath
Copy link

Hi Randal,

We spoke briefly after the Data Philly Meetup at SIG on Feb 18, and I really like your talk. First of all, thank you for your nice talk!

During your talk, you mentioned that currently TPOT could be slow for large data sets, and we were speaking about a possible way to speed it up. Here is the rough idea:

Let's say you have a large set of data,

  1. As you start the pipeline/model selection process, start from the relatively simple pipelines/models, using only a small, random subsets of the data to do the fittings and testings. This serves as a first round, somewhat heuristic weed-out process.

Additional twist 1: because you are only looking at small subsets, if you do things like "repeating the same processes five times and average out" (like a five-fold validation, except that you don't need to be as careful about the five-fold part -- any five random subsets of the data will do), it still wouldn't get too expensive.

If there are any simple pipelines/models that look reasonably good, there are two possibilities to proceed:

i) try them on a larger subset/the entire data set. If they look really successful, we have a winner.

ii) if the pipelines/models are OK but still not satisfactory enough, then maybe they suggest that a certain pattern in the pipelines or a certain type of models which can be a jump start for more complicated models. You can then choose a pipeline/model with higher complexity along the same line, choose a bigger subset size, and reiterate the process.

Additional twist 2: As the subsets grow large, it may be too expensive to do the "multiple subsets, average out" process. So we may skip doing it.

Additional twist 3: in the model fitting stage, suppose there is something like a gradient descend, then when you are looking at simple pipeline/models and small subsets, it may be OK to just start with a relative big step size. As the size of the subsets grow, you may choose to use smaller step sizes, etc.

  1. Now, if any of the simple pipelines/models don't look promising at a small subset size, we may choose to proceed with a few more iterations before completely weeding them out, just so that we don't weed them out too prematurely. But in any case looking for a global optimal pipeline/model is hard anyway, so it may be OK to just find some local optimum along the search tree.
  2. If the final pipeline/model is too complicated, trim them down. We may reach a simple model different from any of the simple ones that we have previously tried.

So that's the rough sketch. Since I am not very familiar with TPOT I may have said things that are not true or things that you have already done. Just a random thought. The idea is for TPOT to try out pipelines/models more like a human: we start from simple models and a few heuristic experiments; if we think we are on the right track, we try more complicated hypothesis and more careful experiments, to see if it gets any better; if we missed out the correct path from the start, but finally reached something that works but looks horribly complicated, we try to trim it down and look for the simplified essence of it. That's the general idea.

I haven't implemented any of this idea, and conceivable it might work on long data but I don't know how it plays on wide data.

Thanks very much,

Ying Zhang

@rhiever
Copy link
Contributor

rhiever commented Feb 19, 2016

Thank you for writing this up, Ying. This is definitely an idea that we'll have to experiment with.

@jni
Copy link

jni commented Mar 2, 2016

Would it make sense to train/evaluate each individual in the population with a different subset of a large dataset every time? I have a dataset of ~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with evolutionary algorithms.

@rhiever
Copy link
Contributor

rhiever commented Mar 2, 2016

4M rows is definitely too large for TPOT right now until we figure out a
method like the one discussed in this issue for training on subsets. It
would even take a long time to train one model on 4M rows, so you can
imagine that a tool that trains many pipelines would be very slow on 4M
rows.

One option for you is to try to reduce the number of rows by removing
duplicates and other data reduction methods. Otherwise any sort of model or
pipeline optimization technique will not really be feasible for your data
set unless you have a lot of parallelized computation power.

On Wednesday, March 2, 2016, Juan Nunez-Iglesias notifications@github.com
wrote:

Would it make sense to train/evaluate each individual in the population
with a different subset of a large dataset every time? I have a dataset of
~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with
evolutionary algorithms.


Reply to this email directly or view it on GitHub
#87 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@jni
Copy link

jni commented Mar 2, 2016

unless you have a lot of parallelized computation power

Which I do. ;) But I imagine that would require a lot of concerted software engineering effort to get working. Nevertheless, I'm wondering whether there is a better strategy for local TPOT than naive and dramatic subsampling, as I mentioned, such as varying the subsets used by individuals in the populations.

@rhiever
Copy link
Contributor

rhiever commented Mar 23, 2017

I bet we could hack this into TPOT for a quick test by passing TPOT's cv parameter a StratifiedShuffleSplit instead of KFold. If the StratifiedShuffleSplit's test set is a small sample (e.g. 10%) of the full training data, then that would achieve what we're looking for here.

@rhiever
Copy link
Contributor

rhiever commented Oct 9, 2017

This feature was implemented in TPOT 0.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants