Possible speed up at large data sets #87

zhangyingmath · 2016-02-19T03:44:24Z

Hi Randal,

We spoke briefly after the Data Philly Meetup at SIG on Feb 18, and I really like your talk. First of all, thank you for your nice talk!

During your talk, you mentioned that currently TPOT could be slow for large data sets, and we were speaking about a possible way to speed it up. Here is the rough idea:

Let's say you have a large set of data,

As you start the pipeline/model selection process, start from the relatively simple pipelines/models, using only a small, random subsets of the data to do the fittings and testings. This serves as a first round, somewhat heuristic weed-out process.

Additional twist 1: because you are only looking at small subsets, if you do things like "repeating the same processes five times and average out" (like a five-fold validation, except that you don't need to be as careful about the five-fold part -- any five random subsets of the data will do), it still wouldn't get too expensive.

If there are any simple pipelines/models that look reasonably good, there are two possibilities to proceed:

i) try them on a larger subset/the entire data set. If they look really successful, we have a winner.

ii) if the pipelines/models are OK but still not satisfactory enough, then maybe they suggest that a certain pattern in the pipelines or a certain type of models which can be a jump start for more complicated models. You can then choose a pipeline/model with higher complexity along the same line, choose a bigger subset size, and reiterate the process.

Additional twist 2: As the subsets grow large, it may be too expensive to do the "multiple subsets, average out" process. So we may skip doing it.

Additional twist 3: in the model fitting stage, suppose there is something like a gradient descend, then when you are looking at simple pipeline/models and small subsets, it may be OK to just start with a relative big step size. As the size of the subsets grow, you may choose to use smaller step sizes, etc.

Now, if any of the simple pipelines/models don't look promising at a small subset size, we may choose to proceed with a few more iterations before completely weeding them out, just so that we don't weed them out too prematurely. But in any case looking for a global optimal pipeline/model is hard anyway, so it may be OK to just find some local optimum along the search tree.
If the final pipeline/model is too complicated, trim them down. We may reach a simple model different from any of the simple ones that we have previously tried.

So that's the rough sketch. Since I am not very familiar with TPOT I may have said things that are not true or things that you have already done. Just a random thought. The idea is for TPOT to try out pipelines/models more like a human: we start from simple models and a few heuristic experiments; if we think we are on the right track, we try more complicated hypothesis and more careful experiments, to see if it gets any better; if we missed out the correct path from the start, but finally reached something that works but looks horribly complicated, we try to trim it down and look for the simplified essence of it. That's the general idea.

I haven't implemented any of this idea, and conceivable it might work on long data but I don't know how it plays on wide data.

Thanks very much,

Ying Zhang

rhiever · 2016-02-19T14:48:51Z

Thank you for writing this up, Ying. This is definitely an idea that we'll have to experiment with.

jni · 2016-03-02T06:59:29Z

Would it make sense to train/evaluate each individual in the population with a different subset of a large dataset every time? I have a dataset of ~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with evolutionary algorithms.

rhiever · 2016-03-02T10:56:11Z

4M rows is definitely too large for TPOT right now until we figure out a
method like the one discussed in this issue for training on subsets. It
would even take a long time to train one model on 4M rows, so you can
imagine that a tool that trains many pipelines would be very slow on 4M
rows.

One option for you is to try to reduce the number of rows by removing
duplicates and other data reduction methods. Otherwise any sort of model or
pipeline optimization technique will not really be feasible for your data
set unless you have a lot of parallelized computation power.

On Wednesday, March 2, 2016, Juan Nunez-Iglesias notifications@github.com
wrote:

Would it make sense to train/evaluate each individual in the population
with a different subset of a large dataset every time? I have a dataset of
~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with
evolutionary algorithms.

—
Reply to this email directly or view it on GitHub
#87 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

jni · 2016-03-02T23:26:55Z

unless you have a lot of parallelized computation power

Which I do. ;) But I imagine that would require a lot of concerted software engineering effort to get working. Nevertheless, I'm wondering whether there is a better strategy for local TPOT than naive and dramatic subsampling, as I mentioned, such as varying the subsets used by individuals in the populations.

rhiever · 2017-03-23T14:23:47Z

I bet we could hack this into TPOT for a quick test by passing TPOT's cv parameter a StratifiedShuffleSplit instead of KFold. If the StratifiedShuffleSplit's test set is a small sample (e.g. 10%) of the full training data, then that would achieve what we're looking for here.

rhiever · 2017-10-09T13:57:13Z

This feature was implemented in TPOT 0.8.

rhiever added the enhancement label Feb 19, 2016

rhiever added the need contributor label Feb 19, 2016

rhiever added being worked on and removed need contributor labels May 10, 2016

rhiever added need contributor and removed being worked on labels Aug 13, 2016

rhiever added being worked on and removed need contributor labels Oct 6, 2016

rhiever added need contributor and removed being worked on labels Mar 23, 2017

weixuanfu mentioned this issue Mar 23, 2017

Add subsample option for speeding up TPOT at large dataset #388

Closed

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

rhiever closed this as completed Oct 9, 2017

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible speed up at large data sets #87

Possible speed up at large data sets #87

zhangyingmath commented Feb 19, 2016

rhiever commented Feb 19, 2016

jni commented Mar 2, 2016

rhiever commented Mar 2, 2016

jni commented Mar 2, 2016

rhiever commented Mar 23, 2017

rhiever commented Oct 9, 2017

Possible speed up at large data sets #87

Possible speed up at large data sets #87

Comments

zhangyingmath commented Feb 19, 2016

rhiever commented Feb 19, 2016

jni commented Mar 2, 2016

rhiever commented Mar 2, 2016

jni commented Mar 2, 2016

rhiever commented Mar 23, 2017

rhiever commented Oct 9, 2017