-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible speed up at large data sets #87
Comments
Thank you for writing this up, Ying. This is definitely an idea that we'll have to experiment with. |
Would it make sense to train/evaluate each individual in the population with a different subset of a large dataset every time? I have a dataset of ~4M rows and TPOT appears to be kinda useless in this scenario... =) Sorry if my question is naive; I don't have much experience with evolutionary algorithms. |
4M rows is definitely too large for TPOT right now until we figure out a One option for you is to try to reduce the number of rows by removing On Wednesday, March 2, 2016, Juan Nunez-Iglesias notifications@github.com
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Which I do. ;) But I imagine that would require a lot of concerted software engineering effort to get working. Nevertheless, I'm wondering whether there is a better strategy for local TPOT than naive and dramatic subsampling, as I mentioned, such as varying the subsets used by individuals in the populations. |
I bet we could hack this into TPOT for a quick test by passing TPOT's |
This feature was implemented in TPOT 0.8. |
Hi Randal,
We spoke briefly after the Data Philly Meetup at SIG on Feb 18, and I really like your talk. First of all, thank you for your nice talk!
During your talk, you mentioned that currently TPOT could be slow for large data sets, and we were speaking about a possible way to speed it up. Here is the rough idea:
Let's say you have a large set of data,
Additional twist 1: because you are only looking at small subsets, if you do things like "repeating the same processes five times and average out" (like a five-fold validation, except that you don't need to be as careful about the five-fold part -- any five random subsets of the data will do), it still wouldn't get too expensive.
If there are any simple pipelines/models that look reasonably good, there are two possibilities to proceed:
i) try them on a larger subset/the entire data set. If they look really successful, we have a winner.
ii) if the pipelines/models are OK but still not satisfactory enough, then maybe they suggest that a certain pattern in the pipelines or a certain type of models which can be a jump start for more complicated models. You can then choose a pipeline/model with higher complexity along the same line, choose a bigger subset size, and reiterate the process.
Additional twist 2: As the subsets grow large, it may be too expensive to do the "multiple subsets, average out" process. So we may skip doing it.
Additional twist 3: in the model fitting stage, suppose there is something like a gradient descend, then when you are looking at simple pipeline/models and small subsets, it may be OK to just start with a relative big step size. As the size of the subsets grow, you may choose to use smaller step sizes, etc.
So that's the rough sketch. Since I am not very familiar with TPOT I may have said things that are not true or things that you have already done. Just a random thought. The idea is for TPOT to try out pipelines/models more like a human: we start from simple models and a few heuristic experiments; if we think we are on the right track, we try more complicated hypothesis and more careful experiments, to see if it gets any better; if we missed out the correct path from the start, but finally reached something that works but looks horribly complicated, we try to trim it down and look for the simplified essence of it. That's the general idea.
I haven't implemented any of this idea, and conceivable it might work on long data but I don't know how it plays on wide data.
Thanks very much,
Ying Zhang
The text was updated successfully, but these errors were encountered: