Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address pipeline overfitting #64

Closed
rhiever opened this issue Dec 22, 2015 · 0 comments
Closed

Address pipeline overfitting #64

rhiever opened this issue Dec 22, 2015 · 0 comments
Assignees

Comments

@rhiever
Copy link
Contributor

rhiever commented Dec 22, 2015

Currently, TPOT has a tendency to build pipelines that overfit the data unless a good training sample is provided. We need to devise a method to combat overfitting on the pipeline level. Here's what I'm looking to explore:

Multi-objective fitness: Optimize along two fitness axes, where one is classification accuracy and the other is model complexity. Model complexity can be quantified in several ways:

  • The number of model pipeline operators in the pipeline
  • The number of pipeline operators in the pipeline
  • The sum of the number of features at every stage of the pipeline

Pareto optimization: Taking ideas from the famous NSGA-2 algorithm, we can explore a two-fitness-axis optimization problem but treat them as Pareto fronts instead. This results in a group of pipelines to select from at the end of the optimization process, where the user must hand-select the trade-off between complexity and accuracy of the pipeline (rather than strictly minimizing model complexity in the multi-objective fitness problem).

I'll be working on this over Winter break, so please feel free to provide feedback and ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant