Address pipeline overfitting #64

rhiever · 2015-12-22T14:58:04Z

Currently, TPOT has a tendency to build pipelines that overfit the data unless a good training sample is provided. We need to devise a method to combat overfitting on the pipeline level. Here's what I'm looking to explore:

Multi-objective fitness: Optimize along two fitness axes, where one is classification accuracy and the other is model complexity. Model complexity can be quantified in several ways:

The number of model pipeline operators in the pipeline
The number of pipeline operators in the pipeline
The sum of the number of features at every stage of the pipeline

Pareto optimization: Taking ideas from the famous NSGA-2 algorithm, we can explore a two-fitness-axis optimization problem but treat them as Pareto fronts instead. This results in a group of pipelines to select from at the end of the optimization process, where the user must hand-select the trade-off between complexity and accuracy of the pipeline (rather than strictly minimizing model complexity in the multi-objective fitness problem).

I'll be working on this over Winter break, so please feel free to provide feedback and ideas.

rhiever added the enhancement label Dec 22, 2015

rhiever self-assigned this Dec 22, 2015

bartleyn mentioned this issue Feb 7, 2016

Add Consensus operator #77

Closed

rhiever closed this as completed in 2c03c73 Feb 8, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address pipeline overfitting #64

Address pipeline overfitting #64

rhiever commented Dec 22, 2015

Address pipeline overfitting #64

Address pipeline overfitting #64

Comments

rhiever commented Dec 22, 2015