Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpot examples do not seem to differentiate/evolve? #503

Closed
dartdog opened this issue Jun 21, 2017 · 28 comments
Closed

Tpot examples do not seem to differentiate/evolve? #503

dartdog opened this issue Jun 21, 2017 · 28 comments
Labels

Comments

@dartdog
Copy link

dartdog commented Jun 21, 2017

Running the example the initial optimization values found do not change?

Context of the issue

so for instance on the Iris classifier example I ran it and got the following:

/home/tom/anaconda3/envs/py36n/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Optimization Progress: 31%|███ | 92/300 [00:23<00:32, 6.41pipeline/s]
Generation 1 - Current best internal CV score: 0.9746376811594203
Optimization Progress: 47%|████▋ | 141/300 [00:39<00:30, 5.23pipeline/s]
Generation 2 - Current best internal CV score: 0.9746376811594203
Optimization Progress: 63%|██████▎ | 190/300 [00:50<00:14, 7.69pipeline/s]
Generation 3 - Current best internal CV score: 0.9746376811594203
Optimization Progress: 77%|███████▋ | 231/300 [00:57<00:07, 8.68pipeline/s]
Generation 4 - Current best internal CV score: 0.9746376811594203

Generation 5 - Current best internal CV score: 0.9746376811594203

Best pipeline: GaussianNB(input_matrix)
0.921052631579

So best internal CV score Never changes..

Seeing the same when running the Minst example.. IE no change in scores..

One would expect to see some search activity with differing values??

minst output:

Optimization Progress: 31%|███▏ | 94/300 [05:29<04:56, 1.44s/pipeline]
Generation 1 - Current best internal CV score: 0.9859273244941604
Optimization Progress: 48%|████▊ | 143/300 [10:57<12:22, 4.73s/pipeline]
Generation 2 - Current best internal CV score: 0.9859273244941604
Optimization Progress: 62%|██████▏ | 187/300 [13:09<04:19, 2.30s/pipeline]
Generation 3 - Current best internal CV score: 0.9859273244941604
Optimization Progress: 77%|███████▋ | 231/300 [15:04<03:09, 2.74s/pipeline]
Generation 4 - Current best internal CV score: 0.9859273244941604

Generation 5 - Current best internal CV score: 0.9859273244941604

Best pipeline: LinearSVC(PolynomialFeatures(input_matrix, PolynomialFeatures__degree=2, PolynomialFeatures__include_bias=DEFAULT, PolynomialFeatures__interaction_only=DEFAULT), LinearSVC__C=0.0001, LinearSVC__dual=True, LinearSVC__loss=DEFAULT, LinearSVC__penalty=l2, LinearSVC__tol=0.1)
0.986666666667

@weixuanfu
Copy link
Contributor

I think the population size of 50 and generation number of 5 in examples may be too small to show evolving.

@rhiever
Copy link
Contributor

rhiever commented Jun 22, 2017

Right. Does it work better if you set population_size=100 and generations=100? The examples are meant to run fast, but they're not ideal parameter settings. The defaults (100 pop and 100 gen) are better.

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 22, 2017

Actually, while working on a new extension to TPOT and analyzing my results, I found the same issue to be true (in both the 'vanilla' 0.8.1 version of TPOT and my extension).
There are a significant amount of duplicates being produced both within the same generation and cross generations.
I have not yet tried to figure out the cause.
I made a very brief write-up here.

EDIT: I made a mistake on when to log the population, over representing the amount of duplicates and cache hits. This mistake is currently present in the above notebook.

@rhiever
Copy link
Contributor

rhiever commented Jun 22, 2017

Wow, that's a lot of duplicates! I wonder if we can change the mutation/xover functions to only produce individuals not in the cache using the pre_check function, @weixuanfu2016?

@dartdog
Copy link
Author

dartdog commented Jun 22, 2017

FWIW I upped the pop size to 150 on the iris example and 1st gen at .990909090909091 with no changes beyond that in subsequent generations was looking for something to really show the power that I assume is here!

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 22, 2017

@dartdog Good to hear! Indeed, the initial population seems to be generated fairly diverse :) so that would align with getting better results by increasing population size (even in the 1st generation).

@rhiever I think you can do this in the _random_mutation_operator and _mate_operator functions by rewriting them from

@_pre_test
def _mate_operator(self, ind1, ind2):
    return cxOnePoint(ind1, ind2)

@_pre_test
def _random_mutation_operator(self, individual):
    mutation_techniques = [
        partial(gp.mutInsert, pset=self._pset),
        partial(mutNodeReplacement, pset=self._pset),
        partial(gp.mutShrink)
    ]
    return np.random.choice(mutation_techniques)(individual)

to

@_pre_test
def _mate_operator(self, ind1, ind2):
    offspring = cxOnePoint(ind1, ind2)
    while offspring in self.evaluated_individuals_:
        offspring = cxOnePoint(ind1, ind2)
    return offspring

@_pre_test
def _random_mutation_operator(self, individual):
   mutation_techniques = [
       partial(gp.mutInsert, pset=self._pset),
       partial(mutNodeReplacement, pset=self._pset),
       partial(gp.mutShrink)
   ]
   mutator = np.random.choice(mutation_techniques)
   offspring = mutator(individual)
   while offspring in self.evaluated_individuals_:
       offspring = mutator(individual)
   return offspring

Of course you probably want to use a for-loop defining some amount of max tries instead to avoid infinite loops.

However, I would still try and see if the behavior is caused by a bug. Considering 30% of individuals will have a random insert mutation each generation (on average), it seems very weird to me that can get 99% cache hits consistently.

Edit: As stated below, bug was with logging, actual numbers rise much more slowly.

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 22, 2017

Oops! It turns out the problem does exist, but is not as severe as stated in the notebook.
I made a mistake on when to log the population, over representing the amount of duplicates and cache hits. I am currently rerunning the experiment, so far it seems that there are still a fair number of cache hits and duplicates, but not nearly as many (maybe 10~40% instead of near 100%).

In addition to changing where the logging happens, I now also have a direct duplicate/cache hit counter directly where they happen in the code in evaluate_individuals to verify.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 23, 2017

@rhiever that is a good idea. I will work on it.

@PG-TUe thank you for the notebook and idea. Maybe my understanding of cache_hits in your plots is not right. I made a quick test branch based on master branch in my forked repo to check how many unique new individuals in each generation (since this line will skip evaluated pipelines in earlier generations). The percentage of new individuals in later generations was around 30% in the test. (check the partial log below)

@dartdog check the partial log below, I saw that scores changed in the early generations but fixed in the later generations. I think this problem in the example is easy for getting a very good pipeline in the early generations. As mentioned above, lower population diversity in later populations is indeed a concern for slowing down optimization process. I will work on a solution to increase population diversity in later generations.

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=2)

tpot = TPOTClassifier(generations=100, population_size=100, verbosity=2, random_state=2)
tpot.fit(X_train, y_train)


Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Optimization Progress:   0%|                                                                                                    | 0/10100 [00:00<?, ?pipeline/s]
Number of new vaild pipelines in currect generation: 90
Optimization Progress:   1%|▊                                                                                        | 92/10100 [00:22<1:08:13,  2.44pipeline/s]
Number of new vaild pipelines in currect generation: 86
Generation 1 - Current best internal CV score: 0.9726896292113685                                                                                               
Optimization Progress:   2%|█▋                                                                                      | 188/10100 [00:48<1:37:21,  1.70pipeline/s]
Number of new vaild pipelines in currect generation: 77
Generation 2 - Current best internal CV score: 0.9726896292113685                                                                                               
Optimization Progress:   3%|██▍                                                                                     | 277/10100 [01:19<1:47:59,  1.52pipeline/s]
Number of new vaild pipelines in currect generation: 72
Generation 3 - Current best internal CV score: 0.9726896292113685                                                                                               
Optimization Progress:   4%|███▏                                                                                    | 366/10100 [01:46<2:13:02,  1.22pipeline/s]
Number of new vaild pipelines in currect generation: 74
Generation 4 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   5%|███▉                                                                                    | 457/10100 [02:19<2:13:45,  1.20pipeline/s]
Number of new vaild pipelines in currect generation: 67
Generation 5 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   5%|████▊                                                                                   | 547/10100 [02:50<4:00:17,  1.51s/pipeline]
Number of new vaild pipelines in currect generation: 71
Generation 6 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   6%|█████▌                                                                                  | 639/10100 [03:23<3:01:06,  1.15s/pipeline]
Number of new vaild pipelines in currect generation: 67
Generation 7 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   7%|██████▍                                                                                 | 733/10100 [03:51<2:14:14,  1.16pipeline/s]
Number of new vaild pipelines in currect generation: 66
Generation 8 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   8%|███████▏                                                                                | 824/10100 [04:16<2:08:42,  1.20pipeline/s]
Number of new vaild pipelines in currect generation: 61
Generation 9 - Current best internal CV score: 0.982213438735178                                                                                                
Optimization Progress:   9%|███████▉                                                                                | 914/10100 [04:43<2:29:08,  1.03pipeline/s]
Number of new vaild pipelines in currect generation: 59
Generation 10 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  10%|████████▋                                                                               | 996/10100 [05:00<1:34:22,  1.61pipeline/s]

...

Generation 21 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  20%|█████████████████▎                                                                     | 2011/10100 [11:11<2:27:03,  1.09s/pipeline]
Number of new vaild pipelines in currect generation: 58
Generation 22 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  21%|██████████████████▏                                                                    | 2106/10100 [11:52<3:47:06,  1.70s/pipeline]
Number of new vaild pipelines in currect generation: 53
Generation 23 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  22%|██████████████████▉                                                                    | 2202/10100 [12:48<2:17:50,  1.05s/pipeline]
Number of new vaild pipelines in currect generation: 46
Generation 24 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  23%|███████████████████▊                                                                   | 2296/10100 [13:25<1:36:54,  1.34pipeline/s]
Number of new vaild pipelines in currect generation: 55
Generation 25 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  24%|████████████████████▌                                                                  | 2384/10100 [13:58<1:49:03,  1.18pipeline/s]
Number of new vaild pipelines in currect generation: 51
Generation 26 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  25%|█████████████████████▍                                                                 | 2482/10100 [14:45<2:29:02,  1.17s/pipeline]
Number of new vaild pipelines in currect generation: 44
Generation 27 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  25%|██████████████████████▏                                                                | 2571/10100 [15:21<2:09:12,  1.03s/pipeline]
Number of new vaild pipelines in currect generation: 55
Generation 28 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  26%|██████████████████████▉                                                                | 2666/10100 [15:55<2:25:48,  1.18s/pipeline]
Number of new vaild pipelines in currect generation: 46
Generation 29 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  27%|███████████████████████▊                                                               | 2762/10100 [16:46<2:36:52,  1.28s/pipeline]
Number of new vaild pipelines in currect generation: 44
Generation 30 - Current best internal CV score: 0.982213438735178     

...

Generation 60 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  56%|████████████████████████████████████████████████▌                                      | 5631/10100 [36:31<1:26:53,  1.17s/pipeline]
Number of new vaild pipelines in currect generation: 27
Generation 61 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  57%|█████████████████████████████████████████████████▏                                     | 5716/10100 [37:15<2:04:30,  1.70s/pipeline]
Number of new vaild pipelines in currect generation: 35
Generation 62 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  57%|█████████████████████████████████████████████████▉                                     | 5804/10100 [37:44<1:13:17,  1.02s/pipeline]
Number of new vaild pipelines in currect generation: 35
Generation 63 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  58%|██████████████████████████████████████████████████▋                                    | 5890/10100 [38:21<1:40:59,  1.44s/pipeline]
Number of new vaild pipelines in currect generation: 30
Generation 64 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  59%|███████████████████████████████████████████████████▌                                   | 5986/10100 [39:04<1:23:10,  1.21s/pipeline]
Number of new vaild pipelines in currect generation: 33
Generation 65 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  60%|████████████████████████████████████████████████████▎                                  | 6074/10100 [39:48<1:28:08,  1.31s/pipeline]
Number of new vaild pipelines in currect generation: 36
Generation 66 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  61%|██████████████████████████████████████████████████████▎                                  | 6166/10100 [40:09<49:16,  1.33pipeline/s]
Number of new vaild pipelines in currect generation: 39
Generation 67 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  62%|█████████████████████████████████████████████████████▉                                 | 6262/10100 [40:56<1:22:02,  1.28s/pipeline]
Number of new vaild pipelines in currect generation: 33
Generation 68 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  63%|██████████████████████████████████████████████████████▋                                | 6352/10100 [41:34<1:20:02,  1.28s/pipeline]
Number of new vaild pipelines in currect generation: 24
Generation 69 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  64%|███████████████████████████████████████████████████████▍                               | 6442/10100 [42:08<1:12:56,  1.20s/pipeline]
Number of new vaild pipelines in currect generation: 30
Generation 70 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  65%|████████████████████████████████████████████████████████▏                              | 6530/10100 [42:56<1:10:04,  1.18s/pipeline]
Number of new vaild pipelines in currect generation: 37

...

Generation 90 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  83%|█████████████████████████████████████████████████████████████████████████▉               | 8390/10100 [54:50<25:52,  1.10pipeline/s]
Number of new vaild pipelines in currect generation: 29
Generation 91 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  84%|██████████████████████████████████████████████████████████████████████████▋              | 8481/10100 [55:06<22:49,  1.18pipeline/s]
Number of new vaild pipelines in currect generation: 39
Generation 92 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  85%|███████████████████████████████████████████████████████████████████████████▌             | 8573/10100 [55:32<32:06,  1.26s/pipeline]
Number of new vaild pipelines in currect generation: 34
Generation 93 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  86%|████████████████████████████████████████████████████████████████████████████▎            | 8664/10100 [56:02<27:22,  1.14s/pipeline]
Number of new vaild pipelines in currect generation: 28
Generation 94 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  87%|█████████████████████████████████████████████████████████████████████████████            | 8752/10100 [56:40<28:31,  1.27s/pipeline]
Number of new vaild pipelines in currect generation: 28
Generation 95 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  88%|█████████████████████████████████████████████████████████████████████████████▉           | 8840/10100 [57:12<24:24,  1.16s/pipeline]
Number of new vaild pipelines in currect generation: 24
Generation 96 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  88%|██████████████████████████████████████████████████████████████████████████████▋          | 8930/10100 [57:48<20:35,  1.06s/pipeline]
Number of new vaild pipelines in currect generation: 23
Generation 97 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  89%|███████████████████████████████████████████████████████████████████████████████▌         | 9022/10100 [58:22<18:36,  1.04s/pipeline]
Number of new vaild pipelines in currect generation: 29
Generation 98 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  90%|████████████████████████████████████████████████████████████████████████████████▎        | 9109/10100 [58:46<25:46,  1.56s/pipeline]
Number of new vaild pipelines in currect generation: 26
Generation 99 - Current best internal CV score: 0.982213438735178                                                                                               
Optimization Progress:  91%|█████████████████████████████████████████████████████████████████████████████████        | 9198/10100 [59:08<20:48,  1.38s/pipeline]
Number of new vaild pipelines in currect generation: 26
Generation 100 - Current best internal CV score: 0.982213438735178                                                                                              
                                                                                                                                                                
Best pipeline: GaussianNB(RBFSampler(input_matrix, RBFSampler__gamma=0.55))


                                                                                          

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 23, 2017

@weixuanfu2016 No, your idea of cache hits was correct. As stated just before you made your post (you probably missed it since you were typing yours), I made a mistake raising cache-hits significantly (specifically, I counted logged the population instead of offspring, meaning I only looked at what remained after selection).

My new results are just in (with correct logging) and results are similar to yours. Here is a (quickly and badly made) bar chart link.

@weixuanfu
Copy link
Contributor

@PG-TUe thank you for double-checking it!

@PGijsbers
Copy link
Contributor

I still think the suggested changes to _random_mutation_operator and _mate_operator, or alternatively _pre_test would be a good!

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 23, 2017 via email

@weixuanfu
Copy link
Contributor

@PG-TUe I will work on this PR and test it since it is just some small changes.

@PGijsbers
Copy link
Contributor

@weixuanfu2016 I actually worked on this before I managed to check in here ^^;;
I'll put up a pull request with my changes when I round it up, then you can compare and see if you want to use some of it.

@dartdog
Copy link
Author

dartdog commented Jun 23, 2017

FYI my new results following the advice here, now shows progress, I do have xxgboost installed so that is why I don't get that message,, but wondering, why don't I get these lines?

Number of new vaild pipelines in currect generation: 35

also would be very nice to have more insight as to what is going on internally? What is being tried?

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=20, population_size=100, verbosity=2, random_state=2)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')
Optimization Progress:   9%|▊         | 182/2100 [00:23<03:39,  8.76pipeline/s]
Generation 1 - Current best internal CV score: 0.9739130434782609
Optimization Progress:  13%|█▎        | 272/2100 [00:41<05:09,  5.92pipeline/s]
Generation 2 - Current best internal CV score: 0.9739130434782609
Optimization Progress:  17%|█▋        | 361/2100 [01:07<04:52,  5.95pipeline/s]
Generation 3 - Current best internal CV score: 0.9739130434782609
Optimization Progress:  22%|██▏       | 457/2100 [01:35<04:25,  6.19pipeline/s]
Generation 4 - Current best internal CV score: 0.9826086956521738
Optimization Progress:  26%|██▌       | 551/2100 [02:12<04:59,  5.17pipeline/s]
Generation 5 - Current best internal CV score: 0.9826086956521738
Optimization Progress:  31%|███       | 647/2100 [02:55<03:55,  6.16pipeline/s]
.....
Optimization Progress:  81%|████████  | 1692/2100 [07:20<01:11,  5.70pipeline/s]
Generation 17 - Current best internal CV score: 0.9913043478260869
Optimization Progress:  85%|████████▌ | 1786/2100 [08:00<00:55,  5.68pipeline/s]
Generation 18 - Current best internal CV score: 0.9913043478260869
Optimization Progress:  90%|████████▉ | 1883/2100 [08:34<00:30,  7.02pipeline/s]
Generation 19 - Current best internal CV score: 0.9913043478260869
                                                                                
Generation 20 - Current best internal CV score: 0.9913043478260869

Best pipeline: DecisionTreeClassifier(FastICA(input_matrix, FastICA__tol=0.25), DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=8, DecisionTreeClassifier__min_samples_leaf=19, DecisionTreeClassifier__min_samples_split=5)
0.842105263158 ```

@PGijsbers
Copy link
Contributor

@dartdog
You do not get the line about the new valid individuals because that is something @weixuanfu2016 added himself when trying to evaluate your issue, for his local version of TPOT only.

also would be very nice to have more insight as to what is going on internally?

From the top of my head, I am not sure what is exposed in the most recent version of TPOT, but I think you can access results of all evaluated pipelines with tpot.evaluated_individuals_. What kind of extra information would be useful to you? Perhaps we can expose more of the process in a new release.

small tip: to show code blocks, instead of using a single quote, use three, i.e.:
```
my code can be
many lines now
```
gives

my code can be
many lines now

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 23, 2017

@weixuanfu2016 I worked on the issue, my work can be found here: https://github.com/PG-TUe/tpot/tree/no_repeats
Specifically I changed:

  • Crossover will keep trying to do new crossovers until one that has not been evaluated before has been found.
  • Mutation will keep trying to do new crossovers until one that has not been evaluated before has been found.
  • Mutation will no longer consider the mutShrink operator if there is only one primitive in the pipeline (and hence can not be shrunk).
  • For longer pipelines, if all shrunk versions of it have already been considered, it will try to mutate by using a different operator instead.
  • For crossover, only individuals with 2 or more primitives are considered.

The reason I did not commit a pull request yet is because the last point is circumventing a problem, rather than fixing it.
Previously, it could be possible that two individuals with only one Primitive would be passed to crossover, which often meant no crossover could take place, and hence no new individual would be made.
Now, with both individuals needing at least two primitives, crossover can almost always take place.

While two individuals with each only one Primitive can have crossover, they must have the same typed terminals (in our case, this means that it would be the same base learner), in this case cross over will swap out a terminal (in our case, a parameter value).
In practice, often two different learners were matched, and input_matrix would be the only terminal that could be exchanged between the two individuals, causing two pipelines which were identical to their parents.
Of course, for any two individuals, no matter their structure, it can always be that all crossover combinations have been tried out before, in which case this solution still introduces an old individual.
The mismatch of 1-primitive-pipelines is now avoided, but this also means that crossover between those two (swapping terminals), is also no longer possible.

The problems of duplicates in a generation was largely removed by the fact that mutation and crossover now almost always provide a new individual, rather than one seen before.

I ran the same setup as yesterday (digits, 10 pop, 50 gens), though I cut it short after 20 generations, because they take much longer now, with all these new individuals ;)
Only one generation contained a duplicate, and there were no cache hits.
I ran an experiment on iris with 100 population and 36 generations (planned 100, but I accidentally had a max_time_mins set), and over those 36 generation, each generation had on average 1.8 duplicates and 0.3 cache hits.

I think that in practice, this solution is probably good enough (for now).

@dartdog
Copy link
Author

dartdog commented Jun 23, 2017

tpot.evaluated_individuals_ looks quite helpful... at this point all I can think of is maybe a way to pretty the output up?

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 23, 2017

@dartdog
Hmm, I am not sure why you get this as output:

Optimization Progress: 9%|▊ | 182/2100 [00:23<03:39, 8.76pipeline/s]
Generation 1 - Current best internal CV score: 0.9739130434782609
Optimization Progress: 13%|█▎ | 272/2100 [00:41<05:09, 5.92pipeline/s]
Generation 2 - Current best internal CV score: 0.9739130434782609
Optimization Progress: 17%|█▋ | 361/2100 [01:07<04:52, 5.95pipeline/s]
Generation 3 - Current best internal CV score: 0.9739130434782609
Optimization Progress: 22%|██▏ | 457/2100 [01:35<04:25, 6.19pipeline/s]
Generation 4 - Current best internal CV score: 0.9826086956521738
Optimization Progress: 26%|██▌ | 551/2100 [02:12<04:59, 5.17pipeline/s]
Generation 5 - Current best internal CV score: 0.9826086956521738
Optimization Progress: 31%|███ | 647/2100 [02:55<03:55, 6.16pipeline/s]
.....
Optimization Progress: 81%|████████ | 1692/2100 [07:20<01:11, 5.70pipeline/s]
Generation 17 - Current best internal CV score: 0.9913043478260869
Optimization Progress: 85%|████████▌ | 1786/2100 [08:00<00:55, 5.68pipeline/s]
Generation 18 - Current best internal CV score: 0.9913043478260869
Optimization Progress: 90%|████████▉ | 1883/2100 [08:34<00:30, 7.02pipeline/s]
Generation 19 - Current best internal CV score: 0.9913043478260869

I got something like this:

Generation 1 - Current best internal CV score: 0.9739130434782609
Generation 2 - Current best internal CV score: 0.9739130434782609
Generation 3 - Current best internal CV score: 0.9739130434782609
Generation 4 - Current best internal CV score: 0.9826086956521738
Generation 5 - Current best internal CV score: 0.9826086956521738
.....
Generation 17 - Current best internal CV score: 0.9913043478260869
Generation 18 - Current best internal CV score: 0.9913043478260869
Generation 19 - Current best internal CV score: 0.9913043478260869
Optimization Progress: 90%|████████▉ | 1883/2100 [08:34<00:30, 7.02pipeline/s]

For the final line, we could 'prettify'

Best pipeline: DecisionTreeClassifier(FastICA(input_matrix, FastICA__tol=0.25), DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=8, DecisionTreeClassifier__min_samples_leaf=19, DecisionTreeClassifier__min_samples_split=5)
0.842105263158

to

Best pipeline: DecisionTreeClassifier(FastICA(input_matrix, tol=0.25), criterion=entropy, max_depth=8, min_samples_leaf=19, min_samples_split=5)
0.842105263158

Anything else you had in mind?

@dartdog
Copy link
Author

dartdog commented Jun 23, 2017

actually the output of tpot.evaluated_individuals_ in a more attractive format? It is a bit dump like...
so maybe a few embedded line breaks? I guess I'm just being too picky...

  0.94703557312252973),
 'GaussianNB(GaussianNB(input_matrix))': (2, 0.95569358178053831),
 'XGBClassifier(MinMaxScaler(input_matrix), XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=5, XGBClassifier__min_child_weight=6, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.2)': (2,
  0.69819311123658956),
 'BernoulliNB(input_matrix, BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=DEFAULT)': (1,
  0.35761340109166195),
 'GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.1, GradientBoostingClassifier__max_depth=4, GradientBoostingClassifier__max_features=0.3, GradientBoostingClassifier__min_samples_leaf=5, GradientBoostingClassifier__min_samples_split=16, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=1.0)': (1,
  0.9652173913043478),```

@PGijsbers
Copy link
Contributor

Okay, that makes sense.
This field has only been recently officially exposed, the dump you see is because an internal dictionary simply gets put out to the screen.
For people wanting to access the TPOT objects themselves, this dictionary is useful as is.
I think perhaps the best solution is to also present a nicer, better legible option for those who just want to immediately gain some insight into what type of pipelines have been examined and what their performance was like.

At this point I would like to defer from going into this aspect any further on this specific issue-thread.
However, issues #337 and #459 both talk about the visualization of TPOT results, so feel free to contribute to the conversation there.

@dartdog
Copy link
Author

dartdog commented Jun 23, 2017

feel free to close as to not pile things up
and FWIW I just stuck the results in a DF for somewhat pretty presentation..
hmm not rendering the header which has the model params..

proc=pd.DataFrame(tpot.evaluated_individuals_)
proc.head()

0	2.00000	2.000000	2.000000	2.0000	2.000000	2.00000	2.000000	2.00000	2.000000	2.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.00000
1	0.36621	0.750456	0.929903	0.7786	0.669433	0.36621	0.651647	0.36621	0.860926	0.851402	...	0.940114	0.725927	0.893864	0.894297	0.356686	0.946932	0.348353	0.285437	0.938599	0.36621
2 rows × 1410 columns```

@PGijsbers
Copy link
Contributor

@dartdog Will do once the respective pull request is made.
@weixuanfu2016 Let me know if you think I should just send a PR for the fixes as I implemented them.

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 23, 2017

@PG-TUe Thank you for the fixes.

For crossover, I think only one individual with at least 2 Primitive in a pair can make a crossover. We also need to change crossover operator for this purpose. For example:

PCA-RFC X LR --> PCA-LR + RFC
# Note: PCA-LR is new individual as output ind1 from modified cxOnePoint

Though crossover may not make big difference comparing with mutation.

I think you can make a PR for these fixes.

@PGijsbers
Copy link
Contributor

PGijsbers commented Jun 23, 2017

@weixuanfu2016 The length of an individual is defined by the number of primitives and terminals (code you linked).
You can have crossover given two individuals with only one Primitive, try this example where two one-primitive individuals perform crossover by swapping a terminal:

from tpot import TPOTClassifier
from deap import creator, gp
tpot = TPOTClassifier()

ind1 = creator.Individual.from_string('BernoulliNB(input_matrix, BernoulliNB__alpha=10.0, BernoulliNB__fit_prior=True)', tpot._pset)
ind2 = creator.Individual.from_string('BernoulliNB(input_matrix, BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=False)', tpot._pset)

new1, new2 = gp.cxOnePoint(ind1, ind2)
str(new1)
>>> BernoulliNB(input_matrix, BernoulliNB__alpha=10.0, BernoulliNB__fit_prior=False)

So two individuals are eligble for crossover if they have any one terminal in common.
For TPOT, we want to select any two individuals from the population for cross-over as long as either (1)
they both have only one Primitive, but it is the same one (which means same Terminal-types*) or (2) at least one of them has two Primitives.

I will go ahead and make a PR.

* technically all one-primitives individuals have one terminal type in common regardless of primitive types, which is the terminal with input_matrix, but this is always the same terminal for one-primitive individuals, so crossover does not generate a new individual.

@dartdog
Copy link
Author

dartdog commented Jun 23, 2017

Further,, (sorry)
when running the same example i get:


Best pipeline: LinearSVC(BernoulliNB(input_matrix, BernoulliNB__alpha=0.001, BernoulliNB__fit_prior=True), LinearSVC__C=DEFAULT, LinearSVC__dual=True, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=0.1)
0.947368421053```
yet none of the tpot.evaluated_individuals_ results show the same result numbers... and in fact this seems to be the "best" from that output:

```BernoulliNB(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=7, DecisionTreeClassifier__min_samples_leaf=2, DecisionTreeClassifier__min_samples_split=3), BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=DEFAULT)```
``` 0      2
    1      0.962771```

@weixuanfu
Copy link
Contributor

the scores in tpot.evaluated_individuals_ is average CV scores from cross_eval_score. The output of tpot.score is fitness score of the best pipeline (see these lines) @dartdog

@rhiever
Copy link
Contributor

rhiever commented Jul 18, 2017

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants