OverflowError: Python int too large to convert to C long #84

magsol · 2016-02-18T14:12:51Z

I tried running tpot on a reasonably small dataset (141 data points, each with 78 features) for 10-way classification. In the interest of nearly-brute-forcing it, I set tpot to run for 1000 generations with 1000 populations in each generation.

Unfortunately, it only made it through 27 generations before crashing with the error: OverflowError: Python int too large to convert to C long. The full stack trace is reproduced below:

Traceback (most recent call last):
  File "/opt/conda/bin/tpot", line 11, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 1213, in main
    tpot.fit(training_features, training_classes)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 219, in fit
    stats=stats, halloffame=self.hof, verbose=verbose)
  File "/opt/conda/lib/python2.7/site-packages/deap/algorithms.py", line 169, in eaSimple
    fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 1040, in _evaluate_individual
    result = func(training_testing_data)
  File "<string>", line 1, in <lambda>
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 568, in gradient_boosting
    return self._train_model_and_predict(input_df, GradientBoostingClassifier, learning_rate=learning_rate, n_estimators=n_estimators, max_depth=max_depth, random_state=42)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 606, in _train_model_and_predict
    clf.fit(training_features, training_classes)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 1025, in fit
    begin_at_stage, monitor, X_idx_sorted)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 1080, in _fit_stages
    X_csc, X_csr)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 784, in _fit_stage
    check_input=False, X_idx_sorted=X_idx_sorted)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/tree/tree.py", line 342, in fit
    max_depth)
  File "sklearn/tree/_tree.pyx", line 134, in sklearn.tree._tree.DepthFirstTreeBuilder.__cinit__ (sklearn/tree/_tree.c:3107)
OverflowError: Python int too large to convert to C long

I'm not familiar enough with tpot innards to diagnose on my own, though I have a fair idea what the basic idea of the problem is: some Python integer variable overflows before it can be converted to a long format. As for why, I'm unsure and could use some suggestions there.

The text was updated successfully, but these errors were encountered:

rhiever · 2016-02-18T14:21:57Z

Interesting... this appears to be an issue with sklearn's GradientBoostingClassifier, or more specifically, with the decision trees that it constructs during gradient tree boosting.

Is the data set you're using publicly available (or can it be)? I'd be interested to see if this bug is reproducible on a simple GradientBoostingClassifier call.

Regardless, I think your issue raises a broader point: We should put exception handling around the pipeline evaluation function so all of TPOT doesn't crash when one invalid or faulty pipeline pops up. I've filed this issue as a bug in #85.

magsol · 2016-02-18T14:23:22Z

Hopefully in a few months the paper I'm working on with this dataset will be published and I can publish the dataset as well ;)

I noticed this bug pops up a lot across Python 2 installs (since iterators are lists, rather than generators). Would switching to Python 3 be a potential fix?

rhiever · 2016-02-18T14:27:07Z

Well, I always recommend upgrading to Python 3. ;-)

Try running this code on your data set and see if it generates an error:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(my_data_features, my_data_targets,
                                                    train_size=0.75, test_size=0.25)
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

rhiever · 2016-02-18T14:29:52Z

I updated the code above to do the CV split.

magsol · 2016-02-18T14:39:22Z

That seems to work just fine; it spits out an accuracy with no problems (in both Python 2 and 3).

rhiever · 2016-02-18T15:09:13Z

Interesting. TPOT must have been performing some feature transformations
before passing them to the classifier.

On Thursday, February 18, 2016, Shannon notifications@github.com wrote:

That seems to work just fine; it spits out an accuracy with no problems
(in both Python 2 and 3).

—
Reply to this email directly or view it on GitHub
#84 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

rhiever · 2016-02-19T17:49:38Z

Since #85 should address this issue, I'm going to close this one. Please feel free to reopen this one if you need further assistance.

magsol · 2016-02-19T18:18:03Z

SGTM. I've been running the same thing but on Python 3 and haven't run into
any problems yet (it's already run for twice as long). Will update on 85 if
I hit a roadblock. Thanks!
On Fri, Feb 19, 2016 at 12:49 Randy Olson notifications@github.com wrote:

Since #85 #85 should address this
issue, I'm going to close this one. Please feel free to reopen this one if
you need further assistance.

—
Reply to this email directly or view it on GitHub
#84 (comment).

iPhone'd

Do not allow one pipeline that crashes to cause TPOT to crash. Instead, assign the crashing pipeline a poor fitness. Reference: #84 and #85

rhiever added bug question and removed bug labels Feb 18, 2016

rhiever closed this as completed Feb 19, 2016

rhiever added a commit that referenced this issue Feb 20, 2016

Add catch-all for evaluation function

72a3c66

Do not allow one pipeline that crashes to cause TPOT to crash. Instead, assign the crashing pipeline a poor fitness. Reference: #84 and #85

rhiever mentioned this issue Feb 24, 2016

OverflowError #90

Closed

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: Python int too large to convert to C long #84

OverflowError: Python int too large to convert to C long #84

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

rhiever commented Feb 18, 2016

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

rhiever commented Feb 19, 2016

magsol commented Feb 19, 2016

OverflowError: Python int too large to convert to C long #84

OverflowError: Python int too large to convert to C long #84

Comments

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

rhiever commented Feb 18, 2016

magsol commented Feb 18, 2016

rhiever commented Feb 18, 2016

rhiever commented Feb 19, 2016

magsol commented Feb 19, 2016