Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: Python int too large to convert to C long #84

Closed
magsol opened this issue Feb 18, 2016 · 8 comments
Closed

OverflowError: Python int too large to convert to C long #84

magsol opened this issue Feb 18, 2016 · 8 comments
Labels

Comments

@magsol
Copy link

magsol commented Feb 18, 2016

I tried running tpot on a reasonably small dataset (141 data points, each with 78 features) for 10-way classification. In the interest of nearly-brute-forcing it, I set tpot to run for 1000 generations with 1000 populations in each generation.

Unfortunately, it only made it through 27 generations before crashing with the error: OverflowError: Python int too large to convert to C long. The full stack trace is reproduced below:

Traceback (most recent call last):
  File "/opt/conda/bin/tpot", line 11, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 1213, in main
    tpot.fit(training_features, training_classes)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 219, in fit
    stats=stats, halloffame=self.hof, verbose=verbose)
  File "/opt/conda/lib/python2.7/site-packages/deap/algorithms.py", line 169, in eaSimple
    fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 1040, in _evaluate_individual
    result = func(training_testing_data)
  File "<string>", line 1, in <lambda>
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 568, in gradient_boosting
    return self._train_model_and_predict(input_df, GradientBoostingClassifier, learning_rate=learning_rate, n_estimators=n_estimators, max_depth=max_depth, random_state=42)
  File "/opt/conda/lib/python2.7/site-packages/tpot/tpot.py", line 606, in _train_model_and_predict
    clf.fit(training_features, training_classes)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 1025, in fit
    begin_at_stage, monitor, X_idx_sorted)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 1080, in _fit_stages
    X_csc, X_csr)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/ensemble/gradient_boosting.py", line 784, in _fit_stage
    check_input=False, X_idx_sorted=X_idx_sorted)
  File "/opt/conda/lib/python2.7/site-packages/sklearn/tree/tree.py", line 342, in fit
    max_depth)
  File "sklearn/tree/_tree.pyx", line 134, in sklearn.tree._tree.DepthFirstTreeBuilder.__cinit__ (sklearn/tree/_tree.c:3107)
OverflowError: Python int too large to convert to C long

I'm not familiar enough with tpot innards to diagnose on my own, though I have a fair idea what the basic idea of the problem is: some Python integer variable overflows before it can be converted to a long format. As for why, I'm unsure and could use some suggestions there.

@rhiever
Copy link
Contributor

rhiever commented Feb 18, 2016

Interesting... this appears to be an issue with sklearn's GradientBoostingClassifier, or more specifically, with the decision trees that it constructs during gradient tree boosting.

Is the data set you're using publicly available (or can it be)? I'd be interested to see if this bug is reproducible on a simple GradientBoostingClassifier call.

Regardless, I think your issue raises a broader point: We should put exception handling around the pipeline evaluation function so all of TPOT doesn't crash when one invalid or faulty pipeline pops up. I've filed this issue as a bug in #85.

@magsol
Copy link
Author

magsol commented Feb 18, 2016

Hopefully in a few months the paper I'm working on with this dataset will be published and I can publish the dataset as well ;)

I noticed this bug pops up a lot across Python 2 installs (since iterators are lists, rather than generators). Would switching to Python 3 be a potential fix?

@rhiever
Copy link
Contributor

rhiever commented Feb 18, 2016

Well, I always recommend upgrading to Python 3. ;-)

Try running this code on your data set and see if it generates an error:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(my_data_features, my_data_targets,
                                                    train_size=0.75, test_size=0.25)
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

@rhiever
Copy link
Contributor

rhiever commented Feb 18, 2016

I updated the code above to do the CV split.

@magsol
Copy link
Author

magsol commented Feb 18, 2016

That seems to work just fine; it spits out an accuracy with no problems (in both Python 2 and 3).

@rhiever
Copy link
Contributor

rhiever commented Feb 18, 2016

Interesting. TPOT must have been performing some feature transformations
before passing them to the classifier.

On Thursday, February 18, 2016, Shannon notifications@github.com wrote:

That seems to work just fine; it spits out an accuracy with no problems
(in both Python 2 and 3).


Reply to this email directly or view it on GitHub
#84 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@rhiever
Copy link
Contributor

rhiever commented Feb 19, 2016

Since #85 should address this issue, I'm going to close this one. Please feel free to reopen this one if you need further assistance.

@rhiever rhiever closed this as completed Feb 19, 2016
@magsol
Copy link
Author

magsol commented Feb 19, 2016

SGTM. I've been running the same thing but on Python 3 and haven't run into
any problems yet (it's already run for twice as long). Will update on 85 if
I hit a roadblock. Thanks!
On Fri, Feb 19, 2016 at 12:49 Randy Olson notifications@github.com wrote:

Since #85 #85 should address this
issue, I'm going to close this one. Please feel free to reopen this one if
you need further assistance.


Reply to this email directly or view it on GitHub
#84 (comment).

iPhone'd

rhiever added a commit that referenced this issue Feb 20, 2016
Do not allow one pipeline that crashes to cause TPOT to crash. Instead,
assign the crashing pipeline a poor fitness.

Reference: #84 and #85
@rhiever rhiever mentioned this issue Feb 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants