-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the initial examples #60
Conversation
- Added IRIS example - Added the updated exported files - Modified description of Using_TPOT_via_code - Added a tutorial folder with Jupyter notebooks for follow along examples
from sklearn.cross_validation import train_test_split | ||
|
||
iris = load_iris() | ||
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.75) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure that test_size=0.25
is also in this call. See #52 for the reasoning.
Great start! Thank you @pronojitsaha.
Sounds fine to me. Although it's easy enough for people to copy-and-paste the code from our docs, I could see this being a useful folder for detailed notebooks that describe the process step-by-step, like this one.
👍
Can you please describe the error you encountered? Was it an issue with data encoding? At this point, I think we need to clarify in the docs that TPOT expects all of the features and classes to be numerical. Although this is certainly something we can raise an issue for to allow TPOT to handle non-numeric features/classes by mapping them to numerical representations within TPOT. |
Thanks!
Further we can certainly look at a different pre-processing for non-numerical/categorical features by creating dummy flags and encoding categorical features as a separate issue? |
Have effected the changes. |
Absolutely, I think this is a convenience feature we should look into adding to TPOT: Detect if there are non-numerical features and encode them as numerical features before passing them to the optimizer. |
For the titanic data, a call to
So as it turns out some of the training indices produced by
in Further, checking my |
That version of the titanic data set is quite messy: it contains non-numerical values, missing values, etc. Do you have a clean version that you passed to TPOT? |
I see. That fix isn't in the latest version on pip. You'd have to install tpot from development:
That will install the development version of tpot onto your system. Alternatively, you can |
Thanks! Good news, I imported from the base directory and it now uses the development version. Bad news, the problem remain exactly the same. I then dropped the features having categorical values and kept only features having numerical values i.e.
and then imputed missing values with placeholder value (-999), but still the exact same problem with indices persists. |
Can you please share the latest version of this data set? I'll take a look. |
Here it is: |
Here's the code I used: from tpot import TPOT
from sklearn.cross_validation import StratifiedShuffleSplit
import pandas as pd
pipeline_optimizer = TPOT(verbosity=2)
tpot_data = pd.read_csv('train.csv', sep=',')
tpot_data.rename(columns={'Survived': 'class'}, inplace=True)
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))
pipeline_optimizer.fit(tpot_data.loc[training_indeces].drop('class', axis=1).values,
tpot_data.loc[training_indeces, 'class'].values) It seems to work fine, except it throws |
Ok..I will look into this. Thanks. |
Ok, so your code works. I had used I did some preprocessing as follows to make the date compliant with tpot requirement:
Did a However this results in the following error: |
Try:
Otherwise a DataFrame is being passed rather than a numpy array. If that doesn't work: It looks like the
should be
I'll fix that now. |
Thanks @rhiever. I did use .values earlier but it dint work either. I think result.loc is the solution for this hashing issue. Anyways, I updated my local from your master, and now I get a new error |
So the problem was with my pandas implementation only. Fixed it. Also got predict to work! Will update the material soon. |
Interesting - old version of pandas? |
No, the locale information somehow got corrupted in .bash_profile and that affected pandas. Corrected it as
|
examples. Let me know if this is required?
Follow-on work: