Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the initial examples #60

Merged
merged 3 commits into from
Dec 16, 2015
Merged

Conversation

pronojitsaha
Copy link
Contributor

  • Added IRIS example
  • Added the updated exported files
  • Modified description of Using_TPOT_via_code
  • Added a tutorial folder with Jupyter notebooks for follow along
    examples. Let me know if this is required?

Follow-on work:

  • Need to add more detailed descriptions of fit, score & export.
  • Encountered an error while working on the titanic data set for the examples. Need your input on the same.

- Added IRIS example
- Added the updated exported files
- Modified description of Using_TPOT_via_code
- Added a tutorial folder with Jupyter notebooks for follow along
examples
from sklearn.cross_validation import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.75)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that test_size=0.25 is also in this call. See #52 for the reasoning.

@rhiever
Copy link
Contributor

rhiever commented Dec 16, 2015

Great start! Thank you @pronojitsaha.

Added a tutorial folder with Jupyter notebooks for follow along examples. Let me know if this is required?

Sounds fine to me. Although it's easy enough for people to copy-and-paste the code from our docs, I could see this being a useful folder for detailed notebooks that describe the process step-by-step, like this one.

Need to add more detailed descriptions of fit, score & export.

👍

Encountered an error while working on the titanic data set for the examples. Need your input on the same.

Can you please describe the error you encountered? Was it an issue with data encoding? At this point, I think we need to clarify in the docs that TPOT expects all of the features and classes to be numerical. Although this is certainly something we can raise an issue for to allow TPOT to handle non-numeric features/classes by mapping them to numerical representations within TPOT.

@pronojitsaha
Copy link
Contributor Author

Thanks!

  • Yes, the idea is for much larger notebooks in the later phases. This is just a container for now.
  • I will just check on the titanic data set again and report back. Yes, I think we should include the limitation of numerical features in the documentation for now. Will do that. I think that might be the culprit here as well, since the titanic data set has non-numerical features.

Further we can certainly look at a different pre-processing for non-numerical/categorical features by creating dummy flags and encoding categorical features as a separate issue?

@pronojitsaha
Copy link
Contributor Author

Have effected the changes.

@rhiever
Copy link
Contributor

rhiever commented Dec 16, 2015

Further we can certainly look at a different pre-processing for non-numerical/categorical features by creating dummy flags and encoding categorical features as a separate issue?

Absolutely, I think this is a convenience feature we should look into adding to TPOT: Detect if there are non-numerical features and encode them as numerical features before passing them to the optimizer.

rhiever pushed a commit that referenced this pull request Dec 16, 2015
Adding the initial examples
@rhiever rhiever merged commit 0925ffe into EpistasisLab:master Dec 16, 2015
@pronojitsaha
Copy link
Contributor Author

For the titanic data, a call to tpot.fit(X_train, y_train) raises a KeyError from the line training_testing_data.loc[training_indeces, 'group'] = 'training' as follows:

KeyError: '[460 75 196 430 221 350 294 610 561 207 84 24 291 281 432 29 134 456\n 467 126 289 336 246 104 38 22 220 488 273 418 177 457 590 613 484 557\n 151 609 642 322 152 558 556 127 532 284 361 657 564 487 358 123 539 380\n 280 441 43 227 549 202 204 449 72 629 165 143 265 553 6 311 173 200\n 297 599 634 192 435 219 568 156 277 544 531 224 563 379 225 399 398 1\n 570 529 14 97 517 575 428 189 187 353 534 344 130 434 643 502 442 70\n 272 56 305 76 85 217 174 420 140 581 522 182 144 3 631 505 268 472\n 396 326 400 10 264] not in index'

So as it turns out some of the training indices produced by

training_indices, testing_indices = next(iter(StratifiedShuffleSplit(training_testing_data['class'].values, n_iter=1, train_size=0.75)))

in tpot.fit() are not in X_train but in X_test (which is not even used in the call to fit(). What seems to be missing here? I have attached the data set.
titanic_train.csv.zip

Further, checking my anaconda/lib/python2.7/site-packages/tpot/tpot.pyc it seems its an old version and not the current one (which has test_size mentioned in StratifiedShuffleSplit).

@rhiever
Copy link
Contributor

rhiever commented Dec 17, 2015

That version of the titanic data set is quite messy: it contains non-numerical values, missing values, etc. Do you have a clean version that you passed to TPOT?

@pronojitsaha
Copy link
Contributor Author

No, I did not clean it. However I do not think the error is due to that as the problem is in the train test splitting (we dont get to the pipeline optimisation stages at all).

I tried to upgrade tpot using pip, which succeeded, but as I stated earlier my anaconda folder still shows an old version of tpot.py which gets referenced in the error. I believe the error may be due to this. Screenshot attached. Any inputs on this?
screen shot 2015-12-17 at 10 57 28 pm

@rhiever
Copy link
Contributor

rhiever commented Dec 17, 2015

I see. That fix isn't in the latest version on pip. You'd have to install tpot from development:

  1. Download and unzip https://github.com/rhiever/tpot/archive/master.zip

  2. cd into the tpot directory you unzipped

  3. python setup.py build

  4. python setup.py install

That will install the development version of tpot onto your system.

Alternatively, you can cd into the directory you're using to develop tpot, sync (i.e., pull) the latest updates from github, and then any code you run in the base tpot directory will reference the development tpot version rather than the version installed via pip.

@pronojitsaha
Copy link
Contributor Author

Thanks! Good news, I imported from the base directory and it now uses the development version. Bad news, the problem remain exactly the same.

I then dropped the features having categorical values and kept only features having numerical values i.e.

PassengerId int64
Survived int64
Pclass int64
Sex int64
Age float64
SibSp int64
Parch int64
Fare float64
Embarked float64
dtype: object

and then imputed missing values with placeholder value (-999), but still the exact same problem with indices persists.

@rhiever
Copy link
Contributor

rhiever commented Dec 18, 2015

Can you please share the latest version of this data set? I'll take a look.

@pronojitsaha
Copy link
Contributor Author

Here it is:
train.zip

@rhiever
Copy link
Contributor

rhiever commented Dec 18, 2015

Here's the code I used:

from tpot import TPOT
from sklearn.cross_validation import StratifiedShuffleSplit
import pandas as pd

pipeline_optimizer = TPOT(verbosity=2)

tpot_data = pd.read_csv('train.csv', sep=',')
tpot_data.rename(columns={'Survived': 'class'}, inplace=True)
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))

pipeline_optimizer.fit(tpot_data.loc[training_indeces].drop('class', axis=1).values,
                       tpot_data.loc[training_indeces, 'class'].values)

It seems to work fine, except it throws ValueError: could not convert string to float: 'W.E.P. 5734' when the data is passed to a StandardScaler (as would be expected since it is non-numerical data).

@pronojitsaha
Copy link
Contributor Author

Ok..I will look into this. Thanks.

@pronojitsaha
Copy link
Contributor Author

Ok, so your code works. I had used train_test_split and I believed that was the culprit in my case, not quite sure why though.

I did some preprocessing as follows to make the date compliant with tpot requirement:

titanic.rename(columns={'Survived': 'class'}, inplace=True)
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})
titanic_new = titanic.drop(['Name','Ticket','Cabin'], axis=1)
titanic_new = titanic_new.fillna(-999)

Did a fit() and score() to get ~80% accuracy. Now to predict on the submission test applied the same pre processing as above and the following command:
tpot.predict(titanic_new.drop('class', axis=1), titanic_new['class'], titanic_sub).

However this results in the following error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed on the line return result[result['group'] == 'testing', 'guess'].values from tpot.predict(). Any inputs?

@rhiever
Copy link
Contributor

rhiever commented Dec 20, 2015

Try:

tpot.predict(titanic_new.drop('class', axis=1).values, titanic_new['class'].values, titanic_sub.values)

Otherwise a DataFrame is being passed rather than a numpy array.

If that doesn't work:

It looks like the predict function wasn't coded correctly at the end:

return result[result['group'] == 'testing', 'guess'].values

should be

return result.loc[result['group'] == 'testing', 'guess'].values

I'll fix that now.

@pronojitsaha
Copy link
Contributor Author

Thanks @rhiever. I did use .values earlier but it dint work either. I think result.loc is the solution for this hashing issue.

Anyways, I updated my local from your master, and now I get a new error ValueError: unknown locale: UTF-8 at the import pandas as pd line in tpot. Any changes you implemented that I should know of?

@pronojitsaha
Copy link
Contributor Author

So the problem was with my pandas implementation only. Fixed it. Also got predict to work! Will update the material soon.

@rhiever
Copy link
Contributor

rhiever commented Dec 21, 2015

Interesting - old version of pandas?

@pronojitsaha
Copy link
Contributor Author

No, the locale information somehow got corrupted in .bash_profile and that affected pandas. Corrected it as

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants