Adding the initial examples #60

pronojitsaha · 2015-12-16T11:31:53Z

Added IRIS example
Added the updated exported files
Modified description of Using_TPOT_via_code
Added a tutorial folder with Jupyter notebooks for follow along
examples. Let me know if this is required?

Follow-on work:

Need to add more detailed descriptions of fit, score & export.
Encountered an error while working on the titanic data set for the examples. Need your input on the same.

- Added IRIS example - Added the updated exported files - Modified description of Using_TPOT_via_code - Added a tutorial folder with Jupyter notebooks for follow along examples

rhiever · 2015-12-16T16:28:58Z

docs/sources/examples/IRIS_Example.md

+from sklearn.cross_validation import train_test_split
+
+iris = load_iris()
+X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.75)


Please ensure that test_size=0.25 is also in this call. See #52 for the reasoning.

rhiever · 2015-12-16T16:36:48Z

Great start! Thank you @pronojitsaha.

Added a tutorial folder with Jupyter notebooks for follow along examples. Let me know if this is required?

Sounds fine to me. Although it's easy enough for people to copy-and-paste the code from our docs, I could see this being a useful folder for detailed notebooks that describe the process step-by-step, like this one.

Need to add more detailed descriptions of fit, score & export.

👍

Encountered an error while working on the titanic data set for the examples. Need your input on the same.

Can you please describe the error you encountered? Was it an issue with data encoding? At this point, I think we need to clarify in the docs that TPOT expects all of the features and classes to be numerical. Although this is certainly something we can raise an issue for to allow TPOT to handle non-numeric features/classes by mapping them to numerical representations within TPOT.

pronojitsaha · 2015-12-16T17:55:53Z

Thanks!

Yes, the idea is for much larger notebooks in the later phases. This is just a container for now.
I will just check on the titanic data set again and report back. Yes, I think we should include the limitation of numerical features in the documentation for now. Will do that. I think that might be the culprit here as well, since the titanic data set has non-numerical features.

Further we can certainly look at a different pre-processing for non-numerical/categorical features by creating dummy flags and encoding categorical features as a separate issue?

pronojitsaha · 2015-12-16T18:53:15Z

Have effected the changes.

rhiever · 2015-12-16T19:28:50Z

Further we can certainly look at a different pre-processing for non-numerical/categorical features by creating dummy flags and encoding categorical features as a separate issue?

Absolutely, I think this is a convenience feature we should look into adding to TPOT: Detect if there are non-numerical features and encode them as numerical features before passing them to the optimizer.

Adding the initial examples

pronojitsaha · 2015-12-17T13:11:58Z

For the titanic data, a call to tpot.fit(X_train, y_train) raises a KeyError from the line training_testing_data.loc[training_indeces, 'group'] = 'training' as follows:

KeyError: '[460 75 196 430 221 350 294 610 561 207 84 24 291 281 432 29 134 456\n 467 126 289 336 246 104 38 22 220 488 273 418 177 457 590 613 484 557\n 151 609 642 322 152 558 556 127 532 284 361 657 564 487 358 123 539 380\n 280 441 43 227 549 202 204 449 72 629 165 143 265 553 6 311 173 200\n 297 599 634 192 435 219 568 156 277 544 531 224 563 379 225 399 398 1\n 570 529 14 97 517 575 428 189 187 353 534 344 130 434 643 502 442 70\n 272 56 305 76 85 217 174 420 140 581 522 182 144 3 631 505 268 472\n 396 326 400 10 264] not in index'

So as it turns out some of the training indices produced by

training_indices, testing_indices = next(iter(StratifiedShuffleSplit(training_testing_data['class'].values, n_iter=1, train_size=0.75)))

in tpot.fit() are not in X_train but in X_test (which is not even used in the call to fit(). What seems to be missing here? I have attached the data set.
titanic_train.csv.zip

Further, checking my anaconda/lib/python2.7/site-packages/tpot/tpot.pyc it seems its an old version and not the current one (which has test_size mentioned in StratifiedShuffleSplit).

rhiever · 2015-12-17T17:06:37Z

That version of the titanic data set is quite messy: it contains non-numerical values, missing values, etc. Do you have a clean version that you passed to TPOT?

pronojitsaha · 2015-12-17T17:29:43Z

No, I did not clean it. However I do not think the error is due to that as the problem is in the train test splitting (we dont get to the pipeline optimisation stages at all).

I tried to upgrade tpot using pip, which succeeded, but as I stated earlier my anaconda folder still shows an old version of tpot.py which gets referenced in the error. I believe the error may be due to this. Screenshot attached. Any inputs on this?

rhiever · 2015-12-17T18:06:47Z

I see. That fix isn't in the latest version on pip. You'd have to install tpot from development:

Download and unzip https://github.com/rhiever/tpot/archive/master.zip
cd into the tpot directory you unzipped
python setup.py build
python setup.py install

That will install the development version of tpot onto your system.

Alternatively, you can cd into the directory you're using to develop tpot, sync (i.e., pull) the latest updates from github, and then any code you run in the base tpot directory will reference the development tpot version rather than the version installed via pip.

pronojitsaha · 2015-12-18T10:18:43Z

Thanks! Good news, I imported from the base directory and it now uses the development version. Bad news, the problem remain exactly the same.

I then dropped the features having categorical values and kept only features having numerical values i.e.

PassengerId int64
Survived int64
Pclass int64
Sex int64
Age float64
SibSp int64
Parch int64
Fare float64
Embarked float64
dtype: object

and then imputed missing values with placeholder value (-999), but still the exact same problem with indices persists.

rhiever · 2015-12-18T16:34:11Z

Can you please share the latest version of this data set? I'll take a look.

pronojitsaha · 2015-12-18T16:47:12Z

Here it is:
train.zip

rhiever · 2015-12-18T16:59:34Z

Here's the code I used:

from tpot import TPOT
from sklearn.cross_validation import StratifiedShuffleSplit
import pandas as pd

pipeline_optimizer = TPOT(verbosity=2)

tpot_data = pd.read_csv('train.csv', sep=',')
tpot_data.rename(columns={'Survived': 'class'}, inplace=True)
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))

pipeline_optimizer.fit(tpot_data.loc[training_indeces].drop('class', axis=1).values,
                       tpot_data.loc[training_indeces, 'class'].values)

It seems to work fine, except it throws ValueError: could not convert string to float: 'W.E.P. 5734' when the data is passed to a StandardScaler (as would be expected since it is non-numerical data).

pronojitsaha · 2015-12-18T17:44:10Z

Ok..I will look into this. Thanks.

pronojitsaha · 2015-12-19T11:53:09Z

Ok, so your code works. I had used train_test_split and I believed that was the culprit in my case, not quite sure why though.

I did some preprocessing as follows to make the date compliant with tpot requirement:

titanic.rename(columns={'Survived': 'class'}, inplace=True)
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})
titanic_new = titanic.drop(['Name','Ticket','Cabin'], axis=1)
titanic_new = titanic_new.fillna(-999)

Did a fit() and score() to get ~80% accuracy. Now to predict on the submission test applied the same pre processing as above and the following command:
tpot.predict(titanic_new.drop('class', axis=1), titanic_new['class'], titanic_sub).

However this results in the following error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed on the line return result[result['group'] == 'testing', 'guess'].values from tpot.predict(). Any inputs?

rhiever · 2015-12-20T16:10:25Z

Try:

tpot.predict(titanic_new.drop('class', axis=1).values, titanic_new['class'].values, titanic_sub.values)

Otherwise a DataFrame is being passed rather than a numpy array.

If that doesn't work:

It looks like the predict function wasn't coded correctly at the end:

return result[result['group'] == 'testing', 'guess'].values

should be

return result.loc[result['group'] == 'testing', 'guess'].values

I'll fix that now.

pronojitsaha · 2015-12-21T12:35:35Z

Thanks @rhiever. I did use .values earlier but it dint work either. I think result.loc is the solution for this hashing issue.

Anyways, I updated my local from your master, and now I get a new error ValueError: unknown locale: UTF-8 at the import pandas as pd line in tpot. Any changes you implemented that I should know of?

pronojitsaha · 2015-12-21T15:11:41Z

So the problem was with my pandas implementation only. Fixed it. Also got predict to work! Will update the material soon.

rhiever · 2015-12-21T15:27:36Z

Interesting - old version of pandas?

pronojitsaha · 2015-12-21T16:32:05Z

No, the locale information somehow got corrupted in .bash_profile and that affected pandas. Corrected it as

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

pronojitsaha added 2 commits December 16, 2015 16:51

Merge remote-tracking branch 'rhiever/master'

968f27b

Adding the initial examples

f21283d

- Added IRIS example - Added the updated exported files - Modified description of Using_TPOT_via_code - Added a tutorial folder with Jupyter notebooks for follow along examples

rhiever reviewed Dec 16, 2015
View reviewed changes

Add test_size to calls

3fb2f8a

rhiever pushed a commit that referenced this pull request Dec 16, 2015

Merge pull request #60 from pronojitsaha/master

0925ffe

Adding the initial examples

rhiever merged commit 0925ffe into EpistasisLab:master Dec 16, 2015

rhiever mentioned this pull request Dec 16, 2015

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

Closed

AIAdventures mentioned this pull request Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this pull request Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the initial examples #60

Adding the initial examples #60

pronojitsaha commented Dec 16, 2015

rhiever Dec 16, 2015

rhiever commented Dec 16, 2015

pronojitsaha commented Dec 16, 2015

pronojitsaha commented Dec 16, 2015

rhiever commented Dec 16, 2015

pronojitsaha commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 18, 2015

rhiever commented Dec 18, 2015

pronojitsaha commented Dec 18, 2015

rhiever commented Dec 18, 2015

pronojitsaha commented Dec 18, 2015

pronojitsaha commented Dec 19, 2015

rhiever commented Dec 20, 2015

pronojitsaha commented Dec 21, 2015

pronojitsaha commented Dec 21, 2015

rhiever commented Dec 21, 2015

pronojitsaha commented Dec 21, 2015

Adding the initial examples #60

Adding the initial examples #60

Conversation

pronojitsaha commented Dec 16, 2015

rhiever Dec 16, 2015

Choose a reason for hiding this comment

rhiever commented Dec 16, 2015

pronojitsaha commented Dec 16, 2015

pronojitsaha commented Dec 16, 2015

rhiever commented Dec 16, 2015

pronojitsaha commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 18, 2015

rhiever commented Dec 18, 2015

pronojitsaha commented Dec 18, 2015

rhiever commented Dec 18, 2015

pronojitsaha commented Dec 18, 2015

pronojitsaha commented Dec 19, 2015

rhiever commented Dec 20, 2015

pronojitsaha commented Dec 21, 2015

pronojitsaha commented Dec 21, 2015

rhiever commented Dec 21, 2015

pronojitsaha commented Dec 21, 2015