Change StratifiedShuffleSplit to ttrain_test_split #112

pronojitsaha · 2016-03-12T18:00:35Z

Addresses #99

rhiever · 2016-03-12T18:04:22Z

Hmmm... it's saying that no changes were made. Did you update from master and overwrite your changes?

pronojitsaha · 2016-03-12T19:54:24Z

In hurry, forgot to commit! Should be good now.

rhiever · 2016-03-12T22:47:04Z

Thanks! Did you verify that it returns the same splits? If not, we'll have to verify that before merging.

rhiever · 2016-03-12T23:03:20Z

Just checked and, by default, train_test_split doesn't stratify the data by class. You have to pass the stratify option and a list of the class labels, e.g.,

X_train, X_test, y_train, y_test = train_test_split(input_data.drop('class', axis=1).values, 
                                                    input_data['class'].values,
                                                    train_size=0.75, test_size=0.25,
                                                    random_state=RANDOM_STATE,
                                                    stratify=input_data['class'].values)

Please make that change and let's see if that passes on Travis-CI. If it does, we'll merge away!

pronojitsaha · 2016-03-13T05:06:20Z

Ok..sure.

pronojitsaha · 2016-03-13T10:19:56Z

I have checked the splits, and the split ratio is same for both train_test_split and StratifiedShuffleSplit, but the split indices are somewhat different which is expected.

rhiever · 2016-03-13T12:48:47Z

If you set the random_state to the same thing in your tests, they should come out the same. That's what I verified on my end yesterday.

rhiever · 2016-03-13T12:56:52Z

tpot/tpot.py

@@ -214,10 +214,10 @@ def fit(self, features, classes):
            np.random.shuffle(data_columns)
            training_testing_data = training_testing_data[data_columns]

-            training_indices, testing_indices = next(iter(StratifiedShuffleSplit(training_testing_data['class'].values,
-                                                                                 n_iter=1,
+            training_indices, testing_indices = train_test_split(training_testing_data.index,


This doesn't look right. The call needs to look something like:

(training_features, testing_features, training_labels, testing_labels) = train_test_split(input_data.drop('class', axis=1).values, input_data['class'].values, train_size=0.75, test_size=0.25, random_state=RANDOM_STATE, stratify=input_data['class'].values)

Have you tested this?

Yes, I have tested it on IRIS and MNIST dataset. Works the same. We can also do it the way you have pointed out, but using training_testing_data.index to get training_indices, testing_indices is in line with rest of our code format.

pronojitsaha · 2016-03-13T14:38:39Z

Ok, got that. Thanks.

rhiever · 2016-03-13T20:46:38Z

Understood. Alright, looks good to merge. Thanks again! :-)

Change StratifiedShuffleSplit to ttrain_test_split

pronojitsaha added 2 commits March 5, 2016 23:13

Merge remote-tracking branch 'rhiever/master'

b19b3ae

Merge remote-tracking branch 'rhiever/master'

82a5557

pronojitsaha changed the title ~~Addresses #99~~ Change StratifiedShuffleSplit to ttrain_test_split Mar 12, 2016

Changing to train_test_split

2dac19e

pronojitsaha added 3 commits March 13, 2016 12:05

Incorporate Stratify in train_test_split

c0c1cd7

Updated the tutorials to incorporate train_test_split

11cf3aa

Updated documentation to incorporate train_test_split

7e46113

rhiever reviewed Mar 13, 2016
View reviewed changes

rhiever pushed a commit that referenced this pull request Mar 13, 2016

Merge pull request #112 from pronojitsaha/master

73f5b38

Change StratifiedShuffleSplit to ttrain_test_split

rhiever merged commit 73f5b38 into EpistasisLab:master Mar 13, 2016

AIAdventures mentioned this pull request Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this pull request Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change StratifiedShuffleSplit to ttrain_test_split #112

Change StratifiedShuffleSplit to ttrain_test_split #112

pronojitsaha commented Mar 12, 2016

rhiever commented Mar 12, 2016

pronojitsaha commented Mar 12, 2016

rhiever commented Mar 12, 2016

rhiever commented Mar 12, 2016

pronojitsaha commented Mar 13, 2016

pronojitsaha commented Mar 13, 2016

rhiever commented Mar 13, 2016

rhiever Mar 13, 2016

pronojitsaha Mar 13, 2016

pronojitsaha commented Mar 13, 2016

rhiever commented Mar 13, 2016

Change StratifiedShuffleSplit to ttrain_test_split #112

Change StratifiedShuffleSplit to ttrain_test_split #112

Conversation

pronojitsaha commented Mar 12, 2016

rhiever commented Mar 12, 2016

pronojitsaha commented Mar 12, 2016

rhiever commented Mar 12, 2016

rhiever commented Mar 12, 2016

pronojitsaha commented Mar 13, 2016

pronojitsaha commented Mar 13, 2016

rhiever commented Mar 13, 2016

rhiever Mar 13, 2016

Choose a reason for hiding this comment

pronojitsaha Mar 13, 2016

Choose a reason for hiding this comment

pronojitsaha commented Mar 13, 2016

rhiever commented Mar 13, 2016