Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change StratifiedShuffleSplit to ttrain_test_split #112

Merged
merged 6 commits into from
Mar 13, 2016

Conversation

pronojitsaha
Copy link
Contributor

Addresses #99

@pronojitsaha pronojitsaha changed the title Addresses #99 Change StratifiedShuffleSplit to ttrain_test_split Mar 12, 2016
@rhiever
Copy link
Contributor

rhiever commented Mar 12, 2016

Hmmm... it's saying that no changes were made. Did you update from master and overwrite your changes?

@pronojitsaha
Copy link
Contributor Author

In hurry, forgot to commit! Should be good now.

@rhiever
Copy link
Contributor

rhiever commented Mar 12, 2016

Thanks! Did you verify that it returns the same splits? If not, we'll have to verify that before merging.

@rhiever
Copy link
Contributor

rhiever commented Mar 12, 2016

Just checked and, by default, train_test_split doesn't stratify the data by class. You have to pass the stratify option and a list of the class labels, e.g.,

X_train, X_test, y_train, y_test = train_test_split(input_data.drop('class', axis=1).values, 
                                                    input_data['class'].values,
                                                    train_size=0.75, test_size=0.25,
                                                    random_state=RANDOM_STATE,
                                                    stratify=input_data['class'].values)

Please make that change and let's see if that passes on Travis-CI. If it does, we'll merge away!

@pronojitsaha
Copy link
Contributor Author

Ok..sure.

@pronojitsaha
Copy link
Contributor Author

I have checked the splits, and the split ratio is same for both train_test_split and StratifiedShuffleSplit, but the split indices are somewhat different which is expected.

@rhiever
Copy link
Contributor

rhiever commented Mar 13, 2016

If you set the random_state to the same thing in your tests, they should come out the same. That's what I verified on my end yesterday.

@@ -214,10 +214,10 @@ def fit(self, features, classes):
np.random.shuffle(data_columns)
training_testing_data = training_testing_data[data_columns]

training_indices, testing_indices = next(iter(StratifiedShuffleSplit(training_testing_data['class'].values,
n_iter=1,
training_indices, testing_indices = train_test_split(training_testing_data.index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. The call needs to look something like:

(training_features, testing_features,
training_labels, testing_labels) = train_test_split(input_data.drop('class', axis=1).values, 
                                                    input_data['class'].values,
                                                    train_size=0.75, test_size=0.25,
                                                    random_state=RANDOM_STATE,
                                                    stratify=input_data['class'].values)

Have you tested this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tested it on IRIS and MNIST dataset. Works the same. We can also do it the way you have pointed out, but using training_testing_data.index to get training_indices, testing_indices is in line with rest of our code format.

@pronojitsaha
Copy link
Contributor Author

Ok, got that. Thanks.

@rhiever
Copy link
Contributor

rhiever commented Mar 13, 2016

Understood. Alright, looks good to merge. Thanks again! :-)

rhiever pushed a commit that referenced this pull request Mar 13, 2016
Change StratifiedShuffleSplit to ttrain_test_split
@rhiever rhiever merged commit 73f5b38 into EpistasisLab:master Mar 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants