Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condense our copy of DictVectorizer to just the one method we still need. #374

Merged
merged 19 commits into from
Oct 24, 2017

Conversation

desilinguist
Copy link
Member

  • Now that @dan-blanchard's DictVectorizeradditions have been merged into scikit-learn, all we need is just the __eq__() method. The rest of the code is unnecessary and has been removed.
  • Shortened some of the test descriptions so that they fit on the console.
  • Renamed a variable in a test to be more appropriate.

@desilinguist desilinguist self-assigned this Oct 19, 2017
@desilinguist desilinguist changed the title Remove our copy of DictVectorizer Condense our copy of DictVectorizer to just the one method we still need. Oct 19, 2017
@desilinguist
Copy link
Member Author

As part of reviewing, please also run this version of SKLL on an existing experiment you have access to and make sure that the results don't change.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.8%) to 91.205% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

@coveralls
Copy link

coveralls commented Oct 19, 2017

Coverage Status

Coverage decreased (-0.2%) to 91.864% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 91.864% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

@desilinguist
Copy link
Member Author

Looks like the code coverage for featureset.py went down by 2 lines and nothing else changed. I can't figure out why that would happen due to the streamlining of dict_vectorizer.py. I'll look into it.

@desilinguist
Copy link
Member Author

Okay, I have figured out why we lost the two lines of coverage in this branch.

Essentially, until this branch came along, our copy of DictVectorizer was not explicitly sorting the indices of the underlying sparse feature matrix when its fit_transform() method was called during the __init__() method for FeatureSet. We delayed that sorting until we absolutely needed to sort, for example, when checking for equality with another FeatureSet instance.

However, in scikit-learn's DictVectorizer, they actually explicitly sort the indices for sparse feature matrices as soon as fit_transform() is called in the vectorizer. See this line. Therefore, in this branch, when we create FeatureSet instances with sparse=True, indices are automatically sorted, the explicit sorting is never triggered in the __eq__() method, and we end up not covering them at all.

However, I think if we include non-sparse FeatureSets in the test, we should be able to trigger those lines. That's what I am going to try next.

@desilinguist
Copy link
Member Author

desilinguist commented Oct 23, 2017

D'oh! sort_indices() as an operation is only needed for sparse matrices because dense matrices are ... you know, dense. So, the two options are:

  1. Remove these lines of code since they are now redundant given that sorting is now done by default.
  2. Leave them in just to be safe and live with the decrease in coverage.

@dan-blanchard @aoifecahill thoughts? I am personally leaning towards the second option for this release and then removing the lines in a subsequent release once we are satisfied that nothing weird is happening.

@aoifecahill
Copy link
Collaborator

I like the option of removing redundant lines of code better. What is the "just to be safe" scenario?

@coveralls
Copy link

coveralls commented Oct 23, 2017

Coverage Status

Coverage decreased (-0.1%) to 91.929% when pulling 19b1041 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

@desilinguist
Copy link
Member Author

Hmm, so the decrease went from 0.2% to 0.1%. Ugh. Stay tuned :)

@desilinguist
Copy link
Member Author

Ah, I think the decreased coverage is basically the result of getting rid of the 4 lines. I compared the coverage HTMLs for featureset.py (which is the only file that changed) and there are no differences in them whatsoever. Here's the coverage HTML for master and here's the coverage HTML for this branch.

So, @aoifecahill @dan-blanchard @bndgyawali this branch is now ready for review.

@desilinguist
Copy link
Member Author

@dan-blanchard do you think you will have a chance to look at this? I really want your input since you filed the original issue :)

Copy link
Contributor

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This looks good to me. I double-checked the scikit-learn code to see how they're doing sorting now, and this all makes sense.

I must admit that I haven't touched SKLL since leaving ETS, so as more time goes on, I'm going to probably be less and less useful for reviews.

@desilinguist
Copy link
Member Author

Thanks @dan-blanchard! I recognize that limitation and so I usually only request reviews from you for cases where I think you can have specific insight that none of us might have. Don't worry, I won't bug you too much :)

@desilinguist desilinguist merged commit efa0a5b into master Oct 24, 2017
@desilinguist desilinguist deleted the feature/remove-our-dict-vectorizer branch October 24, 2017 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants