Condense our copy of `DictVectorizer` to just the one method we still need. #374

desilinguist · 2017-10-19T20:37:58Z

Now that @dan-blanchard's DictVectorizeradditions have been merged into scikit-learn, all we need is just the __eq__() method. The rest of the code is unnecessary and has been removed.
Shortened some of the test descriptions so that they fit on the console.
Renamed a variable in a test to be more appropriate.

Delete our fork.

This reverts commit f0421aa.

- We need to add `sorted()` because by default when we instantiate the `DictVectorizer` for a `FeatureSet`, we set `sort` to be `False`.

desilinguist · 2017-10-19T20:41:39Z

As part of reviewing, please also run this version of SKLL on an existing experiment you have access to and make sure that the results don't change.

coveralls · 2017-10-19T21:02:08Z

Coverage decreased (-0.8%) to 91.205% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

coveralls · 2017-10-19T21:02:08Z

Coverage decreased (-0.2%) to 91.864% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

coveralls · 2017-10-19T21:02:09Z

Coverage decreased (-0.2%) to 91.864% when pulling 2c7edf0 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

desilinguist · 2017-10-19T21:13:59Z

Looks like the code coverage for featureset.py went down by 2 lines and nothing else changed. I can't figure out why that would happen due to the streamlining of dict_vectorizer.py. I'll look into it.

desilinguist · 2017-10-23T13:39:17Z

Okay, I have figured out why we lost the two lines of coverage in this branch.

Essentially, until this branch came along, our copy of DictVectorizer was not explicitly sorting the indices of the underlying sparse feature matrix when its fit_transform() method was called during the __init__() method for FeatureSet. We delayed that sorting until we absolutely needed to sort, for example, when checking for equality with another FeatureSet instance.

However, in scikit-learn's DictVectorizer, they actually explicitly sort the indices for sparse feature matrices as soon as fit_transform() is called in the vectorizer. See this line. Therefore, in this branch, when we create FeatureSet instances with sparse=True, indices are automatically sorted, the explicit sorting is never triggered in the __eq__() method, and we end up not covering them at all.

However, I think if we include non-sparse FeatureSets in the test, we should be able to trigger those lines. That's what I am going to try next.

desilinguist · 2017-10-23T13:57:31Z

D'oh! sort_indices() as an operation is only needed for sparse matrices because dense matrices are ... you know, dense. So, the two options are:

Remove these lines of code since they are now redundant given that sorting is now done by default.
Leave them in just to be safe and live with the decrease in coverage.

@dan-blanchard @aoifecahill thoughts? I am personally leaning towards the second option for this release and then removing the lines in a subsequent release once we are satisfied that nothing weird is happening.

aoifecahill · 2017-10-23T14:09:40Z

I like the option of removing redundant lines of code better. What is the "just to be safe" scenario?

coveralls · 2017-10-23T15:56:11Z

Coverage decreased (-0.1%) to 91.929% when pulling 19b1041 on feature/remove-our-dict-vectorizer into 4a1cc23 on master.

desilinguist · 2017-10-23T15:56:56Z

Hmm, so the decrease went from 0.2% to 0.1%. Ugh. Stay tuned :)

desilinguist · 2017-10-23T16:04:53Z

Ah, I think the decreased coverage is basically the result of getting rid of the 4 lines. I compared the coverage HTMLs for featureset.py (which is the only file that changed) and there are no differences in them whatsoever. Here's the coverage HTML for master and here's the coverage HTML for this branch.

So, @aoifecahill @dan-blanchard @bndgyawali this branch is now ready for review.

desilinguist · 2017-10-24T13:16:16Z

@dan-blanchard do you think you will have a chance to look at this? I really want your input since you filed the original issue :)

dan-blanchard

👍 This looks good to me. I double-checked the scikit-learn code to see how they're doing sorting now, and this all makes sense.

I must admit that I haven't touched SKLL since leaving ETS, so as more time goes on, I'm going to probably be less and less useful for reviews.

desilinguist · 2017-10-24T13:39:14Z

Thanks @dan-blanchard! I recognize that limitation and so I usually only request reviews from you for cases where I think you can have specific insight that none of us might have. Don't worry, I won't bug you too much :)

desilinguist added 18 commits February 13, 2016 10:11

Switch to scikit-learn's dict vectorizer everywhere

f0421aa

Delete our fork.

Revert "Switch to scikit-learn's dict vectorizer everywhere"

cd24636

This reverts commit f0421aa.

Fixing last commit.

49fa41b

Merge branch 'master' into feature/remove-our-dict-vectorizer

29f9a2a

Clean up SKLL dict_vectorizer.

32fe763

Fix the string featureset test

2c508a1

- We need to add `sorted()` because by default when we instantiate the `DictVectorizer` for a `FeatureSet`, we set `sort` to be `False`.

Minor fixes to test descriptions.

e5c70bb

Shorter test descriptions.

9e95117

Fix teardown.

2101a8d

Fix test value due to featureset change.

f28de73

Change variable name to be more appropriate.

50639cb

Make sorting in vectorizer consistent with master.

8854a60

Revert previous sorting change to test.

ebae86f

Make test description shorter.

409be9a

Revert change to test_cv.py

097dcfe

Undo sorting change in readers.py.

a10d145

Undo sorting change from old commits.

cf743ab

Undo sorting change from old commits.

2c7edf0

desilinguist self-assigned this Oct 19, 2017

desilinguist requested review from dan-blanchard, aoifecahill and a user October 19, 2017 20:37

desilinguist changed the title ~~Remove our copy of DictVectorizer~~ Condense our copy of DictVectorizer to just the one method we still need. Oct 19, 2017

desilinguist requested a review from mulhod October 19, 2017 20:41

mulhod approved these changes Oct 20, 2017

View reviewed changes

Remove redundant lines of code and add a warning.

19b1041

aoifecahill approved these changes Oct 23, 2017

View reviewed changes

dan-blanchard approved these changes Oct 24, 2017

View reviewed changes

desilinguist merged commit efa0a5b into master Oct 24, 2017

desilinguist deleted the feature/remove-our-dict-vectorizer branch October 24, 2017 13:40

desilinguist mentioned this pull request Oct 24, 2017

Remove copy of DictVectorizer now that scikit-learn 0.16 has it #263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Condense our copy of `DictVectorizer` to just the one method we still need. #374

Condense our copy of `DictVectorizer` to just the one method we still need. #374

desilinguist commented Oct 19, 2017

desilinguist commented Oct 19, 2017

coveralls commented Oct 19, 2017

coveralls commented Oct 19, 2017 •

edited

Loading

coveralls commented Oct 19, 2017

desilinguist commented Oct 19, 2017

desilinguist commented Oct 23, 2017

desilinguist commented Oct 23, 2017 •

edited

Loading

aoifecahill commented Oct 23, 2017

coveralls commented Oct 23, 2017 •

edited

Loading

desilinguist commented Oct 23, 2017

desilinguist commented Oct 23, 2017

desilinguist commented Oct 24, 2017

dan-blanchard left a comment

desilinguist commented Oct 24, 2017

Condense our copy of DictVectorizer to just the one method we still need. #374

Condense our copy of DictVectorizer to just the one method we still need. #374

Conversation

desilinguist commented Oct 19, 2017

desilinguist commented Oct 19, 2017

coveralls commented Oct 19, 2017

coveralls commented Oct 19, 2017 • edited Loading

coveralls commented Oct 19, 2017

desilinguist commented Oct 19, 2017

desilinguist commented Oct 23, 2017

desilinguist commented Oct 23, 2017 • edited Loading

aoifecahill commented Oct 23, 2017

coveralls commented Oct 23, 2017 • edited Loading

desilinguist commented Oct 23, 2017

desilinguist commented Oct 23, 2017

desilinguist commented Oct 24, 2017

dan-blanchard left a comment

Choose a reason for hiding this comment

desilinguist commented Oct 24, 2017

Condense our copy of `DictVectorizer` to just the one method we still need. #374

Condense our copy of `DictVectorizer` to just the one method we still need. #374

coveralls commented Oct 19, 2017 •

edited

Loading

desilinguist commented Oct 23, 2017 •

edited

Loading

coveralls commented Oct 23, 2017 •

edited

Loading