Filter searches before citation detector #266
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This moves the extract_features method in Detector::MlCitation into the initialize method, allowing us to quickly check for whether a given term has enough non-zero features to make it worth calling the detector.
From our analysis in the TACOS notebooks, we believe that phrases which result in only two non-zero values among all their features will never end up being a citation - and that this threshold will allow us to skip the citation detector in 90% of searches.
The filtering is performed in a convenience method named enough_nonzero_values? (naming things is hard).
There is one side effect worth noting: the @Detections instance variable is now defined as false at the top of the initialize method, before the first guard clause, so that we get a consistent Boolean value in all conditions. This required one test to change that previously expected a nil from the guard clause.
I have one concern about this approach, but I'd like to talk about it as part of code review: This approach doesn't seem to leave any traces for us to diagnose after the fact. I thought about whether there should be a data model change here, to indicate which Term records were filtered by this approach, but we haven't talked about that in any detail yet.
Some of the ways that we might use this information would be:
An alternative arrangement to what I'm proposing here would be to have the number of nonzero features calculated in a public method, ideally during feature extraction in
Detector::Citation. That might allow that value to be returned via GraphQL for training notebooks, or consulted internally without having to recalculate it. The actual filter could still be internal toDetector::MlCitation, butenough_nonzero_features?would just be an inequality check without the calculation.Developer
Ticket(s)
https://mitlibraries.atlassian.net/browse/TCO-190
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
YES dependencies are updated
NO migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing