Filter searches before citation detector #266

matt-bernhardt · 2025-07-30T20:59:14Z

This moves the extract_features method in Detector::MlCitation into the initialize method, allowing us to quickly check for whether a given term has enough non-zero features to make it worth calling the detector.

From our analysis in the TACOS notebooks, we believe that phrases which result in only two non-zero values among all their features will never end up being a citation - and that this threshold will allow us to skip the citation detector in 90% of searches.

The filtering is performed in a convenience method named enough_nonzero_values? (naming things is hard).

There is one side effect worth noting: the @Detections instance variable is now defined as false at the top of the initialize method, before the first guard clause, so that we get a consistent Boolean value in all conditions. This required one test to change that previously expected a nil from the guard clause.

I have one concern about this approach, but I'd like to talk about it as part of code review: This approach doesn't seem to leave any traces for us to diagnose after the fact. I thought about whether there should be a data model change here, to indicate which Term records were filtered by this approach, but we haven't talked about that in any detail yet.

Some of the ways that we might use this information would be:

To assemble a new batch of training data that has survived the filter;
To maintain awareness of how the filter might need to change as we update what features we extract or how the algorithm is trained.

An alternative arrangement to what I'm proposing here would be to have the number of nonzero features calculated in a public method, ideally during feature extraction in Detector::Citation. That might allow that value to be returned via GraphQL for training notebooks, or consulted internally without having to recalculate it. The actual filter could still be internal to Detector::MlCitation, but enough_nonzero_features? would just be an inequality check without the calculation.

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-190

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

YES dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

** Why are these changes being introduced: There is a transaction cost to calling the citation detector, both in terms of time and processing time. If we can develop a way to flag searches which have zero chance of being a citation, we can cut out this expense by skipping the citation detector. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-190 ** How does this address that need: This moves the extract_features method in Detector::MlCitation into the initialize method, allowing us to quickly check for whether a given term has enough non-zero features to make it worth calling the detector. From our analysis in the TACOS notebooks, we believe that phrases which result in only two non-zero values among all their features will never end up being a citation - and that this threshold will allow us to skip the citation detector in 90% of searches. The filtering is performed in a convenience method named enough_nonzero_values? (naming things is hard). ** Document any side effects to this change: The @Detections instance variable is defined as false at the top of the initialize method, before the first guard clause, so that we get a consistent Boolean value in all conditions. This required one test to change that previously expected a nil.

app/models/detector/ml_citation.rb

This makes two slight changes after discussing code review feedback: 1. We move the assignment of the @Detections instance variable down below the first guard clause in the Detector::MlCitation initializer. This better accounts for the fact that the Term model implements its own guard clause, resulting in the rest of the application seeing a detector result of null if the detector is not enabled via env vars. 2. Adds a note to the enough_nonzero_values? method comment explaining how we came to decide that three non-zero feature values is sufficient. This includes links to the notebooks where we conducted that analysis.

matt-bernhardt · 2025-07-31T20:44:25Z

Okay, @JPrevost - I think this is ready for your review again.

mitlib temporarily deployed to tacos-api-pipeline-pr-266 July 30, 2025 21:01 Inactive

matt-bernhardt changed the title ~~tco-190 Filter searches before citation detector~~ Filter searches before citation detector Jul 30, 2025

matt-bernhardt requested review from JPrevost and jazairi July 31, 2025 14:33

JPrevost self-assigned this Jul 31, 2025

JPrevost reviewed Jul 31, 2025

View reviewed changes

app/models/detector/ml_citation.rb Outdated Show resolved Hide resolved

app/models/detector/ml_citation.rb Show resolved Hide resolved

mitlib temporarily deployed to tacos-api-pipeline-pr-266 July 31, 2025 20:43 Inactive

JPrevost approved these changes Jul 31, 2025

View reviewed changes

matt-bernhardt merged commit cf35392 into main Aug 1, 2025
5 checks passed

matt-bernhardt deleted the tco-190-filter branch August 1, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filter searches before citation detector #266

Filter searches before citation detector #266

Uh oh!

matt-bernhardt commented Jul 30, 2025 •

edited by JPrevost

Loading

Uh oh!

Uh oh!

Uh oh!

matt-bernhardt commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Filter searches before citation detector #266

Filter searches before citation detector #266

Uh oh!

Conversation

matt-bernhardt commented Jul 30, 2025 • edited by JPrevost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

Uh oh!

Uh oh!

Uh oh!

matt-bernhardt commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matt-bernhardt commented Jul 30, 2025 •

edited by JPrevost

Loading