Skip to content

Conversation

@matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Jul 30, 2025

This moves the extract_features method in Detector::MlCitation into the initialize method, allowing us to quickly check for whether a given term has enough non-zero features to make it worth calling the detector.

From our analysis in the TACOS notebooks, we believe that phrases which result in only two non-zero values among all their features will never end up being a citation - and that this threshold will allow us to skip the citation detector in 90% of searches.

The filtering is performed in a convenience method named enough_nonzero_values? (naming things is hard).

There is one side effect worth noting: the @Detections instance variable is now defined as false at the top of the initialize method, before the first guard clause, so that we get a consistent Boolean value in all conditions. This required one test to change that previously expected a nil from the guard clause.

I have one concern about this approach, but I'd like to talk about it as part of code review: This approach doesn't seem to leave any traces for us to diagnose after the fact. I thought about whether there should be a data model change here, to indicate which Term records were filtered by this approach, but we haven't talked about that in any detail yet.

Some of the ways that we might use this information would be:

  1. To assemble a new batch of training data that has survived the filter;
  2. To maintain awareness of how the filter might need to change as we update what features we extract or how the algorithm is trained.

An alternative arrangement to what I'm proposing here would be to have the number of nonzero features calculated in a public method, ideally during feature extraction in Detector::Citation. That might allow that value to be returned via GraphQL for training notebooks, or consulted internally without having to recalculate it. The actual filter could still be internal to Detector::MlCitation, but enough_nonzero_features? would just be an inequality check without the calculation.

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-190

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

YES dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

** Why are these changes being introduced:

There is a transaction cost to calling the citation detector, both in
terms of time and processing time. If we can develop a way to flag
searches which have zero chance of being a citation, we can cut out this
expense by skipping the citation detector.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-190

** How does this address that need:

This moves the extract_features method in Detector::MlCitation into the
initialize method, allowing us to quickly check for whether a given term
has enough non-zero features to make it worth calling the detector.

From our analysis in the TACOS notebooks, we believe that phrases which
result in only two non-zero values among all their features will never
end up being a citation - and that this threshold will allow us to skip
the citation detector in 90% of searches.

The filtering is performed in a convenience method named
enough_nonzero_values? (naming things is hard).

** Document any side effects to this change:

The @Detections instance variable is defined as false at the top of
the initialize method, before the first guard clause, so that we get a
consistent Boolean value in all conditions. This required one test to
change that previously expected a nil.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-266 July 30, 2025 21:01 Inactive
@matt-bernhardt matt-bernhardt changed the title tco-190 Filter searches before citation detector Filter searches before citation detector Jul 30, 2025
@JPrevost JPrevost self-assigned this Jul 31, 2025
This makes two slight changes after discussing code review feedback:

1. We move the assignment of the @Detections instance variable down
   below the first guard clause in the Detector::MlCitation initializer.
   This better accounts for the fact that the Term model implements its
   own guard clause, resulting in the rest of the application seeing a
   detector result of null if the detector is not enabled via env vars.

2. Adds a note to the enough_nonzero_values? method comment explaining
   how we came to decide that three non-zero feature values is
   sufficient. This includes links to the notebooks where we conducted
   that analysis.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-266 July 31, 2025 20:43 Inactive
@matt-bernhardt
Copy link
Member Author

Okay, @JPrevost - I think this is ready for your review again.

@matt-bernhardt matt-bernhardt merged commit cf35392 into main Aug 1, 2025
5 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-190-filter branch August 1, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants