-
Notifications
You must be signed in to change notification settings - Fork 0
USE 131 - Framework for embedding input strategies #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: A core requirement of this application is the ability to take a TIMDEX JSON record and "transform" all or parts of it into a single string for which an embedding can be created. We are calling these "embedding strategies" in the context of this app. While our first strategy will likely be a very simple, full record approach, we want to support multiple strategies in the application, and even multiple strategies for a single record in a single invocation. How this addresses that need: * A new 'strategies' module is created * A base 'BaseStrategy' class, with a required 'extract_text()' method for implementations * Our first strategy represented in class 'FullRecordStrategy', which JSON dumps the entire TIMDEX JSON record. * A registry of strategies, similar to our models, that allow CLI level validation. Side effects of this change: * None really, but further solidifies that this application is contains the opinionation about how text is prepared for the embedding process. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-131 * https://mitlibraries.atlassian.net/browse/USE-132
7fa20b0 to
70d8394
Compare
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ipython testing worked as expected, a few minor comments and questions but nothing blocking. Great work, this is clean and tight!
| super().__init_subclass__(**kwargs) | ||
|
|
||
| # require class level STRATEGY_NAME to be set | ||
| if not hasattr(cls, "STRATEGY_NAME"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check necessary since the attribute exists in the base class even if it's not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is. Without this, you could leave out the class level STRATEGY_NAME attribute on a real strategy class.
Another pattern could be something like this:
class BaseStrategy(ABC):
@absractmethod
@classmethod
def strategy_name(self) -> str:
# return strategy name...Which would allow getting the strategy name for an uninstantiated class, which is important. We do something similar in other projects I think.
But this pattern, with a little logic in the base class, enforces that child classes define it.
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm curious if you had an idea of other methods that could be added to a child class of BaseStrategy? 🤔 When I think of this class, I understand it as a "collection of methods that could be applied to TIMDEX records", and while hard to explain, I was surprised by the number of required variables for the __init__ method:
self.timdex_record_id = timdex_record_id
self.run_id = run_id
self.run_record_offset = run_record_offset
self.transformed_record = transformed_recordIt didn't feel clear to me as to why all these fields were required to instantiate a "strategy", if that makes sense. I do understand why it is important information for the resulting EmbeddingInput returned by the BaseStrategy.to_embedding_input() method, which is why I find it fits more as parameters for that method specifically, instead of the __init__ method. However, perhaps there is something I'm not thinking of re: other methods that you expect to define for BaseStrategy classes that will heavily rely on these fields. 🤔
It's an excellent question, and one I don't have an excellent answer for. I'm glad you caught and surfaced this, as this was something I had intended to return to but did not. I think what we have here is a pretty classic major reshuffling during work on this, and this is an awkward leftover that needs attention. One thing is known: we want strategies to emit fully formed I'm going to take another pass at this with this discussion in mind, and will re-tag everyone for a review. Thanks for raising this @jonavellecuerdo! |
Why these changes are being introduced: Formerly, a transformer strategy class was instantiated in a per-record fashion, where things like the timdex_record_id and other record-level values were passed. This ultimately felt awkward, when we could just as easily instantiate it once in a more generic fashion, then build EmbeddingInput instances with the *result* of the strategy extracting text from the TIMDEX JSON record. How this addresses that need: All record-level details are removed as arguments for initializing a transformer strategy. Instead, the helper function create_embedding_inputs() is responsible for passing the TIMDEX JSON record to the transformer strategies, and then building an EmbeddingInput object before yielding. This keeps the init of those strategies much simpler, and preventing properties in the class they don't really need. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-131 * https://mitlibraries.atlassian.net/browse/USE-132
|
@jonavellecuerdo, @ehanson8 - I've added a new commit per the discussion above: 421ba71. Thanks again @jonavellecuerdo, that rough edge completely slipped my mind and/or I couldn't see it anymore when working through this. This further simplifies the transformer strategies as just that: accept a TIMDEX JSON record, be opinionated about how to return a string representation based on the strategy, and then get out of the way. Because we are utilizing iterators most everywhere in the CLI, there was a need for the |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got even cleaner and tighter! Great comment @jonavellecuerdo and great code in response @ghukill !
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great and code flows better! Great work. :D
Purpose and background context
This PR establishes a framework for "strategies" used to extract and transform data from TIMDEX records into single strings used to create an embedding from.
As noted in the commit message, while we only have a single strategy at the moment of using the full record, we want to ensure we can support other strategies in the future and potentially multiple strategies for a record, in a single run (resulting in the cartesian product of records * strategies requested).
This establishes a new
embeddings/strategiesmodule, with a base classBaseStrategy, and our first implementation classFullRecordStrategy.Overall, it's quite simple: a bit of boilerplate to get down to the strategy's
extract_text()method which is opinionated for how text is extracted to create an embedding from.How can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS credentials
2- Set env vars:
3- Start an ipython shell:
4- Prepare an iterator of TIMDEX records:
5- Create
EmbeddingInputinstances:The following is an example
EmbeddingInputreturned:Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review