-
Notifications
You must be signed in to change notification settings - Fork 7
feat(medcat): CU-869b9h7y6 Add faster linker #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Task linked: CU-869b9h7y6 Add simple/fast linker |
|
I've gone ahead and looked at the inference speed for the regular and fast tokenizer. This was run over 400 MIMIC documents with no filter (this becomes important a little later). Here are the results:
As we can see, there's a clear speedup of between 27% (with regex) and 35% (with spacy) with the faster linker introduced here. However, it comes at the expese of 28% fewer entities with spacy and 33% fewer with regex. Notably, we can see from this that the The last piece of the puzzle I wanted to explore and explain was the difference in performance when comparing the COMETA dataset and the linking challenge dataset. So I ran inference only on the texts in these datasets (while using the filters within), and here's the results:
We see again the same situation we did at the start. The COMETA dataset enjoys a brilliant increase in throughput / speed. This time even more than 10 fold (around 13.7 times at max). However, we still see that the regex tokenizer is slower for the Linking challenge dataset. Turns out this is down to how the datasets were prepared. This explains a number of key things:
So my takeaway from this would be: |
tomolopolis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - couple of typos, logger things and a broader registry comment
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class OnlyPrimaryNamesLinker(Linker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for brevity - consider renaming this to PrimaryNameLinker or PNameLinker?
| in StatusTypes.PRIMARY_STATUS and | ||
| cnf_l.filters.check_filters(cui))] | ||
| if not primary_cuis: | ||
| logger.info("No pimary CUIs for name %s", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo for primary
| return | ||
| if len(primary_cuis) > 1: | ||
| logger.info( | ||
| "Ambiguous pimary CUIs for name %s: %s", name, primary_cuis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here. Also should these info statements be debug?
medcat-v2/medcat/components/types.py
Outdated
| # primary name only | ||
| "primary_name_only_linker": ( | ||
| "medcat.components.linking.only_primary_name_linker", | ||
| "OnlyPrimaryNamesLinker.create_new_component"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about the comp_registry generally - should it just accept the comp class, the module will be included in that, and when won't it be clazz.create_new_component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I've set it up like this is for lazy loading. Don't want to import all the components and their internals if we aren't going to be using them.
There is the option to register classes - and that's what you do with a custom component. But the assumption there is that you're registering it in order to immediately use it.
This PR adds a faster linker to the mix.
This faster linker (
primary_name_only_linker) is designed to link names only ifa) There's 1 suitable concept
b) There's 1 concept that considers the name a primary name
This results in faster linking. But it's also likely to reduce performance in cases where disambiguation is needed.
I ran a few performance / speed tests to look at the throughput and performance tradeoffs:
As we can see, for the COMETA dataset, there's a clear benefit in running the faster componetns (tested for both regex tokenizer and this new faster linker). You can improve throughput by an order of magnitude! And the performance benefit isn't that big (up to around 10% in recall - no change in precision).
However, the Linking Challenge dataset shows that the situation is quite a bit more nuanced. In this case, the regex tokenizer results in slower execution than its spacy counterpart. I'm not entirely sure what the underlying cause is here (because the
regextokenizer creates aroudn 25% fewer entities accross the dataset). But it's a good example of having to tailor the config to the specific usecase.EDIT: See comment below for why the linking challenge dataset isn't seeing speedup from the regex based tokenizer as well as some overall inference speed investigation.