SciSpacy plugin compatibility #425

hrshdhgd · 2022-12-23T22:14:55Z

Added model as TextAnnotationConfiguration attribute.
Update annotate() to receive scispacy model names for NLP. Here's the list:

Available SciSpacy models

en_ner_craft_md: A spaCy NER model trained on the CRAFT corpus.
en_ner_jnlpba_md: A spaCy NER model trained on the JNLPBA corpus.
en_ner_bc5cdr_md: A spaCy NER model trained on the BC5CDR corpus.
en_ner_bionlp13cg_md: A spaCy NER model trained on the BIONLP13CG corpus.
en_core_sci_scibert: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.
en_core_sci_sm: A full spaCy pipeline for biomedical data.
en_core_sci_md: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.
en_core_sci_lg: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.

Avaliable Spacy Models: English pipelines optimized for CPU.

en_core_web_sm: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
en_core_web_md: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
en_core_web_lg: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
en_core_web_trf: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

EntityLinkers:

umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
mesh: Links to the Medical Subject Headings.
rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs.
go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
hpo: Links to the Human Phenotype Ontology.

codecov-commenter · 2022-12-23T22:39:35Z

Codecov Report

Base: 80.09% // Head: 80.09% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (562677d) compared to base (429cc1d).
Patch coverage: 71.42% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #425      +/-   ##
==========================================
- Coverage   80.09%   80.09%   -0.01%     
==========================================
  Files         194      194              
  Lines       21055    21062       +7     
==========================================
+ Hits        16864    16869       +5     
- Misses       4191     4193       +2

Impacted Files	Coverage Δ
src/oaklib/interfaces/text_annotator_interface.py	`74.57% <ø> (ø)`
src/oaklib/cli.py	`60.08% <66.66%> (+0.01%)`	⬆️
src/oaklib/datamodels/text_annotator.py	`80.18% <75.00%> (-0.10%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cmungall

I think you need to update the PR description?

In this particular case, I prefer extending TextAnnotatorConfig to include a model parameter with a range of string. Having a named pre-trained model is not specific to SciSpacy, any annotator that uses ML rather than string matching is using some pre-trained model.

In future we may still want a way of passing in genuinely plugin-specific parameters, but this can be done with a generic key-value list, rather than passing in a filename string that the plugin is expected to parse

cmungall · 2023-01-03T19:18:46Z

I'm also reading allenai/scispacy#463

From the perspective of OAK, if I request annotation using HP (e.g. -i scispacy:hp) then I want HP IDs back.

The simplest thing to do in the short term is to simply ignore the linker and treat scispacy purely as a UMLS annotator. The user can then do their own mapping and filtering (e.g. using the MappingProviderInterface and existing implementations like the SRI NodeNormalizer, see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.61)

We could have the scispacy plugin also do the mapping, such that if I said scispacy:hp it would use the hpo linker, and then behind the scenes map the UMLS IDs back to HPO IDs. But I think this makes the plugin more complicated as it would have to rely on some additional service, and doesn't ultimately buy us much.

Note later on we will extend the annotator framework to take in arbitrary value sets (see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.58), and we can revisit using the linker here.

cmungall · 2023-01-10T22:32:37Z

src/oaklib/cli.py

@@ -1330,6 +1337,11 @@ def annotate(
        if exclude_tokens:
            token_exclusion_list = get_exclusion_token_list(exclude_tokens)
            configuration.token_exclusion_list = token_exclusion_list
+        if model:
+            configuration.model = model
+        # if plugin_config:


no point keeping around

hrshdhgd added 3 commits December 23, 2022 16:12

Added scispacy_model_name

eb3a8cb

Added scispacy_model_name

7ee548b

version bump for debugging

532daa5

hrshdhgd added 9 commits December 23, 2022 16:41

unnecessary docstring

4ed4941

added models and entityLinkers as enums

16b7759

added linker cli shortcut -l

575c187

undo 2 extra slots and added 1

eee606f

cleanup

85bdb2e

rolled back entirely

0f6146f

missed commenting one line

15117af

reintroduced plugin config

9457ed3

ran make py

4ac82a0

hrshdhgd marked this pull request as ready for review December 29, 2022 16:25

hrshdhgd requested a review from cmungall December 29, 2022 16:25

hrshdhgd added 5 commits December 29, 2022 10:26

rollled back version

b311984

temp change in version #

991364b

rolled back version

78b7237

temp version bump

9803fd5

version reset to 0

250140a

cmungall requested changes Jan 3, 2023

View reviewed changes

hrshdhgd added 5 commits January 3, 2023 13:29

added model in onfig a version bump

8a2e3bc

rolled back version

ca12d06

added model in CLI

8618aef

added -m

d41d568

rolled back version

562677d

cmungall reviewed Jan 10, 2023

View reviewed changes

cmungall approved these changes Jan 10, 2023

View reviewed changes

hrshdhgd merged commit 5d39eb8 into main Jan 10, 2023

hrshdhgd deleted the spacy-implementation branch January 10, 2023 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SciSpacy plugin compatibility #425

SciSpacy plugin compatibility #425

hrshdhgd commented Dec 23, 2022 •

edited

codecov-commenter commented Dec 23, 2022 •

edited

cmungall left a comment

cmungall commented Jan 3, 2023

cmungall Jan 10, 2023

SciSpacy plugin compatibility #425

SciSpacy plugin compatibility #425

Conversation

hrshdhgd commented Dec 23, 2022 • edited

codecov-commenter commented Dec 23, 2022 • edited

Codecov Report

cmungall left a comment

Choose a reason for hiding this comment

cmungall commented Jan 3, 2023

cmungall Jan 10, 2023

Choose a reason for hiding this comment

hrshdhgd commented Dec 23, 2022 •

edited

codecov-commenter commented Dec 23, 2022 •

edited