Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SciSpacy plugin compatibility #425

Merged
merged 22 commits into from Jan 10, 2023
Merged

SciSpacy plugin compatibility #425

merged 22 commits into from Jan 10, 2023

Conversation

hrshdhgd
Copy link
Collaborator

@hrshdhgd hrshdhgd commented Dec 23, 2022

  • Added model as TextAnnotationConfiguration attribute.
  • Update annotate() to receive scispacy model names for NLP. Here's the list:

Available SciSpacy models

  1. en_ner_craft_md: A spaCy NER model trained on the CRAFT corpus.
  2. en_ner_jnlpba_md: A spaCy NER model trained on the JNLPBA corpus.
  3. en_ner_bc5cdr_md: A spaCy NER model trained on the BC5CDR corpus.
  4. en_ner_bionlp13cg_md: A spaCy NER model trained on the BIONLP13CG corpus.
  5. en_core_sci_scibert: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.
  6. en_core_sci_sm: A full spaCy pipeline for biomedical data.
  7. en_core_sci_md: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.
  8. en_core_sci_lg: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.

Avaliable Spacy Models: English pipelines optimized for CPU.

  1. en_core_web_sm: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
  2. en_core_web_md: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
  3. en_core_web_lg: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
  4. en_core_web_trf: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

EntityLinkers:

  1. umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
  2. mesh: Links to the Medical Subject Headings.
  3. rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs.
  4. go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
  5. hpo: Links to the Human Phenotype Ontology.

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2022

Codecov Report

Base: 80.09% // Head: 80.09% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (562677d) compared to base (429cc1d).
Patch coverage: 71.42% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #425      +/-   ##
==========================================
- Coverage   80.09%   80.09%   -0.01%     
==========================================
  Files         194      194              
  Lines       21055    21062       +7     
==========================================
+ Hits        16864    16869       +5     
- Misses       4191     4193       +2     
Impacted Files Coverage Δ
src/oaklib/interfaces/text_annotator_interface.py 74.57% <ø> (ø)
src/oaklib/cli.py 60.08% <66.66%> (+0.01%) ⬆️
src/oaklib/datamodels/text_annotator.py 80.18% <75.00%> (-0.10%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@hrshdhgd hrshdhgd marked this pull request as ready for review December 29, 2022 16:25
Copy link
Collaborator

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to update the PR description?

In this particular case, I prefer extending TextAnnotatorConfig to include a model parameter with a range of string. Having a named pre-trained model is not specific to SciSpacy, any annotator that uses ML rather than string matching is using some pre-trained model.

In future we may still want a way of passing in genuinely plugin-specific parameters, but this can be done with a generic key-value list, rather than passing in a filename string that the plugin is expected to parse

@cmungall
Copy link
Collaborator

cmungall commented Jan 3, 2023

I'm also reading allenai/scispacy#463

From the perspective of OAK, if I request annotation using HP (e.g. -i scispacy:hp) then I want HP IDs back.

The simplest thing to do in the short term is to simply ignore the linker and treat scispacy purely as a UMLS annotator. The user can then do their own mapping and filtering (e.g. using the MappingProviderInterface and existing implementations like the SRI NodeNormalizer, see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.61)

We could have the scispacy plugin also do the mapping, such that if I said scispacy:hp it would use the hpo linker, and then behind the scenes map the UMLS IDs back to HPO IDs. But I think this makes the plugin more complicated as it would have to rely on some additional service, and doesn't ultimately buy us much.

Note later on we will extend the annotator framework to take in arbitrary value sets (see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.58), and we can revisit using the linker here.

@@ -1330,6 +1337,11 @@ def annotate(
if exclude_tokens:
token_exclusion_list = get_exclusion_token_list(exclude_tokens)
configuration.token_exclusion_list = token_exclusion_list
if model:
configuration.model = model
# if plugin_config:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no point keeping around

@hrshdhgd hrshdhgd merged commit 5d39eb8 into main Jan 10, 2023
@hrshdhgd hrshdhgd deleted the spacy-implementation branch January 10, 2023 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants