New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SciSpacy plugin compatibility #425
Conversation
Codecov ReportBase: 80.09% // Head: 80.09% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #425 +/- ##
==========================================
- Coverage 80.09% 80.09% -0.01%
==========================================
Files 194 194
Lines 21055 21062 +7
==========================================
+ Hits 16864 16869 +5
- Misses 4191 4193 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to update the PR description?
In this particular case, I prefer extending TextAnnotatorConfig to include a model
parameter with a range of string. Having a named pre-trained model is not specific to SciSpacy, any annotator that uses ML rather than string matching is using some pre-trained model.
In future we may still want a way of passing in genuinely plugin-specific parameters, but this can be done with a generic key-value list, rather than passing in a filename string that the plugin is expected to parse
I'm also reading allenai/scispacy#463 From the perspective of OAK, if I request annotation using HP (e.g. The simplest thing to do in the short term is to simply ignore the linker and treat scispacy purely as a UMLS annotator. The user can then do their own mapping and filtering (e.g. using the MappingProviderInterface and existing implementations like the SRI NodeNormalizer, see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.61) We could have the scispacy plugin also do the mapping, such that if I said Note later on we will extend the annotator framework to take in arbitrary value sets (see https://github.com/INCATools/ontology-access-kit/releases/tag/v0.1.58), and we can revisit using the linker here. |
@@ -1330,6 +1337,11 @@ def annotate( | |||
if exclude_tokens: | |||
token_exclusion_list = get_exclusion_token_list(exclude_tokens) | |||
configuration.token_exclusion_list = token_exclusion_list | |||
if model: | |||
configuration.model = model | |||
# if plugin_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no point keeping around
model
asTextAnnotationConfiguration
attribute.annotate()
to receivescispacy
model names for NLP. Here's the list:Available SciSpacy models
en_ner_craft_md
: A spaCy NER model trained on the CRAFT corpus.en_ner_jnlpba_md
: A spaCy NER model trained on the JNLPBA corpus.en_ner_bc5cdr_md
: A spaCy NER model trained on the BC5CDR corpus.en_ner_bionlp13cg_md
: A spaCy NER model trained on the BIONLP13CG corpus.en_core_sci_scibert
: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.en_core_sci_sm
: A full spaCy pipeline for biomedical data.en_core_sci_md
: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.en_core_sci_lg
: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.Avaliable Spacy Models: English pipelines optimized for CPU.
en_core_web_sm
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_md
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_lg
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_trf
: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.EntityLinkers:
umls
: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.mesh
: Links to the Medical Subject Headings.rxnorm
: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs.go
: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.hpo
: Links to the Human Phenotype Ontology.