## Preprocessing of BioText data

As mentioned, biomedical text has several *nuances* that have to be taken into consideration, namely the abbreviations, the specific vocabulary, the synonyms, etc.... In that sense, pre-processing biomedical text is paramount to achieve better results. A library called *NLPre* can be used to achieve this purpose.

## Rule based Part of Biomedical Speech Tagging 

An interesting library called MedSpacy can be used to perform POS on biomedical text data by means of rules ore models. In the following example shows how to use the library with several rules.

In [9]:
import medspacy
from medspacy.ner import TargetRule

nlp = medspacy.load()
print(nlp.pipe_names)

nlp.get_pipe('target_matcher').add([TargetRule('stroke', 'CONDITION'), TargetRule('diabetes', 'CONDITION'), TargetRule('pna', 'CONDITION')])
doc = nlp('Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna.')

# for ent in doc.ents:
#     print(ent, ent._.is_negated, ent._.is_family, ent._.is_historical)
medspacy.visualization.visualize_ent(doc)
medspacy.visualization.visualize_dep(doc)

  and should_run_async(code)


['sentencizer', 'target_matcher', 'context']


## Model based for Biomedical Speech Tagging compared to Rule-based methods
Example from: 
https://github.com/Melbourne-BMDS/mimic34md2020_materials/blob/45ff27874d211795c4cf525f85201692aec13809/notebooks/nlp-04-machine-learning-ner.ipynb

In [4]:
# Rule-based method
import medspacy
from medspacy.ner import TargetRule
from medspacy.visualization import visualize_ent

# Load medspacy model
nlp = medspacy.load()

text = """
Past Medical History:
1. Atrial fibrillation
2. Type II Diabetes Mellitus

Assessment and Plan:
There is no evidence of pneumonia. Continue warfarin for Afib. Follow up for management of type 2 DM.
"""

# Add rules for target concept extraction
target_matcher = nlp.get_pipe("target_matcher")
target_rules = [
    TargetRule("atrial fibrillation", "PROBLEM"),
    TargetRule("atrial fibrillation", "PROBLEM", pattern=[{"LOWER": "afib"}]),
    TargetRule("pneumonia", "PROBLEM"),
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ]),
    TargetRule("warfarin", "MEDICATION")
]
target_matcher.add(target_rules)

doc = nlp(text)
visualize_ent(doc)
visualize_dep(doc)

  and should_run_async(code)


In [4]:
#Pre-trained model based POS
nlp = medspacy.load("en_info_3700_i2b2_2012", enable=['sentencizer', 'tagger', 'parser',
                                                      'ner', 'target_matcher', 'context',
                                                     'sectionizer'])

ner = nlp.get_pipe("ner")
ner.labels

text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm."""
doc = nlp(text)
visualize_ent(doc)

  and should_run_async(code)


## Context and relationship tagging from biomedical text based on POS

An example is available in the medSpacy library where, based on target_rules, specific relationships between entities tagged in text are made.

In [11]:
from medspacy.context import ConTextComponent, ConTextRule
from medspacy.visualization import visualize_ent, visualize_dep
from medspacy.ner import TargetMatcher, TargetRule

nlp = medspacy.load(enable=["sentencizer"])
target_matcher = TargetMatcher(nlp)


target_rules1 = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

nlp.add_pipe(target_matcher)
target_matcher.add(target_rules1)

context = ConTextComponent(nlp, rules="default")
nlp.add_pipe(context)

doc = nlp("Mother with stroke at age 82.")

visualize_ent(doc)
visualize_dep(doc)

  and should_run_async(code)


### PARSER FOR COVID TEXT ANALYSIS

In [12]:
import cov_bsv
from cov_bsv.knowledge_base import context_rules

nlp_cov = cov_bsv.load(model="default", load_rules=True, enable=['tagger', 'parser', 'concept_tagger', 'target_matcher', 'sectionizer', 'postprocessor', 'document_classifier'])
context = ConTextComponent(nlp,
                           rules=None, # Don't load the default cycontext rules
                           add_attrs=cov_bsv.util.CONTEXT_MAPPING # Mapping of modifiers to attribute values
                          )
#add context rules
context.add(context_rules)

nlp_cov.add_pipe(context)

text = """
Patient presents to rule out COVID-19. 
His wife recently tested positive for novel coronavirus.​

COVID-19 results pending.​
"""

doc = nlp_cov(text)

cov_bsv.visualize_doc(doc)
visualize_dep(doc)

  and should_run_async(code)


KeyError: "[E002] Can't find factory for 'tok2vec'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['tok2vec']` or remove it from the model meta and add it via `nlp.add_pipe` instead."

## bioBERT

- https://github.com/dmis-lab/biobert-pytorch
- https://github.com/huggingface/transformers
- https://github.com/ThilinaRajapakse/simpletransformers

## References

- https://github.com/Melbourne-BMDS/mimic34md2020_materials/blob/45ff27874d211795c4cf525f85201692aec13809/notebooks/nlp-04-machine-learning-ner.ipynb

- https://github.com/medspacy/medspacy

- https://spacy.io/universe/project/medspacy

COVID

- https://github.com/abchapman93/VA_COVID-19_NLP_BSV
