## Preprocessing of BioText data

As mentioned, biomedical text has several *nuances* that have to be taken into consideration, namely the abbreviations, the specific vocabulary, the synonyms, etc.... In that sense, pre-processing biomedical text is paramount to achieve better results. A library called *NLPre* can be used to achieve this purpose.

## Rule based Part of Biomedical Speech Tagging 

An interesting library called MedSpacy can be used to perform POS on biomedical text data by means of rules ore models. In the following example shows how to use the library with several rules.

In [14]:
import medspacy
from medspacy.ner import TargetRule

nlp = medspacy.load()
print(nlp.pipe_names)

nlp.get_pipe('target_matcher').add([TargetRule('stroke', 'CONDITION'), TargetRule('diabetes', 'CONDITION'), TargetRule('pna', 'CONDITION')])
doc = nlp('Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna.')

# for ent in doc.ents:
#     print(ent, ent._.is_negated, ent._.is_family, ent._.is_historical)
medspacy.visualization.visualize_ent(doc, jupyter=True)
medspacy.visualization.visualize_dep(doc, jupyter=True)

['sentencizer', 'target_matcher', 'context']


  and should_run_async(code)


![Caption](../Figures/bio_pos.PNG)

## Model based for Biomedical Speech Tagging compared to Rule-based methods
Example from: 
https://github.com/Melbourne-BMDS/mimic34md2020_materials/blob/45ff27874d211795c4cf525f85201692aec13809/notebooks/nlp-04-machine-learning-ner.ipynb

In [4]:
# Rule-based method
import medspacy
from medspacy.ner import TargetRule
from medspacy.visualization import visualize_ent

# Load medspacy model
nlp = medspacy.load()

text = """
Past Medical History:
1. Atrial fibrillation
2. Type II Diabetes Mellitus

Assessment and Plan:
There is no evidence of pneumonia. Continue warfarin for Afib. Follow up for management of type 2 DM.
"""

# Add rules for target concept extraction
target_matcher = nlp.get_pipe("target_matcher")
target_rules = [
    TargetRule("atrial fibrillation", "PROBLEM"),
    TargetRule("atrial fibrillation", "PROBLEM", pattern=[{"LOWER": "afib"}]),
    TargetRule("pneumonia", "PROBLEM"),
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ]),
    TargetRule("warfarin", "MEDICATION")
]
target_matcher.add(target_rules)

doc = nlp(text)
visualize_ent(doc)
visualize_dep(doc)

  and should_run_async(code)


![Caption](../Figures/bio_pos2.PNG)

In [4]:
#Pre-trained model based POS
nlp = medspacy.load("en_info_3700_i2b2_2012", enable=['sentencizer', 'tagger', 'parser',
                                                      'ner', 'target_matcher', 'context',
                                                     'sectionizer'])

ner = nlp.get_pipe("ner")
ner.labels

text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm."""
doc = nlp(text)
visualize_ent(doc)

  and should_run_async(code)


![Caption](../Figures/bio_pos_model.PNG)

## Context and relationship tagging from biomedical text based on POS

An example is available in the medSpacy library where, based on target_rules, specific relationships between entities tagged in text are made.

In [11]:
from medspacy.context import ConTextComponent, ConTextRule
from medspacy.visualization import visualize_ent, visualize_dep
from medspacy.ner import TargetMatcher, TargetRule

nlp = medspacy.load(enable=["sentencizer"])
target_matcher = TargetMatcher(nlp)


target_rules1 = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

nlp.add_pipe(target_matcher)
target_matcher.add(target_rules1)

context = ConTextComponent(nlp, rules="default")
nlp.add_pipe(context)

doc = nlp("Mother with stroke at age 82.")

visualize_ent(doc)
visualize_dep(doc)

  and should_run_async(code)


### PARSER FOR COVID TEXT ANALYSIS

In [5]:
import cov_bsv
from medspacy.visualization import visualize_dep


nlp = cov_bsv.load(model="default", load_rules=True)
text = """
Patient presents to rule out COVID-19. 
His wife recently tested positive for novel coronavirus.​

COVID-19 results pending.​
"""

doc = nlp(text)

cov_bsv.visualize_doc(doc)
visualize_dep(doc)

  and should_run_async(code)


### Usage of the Med7 model with Spacy

In [1]:
import spacy

med7 = spacy.load("en_core_med7_lg")

# create distinct colours for labels
col_dict = {}
seven_colours = ['#e6194B', '#3cb44b', '#ffe119', '#ffd8b1', '#f58231', '#f032e6', '#42d4f4']
for label, colour in zip(med7.pipe_labels['ner'], seven_colours):
    col_dict[label] = colour

options = {'ents': med7.pipe_labels['ner'], 'colors':col_dict}

text = 'A patient was prescribed Magnesium hydroxide 400mg/5ml suspension PO of total 30ml bid for the next 5 days.'
doc = med7(text)

spacy.displacy.render(doc, style='ent', jupyter=True, options=options)

[(ent.text, ent.label_) for ent in doc.ents]


[('Magnesium hydroxide', 'DRUG'),
 ('400mg/5ml', 'DOSAGE'),
 ('suspension', 'FORM'),
 ('PO', 'ROUTE'),
 ('30ml', 'DOSAGE'),
 ('bid', 'FREQUENCY'),
 ('for the next 5 days', 'DURATION')]

![Caption](../Figures/med7pos.PNG)

### Stanza library usage

In [3]:
import stanza

stanza.download('en', package='craft')
nlp = stanza.Pipeline('en', package='craft')
doc = nlp('A single-cell transcriptomic atlas characterizes ageing tissues in the mouse.')
doc.sentences[0].print_dependencies()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 42.8MB/s]                                                                                   
2021-04-06 19:58:41 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | craft   |
| pos       | craft   |
| lemma     | craft   |
| depparse  | craft   |
| pretrain  | craft   |

Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/tokenize/craft.pt: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 637k/637k [00:01<00:00, 348kB/s]
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/pos/craft.pt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 21.6M/21.6M [00:22<00:00, 962kB/s]
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/lemma/craft.pt: 100%|██████████████████████

('A', 6, 'det')
('single', 4, 'amod')
('-', 4, 'punct')
('cell', 6, 'compound')
('transcriptomic', 6, 'amod')
('atlas', 7, 'nsubj')
('characterizes', 0, 'root')
('ageing', 9, 'compound')
('tissues', 7, 'obj')
('in', 12, 'case')
('the', 12, 'det')
('mouse', 7, 'obl')
('.', 7, 'punct')


In [5]:
# download and initialize a mimic pipeline with an i2b2 NER model
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})
# annotate clinical text
doc = nlp('The patient had a sore throat and was treated with Cepacol lozenges.')
# print out all entities
for ent in doc.entities:
    print(f'{ent.text}\t{ent.type}')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 9.16MB/s]                                                                                   
2021-04-06 20:14:10 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| forward_charlm  | mimic   |
| backward_charlm | mimic   |
| pretrain        | mimic   |

2021-04-06 20:14:11 INFO: File exists: C:\Users\Biosignals3\stanza_resources\en\tokenize\mimic.pt.
2021-04-06 20:14:11 INFO: File exists: C:\Users\Biosignals3\stanza_resources\en\pos\mimic.pt.
2021-04-06 20:14:11 INFO: File exists: C:\Users\Biosignals3\stanza_resources\en\lemma\mimic.pt.
2021-04-06 20:14:11 INFO: File exists: C:\Users\Biosignals3\stanza_resources\en\depparse\mimic.pt.
2021-0

a sore throat	PROBLEM
Cepacol lozenges	TREATMENT


## bioBERT Transformer and Embeddings

- https://github.com/dmis-lab/biobert-pytorch
- https://github.com/huggingface/transformers
- https://github.com/ThilinaRajapakse/simpletransformers
- https://overfitter.github.io/2020-04-04-github-biobert-embeddings/

In [18]:
from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('monologg/biobert_v1.1_pubmed')
tokenizer = BertTokenizer.from_pretrained('monologg/biobert_v1.1_pubmed')

text = 'NLM websites (primarily MedlinePlus and PubMed) account for 55% to 80% of total NIH website usage depending on the metric used. In turn, NIH.gov top-level domain usage (inclusive of NLM) ranks second only behind WebMD in the US domestic home health information market and ranks first on a global basis. NIH.gov consistently ranks among the top three or four US government top-level domains based on global Web usage. On a site-specific basis, the top health information websites in terms of global usage appear to be WebMD, MSN Health, PubMed, Yahoo! Health, AOL Health, and MedlinePlus. Based on MedlinePlus Web log data and external Internet audience measurement data, the three most heavily used cancer-centric websites appear to be www.cancer.gov (National Cancer Institute), www.cancer.org (American Cancer Society), and www.breastcancer.org (non-profit organization)'
tokens_pt = tokenizer(text, return_tensors="pt")
for key, value in tokens_pt.items():
    print("{}:\n\t{}".format(key, value))
    
single_seg_input = tokenizer(text)
print("Single segment token (str): {}".format(tokenizer.convert_ids_to_tokens(single_seg_input['input_ids'])))


  and should_run_async(code)


input_ids:
	tensor([[  101, 21239,  2107, 12045,   113,  3120,  2508, 28054,  2042,  2101,
          5954,  1105, 21385,  2107,  1174,   114,  3300,  1111,  3731,   110,
          1106,  2908,   110,  1104,  1703,   151,  2240,  3048,  3265,  7991,
          5763,  1113,  1103, 12676,  1215,   119,  1130,  1885,   117,   151,
          2240,  3048,   119,  1301,  1964,  1499,   118,  1634,  5777,  7991,
           113, 21783,  1104, 21239,  2107,   114,  6496,  1248,  1178,  1481,
          9059, 18219,  1107,  1103,  1646,  4500,  1313,  2332,  1869,  2319,
          1105,  6496,  1148,  1113,   170,  4265,  3142,   119,   151,  2240,
          3048,   119,  1301,  1964, 10887,  6496,  1621,  1103,  1499,  1210,
          1137,  1300,  1646,  1433,  1499,   118,  1634, 13770,  1359,  1113,
          4265,  9059,  7991,   119,  1212,   170,  1751,   118,  2747,  3142,
           117,  1103,  1499,  2332,  1869, 12045,  1107,  2538,  1104,  4265,
          7991,  2845,  1106,  1129,  90

## References

- https://github.com/Melbourne-BMDS/mimic34md2020_materials/blob/45ff27874d211795c4cf525f85201692aec13809/notebooks/nlp-04-machine-learning-ner.ipynb

- https://github.com/medspacy/medspacy

- https://spacy.io/universe/project/medspacy

- http://www.lesfleursdunormal.fr/static/informatique/pymedtermino/index_en.html

- https://stanfordnlp.github.io/stanza/available_biomed_models.html

- https://scispacy.apps.allenai.org/

COVID

- https://github.com/abchapman93/VA_COVID-19_NLP_BSV
