![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb)

# Pretrained_Clinical_Pipelines

## Colab Setup

In [1]:
import json

with open('workshop_license_keys_365.json') as f:
    license_keys = json.load(f)

license_keys.keys()

dict_keys(['PUBLIC_VERSION', 'JSL_VERSION', 'SECRET', 'SPARK_NLP_LICENSE', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'SPARK_OCR_LICENSE', 'SPARK_OCR_SECRET'])

In [2]:
license_keys['JSL_VERSION']

'2.6.0'

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

secret = license_keys['SECRET']

os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID']= license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
version = license_keys['PUBLIC_VERSION']
jsl_version = license_keys['JSL_VERSION']

! pip install --ignore-installed -q pyspark==2.4.4

! python -m pip install --upgrade spark-nlp-jsl==$jsl_version  --extra-index-url https://pypi.johnsnowlabs.com/$secret

! pip install --ignore-installed -q spark-nlp==$version

import sparknlp

print (sparknlp.version())

import json
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp_jsl.start(secret)


<b>  if you want to work with Spark 2.3 </b>
```
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

!tar xf spark-2.3.0-bin-hadoop2.7.tgz
!pip install -q findspark

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/content/spark-2.3.0-bin-hadoop2.7"
! java -version

import findspark
findspark.init()
from pyspark.sql import SparkSession

! pip install --ignore-installed -q spark-nlp==2.5.5
import sparknlp

spark = sparknlp.start(spark23=True)
```

## Pretrained Pipelines

In order to save you from creating a pipeline from scratch, Spark NLP also has a pre-trained pipelines that are already fitted using certain annotators and transformers according to various use cases.

Here is the list of clinical pre-trained pipelines: 

> These clinical pipelines are trained with `embeddings_healthcare_100d` and accuracies might be 1-2% lower than `embeddings_clinical` which is 200d.

**1.   explain_clinical_doc_carp** :

> a pipeline with `ner_clinical`, `assertion_dl`, `re_clinical` and `ner_posology`. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities.

**2.   explain_clinical_doc_era** :

> a pipeline with `ner_clinical_events`, `assertion_dl` and `re_temporal_events_clinical`. It will extract clinical entities, assign assertion status and find temporal relationships between clinical entities.

**3.   recognize_entities_posology** :

> a pipeline with `ner_posology`. It will only extract medication entities.


** Since 3rd pipeline is already a subset of 1st and 2nd pipeline, we will only cover the first two pipelines in this notebook.

**4.   explain_clinical_doc_ade** :

> a pipeline for `Adverse Drug Events (ADE)` with `ner_ade_healthcare`, and `classifierdl_ade_biobert`. It will extract `ADE` and `DRUG` clinical entities, and then assign ADE status to a text(`Negative` means ADE, `Neutral` means not related to ADE).

**letter codes in the naming conventions:**

> c : ner_clinical

> e : ner_clinical_events

> r : relation extraction

> p : ner_posology

> a : assertion

> ade : adverse drug events

**Relation Extraction types:**

`re_clinical` >> TrIP (improved), TrWP (worsened), TrCP (caused problem), TrAP (administered), TrNAP (avoided), TeRP (revealed problem), TeCP (investigate problem), PIP (problems related)

`re_temporal_events_clinical` >> `AFTER`, `BEFORE`, `OVERLAP`


## 1.explain_clinical_doc_carp 

a pipeline with ner_clinical, assertion_dl, re_clinical and ner_posology. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities.

In [None]:
pipeline = PretrainedPipeline('explain_clinical_doc_carp', 'en', 'clinical/models')

explain_clinical_doc_carp download started this may take some time.
Approx size to download 526.5 MB
[OK!]


In [None]:
pipeline.model.stages

[DocumentAssembler_f42e69ce9e76,
 SentenceDetector_70b249a55601,
 REGEX_TOKENIZER_f212b14a3f41,
 WORD_EMBEDDINGS_MODEL_a5c1afb0b657,
 POS_be8d41751649,
 NerDLModel_706522935b2e,
 NerConverter_7694118eadb1,
 dependency_68159e3d6dac,
 NerDLModel_01b90ff03d9e,
 NerConverter_afd758da620a,
 RelationExtractionModel_9c255241fec3,
 ASSERTION_DL_941a00a50db4]

In [None]:
# Load pretrained pipeline from local disk:

# >> pipeline_local = PretrainedPipeline.from_disk('/root/cache_pretrained/explain_clinical_doc_carp_en_2.5.5_2.4_1597841630062')

In [None]:
text ="""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.
"""

annotations = pipeline.annotate(text)

annotations.keys()


dict_keys(['sentences', 'clinical_ner_tags', 'document', 'ner_chunks', 'clinical_ner_chunks', 'ner_tags', 'assertion', 'clinical_relations', 'tokens', 'embeddings', 'pos_tags', 'dependencies'])

In [None]:
import pandas as pd

rows = list(zip(annotations['tokens'], annotations['clinical_ner_tags'], annotations['ner_tags'], annotations['pos_tags'], annotations['dependencies']))

df = pd.DataFrame(rows, columns = ['tokens','clinical_ner_tags','posology_ner_tags','POS_tags','dependencies'])

df.head(20)

Unnamed: 0,tokens,clinical_ner_tags,posology_ner_tags,POS_tags,dependencies
0,A,O,O,DD,female
1,28-year-old,O,O,NN,female
2,female,O,O,NN,ROOT
3,with,O,O,II,history
4,a,O,O,DD,history
5,history,O,O,NN,female
6,of,O,O,II,history
7,gestational,B-PROBLEM,O,JJ,of
8,diabetes,I-PROBLEM,O,NN,mellitus
9,mellitus,I-PROBLEM,O,NN,gestational


In [None]:
text = 'Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain'

result = pipeline.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]

for n,m in zip(result['clinical_ner_chunks'],result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

df

Unnamed: 0,chunks,entities,assertion
0,a headache,PROBLEM,present
1,anxious,PROBLEM,present
2,alopecia,PROBLEM,absent
3,pain,PROBLEM,absent


In [None]:
text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also 
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""

result = pipeline.fullAnnotate(text)[0]

chunks=[]
entities=[]
begins=[]
ends=[]

for n in result['ner_chunks']:
    
    chunks.append(n.result)
    begins.append(n.begin)
    ends.append(n.end)
    entities.append(n.metadata['entity']) 
        
df = pd.DataFrame({'chunks':chunks, 'begin':begins, 'end':ends, 'entities':entities})

df

Unnamed: 0,chunks,begin,end,entities
0,1 unit,28,33,Dosage
1,Advil,38,42,Drug
2,for 5 days,44,53,Duration
3,1 unit,96,101,Dosage
4,Metformin,106,114,Drug
5,daily,116,120,Frequency
6,40 units,190,197,Dosage
7,insulin glargine,202,217,Drug
8,at night,219,226,Frequency
9,12 units,231,238,Dosage


In [None]:
import pandas as pd

def get_relations_df (results, col='relations'):
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
          rel.result, 
          rel.metadata['entity1'], 
          rel.metadata['entity1_begin'],
          rel.metadata['entity1_end'],
          rel.metadata['chunk1'], 
          rel.metadata['entity2'],
          rel.metadata['entity2_begin'],
          rel.metadata['entity2_end'],
          rel.metadata['chunk2'], 
          rel.metadata['confidence']
      ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  rel_df.confidence = rel_df.confidence.astype(float)
  
  return rel_df


In [None]:
text ="""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.
"""

annotations = pipeline.fullAnnotate(text)

rel_df = get_relations_df (annotations, 'clinical_relations')

rel_df

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,TrAP,PROBLEM,39,67,gestational diabetes mellitus,TREATMENT,83,91,metformin,0.9998964
1,TrAP,PROBLEM,39,67,gestational diabetes mellitus,PROBLEM,155,162,polyuria,0.8757551
2,TrCP,PROBLEM,39,67,gestational diabetes mellitus,PROBLEM,166,175,polydipsia,0.5539198
3,TrCP,PROBLEM,39,67,gestational diabetes mellitus,PROBLEM,179,191,poor appetite,0.9128578
4,TrAP,TREATMENT,83,91,metformin,PROBLEM,155,162,polyuria,0.9592948
5,TrAP,TREATMENT,83,91,metformin,PROBLEM,166,175,polydipsia,0.656755
6,TrAP,TREATMENT,83,91,metformin,PROBLEM,179,191,poor appetite,0.6427657
7,TrAP,TREATMENT,83,91,metformin,PROBLEM,199,206,vomiting,0.57237124
8,TrCP,PROBLEM,155,162,polyuria,PROBLEM,166,175,polydipsia,0.647005
9,TrCP,PROBLEM,155,162,polyuria,PROBLEM,179,191,poor appetite,0.9097432


In [None]:
text ="""
he patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . 
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . 
"""

annotations = pipeline.fullAnnotate(text)

rel_df = get_relations_df (annotations, 'clinical_relations')

rel_df[rel_df.confidence>0.9]


Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,TrIP,TEST,140,154,serum chemistry,TEST,203,213,her glucose,0.996591
1,TrIP,TEST,140,154,serum chemistry,TEST,231,243,the anion gap,0.983426
3,TrAP,TEST,140,154,serum chemistry,TEST,306,323,triglyceride level,0.93054
4,TrAP,TEST,140,154,serum chemistry,TEST,352,357,lipase,0.96276
8,TrAP,PROBLEM,249,262,still elevated,TEST,272,288,serum bicarbonate,0.997434
9,TrAP,PROBLEM,249,262,still elevated,TEST,306,323,triglyceride level,0.996227
10,TrAP,PROBLEM,249,262,still elevated,TEST,352,357,lipase,0.99667
11,TrAP,PROBLEM,249,262,still elevated,TEST,366,368,U/L,0.93907
13,TrIP,TEST,272,288,serum bicarbonate,TEST,366,368,U/L,0.926109
15,TrAP,TEST,352,357,lipase,TEST,366,368,U/L,0.999179


## **2.   explain_clinical_doc_era** :

> a pipeline with `ner_clinical_events`, `assertion_dl` and `re_temporal_events_clinical`. It will extract clinical entities, assign assertion status and find temporal relationships between clinical entities.



In [None]:
era_pipeline = PretrainedPipeline('explain_clinical_doc_era', 'en', 'clinical/models')

explain_clinical_doc_era download started this may take some time.
Approx size to download 512.8 MB
[OK!]


In [None]:
era_pipeline.model.stages

[DocumentAssembler_f548f799ea2a,
 SentenceDetector_249f783c340c,
 REGEX_TOKENIZER_209534c638ca,
 POS_be8d41751649,
 dependency_68159e3d6dac,
 WORD_EMBEDDINGS_MODEL_a5c1afb0b657,
 NerDLModel_3c27190d1858,
 NerConverter_9fff8fa39dbd,
 RelationExtractionModel_fb3be959a99e,
 NerConverter_50212b459b4b,
 ASSERTION_DL_941a00a50db4]

In [None]:
text ="""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache.
She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 
12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. 
"""

result = era_pipeline.fullAnnotate(text)[0]


In [None]:
result.keys()

dict_keys(['sentences', 'clinical_ner_tags', 'clinical_ner_chunks_re', 'document', 'clinical_ner_chunks', 'assertion', 'clinical_relations', 'tokens', 'embeddings', 'pos_tags', 'dependencies'])

In [None]:
import pandas as pd

chunks=[]
entities=[]
begins=[]
ends=[]

for n in result['clinical_ner_chunks']:
    
    chunks.append(n.result)
    begins.append(n.begin)
    ends.append(n.end)
    entities.append(n.metadata['entity']) 
        
df = pd.DataFrame({'chunks':chunks, 'begin':begins, 'end':ends, 'entities':entities})

df

Unnamed: 0,chunks,begin,end,entities
0,admitted,7,14,OCCURRENCE
1,The John Hopkins Hospital,19,43,CLINICAL_DEPT
2,2 days ago,45,54,DATE
3,gestational diabetes mellitus,74,102,PROBLEM
4,diagnosed,104,112,OCCURRENCE
5,denied,119,124,EVIDENTIAL
6,pain,126,129,PROBLEM
7,any headache,135,146,PROBLEM
8,seen,157,160,OCCURRENCE
9,the endocrinology service,165,189,CLINICAL_DEPT


In [None]:

chunks=[]
entities=[]
status=[]

for n,m in zip(result['clinical_ner_chunks_re'],result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

df

Unnamed: 0,chunks,entities,assertion
0,admitted,OCCURRENCE,present
1,The John Hopkins Hospital,CLINICAL_DEPT,present
2,gestational diabetes mellitus,PROBLEM,present
3,diagnosed,OCCURRENCE,present
4,denied,EVIDENTIAL,absent
5,pain,PROBLEM,absent
6,any headache,PROBLEM,absent
7,seen,OCCURRENCE,present
8,the endocrinology service,CLINICAL_DEPT,present
9,discharged,OCCURRENCE,present


In [None]:
annotations = era_pipeline.fullAnnotate(text)

rel_df = get_relations_df (annotations, 'clinical_relations')

rel_df


Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,AFTER,OCCURRENCE,7,14,admitted,CLINICAL_DEPT,19,43,The John Hopkins Hospital,1.0
1,AFTER,OCCURRENCE,7,14,admitted,DATE,45,54,2 days ago,0.77979285
2,BEFORE,OCCURRENCE,7,14,admitted,PROBLEM,74,102,gestational diabetes mellitus,0.50644654
3,OVERLAP,OCCURRENCE,7,14,admitted,OCCURRENCE,104,112,diagnosed,1.0
4,OVERLAP,CLINICAL_DEPT,19,43,The John Hopkins Hospital,DATE,45,54,2 days ago,1.0
5,OVERLAP,CLINICAL_DEPT,19,43,The John Hopkins Hospital,PROBLEM,74,102,gestational diabetes mellitus,0.9907205
6,OVERLAP,CLINICAL_DEPT,19,43,The John Hopkins Hospital,OCCURRENCE,104,112,diagnosed,0.99992645
7,BEFORE,PROBLEM,74,102,gestational diabetes mellitus,OCCURRENCE,104,112,diagnosed,1.0
8,BEFORE,EVIDENTIAL,119,124,denied,PROBLEM,126,129,pain,1.0
9,OVERLAP,EVIDENTIAL,119,124,denied,PROBLEM,135,146,any headache,0.998176


In [None]:
annotations[0]['clinical_relations']

[]

## 3.explain_clinical_doc_ade 

a pipeline for `Adverse Drug Events (ADE)` with `ner_ade_healthcare`, and `classifierdl_ade_biobert`. It will extract `ADE` and `DRUG` clinical entities, and then assign ADE status to a text(`Negative` means ADE, `Neutral` means not related to ADE).

In [4]:
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')

explain_clinical_doc_ade download started this may take some time.
Approx size to download 897.4 MB
[OK!]


In [9]:
ade_pipeline.fullAnnotate("I feel a bit drowsy & have a little blurred vision, so far no gastric problems.")[0]['class'][0].metadata

{'sentence': '0', 'Neutral': '1.2872865E-8', 'Negative': '1.0'}

In [20]:
texts = "As she became very drowsy, we discontinued Dilantin discharge medication"

import pandas as pd

chunks = []
entities = []
begin =[]
end = []

print ('sentence:', text)
print()

result = ade_pipeline.fullAnnotate(text)

print ('ADE status:', result[0]['class'][0].result)

print ('prediction probability>> Negative (ADE True): ', result[0]['class'][0].metadata['Negative'], \
        'Neutral (ADE False): ', result[0]['class'][0].metadata['Neutral'])

for n in result[0]['ner_chunk']:

  begin.append(n.begin)
  end.append(n.end)
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 

df = pd.DataFrame({'chunks':chunks, 'entities':entities,
                'begin': begin, 'end': end})

df


sentence: As she became very drowsy, we discontinued Dilantin discharge medication

ADE status: Negative
prediction probability>> Negative (ADE True):  0.99999905 Neutral (ADE False):  1.00849E-6


Unnamed: 0,chunks,entities,begin,end
0,drowsy,ADE,19,24
1,Dilantin discharge medication,DRUG,43,71
