
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_CLINICAL_DATE.ipynb)

# **Detect test, result and date relations**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# **🔎 For about models**


📌 **"re_test_result_date"**--> *Relation extraction between lab test names, their findings, measurements, results, and date.*

*   Predicted Entities => **is_finding_of, is_result_of, is_date_of, O**

📌 **redl_date_clinical_biobert** --> *Identify if tests were conducted on a particular date or any diagnosis was made on a specific date by checking relations between clinical entities and dates. 1 : Shows date and the clinical entity are related, 0 : Shows date and the clinical entity are not related.*

*   Predicted Entities => **0, 1** 

🔎**You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition&edition=Spark+NLP+for+Healthcare)**



# **📌re_test_result_date**

### **🔎Define Spark NLP pipeline**

In [4]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer()\
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

clinical_ner_model = MedicalNerModel()\
        .pretrained('jsl_ner_wip_greedy_clinical', 'en', 'clinical/models')\
        .setInputCols("sentences", "tokens", "embeddings")\
        .setOutputCol("clinical_ner_tags") 

clinical_ner_chunker = NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "clinical_ner_tags"])\
    .setOutputCol("clinical_ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")  

clinical_re_Model = RelationExtractionModel()\
        .pretrained("re_test_result_date", "en", 'clinical/models')\
        .setInputCols(["embeddings", "pos_tags", "clinical_ner_chunks", "dependencies"])\
        .setOutputCol("relations")\
        .setPredictionThreshold(0.0)\
        .setMaxSyntacticDistance(5)\
        .setRelationPairs(["test-test_result", "test_result-test", "oncological-oncological", 
                           "test-oncological", "oncological-test", "test_result-oncological", 
                           "oncological-test_result", "test-date", "date-test", "test_result-date", 
                           "date-test_result", "date-oncological", "oncological-date", "date-treatment", 
                           "treatment-date", "oncological-treatment", "treatment-oncological", 
                           "symptom-oncological", "oncological-symptom", "relativedate-oncological", 
                           "oncological-relativedate", "symptom-relativedate", "relativedate-symptom", 
                           "relativedate-test", "test-relativedate", "disease_syndrome_disorder-date", "date-disease_syndrome_disorder",
                           "weight-test_result", "test_result-weight", "hyperlipidemia-date", "date-hyperlipidemia", "date-bmi", "bmi-date",
                           "cerebrovascular_disease-date", "date-cerebrovascular_disease", "heart_disease-date", "date-heart_disease", 
                           "blood_pressure-date", "date-blood_pressure", "ekg_findings-date", "date-ekg_findings", 
                           "ekg_findings-heart_disease", "heart_disease-ekg_findings", "hypertension-date", "date-hypertension"])

pipeline = Pipeline(
    stages=[
        document_assembler, 
        sentence_detector,
        tokenizer,
        pos_tagger,
        embeddings,
        clinical_ner_model,
        clinical_ner_chunker,
        dependency_parser,
        clinical_re_Model
        ])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = pipeline.fit(empty_df)
light_model = LightPipeline(pipelineModel)

pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
jsl_ner_wip_greedy_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
re_test_result_date download started this may take some time.
Approximate size to download 9.3 MB
[OK!]


In [5]:
def get_relations_df (results, rel='relations'):
    rel_pairs=[]
    for rel in results[rel]:
        rel_pairs.append((
          rel.result, 
          rel.metadata['entity1'],
          rel.metadata['entity1_begin'],
          rel.metadata['entity1_end'],
          rel.metadata['chunk1'], 
          rel.metadata['entity2'],
          rel.metadata['entity2_begin'],
          rel.metadata['entity2_end'],
          rel.metadata['chunk2'], 
          rel.metadata['confidence']
        ))

    rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

    return rel_df[rel_df.relation!='O']

### **🔎Sample Text**

In [6]:
text ="""She is followed by Dr. X in our office and has a history of severe tricuspid regurgitation diagnosed on 05/12/08 . She has previously had a Persantine Myoview nuclear rest-stress test scan completed at ABCD Medical Center that was negative in 07/06/08. She has had significant mitral valve regurgitation in the past being moderate, but on the most recent echocardiogram on 05/12/08, that was not felt to be significant. She has a history of hypertension and EKGs on September 2007, show normal sinus rhythm with frequent APCs versus wandering atrial pacemaker. She does have a history of significant hypertension diagnosed in 2007. She has had dizzy spells and denies clearly any true syncope. She has had bradycardia in the past from beta-blocker therapy."""

### **🔎Run the pipeline**

In [7]:
import pandas as pd

light_result = light_model.fullAnnotate(text)
get_relations_df(light_result[0])

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,is_finding_of,Heart_Disease,67,89,tricuspid regurgitation,Date,104,111,05/12/08,0.8936198
1,is_result_of,Test,159,187,nuclear rest-stress test scan,Test_Result,231,238,negative,0.9825417
3,is_result_of,Test_Result,231,238,negative,Date,243,250,07/06/08,1.0
4,is_finding_of,Test,355,368,echocardiogram,Date,373,380,05/12/08,0.7808746
5,is_date_of,Hypertension,441,452,hypertension,Date,466,479,September 2007,0.9260339
6,is_date_of,Test,458,461,EKGs,Date,466,479,September 2007,1.0
7,is_date_of,Date,466,479,September 2007,EKG_Findings,487,505,normal sinus rhythm,0.787751
8,is_date_of,Date,466,479,September 2007,EKG_Findings,512,524,frequent APCs,0.99999595
9,is_finding_of,Hypertension,600,611,hypertension,Date,626,629,2007,0.9996922


### **🔎Visualize results**

In [8]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = RelationExtractionVisualizer()

re_vis.display(light_result[0],
               relation_col = 'relations',
               document_col = 'document',
               show_relations=True
               )

# **📌redl_date_clinical_biobert**

### **🔎Define Spark NLP pipeline**

In [9]:
events_ner_tagger = MedicalNerModel()\
    .pretrained("ner_events_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("clinical_ner_tags")

events_re_ner_chunk_filter = RENerChunksFilter() \
    .setInputCols(["clinical_ner_chunks", "dependencies"])\
    .setMaxSyntacticDistance(10)\
    .setOutputCol("re_ner_chunks")\
    .setRelationPairs(['problem-date', 'date-problem', 'date-treatment', 'treatment-date',  'date-test', 'test-date'])

events_re_Model = RelationExtractionDLModel() \
    .pretrained('redl_date_clinical_biobert', "en", "clinical/models")\
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations")


pipeline = Pipeline(
    stages=[
        document_assembler, 
        sentence_detector,
        tokenizer,
        pos_tagger,
        embeddings,
        events_ner_tagger,
        clinical_ner_chunker,
        dependency_parser,
        events_re_ner_chunk_filter,
        events_re_Model
        ])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = pipeline.fit(empty_df)
light_model = LightPipeline(pipelineModel)

ner_events_clinical download started this may take some time.
[OK!]
redl_date_clinical_biobert download started this may take some time.
[OK!]


### **🔎Sample Text**

In [10]:
text = """The patient was transferred here the evening of 02/23/2007 from Hospital with a new diagnosis of high-risk acute lymphoblastic leukemia by flow cytometry of peripheral blood lymphoblasts that afternoon. History related to this illness probably dates back to October of 2006 when he had onset of swelling and discomfort in the left testicle with what he described as a residual "lump" posteriorly. The left testicle has continued to be painful off and on since. In early November, he developed pain in the posterior part of his upper right leg, which he initially thought was related to skateboarding and muscle strain."""


### **🔎Run the pipeline**

In [11]:
light_result = light_model.fullAnnotate(text)
get_relations_df(light_result[0])

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,1,DATE,33,57,the evening of 02/23/2007,PROBLEM,97,134,high-risk acute lymphoblastic leukemia,0.9997087
1,1,DATE,33,57,the evening of 02/23/2007,TEST,139,185,flow cytometry of peripheral blood lymphoblasts,0.9998977
2,1,PROBLEM,97,134,high-risk acute lymphoblastic leukemia,DATE,192,200,afternoon,0.9998311
3,1,TEST,139,185,flow cytometry of peripheral blood lymphoblasts,DATE,192,200,afternoon,0.9999057
4,1,PROBLEM,222,233,this illness,DATE,258,272,October of 2006,0.99816656
5,1,DATE,258,272,October of 2006,PROBLEM,295,302,swelling,0.9991328
6,1,DATE,258,272,October of 2006,PROBLEM,308,338,discomfort in the left testicle,0.99895144
7,1,DATE,258,272,October of 2006,PROBLEM,378,394,"lump"" posteriorly",0.9963766
8,1,DATE,464,477,early November,PROBLEM,493,496,pain,0.9995981
9,1,DATE,464,477,early November,PROBLEM,586,598,skateboarding,0.9964504


### **🔎Visualize results**

In [12]:
#from sparknlp_display import RelationExtractionVisualizer

re_vis = RelationExtractionVisualizer()

re_vis.display(light_result[0],
               relation_col = 'relations',
               document_col = 'document',
               show_relations=True
               )