![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/22.CPT_Entity_Resolver.ipynb)

# CPT Entity Resolvers with sBert

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os
import sys, time

import sparknlp
import sparknlp_jsl

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.util import *
from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.4
Spark NLP_JSL Version : 4.2.3


## Named Entity Recognition

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

ner_pipeline = Pipeline(
    stages = [
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
        ])


data_ner = spark.createDataFrame([['']]).toDF("text")

ner_model = ner_pipeline.fit(data_ner)

ner_light_pipeline = LightPipeline(ner_model)


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [None]:
clinical_note = (
    'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years '
    'prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior '
    'episode of HTG-induced pancreatitis three years prior to presentation, associated '
    'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '
    'presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. '
    'Two weeks prior to presentation, she was treated with a five-day course of amoxicillin '
    'for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin '
    'for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months '
    'at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; '
    'significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent '
    'laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, '
    'creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) '
    '10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed '
    'as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for '
    'starvation ketosis, as she reported poor oral intake for three days prior to admission. However, '
    'serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap '
    'was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and '
    'lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - '
    'the original sample was centrifuged and the chylomicron layer removed prior to analysis due to '
    'interference from turbidity caused by lipemia again. The patient was treated with an insulin drip '
    'for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within '
    '24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting '
    'of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on '
    '40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg '
    'two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She '
    'had close follow-up with endocrinology post discharge.'
)


from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

# Change color of an entity label
visualiser.set_label_colors({'PROBLEM':'#008080', 'TEST':'#800080', 'TREATMENT':'#806080'})

# Set label filter
#visualiser.display(ppres, label_col='ner_chunk', labels=['PER'])

visualiser.display(ner_light_pipeline.fullAnnotate(clinical_note)[0], label_col='ner_chunk', document_col='document')


## CPT Resolver

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(['Test','Procedure'])

c2doc = Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")

cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("cpt_code")\
    .setDistanceFunction("EUCLIDEAN")
  
sbert_pipeline_cpt = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        cpt_resolver])

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_cpt_procedures_augmented download started this may take some time.
[OK!]


In [None]:
text = '''
EXAM: Left heart cath, selective coronary angiogram, right common femoral angiogram, and StarClose closure of right common femoral artery.

REASON FOR EXAM: Abnormal stress test and episode of shortness of breath.

PROCEDURE: Right common femoral artery, 6-French sheath, JL4, JR4, and pigtail catheters were used.

FINDINGS:
1. Left main is a large-caliber vessel. It is angiographically free of disease,
2. LAD is a large-caliber vessel. It gives rise to two diagonals and septal perforator. It erupts around the apex. LAD shows an area of 60% to 70% stenosis probably in its mid portion. The lesion is a type A finishing before the takeoff of diagonal 1. The rest of the vessel is angiographically free of disease.
3. Diagonal 1 and diagonal 2 are angiographically free of disease.
4. Left circumflex is a small-to-moderate caliber vessel, gives rise to 1 OM. It is angiographically free of disease.
5. OM-1 is angiographically free of disease.
6. RCA is a large, dominant vessel, gives rise to conus, RV marginal, PDA and one PL. RCA has a tortuous course and it has a 30% to 40% stenosis in its proximal portion.
7. LVEDP is measured 40 mmHg.
8. No gradient between LV and aorta is noted.

Due to contrast concern due to renal function, no LV gram was performed.

Following this, right common femoral angiogram was performed followed by StarClose closure of the right common femoral artery.
'''

data_ner = spark.createDataFrame([[text]]).toDF("text")

sbert_models = sbert_pipeline_cpt.fit(data_ner)

sbert_outputs = sbert_models.transform(data_ner)

from pyspark.sql import functions as F

cpt_sdf = sbert_outputs.select(F.explode(F.arrays_zip(sbert_outputs.ner_chunk.result,
                                                      sbert_outputs.ner_chunk.metadata,sbert_outputs.cpt_code.result,
                                                      sbert_outputs.cpt_code.metadata,
                                                      sbert_outputs.ner_chunk.begin,
                                                      sbert_outputs.ner_chunk.end)).alias("cpt_code")) \
    .select(F.expr("cpt_code['0']").alias("chunk"),
            F.expr("cpt_code['4']").alias("begin"),
            F.expr("cpt_code['5']").alias("end"),
            F.expr("cpt_code['1'].entity").alias("entity"),
            F.expr("cpt_code['2']").alias("code"),
            F.expr("cpt_code['3'].confidence").alias("confidence"),
            F.expr("cpt_code['3'].all_k_resolutions").alias("all_k_resolutions"),
            F.expr("cpt_code['3'].all_k_results").alias("all_k_codes"))

cpt_sdf.show(10, truncate=100)


+----------------------------+-----+----+---------+-----+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                       chunk|begin| end|   entity| code|confidence|                                                                                   all_k_resolutions|                                                                                         all_k_codes|
+----------------------------+-----+----+---------+-----+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|             Left heart cath|    7|  21|Procedure|93462|    0.3829|Cardiac catheterisation, left heart [Left heart catheterization by transseptal puncture through i...|                  

In [None]:
import pandas as pd

def get_codes (light_model, code, text):

  full_light_result = light_model.fullAnnotate(text)

  chunks = []
  terms = []
  begin = []
  end = []
  resolutions=[]
  entity=[]
  all_codes=[]

  for chunk, term in zip(full_light_result[0]['ner_chunk'], full_light_result[0][code]):
          
      begin.append(chunk.begin)
      end.append(chunk.end)
      chunks.append(chunk.result)
      terms.append(term.result) 
      entity.append(chunk.metadata['entity'])
      resolutions.append(term.metadata['all_k_resolutions'])
      all_codes.append(term.metadata['all_k_results'])


  df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entity':entity,
                       'code':terms,'resolutions':resolutions,'all_codes':all_codes})

  return df


In [None]:
text='''
REASON FOR EXAM:  Evaluate for retroperitoneal hematoma on the right side of pelvis, the patient has been following, is currently on Coumadin.  

In CT abdomen,  there is no evidence for a retroperitoneal hematoma, but there is an metastases on the right kidney.  
  
The liver, spleen, adrenal glands, and pancreas are unremarkable. Within the superior pole of the left kidney, there is a 3.9 cm cystic lesion. A 3.3 cm cystic lesion is also seen within the inferior pole of the left kidney. No calcifications are noted. The kidneys are small bilaterally.  
  
In CT pelvis,  evaluation of the bladder is limited due to the presence of a Foley catheter, the bladder is nondistended. The large and small bowels are normal in course and caliber. There is no obstruction.  
'''

cpt_light_pipeline = LightPipeline(sbert_models)

get_codes (cpt_light_pipeline, 'cpt_code', text)

In [None]:
from sparknlp_display import EntityResolverVisualizer

vis = EntityResolverVisualizer()

# Change color of an entity label
vis.set_label_colors({'Procedure':'#008080', 'Test':'#800080'})

light_data_cpt = cpt_light_pipeline.fullAnnotate(text)

vis.display(light_data_cpt[0], 'ner_chunk', 'cpt_code')


In [None]:
text='''1. The left ventricular cavity size and wall thickness appear normal. The wall motion and left ventricular systolic function appears hyperdynamic with estimated ejection fraction of 70% to 75%. There is near-cavity obliteration seen. There also appears to be increased left ventricular outflow tract gradient at the mid cavity level consistent with hyperdynamic left ventricular systolic function. There is abnormal left ventricular relaxation pattern seen as well as elevated left atrial pressures seen by Doppler examination.
2. The left atrium appears mildly dilated.
3. The right atrium and right ventricle appear normal.
4. The aortic root appears normal.
5. The aortic valve appears calcified with mild aortic valve stenosis, calculated aortic valve area is 1.3 cm square with a maximum instantaneous gradient of 34 and a mean gradient of 19 mm.
6. There is mitral annular calcification extending to leaflets and supportive structures with thickening of mitral valve leaflets with mild mitral regurgitation.
7. The tricuspid valve appears normal with trace tricuspid regurgitation with moderate pulmonary artery hypertension. Estimated pulmonary artery systolic pressure is 49 mmHg. Estimated right atrial pressure of 10 mmHg.
8. The pulmonary valve appears normal with trace pulmonary insufficiency.
9. There is no pericardial effusion or intracardiac mass seen.
10. There is a color Doppler suggestive of a patent foramen ovale with lipomatous hypertrophy of the interatrial septum.
11. The study was somewhat technically limited and hence subtle abnormalities could be missed from the study.

'''

df = get_codes (cpt_light_pipeline, 'cpt_code', text)

df

In [None]:
text='''
CC: Left hand numbness on presentation; then developed lethargy later that day.

HX: On the day of presentation, this 72 y/o RHM suddenly developed generalized weakness and lightheadedness, and could not rise from a chair. Four hours later he experienced sudden left hand numbness lasting two hours. There were no other associated symptoms except for the generalized weakness and lightheadedness. He denied vertigo.

He had been experiencing falling spells without associated LOC up to several times a month for the past year.

MEDS: procardia SR, Lasix, Ecotrin, KCL, Digoxin, Colace, Coumadin.

PMH: 1)8/92 evaluation for presyncope (Echocardiogram showed: AV fibrosis/calcification, AV stenosis/insufficiency, MV stenosis with annular calcification and regurgitation, moderate TR, Decreased LV systolic function, severe LAE. MRI brain: focal areas of increased T2 signal in the left cerebellum and in the brainstem probably representing microvascular ischemic disease. 

'''


df = get_codes (cpt_light_pipeline, 'cpt_code', text)

df