![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Clinical Entity Coding with Pretrained Resolver Models

In [0]:
import os
import json
import string
import numpy as np
import pandas as pd


import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader

from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)


print('sparknlp_jsl.version : ',sparknlp_jsl.version())

spark

# Clinical Resolvers

## Entity Resolvers for ICD-10

A common NLP problem in biomedical aplications is to identify the presence of clinical entities in a given text. This clinical entities could be diseases, symptoms, drugs, results of clinical investigations or others.

Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- alternative_confidence_ratios -> Rest of confidence ratios
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId
- chunk -> ChunkId

### Clinical NER Pipeline creation

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector DL annotator, processes various sentences per line
sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

# Tokenizer splits words in a relevant format for NLP

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")
  

The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available from "clinical/models" named "embeddings_clinical".

When running this cell your are advised to be patient.

First time you call this pretrained model it needs to be downloaded in your local.

The model size is about will download the embeddings_clinical corpus it takes a while.

The size is about 1.7Gb and will be saved typically in your home folder as

`~HOMEFOLDER/cached_models/ embeddings_clinical_en_2.0.2_2.4_1558454742956`

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 5 minutes (depending on your machine)

In [0]:
# WordEmbeddingsModel pretrained "embeddings_clinical" includes a model of 1.6Gb that needs to be downloaded

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")
  

The fifth and final annotator in our NER pipeline is the pretrained `ner_clinical` NerDLModel avaliable from "clinical/models". It requires as input the "sentence", "token" and "embeddings" (clinical embeddings pretrained model) and will classify each token in four categories:

- `PROBLEM`: for patient problems

- `TEST`: for tests, labs, etc.

- `TREATMENT`: for treatments, medicines, etc.

- `OTHER`: for the rest of tokens.

In order to split those identified NER that are consecutive, the B prefix (as B-PROBLEM) will be used at the first token of each NER. The I prefix (as I-PROBLEM) will be used for the rest of tokens inside the NER.

In [0]:
# Named Entity Recognition for clinical concepts.

clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")


### Define the NER pipeline

Now we will define the actual pipeline that puts together the annotators we have created.

In [0]:
# Build up the pipeline

pipeline_ner = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetectorDL,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter
  ])

### Create a SparkDataFrame with the content

Now we will create a sample Spark dataframe with our clinical note example.

In this example we are working over a unique clinical note. In production environments a table with several of those clinical notes could be distributed in a cluster and be run in large scale systems.

In [0]:

clinical_note = (
    'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years '
    'prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior '
    'episode of HTG-induced pancreatitis three years prior to presentation, associated '
    'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '
    'presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. '
    'Two weeks prior to presentation, she was treated with a five-day course of amoxicillin '
    'for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin '
    'for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months '
    'at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; '
    'significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent '
    'laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, '
    'creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) '
    '10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed '
    'as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for '
    'starvation ketosis, as she reported poor oral intake for three days prior to admission. However, '
    'serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap '
    'was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and '
    'lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - '
    'the original sample was centrifuged and the chylomicron layer removed prior to analysis due to '
    'interference from turbidity caused by lipemia again. The patient was treated with an insulin drip '
    'for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within '
    '24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting '
    'of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on '
    '40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg '
    'two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She '
    'had close follow-up with endocrinology post discharge.'
)

data_ner = spark.createDataFrame([[clinical_note]]).toDF("text")

In [0]:
data_ner.show(truncate = 100)


### Transform / annotate the clinical note using the model.

In order to process the data with the new created model we have two options.

The first one would be to use the model to transform our clinical note by the command:

`output = model_ner.transform(data_ner)`

That would save in a Spakr DataFrame (output) the resuls of running the model over the clinical note.

However for small tests like this or for real-time request a LightPipelines is a simpler way of managing the data. It will return a dictionary (instead of a Spark DataFrame) with the results of the transformation

We will create a light_pipeline_ner using our model_ner and then will annotate the clinical_note using this light_pipeline.

In [0]:
model = pipeline_ner.fit(data_ner)

light_pipeline = LightPipeline(model)
light_data = light_pipeline.annotate(clinical_note)

Now we have a dictionaty (light_data_ner) that contains the results of running the NER pipeline over our clinical note.

It contains the original document:

In [0]:
light_data['document'][0][0:100]


In [0]:
print("Number of sentences: {}".format(len(light_data['sentence'])))
print("")
for i in range(5):
    print("Sentence {}: {}".format(i, light_data['sentence'][i]))

In [0]:
print("Number of tokens: {}".format(len(light_data['token'])))
print("")
for i in range(25):
    print("Token {}: {} ({})".format(i, light_data['token'][i], light_data['ner'][i]))
print("...")

Lets apply some HTML formating to see the results of the pipeline in a nicer layout:

In [0]:
from sparknlp_display import NerVisualizer

light_result = light_pipeline.fullAnnotate(clinical_note)

visualiser = NerVisualizer()

# Change color of an entity label
visualiser.set_label_colors({'PROBLEM':'#008080', 'TEST':'#800080', 'TREATMENT':'#808080'})
vis = visualiser.display(light_result[0], label_col='ner_chunk', return_html=True)

# Set label filter
# vis = visualiser.display(light_result, label_col='ner_chunk', document_col='document',
                   #labels=['PROBLEM','TEST'])
  
displayHTML(vis)

# SentenceEntityResolver Models

- sbiobertresolve_icd10cm 
- sbiobertresolve_icd10cm_augmented
- sbiobertresolve_hcc_augmented
- sbiobertresolve_icd10cm_augmented_billable_hcc
- sbertresolve_icd10cm_slim_billable_hcc_med
- sbiobertresolve_icd10cm_slim_billable_hcc
- sbiobertresolve_icd10pcs
- sbiobertresolve_snomed_findings (with clinical_findings concepts from CT version)
- sbiobertresolve_snomed_findings_int  (with clinical_findings concepts from INT version)
- sbiobertresolve_snomed_auxConcepts (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from CT version)
- sbiobertresolve_snomed_auxConcepts_int  (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from INT version)
- sbertresolve_snomed_bodyStructure_med
- sbiobertresolve_rxnorm
- sbiobertresolve_rxcui
- sbiobertresolve_icdo
- sbiobertresolve_icdo_augmented
- sbiobertresolve_cpt
- sbiobertresolve_cpt_procedures_augmented
- sbiobertresolve_loinc
- sbiobertresolve_HPO
- sbiobertresolve_umls_major_concepts
- sbiobertresolve_umls_findings

In [0]:
import pandas as pd

pd.set_option('display.max_colwidth', 0)


def get_codes (lp, text, vocab='icd10cm_code', hcc=False):
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):
            
        begin.append(chunk.begin)
        end.append(chunk.end)
        chunks.append(chunk.result)
        codes.append(code.result) 
        all_codes.append(code.metadata['all_k_results'].split(':::'))
        resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
        all_distances.append(code.metadata['all_k_distances'].split(':::'))
        all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
        if hcc:
            try:
                all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
            except:
                all_k_aux_labels.append([])
        else:
            all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    if hcc:

        df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
        df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
        df['hcc_score'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])

    df = df.drop(['all_k_aux_labels'], axis=1)
    
    return df

## Sentence Entity Resolver - ICD10CM

**ICD10 Background Info**

ICD-10-CM vs. ICD-10-PCS

With the transition to ICD-10, in the United States, ICD-9 codes are segmented into ICD-10-CM and ICD-10-PCS codes. **The "CM" in ICD-10-CM codes stands for clinical modification**; ICD-10-CM codes were developed by the Centers for Disease Control and Prevention in conjunction with the National Center for Health Statistics (NCHS), for outpatient medical coding and reporting in the United States, as published by the World Health Organization (WHO).

**The "PCS" in ICD-10-PCS codes stands for the procedural classification system**. ICD-10-PCS is a completely separate medical coding system from ICD-10-CM, containing an additional 87,000 codes for use ONLY in United States inpatient, hospital settings. The procedure classification system (ICD-10-PCS) was developed by the Centers for Medicare and Medicaid Services (CMS) in conjunction with 3M Health Information Management (HIM).

ICD-10-CM codes add increased specificity to their ICD-9 predecessors, growing to five times the number of codes as the present system; a total of 68,000 clinical modification diagnosis codes. ICD-10-CM codes provide the ability to track and reveal more information about the quality of healthcare, allowing healthcare providers to better understand medical complications, better design treatment and care, and better comprehend and determine the outcome of care.

ICD-10-PCS is used only for inpatient, hospital settings in the United States, and is meant to replace volume 3 of ICD-9 for facility reporting of inpatient procedures. Due to the rapid and constant state of flux in medical procedures and technology, ICD-10-PCS was developed to accommodate the changing landscape. Common procedures, lab tests, and educational sessions that are not unique to the inpatient, hospital setting have been omitted from ICD-10-PCS.

ICD-10 is confusing enough when you’re trying to digest the differences between ICD-9 and ICD-10, but there are also different types of ICD-10 codes that providers should be aware of.


**Primary difference between ICD-10-CM and ICD-10-PCS**

When most people talk about ICD-10, they are referring to ICD-10CM. This is the code set for diagnosis coding and is used for all healthcare settings in the United States. ICD-10PCS, on the other hand, is used in hospital inpatient settings for inpatient procedure coding.

ICD-10-CM breakdown

- Approximately 68,000 codes
- 3–7 alphanumeric characters
- Facilitates timely processing of claims


ICD-10-PCS breakdown

- Will replace ICD-9-CM for hospital inpatient use only. 
- ICD-10-PCS will not replace CPT codes used by physicians. According to HealthCare Information Management, Inc. (HCIM), “Its only intention is to identify inpatient facility services in a way not directly related to physician work, but directed towards allocation of hospital services.”

- 7 alphanumeric characters

ICD-10-PCS is very different from ICD-9-CM procedure coding due to its ability to be more specific and accurate. “This becomes increasingly important when assessing and tracking the quality of medical processes and outcomes, and compiling statistics that are valuable tools for research,” according to HCIM.

**Hierarchical Condition Category (HCC)**

Hierarchical condition category (HCC) coding is a risk-adjustment model originally designed to estimate future health care costs for patients. The Centers for Medicare & Medicaid Services (CMS) HCC model was initiated in 2004 but is becoming increasingly prevalent as the environment shifts to value-based payment models.

Hierarchical condition category relies on ICD-10 coding to assign risk scores to patients. Each HCC is mapped to an ICD-10 code. Along with demographic factors (such as age and gender), insurance companies use HCC coding to assign patients a risk adjustment factor (RAF) score. Using algorithms, insurances can use a patient’s RAF score to predict costs. For example, a patient with few serious health conditions could be expected to have average medical costs for a given time. However, a patient with multiple chronic conditions would be expected to have higher health care utilization and costs.

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

# Sentence Detector DL annotator, processes various sentences per line
sentenceDetectorDL = SentenceDetectorDLModel\
      .pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \
      .setInputCols(["document"]) \
      .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

# WordEmbeddingsModel pretrained "embeddings_clinical" includes a model of 1.7Gb that needs to be downloaded
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")
  

# Named Entity Recognition for clinical concepts.
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter_icd = NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['problem'])\
      .setPreservePosition(False)

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")
    
icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
     .setInputCols(["ner_chunk", "sbert_embeddings"]) \
     .setOutputCol("icd10cm_code")\
     .setDistanceFunction("EUCLIDEAN")
    

# Build up the pipeline
resolver_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetectorDL,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter_icd,
        c2doc,
        sbert_embedder,
        icd_resolver
  ])


empty_data = spark.createDataFrame([['']]).toDF("text")

model = resolver_pipeline.fit(empty_data)

In [0]:
icd10_res = model.transform(data_ner)

In [0]:
import pandas as pd

pd.set_option('display.max_colwidth', 0)

icd10_df = icd10_res.select(F.explode(F.arrays_zip('ner_chunk.result', 
                                                   'ner_chunk.metadata', 
                                                   'icd10cm_code.result', 
                                                   'icd10cm_code.metadata')).alias("cols")) \
                            .select(F.expr("cols['0']").alias("ner_chunk"),
                                     F.expr("cols['1']['entity']").alias("entity"), 
                                     F.expr("cols['2']").alias("icd10_code"),
                                     F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                     F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                                     F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()
 
 
codes = []
resolutions = []
hcc_all = []

for code, resolution, hcc in zip(icd10_df['all_codes'], icd10_df['resolutions'], icd10_df['hcc_list']):
    
    codes.append(code.split(':::'))
    resolutions.append(resolution.split(':::'))
    hcc_all.append(hcc.split(":::"))

icd10_df['all_codes'] = codes  
icd10_df['resolutions'] = resolutions
icd10_df['hcc_list'] = hcc_all

The values in `billable`, `hcc_score` and `hcc_status` columns are seperated by || and we will change them to a list.

In [0]:
def extract_billable(bil):
  
    billable = []
    status = []
    score = []
 
    for b in bil:
        billable.append(b.split("||")[0])
        status.append(b.split("||")[1])
        score.append(b.split("||")[2])

    return (billable, status, score)
 

icd10_df["billable"] = icd10_df["hcc_list"].apply(extract_billable).apply(pd.Series).iloc[:,0]    
icd10_df["hcc_status"] = icd10_df["hcc_list"].apply(extract_billable).apply(pd.Series).iloc[:,1]
icd10_df["hcc_score"] = icd10_df["hcc_list"].apply(extract_billable).apply(pd.Series).iloc[:,2]

icd10_df.drop("hcc_list", axis=1, inplace= True)

In [0]:
icd10_df.head(15)

Unnamed: 0,ner_chunk,entity,icd10_code,all_codes,resolutions,billable,hcc_status,hcc_score
0,gestational diabetes mellitus,PROBLEM,O24919,"[O24919, E119, O2441, O24419, O2443, O24439, Z8632, O24319, O2431, O2411, O244, P700, O241]","[gestational diabetes mellitus [Unspecified diabetes mellitus in pregnancy, unspecified trimester], gestational diabetes mellitus [Type 2 diabetes mellitus without complications], gestational diabetes mellitus [Gestational diabetes mellitus in pregnancy], gestational diabetes mellitus in pregnancy [Gestational diabetes mellitus in pregnancy, unspecified control], postpartum gestational diabetes mellitus [Gestational diabetes mellitus in the puerperium], postpartum gestational diabetes mellitus [Gestational diabetes mellitus in the puerperium, unspecified control], history of gestational diabetes mellitus [Personal history of gestational diabetes], preexisting diabetes mellitus in pregnancy [Unspecified pre-existing diabetes mellitus in pregnancy, unspecified trimester], pre-existing diabetes mellitus in pregnancy [Unspecified pre-existing diabetes mellitus in pregnancy], pre-existing diabetes mellitus in pregnancy (disorder) [Pre-existing type 2 diabetes mellitus, in pregnancy], postpartum gestational diabetes mellitus (disorder) [Gestational diabetes mellitus], newborn affected by maternal gestational diabetes mellitus [Syndrome of infant of mother with gestational diabetes], pre-existing type 2 diabetes mellitus in pregnancy [Pre-existing type 2 diabetes mellitus, in pregnancy, childbirth and the puerperium]]","[1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
1,subsequent type two diabetes mellitus,PROBLEM,O2411,"[O2411, E118, E119, E11, E139, E113, E1144, E1143, E1151, E117, E115]","[pre-existing type 2 diabetes mellitus [Pre-existing type 2 diabetes mellitus, in pregnancy], disorder associated with type 2 diabetes mellitus [Type 2 diabetes mellitus with unspecified complications], type 2 diabetes mellitus [Type 2 diabetes mellitus without complications], type 2 diabetes mellitus [Type 2 diabetes mellitus], secondary diabetes mellitus [Other specified diabetes mellitus without complications], disorder of eye due to type 2 diabetes mellitus [Type 2 diabetes mellitus with ophthalmic complications], disorder of nervous system due to type 2 diabetes mellitus [Type 2 diabetes mellitus with diabetic amyotrophy], neurological disorder with type 2 diabetes mellitus [Type 2 diabetes mellitus with diabetic autonomic (poly)neuropathy], diabetes mellitus type 2 with complications [Type 2 diabetes mellitus with diabetic peripheral angiopathy without gangrene], multiple complications of type 2 diabetes mellitus [Type 2 diabetes mellitus], peripheral circulatory disorder due to type 2 diabetes mellitus [Type 2 diabetes mellitus with circulatory complications]]","[0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0]","[0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0]","[0, 18, 19, 0, 19, 0, 75/18, 75/18, 108/18, 0, 0]"
2,T2DM,PROBLEM,E11,"[E11, E119, E118, O2411, E139, E113, E148, E149, Z833, E138, E117, E114, Z794, E1169]","[type 2 diabetes mellitus [Type 2 diabetes mellitus], type 2 diabetes mellitus [Type 2 diabetes mellitus without complications], disorder associated with type 2 diabetes mellitus [Type 2 diabetes mellitus with unspecified complications], pre-existing type 2 diabetes mellitus [Pre-existing type 2 diabetes mellitus, in pregnancy], secondary diabetes mellitus [Other specified diabetes mellitus without complications], disorder of eye with type 2 diabetes mellitus [Type 2 diabetes mellitus with ophthalmic complications], disorder associated with diabetes mellitus, dm - diabetes mellitus, fh: diabetes mellitus [Family history of diabetes mellitus], secondary diabetes [Other specified diabetes mellitus with unspecified complications], multiple complications of type 2 diabetes mellitus [Type 2 diabetes mellitus], neurological disorder with diabetes type 2 [Type 2 diabetes mellitus with neurological complications], insulin treated type 2 diabetes mellitus [Long term (current) use of insulin], hyperglycemia due to type 2 diabetes mellitus (disorder) [Type 2 diabetes mellitus with other specified complication]]","[0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1]","[0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]","[0, 19, 18, 0, 19, 0, 0, 0, 0, 18, 0, 0, 19, 18]"
3,HTG-induced pancreatitis,PROBLEM,K859,"[K859, F10988, K853, K852, K8591, K858, K85, K851, K850]","[alcohol-induced pancreatitis [Acute pancreatitis, unspecified], alcohol-induced pancreatitis [Alcohol use, unspecified with other alcohol-induced disorder], drug-induced acute pancreatitis [Drug induced acute pancreatitis], alcohol-induced acute pancreatitis [Alcohol induced acute pancreatitis], necrotizing pancreatitis [Acute pancreatitis with uninfected necrosis, unspecified], traumatic acute pancreatitis [Other acute pancreatitis], acute pancreatitis [Acute pancreatitis], biliary acute pancreatitis [Biliary acute pancreatitis], idiopathic acute pancreatitis [Idiopathic acute pancreatitis]]","[0, 1, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,an acute hepatitis,PROBLEM,B159,"[B159, B15, K720, B179, B172, Z0389, K712, B16, B169, K701, K7200, K7010]","[acute hepatitis a [Hepatitis A without hepatic coma], acute hepatitis a [Acute hepatitis A], acute hepatitis [Acute and subacute hepatic failure], acute hepatitis [Acute viral hepatitis, unspecified], acute hepatitis e [Acute hepatitis E], acute infectious hepatitis suspected [Encounter for observation for other suspected diseases and conditions ruled out], toxic liver disease with acute hepatitis [Toxic liver disease with acute hepatitis], acute hepatitis b [Acute hepatitis B], acute hepatitis b [Acute hepatitis B without delta-agent and without hepatic coma], acute alcoholic hepatitis [Alcoholic hepatitis], acute hepatic failure [Acute and subacute hepatic failure without coma], alcoholic hepatitis, acute [Alcoholic hepatitis without ascites]]","[1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
5,obesity,PROBLEM,E668,"[E668, E66, E6601, E669, Z9189, E663, E660, Z6854]","[abdominal obesity [Other obesity], overweight and obesity [Overweight and obesity], morbid obesity [Morbid (severe) obesity due to excess calories], generalised obesity [Obesity, unspecified], potential for obesity [Other specified personal risk factors, not elsewhere classified], patient overweight [Overweight], obesity due to excess calories [Obesity due to excess calories], childhood obesity [Body mass index [BMI] pediatric, greater than or equal to 95th percentile for age]]","[1, 0, 1, 1, 1, 1, 0, 1]","[0, 0, 1, 0, 0, 0, 0, 0]","[0, 0, 22, 0, 0, 0, 0, 0]"
6,a body mass index,PROBLEM,R2233,"[R2233, R220, R2230, Z68, R229, R2243, R4189, R222, R1900, R2989, R198, H579, R6889, M799, R193]","[skin mass of arms [Localized swelling, mass and lump, upper limb, bilateral], skin mass of head [Localized swelling, mass and lump, head], skin mass of arm [Localized swelling, mass and lump, unspecified upper limb], body mass index [bmi] [Body mass index [BMI]], skin mass [Localized swelling, mass and lump, unspecified], skin mass of legs [Localized swelling, mass and lump, lower limb, bilateral], preoccupation with body size [Other symptoms and signs involving cognitive functions and awareness], abdominal wall mass [Localized swelling, mass and lump, trunk], abdominal mass [Intra-abdominal and pelvic swelling, mass and lump, unspecified site], mass of musculoskeletal structure [Other symptoms and signs involving the musculoskeletal system], abdominal wall movement [Other specified symptoms and signs involving the digestive system and abdomen], mass of eye structure [Unspecified disorder of eye and adnexa], head pressure [Other general symptoms and signs], muscle mass of limb [Soft tissue disorder, unspecified], abdominal muscle resistance [Abdominal rigidity]]","[1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
7,polyuria,PROBLEM,R358,"[R358, R35, R31, R300, E7201, R808, R80, R809, N026, R319, R34, N029]","[polyuria [Other polyuria], polyuria [Polyuria], hematuria [Hematuria], dysuria [Dysuria], cystinuria [Cystinuria], paroxysmal proteinuria [Other proteinuria], proteinuria [Proteinuria], proteinuria [Proteinuria, unspecified], persistent hematuria (disorder) [Recurrent and persistent hematuria with dense deposit disease], upper urinary tract hematuria (disorder) [Hematuria, unspecified], anuria and oliguria [Anuria and oliguria], macroscopic hematuria [Recurrent and persistent hematuria with unspecified morphologic changes]]","[1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]","[0, 0, 0, 0, 23, 0, 0, 0, 141, 0, 0, 141]"
8,polydipsia,PROBLEM,R631,"[R631, F6389, O40, O409XX0, G4750, G475, M7989, R632, R061, G479, G471, R002, G255, G4754, M353, G4713]","[polydipsia [Polydipsia], psychogenic polydipsia [Other impulse disorders], polyhydramnios [Polyhydramnios], polyhydramnios [Polyhydramnios, unspecified trimester, not applicable or unspecified], parasomnia [Parasomnia, unspecified], parasomnia [Parasomnia], polyalgia [Other specified soft tissue disorders], polyphagia [Polyphagia], biphasic stridor [Stridor], dyssomnia [Sleep disorder, unspecified], hypersomnia (disorder) [Hypersomnia], intermittent palpitations [Palpitations], choreoathetosis, paroxysmal [Other chorea], secondary parasomnia [Parasomnia in conditions classified elsewhere], polymyalgia [Polymyalgia rheumatica], recurrent hypersomnia (disorder) [Recurrent hypersomnia]]","[1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[0, nan, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 0]"
9,poor appetite,PROBLEM,R630,"[R630, Z7689, R531, N484, R109, R4581, R448, R6889, E46, R635, R628, E639, R410, R4589, R452, Z789, M629]","[poor appetite [Anorexia], patient dissatisfied with nutrition regime [Persons encountering health services in other specified circumstances], general weakness [Weakness], poor erection [Other disorders of penis], stomach discomfort [Unspecified abdominal pain], poor self-esteem [Low self-esteem], poor self-image [Other symptoms and signs involving general sensations and perceptions], low motivation [Other general symptoms and signs], undernutrition (disorder) [Unspecified protein-calorie malnutrition], poor weight gain [Abnormal weight gain], poor weight gain [Lack of expected normal physiological development in childhood and adults], undernourished (finding) [Nutritional deficiency, unspecified], orientation poor [Disorientation, unspecified], feeling bitter [Other symptoms and signs involving emotional state], feeling sad [Unhappiness], patient's condition poor [Other specified health status], poor muscle tone [Disorder of muscle, unspecified]]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0, 0, 0]"


Lets try `sbertresolve_icd10cm_slim_billable_hcc_med` model was trained with `sbert_jsl_medium_uncased` embeddings.

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

jsl_sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased','en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sbert_embeddings")
    
icd_resolver_med = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") \
     .setInputCols(["ner_chunk", "sbert_embeddings"]) \
     .setOutputCol("icd10cm_code")\
     .setDistanceFunction("EUCLIDEAN")

icd_medPipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        jsl_sbert_embedder,
        icd_resolver_med])

icd_med_lp = LightPipeline(icd_medPipelineModel)

In [0]:
text = 'gestational diabetes mellitus'

get_codes (icd_med_lp, text, vocab='icd10cm_code', hcc=True)

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances,billable,hcc_status,hcc_score
0,gestational diabetes mellitus,0,28,O24919,"[O24919, E119, O2441, O244, O2443, O24439, Z8632, P702, P700]","[gestational diabetes mellitus [Unspecified diabetes mellitus in pregnancy, unspecified trimester], gestational diabetes mellitus [Type 2 diabetes mellitus without complications], gestational diabetes mellitus [Gestational diabetes mellitus in pregnancy], gestational diabetes mellitus, class r [Gestational diabetes mellitus], gestational diabetes mellitus in the puerperium [Gestational diabetes mellitus in the puerperium], postpartum gestational diabetes mellitus [Gestational diabetes mellitus in the puerperium, unspecified control], history of gestational diabetes mellitus [Personal history of gestational diabetes], neonatal diabetes mellitus [Neonatal diabetes mellitus], newborn affected by maternal gestational diabetes mellitus [Syndrome of infant of mother with gestational diabetes]]","[0.0000, 0.0000, 0.0000, 0.0330, 0.0429, 0.0596, 0.0618, 0.0719, 0.0873]","[1, 1, 0, 0, 0, 1, 1, 1, 1]","[0, 1, 0, 0, 0, 0, 0, 0, 0]","[0, 19, 0, 0, 0, 0, 0, 0, 0]"


In [0]:
text = 'T2DM'

get_codes (icd_med_lp, text, vocab='icd10cm_code', hcc=True)

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances,billable,hcc_status,hcc_score
0,T2DM,0,3,E119,"[E119, S23111A, S23131A, S23121A, Q923, I8000, N6489, S23123A, H7389, S52226P, S52324P, S72324P, S72326P, S82116C, S52325P, I802, S52226C, M7981, I620, S82116K, S52326P, S52324M, S23133A, S52224P, S82115B]","[t2dm [Type 2 diabetes mellitus without complications], closed dislocation t1/t2 [Dislocation of T1/T2 thoracic vertebra, initial encounter], closed dislocation t4/t5 [Dislocation of T4/T5 thoracic vertebra, initial encounter], closed dislocation t2/t3 [Dislocation of T2/T3 thoracic vertebra, initial encounter], trisomy 16p11.2p12.2 [Other trisomies and partial trisomies of the autosomes, not elsewhere classified], phlbts and thombophlb of superfic vessels of unsp low extrm [Phlebitis and thrombophlebitis of superficial vessels of unspecified lower extremity], nontraumatic hematoma of breast [Other specified disorders of breast], closed dislocation t3/t4 [Dislocation of T3/T4 thoracic vertebra, initial encounter], deformity of tympanic membrane [Other specified disorders of tympanic membrane], nondisp transverse fx shaft of unsp ulna, 7thp [Nondisplaced transverse fracture of shaft of unspecified ulna, subsequent encounter for closed fracture with malunion], nondisp transverse fx shaft of r rad, 7thp [Nondisplaced transverse fracture of shaft of right radius, subsequent encounter for closed fracture with malunion], nondisp transverse fx shaft of r femr, 7thp [Nondisplaced transverse fracture of shaft of right femur, subsequent encounter for closed fracture with malunion], nondisp transverse fx shaft of unsp femr, 7thp [Nondisplaced transverse fracture of shaft of unspecified femur, subsequent encounter for closed fracture with malunion], nondisp fx of unsp tibial spine, init for opn fx type 3a/b/c [Nondisplaced fracture of unspecified tibial spine, initial encounter for open fracture type IIIA, IIIB, or IIIC], nondisp transverse fx shaft of l rad, 7thp [Nondisplaced transverse fracture of shaft of left radius, subsequent encounter for closed fracture with malunion], phlbts and thombophlb of and unsp deep vessels of low extrm [Phlebitis and thrombophlebitis of other and unspecified deep vessels of lower extremities], nondisp transverse fx shaft of unsp ulna, 7thc [Nondisplaced transverse fracture of shaft of unspecified ulna, initial encounter for open fracture type IIIA, IIIB, or IIIC], nontraumatic hematoma of soft tissue [Nontraumatic hematoma of soft tissue], nontraumatic subdural hemorrhage [Nontraumatic subdural hemorrhage], nondisp fx of unsp tibial spine, subs for clos fx w nonunion [Nondisplaced fracture of unspecified tibial spine, subsequent encounter for closed fracture with nonunion], nondisp transverse fx shaft of unsp rad, 7thp [Nondisplaced transverse fracture of shaft of unspecified radius, subsequent encounter for closed fracture with malunion], nondisp transverse fx shaft of r rad, 7thm [Nondisplaced transverse fracture of shaft of right radius, subsequent encounter for open fracture type I or II with nonunion], closed dislocation t5/t6 [Dislocation of T5/T6 thoracic vertebra, initial encounter], nondisp transverse fx shaft of r ulna, 7thp [Nondisplaced transverse fracture of shaft of right ulna, subsequent encounter for closed fracture with malunion], nondisp fx of left tibial spine, init for opn fx type i/2 [Nondisplaced fracture of left tibial spine, initial encounter for open fracture type I or II]]","[0.0000, 0.3226, 0.3402, 0.3381, 0.3536, 0.3454, 0.3351, 0.3510, 0.3351, 0.3455, 0.3474, 0.3552, 0.3547, 0.3641, 0.3590, 0.3569, 0.3612, 0.3512, 0.3510, 0.3704, 0.3623, 0.3607, 0.3713, 0.3597, 0.3610]","[1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


In [0]:
text = 'dry oral mucosa'

get_codes (icd_med_lp, text, vocab='icd10cm_code', hcc=True)

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances,billable,hcc_status,hcc_score
0,dry oral mucosa,0,14,J3489,"[J3489, R682, Z7689, K6289, L988, L853, R3989, H3531, T180X, H18899, O731, Z789, Y92091, K1322, R102, O923]","[nasal mucosa dry [Other specified disorders of nose and nasal sinuses], dry mouth [Dry mouth, unspecified], sterile diet [Persons encountering health services in other specified circumstances], empty rectum [Other specified diseases of anus and rectum], dry skin [Other specified disorders of the skin and subcutaneous tissue], dry skin [Xerosis cutis], perineum dry (finding) [Other symptoms and signs involving the genitourinary system], dry senile macular degeneration [Nonexudative age-related macular degeneration], foreign body in oral mucosa [Foreign body in mouth], dry cornea [Other specified disorders of cornea, unspecified eye], postpartum retained portion of placenta or membranes with no hemorrhage [Retained portions of placenta and membranes, without hemorrhage], minimal salicylate diet [Other specified health status], bathroom in oth non-institutional residence as place [Bathroom in other non-institutional residence as the place of occurrence of the external cause], minimal keratinized residual ridge mucosa [Minimal keratinized residual ridge mucosa], vulvovaginal dryness [Pelvic and perineal pain], agalactia (little or no breast milk) [Agalactia]]","[0.1347, 0.2658, 0.2684, 0.2762, 0.2780, 0.2780, 0.3073, 0.3079, 0.3081, 0.3002, 0.3079, 0.3244, 0.3238, 0.3357, 0.3237, 0.3480]","[1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## Sentence Entity Resolver - CPT

The Current Procedural Terminology (CPT) code set is a medical code set maintained by the American Medical Association. The CPT code set describes medical, surgical, and diagnostic services and is designed to communicate uniform information about medical services and procedures among physicians, coders, patients, accreditation organizations, and payers for administrative, financial, and analytical purposes.

In [0]:
tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

ner_converter_cpt = NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['TREATMENT','TEST'])\
      .setPreservePosition(False)

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_augmented","en", "clinical/models") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("sbert_cpt_code")\
      .setDistanceFunction("EUCLIDEAN")
  
sbert_resolver_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetectorDL,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter_cpt,
        c2doc,
        sbert_embedder,
        cpt_resolver])


data_ner = spark.createDataFrame([['sample text']]).toDF("text")

sbert_models = sbert_resolver_pipeline.fit(data_ner)

In [0]:
sbert_light_pipeline_cpt = LightPipeline(sbert_models)

In [0]:
text = 'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) 10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for starvation ketosis, as she reported poor oral intake for three days prior to admission. However, serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again. The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within 24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She had close follow-up with endocrinology post discharge.'

light_result = sbert_light_pipeline_cpt.annotate(text)

light_result.keys()

In [0]:
light_result['sbert_cpt_code']

In [0]:
df = pd.DataFrame(list(zip(light_result['ner_chunk'], light_result['sbert_cpt_code'])),
                  columns = ['Problem','cpt_code'])

df

Unnamed: 0,Problem,cpt_code
0,BMI,3008F
1,amoxicillin,80150
2,metformin,83858
3,glipizide,82980
4,dapagliflozin,1022259
5,atorvastatin,4145F
6,gemfibrozil,80145
7,dapagliflozin,1022259
8,Physical examination,1014526
9,her abdominal examination,76700


In [0]:
df = get_codes (sbert_light_pipeline_cpt, text, 'sbert_cpt_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,BMI,316,318,3008F,"[3008F, 0001F, 4014F, 2001F, 84112, 85041, 89055, 85048, 86706, 84379, 84378, 84377, 84376, 82962, 86336, 88248, 88245, 88249, 1011483, 1011517, 82980]","[Body Mass Index (BMI), documented (PV), Weight monitoring, Weight monitoring, Weighing, Globulin level, RBC count, WBC count, WBC count, HBsAb measurement, Saccharide level, Saccharide level, Saccharide level, Saccharide level, Glucometer blood sugar, Inhibin A, BCR analysis, BCR analysis, BCR analysis, Hemoglobin, Insulin, Glutethimide]","[0.2327, 0.2331, 0.2331, 0.2343, 0.2526, 0.2574, 0.2581, 0.2581, 0.2651, 0.2705, 0.2705, 0.2705, 0.2705, 0.2761, 0.2786, 0.2781, 0.2781, 0.2781, 0.2663, 0.2696, 0.2866]"
1,amoxicillin,499,509,80150,"[80150, 80170, 4049F, 4043F, 4046F, 4124F, 4047F, 4041F, 4042F, 4048F, 4120F, 80202, 4045F, 82415, 80200, 87186]","[Amikacin, Gentamicin, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Antibiotic coverage, Vancomycin, Empirical antibiotic therapy, Chloramphenicol, Tobramycin, Antimicrobial susceptibility test]","[0.1113, 0.1306, 0.1287, 0.1287, 0.1287, 0.1287, 0.1287, 0.1287, 0.1287, 0.1287, 0.1287, 0.1467, 0.1461, 0.1772, 0.1937, 0.1844]"
2,metformin,557,565,83858,"[83858, 80358, 83840, 80048, 80436, 1011517, 84431, 84305, 82962, 80434, 80435, 1011211, 1013548, 3066F, 80203, 82943, 82980, 80168, 99500, 80053]","[Methsuximide, Methadone, Methadone, Metabolic function test, Metyrapone panel, Insulin, Fatty acid measurement, Somatomedin, Glucometer blood glucose, Insulin tolerance test, Insulin tolerance test, Insulin tolerance panel, Medical nutrition therapy, Therapeutic regimen, Zonisamide, Glucagon, Glutethimide, Ethosuximide, Diabetes monitoring call, Chem. metabolic function tests]","[0.2017, 0.2145, 0.2145, 0.2217, 0.2345, 0.2212, 0.2221, 0.2416, 0.2374, 0.2337, 0.2337, 0.2382, 0.2341, 0.2368, 0.2533, 0.2470, 0.2561, 0.2585, 0.2452, 0.2582]"
3,glipizide,568,576,82980,"[82980, 1022259, 82943, 80175, 1011157, 80235, 80199, 1011204, 80168, 80156, 80157, 4190F, 80183, 80203, 82760, 82962, 80176, 80186]","[Glutethimide, Digoxin, Glucagon, Lamotrigine, Carbamazepine, Lacosamide, Tiagabine, Glucagon tolerance panel, Ethosuximide, Carbamazepine measurement (procedure), Carbamazepine measurement (procedure), Therapeutic drug monitoring assay, Oxcarbazepine, Zonisamide, Galactose measurement (procedure), Glucometer blood glucose, Lidocaine measurement (procedure), Dilantin measurement]","[0.1760, 0.1793, 0.1912, 0.2042, 0.2154, 0.2249, 0.2244, 0.2280, 0.2322, 0.2367, 0.2367, 0.2258, 0.2450, 0.2419, 0.2422, 0.2404, 0.2425, 0.2546]"
4,dapagliflozin,583,595,1022259,"[1022259, 80203, 80199, 80166, 80186, 80185, 80175, 82646, 80183, 33920, 80160, 25830, 25170, 80162, 33615, 33647, 82742, 82638, 1022260, 43775, 1003485, 78660, 1011157]","[Digoxin, Zonisamide, Tiagabine, Doxepin, Dilantin measurement, Dilantin measurement, Lamotrigine, Dihydrocodeinone, Oxcarbazepine, Rastelli operation, Desipramine, Darrach operation, Darrach operation, Digoxin; total, Atrioseptopexy, Atrioseptopexy, Flurazepam, Dibucaine number, Valproic acid (dipropylacetic acid), Longitudinal gastroplasty (procedure), Dermabrasion, Dacroscintigraphy, Carbamazepine]","[0.1627, 0.2213, 0.2169, 0.2300, 0.2382, 0.2382, 0.2423, 0.2510, 0.2523, 0.2532, 0.2549, 0.2512, 0.2512, 0.2542, 0.2695, 0.2695, 0.2532, 0.2617, 0.2608, 0.2690, 0.2652, 0.2663, 0.2670]"
5,atorvastatin,610,621,4145F,"[4145F, 4010F, 4210F, 4009F, 82610, 1022248, 37221, 37207, 37225, 37236, 37217, 37206, 37205, 37230, 37223, 37234, 37238, 75960, 37208, 37226, 0075T]","[Antihypertensive therapy (procedure), Angiotensin converting enzyme inhibitor therapy, Angiotensin converting enzyme inhibitor therapy, Angiotensin converting enzyme inhibitor prophylaxis, Cystatin C, Anabolic steroids, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent, Insertion of arterial stent]","[0.2017, 0.2119, 0.2119, 0.2210, 0.2260, 0.2161, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203, 0.2203]"
6,gemfibrozil,627,637,80145,"[80145, 80280, 0264T, 1008F, 4481F, 4133F, 4019F, 4132F, 4186F, 4221F, 4220F, 4131F, 4134F, 4013F, 4005F, 4210F, 4480F, 4185F, 1011153, 4190F, 83858, 80170, 79999]","[Adalimumab, Vedolizumab, Biological response modifier therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Drug therapy, Therapeutic Drug Assays, Therapeutic drug monitoring assay, Methsuximide, Gentamicin, Sr89 therapy]","[0.2224, 0.2418, 0.2332, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2291, 0.2447, 0.2392, 0.2620, 0.2585, 0.2594]"
7,dapagliflozin,664,676,1022259,"[1022259, 80203, 80199, 80166, 80186, 80185, 80175, 82646, 80183, 33920, 80160, 25830, 25170, 80162, 33615, 33647, 82742, 82638, 1022260, 43775, 1003485, 78660, 1011157]","[Digoxin, Zonisamide, Tiagabine, Doxepin, Dilantin measurement, Dilantin measurement, Lamotrigine, Dihydrocodeinone, Oxcarbazepine, Rastelli operation, Desipramine, Darrach operation, Darrach operation, Digoxin; total, Atrioseptopexy, Atrioseptopexy, Flurazepam, Dibucaine number, Valproic acid (dipropylacetic acid), Longitudinal gastroplasty (procedure), Dermabrasion, Dacroscintigraphy, Carbamazepine]","[0.1627, 0.2213, 0.2169, 0.2300, 0.2382, 0.2382, 0.2423, 0.2510, 0.2523, 0.2532, 0.2549, 0.2512, 0.2512, 0.2542, 0.2695, 0.2695, 0.2532, 0.2617, 0.2608, 0.2690, 0.2652, 0.2663, 0.2670]"
8,Physical examination,722,741,1014526,"[1014526, 97750, 99143, 99145, 99144, 1002F, 1003F, 97799, 97533, 97532, 97139, 97110, 97150, 97039]","[Physical Examination, Physical assessment, Monitors physical parameters, Monitors physical parameters, Monitors physical parameters, Physical activity assessment, Physical activity assessment, Physical medicine procedure, Physical medicine procedure, Physical medicine procedure, Physical medicine procedure, Physical medicine procedure, Physical medicine procedure, Physical medicine service]","[0.0216, 0.0581, 0.0914, 0.0914, 0.0914, 0.1153, 0.1153, 0.1314, 0.1314, 0.1314, 0.1314, 0.1314, 0.1314, 0.1607]"
9,her abdominal examination,811,835,76700,"[76700, 76705, 50949, 44202, 59898, 58544, 43653, 44188, 50542, 50548, 60650, 44204, 49327, 43647, 38570, 44238, 54692, 55866, 58578, 49325, 44208, 54690, 55559, 58660, 58679]","[US abdominal scan, US abdominal scan, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen]","[0.1269, 0.1269, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573, 0.1573]"


In [0]:
df = get_codes (sbert_light_pipeline_cpt, 'The patient needs to have a coronary artery bypass but doctor suggests an abdomen CT at first.', 'sbert_cpt_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,a coronary artery bypass,26,49,1006216,"[1006216, 1006217, 33504, 33503, 33505, 34802, 35540, 01925, 35537, 35637, 35538, 35638, 34831, 33533, 33536, 33535, 33534, 33863, 33864, 33507, 33506]","[Arterial Grafting for Coronary Artery Bypass, Coronary artery bypass, using arterial graft(s), Coronary artery graft placement, Coronary artery graft placement, Coronary artery graft placement, Aortic bifurcation bypass graft, Aortic bifurcation bypass graft, Procedure on coronary arteries, Aortoiliac vascular bypass, Aortoiliac vascular bypass, Aortoiliac vascular bypass, Aortoiliac vascular bypass, Aortoiliac vascular bypass, Arterial bypass graft, Arterial bypass graft, Arterial bypass graft, Arterial bypass graft, Coronary artery reconstruction, Coronary artery reconstruction, Reimplantation of coronary artery, Reimplantation of coronary artery]","[0.0817, 0.0875, 0.0895, 0.0895, 0.0895, 0.0983, 0.0983, 0.1103, 0.1123, 0.1123, 0.1123, 0.1123, 0.1123, 0.1115, 0.1115, 0.1115, 0.1115, 0.1163, 0.1163, 0.1184, 0.1184]"
1,an abdomen CT,71,83,1010526,"[1010526, 76700, 76705, 1031051, 1010521, 74160, 74183, 74181, 74175, 49329, 49999, 50949, 47370, 59898, 50542, 50548, 60650, 38570, 54692, 49325, 38571]","[Computed tomography, abdomen, US abdominal scan, US abdominal scan, Radiologic examination, abdomen, Radiologic examination, abdomen, Computed tomography, abdomen; with contrast material(s), Magnetic resonance imaging of abdomen, Magnetic resonance imaging of abdomen, Computed tomography of abdominal vascular structures, Procedure on abdomen, Procedure on abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen, Endoscopy of abdomen]","[0.0751, 0.1059, 0.1059, 0.1156, 0.1156, 0.1300, 0.1334, 0.1334, 0.1349, 0.1422, 0.1422, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544, 0.1544]"


In [0]:
df['resolutions'][0]

In [0]:
df['resolutions'][1]

## Sentence Entity Resolver - RxNorm

RxNorm is a second vocabulary for prescription drugs. RxNorm provides a set of codes for clinical drugs, which are the combination of active ingredients, dose form, and strength of a drug. For example, the RxNorm code for ciprofloxacin 500 mg 24-hour extended-release tablet (the generic name for Cipro XR 500 mg) is RX10359383, regardless of brand or packaging.

The goal of RxNorm is to allow computer systems to communicate drug-related information efficiently and unambiguously. Produced by the National Library of Medicine (NLM), RxNorm is available for distribution in both Metathesaurus Relation (MR) and Rich Release Format (RRF) tables. Currently there are no RxNorm names available for drugs with more than four active ingredients, those that are sold over the counter (OTC) or those that are international, due to the lack of appropriate information available about such drugs.

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings\
      .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sbert_embeddings")
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

rxnorm_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        rxnorm_resolver])

rxnorm_lp = LightPipeline(rxnorm_pipelineModel)


In [0]:
text = 'metformin 100 mg'

get_codes (rxnorm_lp, text, vocab='rxnorm_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,metformin 100 mg,0,15,406081,"[406081, 576612, 403968, 861024, 404727, 334738, 332848, 861026, 333262, 439563, 450523, 1744000, 484793, 402346, 1726496, 316350, 858858, 336846, 316844, 1946837, 451225, 328507, 437723, 385601, 315677]","[metformin 100 mg/ml, metformin 100 mg/ml [riomet], metformin 100 mg/ml oral solution, metformin hydrochloride 100 mg/ml, metformin 100 mg/ml oral solution [riomet], fenofibrate 100 mg, ciprofibrate 100 mg, metformin hydrochloride 100 mg/ml [riomet], rutin 100 mg, fendiline 100 mg, perazine 100 mg, emtricitabine 100 mg, solifenacin 100 mg, miglustat 100 mg, azacitidine 100 mg, niacin 100 mg, carnosine 100 mg, trimebutine 100 mg, torsemide 100 mg, abemaciclib 100 mg, pyrantel 100 mg, rimantadine 100 mg, azintamide 100 mg, mebeverine 100 mg, cimetidine 100 mg]","[0.0235, 0.0376, 0.0505, 0.0658, 0.0670, 0.0684, 0.0697, 0.0770, 0.0765, 0.0775, 0.0804, 0.0818, 0.0807, 0.0811, 0.0812, 0.0811, 0.0828, 0.0844, 0.0846, 0.0840, 0.0844, 0.0850, 0.0851, 0.0851, 0.0876]"


In [0]:
text = 'aspirin 10 meq/ 5 ml oral sol'

get_codes (rxnorm_lp, text, 'rxnorm_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,aspirin 10 meq/ 5 ml oral sol,0,28,104920,"[104920, 410146, 636574, 729048, 422031, 205247, 251102, 246529, 544468, 755941, 251830, 1117402, 422305, 241082, 245328, 604932, 247061, 604936, 2057152, 206399, 830193, 2057158, 252942, 794260, 428592]","[aspirin 500 mg / papaveretum 10 mg oral solution, cromoglicic acid 10 mg oral capsule, guaifenesin 10 mg/ml oral solution, pseudoephedrine tannate 10 mg/ml oral suspension, silymarin 10 mg/ml oral suspension, niacin 10 mg/ml oral solution, nimesulide 10 mg/ml oral suspension, midodrine 10 mg/ml oral solution, hypromellose 10 mg/ml oral solution, teferrol 10 mg/ml oral solution, mebeverine 10 mg/ml oral solution, guaifenesin 10 mg/ml oral solution [liqufruta], mebeverine 10 mg/ml oral suspension, methadyl acetate 10 mg/ml oral solution, cascara sagrada 10 mg/ml oral solution, ubidecarenone 10 mg/ml oral solution, periciazine 10 mg/ml oral solution, ubidecarenone 10 mg/ml oral solution [liquid co-q10], telmisartan 10 mg/ml oral solution, methadyl acetate 10 mg/ml oral solution [orlaam], opium tincture 10 mg/ml oral solution, telmisartan 10 mg/ml oral solution [semintra], bismuth subsalicylate 10 mg/ml oral suspension, ferrous sulfate 10 mg/ml oral solution, melitracen 10 mg / periciazine 0.5 mg oral tablet]","[0.0678, 0.0918, 0.0927, 0.0940, 0.0932, 0.0960, 0.0970, 0.0980, 0.1010, 0.0989, 0.1004, 0.1028, 0.1008, 0.0995, 0.1004, 0.1024, 0.1044, 0.1028, 0.1033, 0.1042, 0.1035, 0.1053, 0.1086, 0.1078, 0.1094]"


##RxNorm with DrugNormalizer

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk_v0")

drug_normalizer = DrugNormalizer() \
      .setInputCols("ner_chunk_v0") \
      .setOutputCol("ner_chunk") \
      .setPolicy('all')

rxnorm_pipelineModel2 = PipelineModel(
    stages = [
        documentAssembler,
        drug_normalizer,
        sbert_embedder,
        rxnorm_resolver])

rxnorm_lp2 = LightPipeline(rxnorm_pipelineModel2)

In [0]:
text = 'aspirin 10 meq/ 5 ml oral sol'

get_codes (rxnorm_lp2, text, vocab='rxnorm_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,aspirin 2 meq/ml oral solution,0,29,688214,"[688214, 342917, 688213, 904947, 343786, 1192949, 755937, 756028, 104072, 343790, 246229, 755924, 309427, 755920, 756064, 756078, 607610, 756171, 1115998, 248847, 243387, 250944, 577157, 108870, 421803]","[aspirin 2.5 mg/ml oral solution, aspirin 2.2 mg/ml, aspirin 2.5 mg/ml, aspirin 2.7 mg/ml, aspirin 2.75 mg/ml, aspirin 2.71 mg/ml, periciazine 2 mg/ml oral solution, fenspiride 2 mg/ml oral solution, cimetidine 2 mg/ml oral solution, aspirin 2.12 mg/ml, thioridazine 2 mg/ml oral solution, carbetapentane 2 mg/ml oral solution, codeine 2 mg/ml oral solution, butamirate 2 mg/ml oral solution, oxeladin 2 mg/ml oral solution, pipazethate 2 mg/ml oral solution, memantine 2 mg/ml oral solution [namenda], selegiline 2 mg/ml oral solution, ephedrine sulfate 2.2 mg/ml oral solution, homatropine 2 mg/ml oral solution, docusate 2 mg/ml oral suspension, piracetam 2 mg/ml oral solution, memantine 2 mg/ml oral solution, theophylline 2 mg/ml oral solution, codeine 2 mg/ml oral suspension]","[0.0347, 0.0566, 0.0748, 0.0753, 0.0912, 0.0944, 0.0947, 0.1040, 0.1039, 0.1125, 0.1155, 0.1166, 0.1098, 0.1149, 0.1154, 0.1178, 0.1154, 0.1184, 0.1196, 0.1211, 0.1212, 0.1196, 0.1205, 0.1236, 0.1175]"


## Sentence Entity Resolver - SNOMED

SNOMED CT is one of a suite of designated standards for use in U.S. Federal Government systems for the electronic exchange of clinical health information and is also a required standard in interoperability specifications of the U.S. Healthcare Information Technology Standards Panel. The clinical terminology is owned and maintained by SNOMED International, a not-for-profit association. 

SNOMED CT:

- Is the most comprehensive and precise, multilingual health terminology in the world.
- Has been, and continues to be, developed collaboratively to ensure it meets the diverse needs and expectations of the worldwide medical profession.
- Assists with the electronic exchange of clinical health information.
- Can be mapped to other coding systems, such as ICD-9 and ICD-10, which helps facilitate semantic interoperability.
- Is accepted as a common global language for health terms in over 50 countries.
- Is a resource with extensive, scientifically validated clinical content

In [0]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\
  .setWhiteList(['PROBLEM'])

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") 

bert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\
  .setInputCols(["ner_chunk_doc"])\
  .setOutputCol("bert_embeddings")

snomed_resolution = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code")

pipeline_snomed = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    snomed_resolution
  ])


In [0]:
clinical_note = (
    'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years '
    'prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior '
    'episode of HTG-induced pancreatitis three years prior to presentation, associated '
    'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '
    'presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. '
    'Two weeks prior to presentation, she was treated with a five-day course of amoxicillin '
    'for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin '
    'for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months '
    'at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; '
    'significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent '
    'laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, '
    'creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) '
    '10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed '
    'as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for '
    'starvation ketosis, as she reported poor oral intake for three days prior to admission. However, '
    'serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap '
    'was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and '
    'lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - '
    'the original sample was centrifuged and the chylomicron layer removed prior to analysis due to '
    'interference from turbidity caused by lipemia again. The patient was treated with an insulin drip '
    'for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within '
    '24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting '
    'of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on '
    '40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg '
    'two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She '
    'had close follow-up with endocrinology post discharge.'
)

data_ner = spark.createDataFrame([[clinical_note]]).toDF("text")

snomed_output = pipeline_snomed.fit(data_ner).transform(data_ner)


In [0]:
df = snomed_output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_code.result","snomed_code.metadata"))\
                          .alias("snomed_result")) \
                  .select(F.expr("snomed_result['0']").alias("chunk"),
                          F.expr("snomed_result['1'].entity").alias("entity"),
                          F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
                          F.expr("snomed_result['2']").alias("snomed_code"),
                          F.expr("snomed_result['3'].confidence").alias("distance")).toPandas()

df

Unnamed: 0,chunk,entity,target_text,snomed_code,distance
0,gestational diabetes mellitus,PROBLEM,"gestational diabetes mellitus:::gestational diabetes mellitus:::gestational diabetes mellitus:::gestational diabetes mellitus:::maternal gestational diabetes mellitus:::maternal diabetes mellitus:::postpartum gestational diabetes mellitus:::pre-existing diabetes mellitus in pregnancy:::gestational diabetes mellitus complicating pregnancy:::pregnancy and insulin-dependent diabetes mellitus:::pre-existing type 2 diabetes mellitus in pregnancy:::gestational diabetes mellitus, class f:::pregnancy and type 2 diabetes mellitus",237629002,0.2475
1,subsequent type two diabetes mellitus,PROBLEM,pre-existing type 2 diabetes mellitus:::disorder associated with type 2 diabetes mellitus:::type 2 diabetes mellitus:::secondary diabetes mellitus:::multiple complications due to type 2 diabetes mellitus:::complication due to diabetes mellitus type 2:::disorder of eye due to type 2 diabetes mellitus:::disorder of nervous system due to type 2 diabetes mellitus:::secondary endocrine diabetes mellitus:::type ii diabetes mellitus:::multiple complications of type 2 diabetes mellitus:::peripheral circulatory disorder due to type 2 diabetes mellitus,199230006,0.1653
2,T2DM,PROBLEM,type 2 diabetes mellitus:::disorder associated with type 2 diabetes mellitus:::pre-existing type 2 diabetes mellitus:::type ii diabetes mellitus:::diabetes mellitus:::diabetes mellitus:::diabetes mellitus:::diabetes mellitus:::secondary diabetes mellitus:::disorder of eye with type 2 diabetes mellitus:::disorder associated with diabetes mellitus:::type 2 diabetes mellitus in obese:::insulin resistance in diabetes:::multiple complications of type 2 diabetes mellitus,44054006,0.1989
3,HTG-induced pancreatitis,PROBLEM,alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::alcohol-induced acute pancreatitis:::hemorrhagic pancreatitis:::alcohol-induced chronic pancreatitis:::alcohol-induced chronic pancreatitis:::pancreatitis:::pancreatitis:::pancreatitis:::traumatic acute pancreatitis (disorder):::drug-induced chronic pancreatitis:::necrotizing pancreatitis:::post-ercp acute pancreatitis:::acute pancreatitis:::acute pancreatitis:::acute pancreatitis:::viral acute pancreatitis (disorder):::apoplectic pancreatitis,445507008,0.1443
4,an acute hepatitis,PROBLEM,acute hepatitis a:::acute hepatitis:::acute infectious hepatitis:::acute hepatitis e:::acute hepatitis e:::acute viral hepatitis:::acute focal hepatitis:::toxic liver disease with acute hepatitis:::acute fulminating type a viral hepatitis:::fulminant hepatitis:::acute hepatitis b:::acute alcoholic hepatitis:::acute alcoholic hepatitis:::acute fulminating viral hepatitis,25102003,0.3444
5,obesity,PROBLEM,obesity:::obesity:::obesity:::abdominal obesity:::generalized obesity:::obese:::central obesity:::morbid obesity:::morbid obesity:::morbid obesity:::obese abdomen:::obese build:::severe obesity:::o/e - obese:::o/e - obese:::o/e - obese:::o/e - obese:::endogenous obesity:::gynoid obesity:::constitutional obesity:::constitutional obesity,414916001,0.3193
6,a body mass index,PROBLEM,finding of body mass index:::mass of body region:::mass of body structure:::a mass:::mass of upper limb:::head and neck mass:::mass of trunk:::mass of head:::finding of body composition:::finding of head circumference:::body positions:::preoccupation with body size:::finding of size of upper limb:::finding of body region:::mass of cardiovascular structure:::mass of abdominal cavity structure:::increased body mass index:::increase in circumference:::mass of skin of head:::finding of proportion of upper limb,301331008,0.6762
7,polyuria,PROBLEM,polyuria:::polyuria:::polyuria:::polyuric state:::hematuria:::hematuria:::hematuria:::micturition frequency and polyuria:::uricosuria:::pollakiuria:::saccharopinuria:::saccharopinuria:::oligouria:::dysuria:::dysuria:::hematuria syndrome:::cystinuria:::cystinuria:::cystinuria:::cyclic proteinuria,28442001,0.333
8,polydipsia,PROBLEM,polydipsia:::polydipsia:::psychogenic polydipsia:::psychogenic polydipsia:::primary polydipsia:::polyhydramnios:::polyhydramnios:::polyhydramnios:::parasomnia:::polyalgia:::polyalgia:::polyphagia:::polyphagia:::biphasic stridor:::oscillopsia:::parasystole (disorder):::dyssomnia:::hypersomnia (disorder):::polyrrhinia,161847005,0.4972
9,poor appetite,PROBLEM,poor appetite:::lack of appetite:::lack of appetite:::poor feeding:::loss of appetite:::loss of appetite:::bad taste in mouth:::bad taste in mouth:::bad taste in mouth:::bad taste in mouth:::decreased sense of taste:::poor fluid intake:::bad breath:::low libido:::diet poor (finding):::diet poor (finding):::diet poor (finding),64379006,0.9835


### with SNOMED INT

In [0]:

snomed_resolution_int = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_int", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code_int")

pipeline_snomed_int = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    snomed_resolution_int
  ])

snomed_output_int = pipeline_snomed_int.fit(data_ner).transform(data_ner)


In [0]:
snomed_output_int.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_code_int.result","snomed_code_int.metadata"))\
                         .alias("snomed_result")) \
                  .select(F.expr("snomed_result['0']").alias("chunk"),
                          F.expr("snomed_result['1'].entity").alias("entity"),
                          F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
                          F.expr("snomed_result['2']").alias("snomed_code"),
                          F.expr("snomed_result['3'].confidence").alias("distance")).show(truncate = 100)

In [0]:
snomed_resolution_int = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code_int")


## Sentence Entity Resolver (LOINC)

Logical Observation Identifiers Names and Codes (LOINC) is a database and universal standard for identifying medical laboratory observations. First developed in 1994, it was created and is maintained by the Regenstrief Institute, a US nonprofit medical research organization. LOINC was created in response to the demand for an electronic database for clinical care and management and is publicly available at no cost.

It is endorsed by the American Clinical Laboratory Association. Since its inception, the database has expanded to include not just medical laboratory code names but also nursing diagnosis, nursing interventions, outcomes classification, and patient care data sets.

LOINC applies universal code names and identifiers to medical terminology related to electronic health records. The purpose is to assist in the electronic exchange and gathering of clinical results (such as laboratory tests, clinical observations, outcomes management and research). LOINC has two main parts: laboratory LOINC and clinical LOINC.

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings\
      .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sbert_embeddings")
    
loinc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc", "en", "clinical/models") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("loinc_code")\
      .setDistanceFunction("EUCLIDEAN")

loinc_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        loinc_resolver])

loinc_lp = LightPipeline(loinc_pipelineModel)

In [0]:
text = 'FLT3 gene mutation analysis'

get_codes (loinc_lp, text, vocab='loinc_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,FLT3 gene mutation analysis,0,26,54447-8,"[54447-8, 47958-4, 53840-5, 21676-2, 75391-3, 38413-1, 58987-9, 49704-0, 46225-9, 82517-4, 53895-9, 21740-6, 34499-4, 41098-5, 49873-3, 71358-6, 47974-1, 40427-7, 41108-2, 81831-0, 22074-9, 35291-4, 38532-8, 59045-5, 41753-5]","[FLT3 gene targeted mutation analysis, FLT3 gene targeted mutation analysis, TGFB3 gene targeted mutation analysis, FGFR3 gene targeted mutation analysis, FGFR3 gene targeted mutation analysis, FGFR3 gene targeted mutation analysis, ITGB3 gene targeted mutation analysis, SH3TC2 gene targeted mutation analysis, ABCA3 gene targeted mutation analysis, LAMA3 gene targeted mutation analysis, SH3BP2 gene targeted mutation analysis, TRAF3 gene targeted mutation analysis, GPC3 gene targeted mutation analysis, GPC3 gene targeted mutation analysis, POU3F4 gene targeted mutation analysis, GJB3 gene targeted mutation analysis, THRB gene targeted mutation analysis, FGF23 gene targeted mutation analysis, FGF23 gene targeted mutation analysis, LAMB3 gene targeted mutation analysis, FGFR3 gene mutations tested for, UBE3A gene targeted mutation analysis, SPG3A gene targeted mutation analysis, ITGB3 gene mutations tested for, FKRP gene targeted mutation analysis]","[0.1369, 0.1369, 0.1516, 0.1571, 0.1571, 0.1571, 0.1594, 0.1641, 0.1635, 0.1635, 0.1692, 0.1651, 0.1701, 0.1701, 0.1716, 0.1739, 0.1719, 0.1797, 0.1797, 0.1771, 0.1810, 0.1744, 0.1803, 0.1845, 0.1804]"


In [0]:
text = 'Hematocrit'

get_codes (loinc_lp, text, vocab='loinc_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Hematocrit,0,9,11151-8,"[11151-8, 42908-4, 11271-4, 30398-2, 48703-3, 13508-7, 71828-8, 55781-9, 17809-5, 4545-0, 71829-6, 47640-8, 71830-4, 11153-4, 31100-1, 4544-3, 71831-2, 70168-0, 32354-3, 62241-5, 70169-8, 71833-8, 20570-8, 41654-5, 41655-2]","[Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit, Hematocrit]","[0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732, 0.0732]"


## Sentence Entity Resolver (UMLS)

Unified Medical Language System (UMLS) integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records.

The UMLS, or Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.

The Metathesaurus forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, all of which stem from the over 100 incorporated controlled vocabularies and classification systems. Some examples of the incorporated controlled vocabularies are CPT, ICD-10, MeSH, SNOMED CT, DSM-IV, LOINC, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNorm, Gene Ontology, and OMIM

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings\
      .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sbert_embeddings")
    
umls_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("umls_code")\
      .setDistanceFunction("EUCLIDEAN")

umls_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        umls_resolver])

umls_lp = LightPipeline(umls_pipelineModel)

In [0]:
# Injuries & poisoning
text = 'food poisoning'

get_codes (umls_lp, text, vocab='umls_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,food poisoning,0,13,C0016479,"[C0016479, C0349783, C0411266, C0178496, C0161615, C0275107, C1272775, C0161721, C0232071, C0749596, C1271087, C0274909, C0679360]","[food poisoning, infectious food poisoning, chemical food poisoning (disorder), food poisoning bacterial, digestant poisoning, poisoning caused by ingestion of insect-infested food (disorder), burns food, toxic effect of noxious substance eaten as food (disorder), food aspiration, ingestion poisoning, suspected food poisoning (situation), toxic effect of food contaminant (disorder), foodborne illness]","[0.0000, 0.0507, 0.0660, 0.0668, 0.0871, 0.0876, 0.0913, 0.0981, 0.0989, 0.1006, 0.1033, 0.1042, 0.1041]"


In [0]:
# clinical findings
text = 'type two diabetes mellitus'

get_codes (umls_lp, text, vocab='umls_code')

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,type two diabetes mellitus,0,25,C4014362,"[C4014362, C3532488, C1320657, C1313937, C0455488, C2733146, C4017629, C3532489, C4538688, C3275844, C0241863, C3280267, C1261139, C0421248, C5195213, C1317301, C3278636, C2675471, C3280359]","[type 2 diabetes mellitus (t2d), history of diabetes mellitus type 2 (situation), diabete type, fh: diabetes mellitus, pre-existing diabetes mellitus, type 2 diabetes mellitus uncontrolled, diabetes mellitus, type ii, digenic, history of diabetes mellitus type i, diabetes mellitus, type 2 (in some heterozygous adults), diabetes mellitus, type ii, susceptibility to, diabetic, type 2 diabetes (in some), diabetic relative, insulin diabetic, increased risk of type 2 diabetes, diabetes status, neonatal insulin-dependent diabetes mellitus, microvascular complications of diabetes, susceptibility to, 2, diabetes mellitus (in some patients)]","[0.0438, 0.0568, 0.0618, 0.0697, 0.0780, 0.0891, 0.0910, 0.0883, 0.0937, 0.0940, 0.0939, 0.0971, 0.1031, 0.1062, 0.1087, 0.1104, 0.1116, 0.1201, 0.1147]"


# ChunkEntityResolver Models

## ICD10CM Resolver

**ICD10 Coding Pipeline Creation**

A common NLP problem in biomedical aplications is to identify the presence of clinical entities in a given text. This clinical entities could be diseases, symptoms, drugs, results of clinical investigations or others.

Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- alternative_confidence_ratios -> Rest of confidence ratios
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId
- chunk -> ChunkId

We will now create a new pipeline that from each of these problems will try to assign an ICD10 base on the content, the wordembeddings and some pretrained models for ICD10 annotation.

The architecture of this new pipeline will be as follows:

- DocumentAssembler (text -> document)

- SentenceDetector (document -> sentence)

- Tokenizer (sentence -> token)

- WordEmbeddingsModel ([sentence, token] -> embeddings)

- NerDLModel ([sentence, token, embeddings] -> ner)

- NerConverter (["sentence, token, ner] -> ner_chunk

- ChunkTokenizer (ner_chunk -> ner_chunk_tokenized)

- ICD10CMEntityResolverModel ([ner_chunk_tokenized, embeddings] -> resolution)

- ICD10PCSEntityResolverModel ([ner_chunk_tokenized, embeddings] -> resolution)

So from a text we end having a list of Named Entities (ner_chunk) and their ICD10 codes (resolution)

Most of the annotators in this pipeline have been already created for the previous pipeline, but we need to create four additional annotators: NerConverter, ChunkEmbeddigns, EntityResolverModel for ICD10CM and EntityResolverModel for ICD10PCS.

Now we define the new pipeline

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Named Entity Recognition concepts parser, transforms entities into CHUNKS (required for next step: assertion status)

ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\
  .setWhiteList(['PROBLEM'])\
  .setPreservePosition(False)

chunk_embeddings = ChunkEmbeddings()\
  .setInputCols("ner_chunk", "embeddings")\
  .setOutputCol("chunk_embeddings")

# ICD resolution model
icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") \
  .setInputCols(["token", "chunk_embeddings"]) \
  .setOutputCol("icd10cm_code") \
  .setDistanceFunction("COSINE") \
  .setNeighbours(5)
# .setDistanceFunction("EUCLIDEAN")

`setPreservePosition(True)` takes exactly the original indices (under some tokenization conditions it might include some undesires chars like `")","]"...)`

`setPreservePosition(False)` takes adjusted indices based on substring indexingOf the first (for begin) and last (for end) tokens

also with internal we can use the `greedyMode` which will marge consecutive entities of same type regardless of b-boundaries

In [0]:
pipeline_icd10 = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetectorDL,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunk_embeddings,
    icd10cm_resolution
  ])

model_icd10 = pipeline_icd10.fit(data_ner)


In [0]:
text = light_data['document'][0]

text

In [0]:
light_pipeline_icd10 = LightPipeline(model_icd10)


In [0]:
import pandas as pd

light_result = light_pipeline_icd10.annotate(text)

df = pd.DataFrame(list(zip(light_result['ner_chunk'], light_result['icd10cm_code'])),
                  columns = ['Problem','ICD10-CM-Code'])

In [0]:
df.head()

Unnamed: 0,Problem,ICD10-CM-Code
0,gestational diabetes mellitus,P702
1,type two diabetes mellitus,E1142
2,T2DM,E1121
3,one prior episode of HTG-induced pancreatitis,K860
4,associated with an acute hepatitis,B172


In [0]:
df = get_codes (light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,gestational diabetes mellitus,39,67,P702,"[P702, E1165, E232]","[Neonatal diabetes mellitus, Type 2 diabetes mellitus with hyperglycemia, Diabetes insipidus]","[0.0796, 0.1034, 0.1886]"
1,type two diabetes mellitus,128,153,E1142,"[E1142, P702, E232]","[Type 2 diabetes mellitus with diabetic polyneuropathy, Neonatal diabetes mellitus, Diabetes insipidus]","[0.0509, 0.1148, 0.1855]"
2,T2DM,156,159,E1121,[E1121],[Type 2 diabetes mellitus with diabetic nephropathy],[0.2151]
3,one prior episode of HTG-induced pancreatitis,163,207,K860,"[K860, F309, F310, K810]","[Alcohol-induced chronic pancreatitis, Manic episode, unspecified, Bipolar disorder, current episode hypomanic, Acute cholecystitis]","[0.2399, 0.2462, 0.2369, 0.2629]"
4,associated with an acute hepatitis,244,277,B172,"[B172, B179, K739, B189, K754]","[Acute hepatitis E, Acute viral hepatitis, unspecified, Chronic hepatitis, unspecified, Chronic viral hepatitis, unspecified, Autoimmune hepatitis]","[0.1159, 0.1360, 0.1674, 0.1507, 0.2232]"
5,obesity with a body mass,284,307,E661,"[E661, Z6826, G3183, Z681]","[Drug-induced obesity, Body mass index (BMI) 26.0-26.9, adult, Dementia with Lewy bodies, Body mass index (BMI) 19.9 or less, adult]","[0.2370, 0.1790, 0.2363, 0.1421]"
6,polyuria,373,380,R358,"[R358, R631, R601, R1114, E2681]","[Other polyuria, Polydipsia, Generalized edema, Bilious vomiting, Bartter's syndrome]","[0.2193, 0.2170, 0.3266, 0.3308, 0.3688]"
7,polydipsia,383,392,R631,"[R631, R4584, O926]","[Polydipsia, Anhedonia, Galactorrhea]","[0.0000, 0.3852, 0.4084]"
8,poor appetite,395,407,R630,"[R630, E639, R1310, R110, R6882]","[Anorexia, Nutritional deficiency, unspecified, Dysphagia, unspecified, Nausea, Decreased libido]","[0.2084, 0.2460, 0.2612, 0.4441, 0.4495]"
9,vomiting,414,421,R1114,"[R1114, R1111, R110]","[Bilious vomiting, Vomiting without nausea, Nausea]","[0.0771, 0.0681, 0.0854]"


In [0]:
df = get_codes (light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,gestational diabetes mellitus,39,67,P702,"[P702, E1165, E232]","[Neonatal diabetes mellitus, Type 2 diabetes mellitus with hyperglycemia, Diabetes insipidus]","[0.0796, 0.1034, 0.1886]"
1,type two diabetes mellitus,128,153,E1142,"[E1142, P702, E232]","[Type 2 diabetes mellitus with diabetic polyneuropathy, Neonatal diabetes mellitus, Diabetes insipidus]","[0.0509, 0.1148, 0.1855]"
2,T2DM,156,159,E1121,[E1121],[Type 2 diabetes mellitus with diabetic nephropathy],[0.2151]
3,one prior episode of HTG-induced pancreatitis,163,207,K860,"[K860, F309, F310, K810]","[Alcohol-induced chronic pancreatitis, Manic episode, unspecified, Bipolar disorder, current episode hypomanic, Acute cholecystitis]","[0.2399, 0.2462, 0.2369, 0.2629]"
4,associated with an acute hepatitis,244,277,B172,"[B172, B179, K739, B189, K754]","[Acute hepatitis E, Acute viral hepatitis, unspecified, Chronic hepatitis, unspecified, Chronic viral hepatitis, unspecified, Autoimmune hepatitis]","[0.1159, 0.1360, 0.1674, 0.1507, 0.2232]"
5,obesity with a body mass,284,307,E661,"[E661, Z6826, G3183, Z681]","[Drug-induced obesity, Body mass index (BMI) 26.0-26.9, adult, Dementia with Lewy bodies, Body mass index (BMI) 19.9 or less, adult]","[0.2370, 0.1790, 0.2363, 0.1421]"
6,polyuria,373,380,R358,"[R358, R631, R601, R1114, E2681]","[Other polyuria, Polydipsia, Generalized edema, Bilious vomiting, Bartter's syndrome]","[0.2193, 0.2170, 0.3266, 0.3308, 0.3688]"
7,polydipsia,383,392,R631,"[R631, R4584, O926]","[Polydipsia, Anhedonia, Galactorrhea]","[0.0000, 0.3852, 0.4084]"
8,poor appetite,395,407,R630,"[R630, E639, R1310, R110, R6882]","[Anorexia, Nutritional deficiency, unspecified, Dysphagia, unspecified, Nausea, Decreased libido]","[0.2084, 0.2460, 0.2612, 0.4441, 0.4495]"
9,vomiting,414,421,R1114,"[R1114, R1111, R110]","[Bilious vomiting, Vomiting without nausea, Nausea]","[0.0771, 0.0681, 0.0854]"


In [0]:
import pyspark.sql.functions as F

output = model_icd10.transform(data_ner).cache()

output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata",
                                     "icd10cm_code.result","icd10cm_code.metadata")).alias("icd10cm_result")) \
      .select(F.expr("icd10cm_result['0']").alias("chunk"),
              F.expr("icd10cm_result['1'].entity").alias("entity"),
              F.expr("icd10cm_result['3'].resolved_text").alias("resolved_text"),
              F.expr("icd10cm_result['2']").alias("code"),
              F.expr("icd10cm_result['3'].all_k_resolutions").alias("cms"))\
      .distinct() \
      .toPandas()


Unnamed: 0,chunk,entity,resolved_text,code,cms
0,type two diabetes mellitus,PROBLEM,Type 2 diabetes mellitus with diabetic polyneuropathy,E1142,Type 2 diabetes mellitus with diabetic polyneuropathy:::Neonatal diabetes mellitus:::Diabetes insipidus
1,lipemia,PROBLEM,Glycosuria,R81,Glycosuria:::Pure hyperglyceridemia:::Hyperchylomicronemia
2,precipitated by her respiratory tract infection,PROBLEM,"Respiratory disorder, unspecified",J989,"Respiratory disorder, unspecified:::Acute nasopharyngitis [common cold]:::Diseases of the respiratory system complicating childbirth"
3,initially admitted for starvation ketosis,PROBLEM,Type 2 diabetes mellitus with ketoacidosis with coma,E1111,Type 2 diabetes mellitus with ketoacidosis with coma:::Type 2 diabetes mellitus with hyperosmolarity with coma:::Propionic acidemia:::Bartter's syndrome
4,gestational diabetes mellitus,PROBLEM,Neonatal diabetes mellitus,P702,Neonatal diabetes mellitus:::Type 2 diabetes mellitus with hyperglycemia:::Diabetes insipidus
5,vomiting,PROBLEM,Bilious vomiting,R1114,Bilious vomiting:::Vomiting without nausea:::Nausea
6,polydipsia,PROBLEM,Polydipsia,R631,Polydipsia:::Anhedonia:::Galactorrhea
7,elevated,PROBLEM,Elevated Lipoprotein(a),E7841,"Elevated Lipoprotein(a):::Elevated erythrocyte sedimentation rate:::Hyperaldosteronism, unspecified:::Hyperkalemia"
8,obesity with a body mass,PROBLEM,Drug-induced obesity,E661,"Drug-induced obesity:::Body mass index (BMI) 26.0-26.9, adult:::Dementia with Lewy bodies:::Body mass index (BMI) 19.9 or less, adult"
9,significant for dry oral mucosa,PROBLEM,Irritative hyperplasia of oral mucosa,K136,"Irritative hyperplasia of oral mucosa:::Leukoplakia of oral mucosa, including tongue:::Oral submucous fibrosis:::Dysphagia, oral phase"


In [0]:
text = 'He has a starvation ketosis but nothing found for significant for dry oral mucosa'


In [0]:
df = get_codes(light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,starvation ketosis,9,26,E71121,"[E71121, E2681, E8351, E240, E873]","[Propionic acidemia, Bartter's syndrome, Hypocalcemia, Pituitary-dependent Cushing's disease, Alkalosis]","[0.3089, 0.3760, 0.3961, 0.3978, 0.3933]"
1,significant for dry oral mucosa,50,80,K136,"[K136, K1321, K135, R1311]","[Irritative hyperplasia of oral mucosa, Leukoplakia of oral mucosa, including tongue, Oral submucous fibrosis, Dysphagia, oral phase]","[0.1760, 0.1708, 0.2918, 0.2326]"


## Resolved Entities Visualization

In [0]:
sample_text = 'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.'

full_light_result = light_pipeline_icd10.fullAnnotate(sample_text)


In [0]:
from sparknlp_display import EntityResolverVisualizer

visualizer = EntityResolverVisualizer()

# Change color of an entity label
#visualizer.set_label_colors({'PROBLEM':'#008080'})

vis = visualizer.display(full_light_result[0], 'ner_chunk', 'icd10cm_code', return_html=True)

displayHTML(vis)

## CPT Resolver

The Current Procedural Terminology (CPT) code set is a medical code set maintained by the American Medical Association. The CPT code set describes medical, surgical, and diagnostic services and is designed to communicate uniform information about medical services and procedures among physicians, coders, patients, accreditation organizations, and payers for administrative, financial, and analytical purposes.

In [0]:
ner_converter_cpt = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(['TREATMENT','TEST'])\
    .setPreservePosition(False)

cpt_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical", "en", "clinical/models") \
    .setInputCols(["token", "chunk_embeddings"]) \
    .setOutputCol("cpt_code") \
    .setDistanceFunction("COSINE") \
    .setNeighbours(5)

pipeline_cpt = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetectorDL,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter_cpt,
    chunk_embeddings,
    cpt_resolution
  ])

model_cpt = pipeline_cpt.fit(data_ner)


In [0]:
light_model_cpt = LightPipeline(model_cpt)

text = 'The patient needs to have a coronary artery bypass but doctor suggests a abdomen CT at first.'

df = get_codes (light_model_cpt, text, 'cpt_code')

df

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,coronary artery bypass,28,49,33535,"[33535, 33534]","[Coronary artery bypass, using arterial graft(s); 3coronary arterial grafts, Coronary artery bypass, using arterial graft(s); 2 coronary arterial grafts]","[0.0668, 0.0551]"
1,abdomen CT,73,82,44970,"[44970, 61751]","[Laparoscopy, surgical, appendectomy , Stereotactic biopsy, aspiration, or excision, including burr hole(s), for intracranial lesion; with computed tomography and/or magnetic resonance guidance]","[0.3754, 0.2620]"


## RxNorm Resolver

In [0]:
posology_ner = MedicalNerModel.pretrained("ner_drugs_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setPreservePosition(False)

rxnorm_resolver = ChunkEntityResolverModel.pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
    .setInputCols('token', 'chunk_embeddings')\
    .setOutputCol('rxnorm_resolution')\
    .setPoolingStrategy("MAX")

pipeline_rxnorm = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetectorDL,
    tokenizer,
    stopwords,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_embeddings,
    rxnorm_resolver
  ])


model_rxnorm = pipeline_rxnorm.fit(data_ner)

In [0]:
output = model_rxnorm.transform(data_ner)

result_df = output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata"))\
                          .alias("rxnorm_result")) \
                  .select(F.expr("rxnorm_result['0']").alias("chunk"),
                          F.expr("rxnorm_result['1'].entity").alias("entity"),
                          F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
                          F.expr("rxnorm_result['2']").alias("code"),
                          F.expr("rxnorm_result['3'].confidence").alias("confidence"))

result_df.show(truncate = 100)

In [0]:
df = result_df.toPandas()
df                 

Unnamed: 0,chunk,entity,target_text,code,confidence
0,amoxicillin,DRUG,Amoxicillin 50 MG Oral Tablet [Biomox]:::Amoxicillin 500 MG Oral Capsule [Amix]:::Amoxicillin 250 MG Oral Capsule [Amix]:::Amoxicillin 500 MG Oral Capsule [Wymox]:::Amoxicillin 500 MG Oral Capsule [Sumox],791949,0.217
1,metformin,DRUG,Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Tablet [Glucophage]:::Metformin hydrochloride 625 MG Oral Tablet [Glucophage]:::Metformin hydrochloride 850 MG Oral Tablet [Metforming],105376,0.2067
2,glipizide,DRUG,Glipizide 5 MG Oral Tablet [Minidiab]:::Glipizide 5 MG Oral Tablet [Glucotrol]:::Glipizide 5 MG Oral Tablet [Glibenese]:::Glipizide 10 MG Oral Tablet [Glucotrol]:::Glipizide 2.5 MG Oral Tablet [Minidiab],105373,0.2224
3,dapagliflozin for T2DM,DRUG,dapagliflozin 5 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 10 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 5 MG / Metformin hydrochloride 500 MG Extended Release Oral Tablet [Xigduo]:::dapagliflozin 5 MG / Metformin hydrochloride 1000 MG Extended Release Oral Tablet [Xigduo]:::dapagliflozin 10 MG / Metformin hydrochloride 500 MG Extended Release Oral Tablet [Xigduo],2169276,0.2532
4,atorvastatin and gemfibrozil,DRUG,atorvastatin 10 MG Oral Tablet [Lipitor]:::atorvastatin 20 MG Oral Tablet [Lipitor]:::atorvastatin 40 MG Oral Tablet [Lipitor]:::atorvastatin 80 MG Oral Tablet [Lipitor]:::Amlodipine 5 MG / atorvastatin 10 MG Oral Tablet [Caduet],617314,0.2166
5,dapagliflozin,DRUG,dapagliflozin 5 MG Oral Tablet [Farxiga]:::dapagliflozin 10 MG Oral Tablet [Farxiga]:::dapagliflozin 5 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 10 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 5 MG / Metformin hydrochloride 500 MG Extended Release Oral Tablet [Xigduo],1486981,0.3523
6,insulin drip,DRUG,Insulin Lispro 100 UNT/ML Injectable Solution [Humalog]:::Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::3 ML Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::Insulin Lispro 100 UNT/ML Injectable Solution [Admelog]:::insulin degludec 100 UNT/ML Injectable Solution [Tresiba],865098,0.225
7,SGLT2 inhibitor,DRUG,"C1 esterase inhibitor (human) 500 UNT Injection [Cinryze]:::alpha 1-proteinase inhibitor, human 1 MG Injection [Zemaira]:::alpha 1-proteinase inhibitor, human 1 MG Injection [Aralast]:::alpha 1-proteinase inhibitor, human 1 MG Injection [Glassia]:::C1 esterase inhibitor (human) 500 UNT Injection [Berinert]",809871,0.2044
8,insulin glargine,DRUG,Insulin Glargine 100 UNT/ML Pen Injector [Lantus]:::Insulin Glargine 300 UNT/ML Pen Injector [Toujeo]:::Insulin Glargine 100 UNT/ML Pen Injector [Basaglar]:::3 ML Insulin Glargine 100 UNT/ML Pen Injector [Lantus]:::3 ML Insulin Glargine 300 UNT/ML Pen Injector [Toujeo],1359856,0.2265
9,insulin lispro,DRUG,Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::Insulin Lispro 200 UNT/ML Pen Injector [Humalog]:::Insulin Lispro 100 UNT/ML Pen Injector [Admelog]:::Insulin Lispro 100 UNT/ML Pen Injector [Humalog]:::3 ML Insulin Lispro 100 UNT/ML Cartridge [Humalog],1652648,0.234


## Snomed Resolver

In [0]:
snomed_ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("greedy_chunk")\
  .setWhiteList(['PROBLEM','TEST'])

chunk_embeddings = ChunkEmbeddings()\
  .setInputCols('greedy_chunk', 'embeddings')\
  .setOutputCol('chunk_embeddings')

snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")\
    .setInputCols("token","chunk_embeddings")\
    .setOutputCol("snomed_resolution")


pipeline_snomed = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    snomed_ner_converter,
    chunk_embeddings,
    snomed_resolver
  ])

model_snomed = pipeline_snomed.fit(data_ner)


In [0]:
snomed_output = model_snomed.transform(data_ner)

snomed_output.write.mode("overwrite").save("/snomed_temp")

snomed_output = spark.read.load("/snomed_temp")

In [0]:
snomed_output.select(F.explode(F.arrays_zip("greedy_chunk.result","greedy_chunk.metadata","snomed_resolution.result","snomed_resolution.metadata"))\
                     .alias("snomed_result")) \
              .select(F.expr("snomed_result['0']").alias("chunk"),
                      F.expr("snomed_result['1'].entity").alias("entity"),
                      F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
                      F.expr("snomed_result['2']").alias("code"),
                      F.expr("snomed_result['3'].confidence").alias("confidence")).show(truncate = 50)

## ICD10 + RxNorm with Multiple NERs

In [0]:
notes = [
'Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !',
'Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .',
'Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control .',
'The patient\'s incisions sternal and right leg were clean and healing well , normal sinus rhythm at 70-80 , with blood pressure 98-110/60 and patient was doing well , recovering , ambulating , tolerating regular diet and last hematocrit prior to discharge was 39% with a BUN and creatinine of 15 and 1.0 , prothrombin time level of 13.8 , chest X-ray prior to discharge showed small bilateral effusions with mild cardiomegaly and subsegmental atelectasis bibasilar and electrocardiogram showed normal sinus rhythm with left atrial enlargement and no acute ischemic changes on electrocardiogram .',
'This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .',
'O2 95% on 3L NC mixed Quinn 82% genrl : in nad , resting comfortably heent : perrla ( 4->3 mm ) bilaterally , blind in right visual field , eomi , dry mm , ? thrush neck : no bruits cv : rrr , no m/r/g , faint s1/s2 pulm : cta bilaterally abd : midline scar ( from urostomy ) , nabs , soft , appears distended but patient denies , ostomy RLQ c/d/i , NT to palpation back : right flank urostomy tube , c/d/i , nt to palpation extr : no Gardner neuro : a , ox3 , wiggles toes bilaterally , unable to lift LE , 06-12 grip bilaterally w/ UE , decrease sensation to soft touch in left',
'Is notable for an inferior myocardial infarction , restrictive and obstructive lung disease with an FEV1 of . 9 and FVC of 1.34 and a moderate at best response to bronchodilators , and a negative sestamibi scan in May , 1999 apart from a severe fixed inferolateral defect , systolic dysfunction with recent echocardiography revealing an LVID of 62 mm . and ejection fraction of 28 percent , moderate mitral regurgitation and mild-to-moderate aortic stenosis with a peak gradient of 33 and a mean gradient of 19 and a valve area of 1.4 cm . squared .',
'This is a 47 - year-old male with a past medical history of type 2 diabetes , high cholesterol , hypertension , and coronary artery disease , status post percutaneous transluminal coronary angioplasty times two , who presented with acute coronary syndrome refractory to medical treatment and TNK , now status post Angio-Jet percutaneous transluminal coronary angioplasty and stent of proximal left anterior descending artery and percutaneous transluminal coronary angioplasty of first diagonal with intra-aortic balloon pump placement .',
'Clinical progression of skin and sinus infection on maximal antimicrobial therapy continued , with emergence on November 20 of a new right-sided ptosis in association with a left homonymous hemianopsia , and fleeting confusion while febrile , prompting head MRI which revealed a large 5 x 2 x 4.3 cm region in the right occipital lobe of hemorrhage and edema , with dural and , likely , leptomeningeal enhancement in association with small foci in the right cerebellum and pons , concerning for early lesions of similar type .',
'The patient had an echocardiogram on day two of admission , which revealed a mildly dilated left atrium , mild symmetric LVH , normal LV cavity size , mild region LV systolic dysfunction , arresting regional wall motion abnormality including focal apical hypokinesis , a normal right ventricular chamber size and free wall motion , a moderately dilated aortic root , a mildly dilated ascending aorta , normal aortic valve leaflet , normal mitral valve leaflet and no pericardial effusions .',
'The patient is a 65-year-old man with refractory CLL , status post non-myeloblative stem cell transplant approximately nine months prior to admission , and status post prolonged recent Retelk County Medical Center stay for Acanthamoeba infection of skin and sinuses , complicated by ARS due to medication toxicity , as well as GVHD and recent CMV infection , readmitted for new fever , increasing creatinine , hepatomegaly and fluid surge spacing , in the setting of hyponatremia .',
'Tylenol 650 mg p.o . q.4h . p.r.n . , Benadryl 25 mg p.o . q.h.s . p.r.n . , Colace 100 mg p.o . q.i.d . , Nortriptyline 25 mg p.o . q.h.s . , Simvastatin 10 mg p.o . q.h.s . , Metamucil one packet p.o . b.i.d . p.r.n . , Neurontin 300 mg p.o . t.i.d . , Levsinex 0.375 mg p.o . q.12h . , Lisinopril / hydrochlorothiazide 20/25 mg p.o . q.d . , hydrocortisone topical ointment to affected areas , MS Contin 30 mg p.o . b.i.d . , MSIR 15 to 30 mg p.o . q.4h . p.r.n . pain .',
'Aspirin 325 q.d . ; albuterol nebs 2.5 mg q . 4h ; Colace 100 mg b.i.d . ; heparin 5,000 units subcu b.i.d . ; Synthroid 200 mcg q.d . ; Ocean Spray 2 sprays q . i.d . ; simvastatin 10 mg q . h.s . ; Flovent 220 mcg 2 puffs b.i.d . ; Zantac 150 b.i.d . ; nystatin ointment to the gluteal fold b.i.d . ; Lisinopril 20 mg q.d . ; Mestinon controlled release 180 q . h.s . ; Mestinon 30 mg q . 4h while awake ; prednisone 60 mg p.o . q . IM ; Atrovent nebs 0.5 mg q . i.d .',
'An echocardiogram was obtained on 4-26 which showed concentric left ventricular hypertrophy with normal _____ left ventricular function , severe right ventricular dilatation with septal hypokinesis and flattening with a question of right ventricular apical clot raised with mild aortic stenosis , severe tricuspid regurgitation and increased pulmonary artery pressure of approximately 70 millimeters , consistent with fairly severe pulmonary hypertension .',
'1 ) CV ( R ) finished amio IV load then started on po , agressive lytes ; although interrogation showed >100 episodes of VT ( as / x ) , pt prefers med therapy as opposed to ablation ( I ) enzymes mildly elevated but not actively ischemic ; lipids , ASA , statin , BB ; Adenosine thal 1/4 and echo 1/4 to look for signs of ischemia as active cause for VT ( P ) JVP at angle of jaw 1/4 -- > giving 20 Lasix ; dig level 1/4 1.3 -- > 1/2 dose as on Amio',
'sodium 141 , potassium 3.5 , chloride 107 , bicarbonates 23.8 , BUN 23 , creatinine 1.1 , glucose 165 , PO2 377 , PCO2 32 , PH 7.50 , asomus 298 , toxic screen negative , white blood cell count 11.1 , hematocrit 39.6 , platelet count 137 , prothrombin time 25.2 , INR 4.3 , partial thromboplastin time 34.7 , urinalysis 1+ albumin , 0-5 high link caths , cervical spine negative , pelvis negative , lumbar spine ; negative , thoracic spine negative .',
]

In [0]:
data = spark.createDataFrame([(i,n.lower()) for i,n in enumerate(notes)]).toDF('doc_id', 'text')

data.show(truncate=50)

In [0]:
from IPython.core.display import display, HTML

html_output=""
for i, d in enumerate(notes):
    html_output += f'Note {i}:'
    html_output +='<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px">'
    html_output += d
    html_output += '</div><br/>'

displayHTML(html_output)

let's build a SparkNLP pipeline with the following stages:

`DocumentAssembler`: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework

`SentenceDetector`: Annotator to pragmatically separate complete sentences inside each document

`Tokenizer`: Annotator to separate sentences in tokens (generally words)

`StopWordsCleaner`: Annotator to remove words defined as StopWords in SparkML

`WordEmbeddings`: Vectorization of word tokens, in this case using word embeddings trained from PubMed, ICD10 and other clinical resources.

`ChunkEmbeddings`: Aggregates the WordEmbeddings for each NER Chunk

`JSL NER + NerConverter`: This annotators return Chunks related to jsl_ner (generic ner) 

`Drug NER + NerConverter`: This annotators return Chunks related to drugs

`ChunkEntityResolver`: Annotator that performs search for the KNNs, in this case trained from ICDO Histology Behavior.

In [0]:
# Annotators responsible for the Cancer Genetics Entity Recognition task

jslNer = MedicalNerModel.pretrained('ner_jsl', 'en', "clinical/models")\
    .setInputCols('sentence', 'token', 'embeddings')\
    .setOutputCol('ner_jsl')

drugNer = MedicalNerModel.pretrained('ner_drugs', 'en', "clinical/models")\
    .setInputCols('sentence', 'token', 'embeddings')\
    .setOutputCol('ner_drug')

In [0]:

#Converter annotators transform IOB tags into full chunks (sequence set of tokens) tagged with `entity` metadata

jslConverter = NerConverter()\
    .setInputCols('sentence', 'token', 'ner_jsl')\
    .setOutputCol('chunk_jsl')\
    .setWhiteList(["Disease_Syndrome_Disorder"])

drugConverter = NerConverter()\
    .setInputCols('sentence', 'token', 'ner_drug')\
    .setOutputCol('chunk_drug')

In [0]:

#ChunkEmbeddings annotators aggregate embeddings for each token in the chunk

jslChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols('chunk_jsl', 'embeddings')\
  .setOutputCol('chunk_embs_jsl')

drugChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols('chunk_drug', 'embeddings')\
  .setOutputCol('chunk_embs_drug')

In [0]:
# Entity Resolution Pretrained Models

icd10cmResolver2 = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200)\
    .setAlternatives(5)\
    .setDistanceWeights([3,3,2,0,0,7])\
    .setInputCols('token', 'chunk_embs_jsl')\
    .setOutputCol('icd10cm_resolution')

rxnormResolver2 = ChunkEntityResolverModel()\
    .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200)\
    .setAlternatives(5)\
    .setDistanceWeights([3,3,2,0,0,7])\
    .setInputCols('token', 'chunk_embs_drug')\
    .setOutputCol('rxnorm_resolution')

In [0]:
# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

# Tokenizer splits words in a relevant format for NLP

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")


pipelineFull = Pipeline().setStages([
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    stopwords, 
    word_embeddings, 
    jslNer,
    drugNer,
    jslConverter,
    drugConverter,
    jslChunkEmbeddings, 
    drugChunkEmbeddings,
    icd10cmResolver2,
    rxnormResolver2
])

In [0]:
# Persisiting temporarily to keep DAG size and resource usage low (Word Embeddings are Resource Intensive)
pipelineModelFull = pipelineFull.fit(data)

output = pipelineModelFull.transform(data)


In [0]:
output.write.mode("overwrite").save("/temp")

output = spark.read.load("/temp")

In [0]:
output.show()

In [0]:
# lets see what would have happened if we hadn't persisted the pipeline at disk. 
output = pipelineModelFull.transform(data)

In [0]:
output.show()
## 1.0 vs 18.39 seconds for the first 20 rows (about x20 faster)

In [0]:
def quick_metadata_analysis(df, doc_field, chunk_field, code_fields):
    code_res_meta = ", ".join([f"{cf}.metadata" for cf in code_fields])
    expression = f"explode(arrays_zip({chunk_field}.begin, {chunk_field}.end, {chunk_field}.result, {chunk_field}.metadata, "+code_res_meta+")) as a"
    top_n_rest = [(f"float(a['{i+4}'].confidence) as {(cf.split('_')[0])}_conf",
                    f"arrays_zip(split(a['{i+4}'].all_k_results,':::'),split(a['{i+4}'].all_k_resolutions,':::')) as {cf.split('_')[0]+'_opts'}")
                    for i, cf in enumerate(code_fields)]
    top_n_rest_args = []
    for tr in top_n_rest:
        for t in tr:
            top_n_rest_args.append(t)
    return df.selectExpr(doc_field, expression) \
        .orderBy('doc_id', F.expr("a['0']"), F.expr("a['1']"))\
        .selectExpr(f"concat_ws('::',{doc_field},a['0'],a['1']) as coords", "a['2'] as chunk","a['3'].entity as entity", *top_n_rest_args)

In [0]:
icd10cm_analysis = quick_metadata_analysis(output, 'doc_id', 'chunk_jsl',['icd10cm_resolution']).toPandas()

In [0]:
rxnorm_analysis = quick_metadata_analysis(output, 'doc_id', 'chunk_drug',['rxnorm_resolution']).toPandas()

In [0]:
pd.set_option('display.max_colwidth', 250)
pd.set_option('display.max_rows', 500)

In [0]:
icd10cm_analysis[icd10cm_analysis.icd10cm_conf>0.4]

Unnamed: 0,coords,chunk,entity,icd10cm_conf,icd10cm_opts
1,4::120::128,gastritis,Disease_Syndrome_Disorder,0.468,"[(K2970, Gastritis, unspecified, without bleeding), (B9681, Helicobacter pylori [H. pylori] as the cause of diseases classified elsewhere), (K2900, Acute gastritis without bleeding), (A084, Viral intestinal infection, unspecified), (K2960, Other ..."
3,6::51::103,restrictive and obstructive lung disease with an fev1,Disease_Syndrome_Disorder,0.6697,"[(J984, Other disorders of lung), (J670, Farmer's lung), (J449, Chronic obstructive pulmonary disease, unspecified), (J440, Chronic obstructive pulmonary disease with acute lower respiratory infection), (G709, Myoneural disorder, unspecified)]"
5,10::223::264,acanthamoeba infection of skin and sinuses,Disease_Syndrome_Disorder,0.4519,"[(L089, Local infection of the skin and subcutaneous tissue, unspecified), (B6010, Acanthamebiasis, unspecified), (A311, Cutaneous mycobacterial infection), (B383, Cutaneous coccidioidomycosis), (L080, Pyoderma)]"
6,10::283::312,ars due to medication toxicity,Disease_Syndrome_Disorder,0.592,"[(D649, Anemia, unspecified), (I952, Hypotension due to drugs), (J310, Chronic rhinitis), (N46021, Azoospermia due to drug therapy), (D642, Secondary sideroblastic anemia due to drugs and toxins)]"
8,10::467::478,hyponatremia,Disease_Syndrome_Disorder,0.9991,"[(E871, Hypo-osmolality and hyponatremia), (E870, Hyperosmolality and hypernatremia), (E876, Hypokalemia), (E875, Hyperkalemia), (E8341, Hypermagnesemia)]"
9,14::323::330,ischemia,Disease_Syndrome_Disorder,0.8969,"[(G450, Vertebro-basilar artery syndrome), (N280, Ischemia and infarction of kidney), (H3582, Retinal ischemia), (I6782, Cerebral ischemia), (I248, Other forms of acute ischemic heart disease)]"


In [0]:
rxnorm_analysis[rxnorm_analysis.rxnorm_conf>0.4].head(20)

Unnamed: 0,coords,chunk,entity,rxnorm_conf,rxnorm_opts
0,0::0::10,pentamidine,DrugChem,0.5925,"[(861601, Pentamidine Isethionate 300 MG Injection), (861597, Pentamidine Isethionate 50 MG/ML Inhalation Solution), (755627, Chloroquine 5 MG/ML Oral Solution), (855624, Dibromopropamidine isethionate 1 MG/ML Ophthalmic Solution), (1119497, chlo..."
1,0::37::47,pentamidine,DrugChem,0.5925,"[(861601, Pentamidine Isethionate 300 MG Injection), (861597, Pentamidine Isethionate 50 MG/ML Inhalation Solution), (755627, Chloroquine 5 MG/ML Oral Solution), (855624, Dibromopropamidine isethionate 1 MG/ML Ophthalmic Solution), (1119497, chlo..."
55,3::278::287,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"
58,7::83::93,cholesterol,DrugChem,0.5609,"[(2104173, beta Sitosterol 35 MG Oral Tablet), (832876, phytosterol esters 500 MG Oral Capsule), (637208, phytosterol esters 650 MG Oral Capsule), (411217, Lecithin 228 MG Oral Capsule), (1737442, amphotericin B lipid complex 5 MG/ML Injection)]"
59,10::397::406,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"
83,12::328::335,mestinon,DrugChem,0.4385,"[(2099309, moxetumomab pasudotox-tdfk 1 MG Injection), (886677, Clidinium bromide 2.5 MG Oral Capsule), (415693, Heparinoids 0.1 UNT/MG Topical Gel), (204558, Peptide Hydrolases 82 UNT/MG Topical Ointment), (1659998, ANTI-INHIBITOR COAGULANT COMP..."
84,12::372::379,mestinon,DrugChem,0.4385,"[(2099309, moxetumomab pasudotox-tdfk 1 MG Injection), (886677, Clidinium bromide 2.5 MG Oral Capsule), (415693, Heparinoids 0.1 UNT/MG Topical Gel), (204558, Peptide Hydrolases 82 UNT/MG Topical Ointment), (1659998, ANTI-INHIBITOR COAGULANT COMP..."
92,15::73::82,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"
