# ICD-O Entity Resolution - version 2.4.6

## Example for ICD-O Entity Resolution Pipeline
A common NLP problem in medical applications is to identify clinical entities with the snomed codes and also cancer specific entities with their ICD10CM codes and their ICDO histology behaviour.

In this example we will use Spark-NLP to identify and resolve these entities using three ontologies: SNOMED, ICD10CM and ICD-O.

Some cancer related clinical notes (taken from https://www.cancernetwork.com/case-studies):  
https://www.cancernetwork.com/case-studies/large-scrotal-mass-multifocal-intra-abdominal-retroperitoneal-and-pelvic-metastases  
https://oncology.medicinematters.com/lymphoma/chronic-lymphocytic-leukemia/case-study-small-b-cell-lymphocytic-lymphoma-and-chronic-lymphoc/12133054
https://oncology.medicinematters.com/lymphoma/epidemiology/central-nervous-system-lymphoma/12124056
https://oncology.medicinematters.com/lymphoma/case-study-cutaneous-t-cell-lymphoma/12129416

Note 1: Desmoplastic small round cell tumor
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.

Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.

The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.
</div>

Note 2: SLL and CLL
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.
</div>

Note 3: CNS lymphoma: https://www.icd10data.com/ICD10CM/Codes/C00-D49/C81-C96/C85-/C85.89
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.
</div>

Note 4: Cutaneous T-cell lymphoma
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite. 
</div>

In [1]:
import sys, os, time, pandas as pd

sys.path.append("/home/fernandrez/JSL/repos/spark-nlp/python")
sys.path.append("/home/fernandrez/JSL/repos/spark-nlp-internal/python")

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType

pd.set_option('display.max_colwidth', 250)
pd.set_option('display.max_rows', 500)

In [2]:
import sparknlp_jsl
sparknlp_jsl.version()

'2.4.7'

In [3]:
spark = sparknlp_jsl.start("####")

## Let's create a dataset with all four case studies

In [24]:
notes = []
notes.append("""A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.
Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.
The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.""")
notes.append("""A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.""")
notes.append("A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.") 
notes.append("An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite.")

# Notes column names

docid_col         = "doc_id"
note_col          = "description"

data = spark.createDataFrame([(i,n.lower(),) for i,n in enumerate(notes)], 
                             StructType([StructField(docid_col, StringType()),
                                         StructField(note_col, StringType())]))

## And let's build a SparkNLP pipeline with the following stages:
- DocumentAssembler: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework
- SentenceDetector: Annotator to pragmatically separate complete sentences inside each document
- Tokenizer: Annotator to separate sentences in tokens (generally words)
- StopWordsCleaner: Annotator to remove words defined as StopWords in SparkML
- WordEmbeddings: Vectorization of word tokens, in this case using word embeddings trained from PubMed, ICD10 and other clinical resources.
- ChunkEmbeddings: Aggregates the WordEmbeddings for each NER Chunk
- BioNLP NER + NerConverter: This annotators return Chunks related to Cancer and Genetics diseases
- ChunkEntityResolver: Annotator that performs search for the KNNs, in this case trained from ICDO Histology Behavior.

In [5]:
#Language and Model Repository

models_language   = "en"
models_repository = "clinical/models"

In [6]:
# Embeddings Pretrained Model Name

embeddings_name   = "embeddings_clinical"

# Preparation and embeddings column names

doc_col           = "document"
sent_col          = "sentence"
token_col         = "token"
embeddings_col    = "embeddings"

# Usual preparation Annotators and Embeddings

docAssembler = DocumentAssembler().setInputCol(note_col).setOutputCol(doc_col)
sentenceDetector = SentenceDetector().setInputCols(doc_col).setOutputCol(sent_col)
tokenizer_chars = [",","\/"," ",".","|","@","#","%","&","\\$","\\[","\\]","\\(","\\)","\\-",";"]
tokenizer = Tokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(sent_col).setOutputCol(token_col)
embeddings = WordEmbeddingsModel.pretrained(embeddings_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col)\
    .setOutputCol(embeddings_col)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [7]:
# NER Pretrained Model Names

clinical_ner_name = "ner_clinical"
cancer_ner_name   = "ner_bionlp"
drug_ner_name     = "ner_drugs"

# NER Column Names

ner_clinical_col  = "ner_clinical"
ner_bio_col       = "ner_bio"
ner_drug_col      = "ner_drug"

# Annotators responsible for the Cancer Genetics Entity Recognition task

clinicalNer = NerDLModel.pretrained(clinical_ner_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col, embeddings_col)\
    .setOutputCol(ner_clinical_col)

bioNer = NerDLModel.pretrained(cancer_ner_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col, embeddings_col)\
    .setOutputCol(ner_bio_col)

drugNer = NerDLModel.pretrained(drug_ner_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col, embeddings_col)\
    .setOutputCol(ner_drug_col)

ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
ner_drugs download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [8]:
# Chunk column names

chunk_clinical_col = "chunk_clinical"
chunk_bio_col      = "chunk_bio"
chunk_cancer_col   = "chunk_cancer"
chunk_drug_col     = "chunk_drug"

#Converter annotators transform IOB tags into full chunks (sequence set of tokens) tagged with `entity` metadata

clinicalConverter = NerConverter().setInputCols(sent_col, token_col, ner_clinical_col)\
    .setOutputCol(chunk_clinical_col)
bioConverter = NerConverter().setInputCols(sent_col, token_col, ner_bio_col)\
    .setOutputCol(chunk_bio_col)
cancerConverter = NerConverter().setInputCols(sent_col, token_col, ner_bio_col)\
    .setOutputCol(chunk_cancer_col).setWhiteList(['Cancer']) # We whitelist just `Cancer` entities
drugConverter = NerConverter().setInputCols(sent_col, token_col, ner_drug_col)\
    .setOutputCol(chunk_drug_col)

In [9]:
# ChunkEmbeddings column names

chunk_embs_clinical_col = "chunk_embs_clinical"
chunk_embs_bio_col      = "chunk_embs_bio"
chunk_embs_cancer_col   = "chunk_embs_cancer"
chunk_embs_drug_col     = "chunk_embs_drug"

#ChunkEmbeddings annotators aggregate embeddings for each token in the chunk

clinicalChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_clinical_col, embeddings_col)\
  .setOutputCol(chunk_embs_clinical_col)
bioChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_bio_col, embeddings_col)\
  .setOutputCol(chunk_embs_bio_col)
cancerChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_cancer_col, embeddings_col)\
  .setOutputCol(chunk_embs_cancer_col)
drugChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_drug_col, embeddings_col)\
  .setOutputCol(chunk_embs_drug_col)

In [10]:
# ChunkTokenizer column names

chunk_token_clinical_col = "chunk_token_clinical"
chunk_token_cancer_col   = "chunk_token_cancer"
chunk_token_drug_col     = "chunk_token_drug"

# ChunkTokenizer provides extra flexibility at the time of tokenizing a given chunk

clinicalChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_clinical_col).setOutputCol(chunk_token_clinical_col)
bioChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_bio_col).setOutputCol(chunk_token_bio_col)
cancerChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_cancer_col).setOutputCol(chunk_token_cancer_col)
drugChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_drug_col).setOutputCol(chunk_token_drug_col)

In [11]:
# Entity Resolution Pretrained Model Names

snomed_model_name           = "ensembleresolve_snomed_clinical"
icd10cm_cancer_model_name   = "chunkresolve_icd10cm_neoplasms_clinical"
icd_model_name              = "chunkresolve_icdo_clinical"

# Entity Resolution column Names

snomed_token_col,snomed_chunk_col,snomed_embs_col = chunk_token_clinical_col,chunk_clinical_col,chunk_embs_clinical_col
#snomed_token_col,snomed_chunk_col,snomed_embs_col = chunk_token_bio_col,chunk_bio_col,chunk_embs_bio_col

snomed_clinical_col  = "snomed_resolution"
icd10cm_col   = "icd10cm_resolution"
icdo_col      = "icdo_resolution" 

snomedResolver = EnsembleEntityResolverModel()\
    .pretrained(snomed_model_name, models_language, models_repository)\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([1,3,2,0,0,1])\
    .setExtramassPenalty(3).setPoolingStrategy("AVERAGE")\
    .setInputCols(snomed_token_col, snomed_embs_col)\
    .setOutputCol(snomed_clinical_col)\
    
icd10cmResolver = ChunkEntityResolverModel\
    .pretrained(icd10cm_cancer_model_name, models_language, models_repository)\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([1,3,2,0,0,1])\
    .setExtramassPenalty(3).setPoolingStrategy("AVERAGE")\
    .setInputCols(chunk_token_cancer_col, chunk_embs_cancer_col)\
    .setOutputCol(icd10cm_col)

icdoResolver = ChunkEntityResolverModel\
    .pretrained(icd_model_name, models_language, models_repository)\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([1,3,2,0,0,1])\
    .setExtramassPenalty(3).setPoolingStrategy("AVERAGE")\
    .setInputCols(chunk_token_cancer_col, chunk_embs_cancer_col)\
    .setOutputCol(icdo_col)

ensembleresolve_snomed_clinical download started this may take some time.
Approximate size to download 592.9 MB
[OK!]
chunkresolve_icd10cm_neoplasms_clinical download started this may take some time.
Approximate size to download 15.4 MB
[OK!]
chunkresolve_icdo_clinical download started this may take some time.
Approximate size to download 8.2 MB
[OK!]


In [12]:
pipelineFull = Pipeline().setStages([
    docAssembler, 
    sentenceDetector, 
    tokenizer, 
    embeddings, 
    clinicalNer,
    bioNer,
    drugNer,
    clinicalConverter,
    bioConverter,
    cancerConverter,
    drugConverter,
    clinicalChunkEmbeddings, 
    bioChunkEmbeddings, 
    cancerChunkEmbeddings, 
    drugChunkEmbeddings,
    clinicalChunkTokenizer,
    bioChunkTokenizer,
    cancerChunkTokenizer,
    drugChunkTokenizer,
    snomedResolver,
    icd10cmResolver,
    icdoResolver
])

In [13]:
pipelineModelFull = pipelineFull.fit(data)

In [14]:
output = pipelineModelFull.transform(data).cache()

## The key parts of our model are the **WordEmbeddings and ChunkEntityResolver**: 

### WordEmebeddings:   
Word2Vec model trained on semantically augmented datasets using information from curated Datasets in JSL Data Market.  

### EntityResolver:  
Trained on an augmented ICDO Dataset from JSL Data Market it provides histology codes resolution for the matched expressions. Other than providing the code in the "result" field it provides more metadata about the matching process:  

- all_k_results -> Sorted ResolverLabels in the top `alternatives` that match the distance `threshold`
- all_k_resolutions -> Respective ResolverNormalized strings
- all_k_distances -> Respective distance values after aggregation
- all_k_wmd_distances -> Respective WMD distance values (added if allDistancesMetadata==True)
- all_k_tfidf_distances -> Respective TFIDF Cosine distance values (added if allDistancesMetadata==True)
- all_k_jaccard_distances -> Respective Jaccard distance values (added if allDistancesMetadata==True)
- all_k_sorensen_distances -> Respective SorensenDice distance values (added if allDistancesMetadata==True)
- all_k_jaro_distances -> Respective JaroWinkler distance values (added if allDistancesMetadata==True)
- all_k_levenshtein_distances -> Respective Levenshtein distance values (added if allDistancesMetadata==True)
- all_k_confidences -> Respective normalized probabilities based in inverse distance valuesprobability (added if allDistancesMetadata==True)
- target_text -> The actual searched string
- resolved_text -> The top ResolverNormalized string
- confidence -> Top probability
- distance -> Top distance value
- sentence -> Sentence index
- chunk -> Chunk Index
- token -> Token index

In [15]:
def quick_metadata_analysis(df, doc_field, chunk_field, code_fields, dist_thres=1):
    code_res_meta = ", ".join([f"{cf}.result, {cf}.metadata" for cf in code_fields])
    expression = f"explode(arrays_zip({chunk_field}.begin, {chunk_field}.end, {chunk_field}.result, {chunk_field}.metadata, "+code_res_meta+")) as a"
    top_n_rest = [(f"a['{2*i+4}'] as {(cf.split('_')[0])}",
                   f"float(a['{2*i+5}'].distance) as {(cf.split('_')[0])}_dst",
                   f"a['{2*i+5}'].confidence as {(cf.split('_')[0])}_conf",
                    f"split(a['{2*i+5}'].all_k_resolutions,':::') as {cf.split('_')[0]+'_opts'}")
                    for i, cf in enumerate(code_fields)]
    top_n_rest_args = []
    for tr in top_n_rest:
        for t in tr:
            top_n_rest_args.append(t)
    return df.selectExpr(doc_field, expression) \
        .orderBy(docid_col, F.expr("a['0']"), F.expr("a['1']"))\
        .selectExpr(f"concat_ws('::',{doc_field},a['0'],a['1']) as coords", "a['2'] as chunk","a['3'].entity as entity", *top_n_rest_args)

In [16]:
snomed_analysis = \
quick_metadata_analysis(output, docid_col, snomed_chunk_col,[snomed_clinical_col], 1).toPandas()

In [17]:
cancer_analysis = \
quick_metadata_analysis(output, docid_col, chunk_cancer_col,[icd10cm_col, icdo_col], 1).toPandas()

In [23]:
snomed_analysis[snomed_analysis.snomed_dst<1]

Unnamed: 0,coords,chunk,entity,snomed,snomed_dst,snomed_conf,snomed_opts
0,0::123::147,a large left scrotal mass,PROBLEM,15634751000119101,0.9883,0.2084,"[Mass of left ovary, Mass in left breast, Mass of skin of left hand, Mass of skin of left forearm, Mass of skin of left foot]"
1,0::192::212,left scrotal swelling,PROBLEM,15952141000119106,0.8715,0.2158,"[Left parotid gland swelling, Swelling of left foot, Swelling of left arm, Swelling of left tonsil, Swelling of scrotum]"
3,0::279::300,mild left scrotal pain,PROBLEM,722829006,0.6781,0.2581,"[Acute scrotal pain, Left inguinal pain, Pain of left thigh, Pain of left hand, Pain of left testicle]"
4,0::329::345,mild constipation,PROBLEM,111360009,0.7484,0.2289,"[Intractable constipation, Functional constipation, Spastic constipation, Mild asthma, Mild dietary indigestion]"
5,0::353::363,hard stools,PROBLEM,75295004,0.0,0.3194,"[Hard stools, Red stools, Loose stools, Black stools, Green stools]"
6,0::396::413,urinary complaints,PROBLEM,38276004,0.7313,0.3216,"[Multiple complaints, Urinary casts, Urinary eosinophils, Abnormal urinary product, Urinary reducing substance]"
7,0::419::438,physical examination,TEST,5880005,0.0,0.2345,"[Physical examination, Physical examination assessment, Physical examination management, Cardiovascular physical examination, Postoperative physical examination]"
8,0::441::466,a hard paratesticular mass,PROBLEM,102031000119109,0.7981,0.253,"[Paratesticular mass (disorder), Mass of hard palate, Observation of a mass, Mass of soft tissue, Neoplasm of hard palate]"
9,0::621::673,"A hard, lower abdominal mass in the suprapubic region",PROBLEM,163293006,0.9794,0.2181,"[On examination - abdominal mass-very hard, On examination - abdominal mass - hard (finding), On examination - abdominal mass - lower border defined, On examination - left lower abdominal mass, Mass in nipple region of right breast]"
10,0::768::785,further evaluation,TEST,182770003,0.836,0.3724,"[Preanaesthesia evaluation, Initial psychiatric evaluation, Cardiovascular examination and evaluation, Limited interview and evaluation, Comprehensive interview and evaluation]"


In [20]:
cancer_analysis[cancer_analysis.icd10cm_dst<2]

Unnamed: 0,coords,chunk,entity,icd10cm,icd10cm_dst,icd10cm_conf,icd10cm_opts,icdo,icdo_dst,icdo_conf,icdo_opts
0,0::448::461,paratesticular,Cancer,D481,1.6465,0.9522,"[Neoplasm of uncertain behavior of connective and other soft tissue, Neoplasm of uncertain behavior of bone and articular cartilage, Malignant neoplasm of retroperitoneum, Hemangioma of other sites, Malignant neoplasm of connective and soft tissu...",9540/3,1.5978,0.8433,"[Malignant peripheral nerve sheath tumor, Teratoid medulloepithelioma, Spermatocytic seminoma, Hemangiopericytoma, NOS, Solitary fibrous tumor, malignant]"
1,0::1078::1103,testicular germ cell tumor,Cancer,C801,0.6457,0.243,"[Malignant (primary) neoplasm, unspecified, Malignant neoplasm of unspecified ovary, Malignant neoplasm of unspecified testis, unspecified whether descended or undescended, Malignant neoplasm of left ovary, Malignant neoplasm of right ovary]",9085/3,0.5235,0.3044,"[Mixed germ cell tumor, Germ cell tumor, nonseminomatous, Enterochromaffin-like cell tumor, malignant, Clear cell tumor, NOS, Squamous cell carcinoma, HPV-positive]"
2,0::1632::1644,pelvic masses,Cancer,C763,0.8573,0.6731,"[Malignant neoplasm of pelvis, Benign neoplasm of peripheral nerves and autonomic nervous system of pelvis, Malignant neoplasm of specified parts of peritoneum, Benign neoplasm of peripheral nerves and autonomic nervous system of abdomen, Maligna...",8312/3,1.7226,0.6066,"[Renal cell carcinoma, Neoplasm, benign, Craniopharyngioma, Hemangioendothelioma, benign, Skin appendage carcinoma]"
3,0::2429::2463,desmoplastic small round cell tumor,Cancer,C784,1.0226,0.2288,"[Secondary malignant neoplasm of small intestine, Malignant (primary) neoplasm, unspecified, Neoplasm of uncertain behavior, unspecified, Small cell B-cell lymphoma, spleen, Malignant neoplasm of unspecified part of right bronchus or lung]",8806/3,0.0,0.3747,"[Desmoplastic small round cell tumor, Malignant tumor, small cell type, Alveolar rhabdomyosarcoma, Round cell liposarcoma, Myxoid liposarcoma]"
5,1::370::433,chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (S,Cancer,C9102,0.8568,0.2264,"[Acute lymphoblastic leukemia, in relapse, Acute lymphoblastic leukemia, in remission, Small cell B-cell lymphoma, spleen, Lymphocyte-rich Hodgkin lymphoma, spleen, Lymphocyte-rich Hodgkin lymphoma, intra-abdominal lymph nodes]",9823/3,0.3718,0.3345,"[Chronic lymphocytic leukemia/small lymphocytic lymphoma, Precursor cell lymphoblastic leukemia, NOS, Hodgkin lymphoma, lymphocytic deplet., NOS, Acute lymphoblastic leukemia, L2 type, NOS, Precursor T-cell lymphoblastic lymphoma]"
7,2::386::421,central nervous system lymphoma (PCN,Cancer,C8589,0.2957,0.2052,"[Other specified types of non-Hodgkin lymphoma, extranodal and solid organ sites, Other non-follicular lymphoma, unspecified site, Other non-follicular lymphoma, extranodal and solid organ sites, Other non-follicular lymphoma, intrathoracic lymph...",9501/3,0.5925,0.2181,"[Medulloepithelioma, NOS, Medulloblastoma, WNT-activated, Medulloblastoma, non-WNT/non-SHH, Craniopharyngioma, Medulloblastoma, NOS]"
8,3::373::380,Lymphoma,Cancer,C8133,1.5289,0.4794,"[Lymphocyte depleted Hodgkin lymphoma, intra-abdominal lymph nodes, Immunoproliferative small intestinal disease, Sezary disease, unspecified site, Hodgkin lymphoma, unspecified, unspecified site, Burkitt lymphoma, unspecified site]",9591/3,1.4131,0.3647,"[Malignant lymphoma, non-Hodgkin, Mantle cell lymphoma, Immunoproliferative small intestinal disease, Sezary syndrome, Hydroa vacciniforme-like lymphoma]"
9,3::408::436,left axillary lymphadenopathy,Cancer,C8514,1.4175,0.2312,"[Unspecified B-cell lymphoma, lymph nodes of axilla and upper limb, Non-Hodgkin lymphoma, unspecified, lymph nodes of axilla and upper limb, Mature T/NK-cell lymphomas, unspecified, lymph nodes of axilla and upper limb, Secondary and unspecified ...",9705/3,1.0236,0.4108,"[Angioimmunoblastic T-cell lymphoma, Renal cell carcinoma, Neoplasm, malignant, Neoplasm, benign, T-cell large granular lymphocytic leukemia]"
