# ICD-O Entity Resolution - version 2.4.6

## Example for ICD-O Entity Resolution Pipeline
A common NLP problem in medical applications is to identify clinical entities with the snomed codes and also cancer specific entities with their ICD10CM codes and their ICDO histology behaviour.

In this example we will use Spark-NLP to identify and resolve these entities using three ontologies: SNOMED, ICD10CM and ICD-O.

Some cancer related clinical notes (taken from https://www.cancernetwork.com/case-studies):  
https://www.cancernetwork.com/case-studies/large-scrotal-mass-multifocal-intra-abdominal-retroperitoneal-and-pelvic-metastases  
https://oncology.medicinematters.com/lymphoma/chronic-lymphocytic-leukemia/case-study-small-b-cell-lymphocytic-lymphoma-and-chronic-lymphoc/12133054
https://oncology.medicinematters.com/lymphoma/epidemiology/central-nervous-system-lymphoma/12124056
https://oncology.medicinematters.com/lymphoma/case-study-cutaneous-t-cell-lymphoma/12129416

Note 0: Desmoplastic small round cell tumor
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.

Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.

The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.
</div>

Note 1: SLL and CLL
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.
</div>

Note 2: CNS lymphoma
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.
</div>

Note 3: Cutaneous T-cell lymphoma
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite. 
</div>

In [1]:
import sys, os, time, pandas as pd

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType

pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)

In [2]:
import sparknlp_jsl
sparknlp_jsl.version()

'2.4.7'

In [3]:
spark = sparknlp_jsl.start("####")

## Let's create a dataset with all four case studies

In [4]:
notes = []
notes.append("""A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.
Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.
The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.""")
notes.append("""A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.""")
notes.append("A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.") 
notes.append("An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite.")

# Notes column names

docid_col         = "doc_id"
note_col          = "description"

data = spark.createDataFrame([(i,n.lower(),) for i,n in enumerate(notes)], 
                             StructType([StructField(docid_col, StringType()),
                                         StructField(note_col, StringType())]))

## And let's build a SparkNLP pipeline with the following stages:
- DocumentAssembler: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework
- SentenceDetector: Annotator to pragmatically separate complete sentences inside each document
- Tokenizer: Annotator to separate sentences in tokens (generally words)
- WordEmbeddings: Vectorization of word tokens, in this case using word embeddings trained from PubMed, ICD10 and other clinical resources.
- ChunkEmbeddings: Aggregates the WordEmbeddings for each NER Chunk
- Clinical NER + NerConverter: This annotators return Chunks related to generic clinical entities
- BioNLP NER + NerConverter: This annotators return Chunks related to Cancer and Genetics diseases
- SNOMED ChunkEntityResolver: Annotator that performs search for the KNNs, in this case trained from SNOMED CT Ontology.
- ICD10CM Neoplasms ChunkEntityResolver: Annotator that performs search for the KNNs, in this case trained from ICD10CM codes from C000-D499 and R590-R599.
- ICDO ChunkEntityResolver: Annotator that performs search for the KNNs, in this case trained from ICDO Histology Behavior.

In [5]:
#Language and Model Repository

models_language   = "en"
models_repository = "clinical/models"

In [6]:
# Embeddings Pretrained Model Name

embeddings_name   = "embeddings_clinical"

# Preparation and embeddings column names

doc_col           = "document"
sent_col          = "sentence"
token_col         = "token"
embeddings_col    = "embeddings"

# Usual preparation Annotators and Embeddings

docAssembler = DocumentAssembler().setInputCol(note_col).setOutputCol(doc_col)
sentenceDetector = SentenceDetector().setInputCols(doc_col).setOutputCol(sent_col)
tokenizer_chars = [",","\/"," ",".","|","@","#","%","&","\\$","\\[","\\]","\\(","\\)","\\-",";"]
tokenizer = Tokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(sent_col).setOutputCol(token_col)
embeddings = WordEmbeddingsModel.pretrained(embeddings_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col)\
    .setOutputCol(embeddings_col)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [7]:
# NER Pretrained Model Names

clinical_ner_name = "ner_clinical"
cancer_ner_name   = "ner_bionlp"

# NER Column Names

ner_clinical_col  = "ner_clinical"
ner_bio_col       = "ner_bio"

# Annotators responsible for the Cancer Genetics Entity Recognition task

clinicalNer = NerDLModel.pretrained(clinical_ner_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col, embeddings_col)\
    .setOutputCol(ner_clinical_col)

bioNer = NerDLModel.pretrained(cancer_ner_name, models_language, models_repository)\
    .setInputCols(sent_col, token_col, embeddings_col)\
    .setOutputCol(ner_bio_col)

ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [8]:
# Chunk column names

chunk_clinical_col = "chunk_clinical"
chunk_cancer_col   = "chunk_cancer"

#Converter annotators transform IOB tags into full chunks (sequence set of tokens) tagged with `entity` metadata

clinicalConverter = NerConverter().setInputCols(sent_col, token_col, ner_clinical_col)\
    .setOutputCol(chunk_clinical_col)
cancerConverter = NerConverter().setInputCols(sent_col, token_col, ner_bio_col)\
    .setOutputCol(chunk_cancer_col).setWhiteList(['Cancer']) # We whitelist just `Cancer` entities

In [9]:
# ChunkEmbeddings column names

chunk_embs_clinical_col = "chunk_embs_clinical"
chunk_embs_cancer_col   = "chunk_embs_cancer"

#ChunkEmbeddings annotators aggregate embeddings for each token in the chunk

clinicalChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_clinical_col, embeddings_col)\
  .setOutputCol(chunk_embs_clinical_col)
cancerChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols(chunk_cancer_col, embeddings_col)\
  .setOutputCol(chunk_embs_cancer_col)

In [10]:
# ChunkTokenizer column names

chunk_token_clinical_col = "chunk_token_clinical"
chunk_token_cancer_col   = "chunk_token_cancer"

# ChunkTokenizer provides extra flexibility at the time of tokenizing a given chunk

clinicalChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_clinical_col).setOutputCol(chunk_token_clinical_col)
cancerChunkTokenizer = ChunkTokenizer().setSplitChars(tokenizer_chars)\
    .setInputCols(chunk_cancer_col).setOutputCol(chunk_token_cancer_col)

In [11]:
# Entity Resolution Pretrained Model Names

snomed_model_name           = "ensembleresolve_snomed_clinical"
icd10cm_model_name   = "chunkresolve_icd10cm_neoplasms_clinical"
icd_model_name              = "chunkresolve_icdo_clinical"

# Entity Resolution column Names

snomed_clinical_col  = "snomed_resolution"
icd10cm_col   = "icd10cm_resolution"
icdo_col      = "icdo_resolution" 

snomedResolver = EnsembleEntityResolverModel()\
    .pretrained(snomed_model_name, models_language, models_repository)\
    .setEnableJaccard(False).setEnableLevenshtein(True).setEnableJaroWinkler(True)\
    .setNeighbours(200).setAlternatives(10).setDistanceWeights([3,3,0,0,4,2])\
    .setExtramassPenalty(2.5).setPoolingStrategy("MAX").setConfidenceFunction("INVERSE")\
    .setInputCols(chunk_token_clinical_col, chunk_embs_clinical_col)\
    .setOutputCol(snomed_clinical_col)\
    
icd10cmResolver = ChunkEntityResolverModel\
    .pretrained(icd10cm_model_name, models_language, models_repository)\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([1,4,1,0,0,3])\
    .setExtramassPenalty(2.5).setPoolingStrategy("MAX").setConfidenceFunction("INVERSE")\
    .setInputCols(chunk_token_cancer_col, chunk_embs_cancer_col)\
    .setOutputCol(icd10cm_col)

icdoResolver = ChunkEntityResolverModel\
    .pretrained(icd_model_name, models_language, models_repository)\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([4,4,2,0,8,8])\
    .setExtramassPenalty(3).setPoolingStrategy("MAX").setConfidenceFunction("INVERSE")\
    .setInputCols(chunk_token_cancer_col, chunk_embs_cancer_col)\
    .setOutputCol(icdo_col)

ensembleresolve_snomed_clinical download started this may take some time.
Approximate size to download 592.9 MB
[OK!]
chunkresolve_icd10cm_neoplasms_clinical download started this may take some time.
Approximate size to download 20.4 MB
[OK!]
chunkresolve_icdo_clinical download started this may take some time.
Approximate size to download 8.2 MB
[OK!]


In [12]:
pipelineFull = Pipeline().setStages([
    docAssembler, 
    sentenceDetector, 
    tokenizer, 
    embeddings, 
    clinicalNer,
    bioNer,
    clinicalConverter,
    cancerConverter,
    clinicalChunkEmbeddings, 
    cancerChunkEmbeddings, 
    clinicalChunkTokenizer,
    cancerChunkTokenizer,
    snomedResolver,
    icd10cmResolver,
    icdoResolver
])

In [13]:
pipelineModelFull = pipelineFull.fit(data)

In [14]:
output = pipelineModelFull.transform(data).cache()

## The last part of our pipeline are the **ChunkEntityResolvers**: 

### EntityResolver:  
Trained on an augmented ICDO Dataset from JSL Data Market it provides histology codes resolution for the matched expressions. Other than providing the code in the "result" field it provides more metadata about the matching process:  

- all_k_results -> Sorted ResolverLabels in the top `alternatives` that match the distance `threshold`
- all_k_resolutions -> Respective ResolverNormalized strings
- all_k_confidences -> Respective normalized probabilities based in inverse distance valuesprobability
- all_k_distances -> Respective distance values after aggregation
- all_k_wmd_distances -> Respective WMD distance values (added if allDistancesMetadata==True)
- all_k_tfidf_distances -> Respective TFIDF Cosine distance values (added if allDistancesMetadata==True)
- all_k_jaccard_distances -> Respective Jaccard distance values (added if allDistancesMetadata==True)
- all_k_sorensen_distances -> Respective SorensenDice distance values (added if allDistancesMetadata==True)
- all_k_jaro_distances -> Respective JaroWinkler distance values (added if allDistancesMetadata==True)
- all_k_levenshtein_distances -> Respective Levenshtein distance values (added if allDistancesMetadata==True)
- target_text -> The actual searched string
- resolved_text -> The top ResolverNormalized string
- confidence -> Top probability
- distance -> Top distance value
- sentence -> Sentence index
- chunk -> Chunk Index
- token -> Token index

In [15]:
def quick_metadata_analysis(df, doc_field, chunk_field, code_fields):
    code_res_meta = ", ".join([f"{cf}.metadata" for cf in code_fields])
    expression = f"explode(arrays_zip({chunk_field}.begin, {chunk_field}.end, {chunk_field}.result, {chunk_field}.metadata, "+code_res_meta+")) as a"
    top_n_rest = [(f"float(a['{i+4}'].confidence) as {(cf.split('_')[0])}_conf",
                    f"arrays_zip(split(a['{i+4}'].all_k_results,':::'),split(a['{i+4}'].all_k_resolutions,':::')) as {cf.split('_')[0]+'_opts'}")
                    for i, cf in enumerate(code_fields)]
    top_n_rest_args = []
    for tr in top_n_rest:
        for t in tr:
            top_n_rest_args.append(t)
    return df.selectExpr(doc_field, expression) \
        .orderBy(docid_col, F.expr("a['0']"), F.expr("a['1']"))\
        .selectExpr(f"concat_ws('::',{doc_field},a['0'],a['1']) as coords", "a['2'] as chunk","a['3'].entity as entity", *top_n_rest_args)

In [16]:
snomed_analysis = \
quick_metadata_analysis(output, docid_col, chunk_clinical_col,[snomed_clinical_col]).toPandas()

In [17]:
cancer_analysis = \
quick_metadata_analysis(output, docid_col, chunk_cancer_col,[icd10cm_col, icdo_col]).toPandas()

In [18]:
snomed_analysis

Unnamed: 0,coords,chunk,entity,snomed_conf,snomed_opts
0,0::123::147,a large left scrotal mass,PROBLEM,0.2148,"[(10682271000119109, Mass of skin of left hand), (10692461000119101, Mass of skin of left thumb), (10682191000119102, Mass of skin of left foot), (10682231000119106, Mass of skin of left forearm), (15634751000119101, Mass of left ovary), (12240181000119103, Mass in left breast), (15748121000119108, Mass of joint of left foot), (13340001000004107, Mass of right submandibular region), (15744721000119103, Mass of joint of left hand), (15746361000119101, Mass of structure of left eye)]"
1,0::192::212,left scrotal swelling,PROBLEM,0.6447,"[(271687003, Swelling of scrotum), (762916009, Swelling of left foot), (15952141000119106, Left parotid gland swelling), (442648006, Swelling of left tonsil), (12242351000119109, Swelling of left arm), (438457000, Swelling of testicle), (441974004, Swelling of buttock), (341401000119104, Left conjunctival edema), (60728008, Swelling of abdomen), (15629941000119104, Left inguinal pain)]"
2,0::223::253,progressively increased in size,PROBLEM,1.0,"[(771339005, Hyperzincaemia and hypercalprotectinaemia), (51978008, Head lag in the newborn (disorder)), (762459007, Disorder due to and following breast reduction), (81543006, Abnormal hard tissue formation in pulp (disorder)), (772791006, Age-related loss of skeletal muscle mass), (191290006, Hemorrhagic disorder due to increase in anti-8a), (191291005, Hemorrhagic disorder due to increase in anti-9a), (315252002, Changing shape of pigmented skin lesion (disorder)), (52139007, Volume exces..."
3,0::279::300,mild left scrotal pain,PROBLEM,0.8921,"[(16675301000119100, Pain of left testicle), (316821000119105, Pain of left thigh), (15629941000119104, Left inguinal pain), (287047008, Pain in left leg), (722829006, Acute scrotal pain), (301368006, Left hypochondrial pain), (1076811000119109, Pain of left heel), (285387005, Left sided abdominal pain), (316751000119107, Pain in left foot), (316851000119102, Pain of left wrist)]"
4,0::329::345,mild constipation,PROBLEM,0.9955,"[(111360009, Intractable constipation), (197118003, Functional constipation), (370218001, Mild asthma), (430097009, Spastic constipation), (331987008, Mild dietary indigestion), (409587002, Severe diarrhea), (426979002, Mild persistent asthma), (397543001, Mild visual impairment), (308853007, Mild stomach dysplasia), (427679007, Mild intermittent asthma)]"
5,0::353::363,hard stools,PROBLEM,1.0,"[(75295004, Hard stools), (398032003, Loose stools), (35064005, Black stools), (64412006, Red stools), (167609007, Green stools), (300328001, Liver hard), (27731006, Soft stool), (449181000124106, Seedy stool), (449201000124107, Creamy stool), (267058009, Stool floats)]"
6,0::396::413,urinary complaints,PROBLEM,0.5742,"[(264832001, Urinary eosinophils), (5277004, Urinary casts), (38276004, Multiple complaints), (106102002, Abnormal urinary product), (7766007, Nebulous urine), (102850002, Urinary reducing substance), (703453009, Complaining of wooziness), (267022002, General symptom), (102840003, Urinary cast, erythrocyte), (102837003, Urinary cast, waxy)]"
7,0::419::438,physical examination,TEST,1.0,"[(5880005, Physical examination), (410184003, Physical examination management), (438547004, Postoperative physical examination), (103740001, Periodic physical examination), (363003006, Cardiovascular physical examination), (410182004, Physical examination assessment), (116302003, Physical examination maneuver), (81375008, Physical assessment), (27032005, Rhinolaryngologic examination), (84728005, Neurological examination)]"
8,0::441::466,a hard paratesticular mass,PROBLEM,1.0,"[(6370001000004104, Mass of hard palate), (300848003, Observation of a mass), (102031000119109, Paratesticular mass (disorder)), (444905003, Mass of soft tissue), (69559004, Mass of retroperitoneal structure), (237047009, Tubo-ovarian mass), (74285003, Mass of pelvic structure), (440299000, Mass of thoracic structure), (15636651000119101, Mass in both breasts), (126806005, Neoplasm of hard palate)]"
9,0::621::673,"a hard, lower abdominal mass in the suprapubic region",PROBLEM,0.7964,"[(163293006, On examination - abdominal mass-very hard), (163307004, On examination - abdominal mass - lower border defined), (274745006, Localised swelling, mass and lump, lower limb), (312355005, On examination - left lower abdominal mass), (457321000124103, Mass in nipple region of right breast), (438512007, Abdominal rigidity of periumbilical region (finding)), (448772000, Mass of abdominal cavity structure (finding)), (201489002, Arthropathy in Behcet syndrome of the pelvic region and t..."


In [19]:
cancer_analysis

Unnamed: 0,coords,chunk,entity,icd10cm_conf,icd10cm_opts,icdo_conf,icdo_opts
0,0::448::461,paratesticular,Cancer,1.0,"[(C6290, Malignant neoplasm of unspecified testis, unspecified whether descended or undescended), (D481, Neoplasm of uncertain behavior of connective and other soft tissue), (D480, Neoplasm of uncertain behavior of bone and articular cartilage), (C480, Malignant neoplasm of retroperitoneum), (C493, Malignant neoplasm of connective and soft tissue of thorax)]",0.4685,"[(9540/3, Malignant peripheral nerve sheath tumor), (9502/3, Teratoid medulloepithelioma), (9150/1, Hemangiopericytoma, NOS), (8815/3, Solitary fibrous tumor, malignant), (9150/3, Hemangiopericytoma, malignant)]"
1,0::1078::1103,testicular germ cell tumor,Cancer,1.0,"[(C6290, Malignant neoplasm of unspecified testis, unspecified whether descended or undescended), (C801, Malignant (primary) neoplasm, unspecified), (C561, Malignant neoplasm of right ovary), (C562, Malignant neoplasm of left ovary), (C569, Malignant neoplasm of unspecified ovary)]",0.4306,"[(9085/3, Mixed germ cell tumor), (9065/3, Germ cell tumor, nonseminomatous), (8621/3, Granulosa cell-theca cell tumor, mal.), (8650/3, Leydig cell tumor, malignant), (8620/3, Granulosa cell tumor, malignant)]"
2,0::1632::1644,pelvic masses,Cancer,0.419,"[(C763, Malignant neoplasm of pelvis), (D3616, Benign neoplasm of peripheral nerves and autonomic nervous system of pelvis), (D3615, Benign neoplasm of peripheral nerves and autonomic nervous system of abdomen), (C481, Malignant neoplasm of specified parts of peritoneum), (C495, Malignant neoplasm of connective and soft tissue of pelvis)]",0.3418,"[(8312/3, Renal cell carcinoma), (8000/0, Neoplasm, benign), (9350/1, Craniopharyngioma), (9130/0, Hemangioendothelioma, benign), (8815/0, Solitary fibrous tumor)]"
3,0::2429::2463,desmoplastic small round cell tumor,Cancer,0.212,"[(C784, Secondary malignant neoplasm of small intestine), (C8307, Small cell B-cell lymphoma, spleen), (C801, Malignant (primary) neoplasm, unspecified), (C3491, Malignant neoplasm of unspecified part of right bronchus or lung), (C3492, Malignant neoplasm of unspecified part of left bronchus or lung)]",1.0,"[(8806/3, Desmoplastic small round cell tumor), (8002/3, Malignant tumor, small cell type), (8003/3, Malignant tumor, giant cell type), (8005/3, Malignant tumor, clear cell type), (9252/3, Malignant tenosynovial giant cell tumor)]"
4,1::370::397,chronic lymphocytic leukemia,Cancer,1.0,"[(C9110, Chronic lymphocytic leukemia of B-cell type not having achieved remission), (C9210, Chronic myeloid leukemia, BCR/ABL-positive, not having achieved remission), (C9310, Chronic myelomonocytic leukemia not having achieved remission), (C9102, Acute lymphoblastic leukemia, in relapse), (C9101, Acute lymphoblastic leukemia, in remission)]",0.3443,"[(9805/3, Acute biphenotypic leukemia), (9729/3, Precursor T-cell lymphoblastic lymphoma), (9946/3, Juvenile myelomonocytic leukemia), (9963/3, Chronic neutrophilic leukemia), (9835/3, Precursor cell lymphoblastic leukemia, NOS)]"
5,1::411::430,lymphocytic lymphoma,Cancer,1.0,"[(C8350, Lymphoblastic (diffuse) lymphoma, unspecified site), (C8300, Small cell B-cell lymphoma, unspecified site), (C880, Waldenstrom macroglobulinemia), (C8143, Lymphocyte-rich Hodgkin lymphoma, intra-abdominal lymph nodes), (C8146, Lymphocyte-rich Hodgkin lymphoma, intrapelvic lymph nodes)]",0.2147,"[(9764/3, Immunoproliferative small intestinal disease), (9673/3, Mantle cell lymphoma), (9761/3, Waldenstrom macroglobulinemia), (9701/3, Sezary syndrome), (9651/3, Hodgkin lymphoma, lymphocyte-rich)]"
6,2::386::416,central nervous system lymphoma,Cancer,0.4149,"[(C8589, Other specified types of non-Hodgkin lymphoma, extranodal and solid organ sites), (C8380, Other non-follicular lymphoma, unspecified site), (C8387, Other non-follicular lymphoma, spleen), (C729, Malignant neoplasm of central nervous system, unspecified), (C8389, Other non-follicular lymphoma, extranodal and solid organ sites)]",0.3507,"[(9501/3, Medulloepithelioma, NOS), (9350/1, Craniopharyngioma), (9475/3, Medulloblastoma, WNT-activated), (9477/3, Medulloblastoma, non-WNT/non-SHH), (9591/3, Malignant lymphoma, non-Hodgkin)]"
7,3::373::380,lymphoma,Cancer,0.5,"[(C8590, Non-Hodgkin lymphoma, unspecified, unspecified site), (C862, Enteropathy-type (intestinal) T-cell lymphoma), (C8580, Other specified types of non-Hodgkin lymphoma, unspecified site), (C8290, Follicular lymphoma, unspecified, unspecified site), (C8350, Lymphoblastic (diffuse) lymphoma, unspecified site)]",0.5,"[(9591/3, Malignant lymphoma, non-Hodgkin), (9673/3, Mantle cell lymphoma), (9651/3, Hodgkin lymphoma, lymphocyte-rich), (9701/3, Sezary syndrome), (9764/3, Immunoproliferative small intestinal disease)]"
8,3::408::436,left axillary lymphadenopathy,Cancer,0.7554,"[(R590, Localized enlarged lymph nodes), (C50622, Malignant neoplasm of axillary tail of left male breast), (C8594, Non-Hodgkin lymphoma, unspecified, lymph nodes of axilla and upper limb), (C8514, Unspecified B-cell lymphoma, lymph nodes of axilla and upper limb), (C84A4, Cutaneous T-cell lymphoma, unspecified, lymph nodes of axilla and upper limb)]",0.3171,"[(8312/3, Renal cell carcinoma), (9705/3, Angioimmunoblastic T-cell lymphoma), (8000/0, Neoplasm, benign), (8000/3, Neoplasm, malignant), (9831/3, T-cell large granular lymphocytic leukemia)]"
