![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.4.Resolving_Medical_Terms_to_Terminology_Codes_Directly.ipynb)

In this notebook, you will find how to optimize the process to get `SentenceEntityResolverModel` model outputs. As the first step, we will extract NERs related to the resolver model concept, then create a 3-stage pipeline with resolver models and get the resolutions of the extracted entities.

## Colab Setup

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel
import pandas as pd
pd.set_option('display.max_colwidth', 100)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.2
Spark NLP_JSL Version : 5.3.3


## Data

We will use MT Samples dataset to extract the entities and map their corresponding ICD-10-CM codes in this example.

In [None]:
# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_samples_10.csv

In [None]:
mt_samples_df = spark.read.csv("mt_samples_10.csv", header=True, multiLine=True)

mt_samples_df.show()

+-----+--------------------+
|index|                text|
+-----+--------------------+
|    0|Sample Type / Med...|
|    1|Sample Type / Med...|
|    2|Sample Type / Med...|
|    3|Sample Type / Med...|
|    4|Sample Type / Med...|
|    5|Sample Type / Med...|
|    6|Sample Type / Med...|
|    7|Sample Type / Med...|
|    8|Sample Type / Med...|
|    9|Sample Type / Med...|
+-----+--------------------+



Let's check how the data looks like.

In [None]:
print(mt_samples_df.limit(1).collect()[0]['text'])

Sample Type / Medical Specialty:
Hematology - Oncology
Sample Name:
Discharge Summary - Mesothelioma - 1
Description:
Mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:
Mesothelioma.
SECONDARY DIAGNOSES:
Pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
PROCEDURES
1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.
2. On August 20, 2007, thoracentesis.
3. On August 31, 2007, Port-A-Cath placement.
HISTORY AND PHYSICAL:
The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-sided chest pain, and went to an urgent care cen

## Clinical NER Pipeline (with pretrained models)

The entities we will feed to the resolver model, should be related to the concept. So we should create a robust pipeline for entity extraction using the NER models. Here we just use the `ner_jsl` model which has more than 80 different labels, by filtering the related labels with the ICD-10-CM concept. But you can use any NER models together in the same NER pipeline, and merge their results to get a single chunk using `ChunkMergeApproach` annotator.

To speed-up the process, we will use `.transform` method for entity extraction. In this way, we can repartition the dataset according to the resources we have and get the results faster.

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

jsl_ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("clinical_ner_chunk")\
    .setWhiteList(['Cerebrovascular_Disease',
                   'Communicable_Disease', 'Diabetes',
                   'Disease_Syndrome_Disorder',
                   'EKG_Findings', 'Heart_Disease',
                   'Hyperlipidemia', 'Hypertension',
                   'ImagingFindings', 'Injury_or_Poisoning',
                   'Kidney_Disease', 'Obesity', 'Oncological',
                   'Overweight', 'Pregnancy',
                   'Psychological_Condition', 'Symptom', 'VS_Finding'])

jsl_ner_pipeline = Pipeline(
    stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      jsl_ner,
      jsl_ner_converter])

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


Now we will fit and transform our data on the NER pipeline.

In [None]:
result = jsl_ner_pipeline.fit(mt_samples_df).transform(mt_samples_df)
result = result.withColumnRenamed("index", "doc_id")
result.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|          embeddings|        clinical_ner|  clinical_ner_chunk|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 53...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 88, 99, ...|
|     1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 53...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 344, 363...|
|     2|Sample Type / Med...|[{document, 0, 42...|[{document, 0, 53...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 73, ...|
|     3|Sample Type / Med...|[{document, 0, 20...|[{document, 0,

We need th detected entities as a list for the next step, so we will explode the dataset and convert it to Pandas dataframe.

In [None]:
ner_result_df = result.select("doc_id", F.explode(F.arrays_zip(result.clinical_ner_chunk.result,
                                     result.clinical_ner_chunk.begin,
                                     result.clinical_ner_chunk.end,
                                     result.clinical_ner_chunk.metadata)).alias("cols"))\
        .select("doc_id", F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']").alias("begin"),
                F.expr("cols['2']").alias("end"),
                F.expr("cols['3']['entity']").alias("entity"),
                F.expr("cols['3']['ner_source']").alias("ner_source")).toPandas()
ner_result_df

Unnamed: 0,doc_id,chunk,begin,end,entity,ner_source
0,0,Mesothelioma,88,99,Oncological,clinical_ner_chunk
1,0,Mesothelioma,118,129,Oncological,clinical_ner_chunk
2,0,pleural effusion,132,147,Disease_Syndrome_Disorder,clinical_ner_chunk
3,0,atrial fibrillation,150,168,Heart_Disease,clinical_ner_chunk
4,0,anemia,171,176,Disease_Syndrome_Disorder,clinical_ner_chunk
...,...,...,...,...,...,...
297,9,ductal carcinoma of the breast,593,622,Oncological,clinical_ner_chunk
298,9,axillary adenopathy,846,864,Symptom,clinical_ner_chunk
299,9,lesion,905,910,Symptom,clinical_ner_chunk
300,9,wound,1673,1677,Symptom,clinical_ner_chunk


## Entity Resolution

We will create a 3-stage pipeline with `DocumentAssembler`, `BertSentenceEmbeddings` and `SentenceEntityResolverModel` components. Then we will create a `LightPipeline` and feed the entity list into it.

We have several ICD-10-CM Sentence Entity Resover models in Saprk NLP for Healthcare which are trained with different embeddings or in different sizes of datasets. You can [check here](https://nlp.johnsnowlabs.com/models?q=icd10cm&task=Entity+Resolution) and use one of them in your pipeline.

Here we will use `sbiobertresolve_icd10cm_slim_billable_hcc` resolver model to get the ICD-10-CM codes of the detected entities. It returns the official resolution text within the brackets and also provides billable and HCC information of the codes in `all_k_aux_labels` parameter in the metadata. This column can be divided to get further details: `billable status || hcc status || hcc score`. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, `hcc status` is 1, and `hcc score` is 19.

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("chunk")\
    .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)

rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        rxnorm_resolver])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_slim_billable_hcc download started this may take some time.
[OK!]


### LightPipelines

In [None]:
lmodel=LightPipeline(rxnorm_pipelineModel)

Now we will create a unique chunk list not to get the BERT embeddings of the duplications. Then we will feed this list to the resolver pipeline.

In [None]:
chunk_list = ner_result_df.chunk.unique().tolist()
len(chunk_list)

210

In [None]:
chunk_list[:10]

['Mesothelioma',
 'pleural effusion',
 'atrial fibrillation',
 'anemia',
 'ascites',
 'esophageal reflux',
 'deep venous thrombosis',
 'Pleural effusion',
 'cough',
 'chest pain']

In [None]:
%%time
results = lmodel.fullAnnotate(chunk_list)

CPU times: user 5.66 s, sys: 1.55 s, total: 7.21 s
Wall time: 2min 20s


You can see the elapsed time when we run `fullAnnotate` on the results above.

In [None]:
%%time

chunks = [i["icd10cm_code"][0].metadata["target_text"] for i in results]
codes = [i["icd10cm_code"][0].result for i in results]
resolutions = [i["icd10cm_code"][0].metadata["resolved_text"] for i in results]
all_codes = [i["icd10cm_code"][0].metadata["all_k_results"].split(':::') for i in results]
all_resolutions = [i["icd10cm_code"][0].metadata["all_k_resolutions"].split(':::') for i in results]
all_k_aux_labels = [i["icd10cm_code"][0].metadata["all_k_aux_labels"].split(':::') for i in results]

resolver_result_df = pd.DataFrame({'chunk':chunks,
                                   'icd10cm_code':codes,
                                   'resolution':resolutions,
                                   'all_codes':all_codes,
                                   'all_resolutions':all_resolutions,
                                   'all_k_aux_labels':all_k_aux_labels})

resolver_result_df

CPU times: user 110 ms, sys: 16.2 ms, total: 126 ms
Wall time: 181 ms


Unnamed: 0,chunk,icd10cm_code,resolution,all_codes,all_resolutions,all_k_aux_labels
0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0||0||0, 1||1||9, 1||1||9, 1||1||9, 1||1||9, 0||0||0, 1||1||8, 0||0||0, 1||1||11, 1||1||12, 0||..."
1,pleural effusion,J94.0,chylous effusion [chylous effusion],"[J94.0, J91.0, J92, S27.63, J91, R09.1, J81, R09.3, Q34.0, S27.6, R07.81, J86, B59, S27.63XS, Q3...","[chylous effusion [chylous effusion], malignant pleural effusion [malignant pleural effusion], p...","[1||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0|..."
2,atrial fibrillation,I48.1,persistent atrial fibrillation [persistent atrial fibrillation],"[I48.1, I48.2, I48.0, I48.21, I48, I48.19, I48.11, I49.01, I48.4, I48.9, I49.0, I48.91, I49.1, I...","[persistent atrial fibrillation [persistent atrial fibrillation], chronic atrial fibrillation [c...","[0||0||0, 0||0||0, 1||1||96, 1||1||96, 0||0||0, 1||1||96, 1||1||96, 1||1||84, 1||1||96, 0||0||0,..."
3,anemia,D53.2,scorbutic anemia [scorbutic anemia],"[D53.2, D50, D72.825, D53.0, D74, R43.0, D70, D75.83, E50, D52, E71.110, D46.4, D72.810, R82.3, ...","[scorbutic anemia [scorbutic anemia], iron deficiency anemia [iron deficiency anemia], bandemia ...","[1||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 0||0||0, 1||1|..."
4,ascites,R18,ascites [ascites],"[R18, R06.6, H53.54, L94.6, R14.2, Q06.0, W17.0, R43.0, J94.0, R46.3, Y93.B4, T78.41, T48.4, T48...","[ascites [ascites], hiccough [hiccough], protanomaly [protanomaly], ainhum [ainhum], eructation ...","[0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||1||72, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0..."
...,...,...,...,...,...,...
205,lumps,R06.6,hiccough [hiccough],"[R06.6, R18, R60.0, L02.23, L02.13, L02.43, R43.0, R06.5, R14.2, J94.0, L02.03, H57.03, R14.0, J...","[hiccough [hiccough], ascites [ascites], localized edema [localized edema], carbuncle of trunk [...","[1||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0|..."
206,bumps,L94.6,ainhum [ainhum],"[L94.6, R06.6, R18, Q06.0, H53.54, R43.0, W17.0, R14.2, H51.12, B48.3, R25.3, Z93, R19.2, T78.41...","[ainhum [ainhum], hiccough [hiccough], ascites [ascites], amyelia [amyelia], protanomaly [protan...","[1||0||0, 1||0||0, 0||0||0, 1||1||72, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0..."
207,hepatomegaly,K76.7,hepatorenal syndrome [hepatorenal syndrome],"[K76.7, K76.4, K72.1, D73.2, Q44.6, K76.81, P59.1, R94.5, D61.82, E72.23, K71, R82.0, S36.11, K7...","[hepatorenal syndrome [hepatorenal syndrome], peliosis hepatis [peliosis hepatis], chronic hepat...","[1||1||27, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||1||27, 1||0||0, 1||0||0, 1||1||46, 1||1||23, 0..."
208,axillary adenopathy,J35.02,chronic adenoiditis [chronic adenoiditis],"[J35.02, K11.2, L40.2, E71.522, M89.09, K14.0, M31.0, H02.51, K11.23, M85.3, I77.4, M76.81, M89....","[chronic adenoiditis [chronic adenoiditis], sialoadenitis [sialoadenitis], acrodermatitis contin...","[1||0||0, 0||0||0, 1||0||0, 1||1||23, 1||0||0, 1||0||0, 1||1||40, 0||0||0, 1||0||0, 0||0||0, 1||..."


As you can see, there were 302 chunks detected by the NER pipeline but we just got the resolutions of the unique ones, 210 chunks instead of 302. This kept us to use the resources and the time effectively.

Now lets merge the resolutions with the ner_result_df.

In [None]:
merged_resolver_df = pd.merge(ner_result_df, resolver_result_df, on="chunk", how="left")
merged_resolver_df.drop(columns=['begin','end','entity','ner_source'], inplace=True)
merged_resolver_df

Unnamed: 0,doc_id,chunk,icd10cm_code,resolution,all_codes,all_resolutions,all_k_aux_labels
0,0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0||0||0, 1||1||9, 1||1||9, 1||1||9, 1||1||9, 0||0||0, 1||1||8, 0||0||0, 1||1||11, 1||1||12, 0||..."
1,0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0||0||0, 1||1||9, 1||1||9, 1||1||9, 1||1||9, 0||0||0, 1||1||8, 0||0||0, 1||1||11, 1||1||12, 0||..."
2,0,pleural effusion,J94.0,chylous effusion [chylous effusion],"[J94.0, J91.0, J92, S27.63, J91, R09.1, J81, R09.3, Q34.0, S27.6, R07.81, J86, B59, S27.63XS, Q3...","[chylous effusion [chylous effusion], malignant pleural effusion [malignant pleural effusion], p...","[1||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0|..."
3,0,atrial fibrillation,I48.1,persistent atrial fibrillation [persistent atrial fibrillation],"[I48.1, I48.2, I48.0, I48.21, I48, I48.19, I48.11, I49.01, I48.4, I48.9, I49.0, I48.91, I49.1, I...","[persistent atrial fibrillation [persistent atrial fibrillation], chronic atrial fibrillation [c...","[0||0||0, 0||0||0, 1||1||96, 1||1||96, 0||0||0, 1||1||96, 1||1||96, 1||1||84, 1||1||96, 0||0||0,..."
4,0,anemia,D53.2,scorbutic anemia [scorbutic anemia],"[D53.2, D50, D72.825, D53.0, D74, R43.0, D70, D75.83, E50, D52, E71.110, D46.4, D72.810, R82.3, ...","[scorbutic anemia [scorbutic anemia], iron deficiency anemia [iron deficiency anemia], bandemia ...","[1||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 0||0||0, 1||1|..."
...,...,...,...,...,...,...,...
297,9,ductal carcinoma of the breast,D05.0,lobular carcinoma in situ of breast [lobular carcinoma in situ of breast],"[D05.0, D05, D05.1, C44.521, C4A.52, C50.6, C44.511, C50, C50.1, D05.12, D05.02, C50.11, C50.61,...","[lobular carcinoma in situ of breast [lobular carcinoma in situ of breast], carcinoma in situ of...","[0||0||0, 0||0||0, 0||0||0, 1||0||0, 1||1||12, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||0..."
298,9,axillary adenopathy,J35.02,chronic adenoiditis [chronic adenoiditis],"[J35.02, K11.2, L40.2, E71.522, M89.09, K14.0, M31.0, H02.51, K11.23, M85.3, I77.4, M76.81, M89....","[chronic adenoiditis [chronic adenoiditis], sialoadenitis [sialoadenitis], acrodermatitis contin...","[1||0||0, 0||0||0, 1||0||0, 1||1||23, 1||0||0, 1||0||0, 1||1||40, 0||0||0, 1||0||0, 0||0||0, 1||..."
299,9,lesion,J63.4,siderosis [siderosis],"[J63.4, R14.2, Z73.82, T75.4, M75, H83.2X, H83.2, R29.2, Q72.7, L02.42, S05.7, L02.43, Q69.0, R4...","[siderosis [siderosis], eructation [eructation], dual sensory impairment [dual sensory impairmen...","[1||1||112, 1||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 0||..."
300,9,wound,S51.8,open wound of forearm [open wound of forearm],"[S51.8, S81.8, S01, S01.0, S61.4, R14.2, L02.42, S11, R60.0, Q72.7, H11.2, S91.3, S01.3, Q70.1, ...","[open wound of forearm [open wound of forearm], open wound of lower leg [open wound of lower leg...","[0||0||0, 0||0||0, 0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0|..."


And finally, you can see the total time taken for all the results to be returned. Additionally, you can split the information in `all_k_aux_labels` to see the `billable`, `hcc_status`, and `hcc_code` details.

In [None]:
merged_resolver_df['billable'] = merged_resolver_df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
merged_resolver_df['hcc_status'] = merged_resolver_df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
merged_resolver_df['hcc_code'] = merged_resolver_df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])
merged_resolver_df = merged_resolver_df.drop(['all_k_aux_labels'], axis=1)

merged_resolver_df.head(15)

Unnamed: 0,doc_id,chunk,icd10cm_code,resolution,all_codes,all_resolutions,billable,hcc_status,hcc_code
0,0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 9, 9, 9, 9, 0, 8, 0, 11, 12, 0, 8, 9, 10, 0, 0, 0, 0, 0, 0, 11, 0, 11, 10, 12]"
1,0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 9, 9, 9, 9, 0, 8, 0, 11, 12, 0, 8, 9, 10, 0, 0, 0, 0, 0, 0, 11, 0, 11, 10, 12]"
2,0,pleural effusion,J94.0,chylous effusion [chylous effusion],"[J94.0, J91.0, J92, S27.63, J91, R09.1, J81, R09.3, Q34.0, S27.6, R07.81, J86, B59, S27.63XS, Q3...","[chylous effusion [chylous effusion], malignant pleural effusion [malignant pleural effusion], p...","[1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 84, 0]"
3,0,atrial fibrillation,I48.1,persistent atrial fibrillation [persistent atrial fibrillation],"[I48.1, I48.2, I48.0, I48.21, I48, I48.19, I48.11, I49.01, I48.4, I48.9, I49.0, I48.91, I49.1, I...","[persistent atrial fibrillation [persistent atrial fibrillation], chronic atrial fibrillation [c...","[0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]","[0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]","[0, 0, 96, 96, 0, 96, 96, 84, 96, 0, 0, 96, 0, 96, 84, 0, 96, 0, 96, 85, 0, 0, 0, 96, 0]"
4,0,anemia,D53.2,scorbutic anemia [scorbutic anemia],"[D53.2, D50, D72.825, D53.0, D74, R43.0, D70, D75.83, E50, D52, E71.110, D46.4, D72.810, R82.3, ...","[scorbutic anemia [scorbutic anemia], iron deficiency anemia [iron deficiency anemia], bandemia ...","[1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 46, 0, 0, 23, 0, 0, 23, 0, 0, 0, 0, 0, 0, 47]"
5,0,ascites,R18,ascites [ascites],"[R18, R06.6, H53.54, L94.6, R14.2, Q06.0, W17.0, R43.0, J94.0, R46.3, Y93.B4, T78.41, T48.4, T48...","[ascites [ascites], hiccough [hiccough], protanomaly [protanomaly], ainhum [ainhum], eructation ...","[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1]","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 72, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
6,0,esophageal reflux,K21,gastro-esophageal reflux disease [gastro-esophageal reflux disease],"[K21, K22.2, K21.0, K22.4, K20.8, K20, T28.6, T85.521, K22.3, T18.1, T85.521S, T18.11, P78.83, T...","[gastro-esophageal reflux disease [gastro-esophageal reflux disease], esophageal obstruction [es...","[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 188, 0, 0]"
7,0,deep venous thrombosis,I81,portal vein thrombosis [portal vein thrombosis],"[I81, I82.72, I82.5, I82.592, I82.722, K64.5, I82.62, I82.591, I82.492, I82.721, I74, I82.4, I82...","[portal vein thrombosis [portal vein thrombosis], chronic embolism and thrombosis of deep veins ...","[1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0]","[0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0]","[0, 0, 0, 108, 108, 0, 0, 108, 108, 108, 0, 0, 108, 108, 0, 0, 0, 108, 108, 0, 0, 0]"
8,0,Mesothelioma,C45,mesothelioma [mesothelioma],"[C45, C45.0, C45.9, C45.1, C45.2, C4A, C7B.1, C96.2, C30.1, Q85.03, C34.0, C78.1, C45.7, C96.22,...","[mesothelioma [mesothelioma], mesothelioma of pleura [mesothelioma of pleura], mesothelioma, uns...","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1]","[0, 9, 9, 9, 9, 0, 8, 0, 11, 12, 0, 8, 9, 10, 0, 0, 0, 0, 0, 0, 11, 0, 11, 10, 12]"
9,0,Pleural effusion,J94.0,chylous effusion [chylous effusion],"[J94.0, J91.0, J92, S27.63, J91, R09.1, J81, R09.3, Q34.0, S27.6, R07.81, J86, B59, S27.63XS, Q3...","[chylous effusion [chylous effusion], malignant pleural effusion [malignant pleural effusion], p...","[1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 84, 0]"


We can save the `result_df` as a CSV file to use it later.

In [None]:
merged_resolver_df.to_csv("result_df.csv", index=False)

In [None]:
merged_resolver_df = pd.read_csv("/content/result_df.csv")
merged_resolver_df.head(10)

Unnamed: 0,doc_id,chunk,icd10cm_code,resolution,all_codes,all_resolutions,billable,hcc_status,hcc_code
0,0,Mesothelioma,C45,mesothelioma [mesothelioma],"['C45', 'C45.0', 'C45.9', 'C45.1', 'C45.2', 'C4A', 'C7B.1', 'C96.2', 'C30.1', 'Q85.03', 'C34.0',...","['mesothelioma [mesothelioma]', 'mesothelioma of pleura [mesothelioma of pleura]', 'mesothelioma...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0', ...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '0', '0', '0', '0', '0', ...","['0', '9', '9', '9', '9', '0', '8', '0', '11', '12', '0', '8', '9', '10', '0', '0', '0', '0', '0..."
1,0,Mesothelioma,C45,mesothelioma [mesothelioma],"['C45', 'C45.0', 'C45.9', 'C45.1', 'C45.2', 'C4A', 'C7B.1', 'C96.2', 'C30.1', 'Q85.03', 'C34.0',...","['mesothelioma [mesothelioma]', 'mesothelioma of pleura [mesothelioma of pleura]', 'mesothelioma...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0', ...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '0', '0', '0', '0', '0', ...","['0', '9', '9', '9', '9', '0', '8', '0', '11', '12', '0', '8', '9', '10', '0', '0', '0', '0', '0..."
2,0,pleural effusion,J94.0,chylous effusion [chylous effusion],"['J94.0', 'J91.0', 'J92', 'S27.63', 'J91', 'R09.1', 'J81', 'R09.3', 'Q34.0', 'S27.6', 'R07.81', ...","['chylous effusion [chylous effusion]', 'malignant pleural effusion [malignant pleural effusion]...","['1', '1', '0', '0', '0', '1', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '6', '0', '0', '0', '0', '0', '0', ..."
3,0,atrial fibrillation,I48.1,persistent atrial fibrillation [persistent atrial fibrillation],"['I48.1', 'I48.2', 'I48.0', 'I48.21', 'I48', 'I48.19', 'I48.11', 'I49.01', 'I48.4', 'I48.9', 'I4...","['persistent atrial fibrillation [persistent atrial fibrillation]', 'chronic atrial fibrillation...","['0', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '1', '1', '1', '1', '0', '1', '1', '1', ...","['0', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '1', '0', '1', '1', '0', '1', '0', '1', ...","['0', '0', '96', '96', '0', '96', '96', '84', '96', '0', '0', '96', '0', '96', '84', '0', '96', ..."
4,0,anemia,D53.2,scorbutic anemia [scorbutic anemia],"['D53.2', 'D50', 'D72.825', 'D53.0', 'D74', 'R43.0', 'D70', 'D75.83', 'E50', 'D52', 'E71.110', '...","['scorbutic anemia [scorbutic anemia]', 'iron deficiency anemia [iron deficiency anemia]', 'band...","['1', '0', '1', '1', '0', '1', '0', '0', '0', '0', '1', '1', '1', '1', '1', '0', '0', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '1', '0', '0', '1', '0', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '23', '46', '0', '0', '23', '0', '0', '23', '..."
5,0,ascites,R18,ascites [ascites],"['R18', 'R06.6', 'H53.54', 'L94.6', 'R14.2', 'Q06.0', 'W17.0', 'R43.0', 'J94.0', 'R46.3', 'Y93.B...","['ascites [ascites]', 'hiccough [hiccough]', 'protanomaly [protanomaly]', 'ainhum [ainhum]', 'er...","['0', '1', '1', '1', '1', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0', '0', '1', '1', '0', ...","['0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...","['0', '0', '0', '0', '0', '72', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',..."
6,0,esophageal reflux,K21,gastro-esophageal reflux disease [gastro-esophageal reflux disease],"['K21', 'K22.2', 'K21.0', 'K22.4', 'K20.8', 'K20', 'T28.6', 'T85.521', 'K22.3', 'T18.1', 'T85.52...","['gastro-esophageal reflux disease [gastro-esophageal reflux disease]', 'esophageal obstruction ...","['0', '1', '0', '1', '0', '0', '0', '0', '1', '0', '1', '0', '1', '1', '1', '0', '0', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ..."
7,0,deep venous thrombosis,I81,portal vein thrombosis [portal vein thrombosis],"['I81', 'I82.72', 'I82.5', 'I82.592', 'I82.722', 'K64.5', 'I82.62', 'I82.591', 'I82.492', 'I82.7...","['portal vein thrombosis [portal vein thrombosis]', 'chronic embolism and thrombosis of deep vei...","['1', '0', '0', '1', '1', '1', '0', '1', '1', '1', '0', '0', '1', '1', '0', '0', '0', '1', '1', ...","['0', '0', '0', '1', '1', '0', '0', '1', '1', '1', '0', '0', '1', '1', '0', '0', '0', '1', '1', ...","['0', '0', '0', '108', '108', '0', '0', '108', '108', '108', '0', '0', '108', '108', '0', '0', '..."
8,0,Mesothelioma,C45,mesothelioma [mesothelioma],"['C45', 'C45.0', 'C45.9', 'C45.1', 'C45.2', 'C4A', 'C7B.1', 'C96.2', 'C30.1', 'Q85.03', 'C34.0',...","['mesothelioma [mesothelioma]', 'mesothelioma of pleura [mesothelioma of pleura]', 'mesothelioma...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0', ...","['0', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '0', '0', '0', '0', '0', ...","['0', '9', '9', '9', '9', '0', '8', '0', '11', '12', '0', '8', '9', '10', '0', '0', '0', '0', '0..."
9,0,Pleural effusion,J94.0,chylous effusion [chylous effusion],"['J94.0', 'J91.0', 'J92', 'S27.63', 'J91', 'R09.1', 'J81', 'R09.3', 'Q34.0', 'S27.6', 'R07.81', ...","['chylous effusion [chylous effusion]', 'malignant pleural effusion [malignant pleural effusion]...","['1', '1', '0', '0', '0', '1', '0', '1', '1', '0', '1', '0', '1', '1', '1', '1', '1', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '6', '0', '0', '0', '0', '0', '0', ..."
