## Medicare Risk Adjustment:
In the United States, the Centers for Medicare & Medicaid Services sets reimbursement for private Medicare plan sponsors based on the assessed risk of their beneficiaries. Information found in unstructured medical records may be more indicative of member risk than existing structured data, creating more accurate risk pools.

In [2]:
import os

jsl_secret = os.getenv('SECRET')

import sparknlp
sparknlp_version = sparknlp.version()
import sparknlp_jsl
jsl_version = sparknlp_jsl.version()

print (jsl_secret)

In [4]:
import os
import json
import string
import numpy as np
import pandas as pd

import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel
from sparknlp.training import CoNLL

In [5]:
params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(jsl_secret,params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.2.1
Spark NLP_JSL Version : 3.2.0rc3


In [6]:
spark

## Downloading oncology notes
In this notebook we will use the clinical notes extracted from www.mtsamples.com. 

In [7]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_oncology_10.zip
!unzip -q mt_oncology_10.zip 

In [8]:
df = spark.sparkContext.wholeTextFiles('mt_oncology_10/mt_note_*.txt').toDF().withColumnRenamed('_1','path').withColumnRenamed('_2','text')
df.show(truncate=50)

+-------------------------------------------+--------------------------------------------------+
|                                       path|                                              text|
+-------------------------------------------+--------------------------------------------------+
|file:/content/mt_oncology_10/mt_note_01.txt|
Medical Specialty:Hematology - Oncology
Sample...|
|file:/content/mt_oncology_10/mt_note_02.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_03.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_04.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_05.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_06.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_07.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/

In [9]:
sample_text = df.limit(2).select("text").collect()[1][0]
print(sample_text)

Medical Specialty:Hematology - Oncology
Sample Name: Mullerian Adenosarcoma
Description: Discharge summary of a patient presenting with a large mass aborted through the cervix.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:  Mullerian adenosarcoma.
HISTORY OF PRESENT ILLNESS:  The patient is a 56-year-old presenting with a large mass aborted through the cervix.
PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender.
PELVIC: There is a large mass in the vagina.
HOSPITAL COURSE:  The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day.
OPERATIONS:  July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy.
DISCHARGE CONDITION:  Stable.
PLAN:  The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She wi

## ICD-10 code extraction
Now, we will create a pipeline to extract ICD10 codes. This pipeline will find diseases and problems and then map their ICD10 codes. We will also check if this problem is still present or not.

In [10]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
 
sentenceDetector = SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
 
tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\
 
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")
 
c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 
 
clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")
 
ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Oncological", "Disease_Syndrome_Disorder", "Heart_Disease"])
 
sbert_embedder = BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")
 
icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")\
    .setInputCols(["ner_chunk", "sbert_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")\
    .setReturnCosineDistances(True)
 
clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
 
resolver_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10_resolver,
        clinical_assertion
    ])
 
data_ner = spark.createDataFrame([[""]]).toDF("text")
 
icd_model = resolver_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
Approximate size to download 1.4 GB
[OK!]
jsl_assertion_wip download started this may take some time.
Approximate size to download 1.4 MB
[OK!]


We can transform the data. In path column, we have long path. Instead we will use filename column. Every file name refers to different patient.


In [11]:
path_array = F.split(df['path'], '/')
df = df.withColumn('filename', path_array.getItem(F.size(path_array)- 1)).select(['filename', 'text'])
 
icd10_sdf = icd_model.transform(df)

Let's see how our model extracted ICD Codes on a sample.

In [12]:
light_model = LightPipeline(icd_model)
 
light_result = light_model.fullAnnotate(sample_text)
 
from sparknlp_display import EntityResolverVisualizer
 
vis = EntityResolverVisualizer()
 
# Change color of an entity label
vis.set_label_colors({'PROBLEM':'#008080'})
 
vis.display(light_result[0], 'ner_chunk', 'icd10cm_code')

ICD resolver can also tell us HCC status. HCC status is 1 if the Medicare Risk Adjusment model contains ICD code.



In [13]:
icd10_hcc_df = icd10_sdf.select("filename", F.explode(F.arrays_zip('ner_chunk.result', 
                                                                   'icd10cm_code.result',
                                                                   'icd10cm_code.metadata',
                                                                   "assertion.result"
                                                                  )).alias("cols")) \
                            .select("filename", F.expr("cols['0']").alias("chunk"),
                                    F.expr("cols['1']").alias("icd10_code"),
                                    F.expr("cols['2']['all_k_aux_labels']").alias("hcc_list"),
                                    F.expr("cols['3']").alias("assertion")
                                   ).toPandas()

icd10_hcc_df.head()

Unnamed: 0,filename,chunk,icd10_code,hcc_list,assertion
0,mt_note_01.txt,breast cancer,C5092,0||1||12:::0||1||12:::1||0||0:::0||0||0:::1||0...,Family
1,mt_note_01.txt,breast cancer,C5092,0||1||12:::0||1||12:::1||0||0:::0||0||0:::1||0...,Family
2,mt_note_01.txt,dysplasia,P614,1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||...,Absent
3,mt_note_01.txt,cancer,C801,1||1||12:::1||1||10:::1||0||0:::1||0||0:::1||1...,Absent
4,mt_note_02.txt,Mullerian adenosarcoma,N40,0||0||0:::0||0||0:::1||1||12:::0||0||0:::1||0|...,Present


In [14]:
icd10_hcc_df["hcc_status"] = icd10_hcc_df["hcc_list"].apply(lambda x: x.split("||")[1])
icd10_df = icd10_hcc_df.drop("hcc_list", axis = 1)
icd10_df.head()

Unnamed: 0,filename,chunk,icd10_code,assertion,hcc_status
0,mt_note_01.txt,breast cancer,C5092,Family,1
1,mt_note_01.txt,breast cancer,C5092,Family,1
2,mt_note_01.txt,dysplasia,P614,Absent,0
3,mt_note_01.txt,cancer,C801,Absent,1
4,mt_note_02.txt,Mullerian adenosarcoma,N40,Present,0


In [15]:
icd10_df = icd10_df[icd10_df.hcc_status=="1"]
icd10_df = icd10_df[~icd10_df.assertion.isin(["Family", "Past"])][['filename','chunk','icd10_code']].drop_duplicates()

We filtered the ICD codes based on HCC status. Now, we will create an ICD_code list column

In [16]:
icd10_df['Extracted_Entities_vs_ICD_Codes'] = list(zip(icd10_df.chunk, icd10_df.icd10_code))
icd10_df.head(10)

Unnamed: 0,filename,chunk,icd10_code,Extracted_Entities_vs_ICD_Codes
3,mt_note_01.txt,cancer,C801,"(cancer, C801)"
6,mt_note_03.txt,leiomyosarcoma,C499,"(leiomyosarcoma, C499)"
10,mt_note_03.txt,Leiomyosarcoma,C499,"(Leiomyosarcoma, C499)"
12,mt_note_03.txt,Pancytopenia,D6181,"(Pancytopenia, D6181)"
15,mt_note_03.txt,rheumatoid arthritis,M069,"(rheumatoid arthritis, M069)"
47,mt_note_05.txt,Breast Cancer,C5092,"(Breast Cancer, C5092)"
48,mt_note_05.txt,ductal carcinoma of the left breast,C5091,"(ductal carcinoma of the left breast, C5091)"
50,mt_note_05.txt,breast cancer,C5092,"(breast cancer, C5092)"
55,mt_note_05.txt,metastatic disease,C800,"(metastatic disease, C800)"
56,mt_note_05.txt,ALLERGIES,G20,"(ALLERGIES, G20)"


In [17]:
icd10_codes= icd10_df.groupby("filename").icd10_code.apply(lambda x: list(x)).reset_index()
icd10_vs_entities = icd10_df.groupby("filename").Extracted_Entities_vs_ICD_Codes.apply(lambda x: list(x)).reset_index()
 
icd10_df_all = icd10_codes.merge(icd10_vs_entities)
 
icd10_df_all

Unnamed: 0,filename,icd10_code,Extracted_Entities_vs_ICD_Codes
0,mt_note_01.txt,[C801],"[(cancer, C801)]"
1,mt_note_03.txt,"[C499, C499, D6181, M069]","[(leiomyosarcoma, C499), (Leiomyosarcoma, C499..."
2,mt_note_05.txt,"[C5092, C5091, C5092, C800, G20, C5092]","[(Breast Cancer, C5092), (ductal carcinoma of ..."
3,mt_note_06.txt,[F319],"[(Type 1 bipolar disease, F319)]"
4,mt_note_08.txt,"[C459, C800]","[(malignant mesothelioma, C459), (metastatic d..."
5,mt_note_09.txt,"[D5702, K5505]","[(Sickle cell crisis, D5702), (Veno-occlusive ..."
6,mt_note_10.txt,"[C6960, C6960]","[(Rhabdomyosarcoma of the left orbit, C6960), ..."


## Gender Classification

In Spark NLP, we have a pretrained model to detect gender of patient. Let's use it by `ClassifierDLModel`

In [18]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
 
tokenizer = Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")\
 
biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")
 
sentence_embeddings = SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")
 
genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["document", "sentence_bert_embeddings"]) \
       .setOutputCol("gender")
 
gender_pipeline = Pipeline(stages=[documentAssembler,
                                   #sentenceDetector,
                                   tokenizer, 
                                   biobert_embeddings, 
                                   sentence_embeddings, 
                                   genderClassifier])

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[OK!]


In [19]:
data_ner = spark.createDataFrame([[""]]).toDF("text")
 
gender_model = gender_pipeline.fit(data_ner)
 
gender_df = gender_model.transform(df)

In [20]:
gender_pd_df = gender_df.select("filename", F.explode(F.arrays_zip('gender.result', 'gender.metadata')).alias("cols")) \
                       .select("filename",
                               F.expr("cols['0']").alias("Gender"),
                               F.expr("cols['1']['Female']").alias("Female"),
                               F.expr("cols['1']['Male']").alias("Male")).toPandas()
 
gender_pd_df['Gender'] = gender_pd_df.apply(lambda x : "F" if float(x['Female']) >= float(x['Male']) else "M", axis=1)
 
gender_pd_df = gender_pd_df[['filename', 'Gender']]

All patients' gender is ready in a dataframe.

In [21]:
gender_pd_df

Unnamed: 0,filename,Gender
0,mt_note_01.txt,F
1,mt_note_02.txt,F
2,mt_note_03.txt,F
3,mt_note_04.txt,F
4,mt_note_05.txt,F
5,mt_note_06.txt,F
6,mt_note_07.txt,M
7,mt_note_08.txt,F
8,mt_note_09.txt,M
9,mt_note_10.txt,M


## Age
We can get patient's age forom the notes by another pipeline. We are creating an age pipeline to get AGE labelled entities. In a note, more than one age entity can be extracted. We will get the first age entity as patient's age.

In [22]:
date_ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Age"])
 
age_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        date_ner_converter
    ])
 
data_ner = spark.createDataFrame([[""]]).toDF("text")
 
age_model = age_pipeline.fit(data_ner)

In [24]:
light_model = LightPipeline(age_model)
 
light_result = light_model.fullAnnotate(sample_text)
 
from sparknlp_display import NerVisualizer
 
visualiser = NerVisualizer()
 
ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

In [25]:
age_result = age_model.transform(df)
 
age_df = age_result.select("filename",F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
                   .select("filename", 
                           F.expr("cols['0']").alias("Age"),
                           F.expr("cols['1']['entity']").alias("ner_label")).toPandas().groupby('filename').first().reset_index()

In [26]:
age_df.Age = age_df.Age.replace(r"\D", "", regex = True).astype(int)
age_df.drop('ner_label', axis=1, inplace=True)
age_df.head()

Unnamed: 0,filename,Age
0,mt_note_01.txt,59
1,mt_note_02.txt,56
2,mt_note_03.txt,66
3,mt_note_04.txt,61
4,mt_note_05.txt,57


# Calculating Medicare Risk Adjusment Score
Now, we have all data which can be extracted from clinical notes. Now we can calculate Medicare Risk Adjusment Score.

In [74]:
patient_df = age_df.merge(icd10_df_all, on='filename', how = "left")\
                   .merge(gender_pd_df, on='filename', how = "left")
 
patient_df = patient_df.dropna()

In [76]:
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   filename                         7 non-null      object
 1   Age                              7 non-null      int64 
 2   icd10_code                       7 non-null      object
 3   Extracted_Entities_vs_ICD_Codes  7 non-null      object
 4   Gender                           7 non-null      object
dtypes: int64(1), object(4)
memory usage: 336.0+ bytes


In [77]:
df = spark.createDataFrame(patient_df)
df.show(truncate=False)

+--------------+---+---------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|filename      |Age|icd10_code                             |Extracted_Entities_vs_ICD_Codes                                                                                                                                      |Gender|
+--------------+---+---------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|mt_note_01.txt|59 |[C801]                                 |[{cancer, C801}]                                                                                                                                                     |F     |
|mt_note_03.txt|66 |[C499, C499, D6181, M069]              |[{le

In [78]:
from pyspark.sql.types import MapType, IntegerType, DoubleType, StringType, StructType, StructField, FloatType
import pyspark.sql.functions as f

schema = StructType([
            StructField('risk_score', FloatType()),
            StructField('hcc_lst', StringType()),
            StructField('parameters', StringType()),
            StructField('details', StringType())])

In [82]:
df = df.withColumn("hcc_profile", profile(df.icd10_code, df.Age, df.Gender))
                                          
df = df.withColumn("hcc_profile", F.from_json(F.col("hcc_profile"), schema))
df= df.withColumn("risk_score", df.hcc_profile.getItem("risk_score"))\
      .withColumn("hcc_lst", df.hcc_profile.getItem("hcc_lst"))\
      .withColumn("parameters", df.hcc_profile.getItem("parameters"))\
      .withColumn("details", df.hcc_profile.getItem("details"))\

df.select('risk_score','icd10_code', 'Age', 'Gender').show(truncate=False )

df.show(truncate=100, vertical=True)

+----------+---------------------------------------+---+------+
|risk_score|icd10_code                             |Age|Gender|
+----------+---------------------------------------+---+------+
|0.15      |[C801]                                 |59 |F     |
|1.419     |[C499, C499, D6181, M069]              |66 |F     |
|3.265     |[C5092, C5091, C5092, C800, G20, C5092]|57 |F     |
|0.309     |[F319]                                 |63 |F     |
|2.982     |[C459, C800]                           |66 |F     |
|1.372     |[D5702, K5505]                         |19 |M     |
|0.15      |[C6960, C6960]                         |16 |M     |
+----------+---------------------------------------+---+------+

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------
 filename                        | mt_note_01.txt                                                                                       
 Age                 