![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/05.3.Calculate_Medicare_Risk_Adjustment_Score.ipynb)

## Medicare Risk Adjustment:
In the United States, the Centers for Medicare & Medicaid Services sets reimbursement for private Medicare plan sponsors based on the assessed risk of their beneficiaries. Information found in unstructured medical records may be more indicative of member risk than existing structured data, creating more accurate risk pools.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import json
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.0.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## Downloading oncology notes
In this notebook we will use the clinical notes extracted from www.mtsamples.com.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/healthcare-nlp/data/mt_oncology_10.zip
!unzip -q mt_oncology_10.zip

In [None]:
df = spark.sparkContext.wholeTextFiles('mt_oncology_10/mt_note_*.txt').toDF().withColumnRenamed('_1','path').withColumnRenamed('_2','text')
df.show(truncate=50)

+-------------------------------------------+--------------------------------------------------+
|                                       path|                                              text|
+-------------------------------------------+--------------------------------------------------+
|file:/content/mt_oncology_10/mt_note_01.txt|
Medical Specialty:Hematology - Oncology
Sample...|
|file:/content/mt_oncology_10/mt_note_02.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_03.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_04.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_05.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_06.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/mt_note_07.txt|Medical Specialty:Hematology - Oncology
Sample ...|
|file:/content/mt_oncology_10/

## ICD-10 code extraction
Now, we will create a pipeline to extract ICD10 codes. This pipeline will find diseases and problems and then map their ICD10 codes. We will also check if this problem is still present or not.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

c2doc = nlp.Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_chunk_doc")

clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Oncological", "Disease_Syndrome_Disorder", "Heart_Disease"])

sbert_embedder = nlp.BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

icd10_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")\
    .setReturnCosineDistances(True)

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

resolver_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10_resolver,
        clinical_assertion
    ])

data_ner = spark.createDataFrame([[""]]).toDF("text")

icd_model = resolver_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]


We can transform the data. In path column, we have long path. Instead we will use filename column. Every file name refers to different patient.


In [None]:
path_array = F.split(df['path'], '/')
df = df.withColumn('filename', path_array.getItem(F.size(path_array)- 1)).select(['filename', 'text'])

icd10_sdf = icd_model.transform(df)

Let's see how our model extracted ICD Codes on a sample.

In [None]:
sample_text = df.select("text").take(2)[1][0]
print(sample_text)

Medical Specialty:Hematology - Oncology
Sample Name: Mullerian Adenosarcoma
Description: Discharge summary of a patient presenting with a large mass aborted through the cervix.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:  Mullerian adenosarcoma.
HISTORY OF PRESENT ILLNESS:  The patient is a 56-year-old presenting with a large mass aborted through the cervix.
PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender.
PELVIC: There is a large mass in the vagina.
HOSPITAL COURSE:  The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day.
OPERATIONS:  July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy.
DISCHARGE CONDITION:  Stable.
PLAN:  The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She wi

In [None]:
light_model = nlp.LightPipeline(icd_model)

light_result = light_model.fullAnnotate(sample_text)

vis = nlp.viz.EntityResolverVisualizer()

# Change color of an entity label
# vis.set_label_colors({'Oncological':'#008080'})

vis.display(light_result[0], 'ner_chunk', 'icd10cm_code')

In [None]:
icd10_df = icd10_sdf.select("filename", F.explode(F.arrays_zip(icd10_sdf.ner_chunk.result,
                                                                   icd10_sdf.icd10cm_code.result,
                                                                   icd10_sdf.assertion.result
                                                                  )).alias("cols")) \
                            .select("filename", F.expr("cols['0']").alias("chunk"),
                                    F.expr("cols['1']").alias("icd10_code"),
                                    F.expr("cols['2']").alias("assertion")
                                   ).toPandas()

icd10_df.head()

Unnamed: 0,filename,chunk,icd10_code,assertion
0,mt_note_01.txt,breast cancer,C50.92,Family
1,mt_note_01.txt,breast cancer,C50.92,Family
2,mt_note_01.txt,dysplasia,P61.4,Absent
3,mt_note_01.txt,cancer,C80.1,Absent
4,mt_note_02.txt,Name: Mullerian Adenosarcoma,C53.9,Present


In [None]:
icd10_df = icd10_df[~icd10_df.assertion.isin(["Family", "Past"])][['filename','chunk','icd10_code']].drop_duplicates()

Now, we will create an ICD_code list column

In [None]:
icd10_df['Extracted_Entities_vs_ICD_Codes'] = list(zip(icd10_df.chunk, icd10_df.icd10_code))
icd10_df.head(10)

Unnamed: 0,filename,chunk,icd10_code,Extracted_Entities_vs_ICD_Codes
2,mt_note_01.txt,dysplasia,P61.4,"(dysplasia, P61.4)"
3,mt_note_01.txt,cancer,C80.1,"(cancer, C80.1)"
4,mt_note_02.txt,Name: Mullerian Adenosarcoma,C53.9,"(Name: Mullerian Adenosarcoma, C53.9)"
5,mt_note_02.txt,Mullerian adenosarcoma,C53.9,"(Mullerian adenosarcoma, C53.9)"
7,mt_note_03.txt,leiomyosarcoma,C49.9,"(leiomyosarcoma, C49.9)"
8,mt_note_03.txt,pulmonary embolism,I26,"(pulmonary embolism, I26)"
9,mt_note_03.txt,pancytopenia,D61.81,"(pancytopenia, D61.81)"
10,mt_note_03.txt,pneumonia,J18.9,"(pneumonia, J18.9)"
11,mt_note_03.txt,Leiomyosarcoma,C49.9,"(Leiomyosarcoma, C49.9)"
13,mt_note_03.txt,Pancytopenia,D61.81,"(Pancytopenia, D61.81)"


In [None]:
icd10_codes= icd10_df.groupby("filename").icd10_code.apply(lambda x: list(x)).reset_index()
icd10_vs_entities = icd10_df.groupby("filename").Extracted_Entities_vs_ICD_Codes.apply(lambda x: list(x)).reset_index()

icd10_df_all = icd10_codes.merge(icd10_vs_entities)

icd10_df_all

Unnamed: 0,filename,icd10_code,Extracted_Entities_vs_ICD_Codes
0,mt_note_01.txt,"[P61.4, C80.1]","[(dysplasia, P61.4), (cancer, C80.1)]"
1,mt_note_02.txt,"[C53.9, C53.9]","[(Name: Mullerian Adenosarcoma, C53.9), (Mulle..."
2,mt_note_03.txt,"[C49.9, I26, D61.81, J18.9, C49.9, D61.81, M06...","[(leiomyosarcoma, C49.9), (pulmonary embolism,..."
3,mt_note_04.txt,"[C44.9, C44.9, N64.81, C44.90, R06.8, I80.9, R...","[(basal cell carcinoma, C44.9), (Basal cell ca..."
4,mt_note_05.txt,"[C50.92, C50.91, C50.9, C50.92, C80.0, T78.40,...","[(Breast Cancer, C50.92), (ductal carcinoma of..."
5,mt_note_06.txt,"[C45, D20.1, F31.9, R18, D20.1, F31.9, L72.0]","[(Name: Intraperitoneal Mesothelioma, C45), (p..."
6,mt_note_08.txt,"[J90, C45.9, J90]","[(Description: Right pleural effusion, J90), (..."
7,mt_note_09.txt,"[D50.8, D57.0, O99.0, D57.02]","[(Name: Sickle Cell Anemia, D50.8), (sickle ce..."
8,mt_note_10.txt,"[C69.60, C69.60]","[(Rhabdomyosarcoma of the left orbit, C69.60),..."


## Gender Classification

In Spark NLP, we have a pretrained model to detect gender of patient. Let's use it by `ClassifierDLModel`

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

biobert_embeddings = nlp.BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
    .setInputCols(["document",'token'])\
    .setOutputCol("bert_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "bert_embeddings"]) \
    .setOutputCol("sentence_bert_embeddings") \
    .setPoolingStrategy("AVERAGE")

genderClassifier = nlp.ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
    .setInputCols(["sentence_bert_embeddings"]) \
    .setOutputCol("gender")

gender_pipeline = nlp.Pipeline(stages=[documentAssembler,
                                   #sentenceDetector,
                                   tokenizer,
                                   biobert_embeddings,
                                   sentence_embeddings,
                                   genderClassifier])

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[OK!]


In [None]:
data_ner = spark.createDataFrame([[""]]).toDF("text")

gender_model = gender_pipeline.fit(data_ner)

gender_df = gender_model.transform(df)

In [None]:
gender_pd_df = gender_df.select("filename", F.explode(F.arrays_zip(gender_df.gender.result,
                                                                   gender_df.gender.metadata)).alias("cols")) \
                        .select("filename",
                                F.expr("cols['0']").alias("Gender"),
                                F.expr("cols['1']['Female']").alias("Female"),
                                F.expr("cols['1']['Male']").alias("Male")).toPandas()

gender_pd_df['Gender'] = gender_pd_df.apply(lambda x : "F" if float(x['Female']) >= float(x['Male']) else "M", axis=1)

gender_pd_df = gender_pd_df[['filename', 'Gender']]

All patients' gender is ready in a dataframe.

In [None]:
gender_pd_df

Unnamed: 0,filename,Gender
0,mt_note_01.txt,F
1,mt_note_02.txt,F
2,mt_note_03.txt,F
3,mt_note_04.txt,F
4,mt_note_05.txt,F
5,mt_note_06.txt,F
6,mt_note_07.txt,M
7,mt_note_08.txt,F
8,mt_note_09.txt,M
9,mt_note_10.txt,M


## Age
We can get patient's age forom the notes by another pipeline. We are creating an age pipeline to get AGE labelled entities. In a note, more than one age entity can be extracted. We will get the first age entity as patient's age.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

date_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Age"])

age_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        date_ner_converter
    ])

data_ner = spark.createDataFrame([[""]]).toDF("text")

age_model = age_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_enriched download started this may take some time.
[OK!]


In [None]:
light_model = nlp.LightPipeline(age_model)

light_result = light_model.fullAnnotate(sample_text)


visualiser = nlp.viz.NerVisualizer()

ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

In [None]:
age_result = age_model.transform(df)

age_df = age_result.select("filename",F.explode(F.arrays_zip(age_result.ner_chunk.result,
                                                             age_result.ner_chunk.metadata)).alias("cols")) \
                   .select("filename",
                           F.expr("cols['0']").alias("Age"),
                           F.expr("cols['1']['entity']").alias("ner_label")).toPandas().groupby('filename').first().reset_index()

In [None]:
age_df.Age = age_df.Age.replace(r"\D", "", regex = True).astype(int)
age_df.drop('ner_label', axis=1, inplace=True)
age_df.head()

Unnamed: 0,filename,Age
0,mt_note_01.txt,59
1,mt_note_02.txt,56
2,mt_note_03.txt,66
3,mt_note_04.txt,61
4,mt_note_05.txt,57


# Calculating Medicare Risk Adjusment Score
Now, we have all data which can be extracted from clinical notes. Now we can calculate Medicare Risk Adjusment Score by Spark NLP Healthcare CMS-HCC risk-adjustment score calculation module.

**This module supports V22, V23, V24 and V28 of the CMS-HCC risk adjustment model.**

It needs the following parameters in order to calculate the risk score:

- ICD Codes
- Age
- Gender
- The eligibility segment of the patient
- The original reason for entitlement
- If the patient is in Medicaid or not


In [None]:
patient_df = age_df.merge(icd10_df_all, on='filename', how = "left")\
                   .merge(gender_pd_df, on='filename', how = "left")

patient_df = patient_df.dropna()

In [None]:
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   filename                         9 non-null      object
 1   Age                              9 non-null      int64 
 2   icd10_code                       9 non-null      object
 3   Extracted_Entities_vs_ICD_Codes  9 non-null      object
 4   Gender                           9 non-null      object
dtypes: int64(1), object(4)
memory usage: 432.0+ bytes


In [None]:
df = spark.createDataFrame(patient_df)
df.show(truncate=False)

+--------------+---+----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|filename      |Age|icd10_code                                                                  |Extracted_Entities_vs_ICD_Codes                                                                                                                                                                                                                                                          |Gender|
+--------------+---+----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------

In [None]:
from pyspark.sql.types import MapType, IntegerType, DoubleType, StringType, StructType, StructField, FloatType
import pyspark.sql.functions as F

schema = StructType([
            StructField('risk_score', FloatType()),
            StructField('hcc_lst', StringType()),
            StructField('parameters', StringType()),
            StructField('details', StringType())])

In [None]:
extra_columns = pd.DataFrame({"filename" : ["mt_note_01.txt", "mt_note_03.txt", "mt_note_05.txt", "mt_note_06.txt",
                                            "mt_note_08.txt", "mt_note_09.txt", "mt_note_10.txt", ],
                              "eligibility" : ["CFA", "CND", "CPA", "CFA", "CND", "CPA", "CFA"],
                      "orec" : ["0", "1", "3", "0", "1", "3", "2"],
                      "medicaid":[True, False, True, False, True, True, False],
                      "DOB" : ['1961-10-12', "1956-05-30", '1963-08-12', "1959-07-24", '1956-03-17', "2003-06-11", '2006-02-14']
                      })

df_extra = spark.createDataFrame(extra_columns)
df_extra.show(truncate=False)

+--------------+-----------+----+--------+----------+
|filename      |eligibility|orec|medicaid|DOB       |
+--------------+-----------+----+--------+----------+
|mt_note_01.txt|CFA        |0   |true    |1961-10-12|
|mt_note_03.txt|CND        |1   |false   |1956-05-30|
|mt_note_05.txt|CPA        |3   |true    |1963-08-12|
|mt_note_06.txt|CFA        |0   |false   |1959-07-24|
|mt_note_08.txt|CND        |1   |true    |1956-03-17|
|mt_note_09.txt|CPA        |3   |true    |2003-06-11|
|mt_note_10.txt|CFA        |2   |false   |2006-02-14|
+--------------+-----------+----+--------+----------+



If we don't have age information in documents and have date of birth for each patient, we can calculate the age with following functions.

```python
from pyspark.sql import functions as F

df_extra = df_extra.withColumn("DOB", F.to_date(F.col("DOB")))
df_extra = df_extra.withColumn("Age", F.datediff(F.current_date(), F.col("DOB"))/365)
df_extra.show()
```
```bash
+--------------+-----------+----+--------+----------+------------------+
|      filename|eligibility|orec|medicaid|       DOB|               Age|
+--------------+-----------+----+--------+----------+------------------+
|mt_note_01.txt|        CFA|   0|    true|1961-10-12| 60.93972602739726|
|mt_note_03.txt|        CND|   1|   false|1956-05-30| 66.31232876712329|
|mt_note_05.txt|        CPA|   3|    true|1963-08-12|59.106849315068494|
|mt_note_06.txt|        CFA|   0|   false|1959-07-24| 63.16164383561644|
|mt_note_08.txt|        CND|   1|    true|1956-03-17| 66.51506849315068|
|mt_note_09.txt|        CPA|   3|    true|2003-06-11| 19.24931506849315|
|mt_note_10.txt|        CFA|   2|   false|2006-02-14|16.567123287671233|
+--------------+-----------+----+--------+----------+------------------+
```

In [None]:
df = df.join(df_extra, on= "filename")

In [None]:
df.show()

+--------------+---+--------------------+-------------------------------+------+-----------+----+--------+----------+
|      filename|Age|          icd10_code|Extracted_Entities_vs_ICD_Codes|Gender|eligibility|orec|medicaid|       DOB|
+--------------+---+--------------------+-------------------------------+------+-----------+----+--------+----------+
|mt_note_03.txt| 66|[C49.9, I26, D61....|           [{leiomyosarcoma,...|     F|        CND|   1|   false|1956-05-30|
|mt_note_01.txt| 59|      [P61.4, C80.1]|           [{dysplasia, P61....|     F|        CFA|   0|    true|1961-10-12|
|mt_note_10.txt| 16|    [C69.60, C69.60]|           [{Rhabdomyosarcom...|     M|        CFA|   2|   false|2006-02-14|
|mt_note_08.txt| 66|   [J90, C45.9, J90]|           [{Description: Ri...|     F|        CND|   1|    true|1956-03-17|
|mt_note_09.txt| 19|[D50.8, D57.0, O9...|           [{Name: Sickle Ce...|     M|        CPA|   3|    true|2003-06-11|
|mt_note_05.txt| 57|[C50.92, C50.91, ...|           [{Br

## Importing the model version

You can import one of the following function calculate the score.

```
- profileV22Y17   - profileV23Y18  - profileV24Y17  - profileV28    - profileESRDV21Y19
- profileV22Y18   - profileV23Y19  - profileV24Y18  - profileV28Y24
- profileV22Y19                    - profileV24Y19
- profileV22Y20                    - profileV24Y20
- profileV22Y21                    - profileV24Y21
- profileV22Y22                    - profileV24Y22
                                   - profileV24
```

In [None]:
df = df.withColumn("hcc_profile", medical.profileV28Y24(df.icd10_code, df.Age, df.Gender, df.eligibility, df.orec, df.medicaid))

df = df.withColumn("hcc_profile", F.from_json(F.col("hcc_profile"), schema))
df= df.withColumn("risk_score", df.hcc_profile.getItem("risk_score"))\
      .withColumn("hcc_lst", df.hcc_profile.getItem("hcc_lst"))\
      .withColumn("parameters", df.hcc_profile.getItem("parameters"))\
      .withColumn("details", df.hcc_profile.getItem("details"))\

df.select('risk_score','icd10_code', 'Age', 'Gender').show(truncate=False )

df.show(truncate=100, vertical=True)

+----------+----------------------------------------------------------------------------+---+------+
|risk_score|icd10_code                                                                  |Age|Gender|
+----------+----------------------------------------------------------------------------+---+------+
|1.01      |[C49.9, I26, D61.81, J18.9, C49.9, D61.81, M06.9, C44.9, I26, J32.9]        |66 |F     |
|0.196     |[P61.4, C80.1]                                                              |59 |F     |
|0.196     |[C69.60, C69.60]                                                            |16 |M     |
|1.989     |[J90, C45.9, J90]                                                           |66 |F     |
|0.303     |[D50.8, D57.0, O99.0, D57.02]                                               |19 |M     |
|2.166     |[C50.92, C50.91, C50.9, C50.92, C80.0, T78.40, R50.9, G25.81, P39.9, C50.92]|57 |F     |
|0.349     |[C45, D20.1, F31.9, R18, D20.1, F31.9, L72.0]                               |63

# Using Question Answer Model

In [None]:
sample_texts = ["""Medical Specialty:Hematology - Oncology
Sample Name: Consult - Breast Cancer - 1
Description: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.
(Medical Transcription Sample Report)
CHIEF COMPLAINT:  Left breast cancer.
HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging. She prior to that area, denied any complaints. She had no nipple discharge. No trauma history. She has had been on no estrogen supplementation. She has had no other personal history of breast cancer. Her family history is positive for her mother having breast cancer at age 48. The patient has had no children and no pregnancies. She denies any change in the right breast. Subsequent to the office visit and tissue diagnosis of breast cancer, she has had medical oncology consultation with Dr. X and radiation oncology consultation with Dr. Y. I have discussed the case with Dr. X and Dr. Y, who are both in agreement with proceeding with surgery prior to adjuvant therapy. The patient's metastatic workup has otherwise been negative with MRI scan and CT scanning. The MRI scan showed some close involvement possibly involving the left pectoralis muscle, although thought to also possibly represent biopsy artifact. CT scan of the neck, chest, and abdomen is negative for metastatic disease. PAST MEDICAL HISTORY: Previous surgery is history of benign breast biopsy in 1972, laparotomy in 1981, 1982, and 1984, right oophorectomy in 1984, and ganglion cyst removal of the hand in 1987.
MEDICATIONS: She is currently on omeprazole for reflux and indigestion.
ALLERGIES: SHE HAS NO KNOWN DRUG ALLERGIES.
REVIEW OF SYSTEMS: Negative for any recent febrile illnesses, chest pains or shortness of breath. Positive for restless leg syndrome. Negative for any unexplained weight loss and no change in bowel or bladder habits.
FAMILY HISTORY: Positive for breast cancer in her mother and also mesothelioma from possible asbestosis or asbestos exposure.
SOCIAL HISTORY: The patient works as a school teacher and teaching high school.
PHYSICAL EXAMINATION: GENERAL: The patient is a white female, alert and oriented x 3, appears her stated age of 57.
HEENT: Head is atraumatic and normocephalic. Sclerae are anicteric. NECK: Supple.
CHEST: Clear. HEART: Regular rate and rhythm. BREASTS: Exam reveals an approximately 1.5 cm relatively mobile focal palpable mass in the inferomedial left breast at the 7 o'clock position, which clinically is not fixed to the underlying pectoralis muscle. There are no nipple retractions. No skin dimpling. There is some, at the time of the office visit, ecchymosis from recent biopsy. There is no axillary adenopathy. The remainder of the left breast is without abnormality. The right breast is without abnormality. The axillary areas are negative for adenopathy bilaterally. ABDOMEN: Soft, nontender without masses. No gross organomegaly. No CVA or flank tenderness. EXTREMITIES: Grossly neurovascularly intact.
IMPRESSION:  The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.
RECOMMENDATIONS:  I have discussed with the patient in detail about the diagnosis of breast cancer and the surgical options, and medical oncologist has discussed with her issues about adjuvant or neoadjuvant chemotherapy. We have decided to recommend to the patient breast conservation surgery with left breast lumpectomy with preoperative sentinel lymph node injection and mapping and left axillary dissection. The possibility of further surgery requiring wider lumpectomy or even completion mastectomy was explained to the patient. The procedure and risks of the surgery were explained to include, but not limited to extra bleeding, infection, unsightly scar formation, the possibility of local recurrence, the possibility of left upper extremity lymphedema was explained. Local numbness, paresthesias or chronic pain was explained. The patient was given an educational brochure and several brochures about the diagnosis and treatment of breast cancers. She was certainly encouraged to obtain further surgical medical opinions prior to proceeding. I believe the patient has given full informed consent and desires to proceed with the above.
"""]



In [None]:
new_text = []
questions = {0: ["What is the patient's age?"],
             1: ["What is the patient's gender?"],
             2: ["What is the patient's diagnosis?"],
}

for i in range(3):
        for x in questions[i]:
            new_text.append([x, sample_texts[0]])

example = spark.createDataFrame(new_text).toDF("question", "context")


In [None]:
document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa  = medical.MedicalQuestionAnswering()\
    .pretrained("clinical_notes_qa_base", "en", "clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("Context: {context} \n Question: {question} \n Answer: ")\
    .setOutputCol("answer")\

pipeline = nlp.Pipeline(stages=[document_assembler, med_qa])


result = pipeline.fit(example).transform(example)

clinical_notes_qa_base download started this may take some time.
[OK!]


In [None]:
df = result.selectExpr("document_question.result as Question", "answer.result as Answer")

#Convert array answers to string
df = df.withColumn("Answer", F.concat_ws(" ", df["Answer"]))

#Create a new common column to combine the df's we will obtain in the future in a common column
df = df.withColumn("filename", F.lit("text_01"))## açıklama yap

df.show(truncate=False)

+----------------------------------+------------------------------------------------------------------------------------------------+--------+
|Question                          |Answer                                                                                          |filename|
+----------------------------------+------------------------------------------------------------------------------------------------+--------+
|[What is the patient's age?]      |The patient is 57 years old.                                                                    |text_01 |
|[What is the patient's gender?]   |The patient is a white female.                                                                  |text_01 |
|[What is the patient's diagnosis?]|The patient has invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.|text_01 |
+----------------------------------+------------------------------------------------------------------------------------------------+--------+

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc")

clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Oncological", "Disease_Syndrome_Disorder", "Heart_Disease"])

sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

icd10_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")\
    .setReturnCosineDistances(True)

resolver_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10_resolver
    ])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

icd_model = resolver_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[OK!]


In [None]:
icd10_sdf = icd_model.transform(df)

In [None]:
icd10_df = icd10_sdf.select("filename",F.explode(F.arrays_zip(icd10_sdf.ner_chunk.result,
                                                   icd10_sdf.icd10cm_code.result,
                                                   icd10_sdf.ner_chunk.metadata,

                                                    )).alias("cols")) \
                            .select("filename",F.expr("cols['0']").alias("chunk"),
                                    F.expr("cols['1']").alias("icd10_code"),
                                    F.expr("cols['2']['entity']").alias("entity"),
                                                     ).toPandas()

icd10_df.head()

Unnamed: 0,filename,chunk,icd10_code,entity
0,text_01,ductal carcinoma of the left breast,C50.91,Oncological
1,text_01,breast carcinoma,C50.9,Oncological


In [None]:
icd10_df['Extracted_Entities_vs_ICD_Codes'] = list(zip(icd10_df.chunk, icd10_df.icd10_code))

In [None]:
icd10_codes= icd10_df.groupby("filename").icd10_code.apply(lambda x: list(x)).reset_index()
icd10_vs_entities = icd10_df.groupby("filename").Extracted_Entities_vs_ICD_Codes.apply(lambda x: list(x)).reset_index()

icd10_df_all = icd10_codes.merge(icd10_vs_entities)

icd10_df_all

Unnamed: 0,filename,icd10_code,Extracted_Entities_vs_ICD_Codes
0,text_01,"[C50.91, C50.9]","[(ductal carcinoma of the left breast, C50.91)..."


## Gender Classification

In Spark NLP, we have a pretrained model to detect gender of patient. Let's use it by `ClassifierDLModel`

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")\

biobert_embeddings = nlp.BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")

genderClassifier = nlp.ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["sentence_bert_embeddings"]) \
       .setOutputCol("gender")

gender_pipeline = nlp.Pipeline(stages=[documentAssembler,
                                   #sentenceDetector,
                                   tokenizer,
                                   biobert_embeddings,
                                   sentence_embeddings,
                                   genderClassifier])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

gender_model = gender_pipeline.fit(data_ner)

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[OK!]


In [None]:
# answers converted to a single text
concatenated_text_df = df.groupBy("filename").agg(F.concat_ws(" ", F.collect_list("Answer")).alias("Answer"))

concatenated_text_df.show(truncate=False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filename|Answer                                                                                                                                                      |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text_01 |The patient is 57 years old. The patient is a white female. The patient has invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.|
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
gender_df = gender_model.transform(concatenated_text_df)

gender_pd_df = gender_df.select("filename", F.explode(F.arrays_zip(gender_df.gender.result,
                                                                   gender_df.gender.metadata)).alias("cols")) \
                       .select("filename",F.expr("cols['0']").alias("Gender"),
                               F.expr("cols['1']['Female']").alias("Female"),
                               F.expr("cols['1']['Male']").alias("Male")).toPandas()

gender_pd_df['Gender'] = gender_pd_df.apply(lambda x : "F" if float(x['Female']) >= float(x['Male']) else "M", axis=1)

gender_pd_df = gender_pd_df[['filename', 'Gender']]

All patients' gender is ready in a dataframe.

In [None]:
gender_pd_df

Unnamed: 0,filename,Gender
0,text_01,F


## Age
We can get patient's age forom the notes by another pipeline. We are creating an age pipeline to get AGE labelled entities. In a note, more than one age entity can be extracted. We will get the first age entity as patient's age.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

date_ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Age"])

age_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        date_ner_converter
    ])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

age_model = age_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_enriched download started this may take some time.
[OK!]


In [None]:
age_result = age_model.transform(concatenated_text_df)

age_df = age_result.select("filename",F.explode(F.arrays_zip(age_result.ner_chunk.result,
                                                             age_result.ner_chunk.metadata)).alias("cols")) \
                   .select("filename",F.expr("cols['0']").alias("Age"),
                           F.expr("cols['1']['entity']").alias("ner_label")).toPandas()

In [None]:
age_df

Unnamed: 0,filename,Age,ner_label
0,text_01,57 years old,Age


In [None]:
age_df.Age = age_df.Age.replace(r"\D", "", regex = True).astype(int)
age_df.drop('ner_label', axis=1, inplace=True)
age_df.head()

Unnamed: 0,filename,Age
0,text_01,57


## Calculating Medicare Risk Adjusment Score
Now, we have all data which can be extracted from clinical notes. Now we can calculate Medicare Risk Adjusment Score by Spark NLP Healthcare CMS-HCC risk-adjustment score calculation module.

This module supports V22, V23, V24, V28 and ESRDV21 of the CMS-HCC risk adjustment model.

It needs the following parameters in order to calculate the risk score:

- ICD Codes
- Age
- Gender
- The eligibility segment of the patient
- The original reason for entitlement
- If the patient is in Medicaid or not


In [None]:
patient_df = age_df.merge(icd10_df_all, on='filename', how = "left")\
                   .merge(gender_pd_df, on='filename', how = "left")

patient_df = patient_df.dropna()

In [None]:
patient_df

Unnamed: 0,filename,Age,icd10_code,Extracted_Entities_vs_ICD_Codes,Gender
0,text_01,57,"[C50.91, C50.9]","[(ductal carcinoma of the left breast, C50.91)...",F


In [None]:
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   filename                         1 non-null      object
 1   Age                              1 non-null      int64 
 2   icd10_code                       1 non-null      object
 3   Extracted_Entities_vs_ICD_Codes  1 non-null      object
 4   Gender                           1 non-null      object
dtypes: int64(1), object(4)
memory usage: 48.0+ bytes


In [None]:
df = spark.createDataFrame(patient_df)
df.show(truncate=False)

+--------+---+---------------+--------------------------------------------------------------------------+------+
|filename|Age|icd10_code     |Extracted_Entities_vs_ICD_Codes                                           |Gender|
+--------+---+---------------+--------------------------------------------------------------------------+------+
|text_01 |57 |[C50.91, C50.9]|[{ductal carcinoma of the left breast, C50.91}, {breast carcinoma, C50.9}]|F     |
+--------+---+---------------+--------------------------------------------------------------------------+------+



In [None]:
from pyspark.sql.types import MapType, IntegerType, DoubleType, StringType, StructType, StructField, FloatType
import pyspark.sql.functions as f

schema = StructType([
            StructField('risk_score', FloatType()),
            StructField('hcc_lst', StringType()),
            StructField('parameters', StringType()),
            StructField('details', StringType())])

In [None]:
extra_columns = pd.DataFrame({"filename" : ["text_01"],
                              "eligibility" : ["INS" ],
                      "orec" : ["0"],
                      "medicaid":[False],
                          })

df_extra = spark.createDataFrame(extra_columns)
df_extra.show(truncate=False)

+--------+-----------+----+--------+
|filename|eligibility|orec|medicaid|
+--------+-----------+----+--------+
|text_01 |INS        |0   |false   |
+--------+-----------+----+--------+



In [None]:
df = df.join(df_extra, on= "filename")

In [None]:
df.show()

+--------+---+---------------+-------------------------------+------+-----------+----+--------+
|filename|Age|     icd10_code|Extracted_Entities_vs_ICD_Codes|Gender|eligibility|orec|medicaid|
+--------+---+---------------+-------------------------------+------+-----------+----+--------+
| text_01| 57|[C50.91, C50.9]|           [{ductal carcinom...|     F|        INS|   0|   false|
+--------+---+---------------+-------------------------------+------+-----------+----+--------+



## Importing the model version

You can import one of the following function calculate the score.

```
- profileV22Y17   - profileV23Y18  - profileV24Y17  - profileV28    - profileESRDV21Y19
- profileV22Y18   - profileV23Y19  - profileV24Y18  - profileV28Y24
- profileV22Y19                    - profileV24Y19
- profileV22Y20                    - profileV24Y20
- profileV22Y21                    - profileV24Y21
- profileV22Y22                    - profileV24Y22
                                   - profileV24
```

In [None]:
from sparknlp_jsl.functions import profileV22Y17

In [None]:
df = df.withColumn("hcc_profile", profileV22Y17(df.icd10_code, df.Age, df.Gender, df.eligibility, df.orec, df.medicaid))

df = df.withColumn("hcc_profile", F.from_json(F.col("hcc_profile"), schema))
df= df.withColumn("risk_score", df.hcc_profile.getItem("risk_score"))\
      .withColumn("hcc_lst", df.hcc_profile.getItem("hcc_lst"))\
      .withColumn("parameters", df.hcc_profile.getItem("parameters"))\
      .withColumn("details", df.hcc_profile.getItem("details"))\

df.select('risk_score','icd10_code', 'Age', 'Gender').show(truncate=False )

df.show(truncate=100, vertical=True)

+----------+---------------+---+------+
|risk_score|icd10_code     |Age|Gender|
+----------+---------------+---+------+
|0.986     |[C50.91, C50.9]|57 |F     |
+----------+---------------+---+------+

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------
 filename                        | text_01                                                                                              
 Age                             | 57                                                                                                   
 icd10_code                      | [C50.91, C50.9]                                                                                      
 Extracted_Entities_vs_ICD_Codes | [{ductal carcinoma of the left breast, C50.91}, {breast carcinoma, C50.9}]                           
 Gender                          | F                                                                              