## Medicare Risk Adjustment: 
In the United States, the Centers for Medicare & Medicaid Services sets reimbursement for private Medicare plan sponsors based on the assessed risk of their beneficiaries. Information found in unstructured medical records may be more indicative of member risk than existing structured data, creating more accurate risk pools.

#### Initial configurations

In [0]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
#nlp.install()

In [0]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL

import os
import json
import string
import numpy as np
import pandas as pd

from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)

spark



## Download oncology notes

In this notebook we will use the clinical notes extracted from www.mtsamples.com

Let's create the folder which we will store the notes.

In [0]:
notes_path='/FileStore/HLS/nlp/data/'
delta_path='/FileStore/HLS/nlp/delta/jsl/'

dbutils.fs.mkdirs(notes_path)
os.environ['notes_path']=f'/dbfs{notes_path}'

In [0]:
%sh
cd $notes_path
wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_oncology_10.zip
unzip -o mt_oncology_10.zip

Archive:  mt_oncology_10.zip
  inflating: mt_oncology_10/mt_note_02.txt  
  inflating: mt_oncology_10/mt_note_03.txt  
  inflating: mt_oncology_10/mt_note_01.txt  
  inflating: mt_oncology_10/mt_note_10.txt  
  inflating: mt_oncology_10/mt_note_04.txt  
  inflating: __MACOSX/mt_oncology_10/._.DS_Store  
  inflating: mt_oncology_10/mt_note_05.txt  
  inflating: mt_oncology_10/mt_note_07.txt  
  inflating: mt_oncology_10/mt_note_06.txt  
  inflating: mt_oncology_10/mt_note_08.txt  
  inflating: mt_oncology_10/mt_note_09.txt  


In [0]:
dbutils.fs.ls(f'{notes_path}/mt_oncology_10')

Out[5]: [FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_01.txt', name='mt_note_01.txt', size=1371, modificationTime=1700485110000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_02.txt', name='mt_note_02.txt', size=1274, modificationTime=1700485110000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_03.txt', name='mt_note_03.txt', size=3699, modificationTime=1700485110000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_04.txt', name='mt_note_04.txt', size=8178, modificationTime=1700485111000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_05.txt', name='mt_note_05.txt', size=4707, modificationTime=1700485111000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_06.txt', name='mt_note_06.txt', size=4817, modificationTime=1700485111000),
 FileInfo(path='dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_07.txt', name='mt_note_07.txt', size=1727, modificationTime=

## Read Data and Write to Bronze Delta Layer

There are 50 clinical notes stored in delta table. We read the data and write the raw notes data into bronze delta tables

In [0]:
df = sc.wholeTextFiles(f'{notes_path}/mt_oncology_10/mt_note_*.txt').toDF().withColumnRenamed('_1','path').withColumnRenamed('_2','text')
display(df.limit(5))

path,text
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_01.txt,"Medical Specialty:Hematology - Oncology Sample Name: BRCA-2 mutation Description: Discharge summary of a patient with a BRCA-2 mutation. (Medical Transcription Sample Report) DISCHARGE DIAGNOSES: BRCA-2 mutation. HISTORY OF PRESENT ILLNESS: The patient is a 59-year-old with a BRCA-2 mutation. Her sister died of breast cancer at age 32 and her daughter had breast cancer at age 27. PHYSICAL EXAMINATION: The chest was clear. The abdomen was nontender. Pelvic examination shows no masses. No heart murmur. HOSPITAL COURSE: The patient underwent surgery on the day of admission. In the postoperative course she was afebrile and unremarkable. The patient regained bowel function and was discharged on the morning of the fourth postoperative day. OPERATIONS AND PROCEDURES: Total abdominal hysterectomy/bilateral salpingo-oophorectomy with resection of ovarian fossa peritoneum en bloc on July 25, 2006. PATHOLOGY: A 105-gram uterus without dysplasia or cancer. CONDITION ON DISCHARGE: Stable. PLAN: The patient will remain at rest initially with progressive ambulation after. She will avoid lifting, driving or intercourse. She will call me if any fevers, drainage, bleeding, or pain. Follow up in my office in four weeks. Family history, social history, psychosocial needs per the social worker. DISCHARGE MEDICATIONS: Percocet 5 #40 one every 3 hours p.r.n. pain."
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_02.txt,"Medical Specialty:Hematology - Oncology Sample Name: Mullerian Adenosarcoma Description: Discharge summary of a patient presenting with a large mass aborted through the cervix. (Medical Transcription Sample Report) PRINCIPAL DIAGNOSIS: Mullerian adenosarcoma. HISTORY OF PRESENT ILLNESS: The patient is a 56-year-old presenting with a large mass aborted through the cervix. PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender. PELVIC: There is a large mass in the vagina. HOSPITAL COURSE: The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day. OPERATIONS: July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy. DISCHARGE CONDITION: Stable. PLAN: The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She will call me for fevers, drainage, bleeding, or pain. Family history, social history, and psychosocial needs per the social worker. The patient will follow up in my office in one week. PATHOLOGY: Mullerian adenosarcoma. MEDICATIONS: Percocet 5, #40, one q.3 h. p.r.n. pain."
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_03.txt,"Medical Specialty:Hematology - Oncology Sample Name: Leiomyosarcoma Description: Discharge summary of patient with leiomyosarcoma and history of pulmonary embolism, subdural hematoma, pancytopenia, and pneumonia. (Medical Transcription Sample Report) ADMITTING DIAGNOSES:1. Leiomyosarcoma.2. History of pulmonary embolism.3. History of subdural hematoma.4. Pancytopenia.5. History of pneumonia. PROCEDURES DURING HOSPITALIZATION:1. Cycle six of CIVI-CAD (Cytoxan, Adriamycin, and DTIC) from 07/22/2008 to 07/29/2008.2. CTA, chest PE study showing no evidence for pulmonary embolism. 3. Head CT showing no evidence of acute intracranial abnormalities.4. Sinus CT, normal mini-CT of the paranasal sinuses. HISTORY OF PRESENT ILLNESS: Ms. ABC is a pleasant 66-year-old Caucasian female who first palpated a mass in the left posterior arm in spring of 2007. The mass increased in size and she was seen by her primary care physician and referred to orthopedic surgeon. MRI showed inflammation and was thought to be secondary to rheumatoid arthritis. The mass increased in size. She eventually underwent a partial resection found to have pathologic grade 2 leiomyosarcoma, margins were impossible to assess, but were likely positive. She was evaluated by Dr. X and Dr. Y and a decision was made to proceed with preoperative chemotherapy. She began treatment with CIVI-CAD in December 2007. Her course was complicated by pulmonary embolus, pneumonia, and subdural hematoma while on anticoagulation. She eventually underwent surgical resection on May 1, 2008 with small area of residual disease, but otherwise clear margins. HOSPITAL COURSE:1. Leiomyosarcoma, the patient was admitted to Hem/Onco B Service under attending Dr. XYZ for cycle six of continuous IV infusion Cytoxan, Adriamycin, and DTIC, which she tolerated well.2. History of pulmonary embolism. Upon admission, the patient reported an approximate two-week history of dyspnea on exertion and some mild chest pain. She underwent a CTA, which showed no evidence of pulmonary embolism and the patient was started on prophylactic doses of Lovenox at 40 mg a day. She had no further complaints throughout the hospitalization with any shortness of breath or chest pain.3. History of subdural hematoma, also on admission the patient noted some mild intermittent headaches that were fleeting in nature, several a day that would resolve on their own. Her headaches were not responding to pain medication and so on 07/24/2008, we obtained a head CT that showed no evidence of acute intracranial abnormalities. The patient also had a history of sinusitis and so a sinus CT scan was obtained, which was normal.4. Pancytopenia. On admission, the patient's white blood count was 3.4, hemoglobin 11.3, platelet count 82, and ANC of 2400. The patient's counts were followed throughout admission. She did not require transfusion of red blood cells or platelets; however, on 07/26/2008 her ANC did dip to 900 and she was placed on neutropenic diet. At discharge her ANC is back up to 1100 and she is taken off neutropenic diet. Her white blood cell count at discharge was 1.4 and her hemoglobin was 11.2 with a platelet count of 140. 5. History of pneumonia. During admission, the patient did not exhibit any signs or symptoms of pneumonia. DISPOSITION: Home in stable condition. DIET: Regular and less neutropenic. ACTIVITY: Resume same activity. FOLLOWUP: The patient will have lab work at Dr. XYZ on 08/05/2008 and she will also return to the cancer center on 08/12/2008 at 10:20 a.m. The patient is also advised to monitor for any fevers greater than 100.5 and should she have any further problems in the meantime to please call in to be seen sooner."
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_04.txt,"Medical Specialty:Hematology - Oncology Sample Name: BCCa Excision - Cheek Description: Excision of basal cell carcinoma. Closure complex, open wound. Bilateral capsulectomies. Bilateral explantation and removal of ruptured silicone gel implants (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSES1. Basal cell carcinoma, right cheek.2. Basal cell carcinoma, left cheek.3. Bilateral ruptured silicone gel implants.4. Bilateral Baker grade IV capsular contracture.5. Breast ptosis. POSTOPERATIVE DIAGNOSES1. Basal cell carcinoma, right cheek.2. Basal cell carcinoma, left cheek. 3. Bilateral ruptured silicone gel implants.4. Bilateral Baker grade IV capsular contracture.5. Breast ptosis. PROCEDURE1. Excision of basal cell carcinoma, right cheek, 2.7 cm x 1.5 cm.2. Excision of basal cell carcinoma, left cheek, 2.3 x 1.5 cm.3. Closure complex, open wound utilizing local tissue advancement flap, right cheek.4. Closure complex, open wound, left cheek utilizing local tissue advancement flap.5. Bilateral explantation and removal of ruptured silicone gel implants. 6. Bilateral capsulectomies.7. Replacement with bilateral silicone gel implants, 325 cc. INDICATIONS FOR PROCEDURESThe patient is a 61-year-old woman who presents with a history of biopsy-proven basal cell carcinoma, right and left cheek. She had no prior history of skin cancer. She is status post bilateral cosmetic breast augmentation many years ago and the records are not available for this procedure. She has noted progressive hardening and distortion of the implant. She desires to have the implants removed, capsulectomy and replacement of implants. She would like to go slightly smaller than her current size as she has ptosis going with a smaller implant combined with capsulectomy will result in worsening of her ptosis. She may require a lift. She is not consenting to lift due to the surgical scars. PAST MEDICAL HISTORYSignificant for deep venous thrombosis and acid reflux. PAST SURGICAL HISTORY Significant for appendectomy, colonoscopy and BAM. MEDICATIONS: Coumadin. She stopped her Coumadin five days prior to the procedures.2. Lipitor3. Effexor.4. Klonopin. ALLERGIES: None. REVIEW OF SYSTEMS: Negative for dyspnea on exertion, palpitations, chest pain, and phlebitis. PHYSICAL EXAMINATION: VITAL SIGNS: Height 5'8"", weight 155 pounds. FACE: Examination of the face demonstrates basal cell carcinoma, right and left cheek. No lesions are noted in the regional lymph node base and no mass is appreciated. BREAST: Examination of the breast demonstrates bilateral grade IV capsular contracture. She has asymmetry in distortion of the breast. No masses are appreciated in the breast or the axilla. The implants appear to be subglandular. CHEST: Clear to auscultation and percussion. CARDIOVASCULAR: Regular rate and rhythm. EXTREMITIES: Show full range of motion. No clubbing, cyanosis or edema. SKIN: Significant environmental actinic skin damage.I recommended excision of basal cell cancers with frozen section control of the margin, closure will require local tissue flaps. I recommended exchange of the implants with reaugmentation. No final size is guaranteed or implied. We will decrease the size of the implants based on the intraoperative findings as the size is not known. Several options are available. Sizer implants will be placed to best estimate postoperative size. Ptosis will be worse following capsulectomy and going with a smaller implant. She may require a lift in the future. We have obtained preoperative clearance from the patient's cardiologist, Dr. K. The patient has been taken off Coumadin for five days and will be placed back on Coumadin the day after the surgery. The risk of deep venous thrombosis is discussed. Other risk including bleeding, infection, allergic reaction, pain, scarring, hypertrophic scarring and poor cosmetic resolve, worsening of ptosis, exposure, extrusion, the rupture of the implants, numbness of the nipple-areolar complex, hematoma, need for additional surgery, recurrent capsular contracture and recurrence of the skin cancer was all discussed, which she understands and informed consent is obtained. PROCEDURE IN DETAILAfter appropriate informed consent was obtained, the patient was placed in the preoperative holding area with **** input. She was then taken to the major operating room with ABCD Surgery Center, placed in a supine position. Intravenous antibiotics were given. TED hose and SCDs were placed. After the induction of adequate general endotracheal anesthesia, she was prepped and draped in the usual sterile fashion. Sites for excision and skin cancers were carefully marked with 5 mm margin. These were injected with 1% lidocaine with epinephrine.After allowing adequate time for basal constriction hemostasis, excision was performed, full thickness of the skin. They were tagged at the 12 o'clock position and sent for frozen section. Hemostasis was achieved using electrocautery. Once margins were determined to be free of involvement, local tissue flaps were designed for advancement. Undermining was performed. Hemostasis was achieved using electrocautery. Closure was performed under moderate tension with interrupted 5-0 Vicryl. Skin was closed under loop magnification paying meticulous attention and cosmetic details with 6-0 Prolene. Attention was then turned to the breast, clothes were changed, gloves were changed, incision was planned and the previous inframammary incision beginning on the right incision was made. Dissection was carried down to the capsule. It was extremely calcified. Dissection of the anterior surface of the capsule was performed. The implant was subglandular, the capsule was entered, implant was noted to be grossly intact; however, there was free silicone. Implant was removed and noted to be ruptured. No marking as to the size of the implant was found.Capsulectomy was performed leaving a small portion in the axilla in the inframammary fold. Pocket was modified to medialize the implant by placing 2-0 Prolene laterally in mattress sutures to restrict the pocket. In identical fashion, capsulectomy was performed on the left. Implant was noted to be grossly ruptured. No marking was found for the size of the implant. The entire content was weighed and found to be 350 grams. Right side was weighed and noted to be 338 grams, although some silicone was lost in the transfer and most likely was identical 350 grams. The implants appeared to be double lumen with the saline portion deflated. Completion of the capsulectomy was performed on the left.The pocket was again fashioned to improve symmetry with the right with Prolene. Pockets were thoroughly irrigated. Hemostasis was achieved using electrocautery, checked for symmetry, which is determined to be excellent. Several liters of normal saline were utilized to irrigate the pocket. Hemostasis was determined to be excellent. Drains were placed. 2-0 Vicryl sutures were preplaced. Pockets were checked for hemostasis, irrigated with normal saline, sizing was performed with placement of 275 cc implants. She was placed in a sitting position. This significantly worsened ptosis and for this reason, 325 cc implants were chosen. She is placed back in a supine position. Pockets were irrigated with antibiotic solution, 2-0 Vicryl sutures were preplaced and gloves were changed. The patient was reprepped with Betadine, towels were changed.These were soaked in antibiotic solution. Gloves were changed, gowns were changed, the patient was reprepped, towels were changed and implants were placed. The patient was placed in a sitting position. Symmetry was excellent. There was noted to be a decrease of volume of approximately 40 cc from the capsulectomy as well as the additional reduction of 25 cc. Ptosis was slightly worse; however, excellent shape of the breast. She was placed back in a supine position, preplaced 2-0 Vicryl sutures were tied; a second layer subcutaneous of 3-0 Vicryl was placed followed by a third of 4-0 Vicryl. The skin was closed with a running 4-0 Prolene. Drains were secured. All sponge and needle counts were correct. Dressing was applied. COMPLICATIONSNone. DISPOSITIONTo recovery room."
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_05.txt,"Medical Specialty:Hematology - Oncology Sample Name: Consult - Breast Cancer - 1 Description: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma. (Medical Transcription Sample Report) CHIEF COMPLAINT: Left breast cancer. HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging. She prior to that area, denied any complaints. She had no nipple discharge. No trauma history. She has had been on no estrogen supplementation. She has had no other personal history of breast cancer. Her family history is positive for her mother having breast cancer at age 48. The patient has had no children and no pregnancies. She denies any change in the right breast. Subsequent to the office visit and tissue diagnosis of breast cancer, she has had medical oncology consultation with Dr. X and radiation oncology consultation with Dr. Y. I have discussed the case with Dr. X and Dr. Y, who are both in agreement with proceeding with surgery prior to adjuvant therapy. The patient's metastatic workup has otherwise been negative with MRI scan and CT scanning. The MRI scan showed some close involvement possibly involving the left pectoralis muscle, although thought to also possibly represent biopsy artifact. CT scan of the neck, chest, and abdomen is negative for metastatic disease. PAST MEDICAL HISTORY: Previous surgery is history of benign breast biopsy in 1972, laparotomy in 1981, 1982, and 1984, right oophorectomy in 1984, and ganglion cyst removal of the hand in 1987. MEDICATIONS: She is currently on omeprazole for reflux and indigestion. ALLERGIES: SHE HAS NO KNOWN DRUG ALLERGIES. REVIEW OF SYSTEMS: Negative for any recent febrile illnesses, chest pains or shortness of breath. Positive for restless leg syndrome. Negative for any unexplained weight loss and no change in bowel or bladder habits. FAMILY HISTORY: Positive for breast cancer in her mother and also mesothelioma from possible asbestosis or asbestos exposure. SOCIAL HISTORY: The patient works as a school teacher and teaching high school. PHYSICAL EXAMINATION: GENERAL: The patient is a white female, alert and oriented x 3, appears her stated age of 57. HEENT: Head is atraumatic and normocephalic. Sclerae are anicteric. NECK: Supple. CHEST: Clear. HEART: Regular rate and rhythm. BREASTS: Exam reveals an approximately 1.5 cm relatively mobile focal palpable mass in the inferomedial left breast at the 7 o'clock position, which clinically is not fixed to the underlying pectoralis muscle. There are no nipple retractions. No skin dimpling. There is some, at the time of the office visit, ecchymosis from recent biopsy. There is no axillary adenopathy. The remainder of the left breast is without abnormality. The right breast is without abnormality. The axillary areas are negative for adenopathy bilaterally. ABDOMEN: Soft, nontender without masses. No gross organomegaly. No CVA or flank tenderness. EXTREMITIES: Grossly neurovascularly intact. IMPRESSION: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma. RECOMMENDATIONS: I have discussed with the patient in detail about the diagnosis of breast cancer and the surgical options, and medical oncologist has discussed with her issues about adjuvant or neoadjuvant chemotherapy. We have decided to recommend to the patient breast conservation surgery with left breast lumpectomy with preoperative sentinel lymph node injection and mapping and left axillary dissection. The possibility of further surgery requiring wider lumpectomy or even completion mastectomy was explained to the patient. The procedure and risks of the surgery were explained to include, but not limited to extra bleeding, infection, unsightly scar formation, the possibility of local recurrence, the possibility of left upper extremity lymphedema was explained. Local numbness, paresthesias or chronic pain was explained. The patient was given an educational brochure and several brochures about the diagnosis and treatment of breast cancers. She was certainly encouraged to obtain further surgical medical opinions prior to proceeding. I believe the patient has given full informed consent and desires to proceed with the above."


In [0]:
df.write.format('delta').mode('overwrite').save(f'{delta_path}/bronze/mt-oc-notes')
display(dbutils.fs.ls(f'{delta_path}/bronze/mt-oc-notes'))

path,name,size,modificationTime
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/_delta_log/,_delta_log/,0,1700485143000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0362e841-51bc-4a6e-b208-98e304a609e6-c000.snappy.parquet,part-00000-0362e841-51bc-4a6e-b208-98e304a609e6-c000.snappy.parquet,14465,1677436344000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0813bb9f-ab15-49ab-9a88-422df69bd614-c000.snappy.parquet,part-00000-0813bb9f-ab15-49ab-9a88-422df69bd614-c000.snappy.parquet,14465,1694720522000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0a7a9d5d-3e07-47b3-be75-e72b8d5ca574-c000.snappy.parquet,part-00000-0a7a9d5d-3e07-47b3-be75-e72b8d5ca574-c000.snappy.parquet,14465,1670935010000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0d47db17-d9cd-409b-ac8f-f3db8dbd813d-c000.snappy.parquet,part-00000-0d47db17-d9cd-409b-ac8f-f3db8dbd813d-c000.snappy.parquet,14465,1670889292000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0f791bf4-bdfc-433b-ac6e-d5728dc9e7ec-c000.snappy.parquet,part-00000-0f791bf4-bdfc-433b-ac6e-d5728dc9e7ec-c000.snappy.parquet,14465,1692343864000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-0fbbef16-85cf-4966-a7cc-8c71ff6abcf3-c000.snappy.parquet,part-00000-0fbbef16-85cf-4966-a7cc-8c71ff6abcf3-c000.snappy.parquet,16863,1672316882000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-18b61779-cba8-4d14-8690-5a3e2e816a79-c000.snappy.parquet,part-00000-18b61779-cba8-4d14-8690-5a3e2e816a79-c000.snappy.parquet,14982,1671462255000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-21ed556d-0570-4172-b707-5d242be2fbc5-c000.snappy.parquet,part-00000-21ed556d-0570-4172-b707-5d242be2fbc5-c000.snappy.parquet,16863,1672621045000
dbfs:/FileStore/HLS/nlp/delta/jsl/bronze/mt-oc-notes/part-00000-27aa4670-5abf-4ef8-a878-9c838386e739-c000.snappy.parquet,part-00000-27aa4670-5abf-4ef8-a878-9c838386e739-c000.snappy.parquet,14465,1676300072000


## ICD-10 code extraction
Now, we will create a pipeline to extract ICD10 codes. This pipeline will find diseases and problems and then map their ICD10 codes. We will also check if this problem is still present or not.

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
 
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
 
tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\
 
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")
 
c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 
 
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")
 
ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Oncological", "Disease_Syndrome_Disorder", "Heart_Disease"])
 
sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")\
      .setCaseSensitive(False)
 
icd10_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")\
      .setInputCols(["sbert_embeddings"])\
      .setOutputCol("icd10cm_code")\
      .setDistanceFunction("EUCLIDEAN")\
      .setReturnCosineDistances(True)
 
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
      .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
      .setOutputCol("assertion")
 
resolver_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10_resolver,
        clinical_assertion
    ])
 
data_ner = spark.createDataFrame([[""]]).toDF("text")
 
icd_model = resolver_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ][ / ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl download started this may take some time.
[ | ][ / ][ — ][OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[ | ][OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[ | ][OK!]
assertion_jsl_augmented download started this may take some time.
[ | ][ / ][OK!]


We can transform the data. In path column, we have long path. Instead we will use filename column. Every file name refers to different patient.

In [0]:
path_array = F.split(df['path'], '/')
df = df.withColumn('filename', path_array.getItem(F.size(path_array)- 1)).select(['filename', 'text'])

icd10_sdf = icd_model.transform(df)

In [0]:
sample_text = df.select("text").take(2)[1][0]
print(sample_text)

Medical Specialty:Hematology - Oncology
Sample Name: Mullerian Adenosarcoma
Description: Discharge summary of a patient presenting with a large mass aborted through the cervix.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:  Mullerian adenosarcoma.
HISTORY OF PRESENT ILLNESS:  The patient is a 56-year-old presenting with a large mass aborted through the cervix.
PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender.
PELVIC: There is a large mass in the vagina.
HOSPITAL COURSE:  The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day.
OPERATIONS:  July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy.
DISCHARGE CONDITION:  Stable.
PLAN:  The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She wi

Let's see how our model extracted ICD Codes on a sample.

In [0]:
sample_text = """
Medical Specialty:Hematology - Oncology
Sample Name: Mullerian Adenosarcoma
Description: Discharge summary of a patient presenting with a large mass aborted through the cervix.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:  Mullerian adenosarcoma.
HISTORY OF PRESENT ILLNESS:  The patient is a 56-year-old presenting with a large mass aborted through the cervix.
PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender.
PELVIC: There is a large mass in the vagina.
HOSPITAL COURSE:  The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day.
OPERATIONS:  July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy.
DISCHARGE CONDITION:  Stable.
PLAN:  The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She will call me for fevers, drainage, bleeding, or pain. Family history, social history, and psychosocial needs per the social worker. The patient will follow up in my office in one week.
PATHOLOGY: Mullerian adenosarcoma.
MEDICATIONS: Percocet 5, #40, one q.3 h. p.r.n. pain.
"""

In [0]:
light_model = nlp.LightPipeline(icd_model)

light_result = light_model.fullAnnotate(sample_text)

vis = nlp.viz.EntityResolverVisualizer()

# Change color of an entity label
vis.set_label_colors({'PROBLEM':'#008080'})

icd_vis = vis.display(light_result[0], 'ner_chunk', 'icd10cm_code', return_html=True)

displayHTML(icd_vis)

ICD resolver can also tell us HCC status. HCC status is 1 if the Medicare Risk Adjusment model contains ICD code.

In [0]:
icd10_df = icd10_sdf.select("filename", F.explode(F.arrays_zip(icd10_sdf.ner_chunk.result,
                                                                   icd10_sdf.icd10cm_code.result,
                                                                   icd10_sdf.assertion.result
                                                                  )).alias("cols")) \
                            .select("filename", F.expr("cols['0']").alias("chunk"),
                                    F.expr("cols['1']").alias("icd10_code"),
                                    F.expr("cols['2']").alias("assertion")
                                   ).toPandas()

icd10_df.head()

Unnamed: 0,filename,chunk,icd10_code,assertion
0,mt_note_01.txt,breast cancer,C50.92,Family
1,mt_note_01.txt,breast cancer,C50.92,Family
2,mt_note_01.txt,dysplasia,P61.4,Absent
3,mt_note_01.txt,cancer,C80.1,Absent
4,mt_note_02.txt,Name: Mullerian Adenosarcoma,C53.9,Present


In [0]:
icd10_df = icd10_df[~icd10_df.assertion.isin(["Family", "Past"])][['filename','chunk','icd10_code']].drop_duplicates()

Now, we will create an ICD_code list column

In [0]:
icd10_df['Extracted_Entities_vs_ICD_Codes'] = list(zip(icd10_df.chunk, icd10_df.icd10_code))
icd10_df.head(10)

Unnamed: 0,filename,chunk,icd10_code,Extracted_Entities_vs_ICD_Codes
2,mt_note_01.txt,dysplasia,P61.4,"(dysplasia, P61.4)"
3,mt_note_01.txt,cancer,C80.1,"(cancer, C80.1)"
4,mt_note_02.txt,Name: Mullerian Adenosarcoma,C53.9,"(Name: Mullerian Adenosarcoma, C53.9)"
5,mt_note_02.txt,Mullerian adenosarcoma,C53.9,"(Mullerian adenosarcoma, C53.9)"
7,mt_note_03.txt,leiomyosarcoma,C49.9,"(leiomyosarcoma, C49.9)"
8,mt_note_03.txt,pulmonary embolism,I26,"(pulmonary embolism, I26)"
9,mt_note_03.txt,pancytopenia,D61.81,"(pancytopenia, D61.81)"
10,mt_note_03.txt,pneumonia,J18.9,"(pneumonia, J18.9)"
11,mt_note_03.txt,Leiomyosarcoma,C49.9,"(Leiomyosarcoma, C49.9)"
13,mt_note_03.txt,Pancytopenia,D61.81,"(Pancytopenia, D61.81)"


In [0]:
icd10_codes= icd10_df.groupby("filename").icd10_code.apply(lambda x: list(x)).reset_index()
icd10_vs_entities = icd10_df.groupby("filename").Extracted_Entities_vs_ICD_Codes.apply(lambda x: list(x)).reset_index()

icd10_df_all = icd10_codes.merge(icd10_vs_entities)

icd10_df_all

Unnamed: 0,filename,icd10_code,Extracted_Entities_vs_ICD_Codes
0,mt_note_01.txt,"[P61.4, C80.1]","[(dysplasia, P61.4), (cancer, C80.1)]"
1,mt_note_02.txt,"[C53.9, C53.9]","[(Name: Mullerian Adenosarcoma, C53.9), (Mullerian adenosarcoma, C53.9)]"
2,mt_note_03.txt,"[C49.9, I26, D61.81, J18.9, C49.9, D61.81, M06.9, C44.9, I26, J32.9]","[(leiomyosarcoma, C49.9), (pulmonary embolism, I26), (pancytopenia, D61.81), (pneumonia, J18.9),..."
3,mt_note_04.txt,"[C44.9, C44.9, N64.81, C44.90, R06.8, I80.9, R23.8, H02.4, I80.2, P39.9]","[(basal cell carcinoma, C44.9), (Basal cell carcinoma, C44.9), (Breast ptosis, N64.81), (skin ca..."
4,mt_note_05.txt,"[C50.92, C50.91, C50.9, C50.92, C80.0, T78.40, R50.9, G25.81, P39.9, C50.92]","[(Breast Cancer, C50.92), (ductal carcinoma of the left breast, C50.91), (ductal carcinoma, C50...."
5,mt_note_06.txt,"[C45, D20.1, F31.9, R18, D20.1, F31.9, L72.0]","[(Name: Intraperitoneal Mesothelioma, C45), (peritoneal mesothelioma, D20.1), (type 1 bipolar di..."
6,mt_note_08.txt,"[J90, C45.9, J90]","[(Description: Right pleural effusion, J90), (malignant mesothelioma, C45.9), (pleural effusion,..."
7,mt_note_09.txt,"[D50.8, D57.0, O99.0, D57.02]","[(Name: Sickle Cell Anemia, D50.8), (sickle cell anemia, D57.0), (sickle cell, O99.0), (Sickle c..."
8,mt_note_10.txt,"[C69.60, C69.60]","[(Rhabdomyosarcoma of the left orbit, C69.60), (rhabdomyosarcoma of the left orbit, C69.60)]"


## Gender Classification

In Spark NLP, we have a pretrained model to detect gender of patient. Let's use it by `ClassifierDLModel`

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")\

biobert_embeddings = nlp.BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")

genderClassifier = nlp.ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["sentence_bert_embeddings"]) \
       .setOutputCol("gender")

gender_pipeline = nlp.Pipeline(stages=[documentAssembler,
                                   #sentenceDetector,
                                   tokenizer, 
                                   biobert_embeddings, 
                                   sentence_embeddings, 
                                   genderClassifier])

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[ | ][ / ][ — ][ \ ][ | ][OK!]


In [0]:
data_ner = spark.createDataFrame([[""]]).toDF("text")

gender_model = gender_pipeline.fit(data_ner)

gender_df = gender_model.transform(df)

In [0]:
gender_pd_df = gender_df.select("filename", F.explode(F.arrays_zip(gender_df.gender.result,
                                                                   gender_df.gender.metadata)).alias("cols")) \
                       .select("filename",
                               F.expr("cols['0']").alias("Gender"),
                               F.expr("cols['1']['Female']").alias("Female"),
                               F.expr("cols['1']['Male']").alias("Male")).toPandas()

gender_pd_df['Gender'] = gender_pd_df.apply(lambda x : "F" if float(x['Female']) >= float(x['Male']) else "M", axis=1)

gender_pd_df = gender_pd_df[['filename', 'Gender']]

All patients' gender is ready in a dataframe.

In [0]:
gender_pd_df

Unnamed: 0,filename,Gender
0,mt_note_01.txt,F
1,mt_note_02.txt,F
2,mt_note_03.txt,F
3,mt_note_04.txt,F
4,mt_note_05.txt,F
5,mt_note_06.txt,F
6,mt_note_07.txt,M
7,mt_note_08.txt,F
8,mt_note_09.txt,M
9,mt_note_10.txt,M


## Age

We can get patient's age from the notes by another pipeline. We are creating an age pipeline to get `AGE` labelled entities. In a note, more than one age entity can be extracted. We will get the first age entity as patient's age.

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
 
tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")
 
clinical_ner = medical.NerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

date_ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Age"])

age_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        date_ner_converter
    ])

data_ner = spark.createDataFrame([[""]]).toDF("text")

age_model = age_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl_enriched download started this may take some time.
[ | ][ / ][ — ][OK!]


In [0]:
light_model = nlp.LightPipeline(age_model)

light_result = light_model.fullAnnotate(sample_text)

visualiser = nlp.viz.NerVisualizer()

ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', return_html=True)

displayHTML(ner_vis)

In [0]:
age_result = age_model.transform(df)
 
age_df = age_result.select("filename",F.explode(F.arrays_zip(age_result.ner_chunk.result, 
                                                             age_result.ner_chunk.metadata)).alias("cols")) \
                   .select("filename", 
                           F.expr("cols['0']").alias("Age"),
                           F.expr("cols['1']['entity']").alias("ner_label")).toPandas().groupby('filename').first().reset_index()

In [0]:
age_df.Age = age_df.Age.replace(r"\D", "", regex = True).astype(int)
age_df.drop('ner_label', axis=1, inplace=True)
age_df.head()

Unnamed: 0,filename,Age
0,mt_note_01.txt,59
1,mt_note_02.txt,56
2,mt_note_03.txt,66
3,mt_note_04.txt,61
4,mt_note_05.txt,57


# Calculating Medicare Risk Adjusment Score

Now, we have all data which can be extracted from clinical notes. Now we can calculate Medicare Risk Adjusment Score by Spark NLP Healthcare CMS-HCC risk-adjustment score calculation module.

**This module supports V22, V23, V24 and V28 of the CMS-HCC risk adjustment model.**

It needs the following parameters in order to calculate the risk score:

- ICD Codes
- Age
- Gender
- The eligibility segment of the patient
- The original reason for entitlement
- If the patient is in Medicaid or not

In [0]:
patient_df = age_df.merge(icd10_df_all, on='filename', how = "left")\
                   .merge(gender_pd_df, on='filename', how = "left")
 
patient_df = patient_df.dropna()

In [0]:
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   filename                         9 non-null      object
 1   Age                              9 non-null      int64 
 2   icd10_code                       9 non-null      object
 3   Extracted_Entities_vs_ICD_Codes  9 non-null      object
 4   Gender                           9 non-null      object
dtypes: int64(1), object(4)
memory usage: 432.0+ bytes


In [0]:
df = spark.createDataFrame(patient_df)
df.show(truncate=False)

+--------------+---+----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|filename      |Age|icd10_code                                                                  |Extracted_Entities_vs_ICD_Codes                                                                                                                                                                                                                                                          |Gender|
+--------------+---+----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------

In [0]:
schema = T.StructType([
            T.StructField('risk_score', T.FloatType()),
            T.StructField('hcc_lst', T.StringType()),
            T.StructField('parameters', T.StringType()),
            T.StructField('details', T.StringType())])

In [0]:
extra_columns = pd.DataFrame({"filename" : ["mt_note_01.txt", "mt_note_03.txt", "mt_note_05.txt", "mt_note_06.txt", 
                                            "mt_note_08.txt", "mt_note_09.txt", "mt_note_10.txt", ],
                              "eligibility" : ["CFA", "CND", "CPA", "CFA", "CND", "CPA", "CFA"],
                      "orec" : ["0", "1", "3", "0", "1", "3", "2"],
                      "medicaid":[True, False, True, False, True, True, False],
                      "DOB" : ['1961-10-12', "1956-05-30", '1963-08-12', "1959-07-24", '1956-03-17', "2003-06-11", '2006-02-14']
                      })

df_extra = spark.createDataFrame(extra_columns)
df_extra.show(truncate=False)

+--------------+-----------+----+--------+----------+
|filename      |eligibility|orec|medicaid|DOB       |
+--------------+-----------+----+--------+----------+
|mt_note_01.txt|CFA        |0   |true    |1961-10-12|
|mt_note_03.txt|CND        |1   |false   |1956-05-30|
|mt_note_05.txt|CPA        |3   |true    |1963-08-12|
|mt_note_06.txt|CFA        |0   |false   |1959-07-24|
|mt_note_08.txt|CND        |1   |true    |1956-03-17|
|mt_note_09.txt|CPA        |3   |true    |2003-06-11|
|mt_note_10.txt|CFA        |2   |false   |2006-02-14|
+--------------+-----------+----+--------+----------+



If we don't have age information in documents and have date of birth for each patient, we can calculate the age with following functions.

```python
from pyspark.sql import functions as F

df_extra = df_extra.withColumn("DOB", F.to_date(F.col("DOB")))
df_extra = df_extra.withColumn("Age", F.datediff(F.current_date(), F.col("DOB"))/365)
df_extra.show()
```
```bash
+--------------+-----------+----+--------+----------+------------------+
|      filename|eligibility|orec|medicaid|       DOB|               Age|
+--------------+-----------+----+--------+----------+------------------+
|mt_note_01.txt|        CFA|   0|    true|1961-10-12| 60.93972602739726|
|mt_note_03.txt|        CND|   1|   false|1956-05-30| 66.31232876712329|
|mt_note_05.txt|        CPA|   3|    true|1963-08-12|59.106849315068494|
|mt_note_06.txt|        CFA|   0|   false|1959-07-24| 63.16164383561644|
|mt_note_08.txt|        CND|   1|    true|1956-03-17| 66.51506849315068|
|mt_note_09.txt|        CPA|   3|    true|2003-06-11| 19.24931506849315|
|mt_note_10.txt|        CFA|   2|   false|2006-02-14|16.567123287671233|
+--------------+-----------+----+--------+----------+------------------+
```

In [0]:
df = df.join(df_extra, on= "filename")

In [0]:
df.show()

+--------------+---+--------------------+-------------------------------+------+-----------+----+--------+----------+
|      filename|Age|          icd10_code|Extracted_Entities_vs_ICD_Codes|Gender|eligibility|orec|medicaid|       DOB|
+--------------+---+--------------------+-------------------------------+------+-----------+----+--------+----------+
|mt_note_01.txt| 59|      [P61.4, C80.1]|           [[dysplasia, P61....|     F|        CFA|   0|    true|1961-10-12|
|mt_note_03.txt| 66|[C49.9, I26, D61....|           [[leiomyosarcoma,...|     F|        CND|   1|   false|1956-05-30|
|mt_note_05.txt| 57|[C50.92, C50.91, ...|           [[Breast Cancer, ...|     F|        CPA|   3|    true|1963-08-12|
|mt_note_06.txt| 63|[C45, D20.1, F31....|           [[Name: Intraperi...|     F|        CFA|   0|   false|1959-07-24|
|mt_note_08.txt| 66|   [J90, C45.9, J90]|           [[Description: Ri...|     F|        CND|   1|    true|1956-03-17|
|mt_note_09.txt| 19|[D50.8, D57.0, O9...|           [[Na

## Importing the model version

You can import one of the following function calculate the score. 

```
- profileV22Y17   - profileV23Y18  - profileV24Y19  - profileV28Y24 - profileESRDV21Y19 - profileRxHCCV05Y20 - profileRxHCCV08Y22                                                           
- profileV22Y18   - profileV23Y19  - profileV24Y20  - profileV28                        - profileRxHCCV05Y21 - profileRxHCCV08Y23
- profileV22Y19   - profileV23     - profileV24Y21                                      - profileRxHCCV05Y22
- profileV22Y20                    - profileV24Y22                                      - profileRxHCCV05Y23
- profileV22Y21                    - profileV24
- profileV22Y22                     
- profileV22                        
```

In [0]:
df = df.withColumn("hcc_profile", medical.profileV28Y24(df.icd10_code, df.Age, df.Gender, df.eligibility, df.orec, df.medicaid))

df = df.withColumn("hcc_profile", F.from_json(F.col("hcc_profile"), schema))
df= df.withColumn("risk_score", df.hcc_profile.getItem("risk_score"))\
      .withColumn("hcc_lst", df.hcc_profile.getItem("hcc_lst"))\
      .withColumn("parameters", df.hcc_profile.getItem("parameters"))\
      .withColumn("details", df.hcc_profile.getItem("details"))\

df.select('risk_score','icd10_code', 'Age', 'Gender').show(truncate=False )

df.show(truncate=100, vertical=True)

+----------+----------------------------------------------------------------------------+---+------+
|risk_score|icd10_code                                                                  |Age|Gender|
+----------+----------------------------------------------------------------------------+---+------+
|0.196     |[P61.4, C80.1]                                                              |59 |F     |
|1.01      |[C49.9, I26, D61.81, J18.9, C49.9, D61.81, M06.9, C44.9, I26, J32.9]        |66 |F     |
|2.166     |[C50.92, C50.91, C50.9, C50.92, C80.0, T78.40, R50.9, G25.81, P39.9, C50.92]|57 |F     |
|0.349     |[C45, D20.1, F31.9, R18, D20.1, F31.9, L72.0]                               |63 |F     |
|1.989     |[J90, C45.9, J90]                                                           |66 |F     |
|0.303     |[D50.8, D57.0, O99.0, D57.02]                                               |19 |M     |
|0.196     |[C69.60, C69.60]                                                            |16

# Using Question Answer Model

In [0]:
import pyspark.sql.functions as F

In [0]:
sample_texts = ["""Medical Specialty:Hematology - Oncology
Sample Name: Consult - Breast Cancer - 1
Description: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.
(Medical Transcription Sample Report)
CHIEF COMPLAINT:  Left breast cancer.
HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging. She prior to that area, denied any complaints. She had no nipple discharge. No trauma history. She has had been on no estrogen supplementation. She has had no other personal history of breast cancer. Her family history is positive for her mother having breast cancer at age 48. The patient has had no children and no pregnancies. She denies any change in the right breast. Subsequent to the office visit and tissue diagnosis of breast cancer, she has had medical oncology consultation with Dr. X and radiation oncology consultation with Dr. Y. I have discussed the case with Dr. X and Dr. Y, who are both in agreement with proceeding with surgery prior to adjuvant therapy. The patient's metastatic workup has otherwise been negative with MRI scan and CT scanning. The MRI scan showed some close involvement possibly involving the left pectoralis muscle, although thought to also possibly represent biopsy artifact. CT scan of the neck, chest, and abdomen is negative for metastatic disease. PAST MEDICAL HISTORY: Previous surgery is history of benign breast biopsy in 1972, laparotomy in 1981, 1982, and 1984, right oophorectomy in 1984, and ganglion cyst removal of the hand in 1987.
MEDICATIONS: She is currently on omeprazole for reflux and indigestion.
ALLERGIES: SHE HAS NO KNOWN DRUG ALLERGIES.
REVIEW OF SYSTEMS: Negative for any recent febrile illnesses, chest pains or shortness of breath. Positive for restless leg syndrome. Negative for any unexplained weight loss and no change in bowel or bladder habits.
FAMILY HISTORY: Positive for breast cancer in her mother and also mesothelioma from possible asbestosis or asbestos exposure.
SOCIAL HISTORY: The patient works as a school teacher and teaching high school.
PHYSICAL EXAMINATION: GENERAL: The patient is a white female, alert and oriented x 3, appears her stated age of 57.
HEENT: Head is atraumatic and normocephalic. Sclerae are anicteric. NECK: Supple.
CHEST: Clear. HEART: Regular rate and rhythm. BREASTS: Exam reveals an approximately 1.5 cm relatively mobile focal palpable mass in the inferomedial left breast at the 7 o'clock position, which clinically is not fixed to the underlying pectoralis muscle. There are no nipple retractions. No skin dimpling. There is some, at the time of the office visit, ecchymosis from recent biopsy. There is no axillary adenopathy. The remainder of the left breast is without abnormality. The right breast is without abnormality. The axillary areas are negative for adenopathy bilaterally. ABDOMEN: Soft, nontender without masses. No gross organomegaly. No CVA or flank tenderness. EXTREMITIES: Grossly neurovascularly intact.
IMPRESSION:  The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.
RECOMMENDATIONS:  I have discussed with the patient in detail about the diagnosis of breast cancer and the surgical options, and medical oncologist has discussed with her issues about adjuvant or neoadjuvant chemotherapy. We have decided to recommend to the patient breast conservation surgery with left breast lumpectomy with preoperative sentinel lymph node injection and mapping and left axillary dissection. The possibility of further surgery requiring wider lumpectomy or even completion mastectomy was explained to the patient. The procedure and risks of the surgery were explained to include, but not limited to extra bleeding, infection, unsightly scar formation, the possibility of local recurrence, the possibility of left upper extremity lymphedema was explained. Local numbness, paresthesias or chronic pain was explained. The patient was given an educational brochure and several brochures about the diagnosis and treatment of breast cancers. She was certainly encouraged to obtain further surgical medical opinions prior to proceeding. I believe the patient has given full informed consent and desires to proceed with the above.
"""]



In [0]:
new_text = []
questions = {0: ["What is the patient's age?"],
             1: ["What is the patient's gender?"],
             2: ["What is the patient's diagnosis?"],
}

for i in range(3):
        for x in questions[i]:
            new_text.append([x, sample_texts[0]])

example = spark.createDataFrame(new_text).toDF("question", "context")


In [0]:
document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa  = medical.QuestionAnswering()\
    .pretrained("clinical_notes_qa_base", "en", "clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("Context: {context} \n Question: {question} \n Answer: ")\
    .setOutputCol("answer")\

pipeline = nlp.Pipeline(stages=[document_assembler, med_qa])


result = pipeline.fit(example).transform(example)

clinical_notes_qa_base download started this may take some time.
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][OK!]


In [0]:
df = result.selectExpr("document_question.result as Question", "answer.result as Answer")

#Convert array answers to string
df = df.withColumn("Answer", F.concat_ws(" ", df["Answer"]))

#Create a new common column to combine the df's we will obtain in the future in a common column
df = df.withColumn("filename", F.lit("text_01"))## açıklama yap

df.show(truncate=False)

+----------------------------------+------------------------------------------------------------------------------------------------+--------+
|Question                          |Answer                                                                                          |filename|
+----------------------------------+------------------------------------------------------------------------------------------------+--------+
|[What is the patient's age?]      |The patient is 57 years old.                                                                    |text_01 |
|[What is the patient's gender?]   |The patient is a white female.                                                                  |text_01 |
|[What is the patient's diagnosis?]|The patient has invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma.|text_01 |
+----------------------------------+------------------------------------------------------------------------------------------------+--------+

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc")

clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Oncological", "Disease_Syndrome_Disorder", "Heart_Disease"])

sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

icd10_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")\
    .setReturnCosineDistances(True)

resolver_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10_resolver
    ])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

icd_model = resolver_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl download started this may take some time.
[ | ][OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[ | ][OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[ | ][OK!]


In [0]:
icd10_sdf = icd_model.transform(df)

In [0]:
icd10_df = icd10_sdf.select("filename",F.explode(F.arrays_zip(icd10_sdf.ner_chunk.result,
                                                   icd10_sdf.icd10cm_code.result,
                                                   icd10_sdf.ner_chunk.metadata,

                                                    )).alias("cols")) \
                            .select("filename",F.expr("cols['0']").alias("chunk"),
                                    F.expr("cols['1']").alias("icd10_code"),
                                    F.expr("cols['2']['entity']").alias("entity"),
                                                     ).toPandas()

icd10_df.head()

Unnamed: 0,filename,chunk,icd10_code,entity
0,text_01,ductal carcinoma of the left breast,C50.91,Oncological
1,text_01,breast carcinoma,C50.9,Oncological


In [0]:
icd10_df['Extracted_Entities_vs_ICD_Codes'] = list(zip(icd10_df.chunk, icd10_df.icd10_code))

In [0]:
icd10_codes= icd10_df.groupby("filename").icd10_code.apply(lambda x: list(x)).reset_index()
icd10_vs_entities = icd10_df.groupby("filename").Extracted_Entities_vs_ICD_Codes.apply(lambda x: list(x)).reset_index()

icd10_df_all = icd10_codes.merge(icd10_vs_entities)

icd10_df_all

Unnamed: 0,filename,icd10_code,Extracted_Entities_vs_ICD_Codes
0,text_01,"[C50.91, C50.9]","[(ductal carcinoma of the left breast, C50.91), (breast carcinoma, C50.9)]"


## Gender Classification

In Spark NLP, we have a pretrained model to detect gender of patient. Let's use it by `ClassifierDLModel`

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")\

biobert_embeddings = nlp.BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")

genderClassifier = nlp.ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["sentence_bert_embeddings"]) \
       .setOutputCol("gender")

gender_pipeline = nlp.Pipeline(stages=[documentAssembler,
                                   #sentenceDetector,
                                   tokenizer,
                                   biobert_embeddings,
                                   sentence_embeddings,
                                   genderClassifier])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

gender_model = gender_pipeline.fit(data_ner)

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[ | ][OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[ | ][OK!]


In [0]:
# answers converted to a single text
concatenated_text_df = df.groupBy("filename").agg(F.concat_ws(" ", F.collect_list("Answer")).alias("Answer"))

concatenated_text_df.show(truncate=False)

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filename|Answer                                                                                                                                                      |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text_01 |The patient is 57 years old. The patient has invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma. The patient is a white female.|
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [0]:
gender_df = gender_model.transform(concatenated_text_df)

gender_pd_df = gender_df.select("filename", F.explode(F.arrays_zip(gender_df.gender.result,
                                                                   gender_df.gender.metadata)).alias("cols")) \
                       .select("filename",F.expr("cols['0']").alias("Gender"),
                               F.expr("cols['1']['Female']").alias("Female"),
                               F.expr("cols['1']['Male']").alias("Male")).toPandas()

gender_pd_df['Gender'] = gender_pd_df.apply(lambda x : "F" if float(x['Female']) >= float(x['Male']) else "M", axis=1)

gender_pd_df = gender_pd_df[['filename', 'Gender']]

In [0]:
gender_pd_df

Unnamed: 0,filename,Gender
0,text_01,F


## Age
We can get patient's age forom the notes by another pipeline. We are creating an age pipeline to get AGE labelled entities. In a note, more than one age entity can be extracted. We will get the first age entity as patient's age.

In [0]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("Answer")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

date_ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Age"])

age_pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        date_ner_converter
    ])

data_ner = spark.createDataFrame([[""]]).toDF("Answer")

age_model = age_pipeline.fit(data_ner)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl_enriched download started this may take some time.
[ | ][OK!]


In [0]:
age_result = age_model.transform(concatenated_text_df)

age_df = age_result.select("filename",F.explode(F.arrays_zip(age_result.ner_chunk.result,
                                                             age_result.ner_chunk.metadata)).alias("cols")) \
                   .select("filename",F.expr("cols['0']").alias("Age"),
                           F.expr("cols['1']['entity']").alias("ner_label")).toPandas()

In [0]:
age_df

Unnamed: 0,filename,Age,ner_label
0,text_01,57 years old,Age


In [0]:
age_df.Age = age_df.Age.replace(r"\D", "", regex = True).astype(int)
age_df.drop('ner_label', axis=1, inplace=True)
age_df.head()

Unnamed: 0,filename,Age
0,text_01,57


## Calculating Medicare Risk Adjusment Score
Now, we have all data which can be extracted from clinical notes. Now we can calculate Medicare Risk Adjusment Score by Spark NLP Healthcare CMS-HCC risk-adjustment score calculation module.

This module supports V22, V23, V24, V28 and ESRDV21 of the CMS-HCC risk adjustment model.

It also supports V05 and V08 of CMS-RxHCC risk adjustment model.

It needs the following parameters in order to calculate the risk score:

- ICD Codes
- Age
- Gender
- The eligibility segment of the patient
- The original reason for entitlement
- If the patient is in Medicaid or not


In [0]:
patient_df = age_df.merge(icd10_df_all, on='filename', how = "left")\
                   .merge(gender_pd_df, on='filename', how = "left")

patient_df = patient_df.dropna()

In [0]:
patient_df

Unnamed: 0,filename,Age,icd10_code,Extracted_Entities_vs_ICD_Codes,Gender
0,text_01,57,"[C50.91, C50.9]","[(ductal carcinoma of the left breast, C50.91), (breast carcinoma, C50.9)]",F


In [0]:
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   filename                         1 non-null      object
 1   Age                              1 non-null      int64 
 2   icd10_code                       1 non-null      object
 3   Extracted_Entities_vs_ICD_Codes  1 non-null      object
 4   Gender                           1 non-null      object
dtypes: int64(1), object(4)
memory usage: 48.0+ bytes


In [0]:
df = spark.createDataFrame(patient_df)
df.show(truncate=False)

+--------+---+---------------+--------------------------------------------------------------------------+------+
|filename|Age|icd10_code     |Extracted_Entities_vs_ICD_Codes                                           |Gender|
+--------+---+---------------+--------------------------------------------------------------------------+------+
|text_01 |57 |[C50.91, C50.9]|[[ductal carcinoma of the left breast, C50.91], [breast carcinoma, C50.9]]|F     |
+--------+---+---------------+--------------------------------------------------------------------------+------+



In [0]:
from pyspark.sql.types import MapType, IntegerType, DoubleType, StringType, StructType, StructField, FloatType
import pyspark.sql.functions as f

schema = StructType([
            StructField('risk_score', FloatType()),
            StructField('hcc_lst', StringType()),
            StructField('parameters', StringType()),
            StructField('details', StringType())])

In [0]:
extra_columns = pd.DataFrame({"filename" : ["text_01"],
                              "eligibility" : ["INS" ],
                      "orec" : ["0"],
                      "medicaid":[False],
                          })

df_extra = spark.createDataFrame(extra_columns)
df_extra.show(truncate=False)

+--------+-----------+----+--------+
|filename|eligibility|orec|medicaid|
+--------+-----------+----+--------+
|text_01 |INS        |0   |false   |
+--------+-----------+----+--------+



In [0]:
df = df.join(df_extra, on= "filename")

In [0]:
df.show()

+--------+---+---------------+-------------------------------+------+-----------+----+--------+
|filename|Age|     icd10_code|Extracted_Entities_vs_ICD_Codes|Gender|eligibility|orec|medicaid|
+--------+---+---------------+-------------------------------+------+-----------+----+--------+
| text_01| 57|[C50.91, C50.9]|           [[ductal carcinom...|     F|        INS|   0|   false|
+--------+---+---------------+-------------------------------+------+-----------+----+--------+



## Importing the model version

You can import one of the following function calculate the score.

```
- profileV22Y17   - profileV23Y18  - profileV24Y17  - profileV28    - profileESRDV21Y19 - profileRxHCCV05Y20 - profileRxHCCV08Y22                                                           
- profileV22Y18   - profileV23Y19  - profileV24Y18  - profileV28Y24                     - profileRxHCCV05Y21 - profileRxHCCV08Y23
- profileV22Y19                    - profileV24Y19                                      - profileRxHCCV05Y22
- profileV22Y20                    - profileV24Y20                                      - profileRxHCCV05Y23
- profileV22Y21                    - profileV24Y21
- profileV22Y22                    - profileV24Y22
                                   - profileV24
```

In [0]:
df = df.withColumn("hcc_profile", medical.profileV22Y17(df.icd10_code, df.Age, df.Gender, df.eligibility, df.orec, df.medicaid))

df = df.withColumn("hcc_profile", F.from_json(F.col("hcc_profile"), schema))
df= df.withColumn("risk_score", df.hcc_profile.getItem("risk_score"))\
      .withColumn("hcc_lst", df.hcc_profile.getItem("hcc_lst"))\
      .withColumn("parameters", df.hcc_profile.getItem("parameters"))\
      .withColumn("details", df.hcc_profile.getItem("details"))\

df.select('risk_score','icd10_code', 'Age', 'Gender').show(truncate=False )

df.show(truncate=100, vertical=True)

+----------+---------------+---+------+
|risk_score|icd10_code     |Age|Gender|
+----------+---------------+---+------+
|0.986     |[C50.91, C50.9]|57 |F     |
+----------+---------------+---+------+

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------
 filename                        | text_01                                                                                              
 Age                             | 57                                                                                                   
 icd10_code                      | [C50.91, C50.9]                                                                                      
 Extracted_Entities_vs_ICD_Codes | [[ductal carcinoma of the left breast, C50.91], [breast carcinoma, C50.9]]                           
 Gender                          | F                                                                              

## License
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors.  The source in this notebook is provided subject to the [Apache 2.0 License](https://spdx.org/licenses/Apache-2.0.html).  All included or referenced third party libraries are subject to the licenses set forth below.

|Library Name|Library License|Library License URL|Library Source URL|
| :-: | :-:| :-: | :-:|
|Pandas |BSD 3-Clause License| https://github.com/pandas-dev/pandas/blob/master/LICENSE | https://github.com/pandas-dev/pandas|
|Numpy |BSD 3-Clause License| https://github.com/numpy/numpy/blob/main/LICENSE.txt | https://github.com/numpy/numpy|
|Apache Spark |Apache License 2.0| https://github.com/apache/spark/blob/master/LICENSE | https://github.com/apache/spark/tree/master/python/pyspark|
|BeautifulSoup|MIT License|https://www.crummy.com/software/BeautifulSoup/#Download|https://www.crummy.com/software/BeautifulSoup/bs4/download/|
|Requests|Apache License 2.0|https://github.com/psf/requests/blob/main/LICENSE|https://github.com/psf/requests|
|Spark NLP Display|Apache License 2.0|https://github.com/JohnSnowLabs/spark-nlp-display/blob/main/LICENSE|https://github.com/JohnSnowLabs/spark-nlp-display|
|Spark NLP |Apache License 2.0| https://github.com/JohnSnowLabs/spark-nlp/blob/master/LICENSE | https://github.com/JohnSnowLabs/spark-nlp|
|Spark NLP for Healthcare|[Proprietary license - John Snow Labs Inc.](https://www.johnsnowlabs.com/spark-nlp-health/) |NA|NA|



|Author|
|-|
|Databricks Inc.|
|John Snow Labs Inc.|

## Disclaimers
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account.  Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.