![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Abstracting Real World Data from Oncology Notes: Data Analysis

In the previous notebook (`./15-entity-extraction`) we used SparkNLP's pipelines to extract hightly specialized oncological entities from unstructured notes and stored the resulting tabular data in our delta lake.

In this notebook we analyze these data to answer questions such as:
What are the most common cancer subtypes? What are the most common symptoms and how are these symptoms associated with each cancer subtype? which indications have the highest risk factor? etc.

#0. Initial configurations

In [0]:
import os
import json
import string
import numpy as np
import pandas as pd


import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader

from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)


print('sparknlp.version : ',sparknlp.version())
print('sparknlp_jsl.version : ',sparknlp_jsl.version())

spark

In [0]:
delta_path='/FileStore/HLS/nlp/delta/jsl/'

let's take a look at the raw text dataset

In [0]:
df=spark.read.load(f'{delta_path}/bronze/mt-oc-notes')
display(df)

path,text,note_id,patient_id
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_01.txt,"Medical Specialty:Hematology - Oncology Sample Name: BRCA-2 mutation Description: Discharge summary of a patient with a BRCA-2 mutation. (Medical Transcription Sample Report) DISCHARGE DIAGNOSES: BRCA-2 mutation. HISTORY OF PRESENT ILLNESS: The patient is a 59-year-old with a BRCA-2 mutation. Her sister died of breast cancer at age 32 and her daughter had breast cancer at age 27. PHYSICAL EXAMINATION: The chest was clear. The abdomen was nontender. Pelvic examination shows no masses. No heart murmur. HOSPITAL COURSE: The patient underwent surgery on the day of admission. In the postoperative course she was afebrile and unremarkable. The patient regained bowel function and was discharged on the morning of the fourth postoperative day. OPERATIONS AND PROCEDURES: Total abdominal hysterectomy/bilateral salpingo-oophorectomy with resection of ovarian fossa peritoneum en bloc on July 25, 2006. PATHOLOGY: A 105-gram uterus without dysplasia or cancer. CONDITION ON DISCHARGE: Stable. PLAN: The patient will remain at rest initially with progressive ambulation after. She will avoid lifting, driving or intercourse. She will call me if any fevers, drainage, bleeding, or pain. Follow up in my office in four weeks. Family history, social history, psychosocial needs per the social worker. DISCHARGE MEDICATIONS: Percocet 5 #40 one every 3 hours p.r.n. pain.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_02.txt,"Medical Specialty:Hematology - Oncology Sample Name: Mullerian Adenosarcoma Description: Discharge summary of a patient presenting with a large mass aborted through the cervix. (Medical Transcription Sample Report) PRINCIPAL DIAGNOSIS: Mullerian adenosarcoma. HISTORY OF PRESENT ILLNESS: The patient is a 56-year-old presenting with a large mass aborted through the cervix. PHYSICAL EXAM:CHEST: Clear. There is no heart murmur. ABDOMEN: Nontender. PELVIC: There is a large mass in the vagina. HOSPITAL COURSE: The patient went to surgery on the day of admission. The postoperative course was marked by fever and ileus. The patient regained bowel function. She was discharged on the morning of the seventh postoperative day. OPERATIONS: July 25, 2006: Total abdominal hysterectomy, bilateral salpingo-oophorectomy. DISCHARGE CONDITION: Stable. PLAN: The patient will remain at rest initially with progressive ambulation thereafter. She will avoid lifting, driving, stairs, or intercourse. She will call me for fevers, drainage, bleeding, or pain. Family history, social history, and psychosocial needs per the social worker. The patient will follow up in my office in one week. PATHOLOGY: Mullerian adenosarcoma. MEDICATIONS: Percocet 5, #40, one q.3 h. p.r.n. pain.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_03.txt,"Medical Specialty:Hematology - Oncology Sample Name: Leiomyosarcoma Description: Discharge summary of patient with leiomyosarcoma and history of pulmonary embolism, subdural hematoma, pancytopenia, and pneumonia. (Medical Transcription Sample Report) ADMITTING DIAGNOSES:1. Leiomyosarcoma.2. History of pulmonary embolism.3. History of subdural hematoma.4. Pancytopenia.5. History of pneumonia. PROCEDURES DURING HOSPITALIZATION:1. Cycle six of CIVI-CAD (Cytoxan, Adriamycin, and DTIC) from 07/22/2008 to 07/29/2008.2. CTA, chest PE study showing no evidence for pulmonary embolism. 3. Head CT showing no evidence of acute intracranial abnormalities.4. Sinus CT, normal mini-CT of the paranasal sinuses. HISTORY OF PRESENT ILLNESS: Ms. ABC is a pleasant 66-year-old Caucasian female who first palpated a mass in the left posterior arm in spring of 2007. The mass increased in size and she was seen by her primary care physician and referred to orthopedic surgeon. MRI showed inflammation and was thought to be secondary to rheumatoid arthritis. The mass increased in size. She eventually underwent a partial resection found to have pathologic grade 2 leiomyosarcoma, margins were impossible to assess, but were likely positive. She was evaluated by Dr. X and Dr. Y and a decision was made to proceed with preoperative chemotherapy. She began treatment with CIVI-CAD in December 2007. Her course was complicated by pulmonary embolus, pneumonia, and subdural hematoma while on anticoagulation. She eventually underwent surgical resection on May 1, 2008 with small area of residual disease, but otherwise clear margins. HOSPITAL COURSE:1. Leiomyosarcoma, the patient was admitted to Hem/Onco B Service under attending Dr. XYZ for cycle six of continuous IV infusion Cytoxan, Adriamycin, and DTIC, which she tolerated well.2. History of pulmonary embolism. Upon admission, the patient reported an approximate two-week history of dyspnea on exertion and some mild chest pain. She underwent a CTA, which showed no evidence of pulmonary embolism and the patient was started on prophylactic doses of Lovenox at 40 mg a day. She had no further complaints throughout the hospitalization with any shortness of breath or chest pain.3. History of subdural hematoma, also on admission the patient noted some mild intermittent headaches that were fleeting in nature, several a day that would resolve on their own. Her headaches were not responding to pain medication and so on 07/24/2008, we obtained a head CT that showed no evidence of acute intracranial abnormalities. The patient also had a history of sinusitis and so a sinus CT scan was obtained, which was normal.4. Pancytopenia. On admission, the patient's white blood count was 3.4, hemoglobin 11.3, platelet count 82, and ANC of 2400. The patient's counts were followed throughout admission. She did not require transfusion of red blood cells or platelets; however, on 07/26/2008 her ANC did dip to 900 and she was placed on neutropenic diet. At discharge her ANC is back up to 1100 and she is taken off neutropenic diet. Her white blood cell count at discharge was 1.4 and her hemoglobin was 11.2 with a platelet count of 140. 5. History of pneumonia. During admission, the patient did not exhibit any signs or symptoms of pneumonia. DISPOSITION: Home in stable condition. DIET: Regular and less neutropenic. ACTIVITY: Resume same activity. FOLLOWUP: The patient will have lab work at Dr. XYZ on 08/05/2008 and she will also return to the cancer center on 08/12/2008 at 10:20 a.m. The patient is also advised to monitor for any fevers greater than 100.5 and should she have any further problems in the meantime to please call in to be seen sooner.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_04.txt,"Medical Specialty:Hematology - Oncology Sample Name: BCCa Excision - Cheek Description: Excision of basal cell carcinoma. Closure complex, open wound. Bilateral capsulectomies. Bilateral explantation and removal of ruptured silicone gel implants (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSES1. Basal cell carcinoma, right cheek.2. Basal cell carcinoma, left cheek.3. Bilateral ruptured silicone gel implants.4. Bilateral Baker grade IV capsular contracture.5. Breast ptosis. POSTOPERATIVE DIAGNOSES1. Basal cell carcinoma, right cheek.2. Basal cell carcinoma, left cheek. 3. Bilateral ruptured silicone gel implants.4. Bilateral Baker grade IV capsular contracture.5. Breast ptosis. PROCEDURE1. Excision of basal cell carcinoma, right cheek, 2.7 cm x 1.5 cm.2. Excision of basal cell carcinoma, left cheek, 2.3 x 1.5 cm.3. Closure complex, open wound utilizing local tissue advancement flap, right cheek.4. Closure complex, open wound, left cheek utilizing local tissue advancement flap.5. Bilateral explantation and removal of ruptured silicone gel implants. 6. Bilateral capsulectomies.7. Replacement with bilateral silicone gel implants, 325 cc. INDICATIONS FOR PROCEDURESThe patient is a 61-year-old woman who presents with a history of biopsy-proven basal cell carcinoma, right and left cheek. She had no prior history of skin cancer. She is status post bilateral cosmetic breast augmentation many years ago and the records are not available for this procedure. She has noted progressive hardening and distortion of the implant. She desires to have the implants removed, capsulectomy and replacement of implants. She would like to go slightly smaller than her current size as she has ptosis going with a smaller implant combined with capsulectomy will result in worsening of her ptosis. She may require a lift. She is not consenting to lift due to the surgical scars. PAST MEDICAL HISTORYSignificant for deep venous thrombosis and acid reflux. PAST SURGICAL HISTORY Significant for appendectomy, colonoscopy and BAM. MEDICATIONS: Coumadin. She stopped her Coumadin five days prior to the procedures.2. Lipitor3. Effexor.4. Klonopin. ALLERGIES: None. REVIEW OF SYSTEMS: Negative for dyspnea on exertion, palpitations, chest pain, and phlebitis. PHYSICAL EXAMINATION: VITAL SIGNS: Height 5'8"", weight 155 pounds. FACE: Examination of the face demonstrates basal cell carcinoma, right and left cheek. No lesions are noted in the regional lymph node base and no mass is appreciated. BREAST: Examination of the breast demonstrates bilateral grade IV capsular contracture. She has asymmetry in distortion of the breast. No masses are appreciated in the breast or the axilla. The implants appear to be subglandular. CHEST: Clear to auscultation and percussion. CARDIOVASCULAR: Regular rate and rhythm. EXTREMITIES: Show full range of motion. No clubbing, cyanosis or edema. SKIN: Significant environmental actinic skin damage.I recommended excision of basal cell cancers with frozen section control of the margin, closure will require local tissue flaps. I recommended exchange of the implants with reaugmentation. No final size is guaranteed or implied. We will decrease the size of the implants based on the intraoperative findings as the size is not known. Several options are available. Sizer implants will be placed to best estimate postoperative size. Ptosis will be worse following capsulectomy and going with a smaller implant. She may require a lift in the future. We have obtained preoperative clearance from the patient's cardiologist, Dr. K. The patient has been taken off Coumadin for five days and will be placed back on Coumadin the day after the surgery. The risk of deep venous thrombosis is discussed. Other risk including bleeding, infection, allergic reaction, pain, scarring, hypertrophic scarring and poor cosmetic resolve, worsening of ptosis, exposure, extrusion, the rupture of the implants, numbness of the nipple-areolar complex, hematoma, need for additional surgery, recurrent capsular contracture and recurrence of the skin cancer was all discussed, which she understands and informed consent is obtained. PROCEDURE IN DETAILAfter appropriate informed consent was obtained, the patient was placed in the preoperative holding area with **** input. She was then taken to the major operating room with ABCD Surgery Center, placed in a supine position. Intravenous antibiotics were given. TED hose and SCDs were placed. After the induction of adequate general endotracheal anesthesia, she was prepped and draped in the usual sterile fashion. Sites for excision and skin cancers were carefully marked with 5 mm margin. These were injected with 1% lidocaine with epinephrine.After allowing adequate time for basal constriction hemostasis, excision was performed, full thickness of the skin. They were tagged at the 12 o'clock position and sent for frozen section. Hemostasis was achieved using electrocautery. Once margins were determined to be free of involvement, local tissue flaps were designed for advancement. Undermining was performed. Hemostasis was achieved using electrocautery. Closure was performed under moderate tension with interrupted 5-0 Vicryl. Skin was closed under loop magnification paying meticulous attention and cosmetic details with 6-0 Prolene. Attention was then turned to the breast, clothes were changed, gloves were changed, incision was planned and the previous inframammary incision beginning on the right incision was made. Dissection was carried down to the capsule. It was extremely calcified. Dissection of the anterior surface of the capsule was performed. The implant was subglandular, the capsule was entered, implant was noted to be grossly intact; however, there was free silicone. Implant was removed and noted to be ruptured. No marking as to the size of the implant was found.Capsulectomy was performed leaving a small portion in the axilla in the inframammary fold. Pocket was modified to medialize the implant by placing 2-0 Prolene laterally in mattress sutures to restrict the pocket. In identical fashion, capsulectomy was performed on the left. Implant was noted to be grossly ruptured. No marking was found for the size of the implant. The entire content was weighed and found to be 350 grams. Right side was weighed and noted to be 338 grams, although some silicone was lost in the transfer and most likely was identical 350 grams. The implants appeared to be double lumen with the saline portion deflated. Completion of the capsulectomy was performed on the left.The pocket was again fashioned to improve symmetry with the right with Prolene. Pockets were thoroughly irrigated. Hemostasis was achieved using electrocautery, checked for symmetry, which is determined to be excellent. Several liters of normal saline were utilized to irrigate the pocket. Hemostasis was determined to be excellent. Drains were placed. 2-0 Vicryl sutures were preplaced. Pockets were checked for hemostasis, irrigated with normal saline, sizing was performed with placement of 275 cc implants. She was placed in a sitting position. This significantly worsened ptosis and for this reason, 325 cc implants were chosen. She is placed back in a supine position. Pockets were irrigated with antibiotic solution, 2-0 Vicryl sutures were preplaced and gloves were changed. The patient was reprepped with Betadine, towels were changed.These were soaked in antibiotic solution. Gloves were changed, gowns were changed, the patient was reprepped, towels were changed and implants were placed. The patient was placed in a sitting position. Symmetry was excellent. There was noted to be a decrease of volume of approximately 40 cc from the capsulectomy as well as the additional reduction of 25 cc. Ptosis was slightly worse; however, excellent shape of the breast. She was placed back in a supine position, preplaced 2-0 Vicryl sutures were tied; a second layer subcutaneous of 3-0 Vicryl was placed followed by a third of 4-0 Vicryl. The skin was closed with a running 4-0 Prolene. Drains were secured. All sponge and needle counts were correct. Dressing was applied. COMPLICATIONSNone. DISPOSITIONTo recovery room.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_05.txt,"Medical Specialty:Hematology - Oncology Sample Name: Consult - Breast Cancer - 1 Description: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma. (Medical Transcription Sample Report) CHIEF COMPLAINT: Left breast cancer. HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging. She prior to that area, denied any complaints. She had no nipple discharge. No trauma history. She has had been on no estrogen supplementation. She has had no other personal history of breast cancer. Her family history is positive for her mother having breast cancer at age 48. The patient has had no children and no pregnancies. She denies any change in the right breast. Subsequent to the office visit and tissue diagnosis of breast cancer, she has had medical oncology consultation with Dr. X and radiation oncology consultation with Dr. Y. I have discussed the case with Dr. X and Dr. Y, who are both in agreement with proceeding with surgery prior to adjuvant therapy. The patient's metastatic workup has otherwise been negative with MRI scan and CT scanning. The MRI scan showed some close involvement possibly involving the left pectoralis muscle, although thought to also possibly represent biopsy artifact. CT scan of the neck, chest, and abdomen is negative for metastatic disease. PAST MEDICAL HISTORY: Previous surgery is history of benign breast biopsy in 1972, laparotomy in 1981, 1982, and 1984, right oophorectomy in 1984, and ganglion cyst removal of the hand in 1987. MEDICATIONS: She is currently on omeprazole for reflux and indigestion. ALLERGIES: SHE HAS NO KNOWN DRUG ALLERGIES. REVIEW OF SYSTEMS: Negative for any recent febrile illnesses, chest pains or shortness of breath. Positive for restless leg syndrome. Negative for any unexplained weight loss and no change in bowel or bladder habits. FAMILY HISTORY: Positive for breast cancer in her mother and also mesothelioma from possible asbestosis or asbestos exposure. SOCIAL HISTORY: The patient works as a school teacher and teaching high school. PHYSICAL EXAMINATION: GENERAL: The patient is a white female, alert and oriented x 3, appears her stated age of 57. HEENT: Head is atraumatic and normocephalic. Sclerae are anicteric. NECK: Supple. CHEST: Clear. HEART: Regular rate and rhythm. BREASTS: Exam reveals an approximately 1.5 cm relatively mobile focal palpable mass in the inferomedial left breast at the 7 o'clock position, which clinically is not fixed to the underlying pectoralis muscle. There are no nipple retractions. No skin dimpling. There is some, at the time of the office visit, ecchymosis from recent biopsy. There is no axillary adenopathy. The remainder of the left breast is without abnormality. The right breast is without abnormality. The axillary areas are negative for adenopathy bilaterally. ABDOMEN: Soft, nontender without masses. No gross organomegaly. No CVA or flank tenderness. EXTREMITIES: Grossly neurovascularly intact. IMPRESSION: The patient is a 57-year-old female with invasive ductal carcinoma of the left breast, T1c, Nx, M0 left breast carcinoma. RECOMMENDATIONS: I have discussed with the patient in detail about the diagnosis of breast cancer and the surgical options, and medical oncologist has discussed with her issues about adjuvant or neoadjuvant chemotherapy. We have decided to recommend to the patient breast conservation surgery with left breast lumpectomy with preoperative sentinel lymph node injection and mapping and left axillary dissection. The possibility of further surgery requiring wider lumpectomy or even completion mastectomy was explained to the patient. The procedure and risks of the surgery were explained to include, but not limited to extra bleeding, infection, unsightly scar formation, the possibility of local recurrence, the possibility of left upper extremity lymphedema was explained. Local numbness, paresthesias or chronic pain was explained. The patient was given an educational brochure and several brochures about the diagnosis and treatment of breast cancers. She was certainly encouraged to obtain further surgical medical opinions prior to proceeding. I believe the patient has given full informed consent and desires to proceed with the above.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_06.txt,"Medical Specialty:Hematology - Oncology Sample Name: Intraperitoneal Mesothelioma Description: A female with a history of peritoneal mesothelioma who has received prior intravenous chemotherapy. (Medical Transcription Sample Report) REASON FOR ADMISSION: Intraperitoneal chemotherapy. HISTORY: A very pleasant 63-year-old hypertensive, nondiabetic, African-American female with a history of peritoneal mesothelioma. The patient has received prior intravenous chemotherapy. Due to some increasing renal insufficiency and difficulties with hydration, it was elected to change her to intraperitoneal therapy. She had her first course with intraperitoneal cisplatin, which was very difficultly tolerated by her. Therefore, on the last hospitalization for IP chemo, she was switched to Taxol. The patient since her last visit has done relatively well. She had no acute problems and has basically only chronic difficulties. She has had some decrease in her appetite, although her weight has been stable. She has had no fever, chills, or sweats. Activity remains good and she has continued difficulty with depression associated with type 1 bipolar disease. She had a recent CT scan of the chest and abdomen. The report showed the following findings. In the chest, there was a small hiatal hernia and a calcification in the region of the mitral valve. There was one mildly enlarged mediastinal lymph node. Several areas of ground-glass opacity were noted in the lower lungs, which were subtle and nonspecific. No pulmonary masses were noted. In the abdomen, there were no abnormalities of the liver, pancreas, spleen, and left adrenal gland. On the right adrenal gland, a 17 x 13 mm right adrenal adenoma was noted. There were some bilateral renal masses present, which were not optimally evaluated due to noncontrast study. A hyperdense focus in the lower pole of the left kidney was felt to most probably represent a hemorrhagic renal cyst. It was unchanged from February and measured 9 mm. There was again minimal left pelvic/iliac _______ with right and left peritoneal catheters noted and were unremarkable. Mesenteric nodes were seen, which were similar in appearance to the previous study that was felt somewhat more conspicuous due to opacified bowel adjacent to them. There was a conglomerate omental mass, which had decreased in volume when compared to previous study, now measuring 8.4 x 1.6 cm. In the pelvis, there was a small amount of ascites in the right pelvis extending from the inferior right paracolic gutter. No suspicious osseous lesions were noted. CURRENT MEDICATIONS: Norco 10 per 325 one to two p.o. q.4h. p.r.n. pain, atenolol 50 mg p.o. b.i.d., Levoxyl 75 mcg p.o. daily, Phenergan 25 mg p.o. q.4-6h. p.r.n. nausea, lorazepam 0.5 mg every 8 hours as needed for anxiety, Ventolin HFA 2 puffs q.6h. p.r.n., Plavix 75 mg p.o. daily, Norvasc 10 mg p.o. daily, Cymbalta 60 mg p.o. daily, and Restoril 30 mg at bedtime as needed for sleep. ALLERGIES: THE PATIENT STATES THAT ON OCCASION LORAZEPAM DOSE PRODUCE HALLUCINATIONS, AND SHE HAD DIFFICULTY TOLERATING ATIVAN. PHYSICAL EXAMINATIONVITAL SIGNS: The patient's height is 165 cm, weight is 77 kg. BSA is 1.8 sq m. The vital signs reveal blood pressure to be 158/75, heart rate 61 per minute with a regular sinus rhythm, temperature of 96.6 degrees, respiratory rate 18 with an SpO2 of 100% on room air. GENERAL: She is normally developed; well nourished; very cooperative; oriented to person, place, and time; and in no distress at this time. She is anicteric. HEENT: EOM is full. Pupils are equal, round, reactive to light and accommodation. Disc margins are unremarkable as are the ocular fields. Mouth and pharynx within normal limits. The TMs are glistening bilaterally. External auditory canals are unremarkable. NECK: Supple, nontender without adenopathy. Trachea is midline. There are no bruits nor is there jugular venous distention. CHEST: Clear to percussion and auscultation bilaterally. HEART: Regular rate and rhythm without murmur, gallop, or rub. BREASTS: Unremarkable. ABDOMEN: Slightly protuberant. Bowel tones are present and normal. She has no palpable mass, and there is no hepatosplenomegaly. EXTREMITIES: Within normal limits. NEUROLOGICAL: Nonfocal. DIAGNOSTIC IMPRESSION: 1. Intraperitoneal mesothelioma, partial remission, as noted by CT scan of the abdomen.2. Presumed left lower pole kidney hemorrhagic cyst.3. History of hypertension.4. Type 1 bipolar disease. PLAN: The patient will have appropriate laboratory studies done. A left renal ultrasound is requested to further delineate the possible hemorrhagic cyst in the lower left pole of the left kidney. Interventional radiology will access for ports in the abdomen. She will receive chemotherapy intraperitoneally. The plan will be to use intraperitoneal Taxol.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_07.txt,"Medical Specialty:Hematology - Oncology Sample Name: Mesothelioma - Port-A-Cath Insertion Description: Biopsy-proven mesothelioma - Placement of Port-A-Cath, left subclavian vein with fluoroscopy. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Mesothelioma. POSTOPERATIVE DIAGNOSIS: Mesothelioma. OPERATIVE PROCEDURE: Placement of Port-A-Cath, left subclavian vein with fluoroscopy. ASSISTANT: None. ANESTHESIA: General endotracheal. COMPLICATIONS: None. DESCRIPTION OF PROCEDURE: The patient is a 74-year-old gentleman who underwent right thoracoscopy and was found to have biopsy-proven mesothelioma. He was brought to the operating room now for Port-A-Cath placement for chemotherapy. After informed consent was obtained with the patient, the patient was taken to the operating room, placed in supine position. After induction of general endotracheal anesthesia, routine prep and drape of the left chest, left subclavian vein was cannulated with #18 gauze needle, and guidewire was inserted. Needle was removed. Small incision was made large enough to harbor the port. Dilator and introducers were then placed over the guidewire. Guidewire and dilator were removed, and a Port-A-Cath was introduced in the subclavian vein through the introducers. Introducers were peeled away without difficulty. He measured with fluoroscopy and cut to the appropriate length. The tip of the catheter was noted to be at the junction of the superior vena cava and right atrium. It was then connected to the hub of the port. Port was then aspirated for patency and flushed with heparinized saline and summoned to the chest wall. Wounds were then closed. Needle count, sponge count, and instrument counts were all correct.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_08.txt,"Medical Specialty:Hematology - Oncology Sample Name: Mesothelioma - Pleural Biopsy Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma. POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma. PROCEDURE: Right VATS pleurodesis and pleural biopsy. ANESTHESIA: General double-lumen endotracheal. DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface. SPECIMEN: Pleural biopsies for pathology and microbiology. ESTIMATED BLOOD LOSS: Minimal. FLUIDS: Crystalloid 1.2 L and 1.9 L of pleural effusion drained. INDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed. PROCEDURE IN DETAIL: After informed consent was obtained, the patient was brought to the operating room and placed in supine position. A double-lumen endotracheal tube was placed. SCDs were also placed and he was given preoperative Kefzol. The patient was then brought into the right side up, left decubitus position, and the area was prepped and draped in the usual fashion. A needle was inserted in the axillary line to determine position of the effusion. At this time, a 10-mm port was placed using the knife and Bovie cautery. The effusion was drained by placing a sucker into this port site. Upon feeling the surface of the pleura, there were multiple firm nodules. An additional anterior port was then placed in similar fashion. The effusion was then drained with a sucker. Multiple pleural biopsies were taken with the biopsy device in all areas of the pleura. Of note, feeling the diaphragmatic surface, it appeared that it was quite nodular, but these nodules felt as though they were on the other side of the diaphragm and not on the pleural surface of the diaphragm concerning for a possibly metastatic disease. This will be worked up with further imaging study later in his hospitalization. After the effusion had been drained, 2 cans of talc pleurodesis aerosol were used to cover the lung and pleural surface with talc. The lungs were then inflated and noted to inflate well. A 32 curved chest tube chest tube was placed and secured with nylon. The other port site was closed at the level of the fascia with 2-0 Vicryl and then 4-0 Monocryl for the skin. The patient was then brought in the supine position and extubated and brought to recovery room in stable condition. Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case.",,
dbfs:/FileStore/HLS/nlp/data/mt_oncology_10/mt_note_09.txt,"Medical Specialty:Hematology - Oncology Sample Name: Sickle Cell Anemia - ER Visit Description: A 19-year-old known male with sickle cell anemia comes to the emergency room on his own with 3-day history of back pain. (Medical Transcription Sample Report) HISTORY OF PRESENT ILLNESS: This is a 19-year-old known male with sickle cell anemia. He comes to the emergency room on his own with 3-day history of back pain. He is on no medicines. He does live with a room mate. Appetite is decreased. No diarrhea, vomiting. Voiding well. Bowels have been regular. Denies any abdominal pain. Complains of a slight headaches, but his main concern is back ache that extends from above the lower T-spine to the lumbosacral spine. The patient is not sure of his immunizations. The patient does have sickle cell and hemoglobin is followed in the Hematology Clinic. ALLERGIES: THE PATIENT IS ALLERGIC TO TYLENOL WITH CODEINE, but he states he can get morphine along with Benadryl. MEDICATIONS: He was previously on folic acid. None at the present time. PAST SURGICAL HISTORY: He has had no surgeries in the past. FAMILY HISTORY: Positive for diabetes, hypertension and cancer. SOCIAL HISTORY: He denies any smoking or drug usage. PHYSICAL EXAMINATION: VITAL SIGNS: On examination, the patient has a temp of 37 degrees tympanic, pulse was recorded at 37 per minute, but subsequently it was noted to be 66 per minute, respiratory rate is 24 per minute and blood pressure is 149/66, recheck blood pressure was 132/72. GENERAL: He is alert, speaks in full sentences, he does not appear to be in distress. HEENT: Normal. NECK: Supple. CHEST: Clear. HEART: Regular. ABDOMEN: Soft. He has pain over the mid to lower spine. SKIN: Color is normal. EXTREMITIES: He moves all extremities well. NEUROLOGIC: Age appropriate.ER COURSE: It was indicated to the patient that I will be drawing labs and giving him IV fluids. Also that he will get morphine and Benadryl combination. The patient was ordered a liter of NS over an hour, and was then maintained on D5 half-normal saline at 125 an hour. CBC done showed white blood cells 4300, hemoglobin 13.1 g/dL, hematocrit 39.9%, platelets 162,000, segs 65.9, lymphs 27, monos 3.4. Chemistries done were essentially normal except for a total bilirubin of 1.6 mg/dL, all of which was indirect. The patient initially received morphine and diphenhydramine at 18:40 and this was repeated again at 8 p.m. He received morphine 5 mg and Benadryl 25 mg. I subsequently spoke to Dr. X and it was decided to admit the patient.The patient initially stated that he wanted to be observed in the ER and given pain control and fluids and wanted to go home in the morning. He stated that he has a job interview in the morning. The resident service did come to evaluate him. The resident service then spoke to Dr. X and it was decided to admit him on to the Hematology service for control of pain and IV hydration. He is to be transitioned to p.o. medications about 4 a.m. and hopefully, he can be discharged in time to make his interview tomorrow. IMPRESSION: Sickle cell crisis. DIFFERENTIAL DIAGNOSIS: Veno-occlusive crisis, and diskitis.",,


# 1. ICD-10 Codes and HCC Status

Now we load the `icd10` delta tables

In [0]:
icd10_hcc_df=spark.read.load(f'{delta_path}/silver/icd10-hcc-df')
icd10_hcc_df.createOrReplaceTempView('icd10HccView')

best_icd_mapped_df=spark.read.load(f'{delta_path}/gold/best-icd-mapped')
best_icd_mapped_df.createOrReplaceTempView('bestIcdMappedView')
best_icd_mapped_pdf=best_icd_mapped_df.toPandas()

In [0]:
%sql
select * from icd10HccView
limit 10

path,final_chunk,entity,icd10_code,confidence,all_codes,resolutions,hcc_status,hcc_score,billable,icd_codes_names,icd_code_billable
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cough,Symptom,R05,1.0,"List(R05, G4483, F458)","List(cough [Cough], cough headache syndrome [Primary cough headache], cough, psychogenic [Other somatoform disorders])","List(0, 0, 1)","List(0, 0, nan)","List(1, 1, 1)",cough,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,chest pain,Symptom,R074,0.4975,"List(R074, R079, R073, R0789, M542, R070, R1031, R1030, R071, R078, M549, R52, R0781)","List(chest pain [Pain in throat and chest], chest pain [Chest pain, unspecified], chest wall pain [Pain in throat and chest], chest wall pain [Other chest pain], neck pain [Cervicalgia], throat pain [Pain in throat], groin pain [Right lower quadrant pain], groin pain [Lower abdominal pain, unspecified], chest pain on breathing [Chest pain on breathing], other chest pain [Other chest pain], spine pain [Dorsalgia, unspecified], pain [Pain, unspecified], chest pain, pleuritic [Pleurodynia])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)",chest pain,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cough,Symptom,R05,1.0,"List(R05, G4483, F458)","List(cough [Cough], cough headache syndrome [Primary cough headache], cough, psychogenic [Other somatoform disorders])","List(0, 0, 1)","List(0, 0, nan)","List(1, 1, 1)",cough,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,chest pain,Symptom,R074,0.4975,"List(R074, R079, R073, R0789, M542, R070, R1031, R1030, R071, R078, M549, R52, R0781)","List(chest pain [Pain in throat and chest], chest pain [Chest pain, unspecified], chest wall pain [Pain in throat and chest], chest wall pain [Other chest pain], neck pain [Cervicalgia], throat pain [Pain in throat], groin pain [Right lower quadrant pain], groin pain [Lower abdominal pain, unspecified], chest pain on breathing [Chest pain on breathing], other chest pain [Other chest pain], spine pain [Dorsalgia, unspecified], pain [Pain, unspecified], chest pain, pleuritic [Pleurodynia])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)",chest pain,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cancer,Oncological,C801,0.9985,"List(C801, C44509, C4490, C449, C380, C61, C3490, C760, C4452, C169, C768, C189, C4459, C44519, C50919, C4440, C444, C56, C569, C6960, C809, C162)","List(cancer [Malignant (primary) neoplasm, unspecified], cancer of chest [Unspecified malignant neoplasm of skin of other part of trunk], cancer of the skin [Unspecified malignant neoplasm of skin, unspecified], cancer of the skin [Other and unspecified malignant neoplasm of skin, unspecified], cancer of the heart [Malignant neoplasm of heart], cancer of prostate [Malignant neoplasm of prostate], cancer of the lung [Malignant neoplasm of unspecified part of unspecified bronchus or lung], cancer of the neck [Malignant neoplasm of head, face and neck], cancer, skin of breast [Squamous cell carcinoma of skin of trunk], cancer of the stomach [Malignant neoplasm of stomach, unspecified], cancer of the back [Malignant neoplasm of other specified ill-defined sites], cancer of the colon [Malignant neoplasm of colon, unspecified], cancer of the back, basal cell [Other specified malignant neoplasm of skin of trunk], cancer of the back, basal cell [Basal cell carcinoma of skin of other part of trunk], breast cancer [Malignant neoplasm of unspecified site of unspecified female breast], cancer of the skin, neck [Unspecified malignant neoplasm of skin of scalp and neck], cancer of the skin, neck [Other and unspecified malignant neoplasm of skin of scalp and neck], ovarian cancer [Malignant neoplasm of ovary], ovarian cancer [Malignant neoplasm of unspecified ovary], cancer of the orbit [Malignant neoplasm of unspecified orbit], dmmr cancer [Malignant neoplasm without specification of site], cancer of the stomach, body [Malignant neoplasm of body of stomach])","List(1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1)","List(12, 0, 0, 0, 11, 12, 9, 12, 0, 9, 12, 11, 0, 0, 0, 0, 0, 0, 10, 12, 0, 9)","List(1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1)",cancer,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,numbness,Symptom,R202,0.8508,"List(R202, N9489, R6889, R4020, G252, R42, R401, R4589, R410, R451, M25642)","List(numbness of skin [Paresthesia of skin], numbness of vulva (finding) [Other specified conditions associated with female genital organs and menstrual cycle], mucosal numbness (finding) [Other general symptoms and signs], unconsciousness [Unspecified coma], tremor, rest [Other specified forms of tremor], subjective vertigo [Dizziness and giddiness], mental status, stupor [Stupor], feeling physically tense (finding) [Other symptoms and signs involving emotional state], clouded consciousness [Disorientation, unspecified], restlessness [Restlessness and agitation], stiffness of left hand [Stiffness of left hand, not elsewhere classified])","List(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 80, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)",numbness of skin,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,tingling of her left arm,Symptom,R202,0.2334,"List(R202, M7989, R223, R298, R2230, R2242, R2232, R2231, T23052A, R198)","List(tingling sensation [Paresthesia of skin], swelling of left arm [Other specified soft tissue disorders], swelling of upper arm [Localized swelling, mass and lump, upper limb], downward drift of outstretched supinated arm (finding) [Other symptoms and signs involving the nervous and musculoskeletal systems], localized swelling on forearm [Localized swelling, mass and lump, unspecified upper limb], swelling of left lower limb [Localized swelling, mass and lump, left lower limb], localized swelling on left arm [Localized swelling, mass and lump, left upper limb], swelling of right upper limb [Localized swelling, mass and lump, right upper limb], burn of left palm [Burn of unspecified degree of left palm, initial encounter], sensation as if bowel still full (finding) [Other specified symptoms and signs involving the digestive system and abdomen])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 0, 0, 1, 1, 1, 1, 1, 1)",tingling sensation,1


In [0]:
%sql
select entity, count('*') from icd10HccView
group by 1
order by 2

entity,count(*)
Treatment,90
Oncological,441
Symptom,968


## 1.1. Get general information for staff management, reporting, & planning.

Let's take a look at the distribution of mapped codes

In [0]:
display(
  best_icd_mapped_df
  .select('onc_code_desc')
  .filter("onc_code_desc!='-'")
  .groupBy('onc_code_desc')
  .count()
  .orderBy('count')
)

onc_code_desc,count
Aplastic and other anemias and other bone marrow failure syndromes,3
Other disorders of the nervous system,3
In situ neoplasms,4
Metabolic disorders,6
Persons encountering health services for examinations,6
Malignant neoplasms of digestive organs,26
Malignant neoplasms of female genital organs,41
Melanoma and other malignant neoplasms of skin,51
"Malignant neoplasms of ill-defined, other secondary and unspecified sites",53
"Malignant neoplasms of lymphoid, hematopoietic and related tissue",85


we can also visualize the results as a countplot to see the number of each parent categories

In [0]:
import plotly.graph_objects as go

_ps=best_icd_mapped_pdf['onc_code_desc'].value_counts()
data=_ps[_ps.index!='-']

fig = go.Figure(go.Bar(
            x=data.values,
            y=data.index,
            orientation='h'))
fig.show()

## 1.2. Reimbursement-ready data with billable codes
In the previous notebook, using an icd10 oncology mapping dictionary, we created a dataset of coded conditions that are all billable. To assess the quality of the mapping, we can look at the distribution of 
the nearest billable codes

In [0]:
import plotly.express as px

_ps=best_icd_mapped_pdf['nearest_billable_code_pos'].value_counts()
data=_ps[_ps!='-']
data_pdf=pd.DataFrame({"count":data.values,"Index of Nearest Billable Codes":data.index})

fig = px.bar(data_pdf, x='Index of Nearest Billable Codes', y='count')
fig.show()

## 1.3. See which indications have the highest average risk factor
In our pipeline we used `sbiobertresolve_icd10cm_augmented_billable_hcc` as a sentence resolver, in which the model return HCC codes. We can look at the distribution risk factors for each entity.
Note that since each category has a different number of corresponding data points, to get a full picture of the distribution of risk factors for each condition, we use box plots.

In [0]:
import plotly.express as px

df = best_icd_mapped_pdf[best_icd_mapped_pdf.onc_code_desc!='-'].dropna()
fig = px.box(df, y="onc_code_desc", x="corresponding_hcc_score", hover_data=df.columns)
fig.show()

As we can see, some categories, even with fewer cases, have higher risk factor.

## 1.4. Analyze Oncological Entities
We can find the most frequent oncological entities.

In [0]:
onc_df = (
  icd10_hcc_df
  .filter("entity == 'Oncological'")
  .select("path","final_chunk","entity","icd10_code","icd_codes_names","icd_code_billable")
 )
onc_pdf=onc_df.toPandas()
onc_pdf.head(10)

Unnamed: 0,path,final_chunk,entity,icd10_code,icd_codes_names,icd_code_billable
0,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_47.txt,cancer,Oncological,C801,cancer,1
1,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_47.txt,tumor cells,Oncological,D497,tumor of carotid body,1
2,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,Colon Cancer,Oncological,C189,cancer of the colon,1
3,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,colon cancer,Oncological,C189,cancer of the colon,1
4,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,T3c,Oncological,E119,t2dm,1
5,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,M0 colon cancer,Oncological,C9200,"leukemia, acute myelocytic, fab m0",1
6,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,tumor,Oncological,D446,carotid body tumor,1
7,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,colon cancer,Oncological,C189,cancer of the colon,1
8,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,colon cancers,Oncological,C189,"cancer of the colon, stage 3",1
9,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,colon cancer,Oncological,C189,cancer of the colon,1


In [0]:
import plotly.express as px

_ps=onc_pdf['icd_codes_names'].value_counts()
data=_ps[_ps.index!='-']
data_pdf=pd.DataFrame({"count":data.values,'icd code names':data.index})
data_pdf=data_pdf[data_pdf['count']>5]
fig = px.bar(data_pdf, y='icd code names', x='count',orientation='h')
fig.show()

### Report Counts by ICD10CM Code Names
Each bar shows count of reports contain the cancer entities.

In [0]:
display(
  onc_df.select('icd_codes_names','path')
  .dropDuplicates()
  .groupBy('icd_codes_names')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

icd_codes_names,count
carcinoma,10
cancer,10
carotid body tumor,8
breast cancer,6
breast cyst,6
ent symptoms,5
nephrotic syndrome w focal and segmental glomerular lesions,5
ekc,4
basal cell carcinoma of back,4
lymphoma,4


### Most common symptoms
 We can find the most common symptoms counting the unique symptoms in documents.

In [0]:
display(
  icd10_hcc_df
  .filter("lower(entity)='symptom'")
  .selectExpr('path','icd_codes_names as symptom')
  .dropDuplicates()
  .groupBy('symptom')
  .count()
  .orderBy(F.desc('count'))
  .limit(30)
)

symptom,count
edema,18
chest mass,17
distress,16
vesicular murmur,15
symptom occurs at night (finding),14
hepatosplenomegaly,13
pain,13
lymphadenopathy,12
amblyopic,11
nausea,11


### Extract most frequent oncological diseases and symptoms based on documents

Here, we will count the number documents for each symptom-disease pair. To do this, first we filter high confidence entities and then create a pivot table.

In [0]:
entity_symptom_df = (
  icd10_hcc_df
  .select('path','entity','icd_codes_names')
  .filter("lower(entity) in ('symptom','oncological') AND confidence > 0.30")
  .dropDuplicates()
)
display(entity_symptom_df)

path,entity,icd_codes_names
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,difficulty using urine bottle (finding)
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,syncope
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,congestion of nose
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_49.txt,Oncological,history of breast biopsy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,loss of balance
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,worried
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,dysphagia
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_5.txt,Symptom,distress
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_48.txt,Symptom,blood in stool
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,muscle ache


In [0]:
condition_symptom_df = (
  entity_symptom_df.groupBy('path').pivot("entity").agg(F.collect_list("icd_codes_names"))
  .withColumnRenamed('Oncological','Condition')
  .withColumn('Conditions',F.explode('Condition'))
  .withColumn('Symptoms',F.explode('Symptom'))
  .drop('Condition','Symptom')
  .dropna()
  .fillna(0)
)
display(condition_symptom_df)

path,Conditions,Symptoms
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",dysphagia
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",swollen legs
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",blood in stool
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",dyspnea
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",difficulty in voiding
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",constipation
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",feeling tired
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",hemoptysis
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",odynophagia
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"renal cell carcinoma, l kidney",cough


In [0]:
conditions_symptoms_count_df=condition_symptom_df.groupBy('Conditions').pivot("Symptoms").count().fillna(0)
conditions_symptoms_count_pdf=conditions_symptoms_count_df.toPandas()
conditions_symptoms_count_pdf.index=conditions_symptoms_count_pdf['Conditions']
conditions_symptoms_count_pdf=conditions_symptoms_count_pdf.drop('Conditions',axis=1)

In [0]:
selected_rows=conditions_symptoms_count_pdf.index[conditions_symptoms_count_pdf.sum(axis=1)>10]
selected_columns=conditions_symptoms_count_pdf.columns[conditions_symptoms_count_pdf.sum(axis=0)>10]

In [0]:
data_pdf=conditions_symptoms_count_pdf.loc[selected_rows,selected_columns]

Now let's visualize the heatmap of the co-occurence of conditions and symptoms. We can directly look at the counts of symptoms by condition

In [0]:
import plotly.express as px
def plot_heatmap(data,color='occurence'):
  fig = px.imshow(data,labels=dict(x="Condition", y="Symptom", color=color),y=list(data.index),x=list(data.columns))
  fig.update_layout(
    autosize=False,
    width=1100,
    height=1100,
  )
  fig.update_xaxes(side="top")
  return(fig)

In [0]:
fg=plot_heatmap(data_pdf)
fg.show()

As we see, this heatmap does not take the expected frequency of a given symptom into account. In order to reflect any correlation between the symptom in question and a given condition, we need to normalize the counts. 
To do so, we use `MinMaxScaler` to scale the values.

In [0]:
from sklearn.preprocessing import MinMaxScaler
normalized_data=MinMaxScaler().fit(data_pdf).transform(data_pdf)

In [0]:
norm_data_pdf=pd.DataFrame(normalized_data,index=data_pdf.index,columns=data_pdf.columns)
plot_heatmap(norm_data_pdf,'normalized occurence')

As we can see, now the symptoms that were not appeared to be enriched show high correlation with corresponding conditions.

# 2. Get Drug codes from the notes

## Analyze drug usage patterns for inventory management and reporting

We are checking how many times any drug are encountered in the documents.

In [0]:
rxnorm_res_df=spark.read.load(f'{delta_path}/gold/rxnorm-res-cleaned')

In [0]:
display(
  rxnorm_res_df
  .filter('confidence > 0.8')
  .groupBy('drugs')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

drugs,count
carboplatin,14
heparin,11
prednisone,10
cyclosporine,8
aspirin,6
iron,5
doxorubicin,5
cyclophosphamide,5
fluconazole,4
epinephrine,4


# 3. Get Timeline Using RE Models

## Find the problems occured after treatments 

We are filtering the dataframe to select rows with following conditions to see problems occured after treatments.
* `relation =='AFTER'`
* `entity1=='TREATMENT'`
* `entity2=='PROBLEM'`

In [0]:
temporal_re_df=spark.read.load(f"{delta_path}/silver/temporal-re")

In [0]:
display(temporal_re_df)

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,BEFORE,OCCURRENCE,Discharge,PROBLEM,Mesothelioma,0.99999833
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,BEFORE,OCCURRENCE,Discharge,DURATION,1,0.99978834
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,DURATION,1,0.9346253
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,pleural effusion,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,atrial fibrillation,0.99999607
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,anemia,0.9996013
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,atrial fibrillation,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,anemia,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,ascites,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,esophageal reflux,0.94568694


In [0]:
display(
  temporal_re_df
  .where("relation == 'AFTER' AND entity1=='TREATMENT' AND entity2 == 'PROBLEM'")
  .filter('confidence > 0.8')
  .orderBy(F.desc('confidence'))
)

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_21.txt,AFTER,TREATMENT,epinephrine,PROBLEM,a transverse incision,0.9999982
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_21.txt,AFTER,TREATMENT,Xylocaine,PROBLEM,a transverse incision,0.9999807
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_14.txt,AFTER,TREATMENT,this procedure,PROBLEM,complications,0.97644377
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_20.txt,AFTER,TREATMENT,intravenous heparin,PROBLEM,hereditary hypercoagulable state,0.9599375
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_37.txt,AFTER,TREATMENT,hemostatic,PROBLEM,a skin stab inferior,0.95746464


# 4. Analyze the Relations Between Body Parts and Procedures

In the extraction notebook, we created a relation extration model to identify relationships between body parts and problem entities by using pretrained **RelationExtractionModel** `re_bodypart_problem`. Now let's load the data and take a look at the relationship between bodypart and procedures. By filtering the dataframe to select rows satisfying `entity1 != entity2` we can see the relations between different entities and see the procedures applied to internal organs

In [0]:
bodypart_re_df=spark.read.load(f'{delta_path}/silver/bodypart-relationships')

In [0]:
display(
  bodypart_re_df
  .where('entity1!=entity2')
  .drop_duplicates()
  )

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,IVC,Procedure,placement of a vena caval filter,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,1,Procedure,lumpectomy,Internal_organ_or_component,axillary node,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_35.txt,0,Internal_organ_or_component,prostate,Procedure,ultrasound-guided I-125 seed implantation,0.51350594
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,inferior vena cava,Procedure,placement of a vena caval filter,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,inferior vena cava,Procedure,mechanical and pharmacologic thrombolysis,0.9999087
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,1,Internal_organ_or_component,inferior vena cava,Procedure,balloon angioplasty,0.98969346
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_8.txt,0,Procedure,bone marrow biopsy,Internal_organ_or_component,cellular marrow,0.7271675
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_45.txt,0,Procedure,lymph node injection,Internal_organ_or_component,sentinel lymph node,0.9483816
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,0,Internal_organ_or_component,nerve,Procedure,thyroidectomy,0.99999917
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,iliac vein,Procedure,placement of a vena caval filter,1.0


# 5. Get Procedure codes from notes

We will created dataset for procedure codes, using `jsl_ner_wip_greedy_clinical` NER mdodle and set NerConverter's WhiteList `['Procedure']` in order to get only drug entities. Let's take a look at this table:

In [0]:
cpt_df=spark.read.load(f'{delta_path}/silver/cpt')

In [0]:
display(cpt_df)

path,chunks,entity,cpt_code,confidence,all_codes,resolutions,cpt
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_32.txt,hysterectomy,Procedure,51925,0.312,"List(51925, 58180, 58285, 58925, 59525, 58280, 19328, 58150, 59100, 44970, 19304, 44950, 19330, 27290, 58940, 19371, 31380, 38120, 44147, 63191, 31420, 62180, 45114, 45111, 58740)","List(Closure of vesicouterine fistula; with hysterectomy, Supracervical abdominal hysterectomy (subtotal hysterectomy), with or without removal of tube(s), with or without removal of ovary(s), Vaginal hysterectomy, radical (Schauta type operation), Ovarian cystectomy, unilateral or bilateral , Subtotal or total hysterectomy after cesarean delivery, Vaginal hysterectomy, with total or partial vaginectomy; with repair of enterocele, Removal of intact mammary implant, Total abdominal hysterectomy (corpus and cervix), with or without removal of tube(s), with or without removal of ovary(s), Hysterotomy, abdominal (e.g., for hydatidiform mole, abortion), Laparoscopy, surgical, appendectomy , Mastectomy, subcutaneous, Appendectomy, Removal of mammary implant material, Interpelviabdominal amputation (hindquarter amputation) , Oophorectomy, partial or total, unilateral or bilateral;, Periprosthetic capsulectomy, breast, Partial laryngectomy (hemilaryngectomy); anterovertical , Laparoscopy, surgical, splenectomy, Colectomy, partial; abdominal and transanal approach, Laminectomy with section of spinal accessory nerve, Epiglottidectomy , Ventriculocisternostomy (Torkildsen type operation), Proctectomy, partial, with anastomosis; abdominal and transsacral approach, Proctectomy; partial resection of rectum, transabdominal approach, Lysis of adhesions (salpingolysis, ovariolysis) )",Closure of vesicouterine fistula; with hysterectomy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_32.txt,cataract surgery,Procedure,31592,0.064,"List(31592, 50065, 61304, 61524, 61570, 50205, 61601, 61860, 61600, 61305, 61880, 62163, 43870, 50280, 61692, 61682, 61705, 61340, 61546, 47400, 44345, 49324, 61708, 44188, 33050)","List(Cricotracheal resection, Nephrolithotomy; secondary surgical operation for calculus , Craniectomy or craniotomy, exploratory; supratentorial, Craniectomy, infratentorial or posterior fossa; for excision or fenestration of cyst, Craniectomy or craniotomy; with excision of foreign body from brain, Renal biopsy; by surgical exposure of kidney , Resection or excision of neoplastic, vascular or infectious lesion of base of anterior cranial fossa; intradural, including dural repair, with or without graft, Craniectomy or craniotomy for implantation of neurostimulator electrodes, cerebral, cortical, Resection or excision of neoplastic, vascular or infectious lesion of base of anterior cranial fossa; extradural, Craniectomy or craniotomy, exploratory; infratentorial (posterior fossa), Revision or removal of intracranial neurostimulator electrodes, Neuroendoscopy, intracranial; with retrieval of foreign body, Closure of gastrostomy, surgical, Excision or unroofing of cyst(s) of kidney , Surgery of intracranial arteriovenous malformation; dural, complex, Surgery of intracranial arteriovenous malformation; supratentorial, complex, Surgery of aneurysm, vascular malformation or carotid-cavernous fistula; by intracranial and cervical occlusion of carotid artery, Subtemporal cranial decompression (pseudotumor cerebri, slit ventricle syndrome), Craniotomy for hypophysectomy or excision of pituitary tumor, intracranial approach, Hepaticotomy or hepaticostomy with exploration, drainage, or removal of calculus , Revision of colostomy; complicated (reconstruction in-depth) (separate procedure), Laparoscopy, surgical; with insertion of tunneled intraperitoneal catheter, Surgery of aneurysm, vascular malformation or carotid-cavernous fistula; by intracranial electrothrombosis, Laparoscopy, surgical, colostomy or skin level cecostomy, Resection of pericardial cyst or tumor)",Cricotracheal resection
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,biopsy,Procedure,32609,0.1621,"List(32609, 47100, 32100, 44950, 19316, 60280, 39200, 50290, 50120, 49321, 43800, 32607, 44025, 50205, 60500, 64746, 61140, 31420, 31592, 44010, 25927, 33050, 32608, 50045, 31770)","List(Thoracoscopy; with biopsy(ies) of pleura, Biopsy of liver, wedge , Thoracotomy; with exploration, Appendectomy, Mastopexy, Excision of thyroglossal duct cyst or sinus, Resection of mediastinal cyst, Excision of perinephric cyst , Pyelotomy; with exploration , Laparoscopy, surgical; with biopsy (single or multiple) , Pyloroplasty, Thoracoscopy; with diagnostic biopsy(ies) of lung infiltrate(s) (eg, wedge, incisional), unilateral, Colotomy, for exploration, biopsy(s), or foreign body removal, Renal biopsy; by surgical exposure of kidney , Parathyroidectomy or exploration of parathyroid(s), Transection or avulsion of; phrenic nerve, Burr hole(s) or trephine; with biopsy of brain or intracranial lesion, Epiglottidectomy , Cricotracheal resection, Duodenotomy, for exploration, biopsy(s), or foreign body removal, Transmetacarpal amputation; , Resection of pericardial cyst or tumor, Thoracoscopy; with diagnostic biopsy(ies) of lung nodule(s) or mass(es) (eg, wedge, incisional), unilateral, Nephrotomy, with exploration , Bronchoplasty; graft repair)",Thoracoscopy; with biopsy(ies) of pleura
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,fine needle aspiration biopsy,Procedure,32609,0.1769,"List(32609, 32608, 32607, 60280, 49321, 32096, 32097, 43610, 50290, 47100, 61750, 19101, 58900, 39200, 32100, 19112, 33050, 49215, 61140, 61751, 49010, 55705, 60200, 43611, 44025)","List(Thoracoscopy; with biopsy(ies) of pleura, Thoracoscopy; with diagnostic biopsy(ies) of lung nodule(s) or mass(es) (eg, wedge, incisional), unilateral, Thoracoscopy; with diagnostic biopsy(ies) of lung infiltrate(s) (eg, wedge, incisional), unilateral, Excision of thyroglossal duct cyst or sinus, Laparoscopy, surgical; with biopsy (single or multiple) , Thoracotomy, with diagnostic biopsy(ies) of lung infiltrate(s) (eg, wedge, incisional), unilateral, Thoracotomy, with diagnostic biopsy(ies) of lung nodule(s) or mass(es) (eg, wedge, incisional), unilateral, Excision, local; ulcer or benign tumor of stomach, Excision of perinephric cyst , Biopsy of liver, wedge , Stereotactic biopsy, aspiration, or excision, including burr hole(s), for intracranial lesion, Biopsy of breast; open, incisional, Biopsy of ovary, unilateral or bilateral (separate procedure) , Resection of mediastinal cyst, Thoracotomy; with exploration, Excision of lactiferous duct fistula, Resection of pericardial cyst or tumor, Excision of presacral or sacrococcygeal tumor, Burr hole(s) or trephine; with biopsy of brain or intracranial lesion, Stereotactic biopsy, aspiration, or excision, including burr hole(s), for intracranial lesion; with computed tomography and/or magnetic resonance guidance, Exploration, retroperitoneal area with or without biopsy(s) (separate procedure), Biopsy, prostate; incisional, any approach, Excision of cyst or adenoma of thyroid, or transection of isthmus, Excision, local; malignant tumor of stomach, Colotomy, for exploration, biopsy(s), or foreign body removal)",Thoracoscopy; with biopsy(ies) of pleura
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,fine needle aspiration,Procedure,60280,0.0686,"List(60280, 50290, 32100, 44850, 32609, 64746, 44950, 39560, 19316, 49010, 27889, 32607, 44820, 38101, 62194, 32608, 47100, 19112, 39200, 32310, 23920, 32225, 62192, 61680, 61460)","List(Excision of thyroglossal duct cyst or sinus, Excision of perinephric cyst , Thoracotomy; with exploration, Suture of mesentery (separate procedure), Thoracoscopy; with biopsy(ies) of pleura, Transection or avulsion of; phrenic nerve, Appendectomy, Resection, diaphragm; with simple repair (eg, primary suture), Mastopexy, Exploration, retroperitoneal area with or without biopsy(s) (separate procedure), Ankle disarticulation , Thoracoscopy; with diagnostic biopsy(ies) of lung infiltrate(s) (eg, wedge, incisional), unilateral, Excision of lesion of mesentery (separate procedure), Splenectomy; partial (separate procedure), Replacement or irrigation, subarachnoid/subdural catheter, Thoracoscopy; with diagnostic biopsy(ies) of lung nodule(s) or mass(es) (eg, wedge, incisional), unilateral, Biopsy of liver, wedge , Excision of lactiferous duct fistula, Resection of mediastinal cyst, Pleurectomy, parietal (separate procedure), Disarticulation of shoulder, Decortication, pulmonary (separate procedure); partial, Creation of shunt; subarachnoid/subdural-peritoneal, -pleural, other terminus, Surgery of intracranial arteriovenous malformation; supratentorial, simple, for section of 1 or more cranial nerves)",Excision of thyroglossal duct cyst or sinus
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,thyroid surgery,Procedure,60270,0.1494,"List(60270, 60271, 60500, 60240, 60200, 60502, 60281, 60280, 60505, 60252, 60260, 61548, 32100, 31592, 60254, 49553, 39200, 32662, 61880, 62180, 39220, 32661, 49215, 25927, 61304)","List(Thyroidectomy, including substernal thyroid; sternal split or transthoracic approach, Thyroidectomy, including substernal thyroid; cervical approach, Parathyroidectomy or exploration of parathyroid(s), Thyroidectomy, total or complete, Excision of cyst or adenoma of thyroid, or transection of isthmus, Parathyroidectomy or exploration of parathyroid(s); re-exploration, Excision of thyroglossal duct cyst or sinus; recurrent, Excision of thyroglossal duct cyst or sinus, Parathyroidectomy or exploration of parathyroid(s); with mediastinal exploration, sternal split or transthoracic approach, Thyroidectomy, total or subtotal for malignancy; with limited neck dissection, Thyroidectomy, removal of all remaining thyroid tissue following previous removal of a portion of thyroid, Hypophysectomy or excision of pituitary tumor, transnasal or transseptal approach, nonstereotactic, Thoracotomy; with exploration, Cricotracheal resection, Thyroidectomy, total or subtotal for malignancy; with radical neck dissection, incarcerated or strangulated, Resection of mediastinal cyst, Thoracoscopy, surgical; with excision of mediastinal cyst, tumor, or mass, Revision or removal of intracranial neurostimulator electrodes, Ventriculocisternostomy (Torkildsen type operation), Resection of mediastinal tumor, Thoracoscopy, surgical; with excision of pericardial cyst, tumor, or mass, Excision of presacral or sacrococcygeal tumor, Transmetacarpal amputation; , Craniectomy or craniotomy, exploratory; supratentorial)","Thyroidectomy, including substernal thyroid; sternal split or transthoracic approach"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,biopsy,Procedure,32609,0.1621,"List(32609, 47100, 32100, 44950, 19316, 60280, 39200, 50290, 50120, 49321, 43800, 32607, 44025, 50205, 60500, 64746, 61140, 31420, 31592, 44010, 25927, 33050, 32608, 50045, 31770)","List(Thoracoscopy; with biopsy(ies) of pleura, Biopsy of liver, wedge , Thoracotomy; with exploration, Appendectomy, Mastopexy, Excision of thyroglossal duct cyst or sinus, Resection of mediastinal cyst, Excision of perinephric cyst , Pyelotomy; with exploration , Laparoscopy, surgical; with biopsy (single or multiple) , Pyloroplasty, Thoracoscopy; with diagnostic biopsy(ies) of lung infiltrate(s) (eg, wedge, incisional), unilateral, Colotomy, for exploration, biopsy(s), or foreign body removal, Renal biopsy; by surgical exposure of kidney , Parathyroidectomy or exploration of parathyroid(s), Transection or avulsion of; phrenic nerve, Burr hole(s) or trephine; with biopsy of brain or intracranial lesion, Epiglottidectomy , Cricotracheal resection, Duodenotomy, for exploration, biopsy(s), or foreign body removal, Transmetacarpal amputation; , Resection of pericardial cyst or tumor, Thoracoscopy; with diagnostic biopsy(ies) of lung nodule(s) or mass(es) (eg, wedge, incisional), unilateral, Nephrotomy, with exploration , Bronchoplasty; graft repair)",Thoracoscopy; with biopsy(ies) of pleura
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,total thyroidectomy,Procedure,60240,0.601,"List(60240, 60260, 60270, 60271, 31365, 60254, 60225, 60220, 60252, 48155, 38100, 43621, 43620, 60212, 31368, 60210, 60500, 60505, 33238, 41145, 61548, 31395, 60200, 43622, 60502)","List(Thyroidectomy, total or complete, Thyroidectomy, removal of all remaining thyroid tissue following previous removal of a portion of thyroid, Thyroidectomy, including substernal thyroid; sternal split or transthoracic approach, Thyroidectomy, including substernal thyroid; cervical approach, Laryngectomy; total, with radical neck dissection , Thyroidectomy, total or subtotal for malignancy; with radical neck dissection, Total thyroid lobectomy, unilateral; with contralateral subtotal lobectomy, including isthmusectomy, Total thyroid lobectomy, unilateral; with or without isthmusectomy, Thyroidectomy, total or subtotal for malignancy; with limited neck dissection, Pancreatectomy, total , Splenectomy; total (separate procedure), Gastrectomy, total; with Roux-en-Y reconstruction, Gastrectomy, total; with esophagoenterostomy, Partial thyroid lobectomy, unilateral; with contralateral subtotal lobectomy, including isthmusectomy, Laryngectomy; subtotal supraglottic, with radical neck dissection , Partial thyroid lobectomy, unilateral; with or without isthmusectomy, Parathyroidectomy or exploration of parathyroid(s), Parathyroidectomy or exploration of parathyroid(s); with mediastinal exploration, sternal split or transthoracic approach, Removal of permanent transvenous electrode(s) by thoracotomy, Glossectomy; complete or total, with or without tracheostomy, with unilateral radical neck dissection , Hypophysectomy or excision of pituitary tumor, transnasal or transseptal approach, nonstereotactic, Pharyngolaryngectomy, with radical neck dissection; with reconstruction , Excision of cyst or adenoma of thyroid, or transection of isthmus, Gastrectomy, total; with formation of intestinal pouch, any type, Parathyroidectomy or exploration of parathyroid(s); re-exploration)","Thyroidectomy, total or complete"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,sentinel node dissection,Procedure,38724,0.2405,"List(38724, 44950, 49215, 63191, 44850, 38720, 19350, 69155, 61458, 44820, 61460, 38101, 41135, 19304, 38542, 39220, 44147, 63064, 44140, 44320, 47600, 31368, 50290, 33235, 58720)","List(Cervical lymphadenectomy (modified radical neck dissection) , Appendectomy, Excision of presacral or sacrococcygeal tumor, Laminectomy with section of spinal accessory nerve, Suture of mesentery (separate procedure), Cervical lymphadenectomy (complete) , Nipple/areola reconstruction, Radical excision external auditory canal lesion; with neck dissection , Craniectomy, suboccipital; for exploration or decompression of cranial nerves, Excision of lesion of mesentery (separate procedure), for section of 1 or more cranial nerves, Splenectomy; partial (separate procedure), Glossectomy; partial, with unilateral radical neck dissection , Mastectomy, subcutaneous, Dissection, deep jugular node(s) , Resection of mediastinal tumor, Colectomy, partial; abdominal and transanal approach, Costovertebral approach with decompression of spinal cord or nerve root(s), (eg, herniated intervertebral disc), thoracic; single segment, Colectomy, partial; with anastomosis, Colostomy or skin level cecostomy;, Cholecystectomy;, Laryngectomy; subtotal supraglottic, with radical neck dissection , Excision of perinephric cyst , Removal of transvenous pacemaker electrode(s); dual lead system , Salpingo-oophorectomy, complete or partial, unilateral or bilateral (separate procedure) )",Cervical lymphadenectomy (modified radical neck dissection)
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,thyroidectomy,Procedure,60240,0.2535,"List(60240, 60271, 60270, 60500, 60252, 60200, 60260, 60502, 60254, 60281, 60280, 60505, 60210, 31592, 31380, 31420, 31300, 31382, 60212, 32100, 60220, 33238, 25927, 61548, 32320)","List(Thyroidectomy, total or complete, Thyroidectomy, including substernal thyroid; cervical approach, Thyroidectomy, including substernal thyroid; sternal split or transthoracic approach, Parathyroidectomy or exploration of parathyroid(s), Thyroidectomy, total or subtotal for malignancy; with limited neck dissection, Excision of cyst or adenoma of thyroid, or transection of isthmus, Thyroidectomy, removal of all remaining thyroid tissue following previous removal of a portion of thyroid, Parathyroidectomy or exploration of parathyroid(s); re-exploration, Thyroidectomy, total or subtotal for malignancy; with radical neck dissection, Excision of thyroglossal duct cyst or sinus; recurrent, Excision of thyroglossal duct cyst or sinus, Parathyroidectomy or exploration of parathyroid(s); with mediastinal exploration, sternal split or transthoracic approach, Partial thyroid lobectomy, unilateral; with or without isthmusectomy, Cricotracheal resection, Partial laryngectomy (hemilaryngectomy); anterovertical , Epiglottidectomy , Laryngotomy (thyrotomy, laryngofissure); with removal of tumor or laryngocele, cordectomy , Partial laryngectomy (hemilaryngectomy); antero-latero-vertical , Partial thyroid lobectomy, unilateral; with contralateral subtotal lobectomy, including isthmusectomy, Thoracotomy; with exploration, Total thyroid lobectomy, unilateral; with or without isthmusectomy, Removal of permanent transvenous electrode(s) by thoracotomy, Transmetacarpal amputation; , Hypophysectomy or excision of pituitary tumor, transnasal or transseptal approach, nonstereotactic, Decortication and parietal pleurectomy)","Thyroidectomy, total or complete"


we can the see most common procedures being performed and count the number of each procedures and plot it.

In [0]:
#top 20
display(
  cpt_df
  .groupBy('cpt')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

cpt,count
Appendectomy,33
Thoracoscopy; with biopsy(ies) of pleura,30
Excision of presacral or sacrococcygeal tumor,19
Thoracotomy; with exploration,18
Resection of mediastinal tumor,10
Epiglottidectomy,9
Revision of peritoneal-venous shunt,8
"Closure of enterostomy, large or small intestine",8
Excision of perinephric cyst,6
Decortication and parietal pleurectomy,6


# 6. Get Assertion Status of Cancer Entities

Using the assertion status dataset we can find the number of family members of cancer patients with cancer or symptoms, and we can fruther check if the symptom is absent or present.

In [0]:
assertion_df=spark.read.load(f'{delta_path}/silver/assertion').drop_duplicates()

In [0]:
display(assertion_df)

path,chunk,entity,assertion
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_34.txt,edema,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,"focal motor, sensory or other neurological symptoms",Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,Newly diagnosed high-risk acute lymphoblastic leukemia,Oncological,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_40.txt,clubbing,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,narrowing,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,decreased appetite,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_7.txt,follicular non,Cancer,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,weight loss,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,chest pains,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_43.txt,shortness of breath,Symptom,absent


In [0]:
n_associated_with_someone_else = assertion_df.where("assertion=='associated_with_someone_else'").count()
print(f"Number of family members have cancer or symptoms: {n_associated_with_someone_else} ")

In [0]:
display(assertion_df)

path,chunk,entity,assertion
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_34.txt,edema,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,"focal motor, sensory or other neurological symptoms",Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,Newly diagnosed high-risk acute lymphoblastic leukemia,Oncological,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_40.txt,clubbing,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,narrowing,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,decreased appetite,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_7.txt,follicular non,Cancer,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,weight loss,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,chest pains,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_43.txt,shortness of breath,Symptom,absent


In [0]:
display(
  assertion_df
  .groupBy('assertion')
  .count()
)

assertion,count
present,571
hypothetical,13
conditional,12
possible,46
absent,500
associated_with_someone_else,35


In [0]:
assertion_symptom_df= (
  assertion_df
  .where("assertion in ('present', 'absent') AND entity=='Symptom'")
)
most_common_symptoms_df=(
  assertion_symptom_df
  .select('path','chunk')
  .groupBy('chunk')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
  )
display(most_common_symptoms_df)

chunk,count
edema,19
mass,15
murmurs,14
acute distress,12
hepatosplenomegaly,12
night sweats,12
pain,11
nausea,10
chills,10
lymphadenopathy,9


In [0]:
display(
  assertion_symptom_df
  .join(most_common_symptoms_df, on='chunk')
  .groupBy('chunk','assertion')
  .count()
  .orderBy(F.desc('count'))
  )

chunk,assertion,count
edema,absent,16
murmurs,absent,14
hepatosplenomegaly,absent,12
acute distress,absent,12
night sweats,absent,11
mass,present,10
chills,absent,10
pain,present,9
nausea,absent,9
vomiting,absent,9


## License
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors.  The source in this notebook is provided subject to the [Apache 2.0 License](https://spdx.org/licenses/Apache-2.0.html).  All included or referenced third party libraries are subject to the licenses set forth below.

|Library Name|Library License|Library License URL|Library Source URL|
| :-: | :-:| :-: | :-:|
|Pandas |BSD 3-Clause License| https://github.com/pandas-dev/pandas/blob/master/LICENSE | https://github.com/pandas-dev/pandas|
|Numpy |BSD 3-Clause License| https://github.com/numpy/numpy/blob/main/LICENSE.txt | https://github.com/numpy/numpy|
|Apache Spark |Apache License 2.0| https://github.com/apache/spark/blob/master/LICENSE | https://github.com/apache/spark/tree/master/python/pyspark|
|Plotly|MIT License|https://github.com/plotly/plotly.py/blob/master/LICENSE.txt|https://github.com/plotly/plotly.py|
|Scikit Learn|BSD 3-Clause License|https://github.com/scikit-learn/scikit-learn/blob/main/COPYING|https://github.com/scikit-learn/scikit-learn|

## Disclaimers
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account.  Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.