You may find this series of notebooks at https://github.com/databricks-industry-solutions/oncology. For more information about this solution accelerator, visit https://www.databricks.com/solutions/accelerators/nlp-oncology.

# Abstracting Real World Data from Oncology Notes: Data Analysis

In the previous notebook (`./00-entity-extraction`) we used SparkNLP's pipelines to extract hightly specialized oncological entities from unstructured notes and stored the resulting tabular data in our delta lake.

In this notebook we analyze these data to answer questions such as:
What are the most common cancer subtypes? What are the most common symptoms and how are these symptoms associated with each cancer subtype? which indications have the highest risk factor? etc.

In [0]:
!pip install mlflow

Collecting mlflow
  Using cached mlflow-2.1.1-py3-none-any.whl (16.7 MB)
Collecting gunicorn<21
  Using cached gunicorn-20.1.0-py3-none-any.whl (79 kB)
Collecting docker<7,>=4.0.0
  Using cached docker-6.0.1-py3-none-any.whl (147 kB)
Collecting pyyaml<7,>=5.1
  Using cached PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)
Collecting shap<1,>=0.40
  Using cached shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
Collecting importlib-metadata!=4.7.0,<6,>=3.7.0
  Using cached importlib_metadata-5.2.0-py3-none-any.whl (21 kB)
Collecting gitpython<4,>=2.1.0
  Using cached GitPython-3.1.31-py3-none-any.whl (184 kB)
Collecting cloudpickle<3
  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting sqlalchemy<2,>=1.4.0
  Using cached SQLAlchemy-1.4.46-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting alembic

In [0]:
import mlflow
import numpy as np
import pandas as pd
from pyspark.sql import functions as F



In [0]:
mlflow.set_tracking_uri('databricks')

**Important Note! After running the cell above, please detach & re-attach to the cluster and continue.**

In [0]:
import mlflow
import numpy as np
import pandas as pd
from pyspark.sql import functions as F

In [0]:
%run ./16-onco_config

In [0]:
ade_demo_util=SolAccUtil('onc-lh',data_path='/FileStore/HLS/nlp/data/')
ade_demo_util.print_info()
delta_path=ade_demo_util.settings['delta_path']
# delta_path='/FileStore/HLS/nlp/delta/jsl/'

let's take a look at the raw text dataset

In [0]:
df=spark.read.load(f'{delta_path}/bronze/mt-oc-notes')
display(df)

path,text
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Discharge Summary - Mesothelioma - 1 Description: Mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis. (Medical Transcription Sample Report) PRINCIPAL DIAGNOSIS: Mesothelioma. SECONDARY DIAGNOSES: Pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis. PROCEDURES 1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy. 2. On August 20, 2007, thoracentesis. 3. On August 31, 2007, Port-A-Cath placement. HISTORY AND PHYSICAL: The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-sided chest pain, and went to an urgent care center. Chest x-ray revealed right-sided pleural effusion. PAST MEDICAL HISTORY 1. Pericardectomy. 2. Pericarditis. 2. Atrial fibrillation. 4. RNCA with intracranial thrombolytic treatment. 5 PTA of MCA. 6. Mesenteric venous thrombosis. 7. Pericardial window. 8. Cholecystectomy. 9. Left thoracentesis. FAMILY HISTORY: No family history of coronary artery disease, CVA, diabetes, CHF or MI. The patient has one family member, a sister, with history of cancer. SOCIAL HISTORY: She is married. Employed with the US Post Office. She is a mother of three. Denies tobacco, alcohol or illicit drug use. MEDICATIONS 1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3. 2. Amiodarone 100 mg p.o. daily. REVIEW OF SYSTEMS: Complete review of systems negative except as in pulmonary as noted above. The patient also reports occasional numbness and tingling of her left arm. PHYSICAL EXAMINATION VITAL SIGNS: Blood pressure 123/95, heart rate 83, respirations 20, temperature 97, and oxygen saturation 97%. GENERAL: Positive nonproductive cough and pain with coughing. HEENT: Pupils are equal and reactive to light and accommodation. Tympanic membranes are clear. NECK: Supple. No lymphadenopathy. No masses. RESPIRATORY: Pleural friction rub is noted. GI: Soft, nondistended, and nontender. Positive bowel sounds. No organomegaly. EXTREMITIES: No edema, no clubbing, no cyanosis, no tenderness. Full range of motion. Normal pulses in all extremities. SKIN: No breakdown or lesions. No ulcers. NEUROLOGIC: Grossly intact. No focal deficits. Awake, alert, and oriented to person, place, and time. LABORATORY DATA: Labs are pending. HOSPITAL COURSE: The patient was admitted for a right-sided pleural effusion for thoracentesis on Monday by Dr. X. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT with Lovenox 40 mg subcutaneously. Her history dated back to March 2005 when she first sought medical attention for evidence of pericarditis, which was treated with pericardial window in an outside hospital, at that time she was also found to have mesenteric pain and thrombosis, is now anticoagulated. Her pericardial fluid was accumulated and she was seen by Dr. Y. At that time, she was recommended for pericardectomy, which was performed by Dr. Z. Review of her CT scan from March 2006 prior to her pericardectomy, already shows bilateral plural effusions. The patient improved clinically after the pericardectomy with resolution of her symptoms. Recently, she was readmitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed epithelioid malignant mesothelioma. The patient was then stained with a PET CT, which showed extensive uptake in the chest, bilateral pleural pericardial effusions, and lymphadenopathy. She also had acidic fluid, pectoral and intramammary lymph nodes and uptake in L4 with SUV of 4. This was consistent with stage III disease. Her repeat echocardiogram showed an ejection fraction of 45% to 49%. She was transferred to Oncology service and started on chemotherapy on September 1, 2007 with cisplatin 75 mg/centimeter squared equaling 109 mg IV piggyback over 2 hours on September 1, 2007, Alimta 500 mg/ centimeter squared equaling 730 mg IV piggyback over 10 minutes. This was all initiated after a Port-A-Cath was placed. The chemotherapy was well tolerated and the patient was discharged the following day after discontinuing IV fluid and IV. Her Port-A-Cath was packed with heparin according to protocol. DISCHARGE MEDICATIONS: Zofran, Phenergan, Coumadin, and Lovenox, and Vicodin DISCHARGE INSTRUCTIONS: She was instructed to followup with Dr. XYZ in the office to check her INR on Tuesday. She was instructed to call if she had any other questions or concerns in the interim. Keywords: hematology - oncology, mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, deep venous thrombosis, port-a-cath placement, port a cath, iv piggyback, venous thrombosis, atrial, thrombosis, pericardial, lymphadenopathy, fluid, pericardectomy, chest, pleural,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_1.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: BCCa Excision - Lower Lid Description: Excision of large basal cell carcinoma, right lower lid, and repaired with used dorsal conjunctival flap in the upper lid and a large preauricular skin graft. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Extremely large basal cell carcinoma, right lower lid. POSTOPERATIVE DIAGNOSIS: Extremely large basal cell carcinoma, right lower lid. TITLE OF OPERATION: Excision of large basal cell carcinoma, right lower lid, and repaired with used dorsal conjunctival flap in the upper lid and a large preauricular skin graft. PROCEDURE: The patient was brought into the operating room and prepped and draped in usual fashion. Xylocaine 2% with epinephrine was injected beneath the conjunctiva and skin of the lower lid and also beneath the conjunctiva and skin of the upper lid. A frontal nerve block was also given on the right upper lid. The anesthetic agent was also injected in the right preauricular region which would provide a donor graft for the right lower lid defect. The area was marked with a marking pen with margins of 3 to 4 mm, and a #15 Bard-Parker blade was used to make an incision at the nasal and temporal margins of the lesion. The incision was carried inferiorly, and using a Steven scissors the normal skin, muscle, and conjunctiva was excised inferiorly. The specimen was then marked and sent to pathology for frozen section. Bleeding was controlled with a wet-field cautery, and the right upper lid was everted, and an incision was made 3 mm above the lid margin with the Bard-Parker blade in the entire length of the upper lid. The incision reached the orbicularis, and Steven scissors were used to separate the tarsus from the underlying orbicularis. Vertical cuts were made nasally and temporally, and a large dorsal conjunctival flap was fashioned with the conjunctiva attached superiorly. It was placed into the defect in the lower lid and sutured with multiple interrupted 6-0 Vicryl sutures nasally, temporally, and inferiorly. The defect in the skin was measured and an appropriate large preauricular graft was excised from the right preauricular region. The defect was closed with interrupted 5-0 Prolene sutures, and the preauricular graft was sutured in place with multiple interrupted 6-0 silk sutures. The upper border of the graft was attached to the upper lid after incision was made in the gray line with a Superblade, and the superior portion of the skin graft was sutured to the upper lid through the anterior lamella created by the razor blade incision. Cryotherapy was then used to treat the nasal and temporal margins of the area of excision because of positive margins, and following this an antibiotic steroid ointment was instilled and a light pressure dressing was applied. The patient tolerated the procedure well and was sent to recovery room in good condition. Keywords: hematology - oncology, basal cell carcinoma, cryotherapy, steven scissors, conjunctiva, conjunctival flap, frontal nerve block, frozen section, lower lid, orbicularis, skin graft, nasal and temporal margins, dorsal conjunctival flap, upper lid, basal, carcinoma, preauricular, incision, conjunctival,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_10.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Anemia - Consult Description: Refractory anemia that is transfusion dependent. At this time, he has been admitted for anemia with hemoglobin of 7.1 and requiring transfusion. (Medical Transcription Sample Report) DIAGNOSIS: Refractory anemia that is transfusion dependent. CHIEF COMPLAINT: I needed a blood transfusion. HISTORY: The patient is a 78-year-old gentleman with no substantial past medical history except for diabetes. He denies any comorbid complications of the diabetes including kidney disease, heart disease, stroke, vision loss, or neuropathy. At this time, he has been admitted for anemia with hemoglobin of 7.1 and requiring transfusion. He reports that he has no signs or symptom of bleeding and had a blood transfusion approximately two months ago and actually several weeks before that blood transfusion, he had a transfusion for anemia. He has been placed on B12, oral iron, and Procrit. At this time, we are asked to evaluate him for further causes and treatment for his anemia. He denies any constitutional complaints except for fatigue, malaise, and some dyspnea. He has no adenopathy that he reports. No fevers, night sweats, bone pain, rash, arthralgias, or myalgias. PAST MEDICAL HISTORY: Diabetes. PAST SURGICAL HISTORY: Hernia repair. ALLERGIES: He has no allergies. MEDICATIONS: Listed in the chart and include Coumadin, Lasix, metformin, folic acid, diltiazem, B12, Prevacid, and Feosol. SOCIAL HISTORY: He is a tobacco user. He does not drink. He lives alone, but has family and social support to look on him. FAMILY HISTORY: Negative for blood or cancer disorders according to the patient. PHYSICAL EXAMINATION: GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately. VITAL SIGNS: Blood pressure of 110/60, pulse of 99, respiratory rate of 14, and temperature of 97.4. He is 69 inches tall and weighs 174 pounds. HEENT: Sclerae show mild arcus senilis in the right. Left is clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear. NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas. CHEST: Clear. HEART: Regular rate and rhythm. ABDOMEN: Soft and nontender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration. EXTREMITIES: No clubbing, but there is some edema, but no cyanosis. NEUROLOGIC: Noncontributory. DERMATOLOGIC: Noncontributory. CARDIOVASCULAR: Noncontributory. IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and a recent esophagogastroduodenoscopy, which was negative. His creatinine was 1. My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia. RECOMMENDATIONS: At this time, my recommendation for the patient is that he undergoes further serologic evaluation with reticulocyte count, serum protein, and electrophoresis, LDH, B12, folate, erythropoietin level, and he should undergo a bone marrow aspiration and biopsy. I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him. He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen, and I will probably see him in follow up after this hospitalization. As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient. Keywords: hematology - oncology, electrophoresis, ldh, b12, folate, erythropoietin level, reticulocyte count, serum protein, blood transfusion, bone marrow, refractory anemia, anemia, myalgias, marrow, bone,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_11.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Intensity-Modulated Radiation Therapy Description: Intensity-modulated radiation therapy is a complex set of procedures which requires appropriate positioning and immobilization typically with customized immobilization devices. (Medical Transcription Sample Report) INTENSITY-MODULATED RADIATION THERAPY Intensity-modulated radiation therapy is a complex set of procedures which requires appropriate positioning and immobilization typically with customized immobilization devices. The treatment planning process requires at least 4 hours of physician time. The technology is appropriate in this patient's case due to the fact that the target volume is adjacent to significant radiosensitive structures. Sequential CT scans are obtained and transferred to the treatment planning software. Extensive analysis occurs. The target volumes, including margins for uncertainty, patient movement and occult tumor extension are selected. In addition, organs at risk are outlined. Doses are selected both for targets, as well as for organs at risk. Associated dose constraints are placed. Inverse treatment planning is then performed in conjunction with the physics staff. These are reviewed by the physician and ultimately performed only following approval by the physician. Multiple beam arrangements may be tested for appropriateness and optimal dose delivery in order to maximize the chance of controlling disease, while minimizing exposure to organs at risk. This is performed in hopes of minimizing associated complications. The physician delineates the treatment type, number of fractions and total volume. During the time of treatment, there is extensive physician intervention, monitoring the patient set up and tolerance. In addition, specific QA is performed by the physics staff under the physician's direction. In view of the above, the special procedure code 77470 is deemed appropriate. Keywords: hematology - oncology, multiple beam arrangements, intensity modulated radiation therapy,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_12.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Neck Dissection Description: Left neck dissection. Metastatic papillary cancer, left neck. The patient had thyroid cancer, papillary cell type, removed with a total thyroidectomy and then subsequently recurrent disease was removed with a paratracheal dissection. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Metastatic papillary cancer, left neck. POSTOPERATIVE DIAGNOSIS: Metastatic papillary cancer, left neck. OPERATION PERFORMED: Left neck dissection. ANESTHESIA: General endotracheal. INDICATIONS: The patient is a very nice gentleman, who has had thyroid cancer, papillary cell type, removed with a total thyroidectomy and then subsequently recurrent disease was removed with a paratracheal dissection. He now has evidence of lesion in the left mid neck and the left superior neck on ultrasound, which are suspicious for recurrent cancer. Left neck dissection is indicated. DESCRIPTION OF OPERATION: The patient was placed on the operating room table in the supine position. After adequate general endotracheal anesthesia was administered, the table was then turned. A shoulder roll placed under the shoulders and the face was placed in an extended fashion. The left neck, chest, and face were prepped with Betadine and draped in a sterile fashion. A hockey stick skin incision was performed, extending a previous incision line superiorly towards the mastoid cortex through skin, subcutaneous tissue and platysma with Bovie electrocautery on cut mode. Subplatysmal superior and inferior flaps were raised. The dissection was left lateral neck dissection encompassing zones 1, 2A, 2B, 3, and the superior portion of 4. The sternocleidomastoid muscle was unwrapped at its fascial attachment and this was taken back posterior to the XI cranial nerve into the superior posterior most triangle of the neck. This was carried forward off of the deep rooted muscles including the splenius capitis and anterior and middle scalenes taken medially off of these muscles including the fascia of the muscles, stripped from the carotid artery, the X cranial nerve, the internal jugular vein and then carried anteriorly to the lateral most extent of the dissection previously done by Dr. X in the paratracheal region. The submandibular gland was removed as well. The X, XI, and XII cranial nerves were preserved. The internal jugular vein and carotid artery were preserved as well. Copious irrigation of the wound bed showed no identifiable bleeding at the termination of the procedure. There were two obviously positive nodes in this neck dissection. One was left medial neck just lateral to the previous tracheal dissection and one was in the mid region of zone 2. A #10 flat fluted Blake drain was placed through a separate stab incision and it was secured to the skin with a 2-0 silk ligature. The wound was closed in layers using a 3-0 Vicryl in a buried knot interrupted fashion for the subcutaneous tissue and the skin was closed with staples. A fluff and Kling pressure dressing was then applied. The patient was extubated in the operating room, brought to the recovery room in satisfactory condition. There were no intraoperative complications. Keywords: hematology - oncology, metastatic papillary cancer, thyroidectomy, thyroid cancer, papillary cell type, dissection, neck, metastatic, paratracheal, papillary, cancer,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_13.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: HDR Brachytherapy Description: HDR Brachytherapy (Medical Transcription Sample Report) HDR BRACHYTHERAPY The intracavitary brachytherapy applicator was placed appropriately and secured after the patient was identified. Simulation films were obtained, documenting its positioning. The 3-dimensional treatment planning process was accomplished utilizing the CT derived data. A treatment plan was selected utilizing sequential dwell positions within a single catheter. The patient was taken to the treatment area. The patient was appropriately positioned and the position of the intracavitary device was checked. Catheter length measurements were taken. Appropriate measurements of the probe dimensions and assembly were also performed. The applicator was attached to the HDR after-loader device. The device ran through its checking sequences appropriately and the brachytherapy was then delivered without difficulty or complication. The brachytherapy source was appropriately removed back to the brachytherapy safe within the device. Radiation screening was performed with the Geiger-Muller counter both prior to and after the brachytherapy procedure was completed and the results were deemed appropriate. Following completion of the procedure, the intracavitary device was removed without difficulty. The patient was in no apparent distress and was discharged home. Keywords: hematology - oncology, geiger-muller, treatment planning, hdr brachytherapy, intracavitary, applicator, brachytherapy,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_14.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Biopsy - Cervical Lymph Node Description: Excisional biopsy of right cervical lymph node. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. PROCEDURE: Excisional biopsy of right cervical lymph node. ANESTHESIA: General endotracheal anesthesia. SPECIMEN: Right cervical lymph node. EBL: 10 cc. COMPLICATIONS: None. FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination. FLUIDS: Please see anesthesia report. URINE OUTPUT: None recorded during the case. INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node. PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node. A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound. The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed. Keywords: hematology - oncology, lymphadenopathy, excisional biopsy, fna, mastisol, penrose drain, cervical, cervical lymph node, endotracheal anesthesia, lymph node, sternocleidomastoid, cervical lymph, lymph, anesthesia,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Metastatic Ovarian Cancer - Consult Description: A very pleasant 66-year-old woman with recurrent metastatic ovarian cancer. (Medical Transcription Sample Report) REASON FOR CONSULTATION: Metastatic ovarian cancer. HISTORY OF PRESENT ILLNESS: Mrs. ABCD is a very nice 66-year-old woman who is followed in clinic by Dr. X for history of renal cell cancer, breast cancer, as well as ovarian cancer, which was initially diagnosed 10 years ago, but over the last several months has recurred and is now metastatic. She last saw Dr. X in clinic towards the beginning of this month. She has been receiving gemcitabine and carboplatin, and she receives three cycles of this with the last one being given on 12/15/08. She was last seen in clinic on 12/22/08 by Dr. Y. At that point, her white count was 0.9 with the hemoglobin of 10.3, hematocrit of 30%, and platelets of 81,000. Her ANC was 0.5. She was started on prophylactic Augmentin as well as Neupogen shots. She has also had history of recurrent pleural effusions with the knee for thoracentesis. She had two of these performed in November and the last one was done about a week ago. Over the last 2 or 3 days, she states she has been getting more short of breath. Her history is somewhat limited today as she is very tired and falls asleep readily. Her history comes from herself but also from the review of the records. Overall, her shortness of breath has been going on for the past few weeks related to her pleural effusions. She was seen in the emergency room this time and on chest x-ray was found to have a new right-sided pulmonic consolidative infiltrate, which was felt to be possibly related to pneumonia. She specifically denied any fevers or chills. However, she was complaining of chest pain. She states that the chest pain was located in the substernal area, described as aching, coming and going and associated with shortness of breath and cough. When she did cough, it was nonproductive. While in the emergency room on examination, her vital signs were stable except that she required 5 liters nasal cannula to maintain oxygen saturations. An EKG was performed, which showed sinus rhythm without any evidence of Q waves or other ischemic changes. The chest x-ray described above showed a right lower lobe infiltrate. A V/Q scan was done, which showed a small mismatched defect in the left upper lobe and a mass defect in the right upper lobe. The findings were compatible with an indeterminate study for a pulmonary embolism. Apparently, an ultrasound of the lower extremities was done and was negative for DVT. There was apparently still some concern that this might be pulmonary embolism and she was started on Lovenox. There was also concern for pneumonia and she was started on Zosyn as well as vancomycin and admitted to the hospital. At this point, we have been consulted to help follow along with this patient who is well known to our clinic. PAST MEDICAL HISTORY 1. Ovarian cancer - This was initially diagnosed about 10 years ago and treated with surgical resection including TAH and BSO. This has recurred over the last couple of months with metastatic disease. 2. History of breast cancer - She has been treated with bilateral mastectomy with the first one about 14 years and the second one about 5 years ago. She has had no recurrent disease. 3. Renal cell carcinoma - She is status post nephrectomy. 4. Hypertension. 5. Anxiety disorder. 6. Chronic pain from neuropathy secondary to chemotherapy from breast cancer treatment. 7. Ongoing tobacco use. PAST SURGICAL HISTORY 1. Recent and multiple thoracentesis as described above. 2. Bilateral mastectomies. 3. Multiple abdominal surgeries. 4. Cholecystectomy. 5. Remote right ankle fracture. ALLERGIES: No known drug allergies. MEDICATIONS: At home, 1. Atenolol 50 mg daily 2. Ativan p.r.n. 3. Clonidine 0.1 mg nightly. 4. Compazine p.r.n. 5. Dilaudid p.r.n. 6. Gabapentin 300 mg p.o. t.i.d. 7. K-Dur 20 mEq p.o. daily. 8. Lasix unknown dose daily. 9. Norvasc 5 mg daily. 10. Zofran p.r.n. SOCIAL HISTORY: She smokes about 6-7 cigarettes per day and has done so for more than 50 years. She quit smoking about 6 weeks ago. She occasionally has alcohol. She is married and has 3 children. She lives at home with her husband. She used to work as a unit clerk at XYZ Medical Center. FAMILY HISTORY: Both her mother and father had a history of lung cancer and both were smokers. REVIEW OF SYSTEMS: GENERAL/CONSTITUTIONAL: She has not had any fever, chills, night sweats, but has had fatigue and weight loss of unspecified amount. HEENT: She has not had trouble with headaches; mouth, jaw, or teeth pain; change in vision; double vision; or loss of hearing or ringing in her ears. CHEST: Per the HPI, she has had some increasing dyspnea, shortness of breath with exertion, cough, but no sputum production or hemoptysis. CVS: She has had the episodes of chest pains as described above but has not had, PND, orthopnea lower extremity swelling or palpitations. GI: No heartburn, odynophagia, dysphagia, nausea, vomiting, diarrhea, constipation, blood in her stool, and black tarry stools. GU: No dysuria, burning with urination, kidney stones, and difficulty voiding. MUSCULOSKELETAL: No new back pain, hip pain, rib pain, swollen joints, history of gout, or muscle weakness. NEUROLOGIC: She has been diffusely weak but no lateralizing loss of strength or feeling. She has some chronic neuropathic pain and numbness as described above in the past medical history. She is fatigued and tired today and falls asleep while talking but is easily arousable. Some of this is related to her lack of sleep over the admission thus far. PHYSICAL EXAMINATION VITAL SIGNS: Her T-max is 99.3. Her pulse is 54, her respirations is 12, and blood pressure 118/61. GENERAL: Somewhat fatigued appearing but in no acute distress. HEENT: NC/AT. Sclerae anicteric. Conjunctiva clear. Oropharynx is clear without any erythema, exudate, or discharge. NECK: Supple. Nontender. No elevated JVP. No thyromegaly. No thyroid nodules. CHEST: Clear to auscultation and percussion bilaterally with decreased breath sounds on the right. CVS: Regular rate and rhythm. No murmurs, gallops or rubs. Normal S1 and S2. No S3 or S4. ABDOMEN: Soft, nontender, nondistended. Normoactive bowel sounds. No guarding or rebound. No hepatosplenomegaly. No masses. MUSCULOSKELETAL: Generalized muscle weakness but no joint swelling or other abnormalities. SKIN: No rashes, bruising, or petechia. No non-healing wounds or ulcerations. NEUROLOGIC: She is oriented x3 but she falls asleep readily. On exam and conversation, her cranial nerves are intact. She has no sensory loss. Her strength is symmetric. LABORATORY DATA: Her white blood cell count is 8.0, hemoglobin 11.1, hematocrit 33.2%, and platelets 29,000. Her differential shows 2% metamyelocytes, 57% neutrophils, 29% bands, 6% lymphocytes, 5% monocytes, and 1% eosinophils. Her sodium is 138, potassium 4.0, chloride 101, CO2 of 23, BUN 21, creatinine 1.4, glucose 107, and calcium 8.7. Her INR is 1.0, PT of 12, and PTT 24. Urinalysis negative for nitrite and leukocyte esterase with moderate epithelial cells, bacteria, white blood cells, and yeast suggesting of contamination. Her troponins have been negative x3. IMAGINING DATA: CT scan of her chest on 12/25/08 shows bilateral pleural effusions, larger on the right than the left but these are somewhat decreased in size compared to the prior CT scan at the end of November. There is some consolidative atelectasis at the bilateral basis. There is some peripheral interstitial opacifications noted in the right lung and to a lesser extent in the left lung possibly consistent with pneumonitis. There are small peripheral nodular densities in both lungs unchanged compared to prior scan. There is an enlarged right adrenal gland again noted without change. ASSESSMENT: ABCD is a very pleasant 66-year-old woman with recurrent metastatic ovarian cancer known to our clinic. At this point, she has been admitted for shortness of breath with possible presumed pneumonia. The possibility of a PE also remains and the plan has been to do a CTA once her kidney function improves. Currently, she is being treated with broad-spectrum antibiotics and Lovenox prophylactically. At this point, it does not appear that her pleural effusions have increased and this would not be the etiology behind her worsening symptoms. Her blood counts appear to be recovering from chemotherapy except for the fact that her platelets have gone lower. It is unclear as to the etiology behind this but may still be related to chemotherapy effect. This also could be related to consumptive process such as DIC in the face of infections or medication effect. We will keep track of her blood counts over this admission. We will continue to follow along through the course of her admission. She has requested being full code. I went back and looked at Dr. X's chart after our clinic chart and at the last visit with Dr. X and Dr. Y, she confirmed that she wanted to be DNR/DNI. I am not sure why this is changed and I will address this issue with her once she is more alert. Thank you very much for this consult. Keywords: hematology - oncology, renal cell cancer, breast cancer, metastatic ovarian cancer, shortness of breath, pleural effusions, cancer, ovarian, recurrent,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_16.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Breast Radiation Therapy Followup Description: Breast radiation therapy followup note. Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. (Medical Transcription Sample Report) DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. CURRENT MEDICATIONS 1. Glucosamine complex. 2. Toprol XL. 3. Alprazolam 4. Hydrochlorothiazide. 5. Dyazide. 6. Centrum. Dr. X has given her some carboplatin and Taxol more recently and feels that she would benefit from electron beam radiotherapy to the left chest wall as well as the neck. She previously received a total of 46.8 Gy in 26 fractions of external beam radiotherapy to the left supraclavicular area. As such, I feel that we could safely re-treat the lower neck. Her weight has increased to 189.5 from 185.2. She does complain of some coughing and fatigue. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time. I look forward to keeping you informed of her progress. Thank you for having allowed me to participate in her care. Keywords: hematology - oncology, carboplatin, taxol, radiation therapy, breast adenocarcinoma, beam radiotherapy, chest wall, radiotherapy, supraclavicular, lymphadenopathy, adenocarcinoma, breast,"
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_17.txt,"Sample Type / Medical Specialty: Hematology - Oncology Sample Name: Lymph Node Excisional Biopsy Description: Left axillary lymph node excisional biopsy. Left axillary adenopathy. (Medical Transcription Sample Report) PREOPERATIVE DIAGNOSIS: Left axillary adenopathy. POSTOPERATIVE DIAGNOSIS: Left axillary adenopathy. PROCEDURE: Left axillary lymph node excisional biopsy. ANESTHESIA: LMA. INDICATIONS: Patient is a very pleasant woman who in 2006 had breast conservation therapy with radiation only. Note, she refused her CMF adjuvant therapy and this was for a triple-negative infiltrating ductal carcinoma of the breast. Patient has been following with Dr. Diener and Dr. Wilmot. I believe that genetic counseling had been recommended to her and obviously the CMF was recommended, but she declined both. She presented to the office with left axillary adenopathy in view of the high-risk nature of her lesion. I recommended that she have this lymph node removed. The procedure, purpose, risk, expected benefits, potential complications, alternative forms of therapy were discussed with her and she was agreeable to surgery. TECHNIQUE: Patient was identified, then taken into the operating room where after induction of appropriate anesthesia, her left chest, neck, axilla, and arm were prepped with Betadine solution, draped in a sterile fashion. An incision was made at the hairline, carried down by sharp dissection through the clavipectoral fascia. I was able to easily palpate the lymph node and grasp it with a figure-of-eight 2-0 silk suture and by sharp dissection, was carried to hemoclip all attached structures. The lymph node was excised in its entirety. The wound was irrigated. The lymph node sent to pathology. The wound was then closed. Hemostasis was assured and the patient was taken to recovery room in stable condition. Keywords: hematology - oncology, axillary lymph node excisional biopsy, sharp dissection, excisional biopsy, lymph node, axillary, excisional, biopsy,"


# 1. ICD-10 Codes and HCC Status

Now we load the `icd10` delta tables

In [0]:
icd10_hcc_df=spark.read.load(f'{delta_path}/silver/icd10-hcc-df')
icd10_hcc_df.createOrReplaceTempView('icd10HccView')

best_icd_mapped_df=spark.read.load(f'{delta_path}/gold/best-icd-mapped')
best_icd_mapped_df.createOrReplaceTempView('bestIcdMappedView')
best_icd_mapped_pdf=best_icd_mapped_df.toPandas()

In [0]:
%sql
select * from icd10HccView
limit 10

path,final_chunk,entity,icd10_code,confidence,all_codes,resolutions,hcc_status,hcc_score,billable,icd_codes_names,icd_code_billable
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Mesothelioma,Oncological,C45,0.9986,"List(C45, C450, C459, C452, C457, C451, G731, C439, D165, C717, C649, C710, D352, C9000, C900)","List(mesothelioma [Mesothelioma], mesothelioma of pleura [Mesothelioma of pleura], mesothelioma, unspecified [Mesothelioma, unspecified], mesothelioma of pericardium [Mesothelioma of pericardium], mesothelioma of other sites [Mesothelioma of other sites], mesothelioma of mesentery [Mesothelioma of peritoneum], cancer, mesothelioma [Lambert-Eaton syndrome in neoplastic disease], amelanotic melanoma [Malignant melanoma of skin, unspecified], ameloblastoma of mandible [Benign neoplasm of lower jaw bone], glioma of brainstem [Malignant neoplasm of brain stem], nephroblastoma [Malignant neoplasm of unspecified kidney, except renal pelvis], glioblastoma multiforme, cerebrum [Malignant neoplasm of cerebrum, except lobes and ventricles], pituitary microadenoma [Benign neoplasm of pituitary gland], smoldering myeloma [Multiple myeloma not having achieved remission], smoldering myeloma [Multiple myeloma])","List(0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)","List(0, 9, 9, 9, 9, 9, 75, 12, 0, 10, 11, 10, 12, 9, 0)","List(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)",mesothelioma,0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cough,Symptom,R05,1.0,"List(R05, G4483, F458)","List(cough [Cough], cough headache syndrome [Primary cough headache], cough, psychogenic [Other somatoform disorders])","List(0, 0, 1)","List(0, 0, nan)","List(1, 1, 1)",cough,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,chest pain,Symptom,R074,0.4975,"List(R074, R079, R073, R0789, M542, R070, R1031, R1030, R071, R078, M549, R52, R0781)","List(chest pain [Pain in throat and chest], chest pain [Chest pain, unspecified], chest wall pain [Pain in throat and chest], chest wall pain [Other chest pain], neck pain [Cervicalgia], throat pain [Pain in throat], groin pain [Right lower quadrant pain], groin pain [Lower abdominal pain, unspecified], chest pain on breathing [Chest pain on breathing], other chest pain [Other chest pain], spine pain [Dorsalgia, unspecified], pain [Pain, unspecified], chest pain, pleuritic [Pleurodynia])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)",chest pain,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cough,Symptom,R05,1.0,"List(R05, G4483, F458)","List(cough [Cough], cough headache syndrome [Primary cough headache], cough, psychogenic [Other somatoform disorders])","List(0, 0, 1)","List(0, 0, nan)","List(1, 1, 1)",cough,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,chest pain,Symptom,R074,0.4975,"List(R074, R079, R073, R0789, M542, R070, R1031, R1030, R071, R078, M549, R52, R0781)","List(chest pain [Pain in throat and chest], chest pain [Chest pain, unspecified], chest wall pain [Pain in throat and chest], chest wall pain [Other chest pain], neck pain [Cervicalgia], throat pain [Pain in throat], groin pain [Right lower quadrant pain], groin pain [Lower abdominal pain, unspecified], chest pain on breathing [Chest pain on breathing], other chest pain [Other chest pain], spine pain [Dorsalgia, unspecified], pain [Pain, unspecified], chest pain, pleuritic [Pleurodynia])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1)",chest pain,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,cancer,Oncological,C801,0.9985,"List(C801, C44509, C4490, C449, C380, C61, C3490, C760, C4452, C169, C768, C189, C4459, C44519, C50919, C4440, C444, C56, C569, C6960, C809, C162)","List(cancer [Malignant (primary) neoplasm, unspecified], cancer of chest [Unspecified malignant neoplasm of skin of other part of trunk], cancer of the skin [Unspecified malignant neoplasm of skin, unspecified], cancer of the skin [Other and unspecified malignant neoplasm of skin, unspecified], cancer of the heart [Malignant neoplasm of heart], cancer of prostate [Malignant neoplasm of prostate], cancer of the lung [Malignant neoplasm of unspecified part of unspecified bronchus or lung], cancer of the neck [Malignant neoplasm of head, face and neck], cancer, skin of breast [Squamous cell carcinoma of skin of trunk], cancer of the stomach [Malignant neoplasm of stomach, unspecified], cancer of the back [Malignant neoplasm of other specified ill-defined sites], cancer of the colon [Malignant neoplasm of colon, unspecified], cancer of the back, basal cell [Other specified malignant neoplasm of skin of trunk], cancer of the back, basal cell [Basal cell carcinoma of skin of other part of trunk], breast cancer [Malignant neoplasm of unspecified site of unspecified female breast], cancer of the skin, neck [Unspecified malignant neoplasm of skin of scalp and neck], cancer of the skin, neck [Other and unspecified malignant neoplasm of skin of scalp and neck], ovarian cancer [Malignant neoplasm of ovary], ovarian cancer [Malignant neoplasm of unspecified ovary], cancer of the orbit [Malignant neoplasm of unspecified orbit], dmmr cancer [Malignant neoplasm without specification of site], cancer of the stomach, body [Malignant neoplasm of body of stomach])","List(1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1)","List(12, 0, 0, 0, 11, 12, 9, 12, 0, 9, 12, 11, 0, 0, 0, 0, 0, 0, 10, 12, 0, 9)","List(1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1)",cancer,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,numbness,Symptom,R202,0.8508,"List(R202, N9489, R6889, R4020, G252, R42, R401, R4589, R410, R451, M25642)","List(numbness of skin [Paresthesia of skin], numbness of vulva (finding) [Other specified conditions associated with female genital organs and menstrual cycle], mucosal numbness (finding) [Other general symptoms and signs], unconsciousness [Unspecified coma], tremor, rest [Other specified forms of tremor], subjective vertigo [Dizziness and giddiness], mental status, stupor [Stupor], feeling physically tense (finding) [Other symptoms and signs involving emotional state], clouded consciousness [Disorientation, unspecified], restlessness [Restlessness and agitation], stiffness of left hand [Stiffness of left hand, not elsewhere classified])","List(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 80, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)",numbness of skin,1
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,tingling of her left arm,Symptom,R202,0.2334,"List(R202, M7989, R223, R298, R2230, R2242, R2232, R2231, T23052A, R198)","List(tingling sensation [Paresthesia of skin], swelling of left arm [Other specified soft tissue disorders], swelling of upper arm [Localized swelling, mass and lump, upper limb], downward drift of outstretched supinated arm (finding) [Other symptoms and signs involving the nervous and musculoskeletal systems], localized swelling on forearm [Localized swelling, mass and lump, unspecified upper limb], swelling of left lower limb [Localized swelling, mass and lump, left lower limb], localized swelling on left arm [Localized swelling, mass and lump, left upper limb], swelling of right upper limb [Localized swelling, mass and lump, right upper limb], burn of left palm [Burn of unspecified degree of left palm, initial encounter], sensation as if bowel still full (finding) [Other specified symptoms and signs involving the digestive system and abdomen])","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)","List(1, 1, 0, 0, 1, 1, 1, 1, 1, 1)",tingling sensation,1


In [0]:
%sql
select entity, count('*') from icd10HccView
group by 1
order by 2

entity,count(*)
Treatment,90
Oncological,439
Symptom,968


## 1.1. Get general information for staff management, reporting, & planning.

Let's take a look at the distribution of mapped codes

In [0]:
display(
  best_icd_mapped_df
  .select('onc_code_desc')
  .filter("onc_code_desc!='-'")
  .groupBy('onc_code_desc')
  .count()
  .orderBy('count')
)

onc_code_desc,count
Other disorders of the nervous system,3
Aplastic and other anemias and other bone marrow failure syndromes,3
In situ neoplasms,4
Metabolic disorders,5
Persons encountering health services for examinations,6
Malignant neoplasms of digestive organs,27
Malignant neoplasms of female genital organs,41
Melanoma and other malignant neoplasms of skin,51
"Malignant neoplasms of ill-defined, other secondary and unspecified sites",53
"Malignant neoplasms of lymphoid, hematopoietic and related tissue",83


we can also visualize the results as a countplot to see the number of each parent categories

In [0]:
import plotly.graph_objects as go

_ps=best_icd_mapped_pdf['onc_code_desc'].value_counts()
data=_ps[_ps.index!='-']

fig = go.Figure(go.Bar(
            x=data.values,
            y=data.index,
            orientation='h'))
fig.show()

## 1.2. Reimbursement-ready data with billable codes
In the previous notebook, using an icd10 oncology mapping dictionary, we created a dataset of coded conditions that are all billable. To assess the quality of the mapping, we can look at the distribution of 
the nearest billable codes

In [0]:
import plotly.express as px
import pandas as pf
_ps=best_icd_mapped_pdf['nearest_billable_code_pos'].value_counts()
data=_ps[_ps!='-']
data_pdf=pd.DataFrame({"count":data.values,"Index of Nearest Billable Codes":data.index})

fig = px.bar(data_pdf, x='Index of Nearest Billable Codes', y='count')
fig.show()

## 1.3. See which indications have the highest average risk factor
In our pipeline we used `sbiobertresolve_icd10cm_augmented_billable_hcc` as a sentence resolver, in which the model return HCC codes. We can look at the distribution risk factors for each entity.
Note that since each category has a different number of corresponding data points, to get a full picture of the distribution of risk factors for each condition, we use box plots.

In [0]:
import plotly.express as px

df = best_icd_mapped_pdf[best_icd_mapped_pdf.onc_code_desc!='-'].dropna()
fig = px.box(df, y="onc_code_desc", x="corresponding_hcc_score", hover_data=df.columns)
fig.show()

As we can see, some categories, even with fewer cases, have higher risk factor.

## 1.4. Analyze Oncological Entities
We can find the most frequent oncological entities.

In [0]:
onc_df = (
  icd10_hcc_df
  .filter("entity == 'Oncological'")
  .select("path","final_chunk","entity","icd10_code","icd_codes_names","icd_code_billable")
 )
onc_pdf=onc_df.toPandas()
onc_pdf.head(10)

Unnamed: 0,path,final_chunk,entity,icd10_code,icd_codes_names,icd_code_billable
0,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,metastatic lesions,Oncological,N40,localized bph (benign prostatic hyperplasia),0
1,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,thyroid carcinoma,Oncological,C73,medullary thyroid carcinoma,1
2,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,vocal cord,Oncological,J383,vocal cord cyst,1
3,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,papillary carcinoma,Oncological,D059,papillary carcinoma in situ of breast,0
4,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,cancer,Oncological,C801,cancer,1
5,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,Breast Cancer,Oncological,C50919,breast cancer,1
6,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,breast cancer,Oncological,C50919,breast cancer,1
7,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,tumor markers,Oncological,R97,abnormal tumor markers,0
8,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,Metastatic breast cancer,Oncological,C7981,cancer metastatic to right breast,1
9,dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_onco...,breast cancer,Oncological,C50919,breast cancer,1


In [0]:
import plotly.express as px

_ps=onc_pdf['icd_codes_names'].value_counts()
data=_ps[_ps.index!='-']
data_pdf=pd.DataFrame({"count":data.values,'icd code names':data.index})
data_pdf=data_pdf[data_pdf['count']>5]
fig = px.bar(data_pdf, y='icd code names', x='count',orientation='h')
fig.show()

### Report Counts by ICD10CM Code Names
Each bar shows count of reports contain the cancer entities.

In [0]:
display(
  onc_df.select('icd_codes_names','path')
  .dropDuplicates()
  .groupBy('icd_codes_names')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

icd_codes_names,count
carcinoma,10
cancer,10
carotid body tumor,8
breast cancer,6
breast cyst,6
nephrotic syndrome w focal and segmental glomerular lesions,5
basal cell carcinoma of back,4
lymphoma,4
inclusion cyst,4
cancer of the lung,4


### Most common symptoms
 We can find the most common symptoms counting the unique symptoms in documents.

In [0]:
display(
  icd10_hcc_df
  .filter("lower(entity)='symptom'")
  .selectExpr('path','icd_codes_names as symptom')
  .dropDuplicates()
  .groupBy('symptom')
  .count()
  .orderBy(F.desc('count'))
  .limit(30)
)

symptom,count
edema,18
chest mass,17
distress,16
vesicular murmur,15
symptom occurs at night (finding),14
pain,13
hepatosplenomegaly,13
lymphadenopathy,12
amblyopic,11
nausea,11


### Extract most frequent oncological diseases and symptoms based on documents

Here, we will count the number documents for each symptom-disease pair. To do this, first we filter high confidence entities and then create a pivot table.

In [0]:
entity_symptom_df = (
  icd10_hcc_df
  .select('path','entity','icd_codes_names')
  .filter("lower(entity) in ('symptom','oncological') AND confidence > 0.30")
  .dropDuplicates()
)
display(entity_symptom_df)

path,entity,icd_codes_names
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_46.txt,Symptom,distress
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,difficulty using urine bottle (finding)
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,syncope
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,Symptom,lump on neck
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,Symptom,nipple discharge
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_37.txt,Oncological,hodgkin lymphoma
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_46.txt,Oncological,mantle cell lymphoma
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,Oncological,breast cancer
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,Symptom,congestion of nose
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_7.txt,Oncological,follicular non-hodgkin's lymphoma


In [0]:
condition_symptom_df = (
  entity_symptom_df.groupBy('path').pivot("entity").agg(F.collect_list("icd_codes_names"))
  .withColumnRenamed('Oncological','Condition')
  .withColumn('Conditions',F.explode('Condition'))
  .withColumn('Symptoms',F.explode('Symptom'))
  .drop('Condition','Symptom')
  .dropna()
  .fillna(0)
)
display(condition_symptom_df)

path,Conditions,Symptoms
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,erythema
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,dysphagia
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,numbness of skin
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,swollen legs
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,blood in stool
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,dyspnea
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,difficulty in voiding
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,constipation
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,hepatosplenomegaly
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_15.txt,cancer,feeling tired


In [0]:
conditions_symptoms_count_df=condition_symptom_df.groupBy('Conditions').pivot("Symptoms").count().fillna(0)
conditions_symptoms_count_pdf=conditions_symptoms_count_df.toPandas()
conditions_symptoms_count_pdf.index=conditions_symptoms_count_pdf['Conditions']
conditions_symptoms_count_pdf=conditions_symptoms_count_pdf.drop('Conditions',axis=1)

In [0]:
selected_rows=conditions_symptoms_count_pdf.index[conditions_symptoms_count_pdf.sum(axis=1)>10]
selected_columns=conditions_symptoms_count_pdf.columns[conditions_symptoms_count_pdf.sum(axis=0)>10]

In [0]:
data_pdf=conditions_symptoms_count_pdf.loc[selected_rows,selected_columns]

Now let's visualize the heatmap of the co-occurence of conditions and symptoms. We can directly look at the counts of symptoms by condition

In [0]:
import plotly.express as px
def plot_heatmap(data,color='occurence'):
  fig = px.imshow(data,labels=dict(x="Condition", y="Symptom", color=color),y=list(data.index),x=list(data.columns))
  fig.update_layout(
    autosize=False,
    width=1100,
    height=1100,
  )
  fig.update_xaxes(side="top")
  return(fig)

In [0]:
fg=plot_heatmap(data_pdf)
fg.show()

As we see, this heatmap does not take the expected frequency of a given symptom into account. In order to reflect any correlation between the symptom in question and a given condition, we need to normalize the counts. 
To do so, we use `MinMaxScaler` to scale the values.

In [0]:
from sklearn.preprocessing import MinMaxScaler
normalized_data=MinMaxScaler().fit(data_pdf).transform(data_pdf)

In [0]:
norm_data_pdf=pd.DataFrame(normalized_data,index=data_pdf.index,columns=data_pdf.columns)
plot_heatmap(norm_data_pdf,'normalized occurence')

As we can see, now the symptoms that were not appeared to be enriched show high correlation with corresponding conditions.

# 2. Get Drug codes from the notes

## Analyze drug usage patterns for inventory management and reporting

We are checking how many times any drug are encountered in the documents.

In [0]:
rxnorm_res_df=spark.read.load(f'{delta_path}/gold/rxnorm-res-cleaned')

In [0]:
display(
  rxnorm_res_df
  .filter('confidence > 0.8')
  .groupBy('drugs')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

drugs,count
iron,5
Aromasin,3
Taxol,2
calcium,2
Saline Spray,2
Coreg,1
Aciphex,1
Dyazide,1
Synthroid,1
technetium Tc-99m sulfur colloid,1


# 3. Get Timeline Using RE Models

## Find the problems occured after treatments 

We are filtering the dataframe to select rows with following conditions to see problems occured after treatments.
* `relation =='AFTER'`
* `entity1=='TREATMENT'`
* `entity2=='PROBLEM'`

In [0]:
temporal_re_df=spark.read.load(f"{delta_path}/silver/temporal-re")

In [0]:
display(temporal_re_df)

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,BEFORE,OCCURRENCE,Discharge,PROBLEM,Mesothelioma,0.99999833
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,pleural effusion,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,atrial fibrillation,0.99999607
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,Mesothelioma,PROBLEM,anemia,0.9996013
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,atrial fibrillation,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,anemia,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,ascites,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,pleural effusion,PROBLEM,esophageal reflux,0.94568694
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,atrial fibrillation,PROBLEM,anemia,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,OVERLAP,PROBLEM,atrial fibrillation,PROBLEM,ascites,1.0


In [0]:
display(
  temporal_re_df
  .where("relation == 'AFTER' AND entity1=='TREATMENT' AND entity2 == 'PROBLEM'")
  .filter('confidence > 0.8')
  .orderBy(F.desc('confidence'))
)

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_21.txt,AFTER,TREATMENT,epinephrine,PROBLEM,a transverse incision,0.9999982
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_21.txt,AFTER,TREATMENT,Xylocaine,PROBLEM,a transverse incision,0.9999807
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_40.txt,AFTER,TREATMENT,Implant,PROBLEM,ruptured,0.9999635
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_14.txt,AFTER,TREATMENT,this procedure,PROBLEM,complications,0.97644377
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_20.txt,AFTER,TREATMENT,intravenous heparin,PROBLEM,hereditary hypercoagulable state,0.9599375
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_37.txt,AFTER,TREATMENT,hemostatic,PROBLEM,a skin stab inferior,0.95746464


# 4. Analyze the Relations Between Body Parts and Procedures

In the extraction notebook, we created a relation extration model to identify relationships between body parts and problem entities by using pretrained **RelationExtractionModel** `re_bodypart_problem`. Now let's load the data and take a look at the relationship between bodypart and procedures. By filtering the dataframe to select rows satisfying `entity1 != entity2` we can see the relations between different entities and see the procedures applied to internal organs

In [0]:
bodypart_re_df=spark.read.load(f'{delta_path}/silver/bodypart-relationships')

In [0]:
display(
  bodypart_re_df
  .where('entity1!=entity2')
  .drop_duplicates()
  )

path,relation,entity1,chunk1,entity2,chunk2,confidence
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,IVC,Procedure,placement of a vena caval filter,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,1,Procedure,lumpectomy,Internal_organ_or_component,axillary node,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_35.txt,0,Internal_organ_or_component,prostate,Procedure,ultrasound-guided I-125 seed implantation,0.51350594
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,inferior vena cava,Procedure,placement of a vena caval filter,1.0
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,inferior vena cava,Procedure,mechanical and pharmacologic thrombolysis,0.9999087
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,1,Internal_organ_or_component,inferior vena cava,Procedure,balloon angioplasty,0.98969346
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_8.txt,0,Procedure,bone marrow biopsy,Internal_organ_or_component,cellular marrow,0.7271675
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_45.txt,0,Procedure,lymph node injection,Internal_organ_or_component,sentinel lymph node,0.9483816
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_33.txt,0,Internal_organ_or_component,nerve,Procedure,thyroidectomy,0.99999917
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,0,Internal_organ_or_component,iliac vein,Procedure,placement of a vena caval filter,1.0


# 5. Get Procedure codes from notes

We will created dataset for procedure codes, using `jsl_ner_wip_greedy_clinical` NER mdodle and set NerConverter's WhiteList `['Procedure']` in order to get only drug entities. Let's take a look at this table:

In [0]:
cpt_df=spark.read.load(f'{delta_path}/silver/cpt')

In [0]:
display(cpt_df)

path,chunks,entity,cpt_code,confidence,all_codes,resolutions,cpt
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,decortication of the lung,Procedure,32225,0.3863,"List(32225, 32651, 39545, 36823, 32215, 33025, 1005980, 31320, 1005815, 32110, 32652, 32220, 32310, 60522, 69424, 31645, 32520)","List(Partial decortication of lung, Partial decortication of lung, Imbrication of diaphragm, Decannulation, Obliteration of pleural cavity, Decompression of pericardium, Decortication, pulmonary (separate procedure), Incision of larynx, Excision Procedures on the Larynx, Repair of lung laceration (procedure), Total decortication of lung, Total decortication of lung, Total decortication of lung, Excision of tissue of mediastinum, Removal of ventilation tube from middle ear, Tracheobronchial suctioning, Resection of lung; with resection of chest wall (Deprecated))",Partial decortication of lung
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,pleural biopsy,Procedure,32098,0.9951,"List(32098, 32095, 32609, 32604, 32400, 32491, 1007169, 37200, 75970, 32096, 32607, 32097, 32608, 10021, 32405, 32602, 32668)","List(Pleural biopsy, Open pleural biopsy, Thoracoscopy; with biopsy(ies) of pleura, Pericardial biopsy, Biopsy, pleura, percutaneous needle, Pleural procedure, Biopsy, Biopsy, Biopsy, Biopsy of lung (procedure), Biopsy of lung (procedure), Biopsy of lung (procedure), Biopsy of lung (procedure), Aspiration biopsy, Percutaneous needle biopsy lung, Thoracoscopic biopsy of lung, Pleural endoscopy)",Pleural biopsy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,thoracentesis,Procedure,32555,0.4992,"List(32555, 32421, 32036, 1005962, 35820, 32668, 32657, 32660, 32667, 32671, 1020900, 32653, 32654, 32661, 32673, 32658, 32651)","List(Thoracentesis, Thoracentesis, Thoracostomy, Thoracostomy, Thoracostomy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy)",Thoracentesis
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Port-A-Cath placement,Procedure,36593,0.0751,"List(36593, 75970, 49325, 20555, 62318, 75989, 62160, 64446, 55875, 49419, 41019, 19296, 61770, 19297, 64416, 64448, 55920, 0169T, 19298, 64449, 62319, 93564)","List(Catheter procedure (procedure), Catheter procedure (procedure), Catheter procedure (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Catheterization (procedure), Cardiac catheterization (procedure), Cardiac catheterization (procedure))",Catheter procedure (procedure)
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,pericardectomy,Procedure,33031,0.2498,"List(33031, 33025, 33030, 32660, 33020, 64746, 49250, 27566, 27350, 27424, 27524)","List(Pericardectomy, Pericardectomy, Pericardectomy, Pericardectomy, Pericardotomy, Phrenicectomy, Omphalectomy, Patellectomy, Patellectomy, Patellectomy, Patellectomy)",Pericardectomy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Pericardectomy,Procedure,32660,0.2466,"List(32660, 33031, 33025, 33030, 33020, 64746, 27566, 27350, 27424, 27524)","List(Pericardectomy, Pericardectomy, Pericardectomy, Pericardectomy, Pericardotomy, Phrenicectomy, Patellectomy, Patellectomy, Patellectomy, Patellectomy)",Pericardectomy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,Cholecystectomy,Procedure,47145,0.1599,"List(47145, 47143, 47144, 1014153, 47600, 47605, 47562, 47620, 68520, 47490, 47480, 21070, 48000, 42335, 42340)","List(Cholecystectomy, Cholecystectomy, Cholecystectomy, Cholecystectomy, Cholecystectomy, Cholecystectomy; with cholangiography, Endoscopic cholecystectomy, Cholecystectomy with cholangiography, Dacryocystectomy, Percutaneous cholecystotomy, Percutaneous cholecystotomy, Coronoidectomy, Cholecystostomy (procedure), Sialolithectomy, Sialolithectomy)",Cholecystectomy
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,thoracentesis,Procedure,32555,0.4992,"List(32555, 32421, 32036, 1005962, 35820, 32668, 32657, 32660, 32667, 32671, 1020900, 32653, 32654, 32661, 32673, 32658, 32651)","List(Thoracentesis, Thoracentesis, Thoracostomy, Thoracostomy, Thoracostomy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy)",Thoracentesis
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,thoracentesis,Procedure,32555,0.4992,"List(32555, 32421, 32036, 1005962, 35820, 32668, 32657, 32660, 32667, 32671, 1020900, 32653, 32654, 32661, 32673, 32658, 32651)","List(Thoracentesis, Thoracentesis, Thoracostomy, Thoracostomy, Thoracostomy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy, Thoracoscopy)",Thoracentesis
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_0.txt,pericardial window,Procedure,33025,0.2983,"List(33025, 33020, 32604, 1006059, 33031, 32660, 33030, 32661, 33050, 33814)","List(Pericardial drainage, Incision of pericardium, Pericardial biopsy, Pericardiocentesis, Excision of pericardium, Excision of pericardium, Excision of pericardium, Excision of pericardial cyst, Excision of pericardial cyst, Aorticopulmonary window operation)",Pericardial drainage


we can the see most common procedures being performed and count the number of each procedures and plot it.

In [0]:
#top 20
display(
  cpt_df
  .groupBy('cpt')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
)

cpt,count
Biopsy,17
Bone marrow aspiration,14
Colonoscopy,11
Capsulectomy or capsulotomy,8
Pericardectomy,6
Endoscopic biopsy,6
Thoracentesis,5
Excision of chemodectoma,5
Decannulation,5
Incision AND drainage,4


# 6. Get Assertion Status of Cancer Entities

Using the assertion status dataset we can find the number of family members of cancer patients with cancer or symptoms, and we can fruther check if the symptom is absent or present.

In [0]:
assertion_df=spark.read.load(f'{delta_path}/silver/assertion').drop_duplicates()

In [0]:
display(assertion_df)

path,chunk,entity,assertion
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_34.txt,edema,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,"focal motor, sensory or other neurological symptoms",Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,Newly diagnosed high-risk acute lymphoblastic leukemia,Oncological,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_40.txt,clubbing,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,narrowing,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,decreased appetite,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_7.txt,follicular non,Cancer,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,weight loss,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,chest pains,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_43.txt,shortness of breath,Symptom,absent


In [0]:
n_associated_with_someone_else = assertion_df.where("assertion=='associated_with_someone_else'").count()
print(f"Number of family members have cancer or symptoms: {n_associated_with_someone_else} ")

Number of family members have cancer or symptoms: 35 


In [0]:
display(assertion_df)

path,chunk,entity,assertion
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_34.txt,edema,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,"focal motor, sensory or other neurological symptoms",Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_39.txt,Newly diagnosed high-risk acute lymphoblastic leukemia,Oncological,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_40.txt,clubbing,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,narrowing,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,decreased appetite,Symptom,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_7.txt,follicular non,Cancer,present
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_9.txt,weight loss,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_6.txt,chest pains,Symptom,absent
dbfs:/FileStore/HLS/nlp/data/mt_onc_50/mt_oncology_43.txt,shortness of breath,Symptom,absent


In [0]:
display(
  assertion_df
  .groupBy('assertion')
  .count()
)

assertion,count
present,570
hypothetical,13
conditional,12
possible,47
absent,496
associated_with_someone_else,35


In [0]:
assertion_symptom_df= (
  assertion_df
  .where("assertion in ('present', 'absent') AND entity=='Symptom'")
)
most_common_symptoms_df=(
  assertion_symptom_df
  .select('path','chunk')
  .groupBy('chunk')
  .count()
  .orderBy(F.desc('count'))
  .limit(20)
  )
display(most_common_symptoms_df)

chunk,count
edema,19
mass,15
murmurs,14
acute distress,12
night sweats,12
hepatosplenomegaly,12
pain,11
nausea,10
chills,10
vomiting,9


In [0]:
display(
  assertion_symptom_df
  .join(most_common_symptoms_df, on='chunk')
  .groupBy('chunk','assertion')
  .count()
  .orderBy(F.desc('count'))
  )

chunk,assertion,count
edema,absent,16
murmurs,absent,14
hepatosplenomegaly,absent,12
acute distress,absent,12
night sweats,absent,11
mass,present,10
chills,absent,10
pain,present,9
nausea,absent,9
vomiting,absent,9


## License
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors.  The source in this notebook is provided subject to the [Apache 2.0 License](https://spdx.org/licenses/Apache-2.0.html).  All included or referenced third party libraries are subject to the licenses set forth below.

|Library Name|Library License|Library License URL|Library Source URL| 
| :-: | :-:| :-: | :-:|
|Pandas |BSD 3-Clause License| https://github.com/pandas-dev/pandas/blob/master/LICENSE | https://github.com/pandas-dev/pandas|
|Numpy |BSD 3-Clause License| https://github.com/numpy/numpy/blob/main/LICENSE.txt | https://github.com/numpy/numpy|
|Apache Spark |Apache License 2.0| https://github.com/apache/spark/blob/master/LICENSE | https://github.com/apache/spark/tree/master/python/pyspark|
|Plotly |MIT License| https://github.com/plotly/plotly.py/blob/master/LICENSE.txt | https://github.com/plotly/plotly.py|
|Scikit-Learn |BSD 3-Clause| https://github.com/scikit-learn/scikit-learn/blob/main/COPYING | https://github.com/scikit-learn/scikit-learn/|


|Author|
|-|
|Databricks Inc.|
|John Snow Labs Inc.|

## Disclaimers
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account.  Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.