# Notes Report
This report will answer:
1. What do the notes contain? -- How do they differ structurally? Semantically? 
2. Statistical info about the note category. Token count? Note count per icustay?
3. Statistical comparison between categories

In [2]:
import pandas as pd
rad_path = f'/data/datasets/mimiciv_notes/physionet.org/files/mimic-iv-note/2.2/note/radiology.csv'
dis_path = f'/data/datasets/mimiciv_notes/physionet.org/files/mimic-iv-note/2.2/note/discharge.csv'
rad_cohort_path = f'/home/ugrads/a/aa_ron_su/BoXHED_Fuse/JSS_SUBMISSION_NEW/data/till_end_mimic_iv_extra_features_train_NOTE_TARGET1_FT_rad.csv'
dis_cohort_path = f'/home/ugrads/a/aa_ron_su/BoXHED_Fuse/JSS_SUBMISSION_NEW/data/till_end_mimic_iv_extra_features_train_NOTE_TARGET1_FT.csv'
rad_c_df = pd.read_csv(rad_cohort_path)
dis_c_df = pd.read_csv(dis_cohort_path)

## 1. What do the notes contain?

The discharge table contains discharge summaries for hospitalizations. Discharge summaries are long form narratives which describe the reason for a patient’s admission to the hospital, their hospital course, and any relevant discharge instructions. 

The radiology table contains free-text radiology reports associated with radiography imaging. Radiology reports cover a variety of imaging modalities: x-ray, computed tomography, magnetic resonance imaging, ultrasound, and so on. Free-text radiology reports are semi-structured and usually follow a consistent template for a given imaging protocol

In [None]:
import numpy as np

np.random.seed(1)
n = rad_c_df.shape[0]
idxs = [np.random.randint(n) for _ in range(3)]

# print(rad_c_df.iloc[idxs[0]].text)
# print( '-' * 40)
'''
HISTORY
FINDINGS
'''

# print(rad_c_df.iloc[idxs[1]].text)
# print( '-' * 40)
'''
INDICATION
COMPARISON
TECHNIQUE
FINDINGS
IMPRESSION
'''

# print(rad_c_df.iloc[idxs[2]].text)
# print( '-' * 40)
'''
EXAMINATION
INDICATION
DOSE
COMPARISON
FINDINGS
Head_CTA
IMPRESSION
'''

Radiology is structured with headings in all caps. These vary in type, but generally there is some form of imaging performed followed by findings, impression, comparison, etc.

In [None]:
import numpy as np

np.random.seed(1)
n = dis_c_df.shape[0]
idxs = [np.random.randint(n) for _ in range(3)]

# print(dis_c_df.iloc[idxs[0]].text)
# print( '-' * 40)
'''
personal info... (hidden data for privacy)
  Service: Neurosurgery
  Chief Complaint: Fall
  Major Surgical or Invasive Procedure: None
  Name:  ___.           Unit No:   ___
  Admission Date:  ___              Discharge Date:   ___
  Date of Birth:  ___             Sex:   M
  ...
Service: NEUROSURGERY
Chief Complaint:
Major Surgical or Invasive Procedure:
History of Present Illness:
Past Medical History:
Social History:
Family History:
Physical Exam: 

On Discharge: 
Pertinent Results:
CT Head:
IMPRESSION: 

Brief Hospital Course:
Medications on Admission:
Discharge Medications: 
Discharge Disposition: 
Discharge Diagnosis:
Discharge Condition:
Discharge Instructions: 
Followup Instructions:
'''

# print(dis_c_df.iloc[idxs[1]].text)
# print( '-' * 40)
'''
personal info...
Chief Complaint:
Major Surgical or Invasive Procedure:
History of Present Illness:
Review of sytems:  
Past Medical History:
Social History:
Family History:
Physical Exam: 

GU: no foley  
Ext: 
Neuro: 
Pertinent Results:
Alpha-1 antitrypsin level: 31
Alpha-1 antitrypsin phenotype:
MICROBIOLOGY:
IMAGING:

Brief Hospital Course:
#. Ascitic fluid leakage: 
#. Cirrhosis: 
#. Pneumonia: 
Medications on Admission:
Discharge Medications:
Discharge Disposition:
Discharge Diagnosis:
Discharge Condition:
Discharge Instructions:
Followup Instructions:
'''

# print(dis_c_df.iloc[idxs[2]].text)
# print( '-' * 40)
'''
----------------------------------------
----------------------------------------
 
personal info...
Chief Complaint:
Major Surgical or Invasive Procedure:
History of Present Illness:
Past Medical History:
Social History:
Family History:
Physical Exam:

CT ABDOMEN W/O CONTRAST Study Date of ___ 1:14 ___ 
CT HEAD W/O CONTRAST Study Date of ___ 10:___vidence of acute intracranial process.  Chronic changes 
CT ABD & PELVIS W/O CONTRAST Study Date of ___ 4:09 ___ 
CT CHEST W/O CONTRAST Study Date of ___ 4:23 ___ 
CT ABD & PELVIS W/O CONTRAST Study Date of ___ 10:56 ___ 

Brief Hospital Course:
Medications on Admission:
Discharge Medications:
Discharge Disposition:
Facility:
Discharge Diagnosis:
Discharge Condition:
Discharge Instructions:
Followup Instructions:
'''

## 2. Statistical info about the note category. 
Token count? Note count per icustay?

In [None]:
import sys
import os
sys.path.append("/home/ugrads/a/aa_ron_su")
# from BoXHED_Fuse.src.helpers import tokenization

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('../models/Clinical-T5-Base/')

In [None]:
def tokenization(tokenizer, batched_text):
    return tokenizer(batched_text['text'], padding = False, truncation=False)

In [None]:
from functools import partial
from datasets import Dataset
rad_c_ds = Dataset.from_pandas(rad_c_df).select_columns(['text'])
tokens_rad = rad_c_ds.map(partial(tokenization, tokenizer), batched = True, batch_size = len(rad_c_ds) // 10)

dis_c_ds = Dataset.from_pandas(dis_c_df).select_columns(['text'])
tokens_dis = dis_c_ds.map(partial(tokenization, tokenizer), batched = True, batch_size = len(dis_c_ds) // 10)

In [None]:
note_token_lens_rad = [len(t['input_ids']) for t in tokens_rad]
note_token_lens_dis = [len(t['input_ids']) for t in tokens_dis]

In [21]:
def summary_stats(data):
    print(f'MEAN: {np.mean(data)}\nMEDIAN: {np.median(data)},\nMAX: {np.max(data)}\nMIN: {np.min(data)}')

summary_stats(note_token_lens_rad)
summary_stats(note_token_lens_dis)

Radiology reports are more succinct. With an average token length of 265, their information easily fits into a 512 token transformer input.
On the other hand, discharge notes are much longer at > 3000 tokens on average and will need to be truncated.

In [None]:
print(tokens_rad[int(np.argmin(note_token_lens_rad))]['text'])
print(tokens_dis[int(np.argmin(note_token_lens_dis))]['text'])

In [None]:
print(tokens_rad[int(np.argmax(note_token_lens_rad))]['text'])
print(tokens_dis[int(np.argmax(note_token_lens_dis))]['text'])

Let's find note count per icustay

In [None]:
note_counts_rad = rad_c_df.groupby('ICUSTAY_ID')['NOTE_ID'].count()
note_counts_dis = dis_c_df.groupby('ICUSTAY_ID')['NOTE_ID'].count()


In [None]:
summary_stats(note_counts_rad)
summary_stats(note_counts_dis)

While there are 2 or 3 radiology reports per stay, there is almost always only 1 discharge report per stay.
Thus, discharge notes can only be of limited use in real-time survival analysisby relying only on discharge reports from previous stays.
Radiology reports, on the other hand, provide recent information on patient condition. 

## Further analysis
Given that there is only one discharge per icustay, how many patients have multiple stays? In other words, how many patients would benefit from the inclusion of discharge notes?

In [7]:
trainpath = '/home/ugrads/a/aa_ron_su/BoXHED_Fuse/JSS_SUBMISSION_NEW/data/till_end_mimic_iv_extra_features_train.csv'
mimic_iv_train = pd.read_csv(trainpath)

In [8]:
mimic_iv_train.rename(columns={'Icustay':'ICUSTAY_ID', 'subject':'SUBJECT_ID'}, inplace=True)

In [13]:
rad_c_df_w_subj = rad_c_df.merge(mimic_iv_train[['ICUSTAY_ID', 'SUBJECT_ID']], on='ICUSTAY_ID', how='inner')
dis_c_df_w_subj = dis_c_df.merge(mimic_iv_train[['ICUSTAY_ID', 'SUBJECT_ID']], on='ICUSTAY_ID', how='inner')

In [27]:
subject_note_counts_rad = rad_c_df_w_subj.groupby('SUBJECT_ID')['NOTE_ID'].nunique()
subject_note_counts_dis = dis_c_df_w_subj.groupby('SUBJECT_ID')['NOTE_ID'].nunique()

print('radiology note counts by subject\n', subject_note_counts_rad)
print('radiology note counts by subject\n', subject_note_counts_dis)

radiology note counts by subject
 SUBJECT_ID
10000032.0    1
10000980.0    1
10001217.0    3
10001725.0    1
10002013.0    2
             ..
19999287.0    5
19999297.0    8
19999442.0    4
19999625.0    2
19999987.0    2
Name: NOTE_ID, Length: 22725, dtype: int64
radiology note counts by subject
 SUBJECT_ID
10000032.0    1
10000980.0    1
10001217.0    1
10002013.0    1
10002155.0    1
             ..
19998843.0    1
19999287.0    1
19999297.0    1
19999442.0    1
19999625.0    1
Name: NOTE_ID, Length: 11042, dtype: int64


In [28]:
summary_stats(subject_note_counts_rad)
summary_stats(subject_note_counts_dis)

MEAN: 3.350847084708471
MEDIAN: 3.0,
MAX: 44
MIN: 1
MEAN: 1.3197790255388517
MEDIAN: 1.0,
MAX: 6
MIN: 1


Even when grouping by subject, there most patients only have 1 discharge summary. 