## AIMI High School Internship 2024
### Notebook 1: Extracting Labels from Radiology Reports

**The Problem**: Given a chest X-ray, our goal in this project is to classify chest X-ray images into one of four categories: pneumonia, pneumothorax, pleural effusion, and normal. This is an important clinical task as it helps in the early diagnosis and treatment of various lung conditions.

In order to train a model that classify pneumonia status given chest X-rays, we require a ***training set*** with chest X-rays and labels from radiology reports consistent of patient metadata. However, when working with real-world medical data, clear-cut important labels (e.g. "pneumonia") are often not annotated ahead of time. The only data that a researcher has access to are the raw images and free-form clinical text written by the radiologist.

**Your First Task**: Given a set of chest X-rays and paired radiology reports, your goal is to use natural language processing (NLP) tools to extract the pneumonia label and relevant patient metadata from the reports.

**Looking Ahead**: When you complete this task, you should have a training dataset with chest X-rays labeled with pneumonia status and other relevant patient metadata. You will later use this dataset to train a computer vision model that predicts the pneumonia class given an image.

### Load Data

FOR GOOGLE COLAB

In [1]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [3]:
import os
# TO DO: update this path to point wherever the `2024 AIMI Summer Internship - Intern Materials/Datasets/`
# is stored for you (e.g. add a shortcut of the folder your Drive and point to that location)
os.chdir(r'/content/drive/MyDrive/Cody - AIMI 2024/2024 AIMI Summer Internship - Intern Materials/Datasets')

In [5]:
!unzip -qq student_data_split.zip -d /content/

In [6]:
os.chdir(r'/content/student_data_split')

In [4]:
!ls

patient_reports.csv  student_data_split  student_data_split_small.zip  student_data_split.zip


FOR LOCAL RUNTIME

In [17]:
import os
os.chdir(r'C:\Users\codys\Desktop\AIMI2024')
os.chdir(r'student_data_split')

In [18]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 3641-391A

 Directory of C:\Users\codys\Desktop\AIMI2024\student_data_split

06/21/2024  11:00 AM    <DIR>          .
06/21/2024  11:00 AM    <DIR>          ..
06/16/2024  11:59 PM            10,244 .DS_Store
06/21/2024  11:06 AM    <DIR>          .ipynb_checkpoints
06/16/2024  11:48 PM        11,055,153 Reports.json
06/21/2024  11:07 AM    <DIR>          student_test
06/21/2024  11:06 AM    <DIR>          student_train
               2 File(s)     11,065,397 bytes
               5 Dir(s)  52,661,051,392 bytes free


### Load reports only (as needed)

In [19]:
import json

# Load the Reports.json file
with open('Reports.json', 'r') as file:
    reports = json.load(file)

# Print example report contents
print(reports[0])

{'study_id': 'student_train/patient39668/study2', 'report': 'NARRATIVE:\nExam: Chest 1 View, 6-19-2003\n \nClinical History: 64 years Female with Critical  care follow-up(ICU)\n \nComparison: 6/19/2003\n \nIMPRESSION:\n \n1.AP SUPINE CHEST RADIOGRAPH DEMONSTRATES INTERVAL PLACEMENT OF A \nRIGHT IJ VENOUS LINE, WITH THE TIP IN THE MID-SVC.\n \n2.STABLE CARDIOMEDIASTINAL SILHOUETTE.\n \n3.INCREASED RETICULAR MARKINGS ARE SEEN IN THE LUNGS BILATERALLY, \nWITH PERIBRONCHIAL CUFFING THAT COULD REPRESENT EARLY PULMONARY EDEMA.\n \n4.INCREASED OPACIFICATION IS SEEN IN THE RETROCARDIAC LUNG, \nCONCERNING FOR DEVELOPING INFECTION.\n \n5.MILD ATELECTASIS IN THE RIGHT MID AND LOWER ZONES.\n \n6.DEGENERATIVE CHANGES IN THE LOWER CERVICAL SPINE.\n \nSUMMARY:4-POSSIBLY SIGNIFICANT FINDING, MAY NEED ACTION\n \n \n \nACCESSION NUMBER:\n4047176\nThis report has been anonymized. All dates are offset from the actual dates by a fixed interval associated with the patient.'}


### Understanding the Data

Let's first go through some terminology. Medical data is often stored in a hierarchy consisting of three levels: patient, study, and images.
- Patient: A patient is a single unique individual.
- Study: Each patient may have multiple sets of images taken, perhaps on different days. Each set of images is referred to as a *study*.
- Images: Each study consists of one or more *images*.

### Extracting Labels

Try some naive approaches to extracting diseases from the reports!

In [13]:
print(reports[0]["report"])

NARRATIVE:
Exam: Chest 1 View, 6-19-2003
 
Clinical History: 64 years Female with Critical  care follow-up(ICU)
 
Comparison: 6/19/2003
 
IMPRESSION:
 
1.AP SUPINE CHEST RADIOGRAPH DEMONSTRATES INTERVAL PLACEMENT OF A 
RIGHT IJ VENOUS LINE, WITH THE TIP IN THE MID-SVC.
 
2.STABLE CARDIOMEDIASTINAL SILHOUETTE.
 
3.INCREASED RETICULAR MARKINGS ARE SEEN IN THE LUNGS BILATERALLY, 
WITH PERIBRONCHIAL CUFFING THAT COULD REPRESENT EARLY PULMONARY EDEMA.
 
4.INCREASED OPACIFICATION IS SEEN IN THE RETROCARDIAC LUNG, 
CONCERNING FOR DEVELOPING INFECTION.
 
5.MILD ATELECTASIS IN THE RIGHT MID AND LOWER ZONES.
 
6.DEGENERATIVE CHANGES IN THE LOWER CERVICAL SPINE.
 
SUMMARY:4-POSSIBLY SIGNIFICANT FINDING, MAY NEED ACTION
 
 
 
ACCESSION NUMBER:
4047176
This report has been anonymized. All dates are offset from the actual dates by a fixed interval associated with the patient.


Code from Part 0

In [20]:
import re

def get_patient_report(target_study_id):
  filtered_report = next((item['report'] for item in reports if item['study_id'] == target_study_id), None)
  patient_id = re.search(r"patient\d+", target_study_id).group(0)
  study_id = re.search(r"study\d+", target_study_id).group(0)

  return [patient_id, filtered_report, study_id]

print(get_patient_report("student_train/patient26819/study2"))


['patient26819', 'NARRATIVE:\nCOMPARISON: 1/9/2003.\nIMPRESSION:\n1. THE ENDOTRACHEAL TUBE HAS BEEN REMOVED. NASOGASTRIC TUBE IS\nSTABLE IN POSITION. THERE IS A NEW LEFT SUBCLAVIAN CATHETER WITH\nTHE TIP IN THE BRACHIOCEPHALIC VEIN AT THE SVC JUNCTION.\n2. NO DEFINITE EVIDENCE OF PNEUMOTHORAX. THERE ARE PERSISTENT LOW\nLUNG VOLUMES WITH BIBASILAR ATELECTASIS. THERE ARE LIKELY SMALL\nBILATERAL PLEURAL EFFUSIONS.\nEND OF IMPRESSION\nSUMMARY: 2 ABNORMAL, PREVIOUSLY REPORTED\nI have personally reviewed the images for this examination and agree\nwith the report transcribed above.\nBy: Ian, Warren  on: 1-9-2003\n \nACCESSION NUMBER:\nJBXEFIJH\nThis report has been anonymized. All dates are offset from the actual dates by a fixed interval associated with the patient.', 'study2']


In [54]:
#patient_ids = os.listdir('/content/student_data_split/student_train')
patient_ids = os.listdir('student_train')
patient_ids.remove(".DS_Store")

patient_reports = []
for patient_id in patient_ids:
  #all_studies = os.listdir(f'/content/student_data_split/student_train/{patient_id}')
  all_studies = os.listdir(f'student_train/{patient_id}')
  for study_number in all_studies:
    if study_number != ".DS_Store":
      patient_reports.append(get_patient_report(f"student_train/{patient_id}/{study_number}"))

import pandas as pd
df = pd.DataFrame(patient_reports, columns=['Patient ID', 'Report', 'Study ID'])
df


Unnamed: 0,Patient ID,Report,Study ID
0,patient00001,NARRATIVE:\nRADIOGRAPHIC EXAMINATION OF THE CH...,study1
1,patient00004,"NARRATIVE:\nChest 2 Views, DECEMBER 2003\n \nH...",study1
2,patient00005,"NARRATIVE:\nChest 2 Views, 26 january\n \nHIST...",study1
3,patient00006,NARRATIVE:\nRADIOGRAPHIC EXAMINATION OF THE CH...,study1
4,patient00011,NARRATIVE:\nCHEST ONE VIEW: June 2002 \n \n ...,study3
...,...,...,...
12711,patient64515,NARRATIVE:\nRADIOGRAPHIC EXAMINATION OF THE CH...,study1
12712,patient64516,"NARRATIVE:\nPORTABLE CHEST, 7-29-01:\nCOMPARIS...",study1
12713,patient64517,"NARRATIVE:\nEXAM: Chest 1 View Portable, 4-22-...",study1
12714,patient64520,NARRATIVE:\nRADIOGRAPHIC EXAMINATION OF THE CH...,study1


### Looking Ahead: RadGraph

Paper: https://arxiv.org/pdf/2106.14463

In [12]:
%%capture
!pip install radgraph

### Learn more about RadGraph: https://arxiv.org/abs/2106.14463

In [22]:
from radgraph import RadGraph, F1RadGraph

  from .autonotebook import tqdm as notebook_tqdm





### RadGraph example

In [23]:
radgraph = RadGraph(model_type='radgraph-xl')
annotations = radgraph([reports[0]["report"]])

cuda


Downloading radgraph-xl.tar.gz: 100%|███████████████████████████████████████████████| 416M/416M [00:12<00:00, 33.4MB/s]
Downloading tokenizer_config.json: 100%|██████████████████████████████████████████████████████| 228/228 [00:00<?, ?B/s]
Downloading config.json: 100%|█████████████████████████████████████████████████████████| 473/473 [00:00<00:00, 485kB/s]
Downloading vocab.txt: 100%|████████████████████████████████████████████████████████| 235k/235k [00:00<00:00, 3.89MB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████| 441M/441M [00:13<00:00, 33.0MB/s]


In [24]:
def label_reports(reports):
    annotations = radgraph(reports)
    return annotations

In [27]:
def checkConditions(report_txt):
  annotation = label_reports(report_txt)
  entities = annotation['0']['entities']

  interestedConditions = ['pneumonia', 'pneumothorax', 'effusion']
  hasConditions = {
      'pneumonia': False,
      'pneumothorax': False,
      'effusion': False
  }

  for key in entities.keys():
    for condition in interestedConditions:
      if entities[key]['tokens'].lower().find(condition) != -1:
        if(entities[key]['label'].find("definitely present") != -1):
          hasConditions[condition] = True

  return hasConditions


In [28]:
%%capture
!pip install tqdm
from tqdm import tqdm

In [73]:
conditionsDf = pd.DataFrame()
processedPatients = []

#for index, row in df.iterrows():
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
  imagesInStudy = os.listdir(f'student_train/{row["Patient ID"]}/{row["Study ID"]}')
  conditions = checkConditions(row['Report'])

  labels = []
  for key in conditions.keys():
    if(conditions[key]==True):
      labels.append(key)

  for imageName in imagesInStudy:
      image_path = f'student_train/{row["Patient ID"]}/{row["Study ID"]}/{imageName}'
      newPatient = {"Image_Path": image_path, "Patient ID": row['Patient ID'], "Study ID": row['Study ID'],
                  'Pneumonia' : conditions['pneumonia'], 'Pneumothorax' : conditions['pneumothorax'], 'Effusion' : conditions['effusion'],
                  'Normal' : (conditions['pneumonia'] == False and  conditions['pneumothorax'] == False and  conditions['effusion'] == False),
                  'labels' : labels
                    }
      processedPatients.append(newPatient)
conditionsDf = pd.DataFrame(processedPatients)


100%|████████████████████████████████████████████████████████████████████████████| 12716/12716 [15:09<00:00, 13.98it/s]


In [74]:
conditionsDf

Unnamed: 0,Image_Path,Patient ID,Study ID,Pneumonia,Pneumothorax,Effusion,Normal,labels
0,student_train/patient00001/study1/view1_fronta...,patient00001,study1,False,False,False,True,[]
1,student_train/patient00004/study1/view1_fronta...,patient00004,study1,False,False,False,True,[]
2,student_train/patient00004/study1/view2_latera...,patient00004,study1,False,False,False,True,[]
3,student_train/patient00005/study1/view1_fronta...,patient00005,study1,False,False,False,True,[]
4,student_train/patient00005/study1/view2_latera...,patient00005,study1,False,False,False,True,[]
...,...,...,...,...,...,...,...,...
16767,student_train/patient64515/study1/view1_fronta...,patient64515,study1,False,False,False,True,[]
16768,student_train/patient64516/study1/view1_fronta...,patient64516,study1,False,False,False,True,[]
16769,student_train/patient64517/study1/view1_fronta...,patient64517,study1,False,False,False,True,[]
16770,student_train/patient64520/study1/view1_fronta...,patient64520,study1,False,False,False,True,[]


In [75]:
conditionsDf.loc[conditionsDf['Normal'] == False]

Unnamed: 0,Image_Path,Patient ID,Study ID,Pneumonia,Pneumothorax,Effusion,Normal,labels
6,student_train/patient00011/study3/view1_fronta...,patient00011,study3,False,True,False,False,[pneumothorax]
7,student_train/patient00011/study4/view1_fronta...,patient00011,study4,False,True,False,False,[pneumothorax]
8,student_train/patient00011/study6/view1_fronta...,patient00011,study6,False,True,False,False,[pneumothorax]
13,student_train/patient00023/study5/view1_fronta...,patient00023,study5,False,False,True,False,[effusion]
14,student_train/patient00023/study5/view2_latera...,patient00023,study5,False,False,True,False,[effusion]
...,...,...,...,...,...,...,...,...
16710,student_train/patient63939/study1/view1_fronta...,patient63939,study1,False,True,False,False,[pneumothorax]
16724,student_train/patient64055/study2/view1_fronta...,patient64055,study2,True,False,False,False,[pneumonia]
16730,student_train/patient64086/study1/view1_fronta...,patient64086,study1,False,False,True,False,[effusion]
16737,student_train/patient64139/study1/view1_fronta...,patient64139,study1,False,False,True,False,[effusion]


In [77]:
conditionsDf.to_pickle('conditionsDf.pkl')
conditionsDf.to_csv('conditionsDfCsv.csv')

### F1GRadGraph Example

In [18]:
references = ["no acute cardiopulmonary abnormality",
        "et tube terminates 2 cm above the carina retraction by several centimeters is recommended for more optimal placement bibasilar consolidations better assessed on concurrent chest ct"
]

hypotheses = ["no acute cardiopulmonary abnormality",
        "endotracheal tube terminates 2 5 cm above the carina bibasilar opacities likely represent atelectasis or aspiration",
]
f1radgraph = F1RadGraph(reward_level="all")
mean_reward, reward_list, hypothesis_annotation_lists, reference_annotation_lists = f1radgraph(hyps=references, refs=hypotheses)

model_type not provided, defaulting to radgraph-xl


# SpaCy Exploration

In [19]:
!pip install spacy



In [20]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

tokenizer = Tokenizer(English().vocab)

def tokenize(idx, reports=reports, tokenizer=tokenizer):
    tokens = tokenizer(reports[idx]["report"])
    return tokens
tokenize(1)

NARRATIVE:
Chest 1 View: 11-11-06
 
HISTORY: 45 years Male, Repeat CXR - chest tube with small PTX on 
prior.
 
COMPARISON: 11/11/06
 
IMPRESSION: 
 
1.  SINGLE FRONTAL CHEST RADIOGRAPH DEMONSTRATES UNCHANGED POSITION 
OF LEFT CHEST TUBE WITH SIDE-PORT IN THORACIC CAVITY.  TINY LEFT 
APICAL PNEUMOTHORAX IS AGAIN SEEN.
 
2.  MULTIPLE DISPLACED LEFT RIB FRACTURES SEEN.
 
3.  UNCHANGED RETROCARDIAC DENSITY AND LEFT BASILAR OPACITY WHICH 
LIKELY REFLECTS ATELECTASIS OR CONSOLIDATION.
 
4.  UNCHANGED LEFT CHEST WALL SUBCUTANEOUS EDEMA.
 
SUMMARY:2-ABNORMAL, PREVIOUSLY REPORTED 
I have personally reviewed the images for this examination and agreed
with the report transcribed above.
 
ACCESSION NUMBER:
65-73-45-73
This report has been anonymized. All dates are offset from the actual dates by a fixed interval associated with the patient.