# MIMIC
location:
- /DATA1/llm-research/MIMIC-CXR

number of studies:
- 227,835 radiology studies

# RadGraph
location:
- /DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/

file:
- train,json  
- 425 reports
- source: {'MIMIC-CXR'}
  - question (not 500?)

file:
- test.json  
- 100 reports
- source: {'MIMIC-CXR', 'CheXpert'}

Paper RadGraph
> We release a development dataset, which contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets.

## Summary of train.json

In [1]:
import json
from pprint import pprint

path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/train.json'
# path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/dev.json'


with open(path, 'r') as f:
    train_data = json.load(f)

# Length of data
print('Length of data:', len(train_data))

# Print first 10 keys of data
print('First 10 keys of data:', list(train_data.keys())[:10])

# Show an example
# example_key = 'p18/p18004941/s58821758.txt'
# print('Example:', example_key)
# pprint.pprint(train_data[example_key])

# print text
# print('Text:', train_data[example_key]['text'])
# pprint(train_data[example_key])

# path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/dev.json'



Length of data: 425
First 10 keys of data: ['p18/p18004941/s58821758.txt', 'p18/p18004941/s58034164.txt', 'p18/p18004941/s57101461.txt', 'p18/p18004941/s52289350.txt', 'p18/p18004941/s55426484.txt', 'p18/p18004941/s55576421.txt', 'p18/p18003081/s57007000.txt', 'p18/p18003081/s51239857.txt', 'p18/p18003081/s56557601.txt', 'p18/p18003081/s59630984.txt']


## Uniqeu entities in training data
Length of all_unique_entities_dict: 1250

In [8]:
import json
import pprint

# Step 1: Initialize an empty dictionary to store all unique entities
# structure: label: {entity: {normalization: null, reports: list}}
all_unique_entities_dict = {}

# list of all entities
i = 0

# Step 2: Iterate through each report in the train_data
for report_key, report_data in train_data.items():
    for entity_id, entity_info in report_data['entities'].items():
        i += 1
        entity = entity_info['tokens']
        label = entity_info['label']
        start_ix = entity_info['start_ix']
        end_ix = entity_info['end_ix']



        if entity not in all_unique_entities_dict:
            all_unique_entities_dict[entity] = {
                'label': label,
                'reports': [
                    {report_key: 
                        {   'start_ix': start_ix, 
                            'end_ix': end_ix
                        }
                    }
                ],
                'normalization': None
            }
        else:
            all_unique_entities_dict[entity]['reports'].append(
                {report_key: 
                    {   'start_ix': start_ix, 
                        'end_ix': end_ix
                    }
                }
            )

# len of all_unique_entities_dict
print('Length of all_unique_entities_dict:', len(all_unique_entities_dict))
print('Number of entities:', i)

Length of all_unique_entities_dict: 1250
Number of entities: 12388


## write to file

In [5]:
# Save Output all_entities to a JSON file
# with open('all_unique_entities.json', 'w', encoding='utf-8') as f:
#     json.dump(all_unique_entities_dict, f, ensure_ascii=False, indent=4)

# print("All entities have been saved to 'all_unique_entities.json'")

All entities have been saved to 'all_unique_entities.json'


## Read the saved file

In [7]:
import json
import pprint

## Read the saved file
with open('../resource/all_unique_entities.json', 'r', encoding='utf-8') as f:
    all_unique_entities = json.load(f)

# Print the first 10 entities
print('First 10 entities:')
pprint.pprint(list(all_unique_entities.keys())[:10])

print(all_unique_entities['Lung']['normalization'])

First 10 entities:
['Lungs',
 'clear',
 'Normal',
 'cardiomediastinal',
 'hilar',
 'silhouettes',
 'pleural',
 'surfaces',
 'Endotracheal',
 'tube']
None


## sorted by label (e.g. ANATOMY, OBSERVATION, etc.)

In [43]:
import json
import pprint

# Step 1: Initialize an empty dictionary to store all unique entities
# structure: label: {entity: {normalization: null, reports: list}}
all_entities = {}

# Step 2: Iterate through each report in the train_data
for report_key, report_data in train_data.items():
    if 'entities' in report_data:
        # Step 3: Iterate through each entity in 'entities'
        for entity_id, entity_info in report_data['entities'].items():
            # get 'tokens' as entity
            entity = entity_info['tokens']
            # get 'label' as label
            label = entity_info['label']
            
            # if label not in all_entities
            if label not in all_entities:
                # add new label with entity, normalization, and report_key
                all_entities[label] = {entity: {'normalization': None, 'reports': [report_key]}}
            else:
                # if entity not in all_entities[label]
                if entity not in all_entities[label]:
                    # add new entity with normalization and report_key
                    all_entities[label][entity] = {'normalization': None, 'reports': [report_key]}
                else:
                    # update existing entity with new report_key
                    all_entities[label][entity]['reports'].append(report_key)

# Print the results for 'ANAT-DP' label
pprint.pprint(all_entities['ANAT-DP'])

# Print the number of unique entities for 'ANAT-DP' label
print(f"Number of unique 'ANAT-DP' entities: {len(all_entities['ANAT-DP'])}")

# Output all_entities to a JSON file
with open('all_entities.json', 'w', encoding='utf-8') as f:
    json.dump(all_entities, f, ensure_ascii=False, indent=4)

print("All entities have been saved to 'all_entities.json'")

{'1.8 cm above': {'normalization': None,
                  'reports': ['p10/p10002428/s59659695.txt',
                              'p10/p10002428/s59659695.txt']},
 '2.8 cm above': {'normalization': None,
                  'reports': ['p15/p15007517/s52317711.txt']},
 '2.9 cm above': {'normalization': None,
                  'reports': ['p15/p15009504/s51227163.txt']},
 '3 cm above': {'normalization': None,
                'reports': ['p15/p15006483/s59006665.txt',
                            'p15/p15006483/s57012181.txt']},
 '3.4 cm': {'normalization': None, 'reports': ['p15/p15009504/s58857907.txt']},
 '3.5 cm above': {'normalization': None,
                  'reports': ['p15/p15007517/s59416262.txt']},
 '3.6 cm above': {'normalization': None,
                  'reports': ['p15/p15009504/s52538987.txt']},
 '4 cm above': {'normalization': None,
                'reports': ['p18/p18010079/s55080309.txt']},
 '4.3 cm': {'normalization': None, 'reports': ['p18/p18004941/s52289350.txt']},


## 统计all_entities中每个entity出现的次数

In [29]:
from collections import Counter

def analyze_entity_frequency(all_entities):
    # Initialize a Counter to store entity frequencies
    entity_frequency = Counter()

    # Iterate through all labels and entities
    for label, entities in all_entities.items():
        for entity, document_keys in entities.items():
            # Count the number of documents for each entity
            entity_frequency[(entity, label)] += len(document_keys)

    # Print the results
    print("Entity Frequency Analysis:")
    for (entity, label), count in entity_frequency.most_common():
        print(f"Entity: '{entity}', Label: '{label}', Frequency: {count}")

    # Optional: Return the Counter object for further analysis if needed
    return entity_frequency

# Assuming all_entities is already defined and populated
analyze_entity_frequency(all_entities)

Entity Frequency Analysis:
Entity: 'pleural', Label: 'ANAT-DP', Frequency: 316
Entity: 'pneumothorax', Label: 'OBS-DA', Frequency: 239
Entity: 'right', Label: 'ANAT-DP', Frequency: 211
Entity: 'left', Label: 'ANAT-DP', Frequency: 191
Entity: 'pulmonary', Label: 'ANAT-DP', Frequency: 176
Entity: 'effusion', Label: 'OBS-DA', Frequency: 163
Entity: 'normal', Label: 'OBS-DP', Frequency: 142
Entity: 'unchanged', Label: 'OBS-DP', Frequency: 137
Entity: 'lung', Label: 'ANAT-DP', Frequency: 124
Entity: 'silhouette', Label: 'ANAT-DP', Frequency: 120
Entity: 'size', Label: 'ANAT-DP', Frequency: 116
Entity: 'atelectasis', Label: 'OBS-DP', Frequency: 110
Entity: 'contours', Label: 'ANAT-DP', Frequency: 104
Entity: 'lungs', Label: 'ANAT-DP', Frequency: 96
Entity: 'clear', Label: 'OBS-DP', Frequency: 96
Entity: 'consolidation', Label: 'OBS-DA', Frequency: 96
Entity: 'acute', Label: 'OBS-DA', Frequency: 95
Entity: 'mediastinal', Label: 'ANAT-DP', Frequency: 87
Entity: 'focal', Label: 'OBS-DA', Freque

Counter({('pleural', 'ANAT-DP'): 316,
         ('pneumothorax', 'OBS-DA'): 239,
         ('right', 'ANAT-DP'): 211,
         ('left', 'ANAT-DP'): 191,
         ('pulmonary', 'ANAT-DP'): 176,
         ('effusion', 'OBS-DA'): 163,
         ('normal', 'OBS-DP'): 142,
         ('unchanged', 'OBS-DP'): 137,
         ('lung', 'ANAT-DP'): 124,
         ('silhouette', 'ANAT-DP'): 120,
         ('size', 'ANAT-DP'): 116,
         ('atelectasis', 'OBS-DP'): 110,
         ('contours', 'ANAT-DP'): 104,
         ('lungs', 'ANAT-DP'): 96,
         ('clear', 'OBS-DP'): 96,
         ('consolidation', 'OBS-DA'): 96,
         ('acute', 'OBS-DA'): 95,
         ('mediastinal', 'ANAT-DP'): 87,
         ('focal', 'OBS-DA'): 86,
         ('lower', 'ANAT-DP'): 82,
         ('stable', 'OBS-DP'): 81,
         ('effusion', 'OBS-DP'): 78,
         ('tube', 'OBS-DP'): 73,
         ('cardiac', 'ANAT-DP'): 71,
         ('opacity', 'OBS-DP'): 70,
         ('small', 'OBS-DP'): 70,
         ('effusions', 'OBS-DP'): 68,


## Discussion about the result from RadGraph

cardiomediastinal
要被分成carido，mediastinal  
Heart (C0018787)  
Mediastinal (C1522718)


这里的500标准数据，其实也不可靠，没有考虑到嵌套和后续的可能的标准花处理，比如分割复合单词。
- 是个很rough的数据。

我们在这里要多做一步。我称之为 简单实体 to 标准化实体。  
- 这一步用生产式AI，就像上面的例子，很好用。  
- 或者用规则，也可以。

为后续的normalization做准备。

鉴于这个结果，我们可以做一些更实际的应用层面上的工作。


In [9]:
unique_second_elements = {token[1] for token in all_tokens}

print(unique_second_elements)

num_unique_second_elements = len(unique_second_elements)
print(num_unique_second_elements)

type_counts = {}
for token in all_tokens:
    token_type = token[1]
    if token_type in type_counts:
        type_counts[token_type] += 1
    else:
        type_counts[token_type] = 1

print(type_counts)

# Filter the tokens to include only ANAT-DP tokens
anat_dp_tokens = [token for token in all_tokens if token[1] == 'ANAT-DP']

# Print the content of ANAT-DP tokens
for token in anat_dp_tokens:
    print(token[0])


{'ANAT-DP', 'OBS-DP', 'OBS-U', 'OBS-DA'}
4
{'OBS-DP': 812, 'ANAT-DP': 398, 'OBS-DA': 153, 'OBS-U': 190}
superior
of T 5 through T 9
structures
knuckle
pectus
internal jugular
lower
diameter
valve
volumes
fifith
heart
barely
vascularity
AP
Cardiac
cutaneous
nodule
CP
sided
gastroesophageal
AZYGOS
CARDIAC
border
Vasculature
first
infectious
skin
near
Median
aorta
L 2 through L 4
BASILAR
axilla
contour
Lower
paratracheal
hilus
locations
margin
left arm
bases
valvular
Interstitial
spine
esophagus
lateral structures
subpulmonic
superior cavoatrial junction
3.4 cm
mid - to - distal
interspace
lingula
duodenum
Bibasilar
basilar
adjacent
Pectus
4 cm above
mediastinum
thoracolumbar junction
biapical
LOBE
cardiac
Pleural
costal
apex
angle
right
aspect
atrium
vertebral
medial
hilar
posteriorly
alveolar
loculated
excavatum
Left sided
vein
5.6 cm
coronary
contents
Multiple
approximately 2.3 cm
biventricular
beneath
mild
Bronchial
spinal
Lateral
bones
peripherally
cardia
other
chest
tricuspid
Sterna

entity: 12388  
unique: 1250

In [9]:
relations = set()

i = 0

# Step 1: Iterate through each document in train_data
for document_key in train_data:
    # Access the 'entities' for the current document
    entities = train_data[document_key]['entities']
    
    # Step 2: Iterate through each entity in 'entities'
    for entity_key in entities:
        # Access the entity and its relations
        entity = entities[entity_key]
        entity_relations = entity['relations']
        
        # Step 3: Iterate through each relation in 'entity_relations'
        for relation in entity_relations:
            # Extract the related entity key and relation type
            related_entity_key = relation[1]
            relation_type = relation[0]
            
            # Get the related entity
            related_entity = entities[related_entity_key]
            
            # Create a tuple with entity1, entity2, and relation
            relation_tuple = (entity['tokens'], related_entity['tokens'], relation_type)
            
            # Add the relation tuple to the set of relations
            relations.add(relation_tuple)

            i += 1

# At this point, 'relations' contains all the relations between entities in the 'train_data'
print(len(relations))
print(i)
# print(relations)

3363
9251


3300 relations

## neo4j database setup

In [11]:
from py2neo import Graph, Node, Relationship

# 连接到Neo4j数据库
# graph = Graph("bolt://localhost:7689", auth=("neo4j", "neo4j"))

# 创建节点
# alice = Node("Person", name="Alice", age=25)
# bob = Node("Person", name="Bob", age=30)
# graph.create(alice)
# graph.create(bob)

# # 创建关系
# alice_knows_bob = Relationship(alice, "KNOWS", bob)
# graph.create(alice_knows_bob)

# RadGraph 75 dev set

In [2]:
import json
from pprint import pprint

# path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/train.json'
path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/dev.json'


with open(path, 'r') as f:
    train_data = json.load(f)



Length of data: 75
First 10 keys of data: ['p10/p10003412/s59172281.txt', 'p18/p18001816/s54309228.txt', 'p15/p15003878/s57380048.txt', 'p15/p15003878/s59424963.txt', 'p15/p15003878/s54410100.txt', 'p15/p15003878/s56836141.txt', 'p15/p15003878/s55982972.txt', 'p15/p15003878/s55991257.txt', 'p15/p15003878/s51651577.txt', 'p15/p15003878/s51266756.txt']


## to nen file

In [4]:
import json
import pprint

# Step 1: Initialize an empty dictionary to store all unique entities
# structure: label: {entity: {normalization: null, reports: list}}
all_unique_entities_dict = {}

# list of all entities
i = 0

# Step 2: Iterate through each report in the train_data
for report_key, report_data in train_data.items():
    for entity_id, entity_info in report_data['entities'].items():
        i += 1
        entity = entity_info['tokens']
        label = entity_info['label']
        start_ix = entity_info['start_ix']
        end_ix = entity_info['end_ix']



        if entity not in all_unique_entities_dict:
            all_unique_entities_dict[entity] = {
                'label': label,
                'reports': [
                    {report_key: 
                        {   'start_ix': start_ix, 
                            'end_ix': end_ix
                        }
                    }
                ],
                'normalization': None
            }
        else:
            all_unique_entities_dict[entity]['reports'].append(
                {report_key: 
                    {   'start_ix': start_ix, 
                        'end_ix': end_ix
                    }
                }
            )

# len of all_unique_entities_dict
print('Length of all_unique_entities_dict:', len(all_unique_entities_dict))
print('Number of entities:', i)

# Save Output all_entities to a JSON file
path = '../resource/all_unique_entities_dev.json'
with open(path , 'w', encoding='utf-8') as f:
    json.dump(all_unique_entities_dict, f, ensure_ascii=False, indent=4)

print(f"All entities have been saved to '{path}'")

Length of all_unique_entities_dict: 482
Number of entities: 2191
All entities have been saved to '../resource/all_unique_entities_dev.json'


## from dev.json to unique_entities_dev.csv

In [5]:
# all_unique_entities_dict is json
# json has entity, (label, reports: list, normalization)
# csv has name,ui,normalized_name,semanticTypes,definition

import csv
import json
import pprint

all_unique_entities_dict

# creat csv, entity = name
file_path = '../resource/not_normalized_dev.csv'
with open(file_path, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'ui', 'normalized_name', 'semanticTypes', 'definition'])
    for entity, entity_info in all_unique_entities_dict.items():
        writer.writerow([entity, '', '', '', ''])

print(f"CSV file has been saved to '{file_path}'")

CSV file has been saved to '../resource/not_normalized_dev.csv'


# whole RadGraph 500

In [9]:
import json
from pprint import pprint

path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/train.json'


with open(path, 'r') as f:
    train_data = json.load(f)

path = '/DATA1/llm-research/RadGraph/physionet.org/files/radgraph/1.0.0/dev.json'
with open(path, 'r') as f:
    dev_data = json.load(f)

# combine train_data and dev_data
train_data.update(dev_data)

# len of data
print('Length of data:', len(train_data))

Length of data: 500


In [12]:
import json
import pprint

train_data

# Step 1: Initialize an empty dictionary to store all unique entities
# structure: label: {entity: {normalization: null, reports: list}}
all_unique_entities_dict = {}

# list of all entities
i = 0

# Step 2: Iterate through each report in the train_data
for report_key, report_data in train_data.items():
    for entity_id, entity_info in report_data['entities'].items():
        i += 1
        entity = entity_info['tokens']
        label = entity_info['label']
        start_ix = entity_info['start_ix']
        end_ix = entity_info['end_ix']


        if entity not in all_unique_entities_dict:
            all_unique_entities_dict[entity] = {
                'label': label,
                'reports': [
                    {report_key: 
                        {   'start_ix': start_ix, 
                            'end_ix': end_ix
                        }
                    }
                ],
                'normalization': None
            }
        else:
            all_unique_entities_dict[entity]['reports'].append(
                {report_key: 
                    {   'start_ix': start_ix, 
                        'end_ix': end_ix
                    }
                }
            )

# len of all_unique_entities_dict
print('Length of all_unique_entities_dict:', len(all_unique_entities_dict))
print('Number of entities:', i)


# creat csv, entity = name
# file_path = '../resource/not_normalized.csv'
# with open(file_path, 'w', newline='') as f:
#     writer = csv.writer(f)
#     writer.writerow(['name', 'ui', 'normalized_name', 'semanticTypes', 'definition'])
#     for entity, entity_info in all_unique_entities_dict.items():
#         writer.writerow([entity, '', '', '', ''])

# print(f"CSV file has been saved to '{file_path}'")

Length of all_unique_entities_dict: 1353
Number of entities: 14579


## read from train.js

In [14]:
file_path = '../resource/not_normalized.csv'
# Read the data from the CSV file
data = []
with open(file_path, 'r', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        data.append(row)

    
# Print the length of the data
print(f"Length of data: {len(data)}")

Length of data: 1353
