<a href="https://colab.research.google.com/github/Sunantha17/dicom_metadata_extraction/blob/main/DICOM_metadata_extracting_attributes_to_dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DICOM metadata

This notebook extracts all possible metadata from DICOM files and saves it into a DataFrame (I will also create a .csv file for portability to other notebooks). The process takes a few minutes and is very memory expensive, so better done just once.

In this other notebook I'm using this data to perform an initial research on what does each attribute mean and an EDA. This notebook just dumps the data into CSV.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pydicom
from tqdm import tqdm
import os

**1. File Listing:**

Find the names for all the files in the training set with .dcm extension.

In [None]:
dcms = []
for root, dirs, fnames in os.walk('/path/folder/dicom_files'):
    dcms += list(os.path.join(root, f) for f in fnames if f.endswith('.dcm'))
print(f'There are {len(dcms)} CT scans')

There are 550 CT scans


2. **Attribute name listing:**

Let's get all the attributes present in any of the DICOM files. The .dir() method comes in handy for this. Note that some files have some attributes and some others do not, so inspecting a single file is not enough. Running this takes some minutes.

In [None]:
attrs = set()
for fname in tqdm(dcms):
    with pydicom.dcmread(fname) as obj:
        attrs.update(obj.dir())

100%|██████████| 550/550 [00:00<00:00, 826.73it/s]


This is a complete list of the DICOM attributes. Drop PixelData so we do not run out of memory (this one contains the actual image).

In [None]:
dcm_keys = list(attrs)
dcm_keys.remove('PixelData') # The actual array of pixels, this is not metadata
dcm_keys.remove('PatientName') # Anonymous data!
dcm_keys

['PatientID',
 'GraphicLayerSequence',
 'ValueType',
 'FluoroscopyFlag',
 'InstanceCreationTime',
 'InstitutionAddress',
 'InstanceNumber',
 'StudyDescription',
 'Columns',
 'PresentationLUTShape',
 'VerifyingObserverSequence',
 'ContrastBolusStartTime',
 'ConceptNameCodeSequence',
 'CompletionFlag',
 'FilterType',
 'DerivationDescription',
 'SamplesPerPixel',
 'GraphicGroupSequence',
 'ContinuityOfContent',
 'ScheduledProcedureStepEndDate',
 'Modality',
 'DataCollectionDiameter',
 'ScanOptions',
 'ReferencedPerformedProcedureStepSequence',
 'SliceLocation',
 'Rows',
 'ContentTemplateSequence',
 'Manufacturer',
 'RequestAttributesSequence',
 'StudyInstanceUID',
 'LossyImageCompression',
 'ImageOrientationPatient',
 'AcquisitionType',
 'TableFeedPerRotation',
 'ConversionType',
 'SOPClassUID',
 'RescaleSlope',
 'DateOfSecondaryCapture',
 'LossyImageCompressionMethod',
 'AcquisitionDate',
 'PatientOrientation',
 'WindowCenter',
 'PatientPosition',
 'PerformedProcedureStepEndTime',
 'Perf

3. **Load the actual values from the files:**

If an attribute is not present, we stick an np.nan. We also perform some casting to standard Python types to make things easier.

In [None]:
meta = []
typemap = {
    pydicom.uid.UID: str,
    pydicom.multival.MultiValue: list
}
def cast(x):
    return typemap.get(type(x), lambda x: x)(x)

for i, fname in enumerate(tqdm(dcms)):
    with pydicom.dcmread(fname) as obj:
        meta.append([cast(obj.get(key, np.nan)) for key in dcm_keys])

dfmeta = pd.DataFrame(meta, columns=dcm_keys)
dfmeta

100%|██████████| 550/550 [00:01<00:00, 545.06it/s]


Unnamed: 0,PatientID,GraphicLayerSequence,ValueType,FluoroscopyFlag,InstanceCreationTime,InstitutionAddress,InstanceNumber,StudyDescription,Columns,PresentationLUTShape,...,WindowWidth,ConvolutionKernel,DeviceSerialNumber,ManufacturerModelName,SeriesDate,RequestedProcedureDescription,ImageComments,ReferencedSeriesSequence,InstitutionalDepartmentName,ImageType
0,OH001Q161421,,,NO,,TOSHIBA_MEC,74,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA\Sag-MIP,,ID_DEPARTMENT,"[DERIVED, PRIMARY, MPR]"
1,OH001Q161421,,,NO,,TOSHIBA_MEC,67,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA,,ID_DEPARTMENT,"[ORIGINAL, PRIMARY, AXIAL]"
2,OH001Q161421,,,NO,,TOSHIBA_MEC,56,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA\Sag-MIP,,ID_DEPARTMENT,"[DERIVED, PRIMARY, MPR]"
3,OH001Q161421,,,NO,,TOSHIBA_MEC,18,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA\Sagittal,,ID_DEPARTMENT,"[DERIVED, PRIMARY, MPR]"
4,OH001Q161421,,,NO,,TOSHIBA_MEC,80,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA,,ID_DEPARTMENT,"[ORIGINAL, PRIMARY, AXIAL]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
545,OH001Q161421,,,NO,,TOSHIBA_MEC,71,CTA CHEST WO W CON,512.0,,...,1600.0,FC50,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA,,ID_DEPARTMENT,"[ORIGINAL, PRIMARY, AXIAL]"
546,OH001Q161421,,,NO,,TOSHIBA_MEC,19,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA\Sagittal,,ID_DEPARTMENT,"[DERIVED, PRIMARY, MPR]"
547,OH001Q161421,,,NO,,TOSHIBA_MEC,1,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA\Cor-MIP,,ID_DEPARTMENT,"[DERIVED, PRIMARY, MPR]"
548,OH001Q161421,,,NO,,TOSHIBA_MEC,17,CTA CHEST WO W CON,512.0,,...,400.0,FC07,FLB1662113,Aquilion PRIME,20230105,CTA CHEST W CON,CTA,,ID_DEPARTMENT,"[ORIGINAL, PRIMARY, AXIAL]"


4. **Writing into CSV file**

In [None]:
dfmeta.to_csv('filename.csv', index=False)