## Working with Dicom Files

The point of this notebook is to:
* explore how the information in a dicom file is organized and stored
* what to remove from the dicom header and how to remove it
    * how do we access the US region coordinates using Pydicom
* what to remove from the pixel data and how to remove it

The article <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3354356/">Managing DICOM images: Tips and tricks for the radiologist</a> 
contains an easily readable overview of the dicom format and says this about deidentification:

The common tags that indicate the patient identity include the patient's name, age, sex, birth date, hospital identity number, ethnic group, occupation, referring physician, institution name, study date, and DICOM Unique Identifiers (UIDs). As described earlier, such demographic information of the patient and a host of other information about the imaging study is encoded within an image header. The data may or may not be displayed on the screen, but the information can be extracted from the header by anyone who has access to the DICOM file. Several educational resources using DICOM files are available for radiology students on the World Wide Web. Creating and accessing such electronic teaching files often involve transmission of DICOM data over the Internet. In the interest of patient confidentiality, all information identifying the patient should be removed from the DICOM header when a DICOM file is uploaded for such purposes.

Respecting the patient's privacy is important when images are used in presentations, teaching files, or publications. A simple and easy method of ensuring this is by converting and exporting the DICOM file into other image formats such as JPEG or TIFF. The header information is lost and patient identity cannot be obtained from the resultant image. Another method is “anonymization,” whereby all patient information is removed from the DICOM header.[3] This is achieved by using software like DicomWorks, ImageJ, and FP Image.[7,8,10] Specifically, all tags contained in groups “0008” (study information) and “0010” (patient information) of the DICOM header should be removed and replaced during anonymization.

## The early pipeline:

First, break the large Datamart file into smaller Datamart files with at most 100 cases each.  (csv_splitter function)

For each smaller Datamart file that does not begin with PROC

* Build a matching Notion query file
* Use Notion to retrieve zip archive of dicom files and anonymization map
* Store small datamart file, zip archive, and anon_map file using 00007_datamart.csv, 00007_dicoms.zip, 00007_anon_map.csv
* Append 00007_anon_map.csv to master_anon_map.csv
* Rename 00007_anon_map.csv to PROC_00007_anon_map.csv
* Extract the dicom files from 00007_dicoms.zip and deidentify them.  Save them in data_anon/dicoms/00007_dicoms
* Clean and anonymize the 00007_datamart.csv (add anonymized ids, delete any columns with PHI, rename and reorder columns).  Save in data_anon/dicoms/00007_dicoms
* Compress and archive data_anon/dicoms/00007_dicoms to 00007_dicoms.zip.  Delete folder.
* Append 00007_datamart.csv to master_datamart.csv (will have PHI)
* Rename 00007_datamart.csv to PROC_datamart_00007.csv

On the Mayo side we will store:

* All small datamart and anon_map files.
* The Notion zip archives.
* The master_datamart and master_anon_map files.
* No need to store Notion queries, extracted images, etc.
* We can always rebuild de-identified dicoms from the stored zip archives and other saved data.

On the UWL side we will store 

* de-identified dicoms (at least temporarily)
* anonymized small datamart files (at least temporarily)
* Tristan will use these to build the database

## Things to update, fix, refactor

* to make the anonymization safer we should delete everything in the dicom file that we don't intend to keep
    * in the opposite direction we may need to put back the time for each image so we can fill in missing lateralities
* rewrite create_dcm_filename to use dicom_media_type

## New functionalities to add

* build an anonymization map for biopsy accession numbers
    * incrementally ingest datamart files and add to the biop anon csv
* write a general function that inputs a datamart type file and selects two columns, output is a notion query file (or multiple)
* include the csv splitter function in this notebook
* write a function that inputs datamart type files along with both anonymization maps and outputs a growing anonymized datamart file
* write a function that gets notion query reports and adds the descriptions to the master anonymized datamart file
* write a function that takes biopsy query reports, extracts and adds the laterality to any biopsy anon map file

In [1]:
# imports

import matplotlib.pyplot as plt
from pydicom import dcmread
from hashlib import sha1
import pydicom
import os
import zipfile
import csv
import pandas as pd
import numpy as np

In [2]:
def anon_callback(ds, element):
    """used with "walk" to loop over dicom and anonymize entries

    Args:

    Returns:

    """
    names = ['SOP Instance UID','Study Time','Series Time','Content Time',
             'Study Instance UID','Series Instance UID','Private Creator',
             'Media Storage SOP Instance UID',
             'Implementation Class UID']
    
    if element.name in names:
        element.value = "anon"

    if element.VR == "DA":
        date = element.value
        date = date[0:4] + "0101" # set all dates to YYYY0101
        element.value = date

    if element.VR == "TM":
        element.value = "anon"

def deidentify_dicom_dataset( ds ):
    """remove patient information from pydicom dataset that Notion has partially deanonymized

    Args:
        ds: pydicom dataset from reading dicom file with pydicom

    Returns:
        out_ds: pydicom dataset with private information removed from header and image
        is_video:  True for Multi-frame dicom, False for single image
        hash: sha1 hash of str(ds.pixel_array) for 'unique id' and 
    """

    ds.remove_private_tags() # take out private tags added by notion or otherwise

    ds.file_meta.walk(anon_callback)
    ds.walk(anon_callback)

    media_type = ds.file_meta[0x00020002]
    is_video = str(media_type).find('Multi-frame')>-1
    is_secondary = str(media_type).find('Secondary')>-1
    if is_secondary:
        y0 = 101
    else:
        if (0x0018, 0x6011) in ds:
            y0 = ds['SequenceOfUltrasoundRegions'][0]['RegionLocationMinY0'].value
        else:
            y0 = 101

    if 'OriginalAttributesSequence' in ds:
        del ds.OriginalAttributesSequence

    # crop patient info above US region 

    arr = ds.pixel_array
    
    if is_video:
        arr[:,:y0] = 0
    else:
        arr[:y0] = 0

    ds.PixelData = arr.tobytes()

    return ds

def create_dcm_filename( ds ):
    """uses info from dicom file to create informative filename

    Args:
        ds:  dataset extracted from dicom file

    Returns:
        filename:  "patient id"_"acc num"_"type"_"hash".dcm
             "patient id" is already anonymized by Notion
             "acc num" is already anonymized by Notion
             "type" is image, video (multi-frame array of images), or second (weird type of image, rare)
             "hash" is sha1 hash created from ds.pixel_array (could use to check for duplicates)
    """
    patient_id = ds.PatientID.rjust(8,'0')
    accession_number = ds.AccessionNumber.rjust(8,'0')

    media_type = ds.file_meta[0x00020002]
    is_video = str(media_type).find('Multi-frame')>-1
    is_secondary = str(media_type).find('Secondary')>-1
    
    if is_video:
        media = 'video'
    elif is_secondary:
        media = 'second'
    else:
        media = 'image'

    image_hash = sha1( ds.pixel_array ).hexdigest()

    filename = f'{media}_{patient_id}_{accession_number}_{image_hash}.dcm'

    return filename

def dicom_media_type( dataset ):
    type = str( dataset.file_meta[0x00020002].value )
    if type == '1.2.840.10008.5.1.4.1.1.6.1': # single ultrasound image
        return 'image'
    elif type == '1.2.840.10008.5.1.4.1.1.3.1': # multi-frame ultrasound image
        return 'multi'
    else:
        return 'other' # something else

def extract_deidentify_dcm_files(directory, target_directory):
    
    # Create the target directory if it doesn't exist
    os.makedirs(target_directory, exist_ok=True)
    
    # Get a list of all ZIP files in the directory
    zip_files = [filename for filename in os.listdir(directory) if filename.endswith('.zip') and not filename.startswith('PROC_')]
    
    # Loop over each ZIP file
    for zip_file in zip_files:
        # create target subdirectory
        zip_name, extension = os.path.splitext(zip_file)
        target_subdirectory = target_directory +  zip_name + '_anon/'
        os.makedirs(target_subdirectory, exist_ok = True)
        
        # Open the ZIP file
        zip_path = os.path.join(directory, zip_file)
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            # Loop over each file in the ZIP file
            for member in zip_ref.namelist():
                if member.endswith('.dcm'):

                    # Read the DICOM file using PyDICOM
                    with zip_ref.open(member, 'r') as dicom_file:
                        dataset = pydicom.dcmread(dicom_file)

                        # check to make sure dicom has image or multi-frame video, else ignore
                        # if its an image makde sure SequenceOfUltrasoundRegions is present
                        media_type = dicom_media_type( dataset )
                        if (media_type == 'image' and (0x0018, 0x6011) in dataset) or media_type=='multi':
    
                            # remove patient information from the dicom dataset
                            dataset = deidentify_dicom_dataset(dataset)
    
                            # create new filename from header and hashed image
                            new_filename = create_dcm_filename( dataset ) 
                            
                            # Set the target path to write the DICOM file
                            target_path = os.path.join(target_subdirectory, new_filename)
                        
                            # Write the DICOM dataset to a new DICOM file
                            dataset.save_as(target_path)
        
        # Rename the processed file with 'PROC_' at the beginning
        processed_file = os.path.join(directory, 'PROC_' + zip_file)
        os.rename(zip_path, processed_file)

def check_uncompressed( ds ):
    type = ds.file_meta.TransferSyntaxUID
    uncompressed_types = ['1.2.840.10008.1.2.1','1.2.840.10008.1.2.2','1.2.840.10008.1.2']
    return type in uncompressed_types

def extract_deidentify_dcm_file(zip_path, zip_file, target_directory):
    
    os.makedirs(target_directory, exist_ok = True)
    
    # Open the ZIP file
    full_path_zip_file = os.path.join(zip_path, zip_file)
    with zipfile.ZipFile(full_path_zip_file, 'r') as zip_ref:
        # Loop over each file in the ZIP file
        for member in zip_ref.namelist():
            if member.endswith('.dcm'):

                # Read the DICOM file using PyDICOM
                with zip_ref.open(member, 'r') as dicom_file:
                    #print(dicom_file)
                    dataset = pydicom.dcmread(dicom_file)

                    # check to make sure dicom has image or multi-frame video, else ignore
                    # if its an image makde sure SequenceOfUltrasoundRegions is present
                    media_type = dicom_media_type( dataset )

#                    is_image = media_type == 'image'
                    is_secondary = str(media_type).find('Secondary')>-1
                    if is_secondary:
                        print('SECONDARY:',dicom_file)
                    
                    if (( media_type == 'image' and (0x0018, 0x6011) in dataset) or media_type=='multi'):

                        # if image is compressed, decompress it and change colorspace if needed
                        is_compressed = not check_uncompressed(dataset)
                        if is_compressed:
                            dataset.decompress()
                            arr = dataset.pixel_array
                            color_space_in = dataset.PhotometricInterpretation
                            if color_space_in not in ['MONOCHROME2','RGB']:
                                color_space_out = 'RGB'
                                arr = pydicom.pixel_data_handlers.util.convert_color_space(arr, 
                                                                                           color_space_in, 
                                                                                           color_space_out, 
                                                                                           True)
                                dataset.PixelData = arr.tobytes()
                                dataset.PhotometricInterpretation = color_space_out
                                

                        # remove patient information from the dicom dataset
                        dataset = deidentify_dicom_dataset(dataset)
                        
                        # create new filename from header and hashed image
                        new_filename = create_dcm_filename( dataset ) 
                        
                        # Set the target path to write the DICOM file
                        target_path = os.path.join(target_directory, new_filename)
                    
                        # Write the DICOM dataset to a new DICOM file
                        #print(dicom_file)
                        dataset.save_as(target_path)
    
    # Rename the processed file with 'PROC_' at the beginning
    new_zip_file = os.path.join(zip_path,f'PROC_{zip_file}')
    os.rename(full_path_zip_file, new_zip_file)

def append_to_csv(target_file, input_file):
    # Add prefix "PROC_" to the input filename
    input_dir = os.path.dirname(input_file)
    input_filename = os.path.basename(input_file)
    input_filename_with_prefix = "PROC_" + input_filename
    input_file_with_prefix = os.path.join(input_dir, input_filename_with_prefix)

    # Check if the target file exists
    target_exists = os.path.exists(target_file)

    # Open the input file for reading
    with open(input_file, 'r', newline='') as input_csv_file:
        input_csv_reader = csv.reader(input_csv_file)
        input_rows = list(input_csv_reader)

    # Check if the input file has any rows
    if len(input_rows) == 0:
        print("Input file is empty. No rows to append.")
        return

    # Determine if the target file already has a header
    target_has_header = False
    if target_exists:
        with open(target_file, 'r', newline='') as target_csv_file:
            target_csv_reader = csv.reader(target_csv_file)
            target_has_header = next(target_csv_reader, None) is not None

    # Open the target file for appending
    with open(target_file, 'a', newline='') as target_csv_file:
        target_csv_writer = csv.writer(target_csv_file)

        # If target file doesn't have a header, write the header from the input file
        if not target_has_header:
            target_csv_writer.writerow(input_rows[0])

        # Write the rows from the input file, skipping duplicate rows and the header if it exists
        if target_has_header:
            input_rows = input_rows[1:]

        # Get the existing rows in the target file
        existing_rows = []
        if target_exists:
            with open(target_file, 'r', newline='') as target_csv_file:
                existing_csv_reader = csv.reader(target_csv_file)
                existing_rows = list(existing_csv_reader)

        # Append only the non-duplicate rows
        for row in input_rows:
            if row not in existing_rows and row != input_rows[0]:
                target_csv_writer.writerow(row)
                existing_rows.append(row)

    # Rename the input file with the "PROC_" prefix
    os.rename(input_file, input_file_with_prefix)


def remove_proc_prefix(folder):
    # Get a list of all files in the folder
    files = os.listdir(folder)
    
    # Loop over each file
    for filename in files:
        old_path = os.path.join(folder, filename)
        
        # Check if the file starts with 'PROC_'
        if filename.startswith('PROC_'):
            new_filename = filename[5:]  # Remove the 'PROC_' prefix
            new_path = os.path.join(folder, new_filename)
            
            # Rename the file by removing the 'PROC_' prefix
            os.rename(old_path, new_path)

def remove_spaces_from_column_names(dataframe):
    dataframe.columns = dataframe.columns.str.replace(' ', '')
    return dataframe

def merge_anon_into_datamart(datamart_csv_file, anon_map_csv_file, target_directory):

    """use anonymization map to replace accession and patient numbers in datamart file, remove patient info, and rename datamart features

    Args:
        datamart_csv_file: full path to unanonymized datamart csv file
        anon_map_csv_file: full path to anonymization map file
        target_directory:  path where anonymized datamart file should be saved (filename derived from input datamart file or save path)

    Returns:
        nothing - output is csv file on disk

    """
 
    datamart_df = remove_spaces_from_column_names( pd.read_csv(datamart_csv_file) )

    print(anon_map_csv_file)
    anon_map_df = remove_spaces_from_column_names( pd.read_csv(anon_map_csv_file) )
    
    datamart_joined_df = pd.merge(datamart_df, anon_map_df,left_on='ACCESSIONNUMBER',right_on='OriginalAccessionNumber',how='left')

    col_to_keep = ['AnonymizedPatientID',
                   'AnonymizedAccessionNumber',
                   'SCORE_CD',
                   'BIOP_SCORE',
                   'SEQ',
                   'A1_PATHOLOGY_TXT',
                   'DENSITY_TXT',
                   'AGE',
                   'RACE',
                   'ETHNICITY']
    
    new_names = {"AGE":"Age",
                 "RACE":"Race",
                 "ETHNICITY":"Ethnicity",
                 "DENSITY_TXT":"Density_Desc",
                 "A1_PATHOLOGY_TXT":"Path_Desc",
                 "SCORE_CD":"BI-RADS",
                 "SEQ":"Biop_Seq",
                 "BIOP_SCORE":"Biopsy",
                 "AnonymizedPatientID":"Patient_ID",
                 "AnonymizedAccessionNumber":"Accession_Number"}
                   
    datamart_joined_df = datamart_joined_df[col_to_keep]
    datamart_joined_df.rename(columns = new_names, inplace=True)
    datamart_joined_df.sort_values(by=['Patient_ID','Accession_Number'], inplace=True)

    filename = 'datamart_anon.csv'
    full_path = os.path.join(target_directory, filename)

    datamart_joined_df.to_csv(full_path, index = False)
    

def concat_csv_files(directory_path, target_file, remove_target=False):
    """
    Concatenate CSV files from a directory (without PROC_ prefix) into a target CSV file.

    Args:
        directory_path (str): Path to the directory containing CSV files.
        target_file (str): Path to the target CSV file where merged data will be saved.
        remove_target (bool, optional): If True, the target file will be removed before appending. Default is False.
    """

    # Remove the target file if it exists and remove_target is True
    if remove_target and os.path.exists(target_file):
        os.remove(target_file)

    # Get a list of CSV files without the PROC_ prefix
    csv_files = [file for file in os.listdir(directory_path) if file.endswith('.csv') and not file.startswith('PROC_')]

    # Initialize a set to store the appended rows to avoid duplicates
    appended_rows = set()

    # If the target file exists, read its contents and populate the set
    if os.path.exists(target_file):
        with open(target_file, 'r', newline='') as target:
            target_reader = csv.reader(target)
            appended_rows.update(tuple(row) for row in target_reader)

    # Loop over each CSV file and append its unique contents to the target file
    with open(target_file, 'a', newline='') as target:
        target_writer = csv.writer(target)

        for index, file in enumerate(csv_files):
            file_path = os.path.join(directory_path, file)
            with open(file_path, 'r', newline='') as source:
                source_reader = csv.reader(source)

                # Skip the first row (headers) of all files except the first file
                if index > 0:
                    next(source_reader)

                # Append the unique contents of the current CSV file to the target file
                for row in source_reader:
                    row_tuple = tuple(row)
                    if row_tuple not in appended_rows:
                        target_writer.writerow(row)
                        appended_rows.add(row_tuple)

            # Add a PROC_ prefix to the current CSV file after it's appended
            os.rename(file_path, os.path.join(directory_path, 'PROC_' + file))


## Total Processing Loop

In [3]:
path_anon_maps = r'./data_orig/notion_anon_maps/'
path_datamart_splits = r'./data_orig/datamart_splits/processed/'
path_notion_queries = r'./data_orig/notion_queries/'
path_dicom_zips = r'./data_orig/zip/'
path_datamart_master = r'./data_orig/'
path_anon_map_master = r'./data_orig/'
path_data_anon = r'./data_anon/'

datamart_master_file = 'master_datamart.csv'
anon_map_master_file = 'master_anon_map.csv'

full_path_datamart_master_file = os.path.join(path_datamart_master, datamart_master_file)
full_path_anon_map_master_file = os.path.join(path_anon_map_master, anon_map_master_file)


In [4]:
clean_folders = True
if clean_folders:
    remove_proc_prefix( path_anon_maps )

directory_path = path_anon_maps
target_file = full_path_anon_map_master_file
concat_csv_files(directory_path, target_file, remove_target=True)

# get list of unprocessed datamart_split files
datamart_splits_files = [filename for filename in os.listdir(path_datamart_splits) if filename.endswith('.csv') and not filename.startswith('PROC')]

# main processing loop
for datamart_file in datamart_splits_files:
    # get batch number
    batch_number_str = datamart_file.split('_')[0]

    # process the zip files, deidentify the dicoms, and add them to target directory    
    dicoms_zip_file = f'{batch_number_str}_dicoms.zip'
    target_directory = os.path.join(path_data_anon, f'{batch_number_str}_dicoms_anon/')
    extract_deidentify_dcm_file(path_dicom_zips, dicoms_zip_file, target_directory)

    # merge anon_map into datamart split file and clean out all PHI
    datamart_csv_file = os.path.join(path_datamart_splits,f'{batch_number_str}_datamart.csv')
    merge_anon_into_datamart(datamart_csv_file, full_path_anon_map_master_file, target_directory)
    
    # add datamart split file to master datamart file and add PROC_ prefix
    append_to_csv( full_path_datamart_master_file, datamart_csv_file)



./data_orig/master_anon_map.csv




./data_orig/master_anon_map.csv




./data_orig/master_anon_map.csv




./data_orig/master_anon_map.csv




./data_orig/master_anon_map.csv


FileNotFoundError: [Errno 2] No such file or directory: './data_orig/zip/00127_dicoms.zip'

In [24]:
# merge anon_map into datamart split file and clean out all PHI
datamart_csv_file = os.path.join(path_datamart_splits,f'{batch_number_str}_datamart.csv')
merge_anon_into_datamart(datamart_csv_file, full_path_anon_map_master_file, target_directory)

./data_orig/master_anon_map.csv


In [25]:
append_to_csv( full_path_datamart_master_file, datamart_csv_file)

In [8]:
datamart_splits_files

['00095_datamart.csv', '00096_datamart.csv']

In [75]:
def get_dcm_files(directory):
    dcm_files = []
    
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.dcm'):
                dcm_files.append(os.path.join(root, file))
    
    return dcm_files

dcm_files = get_dcm_files('./data_anon')

print(f'There were {len(dcm_files)} dicom files found.')

There were 8147 dicom files found.


In [None]:
# this was one-time code for converting all the datamart split files
# to notion query files, keep the code in case notion rejects the converted files

def datamart_to_notion_query( datamart_file, notion_query_file):
    """read datamart csv and write out notion_query xlsx file

    Args:
        datamart_file:  string with full path filename to datamart file (csv)
        notion_query_file:  target filename (xlsx)

    Returns:
        none

    """
    datamart_df = pd.read_csv( datamart_file )

    col_names = [
        'PatientName', 'PatientID', 'AccessionNumber', 
        'PatientBirthDate', 'StudyDate', 'ModalitiesInStudy', 
        'StudyDescription', 'AnonymizedName', 'AnonymizedID']
    
    notion_query_df = pd.DataFrame(
        {'PatientID': datamart_df['PATIENTID'],
	     'AccessionNumber': datamart_df['ACCESSIONNUMBER']}, 
        columns=col_names)

    notion_query_df.to_excel(notion_query_file, index=False)

for batch in np.arange(19,127):
    batch_string = f'{batch:05}'
    datamart_file = f'./data_orig/datamart_splits/unprocessed/{batch_string}_datamart.csv'
    notion_query_file = f'./data_orig/notion_queries/{batch_string}_notion_query.xlsx'
    datamart_to_notion_query( datamart_file, notion_query_file )

'0000000'

In [None]:
import csv
import os


def append_to_csv(target_file, input_file, columns):
    # Add prefix "PROC_" to the input filename
    input_dir = os.path.dirname(input_file)
    input_filename = os.path.basename(input_file)
    input_filename_with_prefix = "PROC_" + input_filename
    input_file_with_prefix = os.path.join(input_dir, input_filename_with_prefix)

    # Check if the target file exists
    target_exists = os.path.exists(target_file)

    # Open the input file for reading
    with open(input_file, 'r', newline='') as input_csv_file:
        input_csv_reader = csv.reader(input_csv_file)
        input_rows = list(input_csv_reader)

    # Check if the input file has any rows
    if len(input_rows) == 0:
        print("Input file is empty. No rows to append.")
        return

    # Determine if the target file already has a header
    target_has_header = False
    if target_exists:
        with open(target_file, 'r', newline='') as target_csv_file:
            target_csv_reader = csv.reader(target_csv_file)
            target_has_header = next(target_csv_reader, None) is not None

    # Open the target file for appending
    with open(target_file, 'a', newline='') as target_csv_file:
        target_csv_writer = csv.writer(target_csv_file)

        # If target file doesn't have a header, write the header from the input file
        if not target_has_header:
            target_csv_writer.writerow(input_rows[0])

        # Write the rows from the input file, skipping duplicate rows and the header if it exists
        if target_has_header:
            input_rows = input_rows[1:]

        # Get the existing rows in the target file
        existing_rows = []
        if target_exists:
            with open(target_file, 'r', newline='') as target_csv_file:
                existing_csv_reader = csv.reader(target_csv_file)
                existing_rows = list(existing_csv_reader)

        # Append only the non-duplicate rows
        for row in input_rows:
            match_found = False
            for existing_row in existing_rows:
                if all(existing_row[col] == row[col] for col in columns):
                    match_found = True
                    break

            if not match_found:
                target_csv_writer.writerow(row)
                existing_rows.append(row)

    print("Rows appended successfully!")

    # Rename the input file with the "PROC_" prefix
    os.rename(input_file, input_file_with_prefix)
    print("Input file renamed with prefix:", input_file_with_prefix)


# Example usage
target_csv_file = 'target.csv'
input_csv_file = 'input.csv'
columns_to_check = [0, 1]  # Example columns to check (0-based indices)

append_to_csv(target_csv_file, input_csv_file, columns_to_check)


In [51]:
ds_problem1 = pydicom.dcmread('./problem.dcm')
ds_problem2 = pydicom.dcmread('./problem2.dcm')
ds_no_problem = pydicom.dcmread('./no_problem.dcm')
ds_problem3 = pydicom.dcmread('./problem3.dcm')

In [None]:
    media_type = ds.file_meta[0x00020002]
    is_video = str(media_type).find('Multi-frame')>-1
    is_secondary = str(media_type).find('Secondary')>-1
    if is_secondary:
        y0 = 101
    else:
        y0 = ds['SequenceOfUltrasoundRegions'][0]['RegionLocationMinY0'].value

In [39]:
ds_problem2

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 194
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: Secondary Capture Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: 1.2.840.113713.17.50545756.3788426393483279394079865510187416286
(0002, 0010) Transfer Syntax UID                 UI: Explicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: 1.2.40.0.13.1.1
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.28'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'SMALL PARTS', '0011']
(0008, 0012) Instance Creation Date              DA: '20220916'
(0008, 0013) Instance Creation Time              TM: '111644'
(0008, 0016) SOP Class UID                       UI: Secondary Ca

In [19]:
ds2 = pydicom.dcmread('./no_problem.dcm')

In [23]:
ds1

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 194
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: Secondary Capture Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: 1.2.840.113713.17.50545756.5031917645990166336542510695148645574
(0002, 0010) Transfer Syntax UID                 UI: Explicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: 1.2.40.0.13.1.1
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.28'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'SMALL PARTS', '0001']
(0008, 0016) SOP Class UID                       UI: Secondary Capture Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.2.840.113713.17.50545756.50319176459901663365425106

In [52]:
(0x0018,0x6011) in ds_problem3

False

In [29]:
media_type = ds1.file_meta[0x00020002]

In [30]:
media_type

(0002, 0002) Media Storage SOP Class UID         UI: Secondary Capture Image Storage

In [31]:
ds = ds1

In [33]:
( media_type == 'image' and (0x0018, 0x6011) in ds) or media_type=='multi'

False

In [40]:
( media_type == 'image' and (0x0018, 0x6011) in ds) 

False

In [43]:
media_type = ds_problem2.file_meta[0x00020002]
media_type=='image'

False

In [53]:
ds_problem3

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 194
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: Ultrasound Multi-frame Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: 1.2.840.113713.17.50545756.7983231200671517409213250989291907106
(0002, 0010) Transfer Syntax UID                 UI: Implicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: 1.2.40.0.13.1.1
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.28'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', '', '0001']
(0008, 0016) SOP Class UID                       UI: Ultrasound Multi-frame Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.2.840.113713.17.50545756.798323120067151740921325098

In [58]:
 ds_new = deidentify_dicom_dataset(ds_problem3)

KeyError: (0018, 6011)

In [46]:
ds_new

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 194
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: Secondary Capture Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: anon
(0002, 0010) Transfer Syntax UID                 UI: Explicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: anon
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.28'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'SMALL PARTS', '0011']
(0008, 0012) Instance Creation Date              DA: '20220101'
(0008, 0013) Instance Creation Time              TM: 'anon'
(0008, 0016) SOP Class UID                       UI: Secondary Capture Image Storage
(0008, 0018) SOP Instance UID                    UI: 