### Add Dicom meta data to MasterKey

<details>
<summary>STEP 1 BIG PICTURE</summary>
We collected data from centers in folders, named as patient ID (e.g. admission). We want to clean these directories, so 
I: Each CT study is placed in one folder
II: Store cases in an excel file, with its dicom files in the table, and all other variables (outcome, clinical, pathology data) stored here. We call this master key, which also contains patient id (un-anonymized) along with the key for anonymization.
III: Transfer dicom-pnly files to new destination and anonymize these images.
</details>


<details>
<summary>PREVIOUS STEP</summary>
We find all file types in our directory (I ran the code for each center sepratly. Having 1.5 terabytes of informaiton and ~1800 cases, it collectivly took 30 hours on a RTX3080Ti labtob and Corei912gen and 32Ram)
</details>


<details>
<summary>THIS STEP</summary>
In this step we will add unique dicom meta data about patient info, study info, and series info
</details>


<details>
<summary>NEXT STEP</summary>
Finding dicom meta data
</details>


### Libraries & Functions

In [None]:
#FINAL 20231216
#My context: I coded this on my windows11 with RTC3080Ti and Corei9-12gen and 32G Ram. I am coding on VS code and using jupyter notebook.
#Your requirment: It doesn't need any exceptional hardward you can run it on an average pc/labtob

import pydicom as pm #for reading dicoms
import os #for looping through system direcotries
from pydicom.multival import MultiValue #for reading dicom metadata
from pydicom.valuerep import PersonName #since tunring dictionary to json raised an error you should use this
from tqdm.notebook import tqdm #for that fancy loop progress, I like it though
import pandas as pd #for tunring dic to excel, first we trasnform it to pandas dataframe
import json #for storing as json

from IPython.display import HTML #so you can click on the sotred excel and json and open it from jupyter notebook

def get_dicom_tag_value(dicom_file, tag, default=None):
    '''this function will get the dicom tag from the dicom filde for the given tag/code'''
    tag_value = dicom_file.get(tag, None)
    if tag_value is None:
        return default
    if isinstance(tag_value, MultiValue):
        return list(tag_value)  # Convert MultiValue to list
    return tag_value.value

def get_path_to_first_subfolder(full_path, first_subfolder):
    """this will get the path to the first folder of root, which is the subfolder that contains all dicom filed of one dicom study """
    path_parts = full_path.split(os.sep)
    if first_subfolder in path_parts:
        subfolder_index = path_parts.index(first_subfolder)
        return os.sep.join(path_parts[:subfolder_index + 1])
    else:
        return full_path

def count_subfolders(directory):
    '''this will cont the number of files and folders within a direcotyr'''
    total_subfolders = 0
    total_files=0
    for root, dirs, files in os.walk(directory):
        total_subfolders += len(dirs)
        total_files += len(files)
    return total_subfolders,total_files 


class CustomJSONEncoder(json.JSONEncoder): #this class will turn our multilevel dictionary into a json file
    def default(self, obj):
        if isinstance(obj, MultiValue):
            return list(obj)  # Convert MultiValue to list
        elif isinstance(obj, PersonName):
            return str(obj)   # Convert PersonName to string
        return json.JSONEncoder.default(self, obj)

def ensure_json_extension(directory): 
    '''this function will ensure that definied json direcotry contains the required extension, otherwise, it will add this to the end of definied dir'''
    if not directory.endswith(".json"):
        return directory + "\\JSON.json"
    return directory

def ensure_excel_extension(directory):
    '''this function will ensure that definied excel direcotry contains the required extension, otherwise, it will add this to the end of definied dir'''
    if not directory.endswith(".xlsx"):
        return directory + "\\excel.xlsx"
    return directory

def create_clickable_dir_path(dir_path):
    # Convert the directory path to a file URL
    file_url = f"{dir_path}"
    return HTML(f'<a href="{file_url}" target="_blank">{dir_path}</a>')



def get_dicomdir_give_dicomdicom_datadic(dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=None, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=None #if you want to store your ditionary as excel, give your desired excel direcotry
                                     ):
    """
    This function creates a multi-level dictionary for DICOM meta data (named dicom_data) in a directory (named dicom_dir).
    The top level has the last component of dicom_dir, which is the first level subfolder, as a key.
    For each subforled it will store study data within this dic, along with another dicitonary for series data, within this study dictionary.
    For series dictionary the data corresponding for series number will be stored.
    We also have another private_info dictionary within subfodler dictionary.
    
    - dicom_validation: If you set dicom_validation=True, it will validate the file in the loop for being an dicom file. This is super important although it makes code slower.
    Becaouse, sometimes some dicom files have no extension, and also reading other files may cause error in the loop.
    
    - folder_list_name_indicomdir: #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
    
    - store_as_json_dir: if you want to store your ditionary as json, give your desired json direcotry
    
    - store_as_excel_dir: if you want to store your ditionary as excel, give your desired excel direcotry
    
    For using this function, the best practice is to place each folder containing one dicom study in subfolder, under the dicom_dir. 
    However, you can change finding unique dicom studies, even placed next to each other beacouse I definied the study_unique=f'{first_subfolder}_{study_id}_{study_date}'.
    If you want your code to be faster you can chane the study_unique to study_unique=first_subfolder. It makes your code 15% faster, sometimes at the cost of incurrect retrival.
    
    """

    total_subfolder,total_files=count_subfolders(dicom_dir)
    print(f'your direcotry contains {total_subfolder} folders and {total_files} files')
    
    last_dir_name = os.path.basename(os.path.normpath(dicom_dir))
    dicom_data = {last_dir_name: {}}

    for root, dirs, files in tqdm(os.walk(dicom_dir), desc="Processing directories", total=total_subfolder,unit='folder'):
        if folder_list_name_indicomdir:
            split_path = root.replace(dicom_dir, '').split(os.sep)
            first_subfolder = split_path[1] if len(split_path) > 1 else ""
            if first_subfolder not in folder_list_name_indicomdir:
                print(f"""The folder {first_subfolder} was not in your definied list.""")
                continue  # Skip if the first subfolder is not in the user-defined list
            
        for file in files:
            if dicom_validation and not pm.misc.is_dicom(os.path.join(root, file)):
                continue # Skip if the it is not dicom file
                   

            try:
                dicom_file = pm.dcmread(os.path.join(root, file))
                study_id = get_dicom_tag_value(dicom_file, (0x0020, 0x0010))
                dicom_data_number = get_dicom_tag_value(dicom_file, (0x0020, 0x0011))
                study_date = get_dicom_tag_value(dicom_file, (0x0008, 0x0020))
                split_path = root.replace(dicom_dir, '').split(os.sep)
                first_subfolder = split_path[1] if len(split_path) > 1 else ""
                if study_id and dicom_data_number and study_date:
                    study_unique = f'{first_subfolder}_{study_id}_{study_date}' #you can change it for increasing the speed > study_unique=first_subfolder
                    if study_unique not in dicom_data[last_dir_name]:
                        private_info={'name': get_dicom_tag_value(dicom_file, (0x0010, 0x0010)),
                                      'institute': get_dicom_tag_value(dicom_file, (0x0008, 0x0080)),
                                      'patient_id': get_dicom_tag_value(dicom_file, (0x0010, 0x0020)),
                                      'accession_number':get_dicom_tag_value(dicom_file, (0x0008, 0x0050))
                                      }
                        
                        dicom_data[last_dir_name][study_unique] = {
                            'dir_to_root': get_path_to_first_subfolder(root, first_subfolder),
                            'study_description': get_dicom_tag_value(dicom_file, (0x0008, 0x1030)),
                            'date': study_date,
                            'age': get_dicom_tag_value(dicom_file, (0x0010, 0x1010)),
                            'sex': get_dicom_tag_value(dicom_file, (0x0010, 0x0040)),
                            'manufacture_model': get_dicom_tag_value(dicom_file, (0x0008, 0x1090)),
                            'manufacture_brand': get_dicom_tag_value(dicom_file, (0x0008, 0x0070)),
                            'manufacture_brand': get_dicom_tag_value(dicom_file, (0x0008, 0x0070)),
                            'protocol': get_dicom_tag_value(dicom_file, (0x0018, 0x1030)),
                            'study_id': study_id,
                            'patient_weight': get_dicom_tag_value(dicom_file, (0x0010, 0x1030)),
                            'Image_type': get_dicom_tag_value(dicom_file, (0x0008, 0x0008)),
                            'body_part': get_dicom_tag_value(dicom_file, (0x0018, 0x0015)),
                            'modalitty':get_dicom_tag_value(dicom_file, (0x0008, 0x0050)),
                            'private_info':private_info,
                            'image_dicom_data_list': {}
                        }

                    

                    dicom_data_info = {
                        'dicom_data_description': get_dicom_tag_value(dicom_file, (0x0008, 0x103E)),
                        'body_part': get_dicom_tag_value(dicom_file, (0x0018, 0x0015)),
                        'slice_thickness': get_dicom_tag_value(dicom_file, (0x0018, 0x0050)),
                        'Image_comment': get_dicom_tag_value(dicom_file, (0x0020, 0x4000)),
                        'kvp': get_dicom_tag_value(dicom_file, (0x0018, 0x0060)),
                        'exposure': get_dicom_tag_value(dicom_file, (0x0018, 0x1152)),
                        'exposure_time': get_dicom_tag_value(dicom_file, (0x0018, 0x1150)),
                    }
                    dicom_data[last_dir_name][study_unique]['image_dicom_data_list'][dicom_data_number] = dicom_data_info

            except Exception as e:
                print(f"""Error reading for {file}::: {e} \n """)
                continue
            
    if store_as_json_dir is not None:
        try:
            json_read = json.dumps(dicom_data, indent=4, cls=CustomJSONEncoder)
            store_as_json_dir=str(store_as_json_dir)
            store_as_json_dir=ensure_json_extension(store_as_json_dir)
            with open(store_as_json_dir, 'w') as json_file:
                json_file.write(json_read)
            print(f"""Json stored at :::""")
            display(create_clickable_dir_path(store_as_json_dir))         
        except:
            print(f"""Error storing the json ::: {e} \n """)
            
    if store_as_excel_dir is not None:
        try:
            dataframes = []
            for key, value in dicom_data.items():
                # Convert value to DataFrame if necessary
                df = pd.DataFrame(value)
                # Add the key as a new column or as part of the index
                df['Key'] = key  # Add key as a column
                # df = df.set_index(['Key'], append=True)  # Add key as part of a MultiIndex
                dataframes.append(df)

            # Concatenate all dataframes
            df2 = pd.concat(dataframes).T
            store_as_excel_dir=str(store_as_excel_dir)
            store_as_excel_dir=ensure_excel_extension(store_as_excel_dir)
            df2.to_excel(store_as_excel_dir)
            print(f"""Excel stored at :::""")
            display(create_clickable_dir_path(store_as_excel_dir))          
        except:
            print(f"""Error storing the excel ::: {e} \n """)
            
                                 
    return dicom_data




### Code

In [None]:
### TO DO

# r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Shiraz\Valid Case Image"                   #Done  #corropted json NEED TO BE DONE AGAIN

# r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Taleghani\Maybe Case Image',               #Done #corropted json NEED TO BE DONE AGAIN
# r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Taleghani\Maybe Control Image'             #Done

# r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Dr Radmard\Valid Case"                     #Done

#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Case Image"                   #in progress
#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Control (wo rep)"             #in progress
#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Control Image"                #in progress
#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Valid Case"                         #in progress

#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan\Valid Case"                           #in progress 
#r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan\Valid Control"                        #in progress


In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Dr Radmard\Valid Case" 
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Radmard_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Radmard_all_dcm.xlsx'


dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan\Valid Control" 
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan_control_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan_control_all_dcm.xlsx'


dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )



In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan\Valid Case"
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan_case_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Guilan_case_all_dcm.xlsx'


dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Case Image"
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybeCase_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybeCase_all_dcm.xlsx'

dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Control (wo rep)"
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybeControlwithoutrep_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybeControlwithoutrep_all_dcm.xlsx'

dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Maybe Control Image"
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybecontrol_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_maybecontrol_all_dcm.xlsx'

dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:
dicom_dir=r"F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Emam Kh\Valid Case"
save_dir_json=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_validcase_all_dcm.json'
save_dir_xlsx=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\EmamKh_validcase_all_dcm.xlsx'

dicom_dic=get_dicomdir_give_dicomdicom_datadic(
    dicom_dir, #direcotry that you want to read, usually dicom studies should be in one folder, preferably with patient unique id/name
                                     dicom_validation=True, #this will check wether the file in the loop is dicom or not. Although make it slower, I recommend using it to ensure only dicom files go through loop 
                                     folder_list_name_indicomdir=None, #In your dicom_dir you can include list of folders name that you want to read. It will not read other folders. Kepp in mind that this will look into subfolders in the main folder, and not the subfolders of subfolders :)
                                     store_as_json_dir=save_dir_json, #if you want to store your ditionary as json, give your desired json direcotry
                                     store_as_excel_dir=save_dir_xlsx #if you want to store your ditionary as excel, give your desired excel direcotry
                                     )

In [None]:

#save to dataframe
import pandas as pd

# Assuming 'series' is your dictionary containing the data
dataframes = []
for key, value in series.items():
    # Convert value to DataFrame if necessary
    df = pd.DataFrame(value)
    # Add the key as a new column or as part of the index
    df['Key'] = key  # Add key as a column
    # df = df.set_index(['Key'], append=True)  # Add key as part of a MultiIndex
    dataframes.append(df)

# Concatenate all dataframes
df2 = pd.concat(dataframes).T
df2.to_excel('F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Shiraz_Case_dcm.xlsx')

json_read = json.dumps(series, indent=4, cls=CustomJSONEncoder)

with open(r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\Taleghani_MaybeControl_all_dcm.json', 'w') as json_file:
    json_file.dump(json_read)
    


### ARCHIVED CODES (TRASH)

In [None]:
for i in range(len(data['Full_Directory'])):
    i=0

    if data['If_dicom'] is True:
        dcm_dir="{}\{}".format(data["Full_Directory"].iloc[0],data["File"].iloc[0])
        dcm_dir
        
    else:
        

In [None]:
import pydicom as pm

dcm_dir="{}\{}".format(data["Full_Directory"].iloc[0],data["File"].iloc[0])
dicom_file = pm.dcmread(dcm_dir)





In [None]:
Hospital_name= "Guilan"
directory_shortlist=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data_short_just_dcm.xlsx"
directory_longlist=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data.csv"



directory_longlist=pd.read_csv(directory_longlist)
directory_longlist_dcm=directory_longlist[directory_longlist['If_dicom']==True]

directory_longlist_dcm=directory_longlist[directory_longlist['If_dicom']==True]
directory_longlist_dcm=directory_longlist_dcm.reset_index()



i=1
dir_path = os.path.join(directory_longlist_dcm.iloc[i][3], directory_longlist_dcm.iloc[i][2])
dir_path



# this code aimed to get all dicom meta data so I can work with them, and know them, especially for anonymization and knowing the phase.
# however, it failed due to different dicom formats. I will use json instead (I should have use it at first place).
# the json also has many erorrs, so I added the try-except into the loop to handle erorrs while finishing the loop.

dcminfo_list = []  # List to store the individual DataFrame pieces

print(f'total rows in your dataframe is {len(directory_longlist_dcm)}')
start_time=time.time()

for i in range(1, len(directory_longlist_dcm)):
    dir_path = os.path.join(directory_longlist_dcm.iloc[i][3], directory_longlist_dcm.iloc[i][2])
    
    try:
        ds = pm.dcmread(dir_path)
        ds = pd.DataFrame(ds.values())
        if ds.shape[1]>1:
            ds= pd.DataFrame({'WARNING_MORETHAN1ROW_DF2CELL': [ds.to_string()]})
        else: 
            ds[0] = ds[0].apply(lambda x: pm.dataelem.DataElement_from_raw(x) if isinstance(x, pm.dataelem.RawDataElement) else x)
            ds['name'] = ds[0].apply(lambda x: x.name)
            ds['value'] = ds[0].apply(lambda x: x.value)
            ds = ds[['name', 'value']]
            ds = ds.T
            new_header = ds.iloc[0]  # First row as header
            ds = ds[1:]  # Taking the rest of the data
            ds.columns = new_header  # Setting the new header

        if i % 1000 == 0:
            percentage = (i / len(directory_longlist_dcm)) * 100
            end_time = time.time() 
            elapsed_time = end_time - start_time
            print(f'Processed {i} rows, which is {percentage:.2f}% of total rows in {elapsed_time} seconds.')

        ds['to_directory'] = dir_path
        ds['key2csv']=directory_longlist_dcm['Unnamed: 0'][i]
    
    except Exception as e:
        error_message = str(e)
        ds = pd.DataFrame({'WARNING_ERROR': [error_message], 'to_directory': dir_path, 'key2csv': directory_longlist_dcm['Unnamed: 0'][i]})

    dcminfo_list.append(ds)

for df in dcminfo_list:
    rename_duplicate_columns(df)


dcminfo_all=pd.concat(dcminfo_list, ignore_index=True, sort=False)
dcminfo_all


In [None]:

ds = pm.dcmread(r'D:\\Data\\Big Pancreas (CT, EUS)\\Raw Data Hospital\\Guilan\\Valid Case\\PG1002-malihe hoseynlo\\DICOMDIR')
js=ds.to_json()
data=json.loads(js)
data

In [None]:
import pydicom as pm
import pandas as pd
import os

def rename_duplicate_columns(df):
    """Rename duplicate columns in the DataFrame."""
    cols = pd.Series(df.columns)
    for dup in cols[cols.duplicated()].unique(): 
        cols[cols[cols == dup].index.values.tolist()] = [dup + '_DUP' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
    df.columns = cols

Hospital_name= "Guilan"
directory_shortlist=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data_short_just_dcm.xlsx"
directory_longlist=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data.csv"


directory_longlist=pd.read_csv(directory_longlist)
directory_longlist_dcm=directory_longlist[directory_longlist['If_dicom']==True]
directory_longlist=pd.read_csv(directory_longlist)
directory_longlist_dcm=directory_longlist[directory_longlist['If_dicom']==True]
directory_longlist_dcm=directory_longlist_dcm.reset_index()





import os
import pandas as pd
import pydicom as pm

dcminfo_list = []  # List to store the individual DataFrame pieces

for i in range(1, len(directory_longlist_dcm)):
    dir_path = os.path.join(directory_longlist_dcm.iloc[i][4], directory_longlist_dcm.iloc[i][3])
    ds = pm.dcmread(dir_path)
    ds = pd.DataFrame(ds.values())
    ds[0] = ds[0].apply(lambda x: pm.dataelem.DataElement_from_raw(x) if isinstance(x, pm.dataelem.RawDataElement) else x)
    ds['name'] = ds[0].apply(lambda x: x.name)
    ds['value'] = ds[0].apply(lambda x: x.value)
    ds = ds[['name', 'value']]
    ds = ds.T
    new_header = ds.iloc[0]  # First row as header
    ds = ds[1:]  # Taking the rest of the data
    ds.columns = new_header  # Setting the new header
    ds['to_directory'] = dir_path
    ds['key2csv']=directory_longlist_dcm['Unnamed: 0'][i]
    

    dcminfo_list.append(ds)

for df in dcminfo_list:
    rename_duplicate_columns(df)


dcminfo_all=pd.concat(dcminfo_list, ignore_index=True, sort=False)
dcminfo_all


In [None]:
# from previous dataframe of directories, read all dicoms.
dcminfo_list = []  # List to store the individual DataFrame pieces

print(f'total rows in your dataframe is {len(directory_longlist_dcm)}')
start_time=time.time()

for i in range(1, len(directory_longlist_dcm)):
    dir_path = os.path.join(directory_longlist_dcm.iloc[i][4], directory_longlist_dcm.iloc[i][3])
    ds = pm.dcmread(dir_path)
    js=ds.to_json()
    

    if i % 1000 == 0:
        percentage = (i / len(directory_longlist_dcm)) * 100
        end_time = time.time() 
        elapsed_time = end_time - start_time
        print(f'Processed {i} rows, which is {percentage:.2f}% of total rows in {elapsed_time} seconds.')

    ds['to_directory'] = dir_path
    ds['key2csv']=directory_longlist_dcm['Unnamed: 0'][i]
    

    dcminfo_list.append(ds)

for df in dcminfo_list:
    rename_duplicate_columns(df)


dcminfo_all=pd.concat(dcminfo_list, ignore_index=True, sort=False)
dcminfo_all


In [None]:
# from previous dataframe of directories, read all dicoms.
dcminfo_list = []  # List to store the individual DataFrame pieces

print(f'total rows in your dataframe is {len(directory_longlist_dcm)}')
start_time=time.time()

for i in range(1, len(directory_longlist_dcm)):
    dir_path = os.path.join(directory_longlist_dcm.iloc[i][4], directory_longlist_dcm.iloc[i][3])
    ds = pm.dcmread(dir_path)
    js=ds.to_json()
    

    if i % 1000 == 0:
        percentage = (i / len(directory_longlist_dcm)) * 100
        end_time = time.time() 
        elapsed_time = end_time - start_time
        print(f'Processed {i} rows, which is {percentage:.2f}% of total rows in {elapsed_time} seconds.')

    ds['to_directory'] = dir_path
    ds['key2csv']=directory_longlist_dcm['Unnamed: 0'][i]
    

    dcminfo_list.append(ds)

for df in dcminfo_list:
    rename_duplicate_columns(df)


dcminfo_all=pd.concat(dcminfo_list, ignore_index=True, sort=False)
dcminfo_all


In [None]:

ds = pm.dcmread(r'D:\\Data\\Big Pancreas (CT, EUS)\\Raw Data Hospital\\Guilan\\Valid Case\\PG1002-malihe hoseynlo\\DICOMDIR')
js=ds.to_json()
data=json.loads(js)
data

In [None]:
# Function to flatten the JSON recursively
def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

flat_data = flatten_json(js)
dff = pd.DataFrame([flat_data])
dff

In [None]:
dcminfo_all.to_excel(f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_testdicomdataframe.xlsx")

In [None]:
## Add this to the first block in your note book to show json files in the jupyter output
import uuid
from IPython.core.display import display, HTML
import json

class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())
        # This line is missed out in most of the versions of this script across the web, it is essential for this to work interleaved with print statements
        self._ipython_display_()
        
    def _ipython_display_(self):
        display(HTML('<div id="{}" style="height: auto; width:100%;"></div>'.format(self.uuid)))
        display(HTML("""<script>
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
        renderjson.set_show_to_level(1)
        document.getElementById('%s').appendChild(renderjson(%s))
        });</script>
        """ % (self.uuid, self.json_str)))

# Since this is copy-pasted wrongly(mostly) at a lot of places across the web, i'm putting the fixed, updated version here, mainly for self-reference


## To use this function, call this, this now works even when you have a print statement before or after the RenderJSON call
#RenderJSON(dict_to_render)