### Transfer dicom files to pseudonymization destination

<details>
<summary>STEP 1 BIG PICTURE</summary>
We collected data from centers in folders, named as patient ID (e.g. admission). We want to clean these directories, so 
I: Each CT study is placed in one folder
II: Store cases in an excel file, with its dicom files in the table, and all other variables (outcome, clinical, pathology data) stored here. We call this master key, which also contains patient id (un-anonymized) along with the key for anonymization.
III: Transfer dicom-pnly files to new destination and anonymize these images.
</details>
<details>
<summary>PREVIOUS STEP</summary>
WE previously asked centers to give us their data:
1- Patients CT scan, each patient in a folder, with their admission ID.
2- CT scan report + Pathology report (if avaialble) stored as pdf or image in the folder.
3- We collected eligble cases in a folder (Maybe Case) and eligible controls in (Maybe control); if the data collection was physian-based or pathology-based we asked them to store these valdiated patients as (Valid Case) or (Valid control).
4- We collected all folders from different centers (which was not that much clean, and I can totally understand the hardship of data collection for our team members). 
5- We validated cases, by reviweing their data case-by-case, and god knows how much time I put on this. If you are reading this, let me say it clearly that you should dedicate a lot of time if you want to validate cases, and if you aim not to enroll elgibile only. In my case, I lost ~50% of cases during validaiton, due to lack of relibale validation, lack of needed CT scan (e.g. post surgery or post chemo CT).
</details>
<details>
<summary>THIS STEP</summary>
This code will find all files in a direcotry, with the number of dicom files, and return a long list, containing a row for each file, and a short list, reporting the number of file types with same extension in a folder.

[20231001]v1: The first final code
[20231201]v2: I added find dicom since some dicom files has no extension (while they usually should have .dcm extension) using pydicom library. This step makes the function three times slower :) 
</details>
<details>
<summary>NEXT STEP</summary>
Finding dicom meta data
</details>


### Enviroment and Functions

In [None]:
import os
import pandas as pd
import time
from tqdm import tqdm


def get_filepaths_dataframe(directory, Ignore=None,give_second_dupreomved_df=False, multiple_or_no_dot_handler=True, find_dicom=False):
    """
    List all files in a directory (file name, file extension, folder direcotry, to_file directory) while optionally ignoring specific file extensions.
    It also checks for validity of dicom files, since some dicom files don't have any extension (they usualy have .dcm, however).
    
    
    Args:
        directory (str): The path to the directory to search for files.
        Ignore (list, optional): A list of file extensions to ignore. If specified, files with these extensions
        will be excluded from the list. Default is None.
        give_second_dupreomved_df (bool, optional): If True, a second DataFrame is generated with duplicated file formats
        within one directory and their count in the data directory.
        multiple_or_no_dot_handler (bool, optional): If True, it will handle the file names with multiple dots, 
        or files that have no format (e.g ".dcm)")

    Returns:
        pandas.DataFrame: A DataFrame with columns 'Full_Directory' and 'File' containing file information.

    If `give_second_dupreomved_df` is set to True, the function returns a tuple of two DataFrames:
    1. The first DataFrame contains all file information, including the file's full directory, file name, and file format.
    2. The second DataFrame (only if `give_second_dupreomved_df` is True) contains the same file information but with
       duplicated file formats within a directory removed, along with a count of each file format in the directory.
    
    If `find_dicom` is set to True, the function returns a tuple of two DataFrames, including a column checking for dicom format of each file:
    
    
    The function also measures the execution time and prints it to the console.

    Example:
        directory_path = r'/path/to/directory'
        Ignore = [".dcm"]  # List of file extensions to ignore
        data, count_perDirectandType = get_filepaths_dataframe(directory_path, Ignore, give_second_dupreomved_df=True, multiple_or_no_dot_handler=True,find_dicom=False)
        print(data)  # Display all files
        print(count_perDirectandType)  # Display files with duplicated formats removed and counts.
    """
    try:
        import os
        import pandas as pd
        import time
        if find_dicom==True:
            import pydicom
    except ImportError:
        raise ImportError("The required packages (os, pandas, time. Or pydicom if you set check dicom as true) are not imported. Please make sure to import these packages before using this function.")
    start_time=time.time()

    if find_dicom is True:
        data = {'File': [], 'Full_Directory': [], 'If_dicom':[]}
        i=0        
        for root, dirs, files in os.walk(directory):
            for file in files:
                file_path = os.path.join(root, file)
                file_extension = os.path.splitext(file)[1]
                if Ignore and file_extension in Ignore:
                    continue  # Skip files with extensions specified in the Ignore list
                
                data['Full_Directory'].append(root)
                data['File'].append(file)
                data['If_dicom'].append(pydicom.misc.is_dicom(file_path))
                i=i+1
                if i%100 ==0:
                    print(f"{len(data['Full_Directory'])} directories extracted")


        
        tmp_data=pd.DataFrame(data)     
        data_split=tmp_data['Full_Directory'].str.split('\\\\', expand=True)
        data_split.columns = [f'Sub_dir_{i+1}' for i in range(data_split.shape[1])]
        data=pd.concat([tmp_data, data_split], axis=1)

        if multiple_or_no_dot_handler == True:
            data['File_Format'] = data['File'].apply(lambda x: x.rsplit('.', 1)[-1] if '.' in x else 'WARNING: NODATAFORMAT')
        else:
            data['File_Format'] = data['File'].str.split('.').str[-1]

        if give_second_dupreomved_df==True:
            count = data.groupby(['Full_Directory', 'File_Format','If_dicom'])['File_Format'].count().reset_index(name='Count')    
            cdata_split=count['Full_Directory'].str.split('\\\\', expand=True)
            cdata_split.columns = [f'Sub_dir_{i+1}' for i in range(cdata_split.shape[1])]
            count_perDirectandType=pd.concat([count, cdata_split], axis=1)


        end_time = time.time()  # Record the end time
        elapsed_time = end_time - start_time
        print(f"Execution time: {elapsed_time} seconds")
        

        if give_second_dupreomved_df == True:
            return data, count_perDirectandType
            print("since you turned give_second_dupreomved_df on, this function will give you two dataframes (dupremoved with counts as the second df)")
        else:
            return data
        


    else:
        data = {'File': [], 'Full_Directory': []         
                ,'to_file':[], 'If_dicom':[]}
        
        i=0
        for root, dirs, files in os.walk(directory):
            for file in files:
                file_path = os.path.join(root, file)
                file_extension = os.path.splitext(file)[1]
                if Ignore and file_extension in Ignore:
                    continue  # Skip files with extensions specified in the Ignore list
                
                data['Full_Directory'].append(root)
                data['File'].append(file)
                data['If_dicom'].append(pydicom.misc.is_dicom(file_path))
                i=i+1
                if i%100 ==0:
                    print(f"{len(data['Full_Directory'])} directories extracted")

        
        tmp_data=pd.DataFrame(data)     
        data_split=tmp_data['Full_Directory'].str.split('\\\\', expand=True)
        data_split.columns = [f'Sub_dir_{i+1}' for i in range(data_split.shape[1])]
        data=pd.concat([tmp_data, data_split], axis=1)

        if multiple_or_no_dot_handler == True:
            data['File_Format'] = data['File'].apply(lambda x: x.rsplit('.', 1)[-1] if '.' in x else 'WARNING: NODATAFORMAT')
        else:
            data['File_Format'] = data['File'].str.split('.').str[-1]

        if give_second_dupreomved_df==True:
            count = data.groupby(['Full_Directory', 'File_Format'])['File_Format'].count().reset_index(name='Count')    
            cdata_split=count['Full_Directory'].str.split('\\\\', expand=True)
            cdata_split.columns = [f'Sub_dir_{i+1}' for i in range(cdata_split.shape[1])]
            count_perDirectandType=pd.concat([count, cdata_split], axis=1)


        end_time = time.time()  # Record the end time
        elapsed_time = end_time - start_time
        print(f"Execution time: {elapsed_time} seconds")
        

        if give_second_dupreomved_df == True:
            return data, count_perDirectandType
            print("since you turned give_second_dupreomved_df on, this function will give you two dataframes (dupremoved with counts as the second df)")
        else:
            return data
        

#
#### F O R    D E B U G"""""
##Hospital_name= "Guilan"
##directory=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}" #direcotry on old pc
#directory=f"E:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}" #directory on ssd hard
#
#
#Ignore=None
#give_second_dupreomved_df=True
#multiple_or_no_dot_handler=True
#find_dicom=True
#import os
#import pandas as pd
#import time
#import pydicom
#
#start_time=time.time()
#
#data = {'File': [], 'Full_Directory': [],'If_dicom':[]}
#
#for root, dirs, files in os.walk(directory):
#    for file in files:
#        file_path = os.path.join(root, file)
#        file_extension = os.path.splitext(file)[1]
#        
#        if Ignore and file_extension in Ignore:
#            continue  # Skip files with extensions specified in the Ignore list
#        
#        data['Full_Directory'].append(root)
#        data['File'].append(file)
#        data['If_dicom'].append(pydicom.misc.is_dicom(file_path))
#
#
#tmp_data=pd.DataFrame(data)   
#print(tmp_data)
#
#data_split=tmp_data['Full_Directory'].str.split('\\\\', expand=True)
#data_split.columns = [f'Sub_dir_{i+1}' for i in range(data_split.shape[1])]
#data=pd.concat([tmp_data, data_split], axis=1)
#
#if multiple_or_no_dot_handler == True:
#    data['File_Format'] = data['File'].apply(lambda x: x.rsplit('.', 1)[-1] if '.' in x else 'WARNING: NODATAFORMAT')
#else:
#    data['File_Format'] = data['File'].str.split('.').str[-1]
#
#if give_second_dupreomved_df==True:
#    count_perDirectandType = data.groupby(['Full_Directory', 'File_Format'])['File_Format'].count().reset_index(name='Count')
#
#end_time = time.time()  # Record the end time
#elapsed_time = end_time - start_time
#print(f"Execution time: {elapsed_time} seconds")
#


### Changables (for reuse) & Code

In [None]:
#changables: the variables, direcotry, file names for saving/loading should be definied here

Hospital_name= "Guilan"
#directory=f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}" #directory on old pc
directory=f"E:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}" #directory on ssd hard

In [None]:
from myfunc_directory import get_filepaths_dataframe #get_filepaths_dataframe(directory, Ignore=None,give_second_dupreomved_df=False,, multiple_or_no_dot_handler=True, find_dicom=False) 

ignore=[] # if you want to ignore some specific file types, insert them in this list.
data, count_perDirectandType=get_filepaths_dataframe(directory, Ignore=ignore,give_second_dupreomved_df=True, multiple_or_no_dot_handler=True,find_dicom=True)

#data.to_csv(f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data.csv")
data.to_csv(f"E:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data.csv")
#count_perDirectandType.to_excel(f"D:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data_short.xlsx")
count_perDirectandType.to_excel(f"E:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\{Hospital_name}_data_short.xlsx")

