### Transfer cases to new location for anonymization and upload

<details>
<summary>STEP 1 BIG PICTURE</summary>
We collected data from centers in folders, named as patient ID (e.g. admission). We want to clean these directories, so 
I: Each CT study is placed in one folder
II: Store cases in an excel file, with its dicom files in the table, and all other variables (outcome, clinical, pathology data) stored here. We call this master key, which also contains patient id (un-anonymized) along with the key for anonymization.
III: Transfer dicom-pnly files to new destination and anonymize these images.
</details>
<details>
<summary>PREVIOUS STEP</summary>
WE previously created an excel file with each CT study, direcotry to dicom-only folder, real ID, and pseudonymiz ID stored in an excel file. Now we want to transfer them.
</details>
<details>
<summary>THIS STEP</summary>
This code reads an excel file with patients pseudonymize id and the path(directory) to patient folder containing dicom-only content. You can filter cases using any condition that you like.

Then it will paste it to new destination folder, checking for uniquness of files using HASH, which is a trsutworthy way of finding unique files (it will not rely only on name and size of file). 

It will print the succeful transfers, and give a list of directories (from excel) that it couldn't transfer. It can happen due to many reasons, that you should look for.
</details>
<details>
<summary>NEXT STEP</summary>
Anonymize dicoms
</details>

### Changables (change theses varibales for reuse)

In [None]:
#variables for reading excel
excel_dir=r'F:\Data\Big Pancreas (CT, EUS)\Raw Data Hospital\MasterKeyE2.xlsx' #the excel file
sheet_name='MasterKey' #sheet name of excel file
import pandas as pd
df = pd.read_excel(excel_dir,sheet_name=sheet_name)

#conditions to filter the dataframe
condition1=(df['VALID Follow-up'] == 'پرسشنامه ی تکمیل شده')
condition2=(df['Case_Validation_STATUS (0: exclude; 1: include; 5: noimage)'].str.startswith('1'))
condition3=(df['Directory'] != '')
condition4=(df['Pseudonymaize']== 'Done_20231212')

#
Column_w_dir='Directory'
Column_w_pseudoID='CT_No'

# Destination folder path for transfering files
dest_folder = "C:\\PanCanAID_Valid_Case_20231212"


### Code

In [None]:
#reading excel and apply filters, if you want them.

import pandas as pd

df = pd.read_excel(excel_dir,sheet_name=sheet_name)

# Filter conditions
# 1. 'VALID Follow-up' should be 'پرسشنامه ی تکمیل شده'
# 2. The first string of 'Case_Validation_STATUS (0: exclude; 1: include; 5: noimage)' should be '1'
# 3. 'Directory' should not be empty
filtered_df = df[
    condition1 &
    condition2 &
    condition3 &
    condition4
]

filtered_df

In [None]:
import os
import shutil
import hashlib

def file_hash(filepath):
    """Calculate the SHA-256 hash of a file."""
    sha256_hash = hashlib.sha256()
    with open(filepath, "rb") as f:
        # Read and update hash in chunks of 4K
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

def is_file_identical(src, dst):
    """Check if the destination file exists and is identical to the source file."""
    if not os.path.exists(dst):
        return False
    if os.path.getsize(src) != os.path.getsize(dst):
        return False
    return file_hash(src) == file_hash(dst)

def copy_if_not_identical(src, dst):
    """Copy a file only if it doesn't exist at the destination or is different."""
    if not is_file_identical(src, dst):
        shutil.copy(src, dst)

def copytree_if_not_identical(src, dst):
    """Recursively copy a directory tree, skipping files that are identical."""
    if not os.path.exists(dst):
        os.makedirs(dst)
    for item in os.listdir(src):
        s = os.path.join(src, item)
        d = os.path.join(dst, item)
        if os.path.isdir(s):
            copytree_if_not_identical(s, d)
        else:
            copy_if_not_identical(s, d)

def print_list_elements_new_line(lst):
    """
    Prints each element of the list in a new row.

    :param lst: List of elements to be printed.
    """
    for element in lst:
        print(element)





# Create the destination folder if it doesn't exist
if not os.path.exists(dest_folder):
    os.makedirs(dest_folder)

# Iterate over the DataFrame
for _, row in filtered_df.iterrows():
    source_directory = row[Column_w_dir]
    ct_no = str(row[Column_w_pseudoID])  # Get the case number from the dataframe

    # Define the destination folder for this specific case
    dest_folder_case = os.path.join(dest_folder, ct_no)

    # Check if the source directory exists
    error_list=[]
    if os.path.exists(source_directory):
        # Copy the directory if not identical
        copytree_if_not_identical(source_directory, dest_folder_case)
        print(f'Done copying from {source_directory} to {dest_folder_case}')
    else:
        print(f'The source directory does not exist: {source_directory}')
        error_list.append(source_directory)

print(f'!!!! WATCH OUT, following directories were not copied due to some problem, look for the error manually \n ')