## Narrowing the data based on sequencing type
Previously I sorted data to 4 cohort (check "mri_data_sort_to_cohorts.ipynb"). <br>
Now I want to keep only relevant sequencing types: MPRAGE and FSPGR. <br> 
<br>
In my 4 cohorts you can find 3 different naming conventions. I will provide 3 examples, one from each type: <br>
1) 1018_NACC282203_20170908ni <br>
2) mri129ni<br>
3) NACC497363_128401136192134176253428319601354034337135ni<br> 

The 1) all have MPRAGE sequencing, the 2) all have FSPGR sequencing and some of 3) have FSPGR, some MPRAGE.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os 
import re
import shutil
import pathlib

In [2]:
ncPath = '../../NACC_data/sorted_cohorts/NC/'
mciPath = '../../NACC_data/sorted_cohorts/MCI/'
alzdPath = '../../NACC_data/sorted_cohorts/ALZD/'
transPath = '../../NACC_data/sorted_cohorts/TRANS/'

In [3]:
# Convert the relative path to an absolute path
ncPath = os.path.abspath(ncPath)
mciPath = os.path.abspath(mciPath)
alzdPath = os.path.abspath(alzdPath)
transPath = os.path.abspath(transPath)

# Modify the absolute path for long path support on Windows
if os.name == 'nt':                     # Check if the operating system is Windows
    ncPath = '\\\\?\\' + ncPath
    mciPath = '\\\\?\\' + mciPath
    alzdPath = '\\\\?\\' + alzdPath
    transPath = '\\\\?\\' + transPath

In [4]:
print(f"Modified path for Windows: {ncPath}")
print(f"Modified path for Windows: {mciPath}")
print(f"Modified path for Windows: {alzdPath}")
print(f"Modified path for Windows: {transPath}")

Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS


### 1) 1018_NACC282203_20170908ni
Directory structure example: <br>
<br>
1018_NACC282203_20170908ni 
- 1018_NACC282203_20170908 
    - Mag_Images_17_1312211075219452552017090815014550782025439000
    - mIP_ImagesSW_19_1312211075219452552017090815014550780625438000
    - MPRAGE_GRAPPA2_6_1312211075219452552017090814552268696468275000
    - Pha_Images_18_1312211075219452552017090815014550782825440000
    - SWI_Images_20_1312211075219452552017090815014550784425442000
    - T2FLAIRSPACENEW_7_1312211075219452552017090814555267209169171000

I will only keep the MPRAGE folder and delete the others.

Converting the paths, sice Windows has a limit to 260 characters in paths, which causes errors in this case.

In [5]:
# Regular expression pattern for folder names starting with 4 digits followed by an underscore
pattern = r'^\d{4}_'

In [6]:
# List to store the matching folders
matching_folders = []

# Iterate over the items in the directory
for item in os.listdir(ncPath):
    item_path = os.path.join(ncPath, item)
    
    # Check if the item is a folder and matches the pattern
    if os.path.isdir(item_path) and re.match(pattern, item):
        matching_folders.append(item_path)

In [7]:
for folder in matching_folders:
    print(folder)

\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC282203_20170908ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC282203_20201106ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC711567_20200114ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC711567_20201214ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC822475_20171116ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC822475_20201119ni


In [8]:
# Regular expression for MPRAGE subfolder
keep_prefix = 'MPRAGE'

This is commented, since it ran only once (to delete folders).

In [9]:
# Loop through each folder in matching_folders
for folder in matching_folders:

    # Get the path to the subfolder (e.g., 1018_NACC282203_20170908)
    subfolder_path = os.path.join(folder, os.listdir(folder)[0])         # only 1 subfolder exists at this level

    # List all subfolders in the subfolder_path
    subfolders = os.listdir(subfolder_path)

    for subfolder in subfolders:
        
        subfolder_full_path = os.path.join(subfolder_path, subfolder)
        
        # Check if the subfolder name starts with 'MPRAGE'
        if os.path.isdir(subfolder_full_path) and not subfolder.startswith(keep_prefix):

            # If the subfolder doesn't start with 'MPRAGE', delete it
            #shutil.rmtree(subfolder_full_path)
            print(f"Deleted: {subfolder_full_path}")
        
        else:
            print(f"Kept: {subfolder_full_path}")

Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC282203_20170908ni\1018_NACC282203_20170908\MPRAGE_GRAPPA2_6_1312211075219452552017090814552268696468275000
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC282203_20201106ni\1018_NACC282203_20201106\MPRAGE_GRAPPA2_6_1312211075219452552020110610432126420639580000
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC711567_20200114ni\1018_NACC711567_20200114\MPRAGE_GRAPPA2_6_1312211075219452552020011409045885927639580000
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC711567_20201214ni\1018_NACC711567_20201214\MPRAGE_GRAPPA2_6_131221107521945255202012140935373703439580000
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC\1018_NACC822475_20171116ni\1018_NACC822475_20171116\MPRAGE_GRAPPA2_6_131221107521945255201711161545017450704733000
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymme

### 2) mri129ni
Directory structure example: <br>
<br>
mri129ni
- s5_dti_DTI
- s6_bravo_T1
- s8_cubet2_T2
- s9_cubet2flair_T2_Flair

I will keep only the s6_bravo_T1 subflolder (all the subfolders that have T1 in their names).

IMPORTANT: not every mri folder has this substructure. Some look like this: <br>
<br>
mri1900ni
- scans
    - many subfolders

or like this: <br>
<br>
mri6615ni
- DICOM
    - many files

or like this <br>
<br>
mri8192ni
- NACC581039
    - nacc
        - many subfolder

Probably there are other structures, just keep this in mind.

In [58]:
pattern = 'mri'

In [59]:
# List to store the matching folders
matching_folders = []

# Iterate over the items in the directory (change Path names to access different cohorts)
for item in os.listdir(mciPath):
    item_path = os.path.join(mciPath, item)
    
    # Check if the item is a folder and matches the pattern
    if os.path.isdir(item_path) and re.match(pattern, item):
        matching_folders.append(item_path)

In [60]:
for folder in matching_folders:
    print(folder)

\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri2714ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri3249ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri3426ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri4012ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri4031ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5084ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5086ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5089ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5090ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5145ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri5224ni
\\?\c:\Users\Crt\Desktop\WIMR

In [61]:
# Regular expression for T1 subfolder
T1_prefix = 'T1'
nacc_prefix = 'NACC'
mprage_prefix = 'MPRAGE'
fspgr_prefix = 'FSPGR'

In [62]:
# Storing the names of deleted patients
deleted_mri_folders = []

In [63]:
# Loop through each folder in matching_folders
for folder in matching_folders:

    # List all subfolders in the folder_path
    subfolders = os.listdir(folder)
    
    # Folders that contain only 1 subfolder (those have different substructures)
    if len(subfolders) == 1:

        subfolder_full_path = os.path.join(folder, subfolders[0])                        # only one subfolder 

        # Only keeping those folders whose subfolders that start with NACC, deleting the rest
        if os.path.isdir(subfolder_full_path) and not subfolder.startswith(nacc_prefix):

            deleted_mri_folders.append(folder)
            shutil.rmtree(folder)
            print(f"Deleted: {folder}")

        # If it does start with NACC, check inside the NACC folder for MPRAGE or FSPGR
        elif os.path.isdir(subfolder_full_path) and subfolder.startswith(nacc_prefix):

            subfolders_in_nacc = os.listdir(subfolder_full_path)                         # List subfolders inside NACC (either nacc or many subfolders)

            if len(subfolders_in_nacc) == 1:                                             # Some have structure NACC--->nacc--->subfolders like MPRAGE, FSPGR
                
                subfolder_in_nacc_full_path = os.path.join(subfolder_full_path, subfolders_in_nacc[0])         # only one subfolder
 
                subfolders_in_subfolder_nacc_in_nacc = os.listdir(subfolder_in_nacc_full_path)

                for subs in subfolders_in_subfolder_nacc_in_nacc: 

                    subs_full_path = os.path.join(subfolder_in_nacc_full_path, subs)

                    if os.path.isdir(subs_full_path) and (mprage_prefix not in subs and fspgr_prefix not in subs):

                        shutil.rmtree(subs_full_path)
                        print(f"Deleted: {subs_full_path}")

                    else:

                        print(f"Kept: {subs_full_path}")
            
            else: 
                for subs in subfolders_in_nacc:

                    subs_full_path = os.path.join(subfolder_full_path, subs)

                    # Check if subfolder does NOT contain 'MPRAGE' or 'FSPGR'
                    if os.path.isdir(subs_full_path) and (mprage_prefix not in subs and fspgr_prefix not in subs):

                        shutil.rmtree(subs_full_path)
                        print(f"Deleted: {subs_full_path}")

                    else:

                        print(f"Kept: {subs_full_path}")
      

    else:
        # For folders that contain multiple subfolders, delete subfolders that don't have 'T1'
        for subfolder in subfolders:

            subfolder_full_path = os.path.join(folder, subfolder)

            # Check if the subfolder name contains 'T1' anywhere in its name
            if os.path.isdir(subfolder_full_path) and T1_prefix not in subfolder:

                # If the subfolder doesn't contain 'T1', delete it
                shutil.rmtree(subfolder_full_path)
                print(f"Deleted: {subfolder_full_path}")

            else:

                print(f"Kept: {subfolder_full_path}")

Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni\006_DTI
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni\007_T1
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni\008_T1
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni\009_T2
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri141ni\010_T2_Flair
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri2714ni\002_T1_Volumetric
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri2714ni\008_DTI
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri2714ni\010_T2
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri2714ni\011_T2_Flair
Kept: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI\mri3249ni\002_T1_Volumetric
Deleted: \\

Saving deleted mri folders, so I can later correct the .csv files containing what patients I have.

In [16]:
writePath = '../../NACC_data/sorted_cohorts/'

In [17]:
def save_to_txt(array, filepath):
    with open(filepath, 'w') as f:
        for filepath in array:
            f.write(f"{filepath}\n")

In [34]:
#save_to_txt(deleted_mri_folders, writePath + 'alzd_deleted.txt')
print("Number of deleted folders:", len(deleted_mri_folders))
print("Save to .txt complete!")

Number of deleted folders: 62
Save to .txt complete!


Deleting freshly produced empty folders (example mri8191ni)

In [36]:
deleted_empty_mri_folders = []

In [37]:
for folder in matching_folders: 

    subfolders = os.listdir(folder)

    if len(subfolders) == 1:

        subfolder_full_path = os.path.join(folder, subfolders[0])

        subs = os.listdir(subfolder_full_path)

        if len(subs) == 0:

            deleted_empty_mri_folders.append(folder)
            shutil.rmtree(folder)
            print(f"Deleted: {folder}")

Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5112ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5113ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5132ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5133ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5134ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5152ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5155ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5158ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri8171ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri8191ni
Deleted: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri8192ni

In [38]:
save_to_txt(deleted_empty_mri_folders, writePath + 'alzd_empty_deleted.txt')
print("Number of deleted empty folders:", len(deleted_mri_folders))
print("Save to .txt complete!")

Number of deleted empty folders: 62
Save to .txt complete!
